Jaegool_'s log

Coursera IBM Data Science Course: Tools for Data Science 본문

Data Science

Coursera IBM Data Science Course: Tools for Data Science

Jaegool 2024. 2. 22. 11:54

 

Summary

 

  • The Data Science Task Categories include:
    • Data Management - storage, management and retrieval of data
    • Data Integration and Transformation - streamline data pipelines and automate data processing tasks
    • Data Visualization - provide graphical representation of data and assist with communicating insights
    • Modelling - enable Building, Deployment, Monitoring and Assessment of Data and Machine Learning models
  • Data Science Tasks support the following:
    • Code Asset Management - store & manage code, track changes and allow collaborative development
    • Data Asset Management - organize and manage data, provide access control, and backup assets
    • Development Environments - develop, test and deploy code
    • Execution Environments - provide computational resources and run the code

The data science ecosystem consists of many open source and commercial options, and include both traditional desktop applications and server-based tools, as well as cloud-based services that can be accessed using web-browsers and mobile interfaces.

Data Management Tools: include Relational Databases, NoSQL Databases, and Big Data platforms:

  • MySQL, and PostgreSQL are examples of Open Source Relational Database Management Systems (RDBMS), and IBM Db2 and SQL Server are examples of commercial RDBMSes and are also available as Cloud services.
  • MongoDB and Apache Cassandra are examples of NoSQL databases.
  • Apache Hadoop and Apache Spark are used for Big Data analytics.

Data Integration and Transformation Tools: include Apache Airflow and Apache Kafka.

Data Visualization Tools: include commercial offerings such as Cognos Analytics, Tableau and PowerBI and can be used for building dynamic and interactive dashboards.

Code Asset Management Tools: Git is an essential code asset management tool. GitHub is a popular web-based platform for storing and managing source code. Its features make it an ideal tool for collaborative software development, including version control, issue tracking, and project management.

Development Environments: Popular development environments for Data Science include Jupyter Notebooks and RStudio.

  • Jupyter Notebooks provides an interactive environment for creating and sharing code, descriptive text, data visualizations, and other computational artifacts in a web-browser based interface.
  • RStudio is an integrated development environment (IDE) designed specifically for working with the R programming language, which is a popular tool for statistical computing and data analysis.