Jaegool_'s log

Preparing for the Google Cloud Professional Data Engineer #day1 본문

Data Science

Preparing for the Google Cloud Professional Data Engineer #day1

Jaegool 2022. 7. 12. 00:59

공부할 것에 대한 목차

Table of Contents for Studying

 

https://cloud.google.com/certification/guides/data-engineer/

 

Professional Data Engineer Certification Exam Guide  |  Google Cloud

Review the official exam guide for the Professional Data Engineer certification exam. Preview for the exam and become Google Cloud certified

cloud.google.com

각각의 카테고리를 설명할 수 있도록 학습하기.

Learn to describe each category.

 

Quiz: Module 1 Assessment

The recommended test-taking strategy is: Bookmark those questions for which you don't know the answer or don't feel confident in your answer, and return to them iteratively.

 

Product and technology knowledge

You need to know the basic information about each product that might be covered on the exam.

You need to know:

 ● What it does, why it exists.

 ● What is special about its design, for what purpose or purposes was it optimized?

 ● When do you use it, and what are the limits or bounds when it is time to consider an alternative?

 ● What are the key features of this product or technology?

 ● Is there an Open Source alternative? If so, what are the key benefits of the cloud-based service over the Open Source software?

 

 

Section 1. Designing data processing systems

 

1.1 Selecting the appropriate storage technologies. Considerations include:

    a. Mapping storage systems to business requirements

    b. Data modeling

    c. Trade-offs involving latency, throughput, transactions

    d. Distributed systems

    e. Schema design

1.2 Designing data pipelines. Considerations include:

    a. Data publishing and visualization (e.g., BigQuery)

    b. Batch and streaming data (e.g., Dataflow, Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Pub/Sub, Apache Kafka)

    c. Online (interactive) vs. batch predictions

    d. Job automation and orchestration (e.g., Cloud Composer)

1.3 Designing a data processing solution. Considerations include:

    a. Choice of infrastructure

    b. System availability and fault tolerance

    c. Use of distributed systems

    d. Capacity planning

    e. Hybrid cloud and edge computing

    f. Architecture options (e.g., message brokers, message queues, middleware, service-oriented architecture, serverless functions)

    g. At least once, in-order, and exactly once, etc., event processing

1.4 Migrating data warehousing and data processing. Considerations include:

    a. Awareness of current state and how to migrate a design to a future state

    b. Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking)

    c. Validating a migration

 

 

exam tip)

know how to identify technologies backwards from their properties for example which data technology offers the fastest ingest of data which one might you use for ingest of streaming data managed services are ones where you can see the individual instance or cluster

 

 exam tip)

managed services still have some it overhead it doesn't completely eliminate the overhead or manual procedures but it minimizes them compared with on-prem solutions

 

Data in files and data in transit

csv: which stands for comma separated values is a simple file format used to store tabular data

XML: which stands for extensible markup language was designed to store and transport data and was designed to be self-descriptive 

 json: which stands for javascript object notation is a lightweight data

 

avro is a remote procedure call and data serialization framework developed within apache hadoop project.

it uses json for defining data types and protocols and serializes data in a compact binary format.

 

A project contains users and datasets. Use a project to:

- Limit access to datasets and jobs

- Manage billing

 

A data set contains tables and views.

Access Control Lists for Reader/Writer/Owner.

Applied to all tables/views in dataset.

 

A table is a collection of columns.

Columnar storage.

Views and virtual tables defined by SQL query.

Tabels can be external(e.g. on Cloud Storage).

 

A job is potentially long-running action.

Can be cancelled.

 

Column processing is very cheap and fast in bigquery and row processing is slow and expensive.

 

Most queries only work on a small number of fields and bigquery only needs to read those relevant columns to execute a query. big query could compress the column data much more effectively.

 

BigQuery storage is columnar

Relational database:

- Record-oriented storage.

- Supports transactional updates.

 

BigQuery storage:

- Each column in a separate, compressed, encrypted file that is replicated 3+ times.

- No indexes, keys, or partitions required.

- For immutable, massive datasets.

 

Spark hides data complexity with an abstraction: RDDs(Resilient Distributed Datasets)

rdds are an abstraction that hides the complicated details of how data is located and replicated in a cluster

spark has the ability to direct processing to occur where there are processing resources available

data partitioning data replication data recovery pipelining of processing all are automated by spark so you don't have to worry about them

 

exam tip)

you should know how different services store data on how each method is optimized for specific use cases as previously mentioned but also understand the key value of the approach in this case rdds hide complexity and allow spark to make decisions on your behalf.

 

 Dataflow terms and concepts: PCollection

- Each step is a Transform

- Together, they form a Pipeline

- The Pipeline is executed on the cloud by a Runner.

1. Each step is elastically scaled.

2. Each Transform is applied on a PCollection

3. The result of an apply() is another PCollection.

 

 

there are a number of concepts that you should know about cloud data flow. your data and data flow is represented in p collections.

 

the pipeline shown in this example reads data from bigquery does a bunch of processing and writes its output to cloud storage.

 

Dataflow does ingest, transform,and load on Batch(bounded data, file) or Stream

Any combination of basic and custom transformations

 

 

A tensor is an N-dimensional array of data.

tensorflow is the open source code that you use to create machine learning models a tensor is a powerful abstraction because it relates different kinds of data types and there are transformations in tensor algebra that apply to any dimension or rank of tensor so it makes solving some problems much easier.

 

Exam tip #3)

일일이 설명해 주지 않으니 모르는 touchstone이 나오면 따로 준비해야한다.

알고 모르는 것을 잘 구별하여 시험을 준비하는 것이 도움이 될 것 같다.

이 이후부터는 반복해서 숙달하기.

 

 

----------- DataProc부터는 모르는 개념이어서 그런지 이해하기가 어려웠다. 시간적으로 효율적이지 않다고 판단.

 

 

구글에서는 GCP PDE시험을 3년이상의 경력자가 시험보기를 권한다.

나 같은 입문자는 바로 GCP professional data engineer 시험을 보기보다는 시험범위가 겹치는 associate cloud engineer 시험을 권장한다.

GCP certifications

데이터 공부를 시작한지 얼마안된 나에게 영어로 pro과정은 너무 버거웠다.

바로 런하고 내일부터는 Associate cloud engineer 시험을 먼저 준비하기로 마음 먹었다. ㅎㅎ