GCP Cloud Dataflow

 

Cloud Dataflow is a managed stream and batch processing service. It is a core component for building pipelines that collect, transform, and output data. In the past, developers would typically create a batch or stream processing pipeline - for example, the hot path and a separate batch processing pipeline; that is, the cold path. Cloud Dataflow pipelines are written using the Apache Beam API, which is a model for combined stream and batch processing. Apache Beam incorporates Beam runners in the data pipeline; the Cloud Dataflow runner is commonly used in GCP. Apache Flink is another commonly used Beam runner in the data pipeline; the Cloud Dataflow runner is commonly used in GCP. Apache Flink is another commonly used Beam runner.

Cloud Dataflow doesn't require you to configure instances or clusters - it is a no-ops service. Cloud Dataflow pipelines are run within a region. It directly integrates with Cloud Pub/Sub, BigQuery, and the Cloud ML Engine. Cloud Dataflow integrates with Bigtable and Apache Kafka.
Much of your work with Cloud Dataflow is coding transformations in one of the languages supported by Apache Beam, which are currently Java and Python. For the purpose of the exam, it is important to understand Cloud Dataflow concepts.

Cloud Dataflow Concepts:

Cloud Dataflow, and the Apache Beam model, are designed around several key concepts:
{Pipelines, PCollection, Transforms, ParDo, Pipeline I/O, Aggregation, User-defined functions, Runner, Triggers}
Windowing concepts and watermarks are also important.
Pipelines in Cloud Dataflow are, as you would expect, a series of computations applied to data that comes from a source. Each computation emits the results of computations, which become the input for the next computation in the pipeline. Pipelines represent a job that can be run repeatedly.

1. The PCollection abstraction is a dataset, which is the data used when a pipeline job is run. In the case of batch processing, the PCollection contains a fixed set of data. In the case of streaming data, the PCollection is unbounded.

2. Transforms = operations that map input data to some output data. Transforms operate on one or more PCollections as input and can produce one or more output PCollections.
The operations can be mathematical calculations, data type conversions, and data grouping steps, as well as performing read and write operations.

3. ParDo = a parallel processing operation that runs a user-specified function on each element in a PCollection. ParDo transforms data in parallel. ParDo receives input data from a main PCollection but may also receive additional inputs from other PCollections by using a side input. Side inputs can be used to perform joins. Similarly, while a ParDo produces a main output PCollection, additional collections can be output using a side output. Side outputs are especially useful when you want have additional processing paths. For example, a side output could be used for data that doesn't pass some validation check.

4. Pipeline I/Os = transforms for reading data into a pipeline from a source and writing data to a sink.

5. Aggregation = the process of computing a result from multiple input values. Aggregation can be simple, like counting the number of messages arriving in a one-minute period or averaging the values of metrics received over the past hour.

6. User-defined functions (UDF) are user-specified code for performing some operation, typically using a ParDo.

7. Runners are software that executes pipelines as jobs.

8. Triggers = functions that determine when to emit an aggregated result. In batch processing jobs, results are emitted when all the data has been processed. When operating on a stream, you have to specify a window over the stream to define a bounded subset, which is done by configuring the window.
Jobs and Templates.

Comments

Popular posts from this blog

The Morph Concept in 2025: From Vision to Emerging Reality

Mortgage Train 2025

Web Train 2025: Locomotives