Designing a Data Processing Solution
Data engineers need to understand how to design processing solutions that start with data collection and end with data exploration and visualization. In the chapter, you will learn about designing infrastructure for data engineering and machine learning, including how to do the following:
1. Choose an appropriate compute service for your use case.
2. Design for scalability, reliability, availability, and maintainability.
3. Use hybrid and edge computing architecture patterns.
4. Design distributed processing systems and use appropriate event processing models.
5. Migrate a data warehouse from on-premises data centers to GCP.
Designing Infrastructure
Data engineers are expected to understand how to choose infrastructure appropriate for a use case; how to design for scalability, reliability, availability, and maintainability; and how to incorporate hybrid and edge computing capabilities into a design.
GCP provides a range of compute infrastructure options. The best choice for your data engineering needs may depend on several factors. The four key compute options with which you should be familiar are as follows:
1. Compute Engine.
2. Kubernetes Engine.
3. App Engine.
4. Cloud Functions
Newer services, such as Cloud Run and Anthos, are also available for use, but they are currently not included in the PDE exam and so will not be covered here.
Compute Engine
Compute Engine is GCP's infrastructure-as-a-service (IaaS) product. With Compute Engine, you have the greatest amount of control over your infrastructure relative to the other GCP compute services.
Compute Engine provides virtual (VM) instances, and users have full access to the VM's operating system. Users can choose from a large number of operating systems that are available on GCP. Once an instance is created, users are free to install and configure additional software to meet their needs.
Users also configure the machine type either by choosing a predefined machine type or by configuring a custom machine. Machine types vary by the number of vCPUs and the amount of memory provided. Instances can be configured with more security features, such as Shielded VMs and accelerators, such as GPUs, which are often used with machine learning and other compute-intensive applications.
Compute Engine is GCP's infrastructure-as-a-service (IaaS) product. With Compute Engine, you have the greatest amount of control over your infrastructure relative to the other GCP compute services.
Compute Engine provides virtual (VM) instances, and users have full access to the VM's operating system. Users can choose from a large number of operating systems that are available on GCP. Once an instance is created, users are free to install and configure additional software to meet their needs.
Users also configure the machine type either by choosing a predefined machine type or by configuring a custom machine. Machine types vary by the number of vCPUs and the amount of memory provided. Instances can be configured with more security features, such as Shielded VMs and accelerators, such as GPUs, which are often used with machine learning and other compute-intensive applications.
In addition to specifying the machine type, operating system, and optional features, you will specify a region and zone when creating a VM.
VMs can be grouped together into clusters for high availability and scalability. A managed instance group is a set of VMs with identical configurations that are managed as a single unit. Managed instance groups are configured with a minimum and a maximum number of instances. The number of instances in the group will vary to scale up or down with workload.
Compute Engine is a good option when you need maximum control over the configuration of VMs and are willing to manage instances.
Comments
Post a Comment