Building and Operationalizing Processing Infrastructure



Objective:

1. Provisioning resources.

2. Monitoring pipelines.

3. Adjusting pipelines.

4. Testing and quality control.

Introduction

Data engineering depends heavily on computing, or processing, resources. In this article, you will learn how to provision and adjust processing resources, including Compute Engine, Kubernetes Engine, Cloud Bigtable, and Cloud Dataproc. You will also learn about configuring managed, serverless processing resources, including those offered by App Engine, Cloud Functions, and Cloud Dataflow. We will also discuss how to use Stackdriver Metrics, Stackdriver Logging, and Stackdriver Trace to monitor processing infrastructure.

Provisioning and Adjusting Processing Resources

GCP has a variety of processing resources. Let's categorize them into:
1. Server-based resources,
2. Serverless resources.


Server-based resources are those that require you to specify virtual machines (VMs) or clusters; serverless resources don't require you to specify VMs or clusters but do still require some configuration.

The server-based services described here include the following:

1. Compute Engine: an infrastructure as a service (IaaS) offering that allows clients to run workloads on Google's physical hardware.

2. Kubernetes Engine: provides a managed environment for deploying, managing, and scaling your containerized applications using Google infrastructure.

3. Cloud Bigtable: is an HBase-compatible, enterprise-grade NoSQL database with single digit millisecond latency and limitless scale.

4. Cloud Dataproc: a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming in a simpler, more cost-efficient way.

5. Cloud Dataflow: a cloud-based data processing service for both and real-time data streaming applications.

Compute Engine can be configured using individual VMs or instance groups. The others are configured as clusters of VMs, sometimes with VMs taking on different roles within the cluster.

The serverless GCP services covered here include the following:
1. App Engine.

2. Cloud Functions.

Although you don't need to specify machine types or cluster configurations, you can specify some parameters to adjust these services to meet the requirements of different use cases.

Provisioning and Adjusting Compute Engine


Compute Engine supports provisioning single instances or groups of instances, known as instance groups. Instance groups are either managed or unmanaged. Managed instance groups (MIGs) consist of identically configured VMs; unmanaged instance groups allow for heterogenous VMs, but you should prefer managed instance groups unless you have a need for heterogenous VMs. The configuration of a VM in a managed instance group is specified in a template, known as an instance template.

Provisioning Single VM Instances


The basic unit of provisioning in Compute Engine is a virtual machine. VMs are provisioned using the cloud console, the command-line SDK, or the REST API. Regardless of the method used to provision a VM, you can specify a wide range of parameters, including the following:

1. Machine type, which specifies the number of vCPUs and the amount of memory.
2. Region and zone to create the VM.
3. Boot disk parameters.
4. Network configuration details.
5. Disk image.
6. Service account for the VM.
7. Metadata and tags.

In addition you can specify optional features, such as Shielded VMs for additional security or GPUs and tensor processing units (TPUs) for additional processing resources.
Compute Engine instances are created in the cloud console.
Instances can be provisioned from the command line using the 'gcloud compute instances create' command. For example, consider the following command:

gcloud compute instances create instance-1 --zone=us-central1-a
--machine-type=n1-standard-1 --subnet=default --network-tier=PREMIUM
--image=debian-9-stretch-v20191115 --image-project=debian-cloud
--boot-disk-size=10GB --boot-disk-type=pd-standard

This command creates a VM of n1-standard type in the us-central1-a zone using the Premium network tier and the specified Debian operating system. The boot disk would be a 10 GB standard disk-in other words, not an SSD.
Once a VM instance is created, if the instance is stopped, you can adjust some configuration parameters. For example, you can detach and reattach a boot disk if it is in need of repair. You can also add or remove GPUs from stopped instances.

Provisioning Managed Instance Groups

Managed instance groups are managed as a single logical resource. MIGs have several useful properties, including:
1. Autohealing based on application-specific health checks, which replaces nonfunctioning instances.
2. Support for multizone groups that provide for availability in spite of zone-level failures.
3. Load balancing to distribute workload across all instances in the group.
4. Autoscaling, which adds or removes instances in the group to accommodate increases and decreases in workloads.
5. Automatic, incremenetal updates to reduce disruptions to workload processing.
The specifications for a MIG are defined in an instance template, which is a file with VM specifications. A template can be created using the 'gcloud compute instances-templates create' command. For example, here is a command to create a template that specifies n1-standard-4 machine types, the Debian 9 operating system, and a 250 GB bootdisk:

gcloud compute instance-templates create pde-exam-template-custom \
--machine-type n1-standard-4 \
--image-family debian-9 \
--image-project debian-cloud \
--boot-disk-size 250GB

To create an instance from this template, you can use the gcloud compute instances create command seen earlier and include the source-instance-template parameter. For example, the following command will create an instance using the instance template just created:

gcloud compute instances create pde-exam-instance
--source-instance-template=pde-exam-template-custom

You can also create an instance template using the cloud console. Once an instance template is created, you can create an instance group using the console as well.
When creating an instance group, you can specify the name of the instance template along with a target CPU utilization that will trigger adding instances to the group up to the maximum number of instances allowed. You can also specify a minimum number of instances. You can also adjust the default cooldown period, which is the amount of time GCP will wait before collecting performance statistics from the instance. This gives the instance time to complete startup before its performance is considered for adjusting the instance group.

Adjusting Compute Engine Resources to Meet Demand


Creating and managing a single instance of a VM is an appropriate way to provisioning computing resources in some cases. For example, if you need a development instance or you want to run a small database for a local group of users, there is a reasonable approach. If, however, you want to able to adapt quickly to changing workloads and provide high availability, a managed instance group is a better option.
One of the advantages of a managed instance group is that the number of VMs in the group can change according to workload. This type of horizontal scaling is readily implemented in MIGs because of autoscaling. An alternative, vertical scaling, requires moving services from one VM to another VM with more or fewer resources. Many data engineering use cases have workloads that can be distributed across VMs so that that horizontal scaling with MIGs is a good option. If you have a high-performance computing (HPC) workload that must be run on a single server - for example, a monolithic simulation- then you may need to scale VMs vertically.

Provisioning and Adjusting Compute Engine
Kubernetes Engine is a managed Kubernetes service that provides container orchestration. Containers are increasingly used to process workloads because they have less overhead than VMs and allow for finer-grained allocation of resources than VMs.

Overview of Kubernetes Architecture


A Kubernetes cluster has two types of instances: cluster masters and nodes. The cluster master runs four core services that control the cluster: controller manager, API server, scheduler, and etcd. The controller manager runs services that manage Kubernetes abstract components, such as deployments and replica sets. Applications running in Kubernetes use the API server to make calls to the master. The API server also handles intercluster interactions. The scheduler is responsible for determining where to run pods, which are the lowest-level scheduable unit in Kubernetes. The etcd service is a distributed key-value store used to store state information across a cluster.
Nodes are instances that execute workloads; they're implemented as Compute Engine VMs that are run within MIGs. They communicate with the cluster master through an agent called kubelet.

Kubernetes introduces abstractions that facilitate the management of containers, applications, and storage services. Some of the most important are as follows:

1. Pods,
2. Services,
3. ReplicaSets,
4. Deployments,
5. PersistentVolumes,
6. StatefulSets,
7. Ingress
Pods

Pods are the smallest computation unit managed by Kubernetes. Pods contain one or more containers. Usually, pods have just one container, but if the services provided by two containers are tightly coupled, then they may be deployed in the same pod. For example, a pod may include container running an extraction, transformation, and load process, as well as a container running ancillary services for decompressing and reformatting data. Multiple containers should be in the same pod only if they are functionally related and have similar scaling and lifecycle characteristics.
Pods are deployed to nodes by the scheduler. They are usually deployed in groups or replicas. This provides for high availability, which is especially needed with pods. Pods are ephemerl and may be terminated if they are not functioning properly. One of the advantages of Kubernetes is that it monitors the health of the pods and replaces them if they are not functioning properly. Since multiple replicas of pods are run, pods can be destroyed without completely disrupting a service. Pods also support scalability. As load increase or decreases, the number of pods deployed for an application can increase or decrease.
Since pods are ephemeral, other service abstraction for this. A service is an abstraction with a stable API endpoint and stable IP address. Applications that need to use a service communicate with API endpoints. A service keeps track of its associated pods so that it can always route calls to a functioning pod.

ReplicaSet

A ReplicaSet is a conroller that manages the number of pods running for a deployment. A deployment is a higher-level concept that manages ReplicaSets and provides decalartive updates. Each pod in a deployment is created using the same template, which efines how to run a pod. The definition is called a pod specification.
Kubernetes deployments are configured with a desired number of pods. If the actual number of pods varies from the desired state - for example, if a pod is terminated for being unhealthy- then the ReplicaSet will add or remove pods until the desired state is reached.
Pods may need access to persistent storage, but since pods are ephemeral, it is a good idea to decouple pods that are responsible for computation from persistent storage, which should continue to exist even after a pod terminates. PersistentVolumes is Kubernete's way of representating storage allocated or provisioned for use by a pod. Pods acquire access to persistent volumes by creating a PersistentVolumeClaim, which is a logical way to link a pod to persistent storage.
Pods as described so far work well for stateless applications, but when state is managed in an application, pods are not functionally interchangeable. Kubernetes uses the StatefulSets abstraction, which is used to designate pods as stateul and assign a unique identifier to them. Kubernetes uses these to track which clients are using which pods and to keep them paired.
An Ingress is an object that controls external access to services running in a Kubernetes cluster. An Ingress Controller must be running in a cluster for an Ingress to function.

Provisioning a Kubernetes Engine Cluster

When you use Kubernetes Engine, you start by provisioning a cluster using either the cloud console, the command line, or the REST API.
When creating a cluster, you will need to specify a cluster name, indicate whether the cluster will run in a single zone or in multiple zones within a region, and specify a version of GKE and a set of node pools.
To create a Kubernetes cluster from the command line, you can use the gcloud container clusters create command. For example, consider the following command:

gcloud container clusters create "standard-cluster-1" --zone "us-central1-a"
--cluster-version "1.13.11-gke.14" --machine-type "n1-standard-1"
--image-type "COS" --disk-type "pd-standard" --disk-size "100"
--num-nodes "5" --enable-autoupgrade --enable-autorepair

This command starts with gcloud container clusters create. Kubernetes Engine was originally called Container Engine, hence the use of gcloud container instead of gcloud kubernetes. The zone, cluster versions, machine type, disk configuration, and number of nodes in the cluster are also specified. The autoupgrade and autorepair options are also specified.
Node pools are collections of VMs running in managed instance groups. Since node pools are implemented as MIGs, all machines in the same node pool have the same configuration. Clusters have a default node pool, but you can add custom node pools as well. This is useful when some loads have specialized needs, such as more memory or higher I/O performance. They are also useful if you want to have a pool of preemptible VMs to help keep costs down.
When jobs are submitted to Kubernetes, a scheduler determines which nodes are available to run the job. Jobs create podsand ensure that at least some of them successfully terminate. Once the specified number of pods successfully complete, the job is considered complete. Jobs include a specification of how much CPU and memory are required to run the job. If a node has sufficient available resources, the scheduler can run the job on that node. You can control where jobs are run using the Kubernetes abstractions known as taints. By assigning taints to node pools and tolerances for taints to pods, you can control when pods are run.

Adjusting Kubernetes Engine Resources to Meet Demand

In Kubernetes Engine, two important ways to adjust resources is by scaling applcations running in clusters and scaling clusters themselves.


Autoscaling Applications in Kubernetes Engine

When you run an application in Kubernetes, you specify how many replicas of that application should run. When demand for a service increases, more replicas can be added, and as demand drops, replicas can be removed.
To adjust the number of replicas manually, you can use 'kubectl', which is the command-line utility for interacting with Kubernetes clusters. Note that this is not a 'gcloud container' command; 'kubectl' is used to control Kubernetes components. The 'gcloud container' commands are used for interacting with Kubernetes Engine.
Let's look a look at an example 'kubectl' command to adjust the number of replicas for a depolyment named 'pde-example-application":

kubectl scale deployment pde-example-application --replicas 6

This command will set the number of desired replicas for the 'pde-example-application' deployment to 6. There may be times when Kubernetes cannot provide all the desired replicas. To see how many replicas are actually running, you can use the 'kubectl get deployments' command to list the name of deployments, the desired number of replicas, the current number of replicas, the number of replicas available to users, and the amount of time the application has been running the cluster.
Alternatively, you can configure deployments to autoscale using the 'kubectl autoscale' command. For example, consider the following:

kubectl autoscale deployment pde-example-application --min 2 --max 8 --cpu-percent 65

The command will autoscale the number of replicas of the pde-example-application between two and eight replicas depending on CPU utilization. In this case, if CPU utilization across all pods is greater than 65%, replicas will be added up to the maximum number specified.Autoscaling Clusters in Kubernetes Engine
Kubernetes Engine provides the ability to autoscale the number of nodes in a node pool. (Remember, nodes are implemented as Compute Engine VMs, and node pools are implemented using managed instance groups.)
Cluster autoscaling is done at the node pool level. When you create a node pool, you can specify a minimum and maximum number of nodes in the pool. The autoscaler adjusts the number of nodes in te pool based on resource requests, not actual utilization. If pods are unscheduled because there is not a sufficient number of nodes in the node pool to run those pods, then more nodes will be added up to the maximum number of nodes specified for that node pool. If nodes are underutilized, the autoscaler will remove nodes from the node pool down to the minimum number of nodes in the pool.
By default, GKE assumes that all pods can be restarted on other nodes. If the application is not tolerant of brief disruptions, it is recommended that you not use cluster autoscaling for that application.
Also, if a cluster is running in multiple zones and the node pool contains multiple instance groups of the same instance type, the autoscaler tries to keep them balanced when scaling up.
You can specify autoscaling parameters when creating a cluster. Here is an example:

gcloud container clusters create pde-example-cluster \
--zone us-central1-a \
--node-locations us-central1-a, us-central1-b, us-central1-f \
--num-nodes 2 --enable-autoscaling --min-nodes 1 --max-nodes 4

This command creates a cluster called pde-example-cluster in the us-central1-a zone, with nodes in three zones and a starting set of two nodes per node pool. Autoscaling is enabled with a minimum of one node and a maximum of four nodes.

Kubernetes YAML Configurations
As with any other GCP service, we can use the command line, cloud console, and REST APIs to configure GKE resources. Kubernetes makes extensive use of declarative specifications using YAML files. These files contain information needed to configure a resource, such as a cluster or a deployment. Here, for example, is a deployment YAML specification for a deployment of an Nginx server using three replicas:


Provisioning and Adjusting Cloud Bigtable
Cloud Bigtable is a managed wide-column NoSQL database used for applications that require high-volume, low-latency random reads and writes, such as IoT applications, and for analytic use cases, such as storing large volumes of data used to train machine learning models. Bigtable has an HBase interface, so it is also a good alternative to using Hadoop HBase on a Hadoop cluster.

Provisioning Bigtable Instances

Bitable instance can be provisioned using the cloud console, command-line SDK, and REST API. When creating an instance, you provide an instance name, and instance ID, an instance type, a storage type, and cluster specifications.
The instance type can be either production or development. Production instances have clusters with a minimum of three nodes; development instances have a single node and do not provide for high availability.
Storage types are SSD and HDD. SSDs are used when low-latency I/O is a priority. Instances using SSDs can support faster reads and are recommended for real-time applications. HDD instances have higher read latency buy are less expensive and performant enough for scanning, making them a good choice for batch analytical use cases. HDD configured instances cost less than SSD instances.
Clusters are sets of nodes in a specific region and zone. You specify a name, the location, and the number of nodes when creating a cluster. You can create multiple, replicated clusters in a Bigtable instance.
Instances can be created using the 'gcloud bigtable' instances create command. The following command creates an instance called 'pde-bt-instance1' with one six-node cluster called 'pde-bt'cluster1' in the 'us-west1-a' zone that uses SSDs for storage. The instance is a production instance.

gcloud bigtable instances create pde-bt-instance1
--cluster=pde-bt-cluster1 \
--cluster-zone=us-west1-a
--display-name=pdc-bt-instance-1 \
--cluster-num-nodes=6 \\
--cluster-storage-type=SSD \
--instance-type=PRODUCTION

Bigtable has a command-line utility called cbt, which can also be used to create instances along with other operations on Bigtable instances. Here is an example of a cbt command to create a Bigtable instance like the one created with gcloud command earlier:

cbt createinstance pde-bt-instance1 pdc-bt-instance-1 pde-bt-cluster1 west1-a 6 SSD

Once you have created a Bigtable instance, you can modify sevveral things about the cluster, including:
1. The number of nodes in each cluster.
2. The number of clusters.
3. Application profiles, which contain replication settings.
4. Labels, which are used to specify metadata attributes.
5. The display name.

Comments

Popular posts from this blog

The Morph Concept in 2025: From Vision to Emerging Reality

Mortgage Train 2025

Web Train 2025: Locomotives