Provisioning and Adjusting Cloud Dataproc
Cloud Dataproc is a managed Hadoop and Spark service. When provisioning Cloud Dataproc resources, you will specify the configuration of a cluster using the cloud console, the command-line SDK, or the REST API. When you create a cluster, you will specify a name, a region, a zone, a cluster mode, machine types, and an autoscaling policy.
The cluster mode determines the number of master nodes and possible worker nodes. The standard mode has one master and some number of workers. The single mode has one master and no workers. The high availability mode has three master nodes and some number of worker nodes. Master nodes and worker nodes are configured separately. For each type of node, you can specify a machine type, disk size, and disk type. For worker nodes, you can specify machine type, disk size and type, a minimum number of nodes, and optional local SSDs.
The 'gcloud dataproc clusters create' command is used to create a cluster from the command line. Here is an example:
gcloud dataproc clusters create pde-cluster-1 \
--region us-central1 \
--zone us-central1-b \
--master-machine-type n1-standard-1 \
--master-boot-disk-size 500 \
--num-workers 4
--worker-machine-type n1-standard-1
--worker-boot-disk-size 500
After a cluster is created, you can adjust the number of worker nodes, including the number of preemptible worker nodes. The number of master nodes cannot be modified. The number of worker nodes can also be adjusted automatically by specifying an autoscaling policy, which specifies the maximum number of nodes and a scale-up rate and a scale-down rate.
Configuring Cloud Dataflow
Cloud Dataflow executes streaming and batch applications as an Apache Beam runner. You can specify pipeline options when you run a Cloud Dataflow program. Required parameters are as follows:
1. Job name,
2. Project ID,
3. Runner, which is DataflowRunner for cloud execution.
4. Staging locations, which is a path to a Cloud Storage location for code packages.
5. Temporary location for temporary job files.
You can also specify the number of workers to use by default when executing a pipeline as well as a maximum number of workers to use in cases where the workload would benefit from additional workers.
There is an option to specify disk size to use with Compute Engine worker instances. This may be important when running batch jobs that may require large amounts of space on the boot disk.
Cloud Dataflow doesn't require you to specify machine types, but you can specify machine type and worker disk type if you want that level of control.
Configuring Managed Serverless Processing Services
Several processing services in GCP are serverless, so there is no need to provision instances or clusters. You can, however, configure some parameters in each of the services. We will review configuring the following:
1. App Engine.
2. Cloud Functions
Configuring App Engine
App Engine is a serverless platform-as-a-service (PaaS) that is organized around the concept of a service, which is an application that you run in the App Engine environment. You can configure your service as well as supporting services by specifying three files:
1. app.yaml.
2. cron.yaml.
3. dispatch.yaml
There is an app.yaml file associated with each version of a service. Usually you create a directory for each version of a service and keep the app.yaml file and other configuration files there.
The app.yaml file has three required parameters:
1. runtime specifies the runtime environment, such as Python 3.
2. handlers is a set of URL patterns that specify what code is run in response to invoking a URL.
3. cron.yaml is used to configure scheduled tasks for an application. Parameters include a schedule of when to run the task and a URL to be invoked when the task is run.
4. The dispatch.yaml file is a place for specifying routing rules to send incoming requests to a specific service based on the URL.



Comments
Post a Comment