A beginner's guide to Cloud Data Fusion
CDF offers an extensive set of data aggregation and analysis tools in a single package. The functionality of these tools is central to some of Cybervision's ongoing projects.
CDF pipelines can be either batch or real-time. Batch applications can be run manually, based on time schedule, or another trigger. Real-time applications run continuously, obtain data as it becomes available, and process it immediately. This flexibility is important for efficiently handling tasks of various nature.
Let's take a look at an example CDF pipeline. Assume that raw, semi-structured medical data is loaded into Cloud Datastore. Our goal is to validate data ranges and dates, normalize the records by renaming certain fields and aggregate them by patient ID and date. In the end, the aggregate data is loaded into BigQuery so it can be used by analytics tools for further exploration and analysis. CDF comes ready with tools to do the necessary data manipulations.
Cloud Data Fusion enables users to build pipelines to ingest data with a BigQuery Plugin, thus making high-volume data ingestion faster and easier. Harnessing the full power of Google's infrastructure, this plugin is immensely helpful in querying massive datasets by enabling fast SQL queries against append-only tables. Users can transfer their data to BigQuery and process it further in CDF, by using an import query or input table available for download to temporary GCS directory and finally to the CDF source.
GCD is a NoSQL document database powered by GCP and designed for automatic scaling, high performance, and easy development of web and mobile applications. Based on Google's Bigtable and Megastore technology stack, GCD is a set to handle large analyticaland operational workloads and provides access to data via a RESTful interface.
GCD source/sink allows users to build multiple data pipelines in CDF at once, enabling them to read complete tables from GCD instance or to perform inserts/upserts into GCD tables in batch. During configuring the GCD the purpose of each field is to define entity properties via unique keys, which makes the processing of multiple data types possible.
Cloud Data Fusion Features
1. Code-free self-service: Remove bottlenecks by enabling nontechnical users through a code-free graphical interface that delivers point-and-click data integration.
2. Collaborative data engineering: cloud data fusion offers the ability to create an internal library of custom connections and transformations that can be validated, shared, and reused across an organization.
3. Google Cloud-native: Fully managed Google Cloud-native architecture unlocks the scalability, reliability, security, and privacy features of Google Cloud.
4. Real-time data integration: Replicate transactional and operational databases such as SQL Server, Oracle and MySQL directly into BigQuery with just a few clicks using Data Fusion's replication feature. Integration with Datastream allows you to deliver change streams into BigQuery for continuous analytics. Use feasibility assessment for faster development iterations and performance/health monitoring for observability.
5. Batch integration: Design, run and operate high-volumes of data pipelines periodically with support for popular data sources including file systems and object stores, relational and NoSQL databases, SaaS systems and mainframes.
6. Enterprise-grade security: Integration with Cloud Identity and Access Management (IAM), Private IP, VPC-SC and CMEK provides enterprise security and alleviates risks by ensuring compliance and data protection.
7. Integration metadata and lineage: Search integrated datasets by technical and business metadata. Track lineage for all integrated datasets at the dataset and field level.
8. Seamless operations: REST APIs, time-based schedules, pipeline state based triggers, logs, metrics, and monitoring dashboards make it easy to operate in mission-critical environments.
9. Comprehensive integration toolkit: Built-in connectors to a variety of modern and legacy systems, code-free transformations conditionals and pre/post processing, alerting and notifications, and error processin provide a comprehensive data.
10. Hybrid enablement: Open source provides the flexibility and portability required to build standardized data integration.
Comments
Post a Comment