Data Pipelines; An Overview II
Batch pipeline for traditional analytics
Traditional analytics is about making sense of data gathered over time (historical data) to support decision-making. It relies on business intelligence tools and batch data pipelines - when data is collected, processed, and published to a database in large blocks (batches), at one time or on regular schedules. Once ready for access, a batch is queried by a user or a software program for data exploration and visualization.
Depending on the size of a batch, pipeline execution takes from a few minutes to a few hours and even days. To avoid overloading source systems, the process is often run during periods of low user activity (for example, at night or on week ends).
Batch processing is a tuned-and-true way to work with huge datasets in non-time sensitive projects. But if you need real-time insights, better opt for architectures enabling streaming analytics.
Streaming data pipeline for real-time analytics
Real-time or streaming analytics is about deriving insights from constant flows of data within seconds or milliseconds. Unlike batch processing, a streaming pipeline ingests a sequence of data as it's created and progressively updates metrics, reports, and summary statistics in response to every event that becomes available.
Real-time analytics allows companies to get up-to-date information about operations and react without a delay, or to provide solutions for smart monitoring of infrastructure performance. Enterprises that can't afford any lags in data processing - like fleet management businesses operating telematics systems - should prefer streaming architecture over batch.
Big Data pipeline for Big Data analytics
Big Data pipelines perform the same tasks as their smaller counterparts. What differentiates them is the ability to support Big Data analytics which means handling:
1. huge volumes of data,
2. coming from multiple (100+) sources,
3. in a great variety of formats (structured and unstructured and semi-structured), and
4. at high speed
ELT loading infinite amounts of raw data and streaming analytics, extracting insights on the fly, seem to be perfect for a Big Data pipeline. Yet, thanks to modern tools, batch processing and ETL can also cope with massive amounts of information. Typically, to analyze Big Data, organizations run both batch and real-time pipelines, leveraging a combination of ETL and ELT along with several stores for different formats.
Data Pipeline Tools
These are tools and infrastructure behind data flow, storage, processing, workflow, and monitoring. The choice of options depend on many factors, such as organization size and industry, data volumes, use cases for data, budget, security requirements, etc. Some groups of instruments for data pipelines are as follows:
1. ETL tools: include data preparation and data integration tools such as IBM DataStage Informatica Power Center, Oracle Data Integrator, Talend Open Studio, and many more.
2. Data warehouses(DWs) are central repositiories to store data transformed (processed) for a particular purpose. Today, all major DWs - such as Amazon Redshift, Azure Synapse, Google BigQuery, Snowflake, and Teradata - support both ETL and ELT processes and allow for stream data loading.
3. Data lakes: store raw data in native formats until it's needed for analytics. Companies typically use data lakes to build ELT-based Big Data pipelines for machine learning projects. All large providers of cloud services - AWS, Microsoft Azure, Google Cloud, IBM - offer data lakes for massive data volumes.
4. Batch workflow schedulers (Luigi or Azkaban) enable userss to programmatically specify workflows as tasks with dependencies between them, as well as automate and monitor these workflows.
5. Real-time data streaming tools: process information continuously generated by sources like machinery sensors, IoT and IoMT devices, transaction systems, etc. Popular instruments in this category are: Apache Kafka, Apache Storm, Google Data Flow, Amazon Kinesis, Azure Stream Analytics, IBM Streaming Analytics, and SQLstream.
6. Big Data tools: comprise all the above-mentioned data streaming solutions and other technologies supporting end-to-end Big Data flow. The Hadoop ecosystem is the number-one source of instruments to work with BD. You have:
1. Hadoop and Spark for batch processing.
2. Spark streaming analytics service extending main Spark capabilities.
3. Apache Oozie and Apache Airflow for batch jobs scheduling and monitoring.
4. Apache Cassandra and Apache HBase NoSQL databases to store and manage massive amounts of data, and
many other tools.
Comments
Post a Comment