ETL Architecture in AWS
As a vital component of our data engineering infrastructure, we rely on AWS as our chosen cloud platform. This document outlines the best practices we adhere to when building and optimizing data pipelines, with a focus on AWS services.
There is no one size fits all with data pipeline architecture, and we use more than one architectural pattern for data pipelines at CRUK. Below describes the components we use for the Medallion Architecture which is used in multiple pipelines for processing supporter and transaction data.
Medallion Architecture
- The Medallion Architecture aims to incrementally enhance the structure and quality of data at each layer, progressing from Bronze to Silver to Gold. The Bronze layer holds raw data, the Silver layer refines and enriches it, and the Gold layer stores curated, high-quality data.
- Within the Medallion Architecture, AWS Glue Data Catalog can be leveraged for centralized metadata management. This enhancement in metadata quality and accessibility supports iterative improvements as data transitions through the Bronze to Gold layers.
- Implement strategies for incremental data processing within the Medallion architecture on AWS. This ensures agility in handling new or updated data while optimizing computational resources. The iterative nature of the Medallion framework allows for continuous enhancement of data quality with each iteration.This can be done with Glue and S3 as described below
ETL with AWS Glue
- Leverage AWS Glue's scalability and serverless architecture to handle dynamic data processing needs. This allows us to focus on designing efficient data transformations.
- Exploit AWS Glue's built-in data engineering features to streamline ETL processes. For example: built-in connectors, dynamic frames and glue data catalog integration.
- Harness the power of Apache Spark underlying AWS Glue. This facilitates distributed data processing, supporting large-scale transformations and analytics.
Storage with Amazon S3 and Parquet
- Utilize Amazon S3 for durable, scalable, and cost-effective storage.
- Opt for the Parquet file format to store data in Amazon S3. Parquet's columnar storage structure enhances read and write operations, facilitating faster query performance and reduced data scan times.
- Leverage Parquet's built-in compression capabilities to optimize storage costs further. Efficient compression algorithms reduce the storage footprint without compromising query performance.
- Typed storage structures
Orchestration with Apache Airflow
- Integrate Apache Airflow into our architecture for orchestrating complex data workflows. Airflow provides visibility into pipeline execution and offers control over the entire process.
- Utilize Airflow's Directed Acyclic Graphs (DAGs) to manage dependencies between tasks, ensuring optimal data flow through pipelines and supporting seamless execution according to workflow logic.
- Leverage Airflow's extensibility to integrate with various AWS services seamlessly. For example, you can use the
GlueJobOperator
to trigger a job in AWS glue with one line of code. - Implement logging and monitoring features for troubleshooting and performance insights.
- You can define explicit workflow behaviour based on pass and fail runs.
- You can re-run previously ran jobs.