ETL Pipeline Architectures

By Michael Anckaert - published on - posted in Architecture - tagged with ETL AWS Pipeline

This page documents a number of ETL Pipeline Architectures. An ETL Pipeline is an Extract, Transform and Load process. Data is extracted from a source, transformed (optional sometimes) and then loaded in a different system.

These architectures are simplified versions I use to convey the entire concept or architecture. In these designs I've chosen to use serverless technologies to achieve scalability and avoid costs when the pipeline is not active. Of course, depending on the amount of data that needs to be processed a fan-out design might be required. An example of a fan-out design can be seen in the last pipeline example.

ETL Details

During data extraction, raw data is copied or exported from source locations to a staging area. Data management teams can extract data from a variety of data sources, which can be structured or unstructured.

In the staging area, the raw data undergoes data processing. Here, the data is transformed and consolidated for its intended analytical use case.

In this last step, the transformed data is loaded into a target data warehouse. Typically, this involves an initial loading of all data, followed by periodic loading of incremental data changes and, less often, full refreshes to erase and replace data in the warehouse. For most organizations that use ETL, the process is automated, well-defined, continuous and batch-driven. Typically, ETL takes place during off-hours when traffic on the source systems and the data warehouse is at its lowest.

File upload

The File upload architecture is a design where the data that needs to be processed is uploaded (either by a user or by an automated process) to an S3 bucket and then processed by transformation step.

Solution Diagram

Periodic Partner API

The Periodic Partner API architecture is a design where an API is called periodically and the resulting data is fed into the ETL pipeline.

Solution Diagram

Fanout design

The following design demonstrates a fanout principle where transformation is handled by multiple consumers. This can be a useful pattern when processing takes too long (or utilizes too much resources) for a single consumer.

Solution Diagram