This page documents a number of ETL Pipeline Architectures. An ETL Pipeline is an Extract, Transform and Load process. Data is extracted from a source, transformed (optional sometimes) and then loaded in a different system.
These architectures are simplified versions I use to convey the entire concept or architecture. In these designs I've chosen to use serverless technologies to achieve scalability and avoid costs when the pipeline is not active. Of course, depending on the amount of data that needs to be processed a fan-out design might be required. An example of a fan-out design can be seen in the last pipeline example.
During data extraction, raw data is copied or exported from source locations to a staging area. Data management teams can extract data from a variety of data sources, which can be structured or unstructured.
In the staging area, the raw data undergoes data processing. Here, the data is transformed and consolidated for its intended analytical use case.
In this last step, the transformed data is loaded into a target data warehouse. Typically, this involves an initial loading of all data, followed by periodic loading of incremental data changes and, less often, full refreshes to erase and replace data in the warehouse. For most organizations that use ETL, the process is automated, well-defined, continuous and batch-driven. Typically, ETL takes place during off-hours when traffic on the source systems and the data warehouse is at its lowest.
The File upload architecture is a design where the data that needs to be processed is uploaded (either by a user or by an automated process) to an S3 bucket and then processed by transformation step.
Periodic Partner API
The Periodic Partner API architecture is a design where an API is called periodically and the resulting data is fed into the ETL pipeline.
The following design demonstrates a fanout principle where transformation is handled by multiple consumers. This can be a useful pattern when processing takes too long (or utilizes too much resources) for a single consumer.