Skip to main content

Understanding the data loading process

The data loading process is engineered for large volumes of data. In addition, for each data warehouse, our loader applications ensure the best representation of Snowplow events. That includes automatically adjusting the database types for self-describing events and entities according to their schemas.

We load data into Redshift using the RDB Loader.

At the high level, RDB Loader reads batches of enriched Snowplow events, converts them to the format supported by Redshift, stores them in an S3 bucket and instructs Redshift to load them.

Redshift
Events table
Extra event/entity tables
Enriched Events
(Kinesis stream)
RDB Loader
(Transformer and Loader apps)
Transformed Events
(S3 bucket)

RDB loader consists of two applications: Transformer and Loader. The following diagram illustrates the interaction between them and Redshift.

TransformerLoaderRedshiftRead a batch of eventsTransform events to TSVSplit out self-describing events and entitiesWrite data to the S3 bucketLoad the data from the S3 bucket using “COPY FROM”loopNotify the loader (via SQS)Send SQL commands for loadingTransformerLoaderRedshift