Using the Simple Data Lake to load raw data
The Simple Data Lake provides a simple, consistent, automated, way of uploading raw data.
The main components are:
- The Raw bucket in AWS S3 or GCP GCS. This is the landing zone to upload raw data
- The Preprocessor that cleans, normalizes, and compresses raw data and puts it in the prepped bucket
- The Prepped bucket that stores files prepped for use by the warehouse
- The RAW database in the warehouse that holds the external stage and table configurations
In the image below you can see the major components. The solid arrows indicate logical data flow. The dashed arrows indicate process control flow.
Structure of the buckets
The object keys in the S3 buckets conform to some basic metadata, following the naming convention in the object key <provider>/<system>/<feed>/export_date=<YYYY-MM-DD>/<file>
- The provider is the organization that generates or controls the data and is providing it to you for analytics
- The system is the name of the external system (i.e. external to the analytics platform) that stores the data
- The feed represents an entity or table that has a defined schema (even if that schema is just a semi-structured JSON blob)
- The export date is the date a file was extracted and loaded
- The file is the specific extract object on the given export date
Configuration
Setting up a new feed and/or system and provider is done in the pipeline repository. By setting up a new dbt source with the external configuration specified, the pipeline will automatically apply the necessary configuration and permissions, as well as automatically bridge the raw data into dbt-ready sources
Supported formats
The raw bucket can accept CSV, CSV tab-delimitted, and JSONL formats. CSV files must have headers on their first row.
Gaining access
The upload-agent IAM user is a service account with permissions to access the raw bucket
Data engineers have individual IAM user accounts with permissions to access the raw bucket.
You can request access credentials for either of these through the service desk