As a Data Engineer, Contribute to Data Transformations

Are you an engineer who wants to participate in dbt or Matillion for doing data transformations? Follow this guide.

Getting set up

To make maximum impact, you will want strong abilities with SQL and working knowledge of dbt, git, and python.

Install prerequisites
1. Docker
2. Git
3. Visual Studio Code
  1. Confirm your system requirements to run containers
  2. Follow any special instructions for your OS
4. The VS Code extension Remote Development Extension Pack
Request developer access to the warehouse and access to your pipeline repository, which will be at https://www.github.com/datateer/<customer-code>-pipeline
Clone the pipeline repository
Follow the instructions in the readme.md file to finish and verify your local setup

Data Pipelines

A pipeline moves data from operational systems, combines data in the warehouse, and applies data transformations that result in a data model designed for analytical queries.

See a high-level design of the data pipeline architecture at https://docs.google.com/presentation/d/11-LqvvXN-Jd1IwMfDr6lv3yQpTVsnEJ6L-Wl2LA-stM/edit#slide=id.p

Data Source

A Data Source is a type of data asset that Datateer manages. Data Sources are defined using the following properties:

Provider - this is the company or organization that is providing or giving rights to data.
System - the API, database, or application where the data is stored. A System can have one or more data Feeds.
Feed - a feed represents an entity in the source system. Feeds have a defined schema of data fields
File - a file is an instance of a feed. Files contain the extracted data and conform to the schema of the associated Feed

Extraction Strategy

Each Data Source has an extraction strategy that identifies how Feeds are extracted from the source System and loaded into the data lake. Possible extraction strategies include:

Pull strategies indicate some process managed by Datateer is pulling data from the source Systems and putting it into the Data Lake or directly into the warehouse
- Meltano to Data Lake
- Fivetran to Warehouse
- Matillion to Warehouse
- Matillion to Data Lake
- Portable to Warehouse
- Portable to Data Lake
- Segment to Warehouse
- Precog to Warehouse
Push strategies indicate a process out of Datateer's control is pushing data into the Datateer Data Lake or warehouse
- Upload Agent to Data Lake - Datateer provides a simple Upload Agent utility to assist in pushing data to the Data Lake
- File Push to Data Lake - Customer has custom a script that is pushing data
- Meltano Local - Customer is running Meltano within their network, with the Datateer Data Lake as a the target
- Report export to Data Lake - ERPs and some other systems follow a pattern of pushing prebuilt reports or data views
- Report export to SFTP
Snowflake Data Share

Data Lake

The Data Lake receives data in an AWS S3 bucket or GCP GCS bucket, performs cleansing, compression, and preparation, and creates a view in the warehouse to the data in the Data Lake

Warehouse

The warehouse is a cloud warehouse managed by Datateer. Best practice security scheme, database structure, resource monitors, and operating structure are applied through the Datateer infrastructure module

Orchestration

Pipelines are scheduled through a tool called Prefect. At this point, pipelines are implemented and can be broken down from one big conceptual pipeline into several concrete pipelines.

You can find pipeline configuration in the `orchestration` folder in the pipeline code repository.

Transformations

You can find transformations done in dbt in the `dbt/models` folder in the pipeline code repository. dbt Transformations run as part of the pipeline

DevOps

Workflow

Create an Issue in Github that describes what you plan to do. Doing all work through Issues will help us support you and collaborate with you. Use this issue to work out requirements or ask questions before beginning development. Be sure to provide context by referencing any affected Metrics, Data Sources, and Data Products (e.g. specific dashboards).
Create a feature branch from the `main` branch named `<issue-number>-short-description`
When development work is started, create a draft pull request and link to the issue by putting "Resoves #<issue-number>" in the body. Use this PR to request code reviews or ask implementation questions
When development work is "dev complete" meaning you believe it is ready for testing, deploy the PR to the Integration environment.
1. From the Github UI, navigate to the cod repo and click "Actions"
2. Click "Deploy Data Pipeline"
3. Click "Run Workflow"
4. Change the Branch to the one you are working on
5. Ensure the environment is "int"
6. Change the name (if necessary) to deploy the pipeline you are working on
7. Click "Run Workflow"
Request a pipeline run and verify the results
Request a review of the PR
After approval
1. Merge the PR to the main branch.
2. Create a release version using semantic versioning (v0.0.0)
3. Notify Datateer in slack that you plan to release the new version to production
4. Follow the steps above to deploy the release to production

Environments

Warehouse environments are schemas dedicated to each purpose:

Production
Staging - Useful for testing any deployments that have risk
UAT - A stable environment for end users to evaluate the correctness of data
Integration - An environment for integrating changes and testing
Individual developer environments - each developer has a personal schema with the name dev_<customer-code>_<initials>. All development work should target this environment

Useful References

You can find documentation on data transformations, and all dbt models and sources, at https://dbt.<customer-code>.datateer.com