As a Data Engineer, Contribute to Data Transformations
Are you an engineer who wants to participate in dbt or Matillion for doing data transformations? Follow this guide.
Getting set up
To make maximum impact, you will want strong abilities with SQL and working knowledge of dbt, git, and python.
- Install prerequisites
- Docker
- Git
-
Visual Studio Code
- Confirm your system requirements to run containers
- Follow any special instructions for your OS
- The VS Code extension Remote Development Extension Pack
-
Request developer access to the warehouse and access to your pipeline repository, which will be at https://www.github.com/datateer/<customer-code>-pipeline
- Clone the pipeline repository
- Follow the instructions in the readme.md file to finish and verify your local setup
Data Pipelines
A pipeline moves data from operational systems, combines data in the warehouse, and applies data transformations that result in a data model designed for analytical queries.
See a high-level design of the data pipeline architecture at https://docs.google.com/presentation/d/11-LqvvXN-Jd1IwMfDr6lv3yQpTVsnEJ6L-Wl2LA-stM/edit#slide=id.p
Data Source
A Data Source is a type of data asset that Datateer manages. Data Sources are defined using the following properties:
-
Provider - this is the company or organization that is providing or giving rights to data.
-
System - the API, database, or application where the data is stored. A System can have one or more data Feeds.
-
Feed - a feed represents an entity in the source system. Feeds have a defined schema of data fields
-
File - a file is an instance of a feed. Files contain the extracted data and conform to the schema of the associated Feed
Extraction Strategy
Each Data Source has an extraction strategy that identifies how Feeds are extracted from the source System and loaded into the data lake. Possible extraction strategies include:
- Pull strategies indicate some process managed by Datateer is pulling data from the source Systems and putting it into the Data Lake or directly into the warehouse
- Meltano to Data Lake
- Fivetran to Warehouse
- Matillion to Warehouse
- Matillion to Data Lake
- Portable to Warehouse
- Portable to Data Lake
- Segment to Warehouse
- Precog to Warehouse
- Push strategies indicate a process out of Datateer's control is pushing data into the Datateer Data Lake or warehouse
- Upload Agent to Data Lake - Datateer provides a simple Upload Agent utility to assist in pushing data to the Data Lake
- File Push to Data Lake - Customer has custom a script that is pushing data
- Meltano Local - Customer is running Meltano within their network, with the Datateer Data Lake as a the target
- Report export to Data Lake - ERPs and some other systems follow a pattern of pushing prebuilt reports or data views
- Report export to SFTP
- Snowflake Data Share
Data Lake
The Data Lake receives data in an AWS S3 bucket or GCP GCS bucket, performs cleansing, compression, and preparation, and creates a view in the warehouse to the data in the Data Lake
Warehouse
The warehouse is a cloud warehouse managed by Datateer. Best practice security scheme, database structure, resource monitors, and operating structure are applied through the Datateer infrastructure module
Orchestration
Pipelines are scheduled through a tool called Prefect. At this point, pipelines are implemented and can be broken down from one big conceptual pipeline into several concrete pipelines.
You can find pipeline configuration in the `orchestration` folder in the pipeline code repository.
Transformations
You can find transformations done in dbt in the `dbt/models` folder in the pipeline code repository. dbt Transformations run as part of the pipeline
DevOps
Workflow
- Create an Issue in Github that describes what you plan to do. Doing all work through Issues will help us support you and collaborate with you. Use this issue to work out requirements or ask questions before beginning development. Be sure to provide context by referencing any affected Metrics, Data Sources, and Data Products (e.g. specific dashboards).
- Create a feature branch from the `main` branch named `<issue-number>-short-description`
- When development work is started, create a draft pull request and link to the issue by putting "Resoves #<issue-number>" in the body. Use this PR to request code reviews or ask implementation questions
- When development work is "dev complete" meaning you believe it is ready for testing, deploy the PR to the Integration environment.
- From the Github UI, navigate to the cod repo and click "Actions"
- Click "Deploy Data Pipeline"
- Click "Run Workflow"
- Change the Branch to the one you are working on
- Ensure the environment is "int"
- Change the name (if necessary) to deploy the pipeline you are working on
- Click "Run Workflow"
- Request a pipeline run and verify the results
- Request a review of the PR
- After approval
- Merge the PR to the main branch.
- Create a release version using semantic versioning (v0.0.0)
- Notify Datateer in slack that you plan to release the new version to production
- Follow the steps above to deploy the release to production
Environments
Warehouse environments are schemas dedicated to each purpose:
- Production
-
Staging - Useful for testing any deployments that have risk
-
UAT - A stable environment for end users to evaluate the correctness of data
-
Integration - An environment for integrating changes and testing
-
Individual developer environments - each developer has a personal schema with the name dev_<customer-code>_<initials>. All development work should target this environment
Useful References
- You can find documentation on data transformations, and all dbt models and sources, at https://dbt.<customer-code>.datateer.com