Data engineering, continued
We're going to revisit a number of concepts from earlier.
What can go wrong in data loading/manipulation? What errors/bugs have you hit?
What would you want to happen?
- Graceful degredation
- Examples?
Directed acyclic graphs
What does that mean?
- Pipelines
- Modeled as a DAG
- Jobs
- Batch vs. streaming
- Online vs. offline
- Online transaction processing (OLTP)
- Online analytical processing (OLAP)
Why is DAG different from setting workflows in Github?
- Useful for complex ETL
- Dependencies
- Assets
- Data
- Code (continuous integration/deployment)
Why store the data?
[using DAGs] increases data pipeline transparency but simultaneously increases reliance on developer discipline. Code flexibility might just as easily turn into production instability.
There are many alternative data integration / workflow orchestration tools.
They're heavy this week, don't wait!