This is my example to process data from application into data warehouse. I Have use Pyspark and Beam for data processing and snipped some code into this repository.
- Read Data from Application Database (eg. MySQL)
- Save Original data into file (eg. Parquet)
- I need to get a column that contain JSON string to transform into valuable record.
- Read JSON string into dict. then process to create new record.
- Save into Data Warehouse.
On Beam pipeline, I integrated with GCP stack with Dataflow, GCS and Bigquery.