Right now the codebase uses use CSV files to pass information for the next flow run or downstream flows. These CSV files store filename and file start/end time on what files have already been processed, so that:
- when the next same flow runs it will not reprocess data already processed
- for downstream flows they can just grab already processed data.
These operations are likely better done with databases for robustness and efficiency, especially when there are multiple write operations on the cloud side.
Right now the codebase uses use CSV files to pass information for the next flow run or downstream flows. These CSV files store filename and file start/end time on what files have already been processed, so that:
These operations are likely better done with databases for robustness and efficiency, especially when there are multiple write operations on the cloud side.