|
| 1 | +# Data Ingestion Module |
| 2 | + |
| 3 | +This module provides tools for ingesting structured datasets from YAML manifest files into the OpenGin services. It processes hierarchical data structures (ministers, departments, categories, subcategories, and datasets) and creates corresponding entities and relationships in the OpenGin system. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The ingestion system reads YAML manifest files that describe the structure of datasets organized in a flexible hierarchy: |
| 8 | +- **Ministers** → **Categories** → **Subcategories** → **Datasets** |
| 9 | +- **Ministers** → **Categories** → **Datasets** |
| 10 | +- **Ministers** → **Departments** → **Categories** → **Subcategories** → **Datasets** |
| 11 | +- **Ministers** → **Departments** → **Categories** → **Datasets** |
| 12 | + |
| 13 | +Each dataset is stored as a JSON file and is ingested as an attribute on the appropriate parent entity (subcategory, minister, department). |
| 14 | + |
| 15 | +## Prerequisites |
| 16 | + |
| 17 | +Before running the ingestion script, ensure you have completed the following setup steps: |
| 18 | + |
| 19 | +### 1. Start OpenGin Services |
| 20 | + |
| 21 | +Make sure the OpenGin services are up and running. The ingestion script requires: |
| 22 | +- **Read Service**: For querying existing entities and relationships |
| 23 | +- **Ingestion Service**: For creating and updating entities |
| 24 | + |
| 25 | +### 2. Restore Data Backup |
| 26 | + |
| 27 | +Restore the `0.0.1` data backup to ensure you have the base entities (ministers, departments, etc.) that the ingestion script will reference and build upon. |
| 28 | + |
| 29 | +### 3. Set Up Python Environment |
| 30 | + |
| 31 | +Create a virtual environment and install the required dependencies: |
| 32 | + |
| 33 | +```bash |
| 34 | +# Create a virtual environment |
| 35 | +python -m venv venv |
| 36 | + |
| 37 | +# Activate the virtual environment |
| 38 | +# On macOS/Linux: |
| 39 | +source venv/bin/activate |
| 40 | +# On Windows: |
| 41 | +# venv\Scripts\activate |
| 42 | + |
| 43 | +# Install dependencies |
| 44 | +pip install -r requirements.txt |
| 45 | +``` |
| 46 | + |
| 47 | +### 4. Configure Environment Variables |
| 48 | + |
| 49 | +The ingestion script requires the following environment variables: |
| 50 | + |
| 51 | +- `READ_BASE_URL`: Base URL for the OpenGin Read Service |
| 52 | +- `INGESTION_BASE_URL`: Base URL for the OpenGin Ingestion Service |
| 53 | + |
| 54 | +You can set these in your environment or create a `.env` file in the `ingestion/` directory: |
| 55 | + |
| 56 | +```bash |
| 57 | +export READ_BASE_URL="http://localhost:8081" |
| 58 | +export INGESTION_BASE_URL="http://localhost:8080" |
| 59 | +``` |
| 60 | + |
| 61 | +Or create a `.env` file: |
| 62 | +``` |
| 63 | +READ_BASE_URL=http://localhost:8081 |
| 64 | +INGESTION_BASE_URL=http://localhost:8080 |
| 65 | +``` |
| 66 | + |
| 67 | +If using a `.env` file, make sure you have `python-dotenv` installed (included in `requirements.txt`). |
| 68 | + |
| 69 | +## Usage |
| 70 | + |
| 71 | +Once all prerequisites are met, you can run the ingestion script: |
| 72 | + |
| 73 | +```bash |
| 74 | +# From the project root directory |
| 75 | +python -m ingestion.ingest_flat_yaml data/statistics/2020_flat/manifest_2020.yaml |
| 76 | + |
| 77 | +# Or with an explicit year override |
| 78 | +python -m ingestion.ingest_flat_yaml data/statistics/2020_flat/manifest_2020.yaml --year 2020 |
| 79 | +``` |
| 80 | + |
| 81 | +### Command Line Arguments |
| 82 | + |
| 83 | +- `yaml_file` (required): Path to the YAML manifest file |
| 84 | +- `--year` (optional): Override the year extracted from the filename |
| 85 | + |
| 86 | +### Example |
| 87 | + |
| 88 | +```bash |
| 89 | +# Ingest 2020 data |
| 90 | +python -m ingestion.ingest_flat_yaml data/statistics/2020_flat/manifest_2020.yaml |
| 91 | + |
| 92 | +# Ingest 2021 data |
| 93 | +python -m ingestion.ingest_flat_yaml data/statistics/2021_flat/manifest_2021.yaml |
| 94 | +``` |
| 95 | + |
| 96 | +## How It Works |
| 97 | + |
| 98 | +1. **Parse YAML Manifest**: Reads the YAML file to extract the hierarchical structure |
| 99 | +2. **Find Entities**: Uses the Read Service to find existing ministers and departments by name and year |
| 100 | +3. **Create Categories**: Creates category and subcategory entities as needed |
| 101 | +4. **Process Datasets**: Reads dataset JSON files and adds them as attributes to parent entities |
| 102 | +5. **Create Relationships**: Establishes relationships between entities (e.g., `AS_CATEGORY`) |
| 103 | + |
| 104 | +## Module Structure |
| 105 | + |
| 106 | +``` |
| 107 | +ingestion/ |
| 108 | +├── ingest_flat_yaml.py # Main ingestion script |
| 109 | +├── models/ # Data models and schemas |
| 110 | +│ └── schema.py |
| 111 | +├── services/ # Service layer |
| 112 | +│ ├── entity_resolver.py # Entity lookup and resolution |
| 113 | +│ ├── ingestion_service.py # OpenGin Ingestion API client |
| 114 | +│ ├── read_service.py # OpenGin Read API client |
| 115 | +│ └── yaml_parser.py # YAML parsing utilities |
| 116 | +├── utils/ # Utility functions |
| 117 | +│ ├── date_utils.py # Date/time calculations |
| 118 | +│ ├── http_client.py # HTTP client for API calls |
| 119 | +│ └── util_functions.py # General utilities |
| 120 | +└── requirements.txt # Python dependencies |
| 121 | +``` |
| 122 | + |
| 123 | +## Troubleshooting |
| 124 | + |
| 125 | +### ModuleNotFoundError: No module named 'ingestion' |
| 126 | + |
| 127 | +Make sure you're running the command from the project root directory (`/Users/LDF/Documents/datasets/`), not from within the `ingestion/` folder. |
| 128 | + |
| 129 | +### Missing Dependencies |
| 130 | + |
| 131 | +If you encounter import errors, ensure all dependencies are installed: |
| 132 | +```bash |
| 133 | +pip install -r requirements.txt |
| 134 | +``` |
| 135 | + |
| 136 | +### Environment Variables Not Set |
| 137 | + |
| 138 | +The script will exit with an error if `READ_BASE_URL` or `INGESTION_BASE_URL` are not set. Make sure these are configured before running. |
| 139 | + |
| 140 | +### Connection Errors |
| 141 | + |
| 142 | +If you see connection errors, verify that: |
| 143 | +- OpenGin services are running |
| 144 | +- The base URLs in your environment variables are correct |
| 145 | +- Your network/firewall allows connections to these services |
| 146 | + |
| 147 | +## Notes |
| 148 | + |
| 149 | +- The script processes ministers sequentially (can be parallelized later) |
| 150 | +- Datasets are validated before ingestion |
| 151 | +- The script handles time period calculations for attributes based on parent entity time ranges and dataset years |
| 152 | +- Categories and subcategories are checked for existence before creation to avoid duplicates |
0 commit comments