Skip to content

Commit dfb54b1

Browse files
authored
GH-34: Clean up datasets insertion script and folder structure (#103)
* Add ingestion code with full functionality for inserting statistics datasets * Remove outdated tests * Fix attribute name formatting * Update readme * Remove unused dependencies * Update readme * Add api_retry_decorator to fetch_relations method in ReadService
1 parent 6e5c2a5 commit dfb54b1

18 files changed

Lines changed: 1612 additions & 0 deletions

.gitignore

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,13 @@
1+
# Python cache
12
__pycache__/*
3+
__pycache__/
4+
**/__pycache__/
5+
*.py[cod]
6+
*$py.class
7+
*.so
28

9+
# Environment
310
.DS_Store
411
docs/.DS_Store
12+
.env
13+
ingestion/.env

ingestion/.env.template

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
READ_BASE_URL="http://localhost:8081"
2+
INGESTION_BASE_URL="http://localhost:8080"

ingestion/README.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# Data Ingestion Module
2+
3+
This module provides tools for ingesting structured datasets from YAML manifest files into the OpenGin services. It processes hierarchical data structures (ministers, departments, categories, subcategories, and datasets) and creates corresponding entities and relationships in the OpenGin system.
4+
5+
## Overview
6+
7+
The ingestion system reads YAML manifest files that describe the structure of datasets organized in a flexible hierarchy:
8+
- **Ministers****Categories****Subcategories****Datasets**
9+
- **Ministers****Categories****Datasets**
10+
- **Ministers****Departments****Categories****Subcategories****Datasets**
11+
- **Ministers****Departments****Categories****Datasets**
12+
13+
Each dataset is stored as a JSON file and is ingested as an attribute on the appropriate parent entity (subcategory, minister, department).
14+
15+
## Prerequisites
16+
17+
Before running the ingestion script, ensure you have completed the following setup steps:
18+
19+
### 1. Start OpenGin Services
20+
21+
Make sure the OpenGin services are up and running. The ingestion script requires:
22+
- **Read Service**: For querying existing entities and relationships
23+
- **Ingestion Service**: For creating and updating entities
24+
25+
### 2. Restore Data Backup
26+
27+
Restore the `0.0.1` data backup to ensure you have the base entities (ministers, departments, etc.) that the ingestion script will reference and build upon.
28+
29+
### 3. Set Up Python Environment
30+
31+
Create a virtual environment and install the required dependencies:
32+
33+
```bash
34+
# Create a virtual environment
35+
python -m venv venv
36+
37+
# Activate the virtual environment
38+
# On macOS/Linux:
39+
source venv/bin/activate
40+
# On Windows:
41+
# venv\Scripts\activate
42+
43+
# Install dependencies
44+
pip install -r requirements.txt
45+
```
46+
47+
### 4. Configure Environment Variables
48+
49+
The ingestion script requires the following environment variables:
50+
51+
- `READ_BASE_URL`: Base URL for the OpenGin Read Service
52+
- `INGESTION_BASE_URL`: Base URL for the OpenGin Ingestion Service
53+
54+
You can set these in your environment or create a `.env` file in the `ingestion/` directory:
55+
56+
```bash
57+
export READ_BASE_URL="http://localhost:8081"
58+
export INGESTION_BASE_URL="http://localhost:8080"
59+
```
60+
61+
Or create a `.env` file:
62+
```
63+
READ_BASE_URL=http://localhost:8081
64+
INGESTION_BASE_URL=http://localhost:8080
65+
```
66+
67+
If using a `.env` file, make sure you have `python-dotenv` installed (included in `requirements.txt`).
68+
69+
## Usage
70+
71+
Once all prerequisites are met, you can run the ingestion script:
72+
73+
```bash
74+
# From the project root directory
75+
python -m ingestion.ingest_flat_yaml data/statistics/2020_flat/manifest_2020.yaml
76+
77+
# Or with an explicit year override
78+
python -m ingestion.ingest_flat_yaml data/statistics/2020_flat/manifest_2020.yaml --year 2020
79+
```
80+
81+
### Command Line Arguments
82+
83+
- `yaml_file` (required): Path to the YAML manifest file
84+
- `--year` (optional): Override the year extracted from the filename
85+
86+
### Example
87+
88+
```bash
89+
# Ingest 2020 data
90+
python -m ingestion.ingest_flat_yaml data/statistics/2020_flat/manifest_2020.yaml
91+
92+
# Ingest 2021 data
93+
python -m ingestion.ingest_flat_yaml data/statistics/2021_flat/manifest_2021.yaml
94+
```
95+
96+
## How It Works
97+
98+
1. **Parse YAML Manifest**: Reads the YAML file to extract the hierarchical structure
99+
2. **Find Entities**: Uses the Read Service to find existing ministers and departments by name and year
100+
3. **Create Categories**: Creates category and subcategory entities as needed
101+
4. **Process Datasets**: Reads dataset JSON files and adds them as attributes to parent entities
102+
5. **Create Relationships**: Establishes relationships between entities (e.g., `AS_CATEGORY`)
103+
104+
## Module Structure
105+
106+
```
107+
ingestion/
108+
├── ingest_flat_yaml.py # Main ingestion script
109+
├── models/ # Data models and schemas
110+
│ └── schema.py
111+
├── services/ # Service layer
112+
│ ├── entity_resolver.py # Entity lookup and resolution
113+
│ ├── ingestion_service.py # OpenGin Ingestion API client
114+
│ ├── read_service.py # OpenGin Read API client
115+
│ └── yaml_parser.py # YAML parsing utilities
116+
├── utils/ # Utility functions
117+
│ ├── date_utils.py # Date/time calculations
118+
│ ├── http_client.py # HTTP client for API calls
119+
│ └── util_functions.py # General utilities
120+
└── requirements.txt # Python dependencies
121+
```
122+
123+
## Troubleshooting
124+
125+
### ModuleNotFoundError: No module named 'ingestion'
126+
127+
Make sure you're running the command from the project root directory (`/Users/LDF/Documents/datasets/`), not from within the `ingestion/` folder.
128+
129+
### Missing Dependencies
130+
131+
If you encounter import errors, ensure all dependencies are installed:
132+
```bash
133+
pip install -r requirements.txt
134+
```
135+
136+
### Environment Variables Not Set
137+
138+
The script will exit with an error if `READ_BASE_URL` or `INGESTION_BASE_URL` are not set. Make sure these are configured before running.
139+
140+
### Connection Errors
141+
142+
If you see connection errors, verify that:
143+
- OpenGin services are running
144+
- The base URLs in your environment variables are correct
145+
- Your network/firewall allows connections to these services
146+
147+
## Notes
148+
149+
- The script processes ministers sequentially (can be parallelized later)
150+
- Datasets are validated before ingestion
151+
- The script handles time period calculations for attributes based on parent entity time ranges and dataset years
152+
- Categories and subcategories are checked for existence before creation to avoid duplicates

ingestion/__init__.py

Whitespace-only changes.

ingestion/exception/__init__.py

Whitespace-only changes.

ingestion/exception/exceptions.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
from fastapi import HTTPException, status
2+
3+
class NotFoundError(HTTPException):
4+
def __init__(self, message: str):
5+
super().__init__(status_code=status.HTTP_404_NOT_FOUND, detail=message)
6+
7+
class BadRequestError(HTTPException):
8+
def __init__(self, message: str):
9+
super().__init__(status_code=status.HTTP_400_BAD_REQUEST, detail=message)
10+
11+
class InternalServerError(HTTPException):
12+
def __init__(self, message: str):
13+
super().__init__(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=message)
14+
15+
class ServiceUnavailableError(HTTPException):
16+
def __init__(self, message: str):
17+
super().__init__(status_code=status.HTTP_503_SERVICE_UNAVAILABLE, detail=message)
18+
19+
class GatewayTimeoutError(HTTPException):
20+
def __init__(self, message: str):
21+
super().__init__(status_code=status.HTTP_504_GATEWAY_TIMEOUT, detail=message)

0 commit comments

Comments
 (0)