Python implementation of R's StatMatch package for statistical matching and data fusion.
py-statmatch
provides tools for statistical matching (also known as data fusion or synthetic data matching) between different datasets. This package is a Python port of the popular R package StatMatch, implementing various methods to match records from different data sources that share some common variables.
- NND.hotdeck: Nearest Neighbor Distance Hot Deck matching
- Multiple distance metrics (Euclidean, Manhattan, Mahalanobis, etc.)
- Donation classes (match within groups/strata)
- Constrained matching using Hungarian algorithm
- Handle missing values appropriately
- Coming soon:
- RANDwNND.hotdeck: Random distance hot deck
- rankNND.hotdeck: Rank distance hot deck
- create.fused: Create fused datasets from matching results
pip install py-statmatch
For development:
git clone https://github.com/PolicyEngine/py-statmatch.git
cd py-statmatch
pip install -e ".[dev]"
import pandas as pd
from statmatch import nnd_hotdeck
# Create donor dataset (has X and Y variables)
donor_data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'income': [30000, 45000, 55000, 65000, 80000],
'education': ['HS', 'BA', 'BA', 'MA', 'PhD'],
'job_satisfaction': [7, 8, 6, 9, 8] # Variable to donate
})
# Create recipient dataset (has X variables but missing Y)
recipient_data = pd.DataFrame({
'age': [28, 33, 42],
'income': [35000, 50000, 70000],
'education': ['BA', 'BA', 'MA']
})
# Perform matching
result = nnd_hotdeck(
data_rec=recipient_data,
data_don=donor_data,
match_vars=['age', 'income'],
dist_fun='euclidean'
)
# Get matched donor indices
print(result['noad.index']) # [0, 1, 3] for example
# Create fused dataset
fused_data = recipient_data.copy()
fused_data['job_satisfaction'] = donor_data.iloc[result['noad.index']]['job_satisfaction'].values
nnd_hotdeck(
data_rec,
data_don,
match_vars,
don_class=None,
dist_fun="euclidean",
cut_don=None,
k=None,
w_don=None,
w_rec=None,
constr_alg=None
)
Parameters:
data_rec
(pd.DataFrame): The recipient datasetdata_don
(pd.DataFrame): The donor datasetmatch_vars
(List[str]): List of variable names to use for matchingdon_class
(str, optional): Variable name defining donation classesdist_fun
(str): Distance function - "euclidean", "manhattan", "mahalanobis", etc.k
(int, optional): Maximum number of times each donor can be usedconstr_alg
(str, optional): Algorithm for constrained matching - "lpsolve" or "hungarian"
Returns:
- Dictionary containing:
mtc.ids
: DataFrame with recipient and donor IDsnoad.index
: Array of donor indices for each recipient (0-based)dist.rd
: Array of distances between matched recipients and donors
# Run all tests
pytest
# Run with coverage
pytest --cov=statmatch
# Run specific test
pytest tests/test_nnd_hotdeck.py::TestNNDHotdeck::test_euclidean_distance_matching
This project uses Black for code formatting:
black . -l 79
MIT License - see LICENSE file for details.
If you use this package in your research, please cite both this package and the original R package:
@software{pystatmatch2024,
title = {py-statmatch: Python implementation of R's StatMatch package},
author = {PolicyEngine},
year = {2024},
url = {https://github.com/PolicyEngine/py-statmatch}
}
@Manual{rstatmatch,
title = {StatMatch: Statistical Matching or Data Fusion},
author = {Marcello D'Orazio},
year = {2023},
note = {R package version 1.4.2},
url = {https://CRAN.R-project.org/package=StatMatch},
}
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This repository uses GitHub Actions for automated PyPI publishing. To enable publishing:
- Create an account on PyPI if you don't have one
- Generate an API token:
- Go to your PyPI account settings
- Scroll to "API tokens" section
- Click "Add API token"
- Give it a meaningful name (e.g., "py-statmatch GitHub Actions")
- Select scope: "Entire account" (for first publish) or "Project: py-statmatch" (after first publish)
- Add the token as a GitHub repository secret:
- Go to Settings → Secrets and variables → Actions
- Click "New repository secret"
- Name:
PYPI_API_TOKEN
- Value: Your PyPI API token (starts with
pypi-
)
- First manually publish the package once using Option 1
- Go to your PyPI project page
- Navigate to "Publishing" settings
- Add GitHub as a trusted publisher:
- Owner:
PolicyEngine
- Repository:
py-statmatch
- Workflow:
versioning.yaml
- Environment: (leave blank)
- Owner:
- Remove the PYPI_API_TOKEN secret (optional)
The workflows automatically detect which method to use based on available secrets.
This is a Python port of the R StatMatch package by Marcello D'Orazio. We are grateful for the original implementation which has been invaluable to the statistical matching community.