Skip to content

PolicyEngine/py-statmatch

Repository files navigation

py-statmatch

Python implementation of R's StatMatch package for statistical matching and data fusion.

Overview

py-statmatch provides tools for statistical matching (also known as data fusion or synthetic data matching) between different datasets. This package is a Python port of the popular R package StatMatch, implementing various methods to match records from different data sources that share some common variables.

Features

  • NND.hotdeck: Nearest Neighbor Distance Hot Deck matching
    • Multiple distance metrics (Euclidean, Manhattan, Mahalanobis, etc.)
    • Donation classes (match within groups/strata)
    • Constrained matching using Hungarian algorithm
    • Handle missing values appropriately
  • Coming soon:
    • RANDwNND.hotdeck: Random distance hot deck
    • rankNND.hotdeck: Rank distance hot deck
    • create.fused: Create fused datasets from matching results

Installation

pip install py-statmatch

For development:

git clone https://github.com/PolicyEngine/py-statmatch.git
cd py-statmatch
pip install -e ".[dev]"

Quick Start

import pandas as pd
from statmatch import nnd_hotdeck

# Create donor dataset (has X and Y variables)
donor_data = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'income': [30000, 45000, 55000, 65000, 80000],
    'education': ['HS', 'BA', 'BA', 'MA', 'PhD'],
    'job_satisfaction': [7, 8, 6, 9, 8]  # Variable to donate
})

# Create recipient dataset (has X variables but missing Y)
recipient_data = pd.DataFrame({
    'age': [28, 33, 42],
    'income': [35000, 50000, 70000],
    'education': ['BA', 'BA', 'MA']
})

# Perform matching
result = nnd_hotdeck(
    data_rec=recipient_data,
    data_don=donor_data,
    match_vars=['age', 'income'],
    dist_fun='euclidean'
)

# Get matched donor indices
print(result['noad.index'])  # [0, 1, 3] for example

# Create fused dataset
fused_data = recipient_data.copy()
fused_data['job_satisfaction'] = donor_data.iloc[result['noad.index']]['job_satisfaction'].values

API Reference

nnd_hotdeck

nnd_hotdeck(
    data_rec,
    data_don,
    match_vars,
    don_class=None,
    dist_fun="euclidean",
    cut_don=None,
    k=None,
    w_don=None,
    w_rec=None,
    constr_alg=None
)

Parameters:

  • data_rec (pd.DataFrame): The recipient dataset
  • data_don (pd.DataFrame): The donor dataset
  • match_vars (List[str]): List of variable names to use for matching
  • don_class (str, optional): Variable name defining donation classes
  • dist_fun (str): Distance function - "euclidean", "manhattan", "mahalanobis", etc.
  • k (int, optional): Maximum number of times each donor can be used
  • constr_alg (str, optional): Algorithm for constrained matching - "lpsolve" or "hungarian"

Returns:

  • Dictionary containing:
    • mtc.ids: DataFrame with recipient and donor IDs
    • noad.index: Array of donor indices for each recipient (0-based)
    • dist.rd: Array of distances between matched recipients and donors

Development

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=statmatch

# Run specific test
pytest tests/test_nnd_hotdeck.py::TestNNDHotdeck::test_euclidean_distance_matching

Code Style

This project uses Black for code formatting:

black . -l 79

License

MIT License - see LICENSE file for details.

Citation

If you use this package in your research, please cite both this package and the original R package:

@software{pystatmatch2024,
  title = {py-statmatch: Python implementation of R's StatMatch package},
  author = {PolicyEngine},
  year = {2024},
  url = {https://github.com/PolicyEngine/py-statmatch}
}

@Manual{rstatmatch,
  title = {StatMatch: Statistical Matching or Data Fusion},
  author = {Marcello D'Orazio},
  year = {2023},
  note = {R package version 1.4.2},
  url = {https://CRAN.R-project.org/package=StatMatch},
}

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

PyPI Publishing Setup

This repository uses GitHub Actions for automated PyPI publishing. To enable publishing:

Option 1: PyPI API Token (Recommended for now)

  1. Create an account on PyPI if you don't have one
  2. Generate an API token:
    • Go to your PyPI account settings
    • Scroll to "API tokens" section
    • Click "Add API token"
    • Give it a meaningful name (e.g., "py-statmatch GitHub Actions")
    • Select scope: "Entire account" (for first publish) or "Project: py-statmatch" (after first publish)
  3. Add the token as a GitHub repository secret:
    • Go to Settings → Secrets and variables → Actions
    • Click "New repository secret"
    • Name: PYPI_API_TOKEN
    • Value: Your PyPI API token (starts with pypi-)

Option 2: Trusted Publisher (OIDC) - Future Enhancement

  1. First manually publish the package once using Option 1
  2. Go to your PyPI project page
  3. Navigate to "Publishing" settings
  4. Add GitHub as a trusted publisher:
    • Owner: PolicyEngine
    • Repository: py-statmatch
    • Workflow: versioning.yaml
    • Environment: (leave blank)
  5. Remove the PYPI_API_TOKEN secret (optional)

The workflows automatically detect which method to use based on available secrets.

Acknowledgments

This is a Python port of the R StatMatch package by Marcello D'Orazio. We are grateful for the original implementation which has been invaluable to the statistical matching community.

About

Python implementation of R's StatMatch package for statistical matching and data fusion

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •