Skip to content

Conversation

@Joshuathomas18
Copy link

Metadata

Details

This PR implements the metadata property on the OpenMLBenchmarkSuite class as requested in issue #1126. The property returns a pandas DataFrame containing comprehensive metadata for all tasks in the suite, combining both task-level information (task ID, estimation procedure, target feature) and dataset-level information (dataset ID, version, uploader, number of instances, features, classes, etc.).

Key changes:

  • Added metadata property to OpenMLBenchmarkSuite class in openml/study/study.py
  • Property uses efficient batch API calls (_list_tasks and list_datasets) to minimize network overhead
  • Implements lazy loading with caching to avoid redundant API calls
  • Merges task and dataset metadata using a left join to preserve one row per task
  • Handles edge cases: empty suites, missing datasets, and API errors gracefully

Currently, researchers using OpenML benchmark suites must manually aggregate metadata from individual tasks and datasets to create the standard "dataset characteristics" tables required for academic publications. This process is:

  • Time-consuming: Requires writing custom scripts that iterate through tasks
  • Inefficient: Triggers N+1 API calls (one per task, plus dataset lookups)
  • Error-prone: Manual data aggregation can lead to inconsistencies

This implementation solves these issues by:

  • Providing a single property that returns all necessary metadata
  • Using batch API calls (only 2 calls regardless of suite size)
  • Ensuring consistency across publications using the same suite
  • Enabling direct LaTeX export using pandas' to_latex() method

Before (manual approach):

import openml
suite = openml.study.get_suite(271)
tasks = [openml.tasks.get_task(tid, download_data=False, download_qualities=False) 
         for tid in suite.tasks]  # N API calls
metadata = openml.datasets.list_datasets(
    data_id=[t.dataset_id for t in tasks], 
    output_format="DataFrame"
)  # Additional API call
# Manual merging and formatting required...

After (with this PR):

import openml
suite = openml.study.get_suite(99)  # OpenML-CC18
meta = suite.metadata  # Single property access, 2 API calls total

# Direct LaTeX export
columns = ['name', 'NumberOfInstances', 'NumberOfFeatures', 'NumberOfClasses']
latex_table = meta[columns].style.to_latex(
    caption="Dataset Characteristics",
    label="tab:suite_metadata"
)

Testing:

# Run the new unit tests
pytest tests/test_study/test_benchmark_suite_metadata.py -v

# Test with a real suite (optional, requires API access)
python -c "import openml; suite = openml.study.get_suite(99); print(suite.metadata.shape)"
  • Performance: The implementation uses batch API calls, reducing network overhead from O(N) to O(1) relative to suite size. For a suite with 72 tasks, this reduces API calls from ~144 to 2.

  • Caching: The property implements lazy loading with caching. The first access triggers API calls, but subsequent accesses return the cached DataFrame instantly.

  • Error Handling: Comprehensive error handling with informative RuntimeError messages that help users debug issues.

  • Backward Compatibility: This is a purely additive change. No existing functionality is modified, ensuring full backward compatibility.

  • Documentation:

    • Comprehensive docstring with examples
    • New example script: examples/Advanced/suite_metadata_latex_export.py
    • Demonstrates basic usage, column selection, LaTeX export, and advanced formatting
  • Code Quality:

    • Follows existing codebase patterns
    • Type hints included
    • No linting errors
    • Comprehensive unit test coverage (7 test cases)
  • Files Changed:

    • openml/study/study.py: Added imports, cache initialization, and metadata property
    • tests/test_study/test_benchmark_suite_metadata.py: New test file with 7 unit tests
    • examples/Advanced/suite_metadata_latex_export.py: New example script demonstrating usage

Joshuathomas18 and others added 7 commits November 26, 2025 10:13
- Add metadata property that returns pandas DataFrame with task and dataset metadata
- Implement efficient batch API calls using _list_tasks and list_datasets
- Add lazy loading with caching to avoid redundant API calls
- Handle edge cases: empty suites, missing datasets, and API errors
- Add comprehensive unit tests (7 test cases)
- Add example script demonstrating LaTeX export usage

Fixes openml#1126
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for Exporting Benchmarking Suites to LaTeX

1 participant