Add metadata property to OpenMLBenchmarkSuite for LaTeX export #1515

Joshuathomas18 · 2025-11-26T04:46:24Z

Metadata

Reference Issue: Fixes Support for Exporting Benchmarking Suites to LaTeX #1126
New Tests Added: Yes
Documentation Updated: Yes (docstrings and example script)
Change Log Entry: "Add metadata property to OpenMLBenchmarkSuite for exporting suite information to LaTeX"

Details

This PR implements the metadata property on the OpenMLBenchmarkSuite class as requested in issue #1126. The property returns a pandas DataFrame containing comprehensive metadata for all tasks in the suite, combining both task-level information (task ID, estimation procedure, target feature) and dataset-level information (dataset ID, version, uploader, number of instances, features, classes, etc.).

Key changes:

Added metadata property to OpenMLBenchmarkSuite class in openml/study/study.py
Property uses efficient batch API calls (_list_tasks and list_datasets) to minimize network overhead
Implements lazy loading with caching to avoid redundant API calls
Merges task and dataset metadata using a left join to preserve one row per task
Handles edge cases: empty suites, missing datasets, and API errors gracefully

Currently, researchers using OpenML benchmark suites must manually aggregate metadata from individual tasks and datasets to create the standard "dataset characteristics" tables required for academic publications. This process is:

Time-consuming: Requires writing custom scripts that iterate through tasks
Inefficient: Triggers N+1 API calls (one per task, plus dataset lookups)
Error-prone: Manual data aggregation can lead to inconsistencies

This implementation solves these issues by:

Providing a single property that returns all necessary metadata
Using batch API calls (only 2 calls regardless of suite size)
Ensuring consistency across publications using the same suite
Enabling direct LaTeX export using pandas' to_latex() method

Before (manual approach):

import openml
suite = openml.study.get_suite(271)
tasks = [openml.tasks.get_task(tid, download_data=False, download_qualities=False) 
         for tid in suite.tasks]  # N API calls
metadata = openml.datasets.list_datasets(
    data_id=[t.dataset_id for t in tasks], 
    output_format="DataFrame"
)  # Additional API call
# Manual merging and formatting required...

After (with this PR):

import openml
suite = openml.study.get_suite(99)  # OpenML-CC18
meta = suite.metadata  # Single property access, 2 API calls total

# Direct LaTeX export
columns = ['name', 'NumberOfInstances', 'NumberOfFeatures', 'NumberOfClasses']
latex_table = meta[columns].style.to_latex(
    caption="Dataset Characteristics",
    label="tab:suite_metadata"
)

Testing:

# Run the new unit tests
pytest tests/test_study/test_benchmark_suite_metadata.py -v

# Test with a real suite (optional, requires API access)
python -c "import openml; suite = openml.study.get_suite(99); print(suite.metadata.shape)"

Performance: The implementation uses batch API calls, reducing network overhead from O(N) to O(1) relative to suite size. For a suite with 72 tasks, this reduces API calls from ~144 to 2.
Caching: The property implements lazy loading with caching. The first access triggers API calls, but subsequent accesses return the cached DataFrame instantly.
Error Handling: Comprehensive error handling with informative RuntimeError messages that help users debug issues.
Backward Compatibility: This is a purely additive change. No existing functionality is modified, ensuring full backward compatibility.
Documentation:
- Comprehensive docstring with examples
- New example script: examples/Advanced/suite_metadata_latex_export.py
- Demonstrates basic usage, column selection, LaTeX export, and advanced formatting
Code Quality:
- Follows existing codebase patterns
- Type hints included
- No linting errors
- Comprehensive unit test coverage (7 test cases)
Files Changed:
- openml/study/study.py: Added imports, cache initialization, and metadata property
- tests/test_study/test_benchmark_suite_metadata.py: New test file with 7 unit tests
- examples/Advanced/suite_metadata_latex_export.py: New example script demonstrating usage

- Add metadata property that returns pandas DataFrame with task and dataset metadata - Implement efficient batch API calls using _list_tasks and list_datasets - Add lazy loading with caching to avoid redundant API calls - Handle edge cases: empty suites, missing datasets, and API errors - Add comprehensive unit tests (7 test cases) - Add example script demonstrating LaTeX export usage Fixes openml#1126

for more information, see https://pre-commit.ci

…thon into develop

Joshuathomas18 and others added 7 commits November 26, 2025 10:13

[pre-commit.ci] auto fixes from pre-commit.com hooks

adddce1

for more information, see https://pre-commit.ci

Fix linting: reduce complexity, use DataFrame.merge()

c2de5ef

[pre-commit.ci] auto fixes from pre-commit.com hooks

fd49806

for more information, see https://pre-commit.ci

Fix remaining lint errors: line length and type check

7e32ea6

Merge branch 'develop' of https://github.com/Joshuathomas18/openml-py…

b449f79

…thon into develop

Resolve merge conflict in study.py

4024844

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add metadata property to OpenMLBenchmarkSuite for LaTeX export #1515

Add metadata property to OpenMLBenchmarkSuite for LaTeX export #1515

Uh oh!

Joshuathomas18 commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Add metadata property to OpenMLBenchmarkSuite for LaTeX export #1515

Are you sure you want to change the base?

Add metadata property to OpenMLBenchmarkSuite for LaTeX export #1515

Uh oh!

Conversation

Joshuathomas18 commented Nov 26, 2025

Metadata

Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant