Skip to content

Conversation

hummuscience
Copy link

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?
It allows loading the outputs of EKS (Ensemble Kalman Smoother) into movement.

What does this PR do?
It implements the loader similar to the other supported packages.

References

Relevant discussion on Zulip: #Movement > Importer for Ensemble Kalman Smoother

How has this PR been tested?

I tested it using an output file from my own data. It is referenced in the test file but of course is not part of movement. I am not sure what the best to do this would be.

Is this a breaking change?

No.

Does this PR require an update to the documentation?

Would have to add the relevant parts to the input/output part in the codumentation

Checklist:

  • [ x] The code has been tested locally
  • [ ?] Tests have been added to cover all new functionality (maybe? I am not sure)
  • The documentation has been updated to reflect any changes
  • [ x] The code has been formatted with pre-commit

Muad Abd El Hay and others added 3 commits August 27, 2025 12:52
- Implemented from_eks_file() function to load EKS CSV files
- Added support for ensemble statistics (median, variance, posterior variance)
- Updated from_file() to support "EKS" as source_software
- Added comprehensive unit tests for EKS loader functionality
- EKS files follow DeepLabCut CSV format but include additional ensemble columns

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Restructure ensemble statistics (ens_median, ens_var, posterior_var) to have (time, space, keypoints, individuals) dimensions
- Stack x and y components into space dimension for consistency with position data
- This ensures all data variables have matching dimension structure

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Copy link

codecov bot commented Aug 28, 2025

Codecov Report

❌ Patch coverage is 89.09091% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.70%. Comparing base (dacc6ef) to head (44480db).

Files with missing lines Patch % Lines
movement/io/load_poses.py 89.09% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##              main     #670      +/-   ##
===========================================
- Coverage   100.00%   99.70%   -0.30%     
===========================================
  Files           34       34              
  Lines         1981     2036      +55     
===========================================
+ Hits          1981     2030      +49     
- Misses           0        6       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Muad Abd El Hay and others added 2 commits August 28, 2025 10:45
- Change test filename from sub-460_strain-B6_2024-11-12T12_30_00_eks.csv to eks_output.csv
- Update ensemble variable checks to match new structure (ens_median, ens_var, posterior_var)
- Fix shape assertions for ensemble variables to include space dimension

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@hummuscience
Copy link
Author

I tried to upload the test file but its too large (77mb). Let me know how I can send it to you. (its a 15 minute recording with 19 keypoints)

@sfmig
Copy link
Contributor

sfmig commented Aug 29, 2025

I tried to upload the test file but its too large (77mb). Let me know how I can send it to you. (its a 15 minute recording with 19 keypoints)

One option could be to make it smaller (i.e. select maybe 1-2min of the data) and add it to our movement dataset repository. Then we can fetch it from there in the tests.

You will need to be provided access to the data repository though (cc @niksirbi). You can see more details on how to add and fetch data in the docs - adding new data and fetching data).

@niksirbi
Copy link
Member

niksirbi commented Aug 29, 2025

Hey @hummuscience, what @sfmig describes above is our standard procedure for adding data to test new I/O functions.

We don't store data on GitHub, but on GIN instead. Our test suite has built-in support for easily fetching that data.
Specifically, if in the test file you call from pytest import DATA_PATHS, the data files or GIN become accessible to you via DATA_PATHS.get(filename). You can see this in action here, for example.

In this particular case, I have an idea. Instead of using your test file @hummuscience , we could use the data provided in the EKS repo itself. I was able to produce EKS output files by running the EKS example scripts for the IBL pupil and paw datasets. I also confirmed that the output file has the same format no matter the number of cameras, mirrors etc. Sharing these data on GIN should be uncomplicated since they are derived from already public datasets.

What do you both think? I'm happy to handle the GIN upload myself.

@hummuscience
Copy link
Author

I also thought about this, but it seems like all the data on the EKS repo is the input, and not the output. So we would have to use the output data you got from running them. Thats fine for me :)

Anything I need to do from my side to get this going?

@niksirbi
Copy link
Member

Yeah, I meant uploading the output data. I will do it today and let you know when the upload is complete so you can carry on with updating the tests. Any preference for a particular dataset (paw, pupil)? For the tests it doesn't really matter I suppose.

@hummuscience
Copy link
Author

For the tests it doesn't really matter. Maybe for variety’s sake (people trying out movement for different types of data) the paw dataset could be nice.

@niksirbi
Copy link
Member

Hey @hummuscience, I've now uploaded two new .csv files to our GIN repository:
EKS_IBL-paw_multicam_left.predictions.csv and EKS_IBL-paw_multicam_right.predictions.csv.

You can see their descriptions at the bottom of the metadata.yaml file.

I've also verified that they can be fetched by our sample_data module:

>>> from movement import sample_data
>>> 
>>> ds_paths = sample_data.fetch_dataset_paths("EKS_IBL-paw_multicam_right.predictions.csv")
>>> print(ds_paths["poses"])
/Users/nsirmpilatze/.movement/data/poses/EKS_IBL-paw_multicam_right.predictions.csv

This means that you should be able to access them directly in tests via the DATA_PATHS.get("EKS_IBL-paw_multicam_right.predictions.csv") syntax.

Note that the source software should now be called "EKS" everywhere, including in the source_software attribute. We can refer to the full name ("Ensemble Kalman Smoother") in the docstring, for completeness, but it should be "EKS" everywhere else.

individual_names=individual_names,
keypoint_names=keypoint_names,
fps=fps,
source_software="EnsembleKalmanSmoother",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
source_software="EnsembleKalmanSmoother",
source_software="EKS",

- Change source_software attribute from 'EnsembleKalmanSmoother' to 'EKS'
- Update tests to use sample data files from GIN repository
- Use EKS_IBL-paw_multicam_right.predictions.csv from DATA_PATHS
- Remove dependency on local eks_output.csv file

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Copy link

Copy link
Member

@niksirbi niksirbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @hummuscience, thanks for the PR! It’s great to see that your code already follows many of our conventions.

I’ve gone through the code for a first round of review and left inline comments. Most of my suggestions involve replacing parts of your implementation with helper functions that already exist in movement, to reduce duplication. I realize many of these functions aren’t obvious from the outside (some are even private), so it’s completely understandable that you didn’t use them initially.

In addition to the inline comments, there are a couple of things we’ll need before the PR is ready to merge:

  • Documentation update: The Input/Output guide needs to be updated to include the new loader. The source file is at docs/source/user_guide/input_output.md. Please check our contributing guide on previewing documentation locally and in CI, which will help you debug formatting issues and confirm how the published site will look.
  • Additional test cases: Your current test checks that everything works for a valid EKS .csv file, which is a great start. But we also need to verify behavior with invalid inputs. Some of this will be covered by the ValidDeepLabCutCSV validator (once you apply my suggestions), but two cases are still missing:
    • Files containing extra coordinates beyond those we expect (I’d recommend not loading these, as noted in my inline comments).
    • Files where some expected ensemble coordinates are missing. In this case, I suspect pandas may raise an error when calling df.xs, but we should check whether that’s indeed happening.

For generating invalid files for these tests, I recommend creating custom fixtures. You can find examples of similar fixtures in tests/fixtures/files.py. I can help with that part if needed, as it can be tricky.

No particular rush on this btw, I will be away for 2 weeks, so I will only look at this again in the 2nd half of September. Thanks for taking the time to make this contribution, I think this will be appreciated by many of our users.

Comment on lines 115 to +117
be among those supported by the ``from_dlc_file()``,
``from_slp_file()`` or ``from_lp_file()`` functions. One of these
these functions will be called internally, based on
``from_slp_file()``, ``from_lp_file()``, ``from_eks_file()`` functions.
One of these functions will be called internally, based on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised that it's starting to become a bit unwieldy to list all the functions here (we've already neglected to add some), so let's take this chance to instead point users to the "See Also" list, which is up-to-date.

Suggested change
be among those supported by the ``from_dlc_file()``,
``from_slp_file()`` or ``from_lp_file()`` functions. One of these
these functions will be called internally, based on
``from_slp_file()``, ``from_lp_file()``, ``from_eks_file()`` functions.
One of these functions will be called internally, based on
be among those supported by the software-specific loading functions
that are listed under "See Also".

def from_eks_file(
file_path: Path | str, fps: float | None = None
) -> xr.Dataset:
"""Create a ``movement`` poses dataset from an EKS file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe under this first like we could spell out that EKS stands for Ensemble Kalman Smoother and point to their GitHub repository (you can do this with a reference, as is done for the from_nwb_file() function).

Comment on lines +405 to +406
EKS files have a similar structure to DeepLabCut CSV files but include
additional columns for ensemble statistics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match the spelling of .csv earlier in the docstring (and in other docstrings).

Suggested change
EKS files have a similar structure to DeepLabCut CSV files but include
additional columns for ensemble statistics:
EKS files have a similar structure to DeepLabCut .csv files but include
additional columns for ensemble statistics:

Comment on lines +407 to +409
- x_ens_median, y_ens_median: Median of the ensemble predictions
- x_ens_var, y_ens_var: Variance of the ensemble predictions
- x_posterior_var, y_posterior_var: Posterior variance from the smoother
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want the bullet point list to render properly in Sphinx .rst you have to leave a blank line. Moreover, and this is more of a subjective stylistic choice, I would make the column names monospace

Suggested change
- x_ens_median, y_ens_median: Median of the ensemble predictions
- x_ens_var, y_ens_var: Variance of the ensemble predictions
- x_posterior_var, y_posterior_var: Posterior variance from the smoother
- ``x_ens_median``, ``y_ens_median``: Median of the ensemble predictions
- ``x_ens_var``, ``y_ens_var``: Variance of the ensemble predictions
- ``x_posterior_var``, ``y_posterior_var``: Posterior variance from
the kalman smoother

return df


def _df_from_eks_csv(file_path: Path) -> pd.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell this private function is unnecessary, because it's almost identical to the existing _df_from_dlc_csv function. The only difference is the use of the ValidDeepLabCutCSV validator you are not using here. But I would say that you can simply use _df_from_dlc_csv as is. The ValidDeepLabCutCSV validator won't hinder you, as it only checks that the expected multi-index levels exist. As these are the same between DLC and EKS, it's actually better to use tha validator, so you are completely fine using _df_from_dlc_csv and getting rid of _df_from_eks_csv

Comment on lines +486 to +556
# Group ensemble statistics by their type
# (ens_median, ens_var, posterior_var)
# Each will have x and y components as the space dimension
ensemble_stats: dict[str, dict[str, np.ndarray]] = {}

for coord in ensemble_coords:
# Parse coordinate name to extract stat type and spatial component
# e.g., "x_ens_median" -> ("ens_median", "x")
if coord.startswith("x_"):
stat_name = coord[2:] # Remove "x_" prefix
spatial_component = "x"
elif coord.startswith("y_"):
stat_name = coord[2:] # Remove "y_" prefix
spatial_component = "y"
else:
# If it doesn't follow expected pattern, handle as before
coord_df = df.xs(coord, level="coords", axis=1)
coord_data = (
coord_df.to_numpy()
.reshape((-1, len(individual_names), len(keypoint_names)))
.transpose(0, 2, 1)
)
da = xr.DataArray(
coord_data,
dims=["time", "keypoints", "individuals"],
coords={
"time": ds.coords["time"],
"keypoints": ds.coords["keypoints"],
"individuals": ds.coords["individuals"],
},
name=coord,
)
ds[coord] = da
continue

# Initialize the stat dict if needed
if stat_name not in ensemble_stats:
ensemble_stats[stat_name] = {}

# Extract the data for this coordinate
coord_df = df.xs(coord, level="coords", axis=1)
coord_data = (
coord_df.to_numpy()
.reshape((-1, len(individual_names), len(keypoint_names)))
.transpose(0, 2, 1)
)

# Store the data indexed by spatial component
ensemble_stats[stat_name][spatial_component] = coord_data

# Create DataArrays with (time, space, keypoints, individuals) dims
for stat_name, spatial_data in ensemble_stats.items():
if "x" in spatial_data and "y" in spatial_data:
# Stack x and y into space dimension
stat_array = np.stack(
[spatial_data["x"], spatial_data["y"]], axis=1
) # Results in (time, space=2, keypoints, individuals)

# Create xarray DataArray with proper dimensions
da = xr.DataArray(
stat_array,
dims=["time", "space", "keypoints", "individuals"],
coords={
"time": ds.coords["time"],
"space": ["x", "y"],
"keypoints": ds.coords["keypoints"],
"individuals": ds.coords["individuals"],
},
name=stat_name,
)
ds[stat_name] = da
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is a more compact way of achieving what you are doing in lines 486 - 556:

    # Add ensemble statistics to the dataset
    for stat in ["ens_median", "ens_var", "posterior_var"]:
        # (time, keypoints, individuals) for each of x and y
        stat_arrays_per_spatial_component = [
            (
                df.xs(f"{c}_{stat}", level="coords", axis=1)
                .to_numpy()
                .reshape((-1, ds.sizes["individuals"], ds.sizes["keypoints"]))
                .transpose(0, 2, 1)
            )
            for c in ["x", "y"]
        ]
        # Stack the arrays into (time, space=2, keypoints, individuals)
        stat_array = np.stack(stat_arrays_per_spatial_component, axis=1)
        # Create a DataArray for the ensemble statistic
        ds[stat] = xr.DataArray(
            stat_array,
            dims=["time", "space", "keypoints", "individuals"],
            coords={
                "time": ds.coords["time"],
                "space": ["x", "y"],
                "keypoints": ds.coords["keypoints"],
                "individuals": ds.coords["individuals"],
            },
        )

Your tests still pass with my version of the code btw, so I think it's equivalent.

Comment on lines +324 to +325
if file_path is None:
pytest.skip("EKS example file not found")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need these two lines.

Comment on lines +362 to +363
except ImportError:
pytest.skip("Required dependencies for EKS loading not available")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also skip the try-except block here. If something is missing, the test will fail and let us know that way.

Comment on lines +366 to +381
def test_load_from_file_eks():
"""Test that loading EKS files via from_file() works correctly."""
file_path = DATA_PATHS.get("EKS_IBL-paw_multicam_right.predictions.csv")
if file_path is None:
pytest.skip("EKS example file not found")

try:
# Test loading via from_file
ds = load_poses.from_file(file_path, source_software="EKS", fps=30)

# Should be identical to from_eks_file
ds_direct = load_poses.from_eks_file(file_path, fps=30)
xr.testing.assert_identical(ds, ds_direct)

except ImportError:
pytest.skip("Required dependencies for EKS loading not available")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this extra test, I recommend adding the EKS option to the existing test_from_file_delegates_correctly test in the same file. That's the one we use for ensuring that from_file() delegates to the correct software-specific loader.

pytest.skip("Required dependencies for EKS loading not available")


def test_load_from_file_eks():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make this test much more succint by re-using one of our helper fixtures:

def test_load_from_eks_file(helpers):
    """Test that loading pose tracks from an EKS CSV file
    returns a proper Dataset with ensemble statistics.
    """
    file_path = DATA_PATHS.get("EKS_IBL-paw_multicam_right.predictions.csv")
    # Load the EKS file
    ds = load_poses.from_eks_file(file_path, fps=30)

    expected_values_eks = {
        "vars_dims": {
            "position": 4,
            "confidence": 3,
            "ens_median": 4,
            "ens_var": 4,
            "posterior_var": 4
        },
        "dim_names": ValidPosesDataset.DIM_NAMES,
        "source_software": "EKS",
        "fps": 30,
    }

    helpers.assert_valid_dataset(ds, expected_values_eks)

The helpers.assert_valid_dataset() will do the exact same check you were doing anyway

@niksirbi
Copy link
Member

Hey @hummuscience, just checking in to see when you might have a chance to continue with this PR.
No rush at all — I’m mainly trying to keep track of open PRs and plan my review time. This is a really useful feature and we’d love to get it merged when you’re ready to pick it up again.

@hummuscience
Copy link
Author

Hey, I will plan it for thursday this week. Does that work for you?

@niksirbi
Copy link
Member

Hey, I will plan it for thursday this week. Does that work for you?

Thanks for the rapid response. I'm away Thu-Fri this week but can catch up with your work next week.
Feel free to work when is best for you, I just wanted to know on a macro level.

@hummuscience
Copy link
Author

Just an update. Got sick the past week. Will try to finish it tomorrow (finally!). Sorry for the wait

@hummuscience
Copy link
Author

Sadly, this PR dropped down heavily on my priority list (PostDoc before SfN syndrome). I guess I will only realistically be able to deal with it end of november

@niksirbi
Copy link
Member

PostDoc before SfN syndrome

Sounds familiar 😄

There is currently no-one else available to work on this PR, so November is still good. I mostly care about bringing it to a finish line at some point, so that the work you've already done doesn't go to waste.

Good luck with SfN!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants