Feature/eks loader #670

hummuscience · 2025-08-28T08:28:29Z

Description

What is this PR

Bug fix
Addition of a new feature
Other

Why is this PR needed?
It allows loading the outputs of EKS (Ensemble Kalman Smoother) into movement.

What does this PR do?
It implements the loader similar to the other supported packages.

References

Relevant discussion on Zulip: #Movement > Importer for Ensemble Kalman Smoother

How has this PR been tested?

I tested it using an output file from my own data. It is referenced in the test file but of course is not part of movement. I am not sure what the best to do this would be.

Is this a breaking change?

No.

Does this PR require an update to the documentation?

Would have to add the relevant parts to the input/output part in the codumentation

Checklist:

[ x] The code has been tested locally
[ ?] Tests have been added to cover all new functionality (maybe? I am not sure)
The documentation has been updated to reflect any changes
[ x] The code has been formatted with pre-commit

- Implemented from_eks_file() function to load EKS CSV files - Added support for ensemble statistics (median, variance, posterior variance) - Updated from_file() to support "EKS" as source_software - Added comprehensive unit tests for EKS loader functionality - EKS files follow DeepLabCut CSV format but include additional ensemble columns 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Restructure ensemble statistics (ens_median, ens_var, posterior_var) to have (time, space, keypoints, individuals) dimensions - Stack x and y components into space dimension for consistency with position data - This ensures all data variables have matching dimension structure 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

codecov · 2025-08-28T08:42:02Z

Codecov Report

❌ Patch coverage is 89.09091% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.70%. Comparing base (dacc6ef) to head (44480db).

Files with missing lines	Patch %	Lines
movement/io/load_poses.py	89.09%	6 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##              main     #670      +/-   ##
===========================================
- Coverage   100.00%   99.70%   -0.30%     
===========================================
  Files           34       34              
  Lines         1981     2036      +55     
===========================================
+ Hits          1981     2030      +49     
- Misses           0        6       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Change test filename from sub-460_strain-B6_2024-11-12T12_30_00_eks.csv to eks_output.csv - Update ensemble variable checks to match new structure (ens_median, ens_var, posterior_var) - Fix shape assertions for ensemble variables to include space dimension 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

… into feature/eks-loader

hummuscience · 2025-08-28T08:57:42Z

I tried to upload the test file but its too large (77mb). Let me know how I can send it to you. (its a 15 minute recording with 19 keypoints)

sfmig · 2025-08-29T08:55:59Z

I tried to upload the test file but its too large (77mb). Let me know how I can send it to you. (its a 15 minute recording with 19 keypoints)

One option could be to make it smaller (i.e. select maybe 1-2min of the data) and add it to our movement dataset repository. Then we can fetch it from there in the tests.

You will need to be provided access to the data repository though (cc @niksirbi). You can see more details on how to add and fetch data in the docs - adding new data and fetching data).

niksirbi · 2025-08-29T09:29:12Z

Hey @hummuscience, what @sfmig describes above is our standard procedure for adding data to test new I/O functions.

We don't store data on GitHub, but on GIN instead. Our test suite has built-in support for easily fetching that data.
Specifically, if in the test file you call from pytest import DATA_PATHS, the data files or GIN become accessible to you via DATA_PATHS.get(filename). You can see this in action here, for example.

In this particular case, I have an idea. Instead of using your test file @hummuscience , we could use the data provided in the EKS repo itself. I was able to produce EKS output files by running the EKS example scripts for the IBL pupil and paw datasets. I also confirmed that the output file has the same format no matter the number of cameras, mirrors etc. Sharing these data on GIN should be uncomplicated since they are derived from already public datasets.

What do you both think? I'm happy to handle the GIN upload myself.

hummuscience · 2025-08-29T09:32:27Z

I also thought about this, but it seems like all the data on the EKS repo is the input, and not the output. So we would have to use the output data you got from running them. Thats fine for me :)

Anything I need to do from my side to get this going?

niksirbi · 2025-08-29T09:37:55Z

Yeah, I meant uploading the output data. I will do it today and let you know when the upload is complete so you can carry on with updating the tests. Any preference for a particular dataset (paw, pupil)? For the tests it doesn't really matter I suppose.

hummuscience · 2025-08-29T09:47:10Z

For the tests it doesn't really matter. Maybe for variety’s sake (people trying out movement for different types of data) the paw dataset could be nice.

niksirbi · 2025-08-29T12:50:27Z

Hey @hummuscience, I've now uploaded two new .csv files to our GIN repository:
EKS_IBL-paw_multicam_left.predictions.csv and EKS_IBL-paw_multicam_right.predictions.csv.

You can see their descriptions at the bottom of the metadata.yaml file.

I've also verified that they can be fetched by our sample_data module:

>>> from movement import sample_data
>>> 
>>> ds_paths = sample_data.fetch_dataset_paths("EKS_IBL-paw_multicam_right.predictions.csv")
>>> print(ds_paths["poses"])
/Users/nsirmpilatze/.movement/data/poses/EKS_IBL-paw_multicam_right.predictions.csv

This means that you should be able to access them directly in tests via the DATA_PATHS.get("EKS_IBL-paw_multicam_right.predictions.csv") syntax.

Note that the source software should now be called "EKS" everywhere, including in the source_software attribute. We can refer to the full name ("Ensemble Kalman Smoother") in the docstring, for completeness, but it should be "EKS" everywhere else.

niksirbi · 2025-08-29T12:55:55Z

movement/io/load_poses.py

+        individual_names=individual_names,
+        keypoint_names=keypoint_names,
+        fps=fps,
+        source_software="EnsembleKalmanSmoother",


Suggested change

source_software="EnsembleKalmanSmoother",

source_software="EKS",

- Change source_software attribute from 'EnsembleKalmanSmoother' to 'EKS' - Update tests to use sample data files from GIN repository - Use EKS_IBL-paw_multicam_right.predictions.csv from DATA_PATHS - Remove dependency on local eks_output.csv file 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

sonarqubecloud · 2025-08-29T13:09:41Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

niksirbi

Hey @hummuscience, thanks for the PR! It’s great to see that your code already follows many of our conventions.

I’ve gone through the code for a first round of review and left inline comments. Most of my suggestions involve replacing parts of your implementation with helper functions that already exist in movement, to reduce duplication. I realize many of these functions aren’t obvious from the outside (some are even private), so it’s completely understandable that you didn’t use them initially.

In addition to the inline comments, there are a couple of things we’ll need before the PR is ready to merge:

Documentation update: The Input/Output guide needs to be updated to include the new loader. The source file is at docs/source/user_guide/input_output.md. Please check our contributing guide on previewing documentation locally and in CI, which will help you debug formatting issues and confirm how the published site will look.
Additional test cases: Your current test checks that everything works for a valid EKS .csv file, which is a great start. But we also need to verify behavior with invalid inputs. Some of this will be covered by the ValidDeepLabCutCSV validator (once you apply my suggestions), but two cases are still missing:
- Files containing extra coordinates beyond those we expect (I’d recommend not loading these, as noted in my inline comments).
- Files where some expected ensemble coordinates are missing. In this case, I suspect pandas may raise an error when calling df.xs, but we should check whether that’s indeed happening.

For generating invalid files for these tests, I recommend creating custom fixtures. You can find examples of similar fixtures in tests/fixtures/files.py. I can help with that part if needed, as it can be tricky.

No particular rush on this btw, I will be away for 2 weeks, so I will only look at this again in the 2nd half of September. Thanks for taking the time to make this contribution, I think this will be appreciated by many of our users.

niksirbi · 2025-09-02T17:20:04Z

movement/io/load_poses.py

        be among those supported by the ``from_dlc_file()``,
-        ``from_slp_file()`` or ``from_lp_file()`` functions. One of these
-        these functions will be called internally, based on
+        ``from_slp_file()``, ``from_lp_file()``, ``from_eks_file()`` functions.
+        One of these functions will be called internally, based on


I realised that it's starting to become a bit unwieldy to list all the functions here (we've already neglected to add some), so let's take this chance to instead point users to the "See Also" list, which is up-to-date.

Suggested change

be among those supported by the ``from_dlc_file()``,

``from_slp_file()`` or ``from_lp_file()`` functions. One of these

these functions will be called internally, based on

``from_slp_file()``, ``from_lp_file()``, ``from_eks_file()`` functions.

One of these functions will be called internally, based on

be among those supported by the software-specific loading functions

that are listed under "See Also".

niksirbi · 2025-09-02T17:24:41Z

movement/io/load_poses.py

+def from_eks_file(
+    file_path: Path | str, fps: float | None = None
+) -> xr.Dataset:
+    """Create a ``movement`` poses dataset from an EKS file.


Maybe under this first like we could spell out that EKS stands for Ensemble Kalman Smoother and point to their GitHub repository (you can do this with a reference, as is done for the from_nwb_file() function).

niksirbi · 2025-09-02T17:38:15Z

movement/io/load_poses.py

+    EKS files have a similar structure to DeepLabCut CSV files but include
+    additional columns for ensemble statistics:


To match the spelling of .csv earlier in the docstring (and in other docstrings).

Suggested change

EKS files have a similar structure to DeepLabCut CSV files but include

additional columns for ensemble statistics:

EKS files have a similar structure to DeepLabCut .csv files but include

additional columns for ensemble statistics:

niksirbi · 2025-09-02T17:40:00Z

movement/io/load_poses.py

+    - x_ens_median, y_ens_median: Median of the ensemble predictions
+    - x_ens_var, y_ens_var: Variance of the ensemble predictions
+    - x_posterior_var, y_posterior_var: Posterior variance from the smoother


If you want the bullet point list to render properly in Sphinx .rst you have to leave a blank line. Moreover, and this is more of a subjective stylistic choice, I would make the column names monospace

Suggested change

- x_ens_median, y_ens_median: Median of the ensemble predictions

- x_ens_var, y_ens_var: Variance of the ensemble predictions

- x_posterior_var, y_posterior_var: Posterior variance from the smoother

- ``x_ens_median``, ``y_ens_median``: Median of the ensemble predictions

- ``x_ens_var``, ``y_ens_var``: Variance of the ensemble predictions

- ``x_posterior_var``, ``y_posterior_var``: Posterior variance from

the kalman smoother

niksirbi · 2025-09-02T17:55:33Z

movement/io/load_poses.py

    return df


+def _df_from_eks_csv(file_path: Path) -> pd.DataFrame:


As far as I can tell this private function is unnecessary, because it's almost identical to the existing _df_from_dlc_csv function. The only difference is the use of the ValidDeepLabCutCSV validator you are not using here. But I would say that you can simply use _df_from_dlc_csv as is. The ValidDeepLabCutCSV validator won't hinder you, as it only checks that the expected multi-index levels exist. As these are the same between DLC and EKS, it's actually better to use tha validator, so you are completely fine using _df_from_dlc_csv and getting rid of _df_from_eks_csv

niksirbi · 2025-09-03T09:26:54Z

movement/io/load_poses.py

+    # Group ensemble statistics by their type
+    # (ens_median, ens_var, posterior_var)
+    # Each will have x and y components as the space dimension
+    ensemble_stats: dict[str, dict[str, np.ndarray]] = {}
+
+    for coord in ensemble_coords:
+        # Parse coordinate name to extract stat type and spatial component
+        # e.g., "x_ens_median" -> ("ens_median", "x")
+        if coord.startswith("x_"):
+            stat_name = coord[2:]  # Remove "x_" prefix
+            spatial_component = "x"
+        elif coord.startswith("y_"):
+            stat_name = coord[2:]  # Remove "y_" prefix
+            spatial_component = "y"
+        else:
+            # If it doesn't follow expected pattern, handle as before
+            coord_df = df.xs(coord, level="coords", axis=1)
+            coord_data = (
+                coord_df.to_numpy()
+                .reshape((-1, len(individual_names), len(keypoint_names)))
+                .transpose(0, 2, 1)
+            )
+            da = xr.DataArray(
+                coord_data,
+                dims=["time", "keypoints", "individuals"],
+                coords={
+                    "time": ds.coords["time"],
+                    "keypoints": ds.coords["keypoints"],
+                    "individuals": ds.coords["individuals"],
+                },
+                name=coord,
+            )
+            ds[coord] = da
+            continue
+
+        # Initialize the stat dict if needed
+        if stat_name not in ensemble_stats:
+            ensemble_stats[stat_name] = {}
+
+        # Extract the data for this coordinate
+        coord_df = df.xs(coord, level="coords", axis=1)
+        coord_data = (
+            coord_df.to_numpy()
+            .reshape((-1, len(individual_names), len(keypoint_names)))
+            .transpose(0, 2, 1)
+        )
+
+        # Store the data indexed by spatial component
+        ensemble_stats[stat_name][spatial_component] = coord_data
+
+    # Create DataArrays with (time, space, keypoints, individuals) dims
+    for stat_name, spatial_data in ensemble_stats.items():
+        if "x" in spatial_data and "y" in spatial_data:
+            # Stack x and y into space dimension
+            stat_array = np.stack(
+                [spatial_data["x"], spatial_data["y"]], axis=1
+            )  # Results in (time, space=2, keypoints, individuals)
+
+            # Create xarray DataArray with proper dimensions
+            da = xr.DataArray(
+                stat_array,
+                dims=["time", "space", "keypoints", "individuals"],
+                coords={
+                    "time": ds.coords["time"],
+                    "space": ["x", "y"],
+                    "keypoints": ds.coords["keypoints"],
+                    "individuals": ds.coords["individuals"],
+                },
+                name=stat_name,
+            )
+            ds[stat_name] = da


I believe there is a more compact way of achieving what you are doing in lines 486 - 556:

# Add ensemble statistics to the dataset for stat in ["ens_median", "ens_var", "posterior_var"]: # (time, keypoints, individuals) for each of x and y stat_arrays_per_spatial_component = [ ( df.xs(f"{c}_{stat}", level="coords", axis=1) .to_numpy() .reshape((-1, ds.sizes["individuals"], ds.sizes["keypoints"])) .transpose(0, 2, 1) ) for c in ["x", "y"] ] # Stack the arrays into (time, space=2, keypoints, individuals) stat_array = np.stack(stat_arrays_per_spatial_component, axis=1) # Create a DataArray for the ensemble statistic ds[stat] = xr.DataArray( stat_array, dims=["time", "space", "keypoints", "individuals"], coords={ "time": ds.coords["time"], "space": ["x", "y"], "keypoints": ds.coords["keypoints"], "individuals": ds.coords["individuals"], }, )

Your tests still pass with my version of the code btw, so I think it's equivalent.

niksirbi · 2025-09-03T09:28:59Z

tests/test_unit/test_io/test_load_poses.py

+    if file_path is None:
+        pytest.skip("EKS example file not found")


We no longer need these two lines.

niksirbi · 2025-09-03T09:29:47Z

tests/test_unit/test_io/test_load_poses.py

+    except ImportError:
+        pytest.skip("Required dependencies for EKS loading not available")


I would also skip the try-except block here. If something is missing, the test will fail and let us know that way.

niksirbi · 2025-09-03T09:44:08Z

tests/test_unit/test_io/test_load_poses.py

+def test_load_from_file_eks():
+    """Test that loading EKS files via from_file() works correctly."""
+    file_path = DATA_PATHS.get("EKS_IBL-paw_multicam_right.predictions.csv")
+    if file_path is None:
+        pytest.skip("EKS example file not found")
+
+    try:
+        # Test loading via from_file
+        ds = load_poses.from_file(file_path, source_software="EKS", fps=30)
+
+        # Should be identical to from_eks_file
+        ds_direct = load_poses.from_eks_file(file_path, fps=30)
+        xr.testing.assert_identical(ds, ds_direct)
+
+    except ImportError:
+        pytest.skip("Required dependencies for EKS loading not available")


Instead of adding this extra test, I recommend adding the EKS option to the existing test_from_file_delegates_correctly test in the same file. That's the one we use for ensuring that from_file() delegates to the correct software-specific loader.

niksirbi · 2025-09-03T09:53:36Z

tests/test_unit/test_io/test_load_poses.py

+        pytest.skip("Required dependencies for EKS loading not available")
+
+
+def test_load_from_file_eks():


You could make this test much more succint by re-using one of our helper fixtures:

def test_load_from_eks_file(helpers): """Test that loading pose tracks from an EKS CSV file returns a proper Dataset with ensemble statistics. """ file_path = DATA_PATHS.get("EKS_IBL-paw_multicam_right.predictions.csv") # Load the EKS file ds = load_poses.from_eks_file(file_path, fps=30) expected_values_eks = { "vars_dims": { "position": 4, "confidence": 3, "ens_median": 4, "ens_var": 4, "posterior_var": 4 }, "dim_names": ValidPosesDataset.DIM_NAMES, "source_software": "EKS", "fps": 30, } helpers.assert_valid_dataset(ds, expected_values_eks)

The helpers.assert_valid_dataset() will do the exact same check you were doing anyway

niksirbi · 2025-09-29T09:20:10Z

Hey @hummuscience, just checking in to see when you might have a chance to continue with this PR.
No rush at all — I’m mainly trying to keep track of open PRs and plan my review time. This is a really useful feature and we’d love to get it merged when you’re ready to pick it up again.

hummuscience · 2025-09-29T10:26:41Z

Hey, I will plan it for thursday this week. Does that work for you?

niksirbi · 2025-09-29T10:37:01Z

Hey, I will plan it for thursday this week. Does that work for you?

Thanks for the rapid response. I'm away Thu-Fri this week but can catch up with your work next week.
Feel free to work when is best for you, I just wanted to know on a macro level.

hummuscience · 2025-10-09T09:38:10Z

Just an update. Got sick the past week. Will try to finish it tomorrow (finally!). Sorry for the wait

hummuscience · 2025-10-20T07:22:10Z

Sadly, this PR dropped down heavily on my priority list (PostDoc before SfN syndrome). I guess I will only realistically be able to deal with it end of november

niksirbi · 2025-10-20T12:36:13Z

PostDoc before SfN syndrome

Sounds familiar 😄

There is currently no-one else available to work on this PR, so November is still good. I mostly care about bringing it to a finish line at some point, so that the work you've already done doesn't go to waste.

Good luck with SfN!

Muad Abd El Hay and others added 3 commits August 27, 2025 12:52

Merge branch 'main' into feature/eks-loader

5118472

Muad Abd El Hay and others added 2 commits August 28, 2025 10:45

Merge branch 'feature/eks-loader' of github.com:hummuscience/movement…

92f3702

… into feature/eks-loader

niksirbi reviewed Aug 29, 2025

View reviewed changes

niksirbi reviewed Sep 3, 2025

View reviewed changes

	source_software="EnsembleKalmanSmoother",
	source_software="EKS",

		EKS files have a similar structure to DeepLabCut CSV files but include
		additional columns for ensemble statistics:

-    - x_ens_median, y_ens_median: Median of the ensemble predictions
-    - x_ens_var, y_ens_var: Variance of the ensemble predictions
-    - x_posterior_var, y_posterior_var: Posterior variance from the smoother
+    - ``x_ens_median``, ``y_ens_median``: Median of the ensemble predictions
+    - ``x_ens_var``, ``y_ens_var``: Variance of the ensemble predictions
+    - ``x_posterior_var``, ``y_posterior_var``: Posterior variance from
+      the kalman smoother

		return df


		def _df_from_eks_csv(file_path: Path) -> pd.DataFrame:

		if file_path is None:
		pytest.skip("EKS example file not found")

		except ImportError:
		pytest.skip("Required dependencies for EKS loading not available")

		pytest.skip("Required dependencies for EKS loading not available")


		def test_load_from_file_eks():

Feature/eks loader #670

Are you sure you want to change the base?

Feature/eks loader #670

Uh oh!

Conversation

hummuscience commented Aug 28, 2025

Description

References

How has this PR been tested?

Is this a breaking change?

Does this PR require an update to the documentation?

Checklist:

Uh oh!

codecov bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hummuscience commented Aug 28, 2025

Uh oh!

sfmig commented Aug 29, 2025

Uh oh!

niksirbi commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hummuscience commented Aug 29, 2025

Uh oh!

niksirbi commented Aug 29, 2025

Uh oh!

hummuscience commented Aug 29, 2025

Uh oh!

niksirbi commented Aug 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Aug 29, 2025

Quality Gate passed

Uh oh!

niksirbi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

niksirbi commented Sep 29, 2025

Uh oh!

hummuscience commented Sep 29, 2025

Uh oh!

niksirbi commented Sep 29, 2025

Uh oh!

hummuscience commented Oct 9, 2025

Uh oh!

hummuscience commented Oct 20, 2025

Uh oh!

niksirbi commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Aug 28, 2025 •

edited

Loading

niksirbi commented Aug 29, 2025 •

edited

Loading