-
Notifications
You must be signed in to change notification settings - Fork 71
Feature/eks loader #670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/eks loader #670
Conversation
- Implemented from_eks_file() function to load EKS CSV files - Added support for ensemble statistics (median, variance, posterior variance) - Updated from_file() to support "EKS" as source_software - Added comprehensive unit tests for EKS loader functionality - EKS files follow DeepLabCut CSV format but include additional ensemble columns 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Restructure ensemble statistics (ens_median, ens_var, posterior_var) to have (time, space, keypoints, individuals) dimensions - Stack x and y components into space dimension for consistency with position data - This ensures all data variables have matching dimension structure 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #670 +/- ##
===========================================
- Coverage 100.00% 99.70% -0.30%
===========================================
Files 34 34
Lines 1981 2036 +55
===========================================
+ Hits 1981 2030 +49
- Misses 0 6 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Change test filename from sub-460_strain-B6_2024-11-12T12_30_00_eks.csv to eks_output.csv - Update ensemble variable checks to match new structure (ens_median, ens_var, posterior_var) - Fix shape assertions for ensemble variables to include space dimension 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
… into feature/eks-loader
I tried to upload the test file but its too large (77mb). Let me know how I can send it to you. (its a 15 minute recording with 19 keypoints) |
One option could be to make it smaller (i.e. select maybe 1-2min of the data) and add it to our movement dataset repository. Then we can fetch it from there in the tests. You will need to be provided access to the data repository though (cc @niksirbi). You can see more details on how to add and fetch data in the docs - adding new data and fetching data). |
Hey @hummuscience, what @sfmig describes above is our standard procedure for adding data to test new I/O functions. We don't store data on GitHub, but on GIN instead. Our test suite has built-in support for easily fetching that data. In this particular case, I have an idea. Instead of using your test file @hummuscience , we could use the data provided in the EKS repo itself. I was able to produce EKS output files by running the EKS example scripts for the IBL pupil and paw datasets. I also confirmed that the output file has the same format no matter the number of cameras, mirrors etc. Sharing these data on GIN should be uncomplicated since they are derived from already public datasets. What do you both think? I'm happy to handle the GIN upload myself. |
I also thought about this, but it seems like all the data on the EKS repo is the input, and not the output. So we would have to use the output data you got from running them. Thats fine for me :) Anything I need to do from my side to get this going? |
Yeah, I meant uploading the output data. I will do it today and let you know when the upload is complete so you can carry on with updating the tests. Any preference for a particular dataset (paw, pupil)? For the tests it doesn't really matter I suppose. |
For the tests it doesn't really matter. Maybe for variety’s sake (people trying out movement for different types of data) the paw dataset could be nice. |
Hey @hummuscience, I've now uploaded two new .csv files to our GIN repository: You can see their descriptions at the bottom of the I've also verified that they can be fetched by our >>> from movement import sample_data
>>>
>>> ds_paths = sample_data.fetch_dataset_paths("EKS_IBL-paw_multicam_right.predictions.csv")
>>> print(ds_paths["poses"])
/Users/nsirmpilatze/.movement/data/poses/EKS_IBL-paw_multicam_right.predictions.csv This means that you should be able to access them directly in tests via the Note that the source software should now be called "EKS" everywhere, including in the |
movement/io/load_poses.py
Outdated
individual_names=individual_names, | ||
keypoint_names=keypoint_names, | ||
fps=fps, | ||
source_software="EnsembleKalmanSmoother", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source_software="EnsembleKalmanSmoother", | |
source_software="EKS", |
- Change source_software attribute from 'EnsembleKalmanSmoother' to 'EKS' - Update tests to use sample data files from GIN repository - Use EKS_IBL-paw_multicam_right.predictions.csv from DATA_PATHS - Remove dependency on local eks_output.csv file 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @hummuscience, thanks for the PR! It’s great to see that your code already follows many of our conventions.
I’ve gone through the code for a first round of review and left inline comments. Most of my suggestions involve replacing parts of your implementation with helper functions that already exist in movement
, to reduce duplication. I realize many of these functions aren’t obvious from the outside (some are even private), so it’s completely understandable that you didn’t use them initially.
In addition to the inline comments, there are a couple of things we’ll need before the PR is ready to merge:
- Documentation update: The Input/Output guide needs to be updated to include the new loader. The source file is at
docs/source/user_guide/input_output.md
. Please check our contributing guide on previewing documentation locally and in CI, which will help you debug formatting issues and confirm how the published site will look. - Additional test cases: Your current test checks that everything works for a valid EKS
.csv
file, which is a great start. But we also need to verify behavior with invalid inputs. Some of this will be covered by theValidDeepLabCutCSV
validator (once you apply my suggestions), but two cases are still missing:- Files containing extra coordinates beyond those we expect (I’d recommend not loading these, as noted in my inline comments).
- Files where some expected ensemble coordinates are missing. In this case, I suspect
pandas
may raise an error when callingdf.xs
, but we should check whether that’s indeed happening.
For generating invalid files for these tests, I recommend creating custom fixtures. You can find examples of similar fixtures in tests/fixtures/files.py
. I can help with that part if needed, as it can be tricky.
No particular rush on this btw, I will be away for 2 weeks, so I will only look at this again in the 2nd half of September. Thanks for taking the time to make this contribution, I think this will be appreciated by many of our users.
be among those supported by the ``from_dlc_file()``, | ||
``from_slp_file()`` or ``from_lp_file()`` functions. One of these | ||
these functions will be called internally, based on | ||
``from_slp_file()``, ``from_lp_file()``, ``from_eks_file()`` functions. | ||
One of these functions will be called internally, based on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realised that it's starting to become a bit unwieldy to list all the functions here (we've already neglected to add some), so let's take this chance to instead point users to the "See Also" list, which is up-to-date.
be among those supported by the ``from_dlc_file()``, | |
``from_slp_file()`` or ``from_lp_file()`` functions. One of these | |
these functions will be called internally, based on | |
``from_slp_file()``, ``from_lp_file()``, ``from_eks_file()`` functions. | |
One of these functions will be called internally, based on | |
be among those supported by the software-specific loading functions | |
that are listed under "See Also". |
def from_eks_file( | ||
file_path: Path | str, fps: float | None = None | ||
) -> xr.Dataset: | ||
"""Create a ``movement`` poses dataset from an EKS file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe under this first like we could spell out that EKS stands for Ensemble Kalman Smoother and point to their GitHub repository (you can do this with a reference, as is done for the from_nwb_file()
function).
EKS files have a similar structure to DeepLabCut CSV files but include | ||
additional columns for ensemble statistics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To match the spelling of .csv earlier in the docstring (and in other docstrings).
EKS files have a similar structure to DeepLabCut CSV files but include | |
additional columns for ensemble statistics: | |
EKS files have a similar structure to DeepLabCut .csv files but include | |
additional columns for ensemble statistics: |
- x_ens_median, y_ens_median: Median of the ensemble predictions | ||
- x_ens_var, y_ens_var: Variance of the ensemble predictions | ||
- x_posterior_var, y_posterior_var: Posterior variance from the smoother |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want the bullet point list to render properly in Sphinx .rst you have to leave a blank line. Moreover, and this is more of a subjective stylistic choice, I would make the column names monospace
- x_ens_median, y_ens_median: Median of the ensemble predictions | |
- x_ens_var, y_ens_var: Variance of the ensemble predictions | |
- x_posterior_var, y_posterior_var: Posterior variance from the smoother | |
- ``x_ens_median``, ``y_ens_median``: Median of the ensemble predictions | |
- ``x_ens_var``, ``y_ens_var``: Variance of the ensemble predictions | |
- ``x_posterior_var``, ``y_posterior_var``: Posterior variance from | |
the kalman smoother |
return df | ||
|
||
|
||
def _df_from_eks_csv(file_path: Path) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell this private function is unnecessary, because it's almost identical to the existing _df_from_dlc_csv
function. The only difference is the use of the ValidDeepLabCutCSV
validator you are not using here. But I would say that you can simply use _df_from_dlc_csv
as is. The ValidDeepLabCutCSV
validator won't hinder you, as it only checks that the expected multi-index levels exist. As these are the same between DLC and EKS, it's actually better to use tha validator, so you are completely fine using _df_from_dlc_csv
and getting rid of _df_from_eks_csv
# Group ensemble statistics by their type | ||
# (ens_median, ens_var, posterior_var) | ||
# Each will have x and y components as the space dimension | ||
ensemble_stats: dict[str, dict[str, np.ndarray]] = {} | ||
|
||
for coord in ensemble_coords: | ||
# Parse coordinate name to extract stat type and spatial component | ||
# e.g., "x_ens_median" -> ("ens_median", "x") | ||
if coord.startswith("x_"): | ||
stat_name = coord[2:] # Remove "x_" prefix | ||
spatial_component = "x" | ||
elif coord.startswith("y_"): | ||
stat_name = coord[2:] # Remove "y_" prefix | ||
spatial_component = "y" | ||
else: | ||
# If it doesn't follow expected pattern, handle as before | ||
coord_df = df.xs(coord, level="coords", axis=1) | ||
coord_data = ( | ||
coord_df.to_numpy() | ||
.reshape((-1, len(individual_names), len(keypoint_names))) | ||
.transpose(0, 2, 1) | ||
) | ||
da = xr.DataArray( | ||
coord_data, | ||
dims=["time", "keypoints", "individuals"], | ||
coords={ | ||
"time": ds.coords["time"], | ||
"keypoints": ds.coords["keypoints"], | ||
"individuals": ds.coords["individuals"], | ||
}, | ||
name=coord, | ||
) | ||
ds[coord] = da | ||
continue | ||
|
||
# Initialize the stat dict if needed | ||
if stat_name not in ensemble_stats: | ||
ensemble_stats[stat_name] = {} | ||
|
||
# Extract the data for this coordinate | ||
coord_df = df.xs(coord, level="coords", axis=1) | ||
coord_data = ( | ||
coord_df.to_numpy() | ||
.reshape((-1, len(individual_names), len(keypoint_names))) | ||
.transpose(0, 2, 1) | ||
) | ||
|
||
# Store the data indexed by spatial component | ||
ensemble_stats[stat_name][spatial_component] = coord_data | ||
|
||
# Create DataArrays with (time, space, keypoints, individuals) dims | ||
for stat_name, spatial_data in ensemble_stats.items(): | ||
if "x" in spatial_data and "y" in spatial_data: | ||
# Stack x and y into space dimension | ||
stat_array = np.stack( | ||
[spatial_data["x"], spatial_data["y"]], axis=1 | ||
) # Results in (time, space=2, keypoints, individuals) | ||
|
||
# Create xarray DataArray with proper dimensions | ||
da = xr.DataArray( | ||
stat_array, | ||
dims=["time", "space", "keypoints", "individuals"], | ||
coords={ | ||
"time": ds.coords["time"], | ||
"space": ["x", "y"], | ||
"keypoints": ds.coords["keypoints"], | ||
"individuals": ds.coords["individuals"], | ||
}, | ||
name=stat_name, | ||
) | ||
ds[stat_name] = da |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there is a more compact way of achieving what you are doing in lines 486 - 556:
# Add ensemble statistics to the dataset
for stat in ["ens_median", "ens_var", "posterior_var"]:
# (time, keypoints, individuals) for each of x and y
stat_arrays_per_spatial_component = [
(
df.xs(f"{c}_{stat}", level="coords", axis=1)
.to_numpy()
.reshape((-1, ds.sizes["individuals"], ds.sizes["keypoints"]))
.transpose(0, 2, 1)
)
for c in ["x", "y"]
]
# Stack the arrays into (time, space=2, keypoints, individuals)
stat_array = np.stack(stat_arrays_per_spatial_component, axis=1)
# Create a DataArray for the ensemble statistic
ds[stat] = xr.DataArray(
stat_array,
dims=["time", "space", "keypoints", "individuals"],
coords={
"time": ds.coords["time"],
"space": ["x", "y"],
"keypoints": ds.coords["keypoints"],
"individuals": ds.coords["individuals"],
},
)
Your tests still pass with my version of the code btw, so I think it's equivalent.
if file_path is None: | ||
pytest.skip("EKS example file not found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer need these two lines.
except ImportError: | ||
pytest.skip("Required dependencies for EKS loading not available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also skip the try-except block here. If something is missing, the test will fail and let us know that way.
def test_load_from_file_eks(): | ||
"""Test that loading EKS files via from_file() works correctly.""" | ||
file_path = DATA_PATHS.get("EKS_IBL-paw_multicam_right.predictions.csv") | ||
if file_path is None: | ||
pytest.skip("EKS example file not found") | ||
|
||
try: | ||
# Test loading via from_file | ||
ds = load_poses.from_file(file_path, source_software="EKS", fps=30) | ||
|
||
# Should be identical to from_eks_file | ||
ds_direct = load_poses.from_eks_file(file_path, fps=30) | ||
xr.testing.assert_identical(ds, ds_direct) | ||
|
||
except ImportError: | ||
pytest.skip("Required dependencies for EKS loading not available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding this extra test, I recommend adding the EKS option to the existing test_from_file_delegates_correctly
test in the same file. That's the one we use for ensuring that from_file()
delegates to the correct software-specific loader.
pytest.skip("Required dependencies for EKS loading not available") | ||
|
||
|
||
def test_load_from_file_eks(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could make this test much more succint by re-using one of our helper fixtures:
def test_load_from_eks_file(helpers):
"""Test that loading pose tracks from an EKS CSV file
returns a proper Dataset with ensemble statistics.
"""
file_path = DATA_PATHS.get("EKS_IBL-paw_multicam_right.predictions.csv")
# Load the EKS file
ds = load_poses.from_eks_file(file_path, fps=30)
expected_values_eks = {
"vars_dims": {
"position": 4,
"confidence": 3,
"ens_median": 4,
"ens_var": 4,
"posterior_var": 4
},
"dim_names": ValidPosesDataset.DIM_NAMES,
"source_software": "EKS",
"fps": 30,
}
helpers.assert_valid_dataset(ds, expected_values_eks)
The helpers.assert_valid_dataset()
will do the exact same check you were doing anyway
Hey @hummuscience, just checking in to see when you might have a chance to continue with this PR. |
Hey, I will plan it for thursday this week. Does that work for you? |
Thanks for the rapid response. I'm away Thu-Fri this week but can catch up with your work next week. |
Just an update. Got sick the past week. Will try to finish it tomorrow (finally!). Sorry for the wait |
Sadly, this PR dropped down heavily on my priority list (PostDoc before SfN syndrome). I guess I will only realistically be able to deal with it end of november |
Sounds familiar 😄 There is currently no-one else available to work on this PR, so November is still good. I mostly care about bringing it to a finish line at some point, so that the work you've already done doesn't go to waste. Good luck with SfN! |
Description
What is this PR
Why is this PR needed?
It allows loading the outputs of EKS (Ensemble Kalman Smoother) into movement.
What does this PR do?
It implements the loader similar to the other supported packages.
References
Relevant discussion on Zulip: #Movement > Importer for Ensemble Kalman Smoother
How has this PR been tested?
I tested it using an output file from my own data. It is referenced in the test file but of course is not part of movement. I am not sure what the best to do this would be.
Is this a breaking change?
No.
Does this PR require an update to the documentation?
Would have to add the relevant parts to the input/output part in the codumentation
Checklist: