Skip to content

Commit 5138876

Browse files
CloseChoicelhoestq
andauthored
Add nifti support (#7815)
* Add nifti support * update docs * update nifti after testing locally and from remote hub * update setup.py to add nibabel and update docs * add nifti_dataset * fix nifti dataset documentation * add nibabel to test dependency * Add section for creating a medical imaging dataset --------- Co-authored-by: Quentin Lhoest <[email protected]>
1 parent 159a645 commit 5138876

File tree

16 files changed

+529
-0
lines changed

16 files changed

+529
-0
lines changed

docs/source/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,8 @@
8888
title: Load document data
8989
- local: document_dataset
9090
title: Create a document dataset
91+
- local: nifti_dataset
92+
title: Create a medical imaging dataset
9193
title: "Vision"
9294
- sections:
9395
- local: nlp_load

docs/source/nifti_dataset.mdx

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# Create a NIfTI dataset
2+
3+
This page shows how to create and share a dataset of medical images in NIfTI format (.nii / .nii.gz) using the `datasets` library.
4+
5+
You can share a dataset with your team or with anyone in the community by creating a dataset repository on the Hugging Face Hub:
6+
7+
```py
8+
from datasets import load_dataset
9+
10+
dataset = load_dataset("<username>/my_nifti_dataset")
11+
```
12+
13+
There are two common ways to create a NIfTI dataset:
14+
15+
- Create a dataset from local NIfTI files in Python and upload it with `Dataset.push_to_hub`.
16+
- Use a folder-based convention (one file per example) and a small helper to convert it into a `Dataset`.
17+
18+
> [!TIP]
19+
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information.
20+
21+
## Local files
22+
23+
If you already have a list of file paths to NIfTI files, the easiest workflow is to create a `Dataset` from that list and cast the column to the `Nifti` feature.
24+
25+
```py
26+
from datasets import Dataset
27+
from datasets import Nifti
28+
29+
# simple example: create a dataset from file paths
30+
files = ["/path/to/scan_001.nii.gz", "/path/to/scan_002.nii.gz"]
31+
ds = Dataset.from_dict({"nifti": files}).cast_column("nifti", Nifti())
32+
33+
# access a decoded nibabel image (if decode=True)
34+
# ds[0]["nifti"] will be a nibabel.Nifti1Image object when decode=True
35+
# or a dict {'bytes': None, 'path': '...'} when decode=False
36+
```
37+
38+
The `Nifti` feature supports a `decode` parameter. When `decode=True` (the default), it loads the NIfTI file into a `nibabel.nifti1.Nifti1Image` object. You can access the image data as a numpy array with `img.get_fdata()`. When `decode=False`, it returns a dict with the file path and bytes.
39+
40+
```py
41+
from datasets import Dataset, Nifti
42+
43+
ds = Dataset.from_dict({"nifti": ["/path/to/scan.nii.gz"]}).cast_column("nifti", Nifti(decode=True))
44+
img = ds[0]["nifti"] # instance of: nibabel.nifti1.Nifti1Image
45+
arr = img.get_fdata()
46+
```
47+
48+
After preparing the dataset you can push it to the Hub:
49+
50+
```py
51+
ds.push_to_hub("<username>/my_nifti_dataset")
52+
```
53+
54+
This will create a dataset repository containing your NIfTI dataset with a `data/` folder of parquet shards.
55+
56+
## Folder conventions and metadata
57+
58+
If you organize your dataset in folders you can create splits automatically (train/test/validation) by following a structure like:
59+
60+
```
61+
dataset/train/scan_0001.nii
62+
dataset/train/scan_0002.nii
63+
dataset/validation/scan_1001.nii
64+
dataset/test/scan_2001.nii
65+
```
66+
67+
If you have labels or other metadata, provide a `metadata.csv`, `metadata.jsonl`, or `metadata.parquet` in the folder so files can be linked to metadata rows. The metadata must contain a `file_name` (or `*_file_name`) field with the relative path to the NIfTI file next to the metadata file.
68+
69+
Example `metadata.csv`:
70+
71+
```csv
72+
file_name,patient_id,age,diagnosis
73+
scan_0001.nii.gz,P001,45,healthy
74+
scan_0002.nii.gz,P002,59,disease_x
75+
```
76+
77+
The `Nifti` feature works with zipped datasets too — each zip can contain NIfTI files and a metadata file. This is useful when uploading large datasets as archives.
78+
This means your dataset structure could look like this (mixed compressed and uncompressed files):
79+
```
80+
dataset/train/scan_0001.nii.gz
81+
dataset/train/scan_0002.nii
82+
dataset/validation/scan_1001.nii.gz
83+
dataset/test/scan_2001.nii
84+
```
85+
86+
## Converting to PyTorch tensors
87+
88+
Use the [`~Dataset.set_transform`] function to apply the transformation on-the-fly to batches of the dataset:
89+
90+
```py
91+
import torch
92+
import nibabel
93+
import numpy as np
94+
95+
def transform_to_pytorch(example):
96+
example["nifti_torch"] = [torch.tensor(ex.get_fdata()) for ex in example["nifti"]]
97+
return example
98+
99+
ds.set_transform(transform_to_pytorch)
100+
101+
```
102+
Accessing elements now (e.g. `ds[0]`) will yield torch tensors in the `"nifti_torch"` key.
103+
104+
105+
## Usage of NifTI1Image
106+
107+
NifTI is a format to store the result of 3 (or even 4) dimensional brain scans. This includes 3 spatial dimensions (x,y,z)
108+
and optionally a time dimension (t). Furthermore, the given positions here are only relative to the scanner, therefore
109+
the dimensions (4, 5, 6) are used to lift this to real world coordinates.
110+
111+
You can visualize nifti files for instance leveraging `matplotlib` as follows:
112+
```python
113+
import matplotlib.pyplot as plt
114+
from datasets import load_dataset
115+
116+
def show_slices(slices):
117+
""" Function to display row of image slices """
118+
fig, axes = plt.subplots(1, len(slices))
119+
for i, slice in enumerate(slices):
120+
axes[i].imshow(slice.T, cmap="gray", origin="lower")
121+
122+
nifti_ds = load_dataset("<username>/my_nifti_dataset")
123+
for epi_img in nifti_ds:
124+
nifti_img = epi_img["nifti"].get_fdata()
125+
show_slices([nifti_img[:, :, 16], nifti_img[26, :, :], nifti_img[:, 30, :]])
126+
plt.show()
127+
```
128+
129+
For further reading we refer to the [nibabel documentation](https://nipy.org/nibabel/index.html) and especially [this nibabel tutorial](https://nipy.org/nibabel/coordinate_systems.html)
130+
---

docs/source/package_reference/loading_methods.mdx

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,12 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
103103

104104
[[autodoc]] datasets.packaged_modules.pdffolder.PdfFolder
105105

106+
### Nifti
107+
108+
[[autodoc]] datasets.packaged_modules.niftifolder.NiftiFolderConfig
109+
110+
[[autodoc]] datasets.packaged_modules.niftifolder.NiftiFolder
111+
106112
### WebDataset
107113

108114
[[autodoc]] datasets.packaged_modules.webdataset.WebDataset

docs/source/package_reference/main_classes.mdx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -271,6 +271,10 @@ Dictionary with split names as keys ('train', 'test' for example), and `Iterable
271271

272272
[[autodoc]] datasets.Pdf
273273

274+
### Nifti
275+
276+
[[autodoc]] datasets.Nifti
277+
274278
## Filesystems
275279

276280
[[autodoc]] datasets.filesystems.is_remote_filesystem

setup.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,6 +186,7 @@
186186
"polars[timezone]>=0.20.0",
187187
"Pillow>=9.4.0", # When PIL.Image.ExifTags was introduced
188188
"torchcodec>=0.7.0", # minium version to get windows support
189+
"nibabel>=5.3.1",
189190
]
190191

191192
NUMPY2_INCOMPATIBLE_LIBRARIES = [
@@ -207,6 +208,8 @@
207208

208209
PDFS_REQUIRE = ["pdfplumber>=0.11.4"]
209210

211+
NIBABEL_REQUIRE = ["nibabel>=5.3.2"]
212+
210213
EXTRAS_REQUIRE = {
211214
"audio": AUDIO_REQUIRE,
212215
"vision": VISION_REQUIRE,
@@ -224,6 +227,7 @@
224227
"benchmarks": BENCHMARKS_REQUIRE,
225228
"docs": DOCS_REQUIRE,
226229
"pdfs": PDFS_REQUIRE,
230+
"nibabel": NIBABEL_REQUIRE,
227231
}
228232

229233
setup(

src/datasets/config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,7 @@
139139
TORCHCODEC_AVAILABLE = importlib.util.find_spec("torchcodec") is not None
140140
TORCHVISION_AVAILABLE = importlib.util.find_spec("torchvision") is not None
141141
PDFPLUMBER_AVAILABLE = importlib.util.find_spec("pdfplumber") is not None
142+
NIBABEL_AVAILABLE = importlib.util.find_spec("nibabel") is not None
142143

143144
# Optional compression tools
144145
RARFILE_AVAILABLE = importlib.util.find_spec("rarfile") is not None

src/datasets/features/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,12 @@
1515
"TranslationVariableLanguages",
1616
"Video",
1717
"Pdf",
18+
"Nifti",
1819
]
1920
from .audio import Audio
2021
from .features import Array2D, Array3D, Array4D, Array5D, ClassLabel, Features, LargeList, List, Sequence, Value
2122
from .image import Image
23+
from .nifti import Nifti
2224
from .pdf import Pdf
2325
from .translation import Translation, TranslationVariableLanguages
2426
from .video import Video

src/datasets/features/features.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
from ..utils.py_utils import asdict, first_non_null_value, zip_dict
4343
from .audio import Audio
4444
from .image import Image, encode_pil_image
45+
from .nifti import Nifti
4546
from .pdf import Pdf, encode_pdfplumber_pdf
4647
from .translation import Translation, TranslationVariableLanguages
4748
from .video import Video
@@ -1270,6 +1271,7 @@ def __repr__(self):
12701271
Image,
12711272
Video,
12721273
Pdf,
1274+
Nifti,
12731275
]
12741276

12751277

@@ -1428,6 +1430,7 @@ def decode_nested_example(schema, obj, token_per_repo_id: Optional[dict[str, Uni
14281430
Image.__name__: Image,
14291431
Video.__name__: Video,
14301432
Pdf.__name__: Pdf,
1433+
Nifti.__name__: Nifti,
14311434
}
14321435

14331436

@@ -1761,6 +1764,9 @@ class Features(dict):
17611764
- [`Pdf`] feature to store the absolute path to a PDF file, a `pdfplumber.pdf.PDF` object
17621765
or a dictionary with the relative path to a PDF file ("path" key) and its bytes content ("bytes" key).
17631766
This feature loads the PDF lazily with a PDF reader.
1767+
- [`Nifti`] feature to store the absolute path to a NIfTI neuroimaging file, a `nibabel.Nifti1Image` object
1768+
or a dictionary with the relative path to a NIfTI file ("path" key) and its bytes content ("bytes" key).
1769+
This feature loads the NIfTI file lazily with nibabel.
17641770
- [`Translation`] or [`TranslationVariableLanguages`] feature specific to Machine Translation.
17651771
"""
17661772

0 commit comments

Comments
 (0)