Dataset v2.0 #461

aliberts · 2024-10-03T18:02:10Z

What this does

This PR introduces a new format for LeRobotDataset, which is accompanied by a new file structure. As these changes are not backward compatible, we increase CODEBASE_VERSION from v1.6 to v2.0.

What do I need to do?

If you already pushed a dataset using v1.6 of our codebase, you can use the conversion script lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py to convert it to the new format.
You will be asked to enter a prompt describing the task performed in the dataset.

Examples for single-task dataset:

python lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py \
    --repo-id lerobot/aloha_sim_insertion_human_image \
    --task "Insert the peg into the socket." \
    --robot-config lerobot/configs/robot/aloha.yaml

python lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py \
    --repo-id aliberts/koch_tutorial \
    --task "Pick the Lego block and drop it in the box on the right." \
    --robot-config lerobot/configs/robot/koch.yaml

For the more complicated cases of one task per episode or multiple tasks per episodes, please refer to the documentation in that script.

Motivation

Current implementation of our LeRobotDataset suffers from a few shortcomings which make it not easy to use on some aspects. Specifically:

The structure of the files does not accurately reflect the data structure. Our datasets are structured by episodes, which contrasts with a typical ML scenarios with train/val/test splits (although these concepts can still be relevant here). This makes it hard to easily select a subset of episodes from a dataset since the whole dataset has to be downloaded/loaded. Related: #440
Due to the hub's limitations, one can not push a dataset with — at most - more than 10k episodes (less if there are multiple cameras).
The format is not transparent to the user: in order to get information about the content of a dataset, current options are limited to download the entire dataset and inspect it with a custom script, or try to visualize it using our visualization tool. Related: #383
The default file cache system used by datasets and huggingface_hub makes it not convenient to create datasets locally (with recording). In order to use the newly created files on disk, these libraries check if those files are present in the cache (which they won't) and if not, will download them even though they may already be on disk.
Some file format used are too framework specific for this format to be more universal (e.g. .safetensors)
The dataset viewer on the hub is not compatible with our datasets due to VideoFrame not yet being integrated into datasets.
The current implementation lacks support for future features that we may want to add such as:
- Task-tokens-conditioned training
- Multirobot policies
- Depth images (Related: #435)

Changes

Some of the biggest change come from the new file structure and their content:

  .
  ├── data
- │   ├── train-00000-of-0001.parquet
+ │   ├── chunk-000
+ │   │   ├── episode_000000.parquet
+ │   │   ├── episode_000001.parquet
+ │   │   ├── episode_000002.parquet
+ │   │   └── ...
+ │   ├── chunk-001
+ │   │   ├── episode_001000.parquet
+ │   │   ├── episode_001001.parquet
+ │   │   ├── episode_001002.parquet
+ │   │   └── ...
+ │   └── ...
- ├── meta_data
+ ├── meta
- │   ├── episode_data_index.safetensors
+ │   ├── episodes.jsonl
  │   ├── info.json
+ │   ├── stats.json
- │   ├── stats.safetensors
+ │   └── tasks.jsonl
  └── videos
+     ├── chunk-000
+     │   ├── observation.images.laptop
      │   │   ├── episode_000000.mp4
      │   │   ├── episode_000001.mp4
      │   │   ├── episode_000002.mp4
      │   │   └── ...
+     │   ├── observation.images.phone
      │   │   ├── episode_000000.mp4
      │   │   ├── episode_000001.mp4
      │   │   ├── episode_000002.mp4
      │   │   └── ...
+     ├── chunk-001
      └── ...

Note that this file-based structure is designed to be as versatile as possible. The parquet files are split by episodes (this was already the case for videos) which allows a much more granular control over which episodes one wants to use and download. The structure of the dataset is entirely described in the info.json file, which can be easily downloaded or viewed directly on the hub before downloading any actual data. The type of files used are very simple and do not need complex tools to be read, it only uses .parquet, .json, .jsonl and .mp4 files (.md for the README).

Added

A LeRobotDataset can now be called with an episodes argument (e.g. episodes=[1, 10, 12, 40]) to select a specific subset of episodes by their episode_index. By doing so, only the files corresponding to these episodes will be downloaded (if they're not already on disk). In that case, the hf_dataset attribute will only contain data from these episodes, as well as the episode_data_index.
Dataset metadata logic is now handled by the LeRobotDatasetMetadata class. This allows to get info about a dataset before loading the data. For example, you could do this:

# Fetch metadata from the hub
metadata = LeRobotDatasetMetadata("lerobot/pusht")

# Calculate train and val episodes
total_episodes = metadata.total_episodes
episodes = list(range(metadata.total_episodes))
num_train_episodes = math.floor(total_episodes * 90 / 100)
train_episodes = episodes[:num_train_episodes]
val_episodes = episodes[num_train_episodes:]

# Load train an val datasets
train_dataset = LeRobotDataset("lerobot/pusht", episodes=train_episodes)
val_dataset = LeRobotDataset("lerobot/pusht", episodes=val_episodes)

Tasks as natural language prompts are now in every datasets and is needed to create one. Every single task of a dataset is listed in the tasks.json mapped to its task_index which is what's actually stored in parquet files. Using the api, they can be accessed either with dataset.tasks to get that mapping or through dataset.episode_dict[episode_index]["tasks"] if you're only interested in a particular episode.
Various information about the structure of the dataset have been added and is now centralized in the info.json (keys, shapes, number of episodes, etc.). It serves as a source of truth for what's inside the dataset.
episodes.jsonl contains per-episode information (episode_index, tasks in natural language and episode lengths). This is accessed through the episode_dict attribute in the api.
LeRobotDataset.create() allows to create a new dataset from scratch, either for recording data or for porting an existing dataset to the LeRobotDataset format. To that end, new methods are added:
- start_image_writter(): This instantiates an ImageWriter in the image_writer attribute to write images asynchonously during data recording. This is automatically called during LeRobotDataset.create() if specified in the arguments.
- stop_image_writter(): This is to properly stop and remove the ImageWriter from the dataset's attributes. Importantly: if the image_writer has been set to a multiprocess ImageWriter, this needs to be called first if you want to pass this dataset into a parallelized DataLoader as the ImageWriter class is not pickleable (required for objects to be transfered between processes). This is not needed when instantiating a dataset with __init__ as the image_writer then is not created.
- add_frame(): Adds a single timestamp data frame to the episode_buffer, which keep data in memory temporarily. Note: this will be merged with the DataBuffer from #445 in a subsequent PR.
- add_episode(): Saves the content of the episode_buffer to disk and updates metadata for them to be in sync with the contents of the files. This method expects a task argument as a string prompt in natural language describing the task performed in the episode. Videos from that episode can optionally be encoded during this phase but it's not mandatory and can be done later in order to give more flexibility on when to do that.
- consolidate(): This will encode videos that have not yet been encoded, clean up the temporary image files, compute dataset statistics, check timestamps are in sync with the fps and perform additional sanity checks in the dataset. It needs to be done before uploading the dataset to the hub with push_to_hub().
- clear_episode_buffer(): This can be used to reset the episode_buffer (e.g. to discard data from a current recording).

Changed

The logic for checking timestamps and delta_timestamps sync is taken outside of the __get_item__() and is now done during __init__ or consolidate. This has the benefit of both saving computation during the __get_item__() as well as knowing immediately if there are sync issues with the timestamps.
The paths for the parquet and video files are now embedded in the info.json to allow flexibility and to easily split chunks of files between directories to avoid the hub's limit of files (10k) per folder.
We now store every datasets (created or downloaded) in ~/.cache/huggingface/lerobot by default. Changing root or setting the LEROBOT_HOME env variable allows to change that location. Every call to the huggingface_hub download functions like snapshot_download or hf_hub_download use the local_dir argument to that location so that files are not duplicated in cache and to solve the issue of having to download again files already present on disk.
Refactored the image writing code from populate_dataset.py into an ImageWriter class.
stats.safetensors is now stats.json (the content remains the same but it's unflattened).
episode_data_index.safetensors is removed but the episode_data_index is still in the api to map episode_index to indices.

Performance

In the nominal case (no delta_timestamp), LeRobotDataset.__get_item__() is on par with the previous version, sometimes slightly improved but still in the same ballpark generally.

__get_item__() call time in seconds (average on 10k iterations):

repo_id                                 | v1.6   | v2.0  
--------------------------------------- | ------ | ------
lerobot/aloha_sim_insertion_human_image | 0.0036 | 0.0037
lerobot/aloha_sim_insertion_human       | 0.0029 | 0.0027
lerobot/pusht_image                     | 0.0003 | 0.0003
lerobot/pusht                           | 0.0011 | 0.0009
aliberts/koch_tutorial                  | 0.0111 | 0.0106
lerobot/aloha_mobile_cabinet            | 0.0104 | 0.0101

Benchmarking code

from pathlib import Path
import time
import torch
from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDataset

repo_ids = [
    "lerobot/aloha_sim_insertion_human_image",
    "lerobot/aloha_sim_insertion_human",
    "lerobot/pusht_image",
    "lerobot/pusht",
    "aliberts/koch_tutorial",
    "lerobot/aloha_mobile_cabinet",
]
num_iterations = 10000
logfile = Path(f"perf_log_{CODEBASE_VERSION}_{num_iterations}.txt")
with open(logfile, "a") as file:
    file.write(f"__get_item__() call time in seconds (average on {num_iterations} iterations)\n\n")
    file.write(f"repo_id                                 | {CODEBASE_VERSION}  \n")
    file.write("--------------------------------------- | ------\n")

for repo_id in repo_ids:
    dataset = LeRobotDataset(repo_id=repo_id)
    durations = []
    for i in range(num_iterations):
        start = time.perf_counter()
        item = dataset[i]
        duration = time.perf_counter() - start
        durations.append(duration)

    avg_duration = torch.Tensor(durations).mean()
    results = f"{repo_id} | {avg_duration:.4f}s"
    print(results)
    with open(logfile, "a") as file:
        file.write(results + "\n")

Using delta_timestamps, results are more diverse depending on the dataset but still remain in the same ballpark.
__get_item__() call time in seconds (average on 10k iterations), delta_timestamps=[-1/fps, 0, 1/fps]:

repo_id                                 | v1.6   | v2.0  
--------------------------------------- | ------ | ------
lerobot/aloha_sim_insertion_human_image | 0.0176 | 0.0160
lerobot/aloha_sim_insertion_human       | 0.0073 | 0.0068
lerobot/pusht_image                     | 0.0024 | 0.0032
lerobot/pusht                           | 0.0028 | 0.0043
aliberts/koch_tutorial                  | 0.0200 | 0.0184
lerobot/aloha_mobile_cabinet            | 0.0224 | 0.0181

Benchmarking code (delta_timestamps)

from pathlib import Path
import time
import torch
from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDataset

repo_ids = [
    "lerobot/aloha_sim_insertion_human_image",
    "lerobot/aloha_sim_insertion_human",
    "lerobot/pusht_image",
    "lerobot/pusht",
    "aliberts/koch_tutorial",
    "lerobot/aloha_mobile_cabinet",
]
num_iterations = 10000
logfile = Path(f"perf_log_{CODEBASE_VERSION}_{num_iterations}.txt")
with open(logfile, "a") as file:
    file.write(f"__get_item__() call time in seconds (average on {num_iterations} iterations)\n\n")
    file.write(f"repo_id                                 | {CODEBASE_VERSION}  \n")
    file.write("--------------------------------------- | ------\n")

for repo_id in repo_ids:
    dataset = LeRobotDataset(repo_id=repo_id)
    fps = dataset.fps
    keys = ["observation.state", *dataset.camera_keys]
    delta_timestamps = {key: [-1/fps, 0, 1/fps] for key in keys}
    dataset = LeRobotDataset(repo_id=repo_id, delta_timestamps=delta_timestamps)
    durations = []
    for i in range(num_iterations):
        start = time.perf_counter()
        item = dataset[i]
        duration = time.perf_counter() - start
        durations.append(duration)

    del dataset
    avg_duration = torch.Tensor(durations).mean()
    results = f"{repo_id} | {avg_duration:.4f}s"
    print(results)
    with open(logfile, "a") as file:
        file.write(results + "\n")

Fixes

Fix a bug in load_previous_and_future_frames which didn't actually raise an error when the requested timestamps from delta_timestamps did not correspond to actual timestamps in the dataset.
Various fixes on the datasets have been made:
- Some tasks already present in some datasets contained strings which were not part of the task (e.g. "tf.Tensor(b'Do something', shape=(), dtype=string)")
- Some video files were not properly tracked by git lfs
- Some datasets present a mismatch between the number of episodes in their parquet and the number of video files. This is being investigated [TODO]
  - lerobot/aloha_mobile_shrimp
  - lerobot/aloha_static_battery
  - lerobot/aloha_static_fork_pick_up
  - lerobot/aloha_static_thread_velcro
  - lerobot/uiuc_d3field
- lerobot/viola is missing video keys [TODO]

How it was tested

Adds tests/fixtures/ in which fixtures and fixtures factories have been added to simplify writing/adding tests. These factories allow the flexibility to create partially mocked objects on the fly to be used in tests, while not relying on other components of the codebase that are not meant to be tested in a particular test (e.g. initializing a dataset using hydra).
Adds tests/test_image_writer.py
Adds tests/test_delta_timestamps.py
Deactivates a bunch of tests which will need to be redesigned and simplified in further PRs.

How to checkout & try? (for the reviewer)

Use an existing dataset:

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

REPO_ID = "lerobot/aloha_sim_insertion_human"  # try with '_image' as well

delta_timestamps = {
    "observation.images.top": [-1, -1/50, 0, 25/50],
    "observation.state": [-1, -1/50, 0, 25/50],
}
dataset = LeRobotDataset(repo_id=REPO_ID, delta_timestamps=delta_timestamps)

Try out the new feature to select / download specific episodes:

dataset = LeRobotDataset(repo_id=REPO_ID, episodes=[1, 10, 12, 40])

You can also create a new dataset:

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

REPO_ID = "your_hf_username/test_v2"

new_dataset = LeRobotDataset.create(
    repo_id=REPO_ID,
    fps=30,
    robot=robot,
    image_writer_threads_per_camera=1,
)

# TODO
frame = {
    ...
}
new_dataset.add_frame(frame)
new_dataset.add_episode(task="Do something")
new_dataset.consolidate()

…_25_reshape_dataset

lerobot/common/datasets/lerobot_dataset.py

Cadene · 2024-10-11T15:38:20Z

lerobot/common/datasets/lerobot_dataset.py

+                '~/.cache/huggingface/lerobot'.
+            episodes (list[int] | None, optional): If specified, this will only load episodes specified by
+                their episode_index in this list. Defaults to None.
+            split (str, optional): _description_. Defaults to "train".


I thought we were removing split?

I've removed it 8bd406e (it wasn't used anymore).
I suggest we just allow to keep a notion of split in the info.json as I've done in the conversion script:

"splits": { "train": "0:50" }

lerobot/common/datasets/lerobot_dataset.py

… main process

…_25_reshape_dataset

apockill · 2024-11-07T18:17:12Z

Hey folks! Awesome work here. I was wondering what the timeline is for merging Dataset 2.0?

The reason I ask is I'm about to start working on adding support for Elephant Robotics MyArm M&C, and I'm not sure if I should target dataset 1.0 or 2.0 initially.

aliberts · 2024-11-14T08:08:28Z

Hey folks! Awesome work here. I was wondering what the timeline is for merging Dataset 2.0?

@apockill Thank you for your support! The team took some time off and we're just getting back. Hopefully this will be merged very soon (we mainly need to update some more tests now).

The reason I ask is I'm about to start working on adding support for Elephant Robotics MyArm M&C, and I'm not sure if I should target dataset 1.0 or 2.0 initially.

We will refactor how robot classes are structured soon but for now this PR shouldn't have a big impact on adding support for a new robot. The only thing that's being added to robot classes are the motor_features and camera_features for now (this interface might change in the next PR).

WIP

ad115b6

aliberts added ✨ Enhancement New feature or request 🗃️ Dataset Something dataset-related labels Oct 3, 2024

aliberts self-assigned this Oct 3, 2024

aliberts linked an issue Oct 3, 2024 that may be closed by this pull request

[Feature Request] Add Detailed Information about Observation Fields to Metadata File in leRobotDataset Repository #383

Open

aliberts added 11 commits October 4, 2024 11:22

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

17a1214

…_25_reshape_dataset

Add upload folders

1016a98

Add info.json link

07e113c

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

028c17f

…_25_reshape_dataset

Add pixel channels

21ba4b5

Update info.json format

2d75b93

Rework LeRobotDataset.__init__

096824b

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

3113038

…_25_reshape_dataset

Update LeRobotDataset.__get_item__

b417ceb

Add doc, scrap video_frame_keys attribute

6d2bc11

Add huggingface-hub patch for offline snapshot_download with local_dir

7f68088

Cadene self-requested a review October 11, 2024 15:10

Add padding keys and download_data option

3ea5312

Cadene reviewed Oct 11, 2024

View reviewed changes

aliberts added 11 commits October 11, 2024 18:52

Add suggestions from code review

8bd406e

Add multitask support, refactor conversion script

cf63334

Extend v1 compatibility

cbc51e1

Fix safe_version

f96773d

Cleanup, fix load_tasks

835ab5a

Update load_tasks doc

da78bbf

WIP add batch convert

9433ac5

Add fixes for batch convert

1102640

Add episode chunks logic, move_videos & lfs tracking fix

c146ba9

Write episodes as jsonlines

50a75ad

Add fixes for lfs tracking

ad3f112

aliberts added 22 commits October 31, 2024 21:43

Remove obsolete code

5ea7c78

Mock snapshot_download

cd1509d

Add tasks and episodes factories

2650872

Rename num_samples -> num_frames for consistency

79d114c

Simplify, add test content, add todo

293bdc7

Add img and img_tensor factories

375abd3

Add test_image_writer, accept PIL images, improve ImageWriter perf in…

6b2ec1e

… main process

Add more options to img factories

7a342db

Add todo in skipped test

df2cb51

Fix test_online_buffer.py

ac79e8c

Add LeRobotDatasetMetadata

e4ba084

Fix hanging

16103cb

Fix vizualize

95a4b59

Fix werkzeug alert

c2d6fb6

Remove end-to-end tests

f6c90ca

Deactivate policies backward compatibility test

56e4603

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

fde29e0

…_25_reshape_dataset

Fix advanced example 2

a6762ec

Remove reset_episode_index

74270c8

Move calculate_episode_data_index

7b159a6

Fix test_examples

b69a132

Fix test_examples

757ea17

aliberts marked this pull request as ready for review November 3, 2024 18:53

aliberts requested review from Cadene and michel-aractingi November 3, 2024 18:54

aliberts added 2 commits November 5, 2024 13:10

Refactor dataset features

aed9f40

Fix tests

f3630ad

Henry-Ellis mentioned this pull request Nov 6, 2024

Data version conversion #501

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset v2.0 #461

Dataset v2.0 #461

aliberts commented Oct 3, 2024 •

edited

Loading

Cadene Oct 11, 2024

aliberts Oct 11, 2024 •

edited

Loading

apockill commented Nov 7, 2024

aliberts commented Nov 14, 2024

Dataset v2.0 #461

Are you sure you want to change the base?

Dataset v2.0 #461

Conversation

aliberts commented Oct 3, 2024 • edited Loading

What this does

What do I need to do?

Motivation

Changes

Performance

Fixes

How it was tested

How to checkout & try? (for the reviewer)

Cadene Oct 11, 2024

Choose a reason for hiding this comment

aliberts Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

apockill commented Nov 7, 2024

aliberts commented Nov 14, 2024

aliberts commented Oct 3, 2024 •

edited

Loading

aliberts Oct 11, 2024 •

edited

Loading