You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper_notes/locomotion_next_token_pred.md
+31-1
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
_March 2024_
4
4
5
-
tl;dr: A motion controller based on next token prediction of sensorimotor tokens for bipedal humanoid locomotion.
5
+
tl;dr: A motion controller based on next token prediction of sensorimotor tokens for bipedal humanoid locomotion. The most obvious advantage over prev methods is scaling.
6
6
7
7
#### Overall impression
8
8
The paper tackles humanoid control, specifically humanoid locomotion (standing upright and moving legs) as an e2e control problem. The sequence of sensory observations and motor actions makes up **sensorimotor trajectories**, as the sentence of the physical world. Note that there is NO images or perception involved, but only streams of relatively sparse, structured output.
@@ -19,6 +19,8 @@ It rolls out observation and action jointly, and in this way, it is a world mode
19
19
20
20
The model can transfer to real world when trained with ONLY 27 hours of data. Another interesting fact is that the transformer based policy is smoother and more accurate than the RL policy, although the model is trained with trajectories produced by this RL policy. (青出于蓝? Why?)
21
21
22
+
[Locomotion as next token](locomotion_next_token_pred.md) deals with missing modality with masked modeling, [Genie](genie.md) with LAM (latent action space), and [VPT](vpt.md) with IDM (inverse dynamics model).
23
+
22
24
#### Key ideas
23
25
- Data source: diverse dataset with potentially missing modalities. Scraped from the internet and simulators.
24
26
- Prior RL policies (o + a)
@@ -31,6 +33,7 @@ The model can transfer to real world when trained with ONLY 27 hours of data. An
31
33
- Generates torque, not consistent with action space, so dropped out actions.
32
34
- Motion capture (o only)
33
35
- MoCap capture human keypoints in 3D, and inverse kinematics are used to find corresponding robot pose.
36
+
- Fitted joint position as observation, actuated trajectory is used as grounding.
34
37
- Youtube videos (o only, much noisier)
35
38
- Reconstruct human videos by using CV techniques and retarget both motion capture and youtube trajectories via inverse kinematics.
36
39
- HW: Agility Robotics
@@ -69,3 +72,30 @@ The model can transfer to real world when trained with ONLY 27 hours of data. An
69
72
- What is the freq in Hz is the sensorimotor trajectory?
70
73
71
74
75
+
#### Notes taken during tech sharing with co-author
76
+
* Motion planning: Given gait (classical), but now not anymore, output motor torque.
77
+
* HW:
78
+
* Digit: Oregon State, spring loaded
79
+
* Underactuated dof, 6 DoF floating base
80
+
* Foot placement: 3 camera in hip. Used for detecting ground (slope, gravel)
81
+
* Ostrich-inspired: cannot sit. Cannot modify to be driver.
82
+
* Intel nuc: GPU options cap out at the Nvidia 2060 in the NUC.
83
+
* Dataset: MoCap and internet much bigger.
84
+
* NN-based: joint position. 30-50 Hz. Downsampled view. NN + PID = MBC.
85
+
* Model based controller: output joint torque. Already converged. High freq (1000 Hz). Cannot be achieve this freq.
86
+
* Model based: 10w line of code to unscrew bottle cap. Model base runs fast needs a lot of compute optimization (simpler dynamics, linearization).
87
+
* Now no image. Short term go straight, adapt to slope and ground condition. Terrain in the wild.
88
+
* model 2 M parameter.
89
+
* Misc
90
+
* Typically Cherrypick. In public, tolerance is low.
0 commit comments