Skip to content

Commit b85eefe

Browse files
committed
Update locomotion next token based on comm with coauthor
1 parent dd45597 commit b85eefe

File tree

1 file changed

+31
-1
lines changed

1 file changed

+31
-1
lines changed

paper_notes/locomotion_next_token_pred.md

+31-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
_March 2024_
44

5-
tl;dr: A motion controller based on next token prediction of sensorimotor tokens for bipedal humanoid locomotion.
5+
tl;dr: A motion controller based on next token prediction of sensorimotor tokens for bipedal humanoid locomotion. The most obvious advantage over prev methods is scaling.
66

77
#### Overall impression
88
The paper tackles humanoid control, specifically humanoid locomotion (standing upright and moving legs) as an e2e control problem. The sequence of sensory observations and motor actions makes up **sensorimotor trajectories**, as the sentence of the physical world. Note that there is NO images or perception involved, but only streams of relatively sparse, structured output.
@@ -19,6 +19,8 @@ It rolls out observation and action jointly, and in this way, it is a world mode
1919

2020
The model can transfer to real world when trained with ONLY 27 hours of data. Another interesting fact is that the transformer based policy is smoother and more accurate than the RL policy, although the model is trained with trajectories produced by this RL policy. (青出于蓝? Why?)
2121

22+
[Locomotion as next token](locomotion_next_token_pred.md) deals with missing modality with masked modeling, [Genie](genie.md) with LAM (latent action space), and [VPT](vpt.md) with IDM (inverse dynamics model).
23+
2224
#### Key ideas
2325
- Data source: diverse dataset with potentially missing modalities. Scraped from the internet and simulators.
2426
- Prior RL policies (o + a)
@@ -31,6 +33,7 @@ The model can transfer to real world when trained with ONLY 27 hours of data. An
3133
- Generates torque, not consistent with action space, so dropped out actions.
3234
- Motion capture (o only)
3335
- MoCap capture human keypoints in 3D, and inverse kinematics are used to find corresponding robot pose.
36+
- Fitted joint position as observation, actuated trajectory is used as grounding.
3437
- Youtube videos (o only, much noisier)
3538
- Reconstruct human videos by using CV techniques and retarget both motion capture and youtube trajectories via inverse kinematics.
3639
- HW: Agility Robotics
@@ -69,3 +72,30 @@ The model can transfer to real world when trained with ONLY 27 hours of data. An
6972
- What is the freq in Hz is the sensorimotor trajectory?
7073

7174

75+
#### Notes taken during tech sharing with co-author
76+
* Motion planning: Given gait (classical), but now not anymore, output motor torque.
77+
* HW:
78+
* Digit: Oregon State, spring loaded
79+
* Underactuated dof, 6 DoF floating base
80+
* Foot placement: 3 camera in hip. Used for detecting ground (slope, gravel)
81+
* Ostrich-inspired: cannot sit. Cannot modify to be driver.
82+
* Intel nuc: GPU options cap out at the Nvidia 2060 in the NUC.
83+
* Dataset: MoCap and internet much bigger.
84+
* NN-based: joint position. 30-50 Hz. Downsampled view. NN + PID = MBC.
85+
* Model based controller: output joint torque. Already converged. High freq (1000 Hz). Cannot be achieve this freq.
86+
* Model based: 10w line of code to unscrew bottle cap. Model base runs fast needs a lot of compute optimization (simpler dynamics, linearization).
87+
* Now no image. Short term go straight, adapt to slope and ground condition. Terrain in the wild.
88+
* model 2 M parameter.
89+
* Misc
90+
* Typically Cherrypick. In public, tolerance is low.
91+
* Poincare map (phase portrait): stability dynamic system. Periodic motion indicates laps.
92+
* Locomotion —> Manipulation. Adhere to evolution of human being.
93+
* Can be transferred to manipulation. Rt1, Rt2, aloha, all based on llm, but different methods.
94+
* How to Learn from 3rd person view video.
95+
* Methodology: RL —> self-supervised + IL. Dataset collection must be large scale.
96+
* Manipulation: Tactile. Last 1 mm. TRI has related work.
97+
* RL as boot-loader to LLM-based IL.
98+
* Psycology, Verbal imitation.
99+
* RPT: several steps, short term look-ahead. Stanford paper (which?). Step 16 steps. Diffusion-based.
100+
101+

0 commit comments

Comments
 (0)