[tx] Implement stacked weights by pcmoritz · Pull Request #1018 · NovaSky-AI/SkyRL

pcmoritz · 2026-02-04T03:46:26Z

This is based on all the great work that @raulchen did in #996 and #906, it also fixes the performance regression in decoding vs. the main branch.

gemini-code-assist

Code Review

This pull request introduces StackedDecoderLayers to optimize the transformer forward pass using nnx.vmap and jax.lax.scan, which is a significant performance improvement for training and prefill. The changes are well-encapsulated, making the model code cleaner and more efficient. However, I've identified a critical bug in the load_safetensors utility related to how parameter paths are handled, which would prevent it from loading weights correctly for certain layer types.

gemini-code-assist · 2026-02-04T03:52:18Z

skyrl-tx/tx/utils/models.py

+        updates = []
+        for path, param in nnx.to_flat_state(nnx.state(module)):
+            if filter_fn is not None and not filter_fn(path):
+                continue
+            key = key_prefix + get_param_key(path)
+            if skip_lora and ("lora_A" in path or "lora_B" in path or "lora_scaling" in path or "lora_ranks" in path):
+                continue
+            if "experts" in path:
+                tensor = np.stack([tensors[key_prefix + get_expert_key(path, i)].T for i in range(config.get_num_experts())], axis=0)
+            else:
+                tensor = tensors[key] if "embed_tokens" in key else tensors[key].T
+            if len(path) >= 2 and path[-2] in {"q_proj", "k_proj", "v_proj", "o_proj"}:
+                tensor = tensor.reshape(param.shape)
+            assert param.shape == tensor.shape, f"shape mismatch for {key}"
+            updates.append((path, jax.device_put(tensor.astype(param.dtype), param.sharding)))
+        nnx.update(module, nnx.from_flat_state(updates))


The checks for parameter types like "lora_A" in path or "experts" in path are incorrect. The path variable is a tuple of nnx.path.PathEntry objects, not strings, so these checks will always evaluate to False. This will prevent weights for LoRA, experts, and projections from being loaded correctly.

To fix this, you should convert the path to a tuple of strings before performing these checks. This will ensure that the logic correctly identifies the parameter types and applies the appropriate loading logic.

Suggested change

updates = []

for path, param in nnx.to_flat_state(nnx.state(module)):

if filter_fn is not None and not filter_fn(path):

continue

key = key_prefix + get_param_key(path)

if skip_lora and ("lora_A" in path or "lora_B" in path or "lora_scaling" in path or "lora_ranks" in path):

continue

if "experts" in path:

tensor = np.stack([tensors[key_prefix + get_expert_key(path, i)].T for i in range(config.get_num_experts())], axis=0)

else:

tensor = tensors[key] if "embed_tokens" in key else tensors[key].T

if len(path) >= 2 and path[-2] in {"q_proj", "k_proj", "v_proj", "o_proj"}:

tensor = tensor.reshape(param.shape)

assert param.shape == tensor.shape, f"shape mismatch for {key}"

updates.append((path, jax.device_put(tensor.astype(param.dtype), param.sharding)))

nnx.update(module, nnx.from_flat_state(updates))

updates = []

for path, param in nnx.to_flat_state(nnx.state(module)):

path_str_tuple = tuple(map(str, path))

if filter_fn is not None and not filter_fn(path):

continue

key = key_prefix + get_param_key(path)

if skip_lora and any(p in path_str_tuple for p in ("lora_A", "lora_B", "lora_scaling", "lora_ranks")):

continue

if "experts" in path_str_tuple:

tensor = np.stack([tensors[key_prefix + get_expert_key(path, i)].T for i in range(config.get_num_experts())], axis=0)

else:

tensor = tensors[key] if "embed_tokens" in key else tensors[key].T

if len(path_str_tuple) >= 2 and path_str_tuple[-2] in {"q_proj", "k_proj", "v_proj", "o_proj"}:

tensor = tensor.reshape(param.shape)

assert param.shape == tensor.shape, f"shape mismatch for {key}"

updates.append((path, jax.device_put(tensor.astype(param.dtype), param.sharding)))

nnx.update(module, nnx.from_flat_state(updates))

skyrl-tx/tx/layers/stacked.py

pcmoritz · 2026-02-04T11:02:09Z

skyrl-tx/tx/layers/stacked.py

+        return self.get_metadata("_parent")[self.get_metadata("_idx")].shape
+
+
+class StackedDecoderLayers(nnx.Module):


Probably the easiest way to implement DeepSeekV3 is to implement DualStackedDecoderLayers which has two StackedDecoderLayers as members and the same interface as StackedDecoderLayers (modulo the constructor which can take two create_layer_fn functions and takes their respective numbers as arguments). This could be a separate PR.

skyrl-tx/tx/layers/util.py

pcmoritz · 2026-02-05T08:18:15Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant and valuable optimization by implementing stacked weights for decoder layers, which should improve training and prefill performance by leveraging jax.lax.scan. The use of StackedDecoderLayers and the ArrayRef helper class to maintain compatibility with standard checkpoint formats is a clever approach.

The changes are well-integrated across the model, layers, and utility functions. However, I've found a critical issue in the save_safetensors function, which was not fully updated to handle the new ArrayRef variable types returned by unstack_state, which will cause model saving to fail. I've also included a medium-severity suggestion to improve the robustness of extract_adapter_state.

Addressing these points will help ensure the new implementation is both performant and correct.

gemini-code-assist · 2026-02-05T08:21:20Z

skyrl-tx/tx/utils/models.py

+
+    # unstack_state converts stacked paths (layers._stacked.xxx) to per-layer paths
+    # (layers.0.xxx) matching the checkpoint key format used by HuggingFace
+    model_params = nnx.to_flat_state(unstack_state(model))


While using unstack_state is correct for handling stacked models, it introduces a bug in the subsequent loop (lines 146-160). The param variable in the loop is now an ArrayRef (a type of nnx.Variable), not a raw JAX array. Operations like param.reshape, param.T, or slicing like param[i,:,:] will fail on this object.

Additionally, checks like if "q_proj" in path and if "embed_tokens" in path are incorrect as path is a tuple of strings and these checks are too broad.

The loop body needs to be updated to correctly handle nnx.Variable objects by first extracting their array value (e.g., value = param[...]) and then performing operations on that value. The checks should also be made more specific, for example by checking path[-2] for layer types or key for embed_tokens.

Here is a suggested correction for the loop body:

tensors = {} for path, param in model_params: if "rngs" in path: continue if filter_fn is not None and not filter_fn(path): continue key = get_param_key(path, prefix=prefix) value = param[...] if "experts" in path: for i in range(config.get_num_experts()): tensors[get_expert_key(path, i)] = value[i, :, :].T continue if path[-2] in {"q_proj", "k_proj", "v_proj"}: value = value.reshape(value.shape[0], -1) elif path[-2] == "o_proj": value = value.reshape(-1, value.shape[-1]) tensors[key] = value if "embed_tokens" in key else value.T

gemini-code-assist · 2026-02-05T08:21:20Z

skyrl-tx/tx/utils/models.py

+        idx = get_adapter_idx(path, adapter_index)
+        if key == "lora_A":
+            return p[idx][..., :, :rank]
+        return p[idx][..., :rank, :]


The original assertion on parameter dimensions was correctly removed as it didn't account for stacked weights. However, it would be good to re-introduce a corrected assertion to maintain robustness. You can assert on the shape of the sliced parameter, which should have a consistent number of dimensions (3 for linear, 4 for expert LoRA) regardless of whether the original parameter was stacked.

This helps catch potential issues with parameter shapes early.

Suggested change

idx = get_adapter_idx(path, adapter_index)

if key == "lora_A":

return p[idx][..., :, :rank]

return p[idx][..., :rank, :]

idx = get_adapter_idx(path, adapter_index)

sliced_p = p[idx]

assert sliced_p.ndim in {3, 4}, f"LoRA parameters sliced by adapter should have 3 or 4 dimensions, got shape {sliced_p.shape}"

if key == "lora_A":

return sliced_p[..., :, :rank]

return sliced_p[..., :rank, :]

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

devin-ai-integration

Devin Review found 1 new potential issue.

View 8 additional findings in Devin Review.

skyrl-tx/tx/tinker/backends/jax.py

pcmoritz added 5 commits February 3, 2026 16:08

[tx] Implement stacked layers

7653b1c

add file

320681c

fix

169ec5a

update

872fcf0

update

e3c3ecd

pcmoritz added the tx label Feb 4, 2026

fix ruff

2336a08

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

pcmoritz added 8 commits February 3, 2026 21:55

update

526efa2

update

26d9a43

update

3751008

update

3f2879d

update

52fcccf

update

851dd0f

cleanup

c2cdecb

update

7cfa898

pcmoritz commented Feb 4, 2026

View reviewed changes

skyrl-tx/tx/layers/stacked.py Outdated Show resolved Hide resolved

pcmoritz commented Feb 4, 2026

View reviewed changes

pcmoritz added 2 commits February 4, 2026 10:27

fix test

e2352d3

black

de2229e

pcmoritz commented Feb 4, 2026

View reviewed changes

skyrl-tx/tx/layers/util.py Outdated Show resolved Hide resolved

pcmoritz added 2 commits February 4, 2026 23:45

update

ca212c1

update

27e0523

pcmoritz changed the title ~~[WIP] [tx] Implement stacked weights~~ [tx] Implement stacked weights Feb 5, 2026

gemini-code-assist bot reviewed Feb 5, 2026

View reviewed changes

pcmoritz added 2 commits February 5, 2026 12:44

simplify

268c0a7

simplify

f9c7b7e

pcmoritz mentioned this pull request Feb 6, 2026

[tx] Per-layer gradient checkpointing with stacked decoder layers #996

Closed

Merge branch 'main' into tx-stacked-layers

67d873d

devin-ai-integration bot reviewed Feb 10, 2026

View reviewed changes

pcmoritz added 3 commits February 10, 2026 15:09

update

730c62a

format

b2ef991

update

6dcef04

devin-ai-integration bot reviewed Feb 11, 2026

View reviewed changes

skyrl-tx/tx/tinker/backends/jax.py Show resolved Hide resolved

pcmoritz added 4 commits February 10, 2026 21:38

update

d5dafdf

update

4c23c25

fix

ff0bb94

fix

8d699d6

pcmoritz closed this Feb 11, 2026

pcmoritz reopened this Feb 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tx] Implement stacked weights#1018

[tx] Implement stacked weights#1018
pcmoritz wants to merge 28 commits intoNovaSky-AI:mainfrom
pcmoritz:tx-stacked-layers

pcmoritz commented Feb 4, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

Uh oh!

pcmoritz Feb 4, 2026

Uh oh!

Uh oh!

pcmoritz commented Feb 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Uh oh!

gemini-code-assist bot Feb 5, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		return self.get_metadata("_parent")[self.get_metadata("_idx")].shape


		class StackedDecoderLayers(nnx.Module):

Conversation

pcmoritz commented Feb 4, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcmoritz Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcmoritz commented Feb 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pcmoritz commented Feb 4, 2026 •

edited by devin-ai-integration bot

Loading