Skip to content

Conversation

@mvinci12
Copy link
Contributor

@mvinci12 mvinci12 commented Nov 7, 2025

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! but let's not add submodule.


# Clone and install VERL
WORKDIR /workspace
RUN git clone https://github.com/volcengine/verl.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You clone verl here. Why do you need to have the same repository as submodule?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed submodule and did a git clone in readme. the git clone in the dockerfile is just to run RUN pip install -e . after.

We use the verl repo we clone locally to run our code (we copy this local repo to the ray pods) in case users want to make direct modifications to the training code.

Comment on lines 1 to 74
apiVersion: v1
data:
collector.yaml: "receivers:\n prometheus:\n config:\n scrape_configs:\n
\ - job_name: kubernetes-apiservers\n scrape_interval: 30s\n bearer_token_file:
/var/run/secrets/kubernetes.io/serviceaccount/token\n kubernetes_sd_configs:\n
\ - role: endpoints\n relabel_configs:\n - action:
keep\n regex: default;kubernetes;https\n source_labels:\n
\ - __meta_kubernetes_namespace\n - __meta_kubernetes_service_name\n
\ - __meta_kubernetes_endpoint_port_name\n # Add cluster
name and id as label\n - target_label: cluster_id\n replacement:
0gto4cvyhcmd\n - target_label: cluster_name\n replacement:
mvincig-rlvr\n scheme: https\n tls_config:\n ca_file:
/var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n insecure_skip_verify:
true\n - job_name: kube-state-metrics\n scrape_interval: 30s\n
\ kubernetes_sd_configs:\n - role: service\n relabel_configs:\n
\ - action: keep\n regex: hyp-obs-kube-state-metrics \n
\ source_labels:\n - __meta_kubernetes_service_name\n
\ - action: labelmap\n regex: __meta_kubernetes_service_label_(.+)\n
\ - source_labels:\n - __meta_kubernetes_namespace\n
\ target_label: kubernetes_namespace\n - source_labels:\n
\ - __meta_kubernetes_service_name\n target_label:
kubernetes_name\n # Add cluster name and id as label\n -
target_label: cluster_id\n replacement: 0gto4cvyhcmd\n -
target_label: cluster_name\n replacement: mvincig-rlvr\n -
job_name: kubeflow-trainer-metrics\n scrape_interval: 30s\n kubernetes_sd_configs:\n
\ - role: service\n relabel_configs:\n - action:
drop\n regex: hp-training-operator-controller-manager-metrics-service\n
\ source_labels:\n - __meta_kubernetes_service_name\n
\ - action: keep\n regex: (.*kubeflow.*-controller-manager|training-operator)\n
\ source_labels:\n - __meta_kubernetes_service_name\n
\ - action: keep\n regex: monitoring-port\n source_labels:\n
\ - __meta_kubernetes_service_port_name\n - action: labelmap\n
\ regex: __meta_kubernetes_service_label_(.+)\n - source_labels:\n
\ - __meta_kubernetes_namespace\n target_label: kubernetes_namespace\n
\ - source_labels:\n - __meta_kubernetes_service_name\n
\ target_label: kubernetes_name\n # Add cluster name and
id as label\n - target_label: cluster_id\n replacement:
0gto4cvyhcmd\n - target_label: cluster_name\n replacement:
mvincig-rlvr\n metrics_path: /metrics\n\n - job_name: hp-training-operator-metrics\n
\ scrape_interval: 30s\n kubernetes_sd_configs:\n -
role: service\n relabel_configs:\n - action: keep\n regex:
hp-training-operator-controller-manager-metrics-service\n source_labels:\n
\ - __meta_kubernetes_service_name\n - action: keep\n
\ regex: metrics-port\n source_labels:\n -
__meta_kubernetes_service_port_name\n - action: labelmap\n regex:
__meta_kubernetes_service_label_(.+)\n - source_labels:\n -
__meta_kubernetes_namespace\n target_label: kubernetes_namespace\n
\ - source_labels:\n - __meta_kubernetes_service_name\n
\ target_label: kubernetes_name\n # Add cluster name and
id as label\n - target_label: cluster_id\n replacement:
0gto4cvyhcmd\n - target_label: cluster_name\n replacement:
mvincig-rlvr\n metrics_path: /metrics\n scheme: http\n\nprocessors:\n
\ batch:\n timeout: 5s\n send_batch_size: 5000\n send_batch_max_size:
6000\n\n\nexporters:\n debug:\n verbosity: detailed\n prometheusremotewrite:\n
\ add_metric_suffixes: false\n endpoint: https://aps-workspaces.ap-northeast-1.amazonaws.com/workspaces/ws-b1b56a1d-2b76-48c8-91d2-e65db2925090/api/v1/remote_write\n
\ auth:\n authenticator: sigv4auth\n retry_on_failure:\n enabled:
true\n initial_interval: 1s\n max_interval: 10s\n max_elapsed_time:
60s\n\nextensions:\n sigv4auth:\n region: ap-northeast-1\n service: \"aps\"\n\nservice:\n
\ extensions: [sigv4auth]\n pipelines:\n metrics:\n receivers: [prometheus]\n
\ processors: [batch]\n exporters: [prometheusremotewrite]\n"
kind: ConfigMap
metadata:
creationTimestamp: "2025-10-31T21:35:37Z"
name: hyperpod-observability-central-collector-config
namespace: hyperpod-observability
ownerReferences:
- apiVersion: observability.sagemaker.aws.amazon.com/v1alpha1
blockOwnerDeletion: true
controller: true
kind: ObservabilityConfig
name: hyperpod-observability
uid: bda07f42-03ed-4c46-867a-a6349cbddad7
resourceVersion: "30938"
uid: 533995a8-c884-45c4-ac8f-b957e19c1ceb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is committed accidentally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - removed

@@ -0,0 +1,109 @@
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's refer existing nccl-tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@KeitaW
Copy link
Contributor

KeitaW commented Nov 7, 2025

The directory structure under test case is formatted as framework/library/testcase (ex: https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/trl/grpo, https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/pytorch/picotron/SmolLM-1.7B. Let's employ the same structure ( pytorch/verl/rlvr ?).

Comment on lines 6 to 29
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have most of them in the existing .gitignore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@mvinci12
Copy link
Contributor Author

mvinci12 commented Nov 9, 2025

lmk - thanks keita!

@KeitaW
Copy link
Contributor

KeitaW commented Nov 11, 2025

@@ -0,0 +1,50 @@
# Ray Cluster Environment Variables
# export IMAGE="hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# export IMAGE="hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0"

Comment on lines 22 to 24
# export MODEL_PATH="Qwen/Qwen2.5-32B" # Set this to load model from HuggingFace
#export MODEL_PATH="Qwen/Qwen3-0.6B-Base"
export MODEL_PATH="Qwen/Qwen3-8B"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# export MODEL_PATH="Qwen/Qwen2.5-32B" # Set this to load model from HuggingFace
#export MODEL_PATH="Qwen/Qwen3-0.6B-Base"
export MODEL_PATH="Qwen/Qwen3-8B"
export MODEL_PATH="Qwen/Qwen3-8B" # Set this to load model from HuggingFace

Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some additional minor comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants