Skip to content

Latest commit

 

History

History
60 lines (46 loc) · 1.71 KB

README.md

File metadata and controls

60 lines (46 loc) · 1.71 KB

toolbox

Containerization and setup tools for bootstrapping my ML research projects.

Backlog:

  • Skypilot integration to spin up instances from cli
  • Use a per-project config file for advanced configuration
  • integration with ray job queue and gpuboard project for observability
  • cleaner, less hardcoded defaults for dockerfiles, etc.
  • Rich devcontainer support
graph TB
    classDef controller fill:#2E5EAA,stroke:#173A7B,color:white
    classDef ray fill:#0194E2,stroke:#015B8C,color:white
    classDef container fill:#67B7D1,stroke:#4A8396,color:white
    classDef storage fill:#FFA500,stroke:#CC8400,color:white
    classDef cli fill:#4CAF50,stroke:#357935,color:white

    Storage[(Remote Storage<br>FUSE/S3)]
    Controller[Controller Service<br>Job Monitor & Resource Manager]:::controller
    CLI[Local CLI Tool]:::cli

    subgraph Instance1
        I1_Ray[Ray Cluster]:::ray
        I1_C1[Container 1]:::container
        I1_C2[Container 2]:::container
    end

    subgraph Instance2
        I2_Ray[Ray Cluster]:::ray
        I2_C1[Container 1]:::container
    end

    subgraph InstanceN
        IN_Ray[Ray Cluster]:::ray
        IN_C1[Container 1]:::container
        IN_C2[Container 2]:::container
    end

    CLI -->|Commands & Monitoring| Controller
    Controller -->|Monitor/Control| Instance1
    Controller -->|Monitor/Control| Instance2
    Controller -->|Monitor/Control| InstanceN

    I1_C1 -->|Submit Jobs| I1_Ray
    I1_C2 -->|Submit Jobs| I1_Ray

    I2_C1 -->|Submit Jobs| I2_Ray

    IN_C1 -->|Submit Jobs| IN_Ray
    IN_C2 -->|Submit Jobs| IN_Ray

    Instance1 -.->|Data| Storage
    Instance2 -.->|Data| Storage
    InstanceN -.->|Data| Storage

    Controller -.->|Job Status & Metrics| CLI
Loading