Skip to content

Cli update #301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
9050740
CLI Layout and Create RayCluster function (#227)
carsonmh Jul 26, 2023
380f4d3
CLI Authentication (#252)
carsonmh Jul 31, 2023
5107aed
change: use updated auth on get_cluster
carsonmh Jul 31, 2023
2c7f0c7
Cli submit delete raycluster (#257)
carsonmh Aug 2, 2023
3bc9120
add: design doc
carsonmh Aug 3, 2023
a6753d3
add: cli status function
carsonmh Jul 31, 2023
47fda05
add: details cli function
carsonmh Jul 31, 2023
ae451c1
create: function to list rayclusters in all namespaces
carsonmh Jul 27, 2023
9372d4c
add: list raycluster function cli
carsonmh Jul 31, 2023
c213393
test: add unit test for list_clusters_all_namespaces
carsonmh Jul 31, 2023
e41dfab
test: add unit tests for status, details, and list CLI commands
carsonmh Jul 31, 2023
819cf57
cleanup
carsonmh Jul 31, 2023
f548525
fix: unit tests
carsonmh Jul 31, 2023
5937034
change: make namespace required for functions
carsonmh Aug 2, 2023
7ccb625
add: error handling for cluster not found
carsonmh Aug 2, 2023
f190786
add: plural alias to list raycluster
carsonmh Aug 3, 2023
7ee83fd
change: use current namespace when not specified
carsonmh Aug 3, 2023
ec3059e
refactor: make _get_all_rayclusters which handles namespaced and all …
carsonmh Aug 3, 2023
1a94b26
cleanup
carsonmh Aug 3, 2023
61e6723
create: CLI job define command
carsonmh Aug 3, 2023
26d00a1
create: submit job command cli
carsonmh Aug 3, 2023
6e4bcac
fix: login help message no longer has ellipsis
carsonmh Aug 3, 2023
7ad8127
test: unit tests for submit define job
carsonmh Aug 3, 2023
bc8113f
fix: typo
carsonmh Aug 4, 2023
245dde1
change: make submit job use current namespace
carsonmh Aug 4, 2023
25104f2
change: make load_auth only happen on login command
carsonmh Aug 4, 2023
0c28afd
add: raycluster not found error handling
carsonmh Aug 7, 2023
f4aff67
make define params required and refactor job submit
carsonmh Aug 9, 2023
64a8e45
create: list_jobs and get_job functions
carsonmh Aug 3, 2023
2e4fdca
create: list jobs CLI command
carsonmh Aug 3, 2023
2c8f81a
create: job status command
carsonmh Aug 3, 2023
a611267
create: cancel job function
carsonmh Aug 8, 2023
979aaa4
create: jobs logs command
carsonmh Aug 8, 2023
1e81b8f
change: slightly change messages and namespace options for job status…
carsonmh Aug 8, 2023
216d255
add: error handling and refactor to main CLI
carsonmh Aug 8, 2023
4f652fd
test: change tests for job functions, refactor tests and add tests fo…
carsonmh Aug 8, 2023
8f72ab4
cleanup
carsonmh Aug 8, 2023
f256b45
make list command list all resources by default
carsonmh Aug 9, 2023
65925d3
fix unit tests
carsonmh Aug 9, 2023
df432c5
fix description of list jobs/rayclusters and change --no-ray flag
carsonmh Aug 9, 2023
83e8367
fix unit tests and list function
carsonmh Aug 10, 2023
c8206e5
add and implement option to not generate appwrapper in a Cluster
carsonmh Aug 8, 2023
4208ef8
Change create_app_wrapper description
carsonmh Aug 10, 2023
fb502f6
fix down function if no name available and changed up to create an ap…
carsonmh Aug 10, 2023
de3af4f
refactor and cleanup cli unit tests
carsonmh Aug 11, 2023
a03ec9c
fix CLI tests
carsonmh Aug 11, 2023
27c24a6
refactor unit tests
carsonmh Aug 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ We use pre-commit to make sure the code is consistently formatted. To make sure
- To run the unit tests, run `pytest -v tests/unit_test.py`
- Any new test functions/scripts can be added into the `tests` folder
- NOTE: Functional tests coming soon, will live in `tests/func_test.py`
- To test CLI, run `codeflare` followed by any command. To see list of commands, simply run `codeflare`

#### Code Coverage

Expand Down
8 changes: 8 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ codeflare-torchx = "0.6.0.dev0"
cryptography = "40.0.2"
executing = "1.2.0"
pydantic = "< 2"
click = "8.0.4"

[tool.poetry.group.docs]
optional = true
Expand All @@ -40,3 +41,10 @@ pdoc3 = "0.10.0"
pytest = "7.4.0"
coverage = "7.2.7"
pytest-mock = "3.11.1"

[tool.poetry.scripts]
codeflare = "codeflare_sdk.cli.codeflare_cli:cli"

[build-system]
requires = ["poetry_core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ codeflare-torchx==0.6.0.dev0
pydantic<2 # 2.0+ broke ray[default] see detail: https://github.com/ray-project/ray/pull/37000
cryptography==40.0.2
executing==1.2.0
click==8.0.4
4 changes: 4 additions & 0 deletions src/codeflare_sdk.egg-info/SOURCES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,7 @@ src/codeflare_sdk/utils/generate_cert.py
src/codeflare_sdk/utils/generate_yaml.py
src/codeflare_sdk/utils/kube_api_helpers.py
src/codeflare_sdk/utils/pretty_print.py
src/codeflare_sdk/cli/__init__.py
src/codeflare_sdk/cli/codeflare_cli.py
src/codeflare_sdk/cli/commands/create.py
src/codeflare_sdk/cli/cli_utils.py
179 changes: 179 additions & 0 deletions src/codeflare_sdk/cli/CodeflareCLI_Design_Doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# CodeFlare CLI Design


## Context and Scope


The primary purpose of the CLI is to serve as an interaction layer between a user and the CodeFlare stack (MCAD, InstaScale, KubeRay) from within the terminal. This addition is required due to the fact that a large set of our target users come from a high-performance computing background and are most familiar and comfortable submitting jobs to a cluster via a CLI.


The CLI will utilize the existing CodeFlare SDK. It will allow for similar operations that the SDK provides (such as Ray Cluster and job management) but in the terminal. The CLI adds some additional functions, allows for saved time, simpler workspaces, and automation of certain processes via bash scripts on top of the existing SDK.




## Goals


- Provide users the ability to request, monitor and stop the Kubernetes resources associated with the CodeFlare stack within the terminal.
- Serve as an interaction layer between the data scientist and CodeFlare stack (MCAD, InstaScale, KubeRay)
- Allow for a user-friendly workflow within the terminal
- Allow for automation and scripting of job/RayCluster management via bash scripts


## Non-Goals


- Do not want to re-make the functionality that is found in the existing CodeFlare SDK or any of the SDK’s clients for Ray, MCAD, or any other service


## Architecture and Design


The CodeFlare CLI is an extension to the CodeFlare SDK package that allows a user to create, monitor, and shut down framework clusters (RayClusters for now) and distributed training jobs on an authenticated Kubernetes cluster from the terminal.


The user should have the ability to do the following from within the terminal:
- Create, view details, view status, submit, delete Ray Clusters via appwrappers
- Create, view logs, view status, submit, delete jobs
- List out all jobs
- List out all ray clusters
- Login to Kubernetes cluster
- Logout of Kubernetes cluster


To support these operations, additional functions to the SDK may include:
- Formatted listing ray clusters
- Formatted listing jobs
- Getting a job given the name


For the majority of functionality, the CLI will utilize the SDK’s already built functionality.


### CLI Framework:


[Click](https://click.palletsprojects.com/en/8.1.x/) is the chosen CLI framework for the following reasons
- Simple syntax/layout: Since the CLI commands are very complex, it is important that the CLI framework doesn’t add any unnecessary complexity
- Supports functional commands instead of objects: This is important because the SDK is designed with various functions, and the CLI being similar improves readability
- Comes with testing and help generation: Testing library and automatic help generation quickens development process
- Large community support/documentation: extensive documentation and large community leads to less errors and easier development.


### Framework Clusters:


When the user invokes the `define raycluster` command, a yaml file with default values is created and put in the user’s current working directory. Users can customize their clusters by adding parameters to the define command and these values will override the defaults when creating the AppWrapper yaml file.


Once the appwrapper is defined, the user can create the ray cluster via a create command. When the user invokes the `create raycluster`, they will specify the name of the cluster to submit. The CLI will first check to see whether or not the specified name is already present in the Kubernetes cluster. If it isn’t already present, then it will search the current working directory for a yaml file corresponding to cluster name and apply it to the K8S cluster. If the wait flag is specified, then the CLI will display a loading sign with status updates until the cluster is up.


We will try to find a good balance between exposing more parameters and simplifying the process by acting on feedback from CLI users.


For `delete raycluster`, the user will invoke the command, and the CLI will shut it down and delete it.


### Training Jobs


When the user invokes `define job` command, a DDPJobDefiniton object will be created and saved into a file. Users can customize their jobs using parameters to the define command.


Once the job is defined, the user can submit the job via a `job submit` command. When the user submits a job, the user will specify the job name. The CLI will then check to see if the job is already on the Kubernetes cluster and if not it will submit the job. The job submitted will be a DDPJob and it will be submitted onto a specified ray cluster.


When the user wants to delete a job, they just invoke the job delete command, and the CLI will stop the job and delete it. This can happen at any time assuming there is a job running.


### Authentication


Users will need to be authenticated into a Kubernetes cluster in order to be able to perform all operations.


If the user tries to perform any operation without being logged in, the CLI will prompt them to authenticate. A kubeconfig will have to be valid in the users environment in order to perform any operation.


The user will be able to login using a simple `login` command and will have the choice of logging in via server + token. The user can also choose whether or not they want tls-verification. If there is a kubeconfig, the CLI will update it, else it will create one for the user.


Alternatively, the user can invoke the login command with their kubeconfig file path, and this will login the user using their kubeconfig file.


Users can logout of their cluster using the `logout` command.




### Listing Info


Users can list both ray cluster information and job information by invoking respective commands. CLI will list information for each raycluster/job such as requested resources, status, name, and namespace.


## Alternatives Considered


- Existing CodeFlare CLI
- Written in TypeScript and overcomplicated. Did not support
- Just using SDK
- Making a CLI saves a lot of time and is easier for the user in some cases
- Interactive CLI
- Interactive CLIs make it harder for automation via bash scripts
- Other CLI libraries
- **Cliff:** Ugly syntax, less readability, not much functionality.
- **Argparse:** Less functionality out of the box. More time spent on unnecessary reimplementation.
- **Cement:** Ugly syntax and low community support.


## Security Considerations


We will rely on Kubernetes default security, where users can not perform any operations on a cluster if they are not authenticated correctly.


## Testing and Validation
The CLI is found within the SDK, so it will be [tested](https://github.com/project-codeflare/codeflare-sdk/blob/main/CodeFlareSDK_Design_Doc.md#testing-and-validation) the same way.


## Deployment and Rollout
- The CLI will be deployed within the CodeFlare SDK so similar [considerations](https://github.com/project-codeflare/codeflare-sdk/blob/main/CodeFlareSDK_Design_Doc.md#deployment-and-rollout) will be taken into account.


## Command Usage Examples
Create ray cluster
- `codeflare create raycluster [options]`


Doing something to a ray cluster:
- `codeflare {operation} raycluster {cluster_name} [options e.g. --gpu=0]`


Create job
- `codeflare create job [options]`


Doing something to a job:
- `codeflare {operation} job {job_name} [options e.g. cluster-name=”mycluster”]`
- Namespace and ray cluster name will be required as options


Listing out clusters
- `codeflare list raycluster -n {namespace} OR codeflare list ray-cluster –all`


Listing out jobs
- `codeflare list job -c {cluster_name} -n {namespace}`
- `codeflare list job -n {namespace}`
- `codeflare list job --all`


Login to kubernetes cluster
- `codeflare login [options e.g. --configpath={path/to/kubeconfig}]` (if configpath is left blank default value is used)


Logout of kubernetes cluster
- `codeflare logout`
Empty file.
173 changes: 173 additions & 0 deletions src/codeflare_sdk/cli/cli_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
import ast
import click
from kubernetes import client, config
import pickle
import os
from ray.job_submission import JobSubmissionClient
from torchx.runner import get_runner
from rich.table import Table
from rich import print

from codeflare_sdk.cluster.cluster import list_clusters_all_namespaces, get_cluster
from codeflare_sdk.cluster.model import RayCluster
from codeflare_sdk.cluster.auth import _create_api_client_config, config_check
from codeflare_sdk.utils.kube_api_helpers import _kube_api_error_handling
import codeflare_sdk.cluster.auth as sdk_auth


class PythonLiteralOption(click.Option):
def type_cast_value(self, ctx, value):
try:
if not value:
return None
return ast.literal_eval(value)
except:
raise click.BadParameter(value)


class AuthenticationConfig:
"""
Authentication configuration that will be stored in a file once
the user logs in using `codeflare login`
"""

def __init__(
self,
token: str,
server: str,
skip_tls: bool,
ca_cert_path: str,
):
self.api_client_config = _create_api_client_config(
token, server, skip_tls, ca_cert_path
)
self.server = server
self.token = token

def create_client(self):
return client.ApiClient(self.api_client_config)


def load_auth():
"""
Loads AuthenticationConfiguration and stores it in global variables
which can be used by the SDK for authentication
"""
try:
auth_file_path = os.path.expanduser("~/.codeflare/auth")
with open(auth_file_path, "rb") as file:
auth = pickle.load(file)
sdk_auth.api_client = auth.create_client()
return auth
except (IOError, EOFError):
click.echo("No authentication found, trying default kubeconfig")
except client.ApiException:
click.echo("Invalid authentication, trying default kubeconfig")


class PluralAlias(click.Group):
def get_command(self, ctx, cmd_name):
rv = click.Group.get_command(self, ctx, cmd_name)
if rv is not None:
return rv
for x in self.list_commands(ctx):
if x + "s" == cmd_name:
return click.Group.get_command(self, ctx, x)
return None

def resolve_command(self, ctx, args):
# always return the full command name
_, cmd, args = super().resolve_command(ctx, args)
return cmd.name, cmd, args


def print_jobs(jobs):
headers = ["Submission ID", "Job ID", "RayCluster", "Namespace", "Status"]
table = Table(show_header=True)
for header in headers:
table.add_column(header)
for job in jobs:
table.add_row(*[job[header] for header in headers])
print(table)


def list_all_kubernetes_jobs(print_to_console=True):
k8s_jobs = []
runner = get_runner()
jobs = runner.list(scheduler="kubernetes_mcad")
rayclusters = {
raycluster.name for raycluster in list_clusters_all_namespaces(False)
}
for job in jobs:
namespace, name = job.app_id.split(":")
status = job.state
if name not in rayclusters:
k8s_jobs.append(
{
"Submission ID": name,
"Job ID": "N/A",
"RayCluster": "N/A",
"Namespace": namespace,
"Status": str(status),
"App Handle": job.app_handle,
}
)
if print_to_console:
print_jobs(k8s_jobs)
return k8s_jobs


def list_all_jobs(print_to_console=True):
k8s_jobs = list_all_kubernetes_jobs(False)
rc_jobs = list_all_raycluster_jobs(False)
all_jobs = rc_jobs + k8s_jobs
if print_to_console:
print_jobs(all_jobs)
return all_jobs


def list_raycluster_jobs(cluster: RayCluster, print_to_console=True):
rc_jobs = []
client = JobSubmissionClient(cluster.dashboard)
jobs = client.list_jobs()
for job in jobs:
job_obj = {
"Submission ID": job.submission_id,
"Job ID": job.job_id,
"RayCluster": cluster.name,
"Namespace": cluster.namespace,
"Status": str(job.status),
"App Handle": "ray://torchx/" + cluster.dashboard + "-" + job.submission_id,
}
rc_jobs.append(job_obj)
if print_to_console:
print_jobs(rc_jobs)
return rc_jobs


def list_all_raycluster_jobs(print_to_console=True):
rc_jobs = []
clusters = list_clusters_all_namespaces(False)
for cluster in clusters:
cluster.dashboard = "http://" + cluster.dashboard
rc_jobs += list_raycluster_jobs(cluster, False)
if print_to_console:
print_jobs(rc_jobs)
return rc_jobs


def get_job_app_handle(job_submission):
job = get_job_object(job_submission)
return job["App Handle"]


def get_job_object(job_submission):
all_jobs = list_all_jobs(False)
for job in all_jobs:
if job["Submission ID"] == job_submission:
return job
raise (
FileNotFoundError(
f"Job {job_submission} not found. Try using 'codeflare list --all' to see all jobs"
)
)
Loading