Added Abstract Type for Model Server Client #5

Bslabe123 · 2025-01-23T19:44:14Z

No description provided.

Bslabe123 · 2025-01-23T19:44:39Z

k8s-ci-robot · 2025-01-23T19:44:41Z

@Bslabe123: GitHub didn't allow me to assign the following users: achandrasekar.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @achandrasekar

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Bslabe123 · 2025-01-23T20:04:41Z

/assign @SergeyKanzhelev

k8s-ci-robot · 2025-01-23T20:14:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Bslabe123
Once this PR has been reviewed and has the lgtm label, please ask for approval from sergeykanzhelev. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

achandrasekar · 2025-01-24T22:11:11Z

inference_perf/client/model_servers/README.md

@@ -0,0 +1,14 @@
+# Model Server Clients  


I would say, we don't need a separate sub module for model server clients right away. We can start with the model server client being a separate file. As we get more clients, we can split them as needed. Otherwise importing all the submodules requires additional work on the part of the caller.

Can get behind that, maybe keeping text-to-text and text-to-image servers separate may be less confusing in the long run especially if benchmarking diffusion models is on our roadmap. Having a dedicated text-to-text abstract class is to deduplicate the common functions like making requests since that procedure is going to be the same regardless of model server (create request, send request, parse relevant info from response). See the example in the other comment.

achandrasekar · 2025-01-24T22:12:48Z

inference_perf/client/model_servers/client.py

+        pass
+
+    @abstractmethod
+    def request(self, *args: Any, **kwargs: Any) -> Any:


I would like to see a concrete implementation alongside this for a model server. vLLM is a good starting point so we can see how this looks in practice and if we need to change / add to this interface.

This is in progress because agree, idea is to do something like this for the text-to-text class:

async def request( self, api_url: str, prompt: str, settings: Text_To_Text_Request_Settings ) -> Response | Exception: request: Request = self.build_request(prompt, settings) ttft: float = 0.0 start_time: float = time.perf_counter() output: str = "" timeout = aiohttp.ClientTimeout(total=10000) async with aiohttp.ClientSession(timeout=timeout, trust_env=True) as session: try: async with session.post(api_url, **request, ssl=False) as response: if settings["streaming"]: async for chunk_bytes in response.content.iter_chunks(): chunk_bytes = chunk_bytes[0].strip() if not chunk_bytes: continue timestamp = time.perf_counter() if ttft == 0.0: ttft = timestamp - start_time standardized_resopnse = self.parse_response(response, settings) standardized_resopnse["time_to_first_token"] = ttft return standardized_resopnse else: return self.parse_response(await response, settings) except Exception as e: self.Errors.record_error(e) return e @abstractmethod def build_request( self, prompt: str, settings: Text_To_Text_Request_Settings ) -> Request: """ Request headers and bodies depend on the specific model server """ pass @abstractmethod def parse_response( self, response: requests.Response, settings: Text_To_Text_Request_Settings ) -> Response: """ Since model server responses are not standardized """ pass

That way this is all we would need for a vllm client:

class vLLM_Client(Text_To_Text_Model_Server_Client): def build_request( self, prompt: str, settings: Text_To_Text_Request_Settings ) -> Any: return { "headers": {"User-Agent": "Test Client"}, "json": { "prompt": prompt, "use_beam_search": settings["use_beam_search"], "temperature": 0.0, "max_tokens": settings["output_len"], "stream": settings["streaming"], }, "streaming": settings["streaming"], } def parse_response( self, response: requests.Response, settings: Text_To_Text_Request_Settings ) -> Response: res: List[Any] = [] # response["choices"] output_token_ids = self.tokenizer(res[0]["text"]).input_ids return { "num_output_tokens": len(output_token_ids), "request_duration": 0.0, "time_to_first_token": None, }

Simillar for jetstream for jetstream:

class Jetstream_Client(Text_To_Text_Model_Server_Client): def build_request( self, prompt: str, settings: Text_To_Text_Request_Settings ) -> Any: return { "json": { "prompt": prompt, "max_tokens": settings["output_len"], } } def parse_response( self, response: requests.Response, settings: Text_To_Text_Request_Settings ) -> Response: res: List[Any] = [] # response["response"] output_token_ids = self.tokenizer(res).input_ids return { "num_output_tokens": len(output_token_ids), "request_duration": 0.0, "time_to_first_token": None, } pass

first commit

2b5f054

k8s-ci-robot requested review from Jeffwan and SergeyKanzhelev January 23, 2025 19:44

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 23, 2025

Bslabe123 added 7 commits January 23, 2025 19:46

describe -> list

3140c92

updated README.md

d1ee368

typo

38dcc46

fix markdown

d478975

Fixed description

12ed390

fixed description formatting

d2fcbb0

Client -> Model_Server_Client

6eea97c

k8s-ci-robot assigned SergeyKanzhelev Jan 23, 2025

Merge branch 'kubernetes-sigs:main' into clients

91b8b78

Merge branch 'kubernetes-sigs:main' into clients

4e17fa5

achandrasekar reviewed Jan 24, 2025

View reviewed changes

Merge branch 'kubernetes-sigs:main' into clients

9e6d0b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Abstract Type for Model Server Client #5

Added Abstract Type for Model Server Client #5

Bslabe123 commented Jan 23, 2025

Bslabe123 commented Jan 23, 2025

k8s-ci-robot commented Jan 23, 2025

Bslabe123 commented Jan 23, 2025

k8s-ci-robot commented Jan 23, 2025

achandrasekar Jan 24, 2025

Bslabe123 Jan 25, 2025

achandrasekar Jan 24, 2025

Bslabe123 Jan 25, 2025

Added Abstract Type for Model Server Client #5

Are you sure you want to change the base?

Added Abstract Type for Model Server Client #5

Conversation

Bslabe123 commented Jan 23, 2025

Bslabe123 commented Jan 23, 2025

k8s-ci-robot commented Jan 23, 2025

Bslabe123 commented Jan 23, 2025

k8s-ci-robot commented Jan 23, 2025

achandrasekar Jan 24, 2025

Choose a reason for hiding this comment

Bslabe123 Jan 25, 2025

Choose a reason for hiding this comment

achandrasekar Jan 24, 2025

Choose a reason for hiding this comment

Bslabe123 Jan 25, 2025

Choose a reason for hiding this comment