-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Abstract Type for Model Server Client #5
base: main
Are you sure you want to change the base?
Conversation
/assign @achandrasekar |
@Bslabe123: GitHub didn't allow me to assign the following users: achandrasekar. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/assign @SergeyKanzhelev |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Bslabe123 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -0,0 +1,14 @@ | |||
# Model Server Clients |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say, we don't need a separate sub module for model server clients right away. We can start with the model server client being a separate file. As we get more clients, we can split them as needed. Otherwise importing all the submodules requires additional work on the part of the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can get behind that, maybe keeping text-to-text and text-to-image servers separate may be less confusing in the long run especially if benchmarking diffusion models is on our roadmap. Having a dedicated text-to-text abstract class is to deduplicate the common functions like making requests since that procedure is going to be the same regardless of model server (create request, send request, parse relevant info from response). See the example in the other comment.
pass | ||
|
||
@abstractmethod | ||
def request(self, *args: Any, **kwargs: Any) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see a concrete implementation alongside this for a model server. vLLM is a good starting point so we can see how this looks in practice and if we need to change / add to this interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in progress because agree, idea is to do something like this for the text-to-text class:
async def request(
self, api_url: str, prompt: str, settings: Text_To_Text_Request_Settings
) -> Response | Exception:
request: Request = self.build_request(prompt, settings)
ttft: float = 0.0
start_time: float = time.perf_counter()
output: str = ""
timeout = aiohttp.ClientTimeout(total=10000)
async with aiohttp.ClientSession(timeout=timeout, trust_env=True) as session:
try:
async with session.post(api_url, **request, ssl=False) as response:
if settings["streaming"]:
async for chunk_bytes in response.content.iter_chunks():
chunk_bytes = chunk_bytes[0].strip()
if not chunk_bytes:
continue
timestamp = time.perf_counter()
if ttft == 0.0:
ttft = timestamp - start_time
standardized_resopnse = self.parse_response(response, settings)
standardized_resopnse["time_to_first_token"] = ttft
return standardized_resopnse
else:
return self.parse_response(await response, settings)
except Exception as e:
self.Errors.record_error(e)
return e
@abstractmethod
def build_request(
self, prompt: str, settings: Text_To_Text_Request_Settings
) -> Request:
"""
Request headers and bodies depend on the specific model server
"""
pass
@abstractmethod
def parse_response(
self, response: requests.Response, settings: Text_To_Text_Request_Settings
) -> Response:
"""
Since model server responses are not standardized
"""
pass
That way this is all we would need for a vllm client:
class vLLM_Client(Text_To_Text_Model_Server_Client):
def build_request(
self, prompt: str, settings: Text_To_Text_Request_Settings
) -> Any:
return {
"headers": {"User-Agent": "Test Client"},
"json": {
"prompt": prompt,
"use_beam_search": settings["use_beam_search"],
"temperature": 0.0,
"max_tokens": settings["output_len"],
"stream": settings["streaming"],
},
"streaming": settings["streaming"],
}
def parse_response(
self, response: requests.Response, settings: Text_To_Text_Request_Settings
) -> Response:
res: List[Any] = [] # response["choices"]
output_token_ids = self.tokenizer(res[0]["text"]).input_ids
return {
"num_output_tokens": len(output_token_ids),
"request_duration": 0.0,
"time_to_first_token": None,
}
Simillar for jetstream for jetstream:
class Jetstream_Client(Text_To_Text_Model_Server_Client):
def build_request(
self, prompt: str, settings: Text_To_Text_Request_Settings
) -> Any:
return {
"json": {
"prompt": prompt,
"max_tokens": settings["output_len"],
}
}
def parse_response(
self, response: requests.Response, settings: Text_To_Text_Request_Settings
) -> Response:
res: List[Any] = [] # response["response"]
output_token_ids = self.tokenizer(res).input_ids
return {
"num_output_tokens": len(output_token_ids),
"request_duration": 0.0,
"time_to_first_token": None,
}
pass
No description provided.