Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Abstract Type for Model Server Client #5

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Bslabe123
Copy link
Contributor

No description provided.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 23, 2025
@Bslabe123
Copy link
Contributor Author

/assign @achandrasekar

@k8s-ci-robot
Copy link
Contributor

@Bslabe123: GitHub didn't allow me to assign the following users: achandrasekar.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @achandrasekar

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Bslabe123
Copy link
Contributor Author

/assign @SergeyKanzhelev

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Bslabe123
Once this PR has been reviewed and has the lgtm label, please ask for approval from sergeykanzhelev. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -0,0 +1,14 @@
# Model Server Clients
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say, we don't need a separate sub module for model server clients right away. We can start with the model server client being a separate file. As we get more clients, we can split them as needed. Otherwise importing all the submodules requires additional work on the part of the caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can get behind that, maybe keeping text-to-text and text-to-image servers separate may be less confusing in the long run especially if benchmarking diffusion models is on our roadmap. Having a dedicated text-to-text abstract class is to deduplicate the common functions like making requests since that procedure is going to be the same regardless of model server (create request, send request, parse relevant info from response). See the example in the other comment.

pass

@abstractmethod
def request(self, *args: Any, **kwargs: Any) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see a concrete implementation alongside this for a model server. vLLM is a good starting point so we can see how this looks in practice and if we need to change / add to this interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in progress because agree, idea is to do something like this for the text-to-text class:

    async def request(
        self, api_url: str, prompt: str, settings: Text_To_Text_Request_Settings
    ) -> Response | Exception:
        request: Request = self.build_request(prompt, settings)
        ttft: float = 0.0
        start_time: float = time.perf_counter()
        output: str = ""
        timeout = aiohttp.ClientTimeout(total=10000)
        async with aiohttp.ClientSession(timeout=timeout, trust_env=True) as session:
            try:
                async with session.post(api_url, **request, ssl=False) as response:
                    if settings["streaming"]:
                        async for chunk_bytes in response.content.iter_chunks():
                            chunk_bytes = chunk_bytes[0].strip()
                            if not chunk_bytes:
                                continue
                            timestamp = time.perf_counter()

                            if ttft == 0.0:
                                ttft = timestamp - start_time
                        standardized_resopnse = self.parse_response(response, settings)
                        standardized_resopnse["time_to_first_token"] = ttft
                        return standardized_resopnse
                    else:
                        return self.parse_response(await response, settings)
            except Exception as e:
                self.Errors.record_error(e)
                return e
                
    @abstractmethod
    def build_request(
        self, prompt: str, settings: Text_To_Text_Request_Settings
    ) -> Request:
        """
        Request headers and bodies depend on the specific model server
        """
        pass

    @abstractmethod
    def parse_response(
        self, response: requests.Response, settings: Text_To_Text_Request_Settings
    ) -> Response:
        """
        Since model server responses are not standardized
        """
        pass

That way this is all we would need for a vllm client:

class vLLM_Client(Text_To_Text_Model_Server_Client):

    def build_request(
        self, prompt: str, settings: Text_To_Text_Request_Settings
    ) -> Any:
        return {
            "headers": {"User-Agent": "Test Client"},
            "json": {
                "prompt": prompt,
                "use_beam_search": settings["use_beam_search"],
                "temperature": 0.0,
                "max_tokens": settings["output_len"],
                "stream": settings["streaming"],
            },
            "streaming": settings["streaming"],
        }

    def parse_response(
        self, response: requests.Response, settings: Text_To_Text_Request_Settings
    ) -> Response:
        res: List[Any] = []  # response["choices"]
        output_token_ids = self.tokenizer(res[0]["text"]).input_ids
        return {
            "num_output_tokens": len(output_token_ids),
            "request_duration": 0.0,
            "time_to_first_token": None,
        }

Simillar for jetstream for jetstream:

class Jetstream_Client(Text_To_Text_Model_Server_Client):

    def build_request(
        self, prompt: str, settings: Text_To_Text_Request_Settings
    ) -> Any:

        return {
            "json": {
                "prompt": prompt,
                "max_tokens": settings["output_len"],
            }
        }

    def parse_response(
        self, response: requests.Response, settings: Text_To_Text_Request_Settings
    ) -> Response:
        res: List[Any] = []  # response["response"]
        output_token_ids = self.tokenizer(res).input_ids

        return {
            "num_output_tokens": len(output_token_ids),
            "request_duration": 0.0,
            "time_to_first_token": None,
        }
        pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants