Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking script v2 #866

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open

Benchmarking script v2 #866

wants to merge 28 commits into from

Conversation

Bslabe123
Copy link
Collaborator

@Bslabe123 Bslabe123 commented Oct 29, 2024

Major script refactoring changes:

  • Individual functions no longer responsible for being aware of backend specifics, these are now abstracted as part of the Backend class.
  • ttft prometheus metric added.
  • Script now writes text files, no longer need to pipe logged output of this script to file via bash
  • Request rate as a function of t, this applies to the --job flag discussed below and the --request-rate flag. A few examples of newly valid inputs:
    • 10+(t/60), a request rate increasing linearly by 1rps/min
    • min(40,20+20(floor([t/300](http://t/300)))), start at 20, step to 40 after 300s, do not step up more
  • --job flag for multi step benchmarking:
    • May be raw json or a file, json must take the following shape, at least one of "time" or "max_num_prompts" is required:
    {
      "time_between_stages": float,
      "stages": [{
         "rate": float | string,
         "time": float
         "max_num_prompts: int
       },
       ...]
     }
    
    • if --job has more than one stage, the "metrics" field will not be rendered on the json output

Breaking changes:

  • ---request-rate no longer has a default value
  • If benchmarking multiple models, the "dimensions" field will not be rendered on the json output

Given the size of this PR the following will be added as follow ups:

  • Separating this into separate files and relevant dockerfile changes
  • Terraform changes use the --job flag for multi step benchmarking rather than multiple script runs
  • Proper tests

Tests:

TEST 1: With these flags to assure no breaking changes to output json:

--scrape-server-metrics --save-json-results --host=llama3-8b-vllm-service --port=8000 --model=meta-llama/Meta-Llama-3-8B --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer=meta-llama/Meta-Llama-3-8B --backend=vllm --max-input-length=256 --max-output-length=256 --stream-request --num-prompts=300 --request-rate=10

JSON diff with accepted output and output text file:

====Result for Model: meta-llama/Meta-Llama-3-8B====
Errors: {'ClientConnectorErrors': 0, 'TimeoutErrors': 0, 'ContentTypeErrors': 0, 'ClientOSErrors': 0, 'ServerDisconnectedErrors': 0, 'unknown_errors': 0}
Total time: 47.04 s
Successful/total requests: 300/300
Requests/min: 382.64
Output_tokens/min: 86893.21
Input_tokens/min: 10713.86
Tokens/min: 97607.07
Average seconds/token (includes waiting time on server): 0.07
Average milliseconds/request (includes waiting time on server): 19159.64
Average milliseconds/output_token (includes waiting time on server): 84.71
Average length of prompt: 28.00
Average length of response: 227.09
Average throughput in requests per second: 6.38
Average Time to First Token (s): 0.12

TEST 2: With these flags to duplicate the above test using the --job flag:

python3 b.py --scrape-server-metrics --save-json-results --host=llama3-8b-vllm-service --port=8000 --model=meta-llama/Meta-Llama-3-8B --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer=meta-llama/Meta-Llama-3-8B --backend=vllm --max-input-length=256 --max-output-length=256 --stream-request --job=r.json

where r.json is

{
 "time_between_stages": 10,
 "stages": [{
   "rate": 10,
   "time": 30
 }]
}

output text file:

====Result for Model: meta-llama/Meta-Llama-3-8B====
Errors: {'ClientConnectorErrors': 0, 'TimeoutErrors': 0, 'ContentTypeErrors': 0, 'ClientOSErrors': 0, 'ServerDisconnectedErrors': 0, 'unknown_errors': 0}
Total time: 48.44 s
Successful/total requests: 298/298
Requests/min: 369.13
Output_tokens/min: 87735.67
Input_tokens/min: 5906.02
Tokens/min: 93641.69
Average seconds/token (includes waiting time on server): 0.08
Average milliseconds/request (includes waiting time on server): 19865.20
Average milliseconds/output_token (includes waiting time on server): 83.60
Average length of prompt: 16.00
Average length of response: 237.68
Average throughput in requests per second: 6.15
Average Time to First Token (s): 0.12

TEST 3: Multi step benchmarking using --job flag:
Using the same flags above with the following r.json

{
 "time_between_stages": 10,
 "stages": [{
   "rate": 10,
   "time": 30
 },{
   "rate": 30,
   "time": 90
 }]
}

Resulting logs:

Starting Prometheus Server on port 9090
Starting benchmarking at 10 requests/sec for 30 sec
All requests sent, awaiting responses...
Finished benchmarking stage 1
Sleeping for 10 sec...
Starting benchmarking at 30 requests/sec for 90 sec
All requests sent, awaiting responses...
Finished benchmarking stage 2
Completed all stages, generating reports...

Resulting output is the following two text files and json:

  • latency-profile-2024-11-13_23-34-22.txt
====Result for Model: meta-llama/Meta-Llama-3-8B====
Errors: {'ClientConnectorErrors': 0, 'TimeoutErrors': 0, 'ContentTypeErrors': 0, 'ClientOSErrors': 0, 'ServerDisconnectedErrors': 0, 'unknown_errors': 0}
Total time: 110.88 s
Successful/total requests: 2493/2493
Requests/min: 1349.00
Output_tokens/min: 167409.52
Input_tokens/min: 66101.16
Tokens/min: 233510.68
Average seconds/token (includes waiting time on server): 0.08
Average milliseconds/request (includes waiting time on server): 18689.54
Average milliseconds/output_token (includes waiting time on server): 155.00
Average length of prompt: 49.00
Average length of response: 124.10
Average throughput in requests per second: 22.48
Average Time to First Token (s): 0.16
  • latency-profile-2024-11-13_23-33-22.txt
====Result for Model: meta-llama/Meta-Llama-3-8B====
Errors: {'ClientConnectorErrors': 0, 'TimeoutErrors': 0, 'ContentTypeErrors': 0, 'ClientOSErrors': 0, 'ServerDisconnectedErrors': 0, 'unknown_errors': 0}
Total time: 45.41 s
Successful/total requests: 303/303
Requests/min: 400.37
Output_tokens/min: 74140.70
Input_tokens/min: 7206.62
Tokens/min: 81347.32
Average seconds/token (includes waiting time on server): 0.07
Average milliseconds/request (includes waiting time on server): 15297.39
Average milliseconds/output_token (includes waiting time on server): 83.12
Average length of prompt: 18.00
Average length of response: 185.18
Average throughput in requests per second: 6.67
Average Time to First Token (s): 0.12
{
  "config": {
    "model": "meta-llama/Meta-Llama-3-8B",
    "model_server": "vllm",
    "start_time": {
      "seconds": 1731540802,
      "nanos": 327488103
    },
    "num_models": 1
  },
  "summary_stats": {
    "stats": [
      {
        "request_rate": 10,
        "per_token_latency": {
          "mean": 0.07385869081540675,
          "median": 0.07489906784391752,
          "sd": 0.007699994295419569,
          "min": 0.028674545288085936,
          "max": 0.08453639314956024,
          "p90": 0.08223036577721604,
          "p99": 0.08393648173226355
        },
        "request_latency": {
          "mean": 15297.391085734855,
          "median": 17421.419858932495,
          "sd": 5070.809268756617,
          "min": 716.8636322021484,
          "max": 20119.661569595337,
          "p90": 19553.660917282104,
          "p99": 19976.8541097641
        },
        "tpot": {
          "mean": 83.1183274983278,
          "median": 83.7416724725203,
          "sd": 6.300375318379742,
          "min": 68.69693886150013,
          "max": 103.30891141704484,
          "p90": 89.72805160889651,
          "p99": 98.9513516059289
        },
        "input_length": {
          "mean": 18,
          "median": 18,
          "sd": 0,
          "min": 18,
          "max": 18,
          "p90": 18,
          "p99": 18
        },
        "output_length": {
          "mean": 185.18151815181517,
          "median": 220,
          "sd": 61.00622206274113,
          "min": 7,
          "max": 223,
          "p90": 220,
          "p99": 221
        },
        "throughput": {
          "mean": 6.6727949503910775
        },
        "ttft": {
          "mean": 0.11786133185987246,
          "median": 0.11588965501869097,
          "sd": 0.024016406241249174,
          "min": 0.07840587495593354,
          "max": 0.17940028198063374,
          "p90": 0.15001811598194764,
          "p99": 0.1739412961388007
        },
        "model_server_metrics": [
          {
            "name": "vllm:gpu_cache_usage_perc",
            "description": "Metrics for vllm:gpu_cache_usage_perc from vllm backend",
            "mean": 0.14264919941775833,
            "median": 0.0873362445414847,
            "sd": 0.10155519884079389,
            "min": 0.055520898315658096,
            "max": 0.2850904553961322,
            "p90": 0.2455396132252027,
            "p99": 0.28113537117903925
          },
          {
            "name": "vllm:num_requests_waiting",
            "description": "Metrics for vllm:num_requests_waiting from vllm backend",
            "mean": 0,
            "median": 0,
            "sd": 0,
            "min": 0,
            "max": 0,
            "p90": 0,
            "p99": 0
          }
        ]
      },
      {
        "request_rate": 30,
        "per_token_latency": {
          "mean": 0.08403210467408449,
          "median": 0.09693534481484173,
          "sd": 0.048431116032316375,
          "min": 0.0016284847259521484,
          "max": 0.14853053191953877,
          "p90": 0.13983914381904997,
          "p99": 0.1466936592907229
        },
        "request_latency": {
          "mean": 18689.539973055842,
          "median": 17226.099729537964,
          "sd": 15242.002332285088,
          "min": 81.42423629760742,
          "max": 42925.323724746704,
          "p90": 40208.89763832092,
          "p99": 42358.3074092865
        },
        "tpot": {
          "mean": 155.00156706954638,
          "median": 155.06598254044852,
          "sd": 43.786653576254,
          "min": 81.42423629760742,
          "max": 987.2474670410156,
          "p90": 180.54201262957847,
          "p99": 233.23154040745325
        },
        "input_length": {
          "mean": 49,
          "median": 49,
          "sd": 0,
          "min": 49,
          "max": 49,
          "p90": 49,
          "p99": 49
        },
        "output_length": {
          "mean": 124.09867629362215,
          "median": 112,
          "sd": 99.93797459824643,
          "min": 1,
          "max": 245,
          "p90": 240,
          "p99": 241
        },
        "throughput": {
          "mean": 22.483387890744382
        },
        "ttft": {
          "mean": 0.15626823447809177,
          "median": 0.14473288401495665,
          "sd": 0.07395838346933907,
          "min": 0.08015697100199759,
          "max": 0.9864042410044931,
          "p90": 0.19817143981344998,
          "p99": 0.5622111370484344
        },
        "model_server_metrics": [
          {
            "name": "vllm:gpu_cache_usage_perc",
            "description": "Metrics for vllm:gpu_cache_usage_perc from vllm backend",
            "mean": 0.674538811157651,
            "median": 0.7953836556456644,
            "sd": 0.2541336080782851,
            "min": 0.22582657517155336,
            "max": 0.9357454772301934,
            "p90": 0.9106674984404243,
            "p99": 0.9332376793512165
          },
          {
            "name": "vllm:num_requests_waiting",
            "description": "Metrics for vllm:num_requests_waiting from vllm backend",
            "mean": 0.14285714285714285,
            "median": 0,
            "sd": 0.3499271061118826,
            "min": 0,
            "max": 1,
            "p90": 0.40000000000000036,
            "p99": 0.9399999999999995
          }
        ]
      }
    ]
  },
  "dimensions": {
    "date": "20241113-233250",
    "backend": "vllm",
    "model_id": "meta-llama/Meta-Llama-3-8B",
    "tokenizer_id": "meta-llama/Meta-Llama-3-8B"
  },
  "metrics": null
}

TEST 4:
Demonstrate the output of f(t) request rates using the following r.json, this particular rate should send ~360 requests, with the same flags as above:

{
 "time_between_stages": 10,
 "stages": [{
   "rate": "1+(t/30)",
   "time": 120
 }]
}

Result text file:

====Result for Model: meta-llama/Meta-Llama-3-8B====
Errors: {'ClientConnectorErrors': 0, 'TimeoutErrors': 0, 'ContentTypeErrors': 0, 'ClientOSErrors': 0, 'ServerDisconnectedErrors': 0, 'unknown_errors': 0}
Total time: 127.89 s
Successful/total requests: 351/351
Requests/min: 164.67
Output_tokens/min: 19362.55
Input_tokens/min: 2799.46
Tokens/min: 22162.01
Average seconds/token (includes waiting time on server): 0.06
Average milliseconds/request (includes waiting time on server): 7943.62
Average milliseconds/output_token (includes waiting time on server): 67.82
Average length of prompt: 17.00
Average length of response: 117.58
Average throughput in requests per second: 2.74
Average Time to First Token (s): 0.11
``

@Bslabe123 Bslabe123 changed the title Request rate as Function of t Benchmarking script v2 Nov 13, 2024
@Bslabe123 Bslabe123 marked this pull request as ready for review November 13, 2024 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant