Benchmarking script v2 #866

Bslabe123 · 2024-10-29T19:59:34Z

Major script refactoring changes:

Individual functions no longer responsible for being aware of backend specifics, these are now abstracted as part of the Backend class.
ttft prometheus metric added.
Script now writes text files, no longer need to pipe logged output of this script to file via bash
Request rate as a function of t, this applies to the --job flag discussed below and the --request-rate flag. A few examples of newly valid inputs:
- 10+(t/60), a request rate increasing linearly by 1rps/min
- min(40,20+20(floor([t/300](http://t/300)))), start at 20, step to 40 after 300s, do not step up more
--job flag for multi step benchmarking:
- May be raw json or a file, json must take the following shape, at least one of "time" or "max_num_prompts" is required:
```
{
  "time_between_stages": float,
  "stages": [{
     "rate": float | string,
     "time": float
     "max_num_prompts: int
   },
   ...]
 }
```
- if --job has more than one stage, the "metrics" field will not be rendered on the json output

Breaking changes:

---request-rate no longer has a default value
If benchmarking multiple models, the "dimensions" field will not be rendered on the json output

Given the size of this PR the following will be added as follow ups:

Separating this into separate files and relevant dockerfile changes
Terraform changes use the --job flag for multi step benchmarking rather than multiple script runs
Proper tests

Tests:

TEST 1: With these flags to assure no breaking changes to output json:

--scrape-server-metrics --save-json-results --host=llama3-8b-vllm-service --port=8000 --model=meta-llama/Meta-Llama-3-8B --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer=meta-llama/Meta-Llama-3-8B --backend=vllm --max-input-length=256 --max-output-length=256 --stream-request --num-prompts=300 --request-rate=10

JSON diff with accepted output and output text file:

====Result for Model: meta-llama/Meta-Llama-3-8B====
Errors: {'ClientConnectorErrors': 0, 'TimeoutErrors': 0, 'ContentTypeErrors': 0, 'ClientOSErrors': 0, 'ServerDisconnectedErrors': 0, 'unknown_errors': 0}
Total time: 47.04 s
Successful/total requests: 300/300
Requests/min: 382.64
Output_tokens/min: 86893.21
Input_tokens/min: 10713.86
Tokens/min: 97607.07
Average seconds/token (includes waiting time on server): 0.07
Average milliseconds/request (includes waiting time on server): 19159.64
Average milliseconds/output_token (includes waiting time on server): 84.71
Average length of prompt: 28.00
Average length of response: 227.09
Average throughput in requests per second: 6.38
Average Time to First Token (s): 0.12

TEST 2: With these flags to duplicate the above test using the --job flag:

python3 b.py --scrape-server-metrics --save-json-results --host=llama3-8b-vllm-service --port=8000 --model=meta-llama/Meta-Llama-3-8B --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer=meta-llama/Meta-Llama-3-8B --backend=vllm --max-input-length=256 --max-output-length=256 --stream-request --job=r.json

where r.json is

{
 "time_between_stages": 10,
 "stages": [{
   "rate": 10,
   "time": 30
 }]
}

output text file:

====Result for Model: meta-llama/Meta-Llama-3-8B====
Errors: {'ClientConnectorErrors': 0, 'TimeoutErrors': 0, 'ContentTypeErrors': 0, 'ClientOSErrors': 0, 'ServerDisconnectedErrors': 0, 'unknown_errors': 0}
Total time: 48.44 s
Successful/total requests: 298/298
Requests/min: 369.13
Output_tokens/min: 87735.67
Input_tokens/min: 5906.02
Tokens/min: 93641.69
Average seconds/token (includes waiting time on server): 0.08
Average milliseconds/request (includes waiting time on server): 19865.20
Average milliseconds/output_token (includes waiting time on server): 83.60
Average length of prompt: 16.00
Average length of response: 237.68
Average throughput in requests per second: 6.15
Average Time to First Token (s): 0.12

TEST 3: Multi step benchmarking using --job flag:
Using the same flags above with the following r.json

{
 "time_between_stages": 10,
 "stages": [{
   "rate": 10,
   "time": 30
 },{
   "rate": 30,
   "time": 90
 }]
}

Resulting logs:

Starting Prometheus Server on port 9090
Starting benchmarking at 10 requests/sec for 30 sec
All requests sent, awaiting responses...
Finished benchmarking stage 1
Sleeping for 10 sec...
Starting benchmarking at 30 requests/sec for 90 sec
All requests sent, awaiting responses...
Finished benchmarking stage 2
Completed all stages, generating reports...

Resulting output is the following two text files and json:

latency-profile-2024-11-13_23-34-22.txt

====Result for Model: meta-llama/Meta-Llama-3-8B====
Errors: {'ClientConnectorErrors': 0, 'TimeoutErrors': 0, 'ContentTypeErrors': 0, 'ClientOSErrors': 0, 'ServerDisconnectedErrors': 0, 'unknown_errors': 0}
Total time: 110.88 s
Successful/total requests: 2493/2493
Requests/min: 1349.00
Output_tokens/min: 167409.52
Input_tokens/min: 66101.16
Tokens/min: 233510.68
Average seconds/token (includes waiting time on server): 0.08
Average milliseconds/request (includes waiting time on server): 18689.54
Average milliseconds/output_token (includes waiting time on server): 155.00
Average length of prompt: 49.00
Average length of response: 124.10
Average throughput in requests per second: 22.48
Average Time to First Token (s): 0.16

latency-profile-2024-11-13_23-33-22.txt

====Result for Model: meta-llama/Meta-Llama-3-8B====
Errors: {'ClientConnectorErrors': 0, 'TimeoutErrors': 0, 'ContentTypeErrors': 0, 'ClientOSErrors': 0, 'ServerDisconnectedErrors': 0, 'unknown_errors': 0}
Total time: 45.41 s
Successful/total requests: 303/303
Requests/min: 400.37
Output_tokens/min: 74140.70
Input_tokens/min: 7206.62
Tokens/min: 81347.32
Average seconds/token (includes waiting time on server): 0.07
Average milliseconds/request (includes waiting time on server): 15297.39
Average milliseconds/output_token (includes waiting time on server): 83.12
Average length of prompt: 18.00
Average length of response: 185.18
Average throughput in requests per second: 6.67
Average Time to First Token (s): 0.12

{
  "config": {
    "model": "meta-llama/Meta-Llama-3-8B",
    "model_server": "vllm",
    "start_time": {
      "seconds": 1731540802,
      "nanos": 327488103
    },
    "num_models": 1
  },
  "summary_stats": {
    "stats": [
      {
        "request_rate": 10,
        "per_token_latency": {
          "mean": 0.07385869081540675,
          "median": 0.07489906784391752,
          "sd": 0.007699994295419569,
          "min": 0.028674545288085936,
          "max": 0.08453639314956024,
          "p90": 0.08223036577721604,
          "p99": 0.08393648173226355
        },
        "request_latency": {
          "mean": 15297.391085734855,
          "median": 17421.419858932495,
          "sd": 5070.809268756617,
          "min": 716.8636322021484,
          "max": 20119.661569595337,
          "p90": 19553.660917282104,
          "p99": 19976.8541097641
        },
        "tpot": {
          "mean": 83.1183274983278,
          "median": 83.7416724725203,
          "sd": 6.300375318379742,
          "min": 68.69693886150013,
          "max": 103.30891141704484,
          "p90": 89.72805160889651,
          "p99": 98.9513516059289
        },
        "input_length": {
          "mean": 18,
          "median": 18,
          "sd": 0,
          "min": 18,
          "max": 18,
          "p90": 18,
          "p99": 18
        },
        "output_length": {
          "mean": 185.18151815181517,
          "median": 220,
          "sd": 61.00622206274113,
          "min": 7,
          "max": 223,
          "p90": 220,
          "p99": 221
        },
        "throughput": {
          "mean": 6.6727949503910775
        },
        "ttft": {
          "mean": 0.11786133185987246,
          "median": 0.11588965501869097,
          "sd": 0.024016406241249174,
          "min": 0.07840587495593354,
          "max": 0.17940028198063374,
          "p90": 0.15001811598194764,
          "p99": 0.1739412961388007
        },
        "model_server_metrics": [
          {
            "name": "vllm:gpu_cache_usage_perc",
            "description": "Metrics for vllm:gpu_cache_usage_perc from vllm backend",
            "mean": 0.14264919941775833,
            "median": 0.0873362445414847,
            "sd": 0.10155519884079389,
            "min": 0.055520898315658096,
            "max": 0.2850904553961322,
            "p90": 0.2455396132252027,
            "p99": 0.28113537117903925
          },
          {
            "name": "vllm:num_requests_waiting",
            "description": "Metrics for vllm:num_requests_waiting from vllm backend",
            "mean": 0,
            "median": 0,
            "sd": 0,
            "min": 0,
            "max": 0,
            "p90": 0,
            "p99": 0
          }
        ]
      },
      {
        "request_rate": 30,
        "per_token_latency": {
          "mean": 0.08403210467408449,
          "median": 0.09693534481484173,
          "sd": 0.048431116032316375,
          "min": 0.0016284847259521484,
          "max": 0.14853053191953877,
          "p90": 0.13983914381904997,
          "p99": 0.1466936592907229
        },
        "request_latency": {
          "mean": 18689.539973055842,
          "median": 17226.099729537964,
          "sd": 15242.002332285088,
          "min": 81.42423629760742,
          "max": 42925.323724746704,
          "p90": 40208.89763832092,
          "p99": 42358.3074092865
        },
        "tpot": {
          "mean": 155.00156706954638,
          "median": 155.06598254044852,
          "sd": 43.786653576254,
          "min": 81.42423629760742,
          "max": 987.2474670410156,
          "p90": 180.54201262957847,
          "p99": 233.23154040745325
        },
        "input_length": {
          "mean": 49,
          "median": 49,
          "sd": 0,
          "min": 49,
          "max": 49,
          "p90": 49,
          "p99": 49
        },
        "output_length": {
          "mean": 124.09867629362215,
          "median": 112,
          "sd": 99.93797459824643,
          "min": 1,
          "max": 245,
          "p90": 240,
          "p99": 241
        },
        "throughput": {
          "mean": 22.483387890744382
        },
        "ttft": {
          "mean": 0.15626823447809177,
          "median": 0.14473288401495665,
          "sd": 0.07395838346933907,
          "min": 0.08015697100199759,
          "max": 0.9864042410044931,
          "p90": 0.19817143981344998,
          "p99": 0.5622111370484344
        },
        "model_server_metrics": [
          {
            "name": "vllm:gpu_cache_usage_perc",
            "description": "Metrics for vllm:gpu_cache_usage_perc from vllm backend",
            "mean": 0.674538811157651,
            "median": 0.7953836556456644,
            "sd": 0.2541336080782851,
            "min": 0.22582657517155336,
            "max": 0.9357454772301934,
            "p90": 0.9106674984404243,
            "p99": 0.9332376793512165
          },
          {
            "name": "vllm:num_requests_waiting",
            "description": "Metrics for vllm:num_requests_waiting from vllm backend",
            "mean": 0.14285714285714285,
            "median": 0,
            "sd": 0.3499271061118826,
            "min": 0,
            "max": 1,
            "p90": 0.40000000000000036,
            "p99": 0.9399999999999995
          }
        ]
      }
    ]
  },
  "dimensions": {
    "date": "20241113-233250",
    "backend": "vllm",
    "model_id": "meta-llama/Meta-Llama-3-8B",
    "tokenizer_id": "meta-llama/Meta-Llama-3-8B"
  },
  "metrics": null
}

TEST 4:
Demonstrate the output of f(t) request rates using the following r.json, this particular rate should send ~360 requests, with the same flags as above:

{
 "time_between_stages": 10,
 "stages": [{
   "rate": "1+(t/30)",
   "time": 120
 }]
}

Result text file:

====Result for Model: meta-llama/Meta-Llama-3-8B====
Errors: {'ClientConnectorErrors': 0, 'TimeoutErrors': 0, 'ContentTypeErrors': 0, 'ClientOSErrors': 0, 'ServerDisconnectedErrors': 0, 'unknown_errors': 0}
Total time: 127.89 s
Successful/total requests: 351/351
Requests/min: 164.67
Output_tokens/min: 19362.55
Input_tokens/min: 2799.46
Tokens/min: 22162.01
Average seconds/token (includes waiting time on server): 0.06
Average milliseconds/request (includes waiting time on server): 7943.62
Average milliseconds/output_token (includes waiting time on server): 67.82
Average length of prompt: 17.00
Average length of response: 117.58
Average throughput in requests per second: 2.74
Average Time to First Token (s): 0.11
``

Bslabe123 and others added 27 commits October 28, 2024 22:54

first commit

2f81a2d

nit

837554b

ns to sec constant

fe980bf

properly handle infinity

28dcf34

nit

3715a2a

ns in s

dab6d67

remove manifest.yaml

4bcd688

better comment

68c3283

better flag message

b631214

remove request rate from filename

0cb2808

tweak description

4e7cfad

typo

d10d860

refactoring

53744df

intermediate change

8242512

remove print

75f9acd

remove duplicate methods

7208868

changes to json report

8517fd9

nit

2cb77b7

revert

2aa34bf

missing server_metrics in metrics

b02e109

nit

4f7af86

remove prints

1838f44

tweak fields

3ba738b

correct json output

98785c0

to_dict

be0a89e

Merge branch 'main' into request-rates

9341f68

streaming changes

6d9b061

Bslabe123 changed the title ~~Request rate as Function of t~~ Benchmarking script v2 Nov 13, 2024

step -> stage

a9f620f

Bslabe123 marked this pull request as ready for review November 13, 2024 23:46

Bslabe123 requested review from achandrasekar, ahg-g, annapendleton and jjk-g as code owners November 13, 2024 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking script v2 #866

Benchmarking script v2 #866

Bslabe123 commented Oct 29, 2024 •

edited

Loading

Benchmarking script v2 #866

Are you sure you want to change the base?

Benchmarking script v2 #866

Conversation

Bslabe123 commented Oct 29, 2024 • edited Loading

Bslabe123 commented Oct 29, 2024 •

edited

Loading