Skip to content

tests: parametrize bench mark tests #4974

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

cm-iwata
Copy link
Contributor

@cm-iwata cm-iwata commented Dec 30, 2024

In the previous implementation, it was necessary to adjust the timeout value every time a benchmark test added.
By parametrizing the benchmark tests, the time required for each test becomes predictable, eliminating the need to adjust the timeout value

Changes

Parametrize the test by the list of criterion benchmarks.

By parametrizing the tests, git clone will executed for each parameter in here.

with TemporaryDirectory() as tmp_dir:
dir_a = git_clone(Path(tmp_dir) / a_revision, a_revision)
result_a = test_runner(dir_a, True)
if b_revision:
dir_b = git_clone(Path(tmp_dir) / b_revision, b_revision)
else:
# By default, pytest execution happens inside the `tests` subdirectory. Pass the repository root, as
# documented.
dir_b = Path.cwd().parent
result_b = test_runner(dir_b, False)
comparison = comparator(result_a, result_b)
return result_a, result_b, comparison

To run all parametrized tests with single git close would require major revisions to git_ab_test, so this PR does not address that issue.

Reason

close #4832

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

In the previous implementation, it was necessary to adjust the timeout
value every time a benchmark test added.
By parametrizing the benchmark tests, the time required for each test
becomes predictable, eliminating the need to adjust the timeout value

Signed-off-by: Tomoya Iwata <[email protected]>
@pb8o pb8o added the python Pull requests that update Python code label Jan 8, 2025
Copy link

codecov bot commented Jan 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.19%. Comparing base (ae078ee) to head (cfbd0e9).

Current head cfbd0e9 differs from pull request most recent head d1226ab

Please upload reports for the commit d1226ab to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4974      +/-   ##
==========================================
+ Coverage   83.01%   83.19%   +0.17%     
==========================================
  Files         250      247       -3     
  Lines       26897    26641     -256     
==========================================
- Hits        22328    22163     -165     
+ Misses       4569     4478      -91     
Flag Coverage Δ
5.10-c5n.metal 83.67% <ø> (+0.10%) ⬆️
5.10-m5n.metal 83.66% <ø> (+0.09%) ⬆️
5.10-m6a.metal 82.86% <ø> (+0.07%) ⬆️
5.10-m6g.metal 79.66% <ø> (+0.32%) ⬆️
5.10-m6i.metal 83.64% <ø> (+0.09%) ⬆️
5.10-m7g.metal 79.66% <ø> (+0.32%) ⬆️
6.1-c5n.metal 83.67% <ø> (+0.06%) ⬆️
6.1-m5n.metal 83.66% <ø> (+0.05%) ⬆️
6.1-m6a.metal 82.86% <ø> (+0.02%) ⬆️
6.1-m6g.metal ?
6.1-m6i.metal 83.64% <ø> (+0.04%) ⬆️
6.1-m7g.metal 79.66% <ø> (+0.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pb8o
pb8o previously approved these changes Jan 8, 2025
)

executables = []
for line in stdout.split("\n"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be stdout.splitlines()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@roypat roypat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think like this we're no longer doing an A/B-test, we're just benchmarking the same binary compiled from the PR branch twice (e.g. comparing the PR results to themselves)

@pytest.mark.no_block_pr
@pytest.mark.timeout(900)
def test_no_regression_relative_to_target_branch():
@pytest.mark.timeout(600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the buildkite run, it seems like the longest duration of one of these is 150s for the queue benchmarks, so I think we can actually drop this timeout marker altogether and just rely on the default timeout specified in pytest.ini (which is 300s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix in 215af23

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I verified this, I found that the first execution would require more than 300 seconds of overhead, so 300 seconds is too short.

root@94c9107981fb:/firecracker/tests# pytest "integration_tests/performance/test_benchmarks.py"  -m no_block_pr --durations=0 -v
================================================================================================== test session starts ===================================================================================================
platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.5.0 -- /opt/venv/bin/python
cachedir: ../build/pytest_cache
metadata: {'Python': '3.12.3', 'Platform': 'Linux-6.8.0-51-generic-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.5', 'pluggy': '1.5.0'}, 'Plugins': {'metadata': '3.1.1', 'rerunfailures': '14.0', 'timeout': '2.3.1', 'xdist': '3.6.1', 'json-report': '1.5.0'}}
EC2 AMI: NA
rootdir: /firecracker/tests
configfile: pytest.ini
plugins: metadata-3.1.1, rerunfailures-14.0, timeout-2.3.1, xdist-3.6.1, json-report-1.5.0
timeout: 300.0s
timeout method: signal
timeout func_only: False
collected 8 items

integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[serialize_cpu_template] PASSED                                                                      [ 12%]
integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[page_fault] PASSED                                                                                  [ 25%]
integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_add_used_16] PASSED                                                                           [ 37%]
integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_pop_16] PASSED                                                                                [ 50%]
integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_add_used_256] PASSED                                                                          [ 62%]
integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[next_descriptor_16] PASSED                                                                          [ 75%]
integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[request_parse] PASSED                                                                               [ 87%]
integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[deserialize_cpu_template] PASSED                                                                    [100%]

------------------------------------------------------------------------------------------------------ JSON report -------------------------------------------------------------------------------------------------------
report saved to: ../test_results/test-report.json
=================================================================================================== slowest durations ====================================================================================================
315.52s call     integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[serialize_cpu_template]
40.62s call     integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[page_fault]
39.61s call     integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_add_used_256]
39.57s call     integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[request_parse]
37.05s call     integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_pop_16]
36.06s call     integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[next_descriptor_16]
34.07s call     integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_add_used_16]
22.73s call     integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[deserialize_cpu_template]
1.63s setup    integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[serialize_cpu_template]
0.15s teardown integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[deserialize_cpu_template]
0.00s setup    integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[next_descriptor_16]
0.00s setup    integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_pop_16]
0.00s setup    integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[deserialize_cpu_template]
0.00s setup    integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_add_used_256]
0.00s setup    integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[page_fault]
0.00s teardown integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_add_used_256]
0.00s setup    integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[request_parse]
0.00s teardown integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_add_used_16]
0.00s setup    integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_add_used_16]
0.00s teardown integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[serialize_cpu_template]
0.00s teardown integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_pop_16]
0.00s teardown integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[next_descriptor_16]
0.00s teardown integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[request_parse]
0.00s teardown integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[page_fault]

Comment on lines 29 to 32
_, stdout, _ = cargo(
"bench",
f"--all --quiet --target {platform.machine()}-unknown-linux-musl --message-format json --no-run",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh, I don't think this does what we want. We precompile the executables ones (from the PR branch), and then we use this precompiled executable for both A and B runs. What need to do though is compile each benchmark twice, ones from the main branch and once from the PR branch, so that this test does a meaningful comparison :/ That's why in #4832 I suggested to use --list-only or something: determine the names of the benchmarks here, and then compile them twice in _run_criterion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I misunderstood a bit how it works.

Let me confirm the modifications.
First, run cargo bench --all -- --list to generate parameters to pass to pytest.parametrize.
I will get the following output:

root@90de30508db0:/firecracker# cargo bench --all -- --list Finished `bench` profile [optimized] target(s) in 0.10s Running benches/block_request.rs (build/cargo_target/release/deps/block_request-2e4b90407b22a8d0) request_parse: benchmark Running benches/cpu_templates.rs (build/cargo_target/release/deps/cpu_templates-cd18fd51dbad16f4) Deserialization test - Template size (JSON string): [2380] bytes. Serialization test - Template size: [72] bytes. deserialize_cpu_template: benchmark serialize_cpu_template: benchmark Running benches/memory_access.rs (build/cargo_target/release/deps/memory_access-741f97a7c9c33391)
page_fault: benchmark

page_fault #2: benchmark

Running benches/queue.rs (build/cargo_target/release/deps/queue-b2dfffbab00c4157)
next_descriptor_16: benchmark

queue_pop_16: benchmark

queue_add_used_16: benchmark

queue_add_used_256: benchmark

And then, I will get benchmark name...for example queue_pop_16.

Finally run a command like cargo bench --all -- queue_pop_16 in _run_criterion.
Is this correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's pretty much it! The main point is that the compilation of the benchmarks needs to happen in _run_criterion, because we actually have to compile them twice, once for the pull request target, and once for the pull request head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix in d42d39f

Since it would be very slow if git clone and build executable for each parameter, so I adjusted fixture so that git clone is executed only once.
033ca8f

use `splitlines()` instead of `split("\n")`.

Signed-off-by: Tomoya Iwata <[email protected]>
No longer need to set individual timeout values,
Because parameterized performance tests.

Signed-off-by: Tomoya Iwata <[email protected]>
cm-iwata and others added 5 commits January 17, 2025 09:59
In the previous implementation, same binary that built in the PR branch
execute twice,
which was not a correct A/B test. This has been fixed.

Signed-off-by: Tomoya Iwata <[email protected]>
In the previous implementation, git clone executed
for each parameter of the parametize test.
This has a large overhead, adjusted it so that
fixtures only called once per class.

Signed-off-by: Tomoya Iwata <[email protected]>
Added `exact` option to cargo bench for avoid running
deserialize_cpu_template when specifying serialize_cpu_template

Signed-off-by: Tomoya Iwata <[email protected]>
I parametrized the benchmark test, but it still took more than 300
seconds to run the first time, so I adjusted the timeout value.

Signed-off-by: Tomoya Iwata <[email protected]>
@cm-iwata cm-iwata requested a review from roypat April 27, 2025 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parametrize test_benchmarks.py test by criterion benchmarks
3 participants