Skip to content

Update "Running large tests" article #992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Feb 13, 2023
Merged

Conversation

imiric
Copy link
Contributor

@imiric imiric commented Jan 19, 2023

This is an update to the "Running large tests" article, and closes #208. I'm creating the PR to get early feedback on the changes, but let's not merge it yet until the benchmarking is finished.

See the commits for details, but some of the major changes include:

  • Removal of the benchmark results section.
    We plan to move all benchmark results to the k6-hardware-benchmark repository, so that we can automate running the tests and generating the report there. This article will contain only general suggestions, which won't need to be updated as often, and will also make it shorter and more readable.

  • Removal of the suggestion of --compatibility-mode=base. See 2ffca0a for explanation.

Ivan Mirić added 3 commits January 19, 2023 16:08
The default compatibility-mode (extended) in recent k6 versions only has
minor overhead from Babel, since core.js was removed several versions
ago. And as goja keeps improving, we're planning to drop Babel
altogether, which will likely mean deprecating compatibility-mode.

The end result of this is that currently in v0.42.0,
compatibility-mode=base has very little benefit over extended. To the
point where it's not worth the hassle of maintaining a build pipeline
for most users.

For these reasons it's best to just stop recommending its usage here.
Ivan Mirić added 11 commits January 19, 2023 16:54
The number of terminals is not relevant if using e.g. nmon, which can
monitor CPU, RAM and network.
First explain the default k6 behavior, then introduce the option.
The "cloud" (the web buzzword) is not the same as "k6 Cloud" (the
product).
In order to update this information easily, we will keep benchmark
results in the k6-hardware-benchmark repository, and leave this article
for suggestions only. This way we can avoid having to continously update
this article, and it will make it more readable by being shorter.

In the future we might also remove specifically mentioning EC2, as
benchmarks will be run from GitHub Actions (Azure). Or we might want to
expand this article and give broad suggestions for the largest cloud
providers.
@imiric imiric force-pushed the update/208-running-large-tests branch from 999a8a0 to dbb023d Compare January 19, 2023 15:54
Copy link
Contributor

@immavalls immavalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great updates @imiric, just added some minor suggestions. @MattDodsonEnglish should probably review as he is the k6 tech writing expert 😄


### Testing for RPS
We maintain a [repository](https://github.com/grafana/k6-hardware-benchmark) with some scripts used to benchmark k6 and create reports. These tests are run for every new k6 version, and you can see the results in the [`results/` directory](https://github.com/grafana/k6-hardware-benchmark/tree/master/results).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, but like I mentioned in the PR description, we shouldn't merge this before the benchmarking is finished. That will happen after the results are added to the k6-hardware-benchmark repo, so this link will eventually work.

@MattDodsonEnglish
Copy link
Contributor

A rendered version in staging is here:

https://mdr-ci.staging.k6.io/docs/refs/pull/992/merge/testing-guides/running-large-tests/

FYI, I'm going to only read it and see if I have any structural changes. I'll wait to do copy edits till the final version, since that's the most obvious and mechanical work.

Tests that are using file uploads can consume tens of megabytes per VU.
Tests that are using file uploads, or load large JS modules, can consume tens of megabytes per VU.
Keep in mind that each VU has a copy of all JS modules your test uses.
If you need to share memory between VUs, consider using [SharedArray](/javascript-api/k6-data/sharedarray/), or an external data store, such as [Redis](/javascript-api/k6-experimental/redis/).

## General advice

### Make your test code resilient

When running large stress tests, your script can't assume anything about the HTTP response.
Often performance tests are written with a "happy path" in mind.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd seen this "Happy path" term twice now, and I only just realized that it's an industry term. I think we should define it. Now I'm not the brightest, but even if it's a common term in CS, not all of our readers will come from the same language and educational backgrounds.


## General advice

### Make your test code resilient

When running large stress tests, your script can't assume anything about the HTTP response.
Often performance tests are written with a "happy path" in mind.
For example, a "happy path" check like the one below is something that we see in k6 often.
For example, a "happy path" check like the one below is something that we often see in k6 scripts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I understand what "Happy path" is, is this really a happy path check? It seems like it does not assume the happy path, since it accepts the possibility of an incorrect body size.

Instead, it seems like it's more a case of error handling that's not general enough.

Proposed change:

Even if scripts accept the SUT won't always respond according to the happy path, it still might not handle a sufficiently wide range of errors. 
For example, the following check is something we often see in k6 scripts:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a happy path check, since it doesn't take into account failure conditions under load, when the response body wouldn't exist. The body size check is a standard assertion, and the user might want the test to fail if it's incorrect.

Your suggestion sounds more confusing, sorry, so I'd rather leave this as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a happy path check, since it doesn't take into account failure conditions under load, when the response body wouldn't exist. The body size check is a standard assertion, and the user might want the test to fail if it's incorrect.

I don't understand. Isn't an incorrect body size another indication of system failure? Or are we defining failure as "no body response at all?". From the Wikipedia page, I thought a happy path is where there are no error conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An incorrect body size would in this case fail the check in an expected way. I.e. the test would just fail. This is not an exceptional condition.

The response body not existing is an exceptional condition the script is not handling, therefore it's only testing the "happy path".

Trust me. This is fine :)

The issue here is that the check assumes that there's always a body in a response. The `r.body` may not exist if server is failing.
In such case, the check itself won't work as expected and error similar to the one below will be returned:
The issue here is that the check assumes that there's always a body in a response. The `r.body` may not exist if the server is failing.
In such case, the check itself won't work as expected and an error similar to the one below will be returned:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add an example of how to handle this exception? Right now it seems we diagnose a problem, but don't give any way to find a solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I realize now that this is in the text, just not in the diffs. Still, it looks like the example only fails a check if there is a body to begin with.

const checkRes = check(res, {
  'Homepage body size is 11026 bytes': (r) => r.body && r.body.length === 11026,
});

Wouldn't it make more sense to have two checks? One that checks that a body is returned and other that, when returned, it's the correct length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to have two checks?

You could do that, but you'd then have to add a dependency between the body length check and the body existence check, which would complicate the check functions. It's much simpler to use && here to avoid the TypeError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm think I confused myself about the program behavior yesterday.

But, shouldn't the check have a different name?

- Homepage body size is 11026 bytes
+ Body exists and is 11026 bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matt, I think you're nitpicking at this point. It's fine as is :)

@MattDodsonEnglish
Copy link
Contributor

MattDodsonEnglish commented Jan 25, 2023

Nice, I made some edits for structure and sentences to merge into this branch. #1003 @imiric , I'd appreciate a review. I don't think you need to analyze it too deeply―just if anything feels off to you. I'll proofread the text after the benchmarks are added.

It's a hard article to organize because of its length and the breadth, but I'm happy with how it shaped up. After my merges, I'll be totally fine with the article (except my necessary proofreading).

As general guidelines, look for the following:
- CPU utilization doesn't exceed 90%. If all CPU cores are 100% utilized during the test run, you might notice performance degradation in your test results.
- Network utilization is at an acceptable level. Depending on your test, you might expect to fully saturate the network bandwidth, or this might be a signal that your test is bound by the available bandwidth. In other scenarios, you might want to minimize network usage to keep costs down.
- Memory utilization doesn't exceed 90%. If you're close to exhausting available physical RAM, the system might start swapping memory to disk, which will affect performance and system stability. In extreme cases, running out of memory on Linux will cause the system to end the k6 process.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we recommend running without swap? Our loadgens don't use it, AFAIK. cc: @mstoykov

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend people to run without swap in general ;) The only use at this point IMO are :

  1. systems which don't have enough ram at all
  2. hibernation (which is why I still have swap on my laptop)

In (my) practice the difference between rogue process running out of memory and dying and going into swap is just that in the second case it dies a lot slower and later usually thrashing the whole system in the process ;).

I guess with NVMe SSDs this is a bit better but even on my laptop whenever I go to swap I find out as the whole system goes really slow.

Given that k6 is a performance tool - it performance degrading significantly because swap is used seems like a terrible idea even if the process itself doesn't die.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree, so I think this deserves a separate section that explains it. Will add, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by 1ed89f2.

@na-- na-- removed their request for review January 27, 2023 14:00
@na--
Copy link
Contributor

na-- commented Jan 27, 2023

Removed myself from the list of reviewers since this doesn't need my review, I likely can't contribute anything that @mstoykov and @imiric haven't already .

Ivan Mirić added 3 commits January 31, 2023 12:38
@ppcano
Copy link
Collaborator

ppcano commented Feb 8, 2023

@imiric any blocker - can we merge it?

@imiric
Copy link
Contributor Author

imiric commented Feb 9, 2023

@ppcano Not yet. It's waiting for PRs in k6-hardware-benchmark to be approved and merged.

Copy link
Contributor

@mstoykov mstoykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

This reads as marketing material with a bunch of hyperbole and statements bordering on "k6 is the only load testing tool that users your resources fully". Which arguably is not true. Apart from locust I think all the other fairly known load testing tools use the resources fully - some of them just have a lot of other things that eat into their actual performance (jMeter for example).

And a lot of those statements would've been a lot better if we were comparing with other tools, and we had "see graph k6 does 10kRPs, jMeter(for example) does 2k RPS, both at 80% CPU". But we show only k6 and then are "it's better than most" which IMO is pointless - better than most is still worse than some which might be what people use. And with no concrete comparison there is no way for someone to know if with this same setup their load testing tool will not give them twice the performance.

I personally do not think the docs are a place for marketing, so I would prefer if we tone it down a bunch on the amount of comparison to others where we ... just say stuff instead of show them.

I still think there is a lot of valuable information and showing that you can do X RPS for some situation is still beneficial to users, and we should have it. But a bunch of the article reads as "k6 exceptionalism" to me.

@imiric
Copy link
Contributor Author

imiric commented Feb 10, 2023

@mstoykov You're right that the article has a marketing tone biased towards k6. I wouldn't say it reads like k6 is the only tool that fully utilizes system resources, but we're pointing out a clear advantage over other tools that struggle with that. This is not a comparison article like the "Open source load testing tool review" (which we should also eventually update, and possibly automate), so mentioning the performance of other tools would be out of context. This article is about how to configure the system and k6 to fully utilize the available hardware resources.

I personally do not think the docs are a place for marketing

Agreed, so you'll be happy to know that this article will be moved to the blog. 😄

a bunch of the article reads as "k6 exceptionalism" to me

Can you point out all those places?

This shouldn't read as a marketing article, but then again, it should point out k6's strengths, so that's a thin line to tread. I can try toning down some of these parts, with @MattDodsonEnglish's help, of course. Ah, sorry, Matt has limited availability right now, so we can change it ourselves.

@MattDodsonEnglish
Copy link
Contributor

, with @MattDodsonEnglish's help, of course. Ah, sorry, Matt has limited availability right now, so we can change it ourselves.

Thanks, @imiric ! 🙏 I can make a quick comment though. Besides that first line, I also don't really see the "k6 exceptionalism" that @mstoykov pointed out. The only other overt line that I found was "not as greedy as other tools" but that does link to a data-driven comparison.

I believe that for technical articles like this, the best marketing is honesty and usefulness. So if there are other places where the doc seems over the top, I'd prefer to delete those lines. I just don't know what lines those are exactly.

However, the first line makes such a strong impression that might it frame the whole article as marketing material, when really the material is quite practical and technical. That's also from the original version, and since then there's been much revision. So it isn't very harmonious with the rest of the article. I proposed cooling it down here:

#992 (comment).

@imiric imiric requested a review from mstoykov February 10, 2023 16:26
@imiric
Copy link
Contributor Author

imiric commented Feb 10, 2023

I know that you already approved this @mstoykov, but it sounded like you had major objections to the marketing tone of the article, so please take another look if the recent changes are an improvement.

So with that approval, this will be ready to merge, since the k6-benchmarks changes are done.

Copy link
Contributor

@mstoykov mstoykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I over-reacted as the start was particularly marketing like IMO.

Although that in general seems to be mostly from the previous version then anything done to this one.

With the latest changes IMO there are no unfound comparisons.

But maybe we should also try to get https://github.com/grafana/loadgentest updated? Maybe with just less tools?

@imiric
Copy link
Contributor Author

imiric commented Feb 13, 2023

@mstoykov Yeah, we should definitely update the comparison article and benchmark results at some point as well. This year would be great.

Can we get Ragnar to do it again? 😄

Anyway, I'm merging this as is, with messy history and all. It would be a lot of work to clean it up, and we want to preserve the history, so I wouldn't squash it.

Thanks for the reviews, everyone! 🙇

@imiric imiric merged commit d976e1c into main Feb 13, 2023
@imiric imiric deleted the update/208-running-large-tests branch February 13, 2023 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update "Running Large Tests" article
6 participants