Skip to content

Scope and validate flaky test detection #18

@aj-codecov

Description

@aj-codecov

From Datadog's website: Flaky tests are tests that exhibit both a passing and failing status across multiple test runs for the same commit. If you commit some code and run it through CI, and a test fails, and you run it through CI again and the test passes, that test is unreliable as proof of quality code.

Datadog Docs

Prior research:
https://www.notion.so/sentry/Flaky-Tests-Product-717283e859e345d1b58e29952442908b

https://www.notion.so/sentry/Flaky-Tests-Study-39a87ba0bf9f4318a2cb074813d63f66

From Matt Hammerly from convo around Codecov as a test runner:
there were two angles i was thinking of:

  • we can diagnose flaky tests with reruns
  • we can compute difference between project-level coverage from a full test run vs from the incrementally-computed coverage with carry-over flags and the like

neither can be worked into customer CI without causing considerable slowdowns, but maybe we could expose them in codecovcli for manual runs?

  • codecovcli flake-test subcommand using whatever heuristics we have for evaluating flakiness through coverage differences
    • are there other "test quality" analyses we could provide? flagging non-hermetic test characteristics like network/filesystem usage? putting diagnostics in users' hands feels like giving them new powers to improve tests while also reinforcing that, insofar as variable coverage is due to flaky tests, it's their problem
  • codecovcli init subcommand which does a full test run and uploads a report which we use to reset project-level coverage. if there was already project-level coverage computed for the commit, log the difference between the new and old project-level coverage so we can measure how accurate our carry-forward mechanism is over time. if users think project-level coverage is suspicious, ask them to run this (on a commit with sus coverage) and give us the ID it prints so we can investigate, or something.

drawing insights from user-initiated processes is more chaotic than if we just did it every N commits or M days, and codecovcli is awkward to use manually (tho maybe the global tokens solve that, haven't played with it), but on the plus side it takes no extra infra capacity for us + it empowers customers to identify/rule out issues and restore trust in coverage rather than relying on us as an inscrutable oracle

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status

MVP/Internal QA

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions