Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine triaging logic, improve web crash details and add better "TP/FP" indicator #656

Open
DavidKorczynski opened this issue Sep 29, 2024 · 1 comment
Assignees

Comments

@DavidKorczynski
Copy link
Collaborator

DavidKorczynski commented Sep 29, 2024

There are some weird attributes in the "Bug" column for benchmark results, which makes it a bit difficult to understand the meaning of it. I also think the use of Bug is a bit too vague in this context. The "Bug" column is supposed to indicate whether the bug is in the driver code or the project code.

Consider the following result: https://llm-exp.oss-fuzz.com/Result-reports/ofg-pr/2024-09-29-655-d-cov-103-all/benchmark/output-htslib-sam_index_build2/index.html

The two results that crash have:

  • Triaging --> Driver for both
  • Diagnosis --> Semantics vs non-semantic
  • Crashes --> True for both

Both are clearly false positives, but "Bug" is set to opposites between them.

Bug, in the report, is defined:

<td style="color: {{ 'red' if sample.result.crashes and not sample.result.is_semantic_error else 'black' }}">{{ sample.result.crashes and not sample.result.is_semantic_error }}</td>

i.e. sample.result.crashes and not sample.result.is_semantic_error

So if there is no semantic error and the issue crashes, bug becomes True and colored red.

Based on the definition in template, I think True means the bug is considered a valid bug. The color coding makes me a bit confusing though -- my intuition would be to have it green if it was a true positive.

I think we should do a couple of improvements here:

  1. Rename Bug to be a bit more descriptive
  2. Add the classification logic to the core rather than in the web app itself (i.e. include sample.result.crashes and not sample.result.is_semantic_error in the core)
  3. Add the LLM-based triage verdict into the conclusion of the a bug is a TP or FP
  4. Include all semantic validations in the crash triaging logic, and show them all in the UI. The logic starting here:
    symptom = SemanticCheckResult.extract_symptom(fuzzlog)
    crash_stacks = self._parse_stacks_from_libfuzzer_logs(lines)
    crash_func = self._parse_func_from_stacks(project_name, crash_stacks)
    crash_info = SemanticCheckResult.extract_crash_info(fuzzlog)
    only includes a single semantic validation. However, the semantic checks are not mutually exclusive, and in many cases it would be good to know them all, e.g. "the crash is a NULL-deref (
    if symptom == 'null-deref':
    ) and happens in the first iterations of the running (
    if lastround is None or lastround <= EARLY_FUZZING_ROUND_THRESHOLD:
    ) and in a trace close to the harness (
    if len(crash_stacks) > 0:
    )
  5. based on the above improvements come up with a new definition of "True Positive vs False Positive", e.g. based on a more fine-grained scoring system
@DavidKorczynski DavidKorczynski self-assigned this Sep 29, 2024
@DonggeLiu
Copy link
Collaborator

DonggeLiu commented Sep 30, 2024

Thanks @DavidKorczynski.
Add @happy-qop and @fdt622 for visibility, they added the semantic analysis and bug triage.

Consider the following result: https://llm-exp.oss-fuzz.com/Result-reports/ofg-pr/2024-09-29-655-d-cov-103-all/benchmark/output-htslib-sam_index_build2/index.html
Both are clearly false positives, but "Bug" is set to opposites between them.

Yes, their contradiction is because they use two separate judging systems:

  1. bug+diagnosis: This is from a list of known heuristics.
  2. Triage: This is purely from LLM.

Taking this case as an example:

  1. The heuristic-based method categorised the first one as a valid bug (Bug: True), because its heuristics did not capture any semantic error (NO_SEMANTIC_ERR). It categorised the second one (Bug: False) because of a heuristic on immediate crash.
  2. The LLM-based method correctly categorised both as FP.

We are improving on the LLM method (e.g., via agents).

  1. Rename Bug to be a bit more descriptive

Yep we use 'bug' to represent a valid true positive bug in the project-under-test (not a false positive from the fuzz target). Please rename it as you see fit.

  1. Add the classification logic to the core rather than in the web app itself (i.e. include sample.result.crashes and not sample.result.is_semantic_error in the core)

Yes good point.
I reckon they are busy with other tasks in hand so please feel free to change it if you happen to have time : )

  1. Add the LLM-based triage verdict into the conclusion of the a bug is a TP or FP

IIUC, I think this is already reflected by the triage column?

  1. Include all semantic validations in the crash triaging logic, and show them all in the UI.
  2. based on the above improve ments come up with a new definition of "True Positive vs False Positive", e.g. based on a more fine-grained scoring system

Yep, we are working on that + some more heuristics + agents with access to bash & LLDB + some LLM tricks to be pushed later.
Eventually, LLM will be given all info that a human needs to triage a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants