Refine triaging logic, improve web crash details and add better "TP/FP" indicator #656

DavidKorczynski · 2024-09-29T19:46:40Z

There are some weird attributes in the "Bug" column for benchmark results, which makes it a bit difficult to understand the meaning of it. I also think the use of Bug is a bit too vague in this context. The "Bug" column is supposed to indicate whether the bug is in the driver code or the project code.

Consider the following result: https://llm-exp.oss-fuzz.com/Result-reports/ofg-pr/2024-09-29-655-d-cov-103-all/benchmark/output-htslib-sam_index_build2/index.html

The two results that crash have:

Triaging --> Driver for both
Diagnosis --> Semantics vs non-semantic
Crashes --> True for both

Both are clearly false positives, but "Bug" is set to opposites between them.

Bug, in the report, is defined:

oss-fuzz-gen/report/templates/benchmark.html

Line 45 in d26a523

    
           <td style="color: {{ 'red' if sample.result.crashes and not sample.result.is_semantic_error else 'black' }}">{{ sample.result.crashes and not sample.result.is_semantic_error }}</td>

i.e. sample.result.crashes and not sample.result.is_semantic_error

So if there is no semantic error and the issue crashes, bug becomes True and colored red.

Based on the definition in template, I think True means the bug is considered a valid bug. The color coding makes me a bit confusing though -- my intuition would be to have it green if it was a true positive.

I think we should do a couple of improvements here:

Rename Bug to be a bit more descriptive
Add the classification logic to the core rather than in the web app itself (i.e. include sample.result.crashes and not sample.result.is_semantic_error in the core)
Add the LLM-based triage verdict into the conclusion of the a bug is a TP or FP

Include all semantic validations in the crash triaging logic, and show them all in the UI. The logic starting here:

oss-fuzz-gen/experiment/builder_runner.py

Lines 362 to 365 in d26a523

    
           symptom = SemanticCheckResult.extract_symptom(fuzzlog) 
        
           crash_stacks = self._parse_stacks_from_libfuzzer_logs(lines) 
        
           crash_func = self._parse_func_from_stacks(project_name, crash_stacks) 
        
           crash_info = SemanticCheckResult.extract_crash_info(fuzzlog)

only includes a single semantic validation. However, the semantic checks are not mutually exclusive, and in many cases it would be good to know them all, e.g. "the crash is a NULL-deref (

oss-fuzz-gen/experiment/builder_runner.py

Line 370 in d26a523

if symptom == 'null-deref':

) and happens in the first iterations of the running (

oss-fuzz-gen/experiment/builder_runner.py

Line 410 in d26a523

if lastround is None or lastround <= EARLY_FUZZING_ROUND_THRESHOLD:

) and in a trace close to the harness (

oss-fuzz-gen/experiment/builder_runner.py

Line 419 in d26a523

if len(crash_stacks) > 0:

)

based on the above improvements come up with a new definition of "True Positive vs False Positive", e.g. based on a more fine-grained scoring system

The text was updated successfully, but these errors were encountered:

DonggeLiu · 2024-09-30T04:58:18Z

Thanks @DavidKorczynski.
Add @happy-qop and @fdt622 for visibility, they added the semantic analysis and bug triage.

Consider the following result: https://llm-exp.oss-fuzz.com/Result-reports/ofg-pr/2024-09-29-655-d-cov-103-all/benchmark/output-htslib-sam_index_build2/index.html
Both are clearly false positives, but "Bug" is set to opposites between them.

Yes, their contradiction is because they use two separate judging systems:

bug+diagnosis: This is from a list of known heuristics.
Triage: This is purely from LLM.

Taking this case as an example:

The heuristic-based method categorised the first one as a valid bug (Bug: True), because its heuristics did not capture any semantic error (NO_SEMANTIC_ERR). It categorised the second one (Bug: False) because of a heuristic on immediate crash.
The LLM-based method correctly categorised both as FP.

We are improving on the LLM method (e.g., via agents).

Rename Bug to be a bit more descriptive

Yep we use 'bug' to represent a valid true positive bug in the project-under-test (not a false positive from the fuzz target). Please rename it as you see fit.

Add the classification logic to the core rather than in the web app itself (i.e. include sample.result.crashes and not sample.result.is_semantic_error in the core)

Yes good point.
I reckon they are busy with other tasks in hand so please feel free to change it if you happen to have time : )

Add the LLM-based triage verdict into the conclusion of the a bug is a TP or FP

IIUC, I think this is already reflected by the triage column?

Include all semantic validations in the crash triaging logic, and show them all in the UI.

based on the above improve ments come up with a new definition of "True Positive vs False Positive", e.g. based on a more fine-grained scoring system

Yep, we are working on that + some more heuristics + agents with access to bash & LLDB + some LLM tricks to be pushed later.
Eventually, LLM will be given all info that a human needs to triage a bug.

DavidKorczynski self-assigned this Sep 29, 2024

DavidKorczynski assigned arthurscchan Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine triaging logic, improve web crash details and add better "TP/FP" indicator #656

Refine triaging logic, improve web crash details and add better "TP/FP" indicator #656

DavidKorczynski commented Sep 29, 2024 •

edited

Loading

DonggeLiu commented Sep 30, 2024 •

edited

Loading

Refine triaging logic, improve web crash details and add better "TP/FP" indicator #656

Refine triaging logic, improve web crash details and add better "TP/FP" indicator #656

Comments

DavidKorczynski commented Sep 29, 2024 • edited Loading

DonggeLiu commented Sep 30, 2024 • edited Loading

DavidKorczynski commented Sep 29, 2024 •

edited

Loading

DonggeLiu commented Sep 30, 2024 •

edited

Loading