You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are some weird attributes in the "Bug" column for benchmark results, which makes it a bit difficult to understand the meaning of it. I also think the use of Bug is a bit too vague in this context. The "Bug" column is supposed to indicate whether the bug is in the driver code or the project code.
<tdstyle="color: {{ 'red' if sample.result.crashes and not sample.result.is_semantic_error else 'black' }}">{{ sample.result.crashes and not sample.result.is_semantic_error }}</td>
i.e. sample.result.crashes and not sample.result.is_semantic_error
So if there is no semantic error and the issue crashes, bug becomes True and colored red.
Based on the definition in template, I think True means the bug is considered a valid bug. The color coding makes me a bit confusing though -- my intuition would be to have it green if it was a true positive.
I think we should do a couple of improvements here:
Rename Bug to be a bit more descriptive
Add the classification logic to the core rather than in the web app itself (i.e. include sample.result.crashes and not sample.result.is_semantic_error in the core)
Add the LLM-based triage verdict into the conclusion of the a bug is a TP or FP
Include all semantic validations in the crash triaging logic, and show them all in the UI. The logic starting here:
only includes a single semantic validation. However, the semantic checks are not mutually exclusive, and in many cases it would be good to know them all, e.g. "the crash is a NULL-deref (
Yes, their contradiction is because they use two separate judging systems:
bug+diagnosis: This is from a list of known heuristics.
Triage: This is purely from LLM.
Taking this case as an example:
The heuristic-based method categorised the first one as a valid bug (Bug: True), because its heuristics did not capture any semantic error (NO_SEMANTIC_ERR). It categorised the second one (Bug: False) because of a heuristic on immediate crash.
The LLM-based method correctly categorised both as FP.
We are improving on the LLM method (e.g., via agents).
Rename Bug to be a bit more descriptive
Yep we use 'bug' to represent a valid true positive bug in the project-under-test (not a false positive from the fuzz target). Please rename it as you see fit.
Add the classification logic to the core rather than in the web app itself (i.e. include sample.result.crashes and not sample.result.is_semantic_error in the core)
Yes good point.
I reckon they are busy with other tasks in hand so please feel free to change it if you happen to have time : )
Add the LLM-based triage verdict into the conclusion of the a bug is a TP or FP
IIUC, I think this is already reflected by the triage column?
Include all semantic validations in the crash triaging logic, and show them all in the UI.
based on the above improve ments come up with a new definition of "True Positive vs False Positive", e.g. based on a more fine-grained scoring system
Yep, we are working on that + some more heuristics + agents with access to bash & LLDB + some LLM tricks to be pushed later.
Eventually, LLM will be given all info that a human needs to triage a bug.
There are some weird attributes in the "Bug" column for benchmark results, which makes it a bit difficult to understand the meaning of it. I also think the use of
Bug
is a bit too vague in this context. The "Bug" column is supposed to indicate whether the bug is in the driver code or the project code.Consider the following result: https://llm-exp.oss-fuzz.com/Result-reports/ofg-pr/2024-09-29-655-d-cov-103-all/benchmark/output-htslib-sam_index_build2/index.html
The two results that crash have:
Driver
for bothTrue
for bothBoth are clearly false positives, but "Bug" is set to opposites between them.
Bug
, in the report, is defined:oss-fuzz-gen/report/templates/benchmark.html
Line 45 in d26a523
i.e.
sample.result.crashes and not sample.result.is_semantic_error
So if there is no semantic error and the issue crashes, bug becomes
True
and colored red.Based on the definition in template, I think
True
means the bug is considered a valid bug. The color coding makes me a bit confusing though -- my intuition would be to have it green if it was a true positive.I think we should do a couple of improvements here:
Bug
to be a bit more descriptivesample.result.crashes and not sample.result.is_semantic_error
in the core)oss-fuzz-gen/experiment/builder_runner.py
Lines 362 to 365 in d26a523
oss-fuzz-gen/experiment/builder_runner.py
Line 370 in d26a523
oss-fuzz-gen/experiment/builder_runner.py
Line 410 in d26a523
oss-fuzz-gen/experiment/builder_runner.py
Line 419 in d26a523
The text was updated successfully, but these errors were encountered: