Pinned Loading
-
-
-
-
michigan-bird-evals
michigan-bird-evals Public40-scenario LLM benchmark for Michigan bird presence: stratified species sampling, probability + confidence dual scoring, eBird ground truth.
Python
-
bird-taxonomy-evals
bird-taxonomy-evals PublicLLM calibration benchmark for taxonomic hierarchy consistency
Python
-
florida-weather-evals
florida-weather-evals PublicLLM calibration benchmark: 18-scenario Florida rainfall eval with specificity + seasonal gradients, CRPS scoring, and ground-truth-free self-consistency checks.
Python
Something went wrong, please refresh the page to try again.
If the problem persists, check the GitHub status page or contact support.
If the problem persists, check the GitHub status page or contact support.