Score Calculation

Article Page: Knowledge Calibration File: sections/16_score_calculation.html Theme: purple

We evaluated 57 large language models on the hallucination benchmark of 200 fabricated binomials. Frontier systems confined moderate or extensive hallucinations to below one percent of prompts (0-0.5%), whereas the weakest models produced them on nearly two-thirds of queries (about 65%). Counting limited answers as partial hallucinations pushes the range to roughly 0.5-92%. This variability motivates the simple scoring scheme highlighted in the example above.

To compare models on a single scale we devised a "hallucination-avoidance" score, as illustrated in the score-calculation example. Each response earns three points if the model refuses or returns NA, two points for a self-declared limited answer, one point for intermediate, and zero when it claims extensive knowledge. We then sum those points for every fictional name and divide by the number of prompts, yielding an average that ranges from 0 (pure fantasy) to 3 (perfect restraint). Thus a model that usually says "I don't know" or "limited" on fake species will hover near the high end of the scale, while one that invents rich detail will drift toward zero.