We wanted to know when LLM is guessing versus when it actually knows the answer. LLM models expose logprobs - after every word they generate, you can request the top alternative tokens and their probabilities. Low entropy means the model was certain, high means it was guessing. I tested two models using 0.0 temperature and 3 prompts: "what is 2+2", "the opposite of hot is", "once upon a time." The results: gpt-4o-mini: 0.00, 0.35, 0.55. Math certain, story uncertain, correct. gpt-4.1-nano: 0.39,

LLM guesses or knows
Alex
