Why we run two scoring tracks (LLM + Mediapipe) for our AI face-rating tool

A user tested our face-rating tool five times in a row with the same photo. They got scores of 6.2, 7.5, 6.8, 7.1, 5.9. That's a ±0.8 spread on supposedly the same input. That email was the death of single-LLM scoring for us. This is a short post about the architecture decision we ended up making — running two parallel scoring tracks and taking the geometric one as an anchor against LLM hallucination. The variance problem Subjective face scoring with an LLM is fundamentally non-deterministic. Ea