I A/B tested 4 LLMs on the same 500 queries. The results surprised me.

I see a lot of claims about which model is "best." Best at what? For whom? At what cost? I got tired of guessing. So I ran my own comparison. The setup I took 500 real queries from my production logs – a mix of: Code generation (120 queries) Document summarization (150 queries) Question answering (180 queries) Creative writing (50 queries) I ran each query through four models using the same prompt, same temperature (0.7), same everything. The models: DeepSeek-V4 Pro Kimi 2.6 MiniMax 2.7 Qwen3 23