"A benchmark isn't a dataset. It's a triplet: dataset, model, judge. Swapping judges changed the actual ranking of frontier models. The harder the question, the more your benchmark score reflects judge competence instead of model competence."— @rryssf_,论 Omni-MATH 审计结果