Are we in a GPT-4-style leap that evals can't see?martinalderson.com[email protected] (Martin Alderson)2025年11月30日 00:00Gemini 3 Pro's design capabilities and Opus 4.5's reduced babysitting needs represent a subtle but significant leap that traditional benchmarks completely miss.