alphaevolve evaluator explicit

<section style=“max-width:100%;margin:0 auto;padding:0 0 30px;font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif;font-size:16px;line-height:1.8!important;color:#333!important;background-color:#ffffff!important;word-wrap:break-word;”> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px; font-size:16px;color:#1a1a1a!important;”>自动驾驶被业界烧了快十年。Waymo、Cruise、特斯拉，加起来几百亿美金，L4 通用乘用至今难产。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>还是同一拨公司、同一代模型，这一周却在另一个领域给出了一个具体数字——基因测序的变异检测错误率，被压低了 30%。干这件事的是 DeepMind 这一周发的 AlphaEvolve。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>烧的钱差几个数量级。算力、模型、人没差。差别只在评估这一步——是 AI 自己说了算，还是另有一把外面的尺子说了算。</p> <figure style=“margin:20px 0;text-align:center;padding:0;”><img src=“images/01-two-roads.png” alt=“两条 AI 落地路线对比” style=“width:100%;max-width:100%;height:auto;display:block;margin:20px auto;border-radius:6px;”></figure> <h2 style=“font-size:18px;font-weight:700;color:#333!important;line-height:1.4!important;margin:30px 0 15px;padding-left:12px;border-left:4px solid #07c160;”>agent 这件事，多数公司还在自批卷子</h2> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>一个学生考完试，自己给自己批卷子，打 95 分。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>放在学校里，这事是过不去的。不是怀疑学生不诚实，是没人会信「判官就是被判的人」打出来的分数。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>但这一周市面上几乎所有"agent"，都在干这件事——让对话 agent 自己干活、自己评估干得好不好；或者交给另一个 LLM 来评估——卷子换了个老师批，但跟出题、考试的，还是同一拨人。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>打开任何一份热门 agent 框架的介绍，看到的大致是这种：</p> <ul style=“margin:16px 0;padding-left:24px;;list-style-type:disc!important;list-style-position:outside;”> <li style=“margin:8px 0;line-height:1.8!important;color:#333!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>一个对话 agent 调更多工具</li> <li style=“margin:8px 0;line-height:1.8!important;color:#333!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>多个 agent 协作做更复杂的任务</li> <li style=“margin:8px 0;line-height:1.8!important;color:#333!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>agent 自己反思、自己规划、自己改 prompt</li> </ul> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>这些做法的共同点是：<strong style=“font-weight:700;color:#000!important;”>"什么算好" 这件事长在模型脑子里</strong>。模型自己跑、自己评估、自己迭代。所谓"评估"环节多半是 LLM-as-judge——说到底还是自批卷子。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>AlphaEvolve 走相反方向。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>它是 Gemini 驱动的演化型 coding agent。Gemini 反复改写候选代码，每改一版跑一次评分，分高的留下、分低的丢掉，留下的接着改。一代一代演化，跟自然选择一个套路。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>关键不在 Gemini，关键在<strong style=“font-weight:700;color:#000!important;”>评分谁打</strong>。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>DeepMind 的做法是——<strong style=“font-weight:700;color:#000!important;”>评分函数显式拆出来，挂在循环外边</strong>。不塞进模型，不让模型自评。这把尺子在模型外面：能写、能 debug，也能换成任何外部检测器。</p> <blockquote style=“margin:20px 0;padding:15px 20px;background-color:#f7f7f7!important;border-left:4px solid #07c160;color:#666!important;line-height:1.8!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”> <p style=“margin:15px 0!important;line-height:1.8!important;color:inherit!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>模型在循环里跑，尺子在循环外卡。</p> </blockquote> <h2 style=“font-size:18px;font-weight:700;color:#333!important;line-height:1.4!important;margin:30px 0 15px;padding-left:12px;border-left:4px solid #07c160;”>30% 不是模型变聪明了，是尺子换了一把</h2> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>AlphaEvolve 改了 Google Research 一个叫 DeepConsensus 的 DNA 测序纠错模型。变异检测错误率被压低了 30%。PacBio Senior Director Aaron Wenger 这样描述：</p> <blockquote style=“margin:20px 0;padding:15px 20px;background-color:#f7f7f7!important;border-left:4px solid #07c160;color:#666!important;line-height:1.8!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”> <p style=“margin:15px 0!important;line-height:1.8!important;color:inherit!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>“Google 团队用 AlphaEvolve 找到的方案，把我们测序仪的准确率提到了一个有意义的新高度。对研究者来说，更高质量的数据可能让我们看到以前藏在数据里的致病突变。”</p> </blockquote> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>这不是 demo 视频里"AI 帮我做了个 PPT" 那种话。是基因测序仪供应链上游在说"我的产品因为这件事变好了"——这种话从 PacBio 这个级别的公司嘴里说出来，每一个字都过了法务。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>它做成的原因不是 Gemini 比上一代聪明 5%。是评估器：AI 改一版代码，跑一次 DeepConsensus，错误率降了就保留，没降就丢掉。<strong style=“font-weight:700;color:#000!important;”>评估器就是真世界的尺子</strong>。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>同一套办法 DeepMind 还在数学猜想、电池材料、矩阵乘法、数据中心调度里跑通了。具体数字没披露，但同一个套路能在差异这么大的五个领域都跑通——evaluator 拆到循环外不是基因测序专属，换个领域照样能用。</p> <figure style=“margin:20px 0;text-align:center;padding:0;”><img src=“images/02-evaluator-loop.png” alt=“评估器拆在循环外的工作流程” style=“width:100%;max-width:100%;height:auto;display:block;margin:20px auto;border-radius:6px;”></figure> <h2 style=“font-size:18px;font-weight:700;color:#333!important;line-height:1.4!important;margin:30px 0 15px;padding-left:12px;border-left:4px solid #07c160;”>自动驾驶为什么不行——CASP 比它早 30 年答了同一道题</h2> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>回到自动驾驶。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>最难的不是感知模型不够强，是没人能把"一次好驾驶" 显式拆成评估器。Disengagement、MPI、Route completion 都是 proxy，不是真世界尺子。模型越大，没有评估器照样在 corner case 里打转。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>蛋白结构预测的故事正好相反。AlphaFold 能被 DeepMind 攻下来，不是因为模型聪明 5 倍，是因为 CASP 赛事 30 年下来已经把"什么是好的结构预测"写成了一组可量化指标——GDT_TS、TM-score。模型撞上去就能涨分。<strong style=“font-weight:700;color:#000!important;”>尺子在那儿等了 30 年</strong>。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>基因测序的故事也一样。错误率是物理可测的，跑一次就出数字。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”><strong style=“font-weight:700;color:#000!important;”>有尺子的领域，AI 会一直涨分。没尺子的，砸再多算力都进不去。</strong></p> <h2 style=“font-size:18px;font-weight:700;color:#333!important;line-height:1.4!important;margin:30px 0 15px;padding-left:12px;border-left:4px solid #07c160;”>对开发 AI agent 的启发</h2> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>通用 agent 这条路会继续跑。算力会更便宜、模型会更聪明、对话框会变成更复杂的工作流。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>但 agent 真正落到产业里、在医疗、能源、芯片、材料里做出真指标的那条路，跟"对话框更聪明" 没关系。它跟"这个领域有没有一把能写下来的尺子" 有关。即是否有目标函数——只要能评估，AI agent 就能不断推进到最优路径。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>所以我们做 AI agent，下面三个问题务必先搞清楚：</p> <ol style=“margin:16px 0;padding-left:24px;;list-style-type:decimal!important;list-style-position:outside;”> <li style=“margin:8px 0;line-height:1.8!important;color:#333!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”><strong style=“font-weight:700;color:#000!important;”>评估器是什么？</strong></li> <li style=“margin:8px 0;line-height:1.8!important;color:#333!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”><strong style=“font-weight:700;color:#000!important;”>它在循环里还是循环外？</strong></li> <li style=“margin:8px 0;line-height:1.8!important;color:#333!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”><strong style=“font-weight:700;color:#000!important;”>评估器的输入是真世界数据，还是另一个 LLM 的判断？</strong></li> </ol> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>第三个问题最难答。多数 agent 框架到这就卡住。</p> </section>