<section style=“max-width:100%;margin:0 auto;padding:0 0 30px;font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif;font-size:16px;line-height:1.8!important;color:#333!important;background-color:#ffffff!important;word-wrap:break-word;”> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px; font-size:16px;color:#1a1a1a!important;”>自动驾驶被业界烧了快十年。Waymo、Cruise、特斯拉,加起来几百亿美金,L4 通用乘用至今难产。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>还是同一拨公司、同一代模型,这一周却在另一个领域给出了一个具体数字——基因测序的变异检测错误率,被压低了 30%。干这件事的是 DeepMind 这一周发的 AlphaEvolve。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>烧的钱差几个数量级。算力、模型、人没差。差别只在评估这一步——是 AI 自己说了算,还是另有一把外面的尺子说了算。</p> <figure style=“margin:20px 0;text-align:center;padding:0;”><img src=“images/01-two-roads.png” alt=“两条 AI 落地路线对比” style=“width:100%;max-width:100%;height:auto;display:block;margin:20px auto;border-radius:6px;”></figure> <h2 style=“font-size:18px;font-weight:700;color:#333!important;line-height:1.4!important;margin:30px 0 15px;padding-left:12px;border-left:4px solid #07c160;”>agent 这件事,多数公司还在自批卷子</h2> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>一个学生考完试,自己给自己批卷子,打 95 分。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>放在学校里,这事是过不去的。不是怀疑学生不诚实,是没人会信「判官就是被判的人」打出来的分数。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>但这一周市面上几乎所有"agent",都在干这件事——让对话 agent 自己干活、自己评估干得好不好;或者交给另一个 LLM 来评估——卷子换了个老师批,但跟出题、考试的,还是同一拨人。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>打开任何一份热门 agent 框架的介绍,看到的大致是这种:</p> <ul style=“margin:16px 0;padding-left:24px;;list-style-type:disc!important;list-style-position:outside;”> <li style=“margin:8px 0;line-height:1.8!important;color:#333!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>一个对话 agent 调更多工具</li> <li style=“margin:8px 0;line-height:1.8!important;color:#333!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>多个 agent 协作做更复杂的任务</li> <li style=“margin:8px 0;line-height:1.8!important;color:#333!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>agent 自己反思、自己规划、自己改 prompt</li> </ul> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>这些做法的共同点是:<strong style=“font-weight:700;color:#000!important;”>"什么算好" 这件事长在模型脑子里</strong>。模型自己跑、自己评估、自己迭代。所谓"评估"环节多半是 LLM-as-judge——说到底还是自批卷子。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>AlphaEvolve 走相反方向。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>它是 Gemini 驱动的演化型 coding agent。Gemini 反复改写候选代码,每改一版跑一次评分,分高的留下、分低的丢掉,留下的接着改。一代一代演化,跟自然选择一个套路。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>关键不在 Gemini,关键在<strong style=“font-weight:700;color:#000!important;”>评分谁打</strong>。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>DeepMind 的做法是——<strong style=“font-weight:700;color:#000!important;”>评分函数显式拆出来,挂在循环外边</strong>。不塞进模型,不让模型自评。这把尺子在模型外面:能写、能 debug,也能换成任何外部检测器。</p> <blockquote style=“margin:20px 0;padding:15px 20px;background-color:#f7f7f7!important;border-left:4px solid #07c160;color:#666!important;line-height:1.8!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”> <p style=“margin:15px 0!important;line-height:1.8!important;color:inherit!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>模型在循环里跑,尺子在循环外卡。</p> </blockquote> <h2 style=“font-size:18px;font-weight:700;color:#333!important;line-height:1.4!important;margin:30px 0 15px;padding-left:12px;border-left:4px solid #07c160;”>30% 不是模型变聪明了,是尺子换了一把</h2> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>AlphaEvolve 改了 Google Research 一个叫 DeepConsensus 的 DNA 测序纠错模型。变异检测错误率被压低了 30%。PacBio Senior Director Aaron Wenger 这样描述:</p> <blockquote style=“margin:20px 0;padding:15px 20px;background-color:#f7f7f7!important;border-left:4px solid #07c160;color:#666!important;line-height:1.8!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”> <p style=“margin:15px 0!important;line-height:1.8!important;color:inherit!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>“Google 团队用 AlphaEvolve 找到的方案,把我们测序仪的准确率提到了一个有意义的新高度。对研究者来说,更高质量的数据可能让我们看到以前藏在数据里的致病突变。”</p> </blockquote> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>这不是 demo 视频里"AI 帮我做了个 PPT" 那种话。是基因测序仪供应链上游在说"我的产品因为这件事变好了"——这种话从 PacBio 这个级别的公司嘴里说出来,每一个字都过了法务。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>它做成的原因不是 Gemini 比上一代聪明 5%。是评估器:AI 改一版代码,跑一次 DeepConsensus,错误率降了就保留,没降就丢掉。<strong style=“font-weight:700;color:#000!important;”>评估器就是真世界的尺子</strong>。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>同一套办法 DeepMind 还在数学猜想、电池材料、矩阵乘法、数据中心调度里跑通了。具体数字没披露,但同一个套路能在差异这么大的五个领域都跑通——evaluator 拆到循环外不是基因测序专属,换个领域照样能用。</p> <figure style=“margin:20px 0;text-align:center;padding:0;”><img src=“images/02-evaluator-loop.png” alt=“评估器拆在循环外的工作流程” style=“width:100%;max-width:100%;height:auto;display:block;margin:20px auto;border-radius:6px;”></figure> <h2 style=“font-size:18px;font-weight:700;color:#333!important;line-height:1.4!important;margin:30px 0 15px;padding-left:12px;border-left:4px solid #07c160;”>自动驾驶为什么不行——CASP 比它早 30 年答了同一道题</h2> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>回到自动驾驶。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>最难的不是感知模型不够强,是没人能把"一次好驾驶" 显式拆成评估器。Disengagement、MPI、Route completion 都是 proxy,不是真世界尺子。模型越大,没有评估器照样在 corner case 里打转。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>蛋白结构预测的故事正好相反。AlphaFold 能被 DeepMind 攻下来,不是因为模型聪明 5 倍,是因为 CASP 赛事 30 年下来已经把"什么是好的结构预测"写成了一组可量化指标——GDT_TS、TM-score。模型撞上去就能涨分。<strong style=“font-weight:700;color:#000!important;”>尺子在那儿等了 30 年</strong>。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>基因测序的故事也一样。错误率是物理可测的,跑一次就出数字。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”><strong style=“font-weight:700;color:#000!important;”>有尺子的领域,AI 会一直涨分。没尺子的,砸再多算力都进不去。</strong></p> <h2 style=“font-size:18px;font-weight:700;color:#333!important;line-height:1.4!important;margin:30px 0 15px;padding-left:12px;border-left:4px solid #07c160;”>对开发 AI agent 的启发</h2> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>通用 agent 这条路会继续跑。算力会更便宜、模型会更聪明、对话框会变成更复杂的工作流。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>但 agent 真正落到产业里、在医疗、能源、芯片、材料里做出真指标的那条路,跟"对话框更聪明" 没关系。它跟"这个领域有没有一把能写下来的尺子" 有关。即是否有目标函数——只要能评估,AI agent 就能不断推进到最优路径。</p> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>所以我们做 AI agent,下面三个问题务必先搞清楚:</p> <ol style=“margin:16px 0;padding-left:24px;;list-style-type:decimal!important;list-style-position:outside;”> <li style=“margin:8px 0;line-height:1.8!important;color:#333!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”><strong style=“font-weight:700;color:#000!important;”>评估器是什么?</strong></li> <li style=“margin:8px 0;line-height:1.8!important;color:#333!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”><strong style=“font-weight:700;color:#000!important;”>它在循环里还是循环外?</strong></li> <li style=“margin:8px 0;line-height:1.8!important;color:#333!important; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”><strong style=“font-weight:700;color:#000!important;”>评估器的输入是真世界数据,还是另一个 LLM 的判断?</strong></li> </ol> <p style=“margin:15px 0!important;line-height:1.8!important;color:#333!important;text-align:justify; font-family:-apple-system,BlinkMacSystemFont,"PingFang SC","Hiragino Sans GB","Microsoft YaHei",sans-serif; font-size:16px;”>第三个问题最难答。多数 agent 框架到这就卡住。</p> </section>