VIEW 1 · A QUESTION'S FULL JOURNEYVIEW 1 · 一道题的完整旅程

See how a question gets predicted看一道题怎么被预测出来

Pick a question, hit ▶ Start, and watch 5 AI workers take turns. When each worker is busy, a thought bubble shows what they're thinking. After the cycle, the system learns from this prediction — next time it'll do better on a similar question.

挑一道题,按下 ▶ 开始预测,看 5 个 AI 工人依次开工。 每个工人工作时头顶会弹一个气泡告诉你他在想什么。 全部做完之后,系统会从这次预测里学到经验,下一道同类题就会做得更准。

📦processed已做过1042
🎯recent accuracy近期准确88.2%
📘learned lessons总结的经验23
👋
1 Pick or edit a question挑题 / 改题 2 Hit ▶ Start按 ▶ 开始预测 3 Watch 5 AI workers → get probability看 5 个 AI 工人接力 → 出概率
this cycle本次预测 id编号 #0143 elapsed用时 +00.000s cost成本 $0.0000 shortcut快捷键 Spacestart开始 Rreset重置 5 AI workers are ready, just hit Start → 5 个 AI 工人就位,等你按下开始 →
QUESTION INTAKE题目识别
? Question Intake
Type or pick a question. The system uses a classifier (doubao-flash) to tag its family (rates / inflation / commodity / …) before passing it down the pipeline.
题目识别
在这里输入或选择一道题。系统会先用一个分类器(doubao-flash)给它打"家族"标签(利率/通胀/商品…),方便后面的 AI 用对应方法处理。
·
auto-classified as自动识别为 fomc-rate (rates) · confidence 94%类(利率) 置信度 94%
Pick a question, or edit it below挑一道题,或在下方框里改写
window时间窗 2026-05-19 → 2026-06-02
threshold价格阈值 $2384.50
source题目来源 template手工模板
first visit auto-plays once第一次进入会自动跑一遍
5 AI workers take turns5 个 AI 工人依次开工 ? The Pipeline
Each card below is one AI worker (Splitter → Researchers → Combiner → Skeptic → Fact Checker). Cards light up cyan when working, green when done. Hover any card for that role's job description. Bubbles above show the worker's "thoughts" while they work.
整个流水线
下面每张卡片是一个 AI 工人(拆题官 → 调查员 → 合成师 → 质疑者 → 事实警探)。卡片工作时变青色,完成后变绿。悬停任意卡片查看角色介绍。工人头顶气泡显示他工作时在想什么。
· about大约 14s
step步骤 0/5 live运行中 version系统版本 v0.7
what the worker is doing当前工人正在做什么 ? Worker close-up
This panel shows the currently active worker's detailed state: drivers being split, RAG progress per sub-agent, recalled lessons, critic's verdict, etc.
工人特写
这一区显示当前活跃那个工人的详细状态:正在拆什么子问题、5 个调查员各自的进度、召回的经验、质疑者的修正等。
· idle…还没开始…
Hit Start, or pick a question to run.点击「开始预测」或选一道题来跑一遍。
live log实时日志 ? Live log
Time-stamped events as they happen: stage start/done, RAG retrievals, critic tags. New rows appear at the top. Read it like a terminal log.
实时日志
带时间戳的事件流水:每个阶段开始/结束、RAG 检索、critic 标签等。新事件出现在最上方。可以当作终端 log 读。
OUTPUT输出结果 ? The output
The big number on the left = probability of the event happening. The 4 small cards on the right tell you how to trust it: historic average (the lower the better), what prediction markets say, how much the critic adjusted, and which fact checks passed.
最终输出
左边的大数字 = 事件发生的概率。右边 4 个小卡告诉你这个数字有多可信:历史同类题平均(越低越准)、博彩市场的报价、critic 修正幅度、事实核对结果。
probability of event事件发生的概率
0.00
±0.00
90% confidence interval90% 置信区间 [0.00 – 0.00]
historic avg (same family)历史同类题平均
0.241 · lower is better越低越准
prediction-market price博彩市场报价
0.38 · Δ vs market比市场偏 +0.09
critic adjustmentAI 质疑的修正
−0.06
fact check事实核对
base-rate ✓   facts 7/7 ✓   market Δ +0.05 ⚠基准率 ✓   数据 7/7 ✓   与市场差 +0.05 ⚠
prediction #1043 complete第 1043 次预测完成
Want to see the whole system?想看整个系统的全景吗?
What you just saw is one question's journey. But that's only half of how the system works — the other half runs in the background: it turns each prediction's log into lessons, and every 50 questions it uses those lessons to train itself so the next question is more accurate.
The next page paints all of this on a single flow diagram.
你刚刚看到的是一道题的旅程。但这只是系统工作的一半—— 它还有另一半在背后默默运作:把这次预测留下的记录整理成经验, 每 50 题用经验训练自己一次,下次同类题变得更准。
下一页用一张工厂流程图把这一切画出来。
working正在工作
done已完成
attention需要注意
queued等待中
Hover any AI worker card to see what they do
把鼠标悬停在每个 AI 工人卡片上,能看到他做什么
VIEW 2 · DASHBOARD

History · single-question trace · training timeline历史预测 · 单题 trace · 训练时间轴

Click any past question → the middle column expands its 5-stage trace + dual-track critique + recalled lessons; the right column simultaneously highlights "which factory version processed this question" (the prompt lineage line).

选中任意历史题 → 中栏展开它的 5 段 trace、双轨 critique、召回的 procedural insight; 同时右栏训练时间轴会画出「这道题处理时用的是哪一版工厂」(prompt 血统线)。

rolling Brier (last 90d)近 90 天滚动 Brier
0.118 ↓ 0.027 vs market优于市场
resolved 968 / total 1042已揭晓 968 / 共 1042
💡
How to read this page (3 interactions)这页怎么看(3 个互动点)
  1. Click any past question on the left → the middle column instantly opens its 5-stage trace + dual-track critique + recalled lessons.
  2. Each timeline node marked "click for diff →" expands an inline red/green prompt diff (first 6 lines); "Open in Training view" shows the full diff.
  3. The right-column amber vertical line = the factory version that processed this question. Switch questions → the line jumps — that's the "prompt lineage line".
  1. 左栏点任意历史题 → 中栏立即展开它的 5 段 trace + 双轨 critique + 召回的 procedural insight。
  2. 右栏时间轴上每个带「click 看 diff →」的版本节点 点击会内联展开 prompt 红绿 diff 摘要(前 6 行),按钮"在训练演化视图打开"看完整。
  3. 右栏 琥珀色竖线 = 这道题处理时用的工厂版本。切换不同历史题时,竖线会跳到对应版本——这是"prompt 血统线"。
HISTORY
filter
idfamilypbrierst
family Brier (90d)
TRACE · Q#1042
processed-by prompt-pack v0.7
critic verdict · doubao-thinking
verifier flags · rule-engine v0.14
recalled procedural insights · semantic + family-match
TRAINING TIMELINE
prompt-pack versions · slow loop
amber line marks the prompt-pack used for the selected question. Click any node to see its diff. 高亮线指示:选中题处理时使用的 prompt-pack。点击节点查看版本 diff。
💡
How to read the three loops三回路怎么读
  1. Color = trigger frequency: cyan runs every prediction · violet distills every 20 questions · amber retrains every 50 questions. Three rhythms an order of magnitude apart.
  2. Top-down = data flow: fast loop writes critique → mid loop distills lessons → slow loop rewrites prompts.
  3. The amber arrow at the bottom flows backward: after DSPy + TextGrad rewrites the prompts, new versions are "deployed" to the 4 fast-loop agents — that's the physical act of "the system getting better".
  4. Hover any node to see its role / model / last-modified time.
  1. 颜色 = 触发频率cyan 每次预测都跑 · violet 每 20 题蒸馏一次 · amber 每 50 题重训一次。三层节奏差一个数量级。
  2. 从上往下 = 数据流:fast loop 写 critique → mid loop 蒸馏 insight → slow loop 重写 prompt。
  3. 底部琥珀箭头反向回写:DSPy + TextGrad 改完 prompt 后,新版本会"上线"到 fast loop 的 4 个 agent 节点上——这就是"系统变强"的物理动作。
  4. hover 任意节点查看角色 / 模型 / 上次被改时间。
VIEW 3 · CONCEPT ATLAS

Three loops · who rewrites whom三回路 · 谁在改谁

The system has three loops: fast (every prediction) + mid (Reflector distills every 20 questions) + slow (DSPy + TextGrad retrains every 50 questions). Arrow direction + color = data flow / training flow. Hover a node for its agent card.

系统由 fast loop(每次预测)+ mid loop(每 20 题 Reflector 蒸馏)+ slow loop(每 50 题 DSPy + TextGrad 重训)三层循环组成。 每条线的箭头方向 + 颜色 = 数据 / 训练流向。hover 节点弹出角色卡。

three-loop cadence
fastevery prediction · pure inference每次预测 · pure inference
midevery 20 · Reflector distills每 20 题 · Reflector 蒸馏
slowevery 50 · DSPy + TextGrad每 50 题 · DSPy + TextGrad
FAST LOOP runs every prediction · no prompt change每次预测都走 · 不改 prompt
data flow: question → P(event)数据流:题目 → P(event)
every cycle Critic + Verifier write to ExperienceStore (episodic layer) 每次 Critic + Verifier 写入 ExperienceStore (episodic 层) ↳ Synthesizer recalls procedural insights as context (family-match + semantic top-K) ↳ Synthesizer 召回 procedural insight 当上下文 (family-match + semantic top-K)
After every question → dual-track critique (LLM-critic + det-verifier) is written to ExperienceStore.episodic 每完成 1 道题 → 双轨 critique(LLM-critic + det-verifier)写入 ExperienceStore.episodic accumulate 20 → trigger mid loop ↓ 累积到 20 道触发中回路 ↓
MID LOOP every 20 resolved questions · Reflector distills procedural insights每 20 道 resolved 题 · Reflector 蒸馏 procedural insight
training flow: critique → procedural insight训练流:critique → procedural insight
Reflector outputs 4 kinds of insight (written to ExperienceStore.procedural, recalled later by Decomposer / Synthesizer): thought-chain-template· checklist-item· driver-taxonomy· common-error
Reflector 输出 4 类 insight(写入 ExperienceStore.procedural 层,被未来 Decomposer / Synthesizer 召回): thought-chain-template· checklist-item· driver-taxonomy· common-error
Accumulate 50 → trigger slow loop, bootstrap demos from training set + critic signals as gradient → rewrite system prompts 累积到 50 道 → 触发慢回路,用训练集 bootstrap demo + critic 信号当 gradient → 重写 system prompt
SLOW LOOP every 50 questions (or monthly) · DSPy MIPROv2 + TextGrad每 50 题(或每月) · DSPy MIPROv2 + TextGrad
training flow: bootstrap demos + gradient → rewrite 4 agent prompts训练流:bootstrap demos + gradient → 改写 4 个 agent prompt
↑ Rewrite system prompts of Decomposer / Sub-agent / Synthesizer / Critic ↑ 重写 Decomposer / Sub-agent / Synthesizer / Critic 的 system prompt · deploy only if 1000-bootstrap 95% CI on validation set passes 验证集 1000-bootstrap 95% CI 通过才上线 · fail → auto-rollback to previous version + flag for human review 不通过 → 自动回滚到上一版 + 标 review hook
DATA / TRAINING FLOW LEGEND
cyan = inference data flow (every prediction)
violet = critique → insight distillation (every 20)
amber = prompt rewrite (every 50, marching-ants animation)
↑ Reverse arrow = "training flow writes back to fast loop nodes" — the physical act of "system getting better"
cyan = inference 数据流(每次预测)
violet = critique → insight 蒸馏流(每 20 题)
amber = prompt 改写流(每 50 题,marching-ants 动画)
↑ 反向箭头 = "训练流回写到 fast loop 节点",这是「系统变强」的物理动作
KILL CRITERIA (end of v0.1 · 8 weeks)
· Test Brier fails to beat L1 base-rate → demote to "training-paradigm demo"
· Fails to beat L2 market-implied by ≥5% → architecture adjustment
· Even without beating L2, any of the following gives the project value:
positive thought-chain ablation (removing a checklist makes Brier worse)
positive cross-family transfer (train fomc, test commodity)
steady slow-loop descent (training-set Brier monotonically decreases)
· 测试集 Brier 未打过 L1 base-rate → 项目降级为"训练范式 demo"
· 未打过 L2 market-implied ≥ 5% → 架构调整
· 即使没赢 L2,下列任一成立则有价值:
思维链 ablation 正向(去 checklist → Brier 变差)
跨 family 迁移正向(训 fomc 测 commodity)
慢回路稳定下降(每轮训练集 Brier 单调降)
💡
How to read this diagram (30s)怎么看这张图(30 秒)
  1. Your question enters from the left inlet, flows right along the blue pipe, and ends at the bottom-right outlet as a probability number.
  2. It passes 6 steps (numbered ①②③ on the pipe). Each step is one AI's job. Hover any device to see "what it does".
  3. Blue = this prediction's data flow; violet = the system distills lessons from past predictions; amber = trained improvements get loaded back into the upstream AIs.
  4. The core idea: every prediction makes the system a bit more accurate (the violet + amber pipes do exactly that).
  1. 你的题目从左边入口进入,沿着蓝色管道一路向右,最终在右下角出口得到一个概率数字。
  2. 路上经过 6 个步骤(管道上有 ①②③ 编号),每个步骤由一个 AI 负责,鼠标悬停任意设备能看到「它做什么」。
  3. 蓝色 = 你这次预测的数据流;紫色 = 系统从历次预测里整理经验;琥珀色 = 训练优化后,把更好的方法"装回"前面的 AI。
  4. 这套设计的核心:每做一次预测,系统都会变得更准(紫色和琥珀色管道就是干这件事的)。
VIEW 5 · SYSTEM FLOWVIEW 5 · 系统流程图

How your question gets processed你的一道题,是怎么被加工的

The full flow of the AI prediction system. Question pours in from top-left, passes 6 steps, and a probability comes out at the bottom-right. Blue pipe = data flow of this prediction; violet pipe = system distilling lessons from past predictions; amber pipe = trained improvements flowing back to upstream AIs (how the system gets better).

这是 AI 预测系统的全景流程图。 左上灌入题目,途经 6 个步骤,右下出口得到一个概率数字。 蓝色管道 = 这次预测的数据流; 紫色管道 = 系统从历史预测里整理经验; 琥珀色管道 = 训练优化后回写给前面的 AI(系统变更准的过程)。

system status系统当前状态
AI workersAI 工人6/6 online6/6 在线
avg time平均耗时14.3s / question14.3 秒 / 题
version当前版本v0.7 · 1042 processedv0.7 · 已加工 1042 题
⟲ Trainer ships improved methods back to upstream AIs (every 50 questions) ⟲ 训练优化器把更好的方法回传给前面的 AI(每 50 题一次) ↳ The ③ Combiner AI uses these lessons on every prediction ↳ 每次预测时,③合并答案 AI 都会参考这些经验 IN Question Intake 题目入口 auto-classify (rates/cpi/…) 先分类(利率/通胀/…) 1 Splitter 拆题器 #D-001 · breaks a hard question into 5 sub-questions #D-001 · 把一道难题拆成 5 个小问题 AI: doubao-flash AI:doubao-flash #S₁ #S₂ #S₃ #S₄ #S₅ 2 5 Parallel Researchers 5 路并行调研 #S-001 · 5 AIs look up evidence at the same time #S-001 · 5 个 AI 同时去查资料 AI: doubao-flash + retrieval AI:doubao-flash + 资料检索 3 Combiner 合并答案 #Y-001 · merges 5 answers + uses past lessons #Y-001 · 综合 5 路答案 + 参考经验清单 AI: minimax-m2 AI:minimax-m2 #C-001 4 AI Skeptic AI 质疑 a different AI hunts for flaws 用另一家 AI 来挑毛病 AI: doubao-thinking (different from ③) AI:doubao-thinking(与③不同来源) #V-001 5 Fact Checker 事实核对 rule-based (no AI, no hallucinations) 用规则查(不用 AI,不会幻觉) 3 checks: base-rate / facts / market gap 3 项检查:基准率 / 事实 / 与市场对比 0.47 6 Probability 预测概率 47% episodic 1042 procedural 23 #X-001 · Memory Store #X-001 · 经验库 Memory Store (left: per-question log / right: lessons) 经验库(左:逐题记录 / 右:通用经验) Lesson Distiller 经验整理 #R-001 · turns 20 logs into reusable lessons #R-001 · 把 20 题的记录整理成通用经验 every 20 questions · AI: minimax-m2 每 20 题触发 · AI:minimax-m2 Lessons Shelf (4 kinds) 经验清单(4 类) 23 lessons · flow/checklist/taxonomy/errors 23 条 · 思考流程/检查清单/分类法/常见错误 Trainer 提示词优化器 rewrites the 4 upstream AIs' instructions using lessons 用经验材料改写前 4 个 AI 的工作说明 every 50 questions · auto-rollback if not better 每 50 题触发 · 没变好就自动撤销 ▼ This prediction's main flow (~14s) ▼ 这次预测的主流程(约 14 秒) ↑ System distills lessons (every 20) ↑ 系统整理经验(每 20 题) ↑ System trains itself (every 50) ↑ 系统训练自己(每 50 题)
Blue: this prediction's data蓝色管道:这次预测的数据
runs every time you submit, ~14s每次你提交一道题都会走,约 14 秒
Violet: system distilling lessons紫色管道:系统整理经验
every 20 questions, turns logs into reusable lessons每 20 题触发一次,把记录变成通用经验
Amber: system training itself琥珀色管道:系统训练自己
every 50 questions, loads improvements back to upstream AIs每 50 题触发一次,把更好的方法装回前面的 AI
🕐 A question's complete journey (~14s)一道题被加工的完整过程(约 14 秒)
1 0s
Split拆题
hard question → 5 sub-questions一道难题被拆成 5 个子问题
2 1.4s
Research并行调研
5 AIs look things up in parallel5 个 AI 同时查资料
3 5.9s
Combine合并
merge 5 answers, use past lessons综合 5 路答案,参考经验
4 8.0s
SkepticAI 质疑
a different AI hunts for flaws另一家 AI 来挑毛病
5 11.0s
Fact check事实核对
rules check numbers and deviations规则查数字、查偏离
6 14.3s
Outlet出口
probability 47%给出概率 47%
Two things happen quietly after the outlet:
· The whole trace gets stored in the memory store. Every 20 logs, the system distills new lessons.
· Every 50 logs the Trainer kicks in and rewrites the 4 upstream AIs' instructions using those lessons. If it doesn't really improve, it auto-rolls back — so training can't make the system worse.
⇒ That's why "more questions answered = more accurate system".
出口之后还有两件事在背后发生:
· 这次预测的所有过程会被存进经验库。每攒够 20 条,系统会自动整理出新的经验。
· 每攒够 50 条,提示词优化器会启动,用经验材料改写前面 4 个 AI 的工作方法。没真的变好就自动撤销——所以训练不会让系统变差。
⇒ 这是为什么"做的题越多,系统越准"。
💡
What training changes · how to read the 6 blocks训练改了什么 · 6 个区块怎么读
  1. A Brier curve: 60 blue dots split into 5 versions (amber vertical line = each retrain). Check if dot height drops after each line. L1/L2 are baselines.
  2. B Insight library: 8 distilled procedural insights, click to expand for source_question_ids ("I was learned from these 20 questions") + ablation numbers (how much worse Brier gets without me).
  3. C Prompt diff: pick agent + 2 versions → red/green inline diff shows which line of the system prompt changed; change driven by shows which insight / which optimizer triggered the rewrite.
  4. D Cross-family transfer + E ablation: §11.3 two hard metrics — proves "what training distilled is reusable thinking structure, not memorized answers".
  5. F Auto-rollback log: §9.3 — slow loop auto-rolls back if the 1000-bootstrap CI on validation set fails. This is the hard constraint that "training can't make the system worse".
  1. A Brier 曲线:60 个蓝点按 5 个版本切段(琥珀竖线 = 每次重训)。看每根竖线之后,蓝点的平均高度是不是真的往下走。L1/L2 是 baseline。
  2. B Insight 库:8 条蒸馏出的 procedural insight,点击展开看 source_question_ids("我是从这 20 道题学来的")+ ablation 数字(去掉我 Brier 变差多少)。
  3. C Prompt diff:选 agent + 两个版本 → 红绿 inline diff 看 system prompt 改了哪一句话;change driven by 显示这次改动是哪条 insight / 哪种 optimizer 触发的。
  4. D 跨家族迁移 + E ablation:§11.3 两个硬指标——证明"训练蒸馏出的不是死记答案,是可迁移的思维结构"。
  5. F 自动回滚日志:§9.3——慢回路如果验证集 1000-bootstrap CI 不通过就自动回滚。这是"训练不会让系统变差"的硬约束。
VIEW 4 · TRAINING EVOLUTION

How the system gets stronger from past predictions系统怎么从历史预测中变强

This page makes one concrete thing visible: between v0.3 → v0.7, what was changed in each of the 5 prompt-pack versions / who drove the change / did Brier really drop afterwards. Every procedural insight can be traced back to its source questions; every prompt rewrite has a red/green diff.

这页讲的是一件具体事:v0.3 → v0.7 这五个版本的 prompt-pack 之间,到底什么被改了 / 谁被谁驱动 / 改完之后 Brier 是不是真的降了。 每一条 procedural insight 都能反向追溯到来源题,每一次 prompt 重写都能看到红绿 diff。

overall trend (v0.3 → v0.7)整体趋势 (v0.3 → v0.7)
train Brier训练集 Brier0.232 → 0.171 ↓ 0.061
val Brier验证集 Brier0.218 → 0.184 ↓ 0.034
vs market与市场对比+0.011 → −0.027
A · ROLLING BRIER · validation set
Each amber dashed line = one slow-loop retrain; each blue dot = the Brier of one resolved question. Check whether dot height really drops after each retrain.每条琥珀虚线 = 一次慢回路重训;每条蓝点 = 一道 resolved 题的 Brier。看每次重训之后蓝点平均高度是不是真的下降。
L1 base-rate ━ ━ L2 market ━ ━ predictor
§11.3-③ slow-loop monotonicity
✓ v0.3→v0.7 monotonic descent (5 retrains, 4 effective, 1 rolled back)✓ v0.3→v0.7 单调下降(5 次重训 4 次有效,1 次回滚)
vs L1 base-rate
✓ beats (0.184 < 0.241)✓ 打过 (0.184 < 0.241)
vs L2 market-implied
▲ −2.7% Δ (target ≥ 5%)▲ −2.7% Δ(目标 ≥ 5%)
B · PROCEDURAL INSIGHT LIBRARY Reflector output · 4 kindsReflector 蒸馏产物 · 4 类
Every insight has source_question_ids ("learned from these 20 questions") + ablation_status ("removing me makes Brier worse by X"). Click to expand for content and source.每条 insight 都有 source_question_ids("我是从这 20 道题学来的")+ ablation_status("去掉我 Brier 变差多少")。点击展开看具体内容和来源。
C · PROMPT DIFF · pick two versions to see what changed选两版对比看 system prompt 改了什么
Slow loop doesn't produce model weights — it produces new system-prompt text. Source = MIPROv2 auto-bootstrap demos + TextGrad using critic signal as gradient.慢回路输出的不是模型权重,是新的 system prompt 文本。改动来源 = MIPROv2 自动 bootstrap demo + TextGrad 用 critic 信号当 gradient。
pick agent + version pair选择 agent + 版本对
agent
compare
change driven by
D · CROSS-FAMILY TRANSFER §11.3-②
Train on family-A, distill insights, then test Brier on an unseen family-B. Green cell = better than pure LLM single-shot (positive transfer).在 family-A 上训练蒸馏出的 insight,去未见过的 family-B 测试集上跑 Brier。绿格 = 比纯 LLM single-shot 更好(迁移正向)。
test fomc
test cpi
test comm
test geopol
train fomc
0.171 (in)
−0.018 ✓
−0.012 ✓
+0.004 ✗
train cpi
−0.024 ✓
0.183 (in)
−0.009 ✓
+0.001 ≈
train comm
−0.011 ✓
−0.007 ✓
0.176 (in)
−0.005 ✓
✓ 9/12 cells positive — thought-chain structure really transfers, not just memorizing fomc answers✓ 9/12 单元正向 — 思维链结构有真实迁移性,不是只记 fomc 答案
E · ABLATION §11.3-① · how much worse Brier gets without each insight去掉某条 insight 后 Brier 变差多少
Leave-one-out each procedural insight, see how much holdout Brier degrades. More degradation = more important insight.每条 procedural insight 单独 leave-one-out,看 holdout Brier 变差幅度。变差越多 = 这条 insight 越关键。
degradation ≥ 0.01 = significantly useful, keep + continue training; degradation ≤ 0.002 or reversed = mark ablation_status as rejected, remove on next slow loop. 变差 ≥ 0.01 = 显著有用,保留并继续训练; 变差 ≤ 0.002 或反向 = ablation_status 标 rejected,下个慢回路移除。
F · AUTO-ROLLBACK LOG §9.3 · auto-rollback if 1000-bootstrap 95% CI fails1000-bootstrap 95% CI 未通过自动回滚
2026-04-30 slow-loop attempt v0.4.5 → v0.4 CI overlapping ([0.198, 0.231] vs [0.196, 0.228]) → not significantly better → rolled backCI 重叠([0.198, 0.231] vs [0.196, 0.228])→ 未显著改善 → 回滚
2026-05-12 slow-loop attempt v0.6 → v0.7 CI non-overlapping ([0.166, 0.187] vs [0.181, 0.203]) → significantly lower → deployedCI 不重叠([0.166, 0.187] vs [0.181, 0.203])→ 显著降低 → 上线
↳ This is the hard constraint that "training can't make the system worse". Rollback is part of the design, not a failure.↳ 这是"训练不会让系统更差"的硬约束。回滚是设计的一部分,不是失败。