VIEW 1 · A QUESTION'S FULL JOURNEYVIEW 1 · 一道题的完整旅程

See how a question gets predicted看一道题怎么被预测出来

Pick a question, hit ▶ Start, and watch 5 AI workers take turns. When each worker is busy, a thought bubble shows what they're thinking. After the cycle, the system learns from this prediction — next time it'll do better on a similar question.

挑一道题，按下 ▶ 开始预测，看 5 个 AI 工人依次开工。每个工人工作时头顶会弹一个气泡告诉你他在想什么。全部做完之后，系统会从这次预测里学到经验，下一道同类题就会做得更准。

📦processed已做过1042题

🎯recent accuracy近期准确88.2%

📘learned lessons总结的经验23

👋

1 Pick or edit a question挑题 / 改题 → 2 Hit ▶ Start按 ▶ 开始预测 → 3 Watch 5 AI workers → get probability看 5 个 AI 工人接力 → 出概率

this cycle本次预测 id编号 #0143 elapsed用时 +00.000s cost成本 $0.0000 shortcut快捷键 Spacestart开始 Rreset重置 5 AI workers are ready, just hit Start → 5 个 AI 工人就位，等你按下开始 →

QUESTION INTAKE题目识别

?

·

auto-classified as自动识别为 fomc-rate (rates) · confidence 94%类（利率）置信度 94%

Pick a question, or edit it below挑一道题，或在下方框里改写

window时间窗 2026-05-19 → 2026-06-02

threshold价格阈值 $2384.50

source题目来源 template手工模板

first visit auto-plays once第一次进入会自动跑一遍

5 AI workers take turns5 个 AI 工人依次开工 ? · about大约 14s

step步骤 0/5 live运行中 version系统版本 v0.7

what the worker is doing当前工人正在做什么 ? · idle…还没开始…

Hit Start, or pick a question to run.点击「开始预测」或选一道题来跑一遍。

live log实时日志 ?

OUTPUT输出结果 ?

probability of event事件发生的概率

0.00

±0.00

90% confidence interval90% 置信区间 [0.00 – 0.00]

historic avg (same family)历史同类题平均

0.241 · lower is better越低越准

prediction-market price博彩市场报价

0.38 · Δ vs market比市场偏 +0.09

critic adjustmentAI 质疑的修正

−0.06

fact check事实核对

base-rate ✓ facts 7/7 ✓ market Δ +0.05 ⚠基准率 ✓ 数据 7/7 ✓ 与市场差 +0.05 ⚠

working正在工作

done已完成

attention需要注意

queued等待中

Hover any AI worker card to see what they do

把鼠标悬停在每个 AI 工人卡片上，能看到他做什么

VIEW 2 · DASHBOARD

History · single-question trace · training timeline历史预测 · 单题 trace · 训练时间轴

Click any past question → the middle column expands its 5-stage trace + dual-track critique + recalled lessons; the right column simultaneously highlights "which factory version processed this question" (the prompt lineage line).

选中任意历史题 → 中栏展开它的 5 段 trace、双轨 critique、召回的 procedural insight；同时右栏训练时间轴会画出「这道题处理时用的是哪一版工厂」（prompt 血统线）。

rolling Brier (last 90d)近 90 天滚动 Brier

0.118 ↓ 0.027 vs market优于市场

resolved 968 / total 1042已揭晓 968 / 共 1042

💡

How to read this page (3 interactions)这页怎么看（3 个互动点）

Click any past question on the left → the middle column instantly opens its 5-stage trace + dual-track critique + recalled lessons.
Each timeline node marked "click for diff →" expands an inline red/green prompt diff (first 6 lines); "Open in Training view" shows the full diff.
The right-column amber vertical line = the factory version that processed this question. Switch questions → the line jumps — that's the "prompt lineage line".

左栏点任意历史题 → 中栏立即展开它的 5 段 trace + 双轨 critique + 召回的 procedural insight。
右栏时间轴上每个带「click 看 diff →」的版本节点点击会内联展开 prompt 红绿 diff 摘要（前 6 行），按钮"在训练演化视图打开"看完整。
右栏琥珀色竖线 = 这道题处理时用的工厂版本。切换不同历史题时，竖线会跳到对应版本——这是"prompt 血统线"。

HISTORY

filter

id	family	p	brier	st

family Brier (90d)

TRACE · Q#1042

processed-by prompt-pack v0.7

critic verdict · doubao-thinking

verifier flags · rule-engine v0.14

recalled procedural insights · semantic + family-match

TRAINING TIMELINE

prompt-pack versions · slow loop

▍ amber line marks the prompt-pack used for the selected question. Click any node to see its diff. 高亮线指示：选中题处理时使用的 prompt-pack。点击节点查看版本 diff。

💡

How to read the three loops三回路怎么读

Color = trigger frequency: cyan runs every prediction · violet distills every 20 questions · amber retrains every 50 questions. Three rhythms an order of magnitude apart.
Top-down = data flow: fast loop writes critique → mid loop distills lessons → slow loop rewrites prompts.
The amber arrow at the bottom flows backward: after DSPy + TextGrad rewrites the prompts, new versions are "deployed" to the 4 fast-loop agents — that's the physical act of "the system getting better".
Hover any node to see its role / model / last-modified time.

颜色 = 触发频率：cyan 每次预测都跑 · violet 每 20 题蒸馏一次 · amber 每 50 题重训一次。三层节奏差一个数量级。
从上往下 = 数据流：fast loop 写 critique → mid loop 蒸馏 insight → slow loop 重写 prompt。
底部琥珀箭头反向回写：DSPy + TextGrad 改完 prompt 后，新版本会"上线"到 fast loop 的 4 个 agent 节点上——这就是"系统变强"的物理动作。
hover 任意节点查看角色 / 模型 / 上次被改时间。

VIEW 3 · CONCEPT ATLAS

Three loops · who rewrites whom三回路 · 谁在改谁

The system has three loops: fast (every prediction) + mid (Reflector distills every 20 questions) + slow (DSPy + TextGrad retrains every 50 questions). Arrow direction + color = data flow / training flow. Hover a node for its agent card.

系统由 fast loop（每次预测）+ mid loop（每 20 题 Reflector 蒸馏）+ slow loop（每 50 题 DSPy + TextGrad 重训）三层循环组成。每条线的箭头方向 + 颜色 = 数据 / 训练流向。hover 节点弹出角色卡。

three-loop cadence

fastevery prediction · pure inference每次预测 · pure inference

midevery 20 · Reflector distills每 20 题 · Reflector 蒸馏

slowevery 50 · DSPy + TextGrad每 50 题 · DSPy + TextGrad

FAST LOOP runs every prediction · no prompt change每次预测都走 · 不改 prompt

data flow: question → P(event)数据流：题目 → P(event)

↳ every cycle Critic + Verifier write to ExperienceStore (episodic layer) ↳ 每次 Critic + Verifier 写入 ExperienceStore (episodic 层) ↳ Synthesizer recalls procedural insights as context (family-match + semantic top-K) ↳ Synthesizer 召回 procedural insight 当上下文 (family-match + semantic top-K)

▼ After every question → dual-track critique (LLM-critic + det-verifier) is written to ExperienceStore.episodic 每完成 1 道题 → 双轨 critique（LLM-critic + det-verifier）写入 ExperienceStore.episodic accumulate 20 → trigger mid loop ↓ 累积到 20 道触发中回路 ↓

MID LOOP every 20 resolved questions · Reflector distills procedural insights每 20 道 resolved 题 · Reflector 蒸馏 procedural insight

training flow: critique → procedural insight训练流：critique → procedural insight

Reflector outputs 4 kinds of insight (written to ExperienceStore.procedural, recalled later by Decomposer / Synthesizer): thought-chain-template· checklist-item· driver-taxonomy· common-error

Reflector 输出 4 类 insight（写入 ExperienceStore.procedural 层，被未来 Decomposer / Synthesizer 召回）： thought-chain-template· checklist-item· driver-taxonomy· common-error

▼ Accumulate 50 → trigger slow loop, bootstrap demos from training set + critic signals as gradient → rewrite system prompts 累积到 50 道 → 触发慢回路，用训练集 bootstrap demo + critic 信号当 gradient → 重写 system prompt ↓

SLOW LOOP every 50 questions (or monthly) · DSPy MIPROv2 + TextGrad每 50 题（或每月） · DSPy MIPROv2 + TextGrad

training flow: bootstrap demos + gradient → rewrite 4 agent prompts训练流：bootstrap demos + gradient → 改写 4 个 agent prompt

↑ Rewrite system prompts of Decomposer / Sub-agent / Synthesizer / Critic ↑ 重写 Decomposer / Sub-agent / Synthesizer / Critic 的 system prompt · deploy only if 1000-bootstrap 95% CI on validation set passes 验证集 1000-bootstrap 95% CI 通过才上线 · fail → auto-rollback to previous version + flag for human review 不通过 → 自动回滚到上一版 + 标 review hook

DATA / TRAINING FLOW LEGEND

cyan = inference data flow (every prediction)

violet = critique → insight distillation (every 20)

amber = prompt rewrite (every 50, marching-ants animation)

↑ Reverse arrow = "training flow writes back to fast loop nodes" — the physical act of "system getting better"

cyan = inference 数据流（每次预测）

violet = critique → insight 蒸馏流（每 20 题）

amber = prompt 改写流（每 50 题，marching-ants 动画）

↑ 反向箭头 = "训练流回写到 fast loop 节点"，这是「系统变强」的物理动作

KILL CRITERIA (end of v0.1 · 8 weeks)

· Test Brier fails to beat L1 base-rate → demote to "training-paradigm demo"

· Fails to beat L2 market-implied by ≥5% → architecture adjustment

· Even without beating L2, any of the following gives the project value:

① positive thought-chain ablation (removing a checklist makes Brier worse)

② positive cross-family transfer (train fomc, test commodity)

③ steady slow-loop descent (training-set Brier monotonically decreases)

· 测试集 Brier 未打过 L1 base-rate → 项目降级为"训练范式 demo"

· 未打过 L2 market-implied ≥ 5% → 架构调整

· 即使没赢 L2，下列任一成立则有价值：

① 思维链 ablation 正向（去 checklist → Brier 变差）

② 跨 family 迁移正向（训 fomc 测 commodity）

③ 慢回路稳定下降（每轮训练集 Brier 单调降）

💡

How to read this diagram (30s)怎么看这张图（30 秒）

Your question enters from the left inlet, flows right along the blue pipe, and ends at the bottom-right outlet as a probability number.
It passes 6 steps (numbered ①②③ on the pipe). Each step is one AI's job. Hover any device to see "what it does".
Blue = this prediction's data flow; violet = the system distills lessons from past predictions; amber = trained improvements get loaded back into the upstream AIs.
The core idea: every prediction makes the system a bit more accurate (the violet + amber pipes do exactly that).

你的题目从左边入口进入，沿着蓝色管道一路向右，最终在右下角出口得到一个概率数字。
路上经过 6 个步骤（管道上有 ①②③ 编号），每个步骤由一个 AI 负责，鼠标悬停任意设备能看到「它做什么」。
蓝色 = 你这次预测的数据流；紫色 = 系统从历次预测里整理经验；琥珀色 = 训练优化后，把更好的方法"装回"前面的 AI。
这套设计的核心：每做一次预测，系统都会变得更准（紫色和琥珀色管道就是干这件事的）。

VIEW 5 · SYSTEM FLOWVIEW 5 · 系统流程图

How your question gets processed你的一道题，是怎么被加工的

The full flow of the AI prediction system. Question pours in from top-left, passes 6 steps, and a probability comes out at the bottom-right. Blue pipe = data flow of this prediction; violet pipe = system distilling lessons from past predictions; amber pipe = trained improvements flowing back to upstream AIs (how the system gets better).

这是 AI 预测系统的全景流程图。左上灌入题目，途经 6 个步骤，右下出口得到一个概率数字。蓝色管道 = 这次预测的数据流；紫色管道 = 系统从历史预测里整理经验；琥珀色管道 = 训练优化后回写给前面的 AI（系统变更准的过程）。

system status系统当前状态

AI workersAI 工人6/6 online6/6 在线

avg time平均耗时14.3s / question14.3 秒 / 题

version当前版本v0.7 · 1042 processedv0.7 · 已加工 1042 题

Blue: this prediction's data蓝色管道：这次预测的数据

runs every time you submit, ~14s每次你提交一道题都会走，约 14 秒

Violet: system distilling lessons紫色管道：系统整理经验

every 20 questions, turns logs into reusable lessons每 20 题触发一次，把记录变成通用经验

Amber: system training itself琥珀色管道：系统训练自己

every 50 questions, loads improvements back to upstream AIs每 50 题触发一次，把更好的方法装回前面的 AI

🕐 A question's complete journey (~14s)一道题被加工的完整过程（约 14 秒）

1 0s

Split拆题

hard question → 5 sub-questions一道难题被拆成 5 个子问题

2 1.4s

Research并行调研

5 AIs look things up in parallel5 个 AI 同时查资料

3 5.9s

Combine合并

merge 5 answers, use past lessons综合 5 路答案，参考经验

4 8.0s

SkepticAI 质疑

a different AI hunts for flaws另一家 AI 来挑毛病

5 11.0s

Fact check事实核对

rules check numbers and deviations规则查数字、查偏离

6 14.3s

Outlet出口

probability 47%给出概率 47%

Two things happen quietly after the outlet:
· The whole trace gets stored in the memory store. Every 20 logs, the system distills new lessons.
· Every 50 logs the Trainer kicks in and rewrites the 4 upstream AIs' instructions using those lessons. If it doesn't really improve, it auto-rolls back — so training can't make the system worse.
⇒ That's why "more questions answered = more accurate system".

出口之后还有两件事在背后发生：
· 这次预测的所有过程会被存进经验库。每攒够 20 条，系统会自动整理出新的经验。
· 每攒够 50 条，提示词优化器会启动，用经验材料改写前面 4 个 AI 的工作方法。没真的变好就自动撤销——所以训练不会让系统变差。
⇒ 这是为什么"做的题越多，系统越准"。

💡

What training changes · how to read the 6 blocks训练改了什么 · 6 个区块怎么读

A Brier curve: 60 blue dots split into 5 versions (amber vertical line = each retrain). Check if dot height drops after each line. L1/L2 are baselines.
B Insight library: 8 distilled procedural insights, click to expand for source_question_ids ("I was learned from these 20 questions") + ablation numbers (how much worse Brier gets without me).
C Prompt diff: pick agent + 2 versions → red/green inline diff shows which line of the system prompt changed; change driven by shows which insight / which optimizer triggered the rewrite.
D Cross-family transfer + E ablation: §11.3 two hard metrics — proves "what training distilled is reusable thinking structure, not memorized answers".
F Auto-rollback log: §9.3 — slow loop auto-rolls back if the 1000-bootstrap CI on validation set fails. This is the hard constraint that "training can't make the system worse".

A Brier 曲线：60 个蓝点按 5 个版本切段（琥珀竖线 = 每次重训）。看每根竖线之后，蓝点的平均高度是不是真的往下走。L1/L2 是 baseline。
B Insight 库：8 条蒸馏出的 procedural insight，点击展开看 source_question_ids（"我是从这 20 道题学来的"）+ ablation 数字（去掉我 Brier 变差多少）。
C Prompt diff：选 agent + 两个版本 → 红绿 inline diff 看 system prompt 改了哪一句话；change driven by 显示这次改动是哪条 insight / 哪种 optimizer 触发的。
D 跨家族迁移 + E ablation：§11.3 两个硬指标——证明"训练蒸馏出的不是死记答案，是可迁移的思维结构"。
F 自动回滚日志：§9.3——慢回路如果验证集 1000-bootstrap CI 不通过就自动回滚。这是"训练不会让系统变差"的硬约束。

VIEW 4 · TRAINING EVOLUTION

How the system gets stronger from past predictions系统怎么从历史预测中变强

This page makes one concrete thing visible: between v0.3 → v0.7, what was changed in each of the 5 prompt-pack versions / who drove the change / did Brier really drop afterwards. Every procedural insight can be traced back to its source questions; every prompt rewrite has a red/green diff.

这页讲的是一件具体事：v0.3 → v0.7 这五个版本的 prompt-pack 之间，到底什么被改了 / 谁被谁驱动 / 改完之后 Brier 是不是真的降了。每一条 procedural insight 都能反向追溯到来源题，每一次 prompt 重写都能看到红绿 diff。

overall trend (v0.3 → v0.7)整体趋势 (v0.3 → v0.7)

train Brier训练集 Brier0.232 → 0.171 ↓ 0.061

val Brier验证集 Brier0.218 → 0.184 ↓ 0.034

vs market与市场对比+0.011 → −0.027

A · ROLLING BRIER · validation set

Each amber dashed line = one slow-loop retrain; each blue dot = the Brier of one resolved question. Check whether dot height really drops after each retrain.每条琥珀虚线 = 一次慢回路重训；每条蓝点 = 一道 resolved 题的 Brier。看每次重训之后蓝点平均高度是不是真的下降。

L1 base-rate ━ ━ L2 market ━ ━ predictor ━

§11.3-③ slow-loop monotonicity

✓ v0.3→v0.7 monotonic descent (5 retrains, 4 effective, 1 rolled back)✓ v0.3→v0.7 单调下降（5 次重训 4 次有效，1 次回滚）

vs L1 base-rate

✓ beats (0.184 < 0.241)✓ 打过 (0.184 < 0.241)

vs L2 market-implied

▲ −2.7% Δ (target ≥ 5%)▲ −2.7% Δ（目标 ≥ 5%）

B · PROCEDURAL INSIGHT LIBRARY — Reflector output · 4 kindsReflector 蒸馏产物 · 4 类

Every insight has source_question_ids ("learned from these 20 questions") + ablation_status ("removing me makes Brier worse by X"). Click to expand for content and source.每条 insight 都有 source_question_ids（"我是从这 20 道题学来的"）+ ablation_status（"去掉我 Brier 变差多少"）。点击展开看具体内容和来源。

C · PROMPT DIFF · pick two versions to see what changed选两版对比看 system prompt 改了什么

Slow loop doesn't produce model weights — it produces new system-prompt text. Source = MIPROv2 auto-bootstrap demos + TextGrad using critic signal as gradient.慢回路输出的不是模型权重，是新的 system prompt 文本。改动来源 = MIPROv2 自动 bootstrap demo + TextGrad 用 critic 信号当 gradient。

pick agent + version pair选择 agent + 版本对

agent

compare

→

change driven by

D · CROSS-FAMILY TRANSFER §11.3-②

Train on family-A, distill insights, then test Brier on an unseen family-B. Green cell = better than pure LLM single-shot (positive transfer).在 family-A 上训练蒸馏出的 insight，去未见过的 family-B 测试集上跑 Brier。绿格 = 比纯 LLM single-shot 更好（迁移正向）。

test fomc

test cpi

test comm

test geopol

train fomc

0.171 (in)

−0.018 ✓

−0.012 ✓

+0.004 ✗

train cpi

−0.024 ✓

0.183 (in)

−0.009 ✓

+0.001 ≈

train comm

−0.011 ✓

−0.007 ✓

0.176 (in)

−0.005 ✓

✓ 9/12 cells positive — thought-chain structure really transfers, not just memorizing fomc answers✓ 9/12 单元正向 — 思维链结构有真实迁移性，不是只记 fomc 答案

E · ABLATION §11.3-① · how much worse Brier gets without each insight去掉某条 insight 后 Brier 变差多少

Leave-one-out each procedural insight, see how much holdout Brier degrades. More degradation = more important insight.每条 procedural insight 单独 leave-one-out，看 holdout Brier 变差幅度。变差越多 = 这条 insight 越关键。

▲ degradation ≥ 0.01 = significantly useful, keep + continue training; ▼ degradation ≤ 0.002 or reversed = mark ablation_status as rejected, remove on next slow loop. ▲ 变差 ≥ 0.01 = 显著有用，保留并继续训练； ▼ 变差 ≤ 0.002 或反向 = ablation_status 标 rejected，下个慢回路移除。

F · AUTO-ROLLBACK LOG §9.3 · auto-rollback if 1000-bootstrap 95% CI fails1000-bootstrap 95% CI 未通过自动回滚

⟲ 2026-04-30 slow-loop attempt v0.4.5 → v0.4 CI overlapping ([0.198, 0.231] vs [0.196, 0.228]) → not significantly better → rolled backCI 重叠（[0.198, 0.231] vs [0.196, 0.228]）→ 未显著改善 → 回滚

✓ 2026-05-12 slow-loop attempt v0.6 → v0.7 CI non-overlapping ([0.166, 0.187] vs [0.181, 0.203]) → significantly lower → deployedCI 不重叠（[0.166, 0.187] vs [0.181, 0.203]）→ 显著降低 → 上线

↳ This is the hard constraint that "training can't make the system worse". Rollback is part of the design, not a failure.↳ 这是"训练不会让系统更差"的硬约束。回滚是设计的一部分，不是失败。