Observe
Assay returns, verifier failures, reviewer corrections, counterevidence, calibration regressions, and agent trace defects enter the same event stream.
This page defines the full Recursive Discovery Loop: what evidence enters the loop, what artifacts can change, which MCP tools expose the process, which gates block promotion, and which system boundaries are never allowed to recursively rewrite themselves.
DiscoveryLab can learn from every failed proof, rejected plan, reviewer correction, returned assay result, and regression. The recursion is bounded: the system improves proposals, rankings, retrieval, mechanism templates, assays, and evals — never the permissions it operates under.
Assay returns, verifier failures, reviewer corrections, counterevidence, calibration regressions, and agent trace defects enter the same event stream.
The system classifies what actually failed: ranking weights, mechanism templates, evidence retrieval, assay selection, schemas, prompts, or evaluation coverage.
An ImprovementProposal is created with an explicit subsystem, artifact diff, expected gain, risk class, validation plan, rollback plan, and evidence trace.
Offline benchmarks, regression tests, calibration checks, counterevidence retrieval, and verifier compatibility run before any change can be promoted.
Human review remains the promotion boundary. Agents can recommend, score, and explain. They cannot approve their own authority changes.
Approved changes become versioned artifacts with audit receipts. Failed changes become new regression tests, negative examples, or rollback records.
A recursive loop is only useful if it turns evidence into concrete patches. DiscoveryLab keeps each feedback signal tied to an artifact class and a validation burden.
Outcome-calibrate ranking weights and next-batch policy.
Repair unsupported edges, wrong signs, missing receptors, or overclaimed endpoints.
Turn reviewer edits into prompt, schema, retrieval, or workflow proposals.
Freeze the failure as a regression case before promotion.
Raise source diversity and contradiction-search requirements.
Propose low-risk UI, schema, or tool-call improvements for review.
The loop promotes artifacts only after offline evaluation, regression checks, verifier compatibility, rollback planning, and human review.
source, subsystem, hypothesis, proposedChange, affectedArtifacts, expectedGain, riskClass, validationPlan, evidence, status, semanticHash, auditReceipt
offlineEval, regressionTests, acceptanceCriteria, rollbackPlan
approvedBy, promotedAt, proposalHash, changedArtifacts, acceptanceSummary, rollbackPointer
failedTrace, expectedBehavior, subsystem, blockingGate, reviewerNote, reproducedAt
type ImprovementProposal = {
source: "wet_lab_outcome" | "lean_verifier_failure" | "reviewer_correction" | "eval_regression" | "counterevidence" | "agent_self_critique"
subsystem: "candidate_generation" | "mechanism_reasoning" | "evidence_retrieval" | "assay_selection" | "ranking_policy" | "agent_prompt" | "schema" | "ui_workflow" | "eval_harness"
hypothesis: string
proposedChange: string
affectedArtifacts: string[]
expectedGain: string
riskClass: "low" | "medium" | "high" | "forbidden"
validationPlan: ValidationPlan
evidence: EvidenceTrace
status: "draft" | "ready_for_eval" | "eval_passed" | "approved" | "promoted" | "rolled_back"
}The MCP server exposes improvement operations as auditable calls. Tooling can draft and evaluate proposals, but promotion remains gated by approval and risk policy.
Create a candidate change from an outcome, proof failure, reviewer correction, or eval regression.
Run offline evaluation, calibration checks, regression tests, and verifier compatibility checks.
Return the gate report: evidence, risk, regression, verifier, approval, rollback.
Attach human approval to a passing proposal. Agent calls cannot self-approve.
Promote only approved, non-forbidden, eval-passed changes into versioned artifacts.
Inspect active proposals, blocked proposals, receipts, and rollback history.
Mark a promoted change as rolled back and preserve the reason as a future regression.
propose → evaluate → verify → approve → promote is one-way. A proposal that touches safety gates, approval thresholds, wet-lab permissions, or verifier requirements is classified as forbidden and cannot be promoted by the loop.
The loop is not a generic promise that the AI gets better. It names exactly which subsystems are allowed to change and which signals justify the change.
Fewer invalid candidates and better scaffold-family selection.
Mechanism canvases that survive evidence and formal checks.
Experiments that reduce uncertainty, not merely produce more data.
Scores that correlate with returned evidence instead of internal prettiness.
Lower reviewer burden without changing scientific authority.
The strongest safety property is simple: a tool may improve scientific artifacts but may not improve its own permissions.
Agents may not weaken or bypass them.
No proposal may approve itself or lower the approval threshold.
Scientific claim boundaries remain human-governed.
Executable lab actions stay approval-gated.
Throughput cannot silently replace truth, safety, or calibration.
Formal checks can be extended, not skipped.
The loop can make DiscoveryLab more calibrated, more skeptical, and more useful. It cannot make itself more autonomous. Permission changes are outside the recursive loop and require a separate governance release.
The UI should show why the loop wants to change, how it will be evaluated, and which gate blocks promotion.
stability score overweighted in IL-17R scaffold family
unsupported receptor-to-endpoint jump in NF-κB template
counterevidence query missed negative ligand study
The loop should be judged by calibration, regression protection, counterevidence quality, and reviewer burden — not by the number of proposals it can generate.
prediction → returned outcome alignment
mechanism proposals that survive encoded assumptions
contradictions found before reviewer review
edits per promoted proposal
known failures blocked before release
time from detected bad change to restored artifact
The safest implementation path starts with outcome-calibrated ranking and active-learning policy evaluation. Model-weight changes come only after artifact, eval, and review loops are stable.
Learn ranking updates from returned assay outcomes and rejected verifier traces.
Compare acquisition policies and update the selector from returned information gain.
Turn Lean failures into bounded mechanism-template repair proposals.
Let agents propose prompt, schema, and UI workflow patches with regression tests.
Fine-tune only from verified traces, reviewer-approved corrections, and outcome-labeled examples.