Recursive discovery

A bounded self-improving evidence loop for DiscoveryLab.

This page defines the full Recursive Discovery Loop: what evidence enters the loop, what artifacts can change, which MCP tools expose the process, which gates block promotion, and which system boundaries are never allowed to recursively rewrite themselves.

Figure · schematic

Recursive Discovery Loop

A discovery system that improves itself — but never its own authority.

DiscoveryLab can learn from every failed proof, rejected plan, reviewer correction, returned assay result, and regression. The recursion is bounded: the system improves proposals, rankings, retrieval, mechanism templates, assays, and evals — never the permissions it operates under.

01→

Observe

Assay returns, verifier failures, reviewer corrections, counterevidence, calibration regressions, and agent trace defects enter the same event stream.

02→

Diagnose

The system classifies what actually failed: ranking weights, mechanism templates, evidence retrieval, assay selection, schemas, prompts, or evaluation coverage.

03→

Propose

An ImprovementProposal is created with an explicit subsystem, artifact diff, expected gain, risk class, validation plan, rollback plan, and evidence trace.

04→

Evaluate

Offline benchmarks, regression tests, calibration checks, counterevidence retrieval, and verifier compatibility run before any change can be promoted.

05→

Review

Human review remains the promotion boundary. Agents can recommend, score, and explain. They cannot approve their own authority changes.

06↺

Promote or rollback

Approved changes become versioned artifacts with audit receipts. Failed changes become new regression tests, negative examples, or rollback records.

Feedback streams

Every return path points to a specific artifact, not a vague model improvement.

A recursive loop is only useful if it turns evidence into concrete patches. DiscoveryLab keeps each feedback signal tied to an artifact class and a validation burden.

wet-lab outcome

→ candidate ranking

Outcome-calibrate ranking weights and next-batch policy.

Lean verifier failure

→ mechanism template

Repair unsupported edges, wrong signs, missing receptors, or overclaimed endpoints.

reviewer correction

→ agent policy

Turn reviewer edits into prompt, schema, retrieval, or workflow proposals.

eval regression

→ test harness

Freeze the failure as a regression case before promotion.

counterevidence

→ evidence retrieval

Raise source diversity and contradiction-search requirements.

agent self-critique

→ operating procedure

Propose low-risk UI, schema, or tool-call improvements for review.

Artifact-native recursion

No invisible self-modification. Every improvement is a typed artifact with a diff.

The loop promotes artifacts only after offline evaluation, regression checks, verifier compatibility, rollback planning, and human review.

ImprovementProposal

The unit of recursive change.

source, subsystem, hypothesis, proposedChange, affectedArtifacts, expectedGain, riskClass, validationPlan, evidence, status, semanticHash, auditReceipt

ValidationPlan

The proof obligation before promotion.

offlineEval, regressionTests, acceptanceCriteria, rollbackPlan

PromotionReceipt

The record that separates suggestion from deployment.

approvedBy, promotedAt, proposalHash, changedArtifacts, acceptanceSummary, rollbackPointer

RegressionCase

The memory of failure.

failedTrace, expectedBehavior, subsystem, blockingGate, reviewerNote, reproducedAt

type ImprovementProposal = {
  source: "wet_lab_outcome" | "lean_verifier_failure" | "reviewer_correction" | "eval_regression" | "counterevidence" | "agent_self_critique"
  subsystem: "candidate_generation" | "mechanism_reasoning" | "evidence_retrieval" | "assay_selection" | "ranking_policy" | "agent_prompt" | "schema" | "ui_workflow" | "eval_harness"
  hypothesis: string
  proposedChange: string
  affectedArtifacts: string[]
  expectedGain: string
  riskClass: "low" | "medium" | "high" | "forbidden"
  validationPlan: ValidationPlan
  evidence: EvidenceTrace
  status: "draft" | "ready_for_eval" | "eval_passed" | "approved" | "promoted" | "rolled_back"
}

MCP tool family

The loop becomes a first-party tool surface, not an unbounded agent behavior.

The MCP server exposes improvement operations as auditable calls. Tooling can draft and evaluate proposals, but promotion remains gated by approval and risk policy.

improvement/propose

Create a candidate change from an outcome, proof failure, reviewer correction, or eval regression.

improvement/evaluate

Run offline evaluation, calibration checks, regression tests, and verifier compatibility checks.

improvement/verify

Return the gate report: evidence, risk, regression, verifier, approval, rollback.

improvement/approve

Attach human approval to a passing proposal. Agent calls cannot self-approve.

improvement/promote

Promote only approved, non-forbidden, eval-passed changes into versioned artifacts.

improvement/inspect

Inspect active proposals, blocked proposals, receipts, and rollback history.

improvement/rollback

Mark a promoted change as rolled back and preserve the reason as a future regression.

Promotion contract

propose → evaluate → verify → approve → promote is one-way. A proposal that touches safety gates, approval thresholds, wet-lab permissions, or verifier requirements is classified as forbidden and cannot be promoted by the loop.

What improves

Five learning surfaces, each tied to evidence and a failure mode.

The loop is not a generic promise that the AI gets better. It names exactly which subsystems are allowed to change and which signals justify the change.

Candidate generation

Fewer invalid candidates and better scaffold-family selection.

▢accepted/rejected candidates
▢synthesis feasibility
▢toxicity flags
▢wet-lab returns

Mechanism reasoning

Mechanism canvases that survive evidence and formal checks.

▢unsupported edge
▢wrong sign
▢missing receptor
▢overclaimed endpoint

Assay selection

Experiments that reduce uncertainty, not merely produce more data.

▢posterior variance
▢mechanism ambiguity
▢cost
▢safety class

Evaluation policy

Scores that correlate with returned evidence instead of internal prettiness.

▢calibration error
▢false positives
▢false negatives
▢overconfident models

Agent behavior

Lower reviewer burden without changing scientific authority.

▢missed counterevidence
▢weak queries
▢duplicate plans
▢overconfident copy

Governance

The recursive loop is designed to make the system harder to fool, not harder to control.

The strongest safety property is simple: a tool may improve scientific artifacts but may not improve its own permissions.

Non-recursive boundaries

Safety gates

Agents may not weaken or bypass them.

Approval requirements

No proposal may approve itself or lower the approval threshold.

Claim thresholds

Scientific claim boundaries remain human-governed.

Wet-lab permissions

Executable lab actions stay approval-gated.

Objective functions

Throughput cannot silently replace truth, safety, or calibration.

Verifier requirements

Formal checks can be extended, not skipped.

The loop can make DiscoveryLab more calibrated, more skeptical, and more useful. It cannot make itself more autonomous. Permission changes are outside the recursive loop and require a separate governance release.

Operator surface

A dashboard for lessons learned, proposed patches, gates, and rollback readiness.

The UI should show why the loop wants to change, how it will be evaluated, and which gate blocks promotion.

Recent detected lessons

Predicted high, assay failed

stability score overweighted in IL-17R scaffold family

ranking_config.v12.json

Verifier rejected mechanism

unsupported receptor-to-endpoint jump in NF-κB template

mechanism_template.nfkb.v4

Reviewer correction

counterevidence query missed negative ligand study

counterevidence_query_pack.v5

Success metrics

Optimize for truth-seeking behavior, not raw autonomy or throughput.

The loop should be judged by calibration, regression protection, counterevidence quality, and reviewer burden — not by the number of proposals it can generate.

Calibration

prediction → returned outcome alignment

Verifier pass rate

mechanism proposals that survive encoded assumptions

Counterevidence recall

contradictions found before reviewer review

Reviewer burden

edits per promoted proposal

Regression protection

known failures blocked before release

Rollback clarity

time from detected bad change to restored artifact

Build sequence

Ship the loop in layers: ranking first, model improvement last.

The safest implementation path starts with outcome-calibrated ranking and active-learning policy evaluation. Model-weight changes come only after artifact, eval, and review loops are stable.

V0

Outcome-calibrated ranking

Learn ranking updates from returned assay outcomes and rejected verifier traces.

V1

Recursive experiment planner

Compare acquisition policies and update the selector from returned information gain.

V2

Mechanism repair

Turn Lean failures into bounded mechanism-template repair proposals.

V3

Agent operating-procedure repair

Let agents propose prompt, schema, and UI workflow patches with regression tests.

V4

Model improvement

Fine-tune only from verified traces, reviewer-approved corrections, and outcome-labeled examples.