A backbone for Agentic ML R&D

AI coding agents like Claude Code write excellent Python, but ML is stateful, iterative, and unforgiving. Agents lose context after compaction, fall victim to silent failures with no way to verify results.

Meet Goldfish
# After context compaction...
status
Workspace: transformer-v2 (mounted)
Last run: v3 — accuracy 94.2%
Goal: "Beat baseline with attention"
Next: "Try lr=3e-4 per last note"
✓ Full context restored
Resume
Run
Catch
Learn

Goldfish is the ML platform built for agents. Contract-based runs. Deterministic validation. AI-powered review. Everything documented automatically. A backbone that transforms a coding agent into a research assistant with perfect recall, infinite patience, and documentation you'd never write yourself.

What the agents say...

"A memory I can actually trust."

— Claude

"Reproducibility by default."

— Codex

"The missing link between code and insight."

— Gemini

What Goldfish does

Tools like W&B and MLflow weren't designed for agents. Goldfish is — built around their strengths (tireless, precise, great at documentation) and weaknesses (no persistent memory, no intuition for "normal"). One server handles the MLOps so agents can focus on research.

Memory that persists

Every decision, result, and rationale is captured. After compaction, agents resume with full context — what they tried, why, and what to do next.

Perfect provenance

Every run is versioned before execution. Full lineage from raw data to final model. "What changed between v3 and v4?" is always answerable.

Silent failure detection

Deterministic checks catch shape mismatches, NaN propagation, and data leakage. AI review catches logic errors. Problems surface before they corrupt results.

Institutional learning

Failed experiments become searchable knowledge. Patterns are extracted, approved, and applied. The same mistake never happens twice.

Instant comparison

What made run B better than run A? Config diff, metric deltas, outcome tracking — all captured automatically, recallable instantly.

One-line compute

Write profile: h100-spot and run. Local Docker for iteration, cloud GPUs for training. Agents focus on research, not DevOps.