What The Agent Sees On Reset
Alert โ no code, no traceback, just this:
"Training completed. Metrics look suspicious and vary between evaluation runs."
run_code
get_traceback
inspect_gradients
print_shapes
view_source
ยท 5 steps total
๐ฎ Live Demo โ Try It
Step 1 โ Reset (start a new episode)
Task:
Click "POST /reset" to start an episode...
Step 2 โ Inspect (call a diagnostic tool)
Tool:
Reset first, then inspect...
Step 3 โ Fix (submit your fix)
Bug type:
Inspect first, then submit fix...
๐ Try In Terminal
curl -s -X POST https://rak2315-ml-debug-env.hf.space/reset -H "Content-Type: application/json" -d "{\"task_id\": \"compound_leakage_eval\"}"
curl -s -X POST https://rak2315-ml-debug-env.hf.space/step -H "Content-Type: application/json" -d "{\"action\": {\"action_type\": \"inspect\", \"tool_name\": \"run_code\"}}"
curl -s -X POST https://rak2315-ml-debug-env.hf.space/step -H "Content-Type: application/json" -d "{\"action\": {\"action_type\": \"inspect\", \"tool_name\": \"inspect_gradients\"}}"
curl -s -X POST https://rak2315-ml-debug-env.hf.space/step -H "Content-Type: application/json" -d "{\"action\": {\"action_type\": \"fix\", \"bug_type\": \"compound_leakage_eval\", \"diagnosis\": \"Two bugs: data normalized before split and model.eval() missing\", \"fixed_code\": \"# placeholder\"}}"
curl https://rak2315-ml-debug-env.hf.space/health
curl https://rak2315-ml-debug-env.hf.space/tasks
๐ Training Results โ GRPO on Qwen2.5-1.5B
Run 1 exposed grader bugs โ fixed them. Run 2 showed improvement. The environment improved itself.
๐ฏ 8 Tasks โ Easy to Expert
shape_mismatchEasy
nn.Linear input dim wrong โ explicit RuntimeError crash on first forward pass.
training_collapseMedium
Bad LR โ NaN loss, or wrong loss fn โ flat plateau. No crash.
wrong_deviceMedium
Model on GPU, data on CPU โ device mismatch crash on first forward pass.
gradient_not_zeroedMedium-Hard
Missing zero_grad() โ gradients accumulate โ loss explodes silently to NaN.
data_leakageHard
Normalized before split โ 96% accuracy that can't be trusted. No crash.
missing_eval_modeHard
No model.eval() โ Dropout active during eval โ non-deterministic metrics.
compound_shape_deviceMedium-Hard
TWO bugs: shape mismatch + device mismatch. Both must be fixed for 0.99.
2 bugs
compound_leakage_evalExpert
TWO silent bugs: data leakage + missing eval mode. No crash. Frontier models score 0.
2 bugs
๐ Scoring Ladder
0.01Wrong bug type identified
0.20Right type, fixed code crashes
0.40Code runs, training incomplete
0.60Training completes, root cause not fixed
0.80Root cause fixed, success signal missing
0.99Perfect fix โ
+ efficiency bonus ร1.2
๐ API Endpoints
POST/resetStart episode โ alert + tools only, no code
POST/stepinspect action (tool call) or fix action
GET/stateEpisode state, tools used, best score
GET/tasksAll 8 tasks with descriptions + action schema
POST/graderScore a fix directly without a full episode
GET/baselineRun built-in LLM agent on all 8 tasks
GET/health{"status": "healthy"}
GET/docsSwagger UI
WS/wsPersistent WebSocket session