๐Ÿ› Meta ร— PyTorch ร— Scaler OpenEnv Hackathon 2026

ML Debug Env

A partially observable RL environment where AI agents debug broken PyTorch scripts โ€” using diagnostic tools, not oracles.

What The Agent Sees On Reset
Alert โ€” no code, no traceback, just this:
"Training completed. Metrics look suspicious and vary between evaluation runs."
run_code get_traceback inspect_gradients print_shapes view_source ยท 5 steps total
๐ŸŽฎ Live Demo โ€” Try It

Step 1 โ€” Reset (start a new episode)

Task:
Click "POST /reset" to start an episode...

Step 2 โ€” Inspect (call a diagnostic tool)

Tool:
Reset first, then inspect...

Step 3 โ€” Fix (submit your fix)

Bug type:
Inspect first, then submit fix...
๐Ÿ“‹ Try In Terminal
Reset โ€” start episode on expert task
curl -s -X POST https://rak2315-ml-debug-env.hf.space/reset -H "Content-Type: application/json" -d "{\"task_id\": \"compound_leakage_eval\"}"
Inspect โ€” call run_code tool
curl -s -X POST https://rak2315-ml-debug-env.hf.space/step -H "Content-Type: application/json" -d "{\"action\": {\"action_type\": \"inspect\", \"tool_name\": \"run_code\"}}"
Inspect โ€” call inspect_gradients tool
curl -s -X POST https://rak2315-ml-debug-env.hf.space/step -H "Content-Type: application/json" -d "{\"action\": {\"action_type\": \"inspect\", \"tool_name\": \"inspect_gradients\"}}"
Fix โ€” submit a fix attempt
curl -s -X POST https://rak2315-ml-debug-env.hf.space/step -H "Content-Type: application/json" -d "{\"action\": {\"action_type\": \"fix\", \"bug_type\": \"compound_leakage_eval\", \"diagnosis\": \"Two bugs: data normalized before split and model.eval() missing\", \"fixed_code\": \"# placeholder\"}}"
Health check
curl https://rak2315-ml-debug-env.hf.space/health
List all 8 tasks
curl https://rak2315-ml-debug-env.hf.space/tasks
๐Ÿ“ˆ Training Results โ€” GRPO on Qwen2.5-1.5B

Run 1 exposed grader bugs โ€” fixed them. Run 2 showed improvement. The environment improved itself.

Reward Curve
0.024
Initial reward
โ†’
200 steps ยท T4
0.190
Final reward
+0.166
Improvement
๐ŸŽฏ 8 Tasks โ€” Easy to Expert
shape_mismatchEasy
nn.Linear input dim wrong โ†’ explicit RuntimeError crash on first forward pass.
training_collapseMedium
Bad LR โ†’ NaN loss, or wrong loss fn โ†’ flat plateau. No crash.
wrong_deviceMedium
Model on GPU, data on CPU โ†’ device mismatch crash on first forward pass.
gradient_not_zeroedMedium-Hard
Missing zero_grad() โ†’ gradients accumulate โ†’ loss explodes silently to NaN.
data_leakageHard
Normalized before split โ†’ 96% accuracy that can't be trusted. No crash.
missing_eval_modeHard
No model.eval() โ†’ Dropout active during eval โ†’ non-deterministic metrics.
compound_shape_deviceMedium-Hard
TWO bugs: shape mismatch + device mismatch. Both must be fixed for 0.99.
2 bugs
compound_leakage_evalExpert
TWO silent bugs: data leakage + missing eval mode. No crash. Frontier models score 0.
2 bugs
๐Ÿ† Scoring Ladder
0.01
Wrong bug type identified
0.20
Right type, fixed code crashes
0.40
Code runs, training incomplete
0.60
Training completes, root cause not fixed
0.80
Root cause fixed, success signal missing
0.99
Perfect fix โœ… + efficiency bonus ร—1.2
๐ŸŒ API Endpoints
POST/resetStart episode โ€” alert + tools only, no code
POST/stepinspect action (tool call) or fix action
GET/stateEpisode state, tools used, best score
GET/tasksAll 8 tasks with descriptions + action schema
POST/graderScore a fix directly without a full episode
GET/baselineRun built-in LLM agent on all 8 tasks
GET/health{"status": "healthy"}
GET/docsSwagger UI
WS/wsPersistent WebSocket session