πŸ—„οΈ GRPO SQL Optimizer πŸ“ Writeup πŸ“„ Blog.md πŸ€— Model πŸ’» GitHub
DuckDB-verifiable rewards Β· OpenEnv

GRPO Training for SQL Query Optimization

Fine-tuned Qwen/Qwen2.5-0.5B-Instruct using GRPO (Group Relative Policy Optimization) to optimize SQL queries using a DuckDB execution environment.

Overview

This project trains/evaluates SQL optimization with execution-grounded rewards: the environment executes both original and rewritten SQL on real DuckDB data, and scores speedup + correctness + structured diagnostics.

5
tasks (easy β†’ expert)
DuckDB
verifiable execution
GRPO
group-relative RL

Training curve

GRPO training curve

If this image ever breaks, the canonical plot is also in the GitHub repo: results/grpo_reward_curve.png .

Training progress (100 episodes)

MetricValue
Start avg (ep 1–10)0.3090
End avg (ep 91–100)0.5962
Improvement+93%

Final evaluation (per task)

These task scores are aligned to the GitHub repo README (source of truth). Task 5 is the expert scenario, so it is expected to be the lowest β€” that is not an error.

TaskDifficultyScore
task_1_basic_antipatternseasy0.7500 βœ…
task_2_correlated_subqueriesmedium0.8313 βœ…
task_3_wildcard_scanmedium-hard0.6563 βœ…
task_4_implicit_joinhard0.6563 βœ…
task_5_window_functionsexpert0.6500 βœ…

Before / After (environment-only, reproducible)

To avoid hand-wavy baselines, we provide a reproducible before/after contrast in the GitHub repo: β€œbefore” = analysis-only (no optimized SQL), β€œafter” = deterministic fallback with a real optimized query. Chart: results/before_after_chart.png

Before/after chart

Approach

GRPO setup

Reward components

Links