Fine-tuned Qwen/Qwen2.5-0.5B-Instruct using GRPO (Group Relative Policy Optimization)
to optimize SQL queries using a DuckDB execution environment.
This project trains/evaluates SQL optimization with execution-grounded rewards: the environment executes both original and rewritten SQL on real DuckDB data, and scores speedup + correctness + structured diagnostics.
If this image ever breaks, the canonical plot is also in the GitHub repo: results/grpo_reward_curve.png .
| Metric | Value |
|---|---|
| Start avg (ep 1β10) | 0.3090 |
| End avg (ep 91β100) | 0.5962 |
| Improvement | +93% |
These task scores are aligned to the GitHub repo README (source of truth). Task 5 is the expert scenario, so it is expected to be the lowest β that is not an error.
| Task | Difficulty | Score |
|---|---|---|
| task_1_basic_antipatterns | easy | 0.7500 β |
| task_2_correlated_subqueries | medium | 0.8313 β |
| task_3_wildcard_scan | medium-hard | 0.6563 β |
| task_4_implicit_join | hard | 0.6563 β |
| task_5_window_functions | expert | 0.6500 β |
To avoid hand-wavy baselines, we provide a reproducible before/after contrast in the GitHub repo: βbeforeβ = analysis-only (no optimized SQL), βafterβ = deterministic fallback with a real optimized query. Chart: results/before_after_chart.png
execution_speedup: measured DuckDB timing ratioresult_correctness: results match check (order-independent for large sets)issue_detection: anti-pattern detection vs ground truth keywordsapproval_correctness, summary_quality, severity_labels