Code Generation Evaluation

20 coding tasks scored by two neutral LLM judges. Click any row to see the code.

GPT-5.4-mini

Claude Sonnet 4.6
8.8
Brewmode 8B
6.8
Qwen3-8B base
5.3

Gemini 3.1 Flash

Claude Sonnet 4.6
8.8
Brewmode 8B
6.6
Qwen3-8B base
5.8
Brewmode wins 19/20 tasks vs Base

Brewmode runs locally at $0/call. Claude costs ~$0.03/call. Both judges are neutral third parties.

TaskFRBMBaseFRBMBase
GPT-5.4-miniGemini 3.1 Flash
Trie Autocomplete 9108 999
Interval Scheduling 996 993
Graph BFS Shortest Path 987 867
Binary Search Variants 987 988
Min Heap 987 978
Observable Pattern 987 978
Data Table 786 796
Retry with Backoff 976 976
Schema Validator 975 975
Todo App 976 977
Connection Pool 965 965
Job Queue Worker 965 956
API Cache Middleware 964 954
Event Emitter 964 954
Webhook Handler 954 965
Cursor Pagination 950
State Machine 954 943
Rate Limiter 843 965
CSV Parser 943 843
Merge K Sorted Lists 989 999
Models
Claude Sonnet 4.6 — frontier
Brewmode Qwen3-8B — fine-tuned on coding logs and code examples
Qwen3-8B base — untuned control
Training
Fine-tuned on coding logs and code examples
Unsloth + LoRA on H100
Thinking mode enabled, 16K context
Judges
GPT-5.4-mini (OpenAI) — neutral
Gemini 3.1 Flash (Google) — neutral
Scored on correctness, completeness, quality