Evaluation Results

10 interactive Amazon components generated by 3 models. Scored by two neutral LLM judges.

GPT-5.4-mini

Claude Sonnet 4.6
8.7
Brewmode 8B
5.6
Qwen3-8B base
6.8

Gemini 3.1 Flash

Claude Sonnet 4.6
8.0
Brewmode 8B
5.1
Qwen3-8B base
6.0
Models
Claude Sonnet 4.6 — frontier, ~$0.03/call
Brewmode 8B — fine-tuned, $0/call
Qwen3-8B base — untuned control
Setup
4,096 token budget, equal for all
Plain HTML + Tailwind + vanilla JS
A10G GPU, 16K context, thinking enabled
Judges
GPT-5.4-mini (OpenAI) — neutral
Gemini 3.1 Flash (Google) — neutral
Visual quality, interactivity, completeness, code
Component FRBMBase FRBMBase Reason
GPT-5.4-mini Gemini 3.1 Flash
Search Autocomplete 967 967 Model A best matches Amazon's visual style, has full interactivity including keyboard navigation and category filtering, complete code with
Shopping Cart 957 977 Model A offers the most complete, visually Amazon-like design with fully functional interactivity and clean, well-structured code, while B a
Image Gallery + Zoom 864 845 Model A provides a complete, well-structured Amazon-like gallery with functional zoom and thumbnail interactivity, while B is incomplete and
Star Rating Filter 976 967 Model A best matches Amazon's visual style with clear interactivity and complete, clean code, while B uses accessible inputs but lacks desel
Add to Cart Widget 957 967 Model A best matches Amazon's visual style, has full working interactivity, complete code, and clean structure, while B is incomplete with e
Product Reviews 738 947 Model C offers a visually clean, Amazon-like design with functional interactivity and complete, well-structured code, while A is visually go
Deal of the Day 967 967 Model A best matches Amazon's visual style with detailed layout and working countdown timer, while B is minimal and incomplete visually, and
Top Navigation 967 Model A best matches Amazon's visual style, includes interactive JS handlers, is complete, and has clean, well-structured code.
Order Confirmation 978 976 Model A best matches Amazon's visual style with clear structure and functional tracking interactivity, while B lacks meaningful interactivit
Product Comparison Table 957 957 Model A delivers a complete, well-structured Amazon-like comparison table with polished visuals and fully functional interactive buttons, wh
See the live rendered output → — click, type, interact with each component.