How to Build a Model Router: Use Cheap Models for 90% of Requests

Here's a fact most AI engineers overlook: the majority of requests hitting your API are simple. "Summarize this email." "Extract the date from this text." "Is this sentiment positive or negative?" These tasks don't need GPT-5.5 at $5.00 per million input tokens. They need GPT-5.4 Nano at $0.20, or DeepSeek V4 Flash at $0.14. The problem is that most applications send everything to one model — usually the expensive one — because it's easier.

A model router solves this. It examines each incoming request, estimates its complexity, and sends it to the cheapest model that can handle it well. The result: 80-90% of your traffic goes to budget models, and your API bill drops proportionally.

The Cost Math

Let's quantify the opportunity. Suppose your application handles 50,000 requests per day with an average of 3,000 input tokens and 500 output tokens per request.

Without routing (everything to GPT-5.5)

Metric	Value
Daily input cost	$750.00
Daily output cost	$750.00
Daily total	$1,500.00
Monthly total	$45,000

With routing (90% to Nano, 10% to GPT-5.5)

Tier	Requests	Daily Cost
Simple → GPT-5.4 Nano ($0.20/$1.25)	45,000	$55.13
Complex → GPT-5.5 ($5.00/$30.00)	5,000	$150.00
Total	50,000	$205.13
Monthly total		$6,154

Routing 90% of traffic from GPT-5.5 to GPT-5.4 Nano reduces the monthly bill from $45,000 to $6,154. That's an 86% reduction — saving $38,846 every month.

And this is a conservative estimate. If you route to DeepSeek V4 Flash ($0.14/$0.28) instead of Nano, the simple-tier daily cost drops to $9.10, and the monthly total falls to $4,873.

What Is a Model Router?

A model router is a layer between your application logic and the AI API calls. Instead of hardcoding a single model, you define multiple tiers and a routing strategy:

Request arrives — a user sends a message, uploads a document, or triggers an automation.
Router classifies — the router examines the request and estimates its complexity.
Model selected — based on complexity, the router picks the cheapest capable model.
Response returned — the model processes the request and the response goes back to the user.

The user never knows which model answered. The router is invisible to them — they just see a response that's as good as it needs to be, at a fraction of the cost.

Routing Strategies

There are three main approaches to routing, from simplest to most sophisticated:

Strategy 1: Rule-Based Routing

The simplest approach. You define rules based on observable request properties:

Task type: Classification and extraction go to budget models. Code generation and reasoning go to frontier models.
Input length: Short inputs (<500 tokens) are usually simpler. Long inputs (>5,000 tokens) often involve complex analysis.
Endpoint: Different API endpoints serve different purposes. Your /classify endpoint doesn't need the same model as /analyze.
User tier: Free users get budget models. Paying users get premium models.

Rule-based routing is fast (zero overhead), predictable, and easy to debug. The downside: it requires you to understand your traffic patterns upfront, and it can't adapt to unexpected complexity within a category.

When to Use Rule-Based Routing

Rule-based routing works best when your application has clearly defined task types. If you have a chatbot with separate endpoints for FAQ, support, and code help, route by endpoint. If you're building a data pipeline, route by input type (structured vs. unstructured).

Strategy 2: Classifier-Based Routing

Train or prompt a small model to classify request complexity before routing. This is more flexible than rules — it can handle ambiguous cases and adapt to new patterns.

The classifier can be:

A dedicated small model: Fine-tuned on your historical data to predict which model gave the best response for similar requests.
A prompt-based classifier: Use a cheap model (Nano or Flash-Lite) with a simple prompt: "Rate this request's complexity from 1-5." Route 1-2 to budget, 3 to mid-tier, 4-5 to frontier.
Embedding similarity: Compare the request embedding to a set of known-simple and known-complex examples. Route based on which cluster it's closest to.

The overhead of classification is small — a classifier call costs $0.0002 at Nano pricing. If it saves even one unnecessary frontier-model call, it pays for itself 2,500x over.

Strategy 3: Cascade Routing

Start with the cheapest model. If its response quality is below a threshold, retry with a more expensive model. This is the most cost-efficient strategy because it only escalates when necessary.

The quality check can be:

Self-evaluation: Ask the model to rate its own confidence. "On a scale of 1-10, how confident are you in this answer?" If confidence is below 7, escalate.
Structural validation: If the response should be JSON, check if it's valid JSON. If it should contain a date, check if a date is present. If validation fails, escalate.
Length heuristic: If the model returns a very short response to a complex question, it probably didn't understand the task. Escalate.

Cascade routing has higher latency for complex requests (two round trips instead of one) but achieves the lowest possible cost for simple requests — which are the majority.

Implementation: A Practical Router

Here's a simplified model router in Python that combines rule-based and cascade strategies:

import openai

# Model tiers — cheapest first
TIERS = [
    {"name": "nano", "model": "gpt-5.4-nano", "input": 0.20, "output": 1.25},
    {"name": "mini", "model": "gpt-5.4-mini", "input": 0.75, "output": 4.50},
    {"name": "full", "model": "gpt-5.5", "input": 5.00, "output": 30.00},
]

def classify_complexity(messages: list[dict]) -> int:
    """Estimate request complexity on a 1-5 scale."""
    user_msg = next(
        (m["content"] for m in reversed(messages) if m["role"] == "user"), ""
    )

    # Simple heuristics — adapt to your domain
    score = 1
    if len(user_msg) > 2000:
        score += 1
    if any(kw in user_msg.lower() for kw in [
        "code", "debug", "implement", "refactor", "analyze"
    ]):
        score += 2
    if any(kw in user_msg.lower() for kw in [
        "compare", "explain why", "step by step", "trade-off"
    ]):
        score += 1
    return min(score, 5)


def select_tier(complexity: int) -> dict:
    """Map complexity score to a model tier."""
    if complexity <= 2:
        return TIERS[0]   # nano
    elif complexity <= 3:
        return TIERS[1]   # mini
    else:
        return TIERS[2]   # full


def call_with_cascade(messages: list[dict], **kwargs) -> dict:
    """Start cheap, escalate if quality is low."""
    for tier in TIERS:
        response = openai.chat.completions.create(
            model=tier["model"],
            messages=messages,
            **kwargs,
        )
        result = response.choices[0].message

        # Quality check: did the model produce a substantive response?
        if result.content and len(result.content) > 50:
            return {
                "content": result.content,
                "model": tier["model"],
                "tier": tier["name"],
            }

    # All tiers failed — return the last attempt
    return {
        "content": result.content,
        "model": TIERS[-1]["model"],
        "tier": TIERS[-1]["name"],
    }


def route(messages: list[dict], strategy: str = "cascade", **kwargs) -> dict:
    """Main routing function."""
    if strategy == "rules":
        complexity = classify_complexity(messages)
        tier = select_tier(complexity)
        response = openai.chat.completions.create(
            model=tier["model"], messages=messages, **kwargs
        )
        return {
            "content": response.choices[0].message.content,
            "model": tier["model"],
            "tier": tier["name"],
        }
    elif strategy == "cascade":
        return call_with_cascade(messages, **kwargs)

This is a starting point. In production, you'd add logging, metrics, timeout handling, and a more sophisticated quality check. But the pattern is clear: classify, route, escalate if needed.

How to Classify Complexity

The quality of your router depends entirely on how well it classifies request complexity. Here are the signals that work best, ranked by reliability:

High-confidence signals

Task type (from your application): If your app has different modes (chat, code, search, classify), you already know the task type. This is the strongest signal.
Input structure: A request with JSON schema, code blocks, or structured data is more complex than a plain text question.
Historical performance: If you've logged which model handled similar requests successfully, use that data. "Requests like this one were solved by Nano 94% of the time" is the best routing signal.

Medium-confidence signals

Input length: Longer inputs tend to be more complex, but not always. A 3,000-token legal contract is complex. A 3,000-token product description for extraction is not.
Keywords: Words like "code," "debug," "analyze," and "compare" correlate with complexity. But keyword matching is brittle — users phrase things unpredictably.
User behavior: If a user has been asking follow-up questions for 10 turns, the conversation is probably complex. Route up.

Low-confidence signals

Time of day: Some teams route differently by time — budget models during off-peak, premium during business hours. This is a blunt instrument but can help with cost smoothing.
User agent / client: Mobile users might get shorter, simpler responses. Desktop users might get more detailed ones. Weak signal, but sometimes useful.

Start Simple

Don't over-engineer the classifier. Start with task-type routing (if your app has distinct endpoints) or a simple prompt-based classifier. Measure the quality difference. If users don't notice, you're routing correctly. Add complexity only when you have data showing it's needed.

Monitoring: The Key to Reliable Routing

A router without monitoring is a time bomb. You need to know when cheap models are failing so you can adjust thresholds. Track these metrics:

Escalation rate: What percentage of requests get escalated from tier 1 to tier 2? If this exceeds 30%, your tier 1 model is too weak for your traffic — consider upgrading the cheap tier or adjusting classification rules.
Quality scores: If you have human evaluations or automated quality metrics, track them per tier. If tier 1 quality drops below an acceptable threshold, tighten the routing rules.
Latency per tier: Cascade routing adds latency for escalated requests. Track p50 and p95 latency per tier to ensure you're meeting SLAs.
Cost per request: The metric that matters most. Track the blended cost per request across all tiers. This should decrease over time as you optimize routing.

The goal isn't to route everything to the cheapest model. It's to route everything to the cheapest model that maintains acceptable quality. Monitoring tells you where that line is.

Real-World Routing Patterns

Pattern 1: Chatbot with Mixed Traffic

A customer support chatbot handles FAQs, account questions, and complex complaints.

FAQ (60% of traffic): "What's your return policy?" → DeepSeek V4 Flash ($0.14/$0.28)
Account queries (25%): "Why was I charged twice?" → GPT-5.4 Mini ($0.75/$4.50)
Complex complaints (15%): "I need to dispute this and escalate to management" → GPT-5.5 ($5.00/$30.00)

Blended cost: $0.62 per request vs. $5.00 without routing. 88% savings.

Pattern 2: Code Assistant

An IDE plugin that handles completions, explanations, and refactoring.

Completions (50%): Simple code completion → GPT-5.4 Nano ($0.20/$1.25)
Explanations (30%): "What does this function do?" → Claude Sonnet 4.6 ($3.00/$15.00)
Refactoring (20%): "Refactor this class to use the observer pattern" → Claude Opus 4.7 ($5.00/$25.00)

Blended cost: $1.62 per request vs. $5.00 without routing. 68% savings.

Pattern 3: Data Processing Pipeline

A pipeline that extracts, classifies, and summarizes documents.

Extraction (40%): Extract dates, names, amounts from invoices → Gemini 2.5 Flash-Lite ($0.10/$0.40)
Classification (35%): Categorize documents by type → GPT-5.4 Nano ($0.20/$1.25)
Summarization (20%): Summarize long reports → Gemini 2.5 Flash ($0.30/$2.50)
Analysis (5%): Complex multi-document analysis → GPT-5.5 ($5.00/$30.00)

Blended cost: $0.31 per request vs. $2.50 without routing. 88% savings.

Common Pitfalls

1. Routing too aggressively to cheap models

If 30%+ of your responses are being escalated or receiving negative feedback, your routing threshold is too aggressive. Pull back and send more traffic to mid-tier models. The cost savings aren't worth it if users are unhappy.

2. Ignoring output cost

Many developers focus on input price when routing. But output tokens are 3-8x more expensive than input tokens. A model that generates long, detailed responses might be more expensive overall than a model with higher input prices but more concise outputs. Route based on total expected cost, not just input price.

3. No fallback

If your primary model is down or rate-limited, you need a fallback. Don't let routing logic become a single point of failure. Always have a default model that can handle any request.

4. Not accounting for cache

If your application has high cache hit rates (repeated system prompts), the cache read price matters more than the normal input price. A model with higher input cost but lower cache read cost might be cheaper overall. Factor caching into your routing decisions.

5. Over-engineering from day one

Start with two tiers: cheap and expensive. Route by task type or a simple classifier. Measure quality and cost. Add a third tier only when you have data showing the two-tier system leaves money on the table. Most applications never need more than three tiers.

Provider-Specific Considerations

Multi-provider routing

You can route across providers, not just within one. Use DeepSeek for the cheap tier, Claude for the mid-tier, and GPT-5.5 for the frontier tier. This gives you the best price at every level, but adds operational complexity — you need to manage multiple API keys, handle different response formats, and monitor multiple dashboards.

Batch API as a fourth tier

Both OpenAI and Anthropic offer Batch APIs with 50% off for non-real-time workloads. If your application can tolerate 24-hour turnaround, add a batch tier for background processing tasks like document indexing, content generation, or data enrichment.

Cache-aware routing

If you have a heavily cached system prompt, route to the provider with the best cache read pricing for your volume. DeepSeek's cache read at $0.0028/M is 98% cheaper than the normal price. For agent applications with repeated context, this can be more important than the normal input price.

Calculate savings with different routing strategies

Open Cost Calculator

The Bottom Line

A model router is the highest-ROI cost optimization you can implement. It requires no changes to your prompts, no changes to your application logic, and no changes to your users' experience. You're simply matching request complexity to model capability — something you should have been doing from the start.

Start with two tiers. Route by task type. Monitor quality. Add complexity only when data demands it. The savings compound with every request, and the implementation is measured in hours, not weeks.

Pricing based on official provider documentation (May 2026). Cost examples assume standard API pricing without volume discounts. Actual savings depend on your traffic patterns, routing accuracy, and quality requirements.