Multi-Model Routing
Level: Enhancement (beyond minimal agent) Prerequisites: Agent loop, tool system
What This Adds
Routes requests to different AI models based on:
- Current mode (smart/rush/free)
- Subagent type (oracle, finder, etc.)
- Feature flags and settings
- Context size constraints
Why Multiple Models?
Different tasks benefit from different models:
| Need | Best Model Type |
|---|---|
| Complex reasoning | Large, expensive model (Opus, GPT-5) |
| Fast responses | Small, cheap model (Haiku, Gemini Flash) |
| Code generation | Code-specialized model |
| Vision/media | Multimodal model |
| Very large context | High-context model (1M tokens) |
Model Registry
Define Available Models
interface ModelDefinition {
id: string;
provider: "anthropic" | "openai" | "vertexai";
contextWindow: number;
maxOutputTokens: number;
capabilities: {
tools: boolean;
vision: boolean;
reasoning?: boolean;
};
pricing?: {
inputPerMTok: number;
outputPerMTok: number;
cachedInputPerMTok?: number;
};
}
const MODELS: Record<string, ModelDefinition> = {
"claude-opus-4-5": {
id: "claude-opus-4-5-20251101",
provider: "anthropic",
contextWindow: 200000,
maxOutputTokens: 32000,
capabilities: { tools: true, vision: true, reasoning: true },
pricing: { inputPerMTok: 5, outputPerMTok: 25, cachedInputPerMTok: 0.5 }
},
"claude-haiku-4-5": {
id: "claude-haiku-4-5-20251001",
provider: "anthropic",
contextWindow: 200000,
maxOutputTokens: 64000,
capabilities: { tools: true, vision: true },
pricing: { inputPerMTok: 1, outputPerMTok: 5, cachedInputPerMTok: 0.1 }
},
"claude-sonnet-4-5": {
id: "claude-sonnet-4-5-20250929",
provider: "anthropic",
contextWindow: 1000000,
maxOutputTokens: 32000,
capabilities: { tools: true, vision: true, reasoning: true },
pricing: { inputPerMTok: 3, outputPerMTok: 15, cachedInputPerMTok: 0.3 }
},
"gpt-5.2": {
id: "gpt-5.2",
provider: "openai",
contextWindow: 400000,
maxOutputTokens: 128000,
capabilities: { tools: true, vision: true, reasoning: true },
pricing: { inputPerMTok: 1.75, outputPerMTok: 14, cachedInputPerMTok: 0.175 }
},
"gpt-5.2-codex": {
id: "gpt-5.2-codex",
provider: "openai",
contextWindow: 400000,
maxOutputTokens: 128000,
capabilities: { tools: true, vision: true, reasoning: true },
pricing: { inputPerMTok: 1.75, outputPerMTok: 14, cachedInputPerMTok: 0.175 }
},
"gemini-3-pro-preview": {
id: "gemini-3-pro-preview",
provider: "vertexai",
contextWindow: 1048576,
maxOutputTokens: 65535,
capabilities: { tools: true, vision: true, reasoning: true }
},
"gemini-3-flash-preview": {
id: "gemini-3-flash-preview",
provider: "vertexai",
contextWindow: 1048576,
maxOutputTokens: 65535,
capabilities: { tools: true, vision: true, reasoning: true }
},
"gemini-2.5-flash": {
id: "gemini-2.5-flash",
provider: "vertexai",
contextWindow: 1048576,
maxOutputTokens: 65535,
capabilities: { tools: true, vision: true, reasoning: true }
}
};
Routing by Mode
Different modes use different default models:
const MODE_MODELS: Record<string, string> = {
smart: "claude-opus-4-5", // Best quality
rush: "claude-haiku-4-5", // Fast/cheap
free: "claude-haiku-4-5", // Free tier
large: "claude-sonnet-4-5", // 1M context
deep: "gpt-5.2-codex" // Deep reasoning
};
function getModelForMode(mode: AgentMode): string {
return MODE_MODELS[mode] ?? MODE_MODELS.smart;
}
Routing by Subagent
Subagents have default model assignments (finder is overridden at runtime):
const SUBAGENT_MODELS: Record<string, string> = {
finder: "claude-haiku-4-5", // Registry default (runtime override below)
oracle: "gpt-5.2", // Deep reasoning
librarian: "claude-haiku-4-5", // GitHub operations
"kraken-scope": "claude-haiku-4-5",
"kraken-executor": "claude-haiku-4-5",
reviewer: "claude-opus-4-5", // Quality review
"code-review": "claude-sonnet-4-5",
"course-correction": "gemini-3-pro-preview"
};
function getModelForSubagent(type: string, parentMode: AgentMode): string {
// Task subagent inherits from parent
if (type === "task") {
return getModelForMode(parentMode);
}
if (type === "finder") {
return "gemini-3-flash-preview"; // Runtime override in Amp
}
return SUBAGENT_MODELS[type] ?? getModelForMode(parentMode);
}
Context Overflow Fallback
When context exceeds the model's limit, fall back to a larger-context model:
function selectModelWithFallback(
preferredModel: string,
inputTokens: number,
maxOutputTokens: number
): string {
const model = MODELS[preferredModel];
const effectiveLimit = model.contextWindow - maxOutputTokens;
if (inputTokens >= effectiveLimit) {
console.info("Context overflow, falling back to large-context model", {
preferredModel,
inputTokens,
limit: effectiveLimit
});
return "vertexai/gemini-2.5-flash"; // 1M context fallback
}
return preferredModel;
}
Provider-Specific Handling
Each provider has different API shapes. Handle them separately:
interface InferenceRequest {
model: string;
system: string;
messages: Message[];
tools: Tool[];
maxTokens: number;
temperature?: number;
}
async function inference(request: InferenceRequest): Promise<Response> {
const modelDef = MODELS[request.model];
switch (modelDef.provider) {
case "anthropic":
return runAnthropicInference(request, modelDef);
case "openai":
return runOpenAIInference(request, modelDef);
case "vertexai":
return runVertexAIInference(request, modelDef);
default:
throw new Error(`Unknown provider: ${modelDef.provider}`);
}
}
Anthropic-Specific
async function runAnthropicInference(
request: InferenceRequest,
model: ModelDefinition
): Promise<Response> {
const client = new Anthropic({ apiKey: getApiKey("anthropic") });
const params = {
model: model.id,
max_tokens: request.maxTokens,
messages: convertToAnthropicMessages(request.messages),
system: request.system,
tools: convertToAnthropicTools(request.tools),
stream: true,
// Anthropic-specific features
thinking: {
type: "enabled",
budget_tokens: 4000
}
};
const stream = await client.messages.stream(params);
return processAnthropicStream(stream);
}
OpenAI-Specific
async function runOpenAIInference(
request: InferenceRequest,
model: ModelDefinition
): Promise<Response> {
const client = new OpenAI({ apiKey: getApiKey("openai") });
const params = {
model: model.id,
max_tokens: request.maxTokens,
messages: [
{ role: "system", content: request.system },
...convertToOpenAIMessages(request.messages)
],
tools: convertToOpenAITools(request.tools),
stream: true
};
const stream = await client.chat.completions.create(params);
return processOpenAIStream(stream);
}
VertexAI-Specific
async function runVertexAIInference(
request: InferenceRequest,
model: ModelDefinition
): Promise<Response> {
const client = new GoogleGenerativeAI(getApiKey("vertexai"));
const genModel = client.getGenerativeModel({ model: model.id });
const params = {
contents: convertToVertexAIContents(request.messages),
systemInstruction: request.system,
tools: convertToVertexAITools(request.tools),
generationConfig: {
maxOutputTokens: request.maxTokens,
temperature: request.temperature
},
// VertexAI-specific
thinkingConfig: { thinkingLevel: "MEDIUM" }
};
const result = await genModel.generateContentStream(params);
return processVertexAIStream(result);
}
Extended Thinking
Different providers handle thinking differently:
Anthropic Extended Thinking
// Budget tiers based on user phrases
function getThinkingBudget(lastUserMessage: string): number {
const text = lastUserMessage.toLowerCase();
const highTriggers = [
/\bthink harder\b/,
/\bthink intensely\b/,
/\bthink very hard\b/
];
const mediumTriggers = [
/\bthink deeply\b/,
/\bthink hard\b/,
/\bthink more\b/
];
for (const pattern of highTriggers) {
if (pattern.test(text)) return 31999;
}
for (const pattern of mediumTriggers) {
if (pattern.test(text)) return 10000;
}
return 4000; // Default
}
VertexAI Thinking Config
thinkingConfig: {
thinkingLevel: "LOW" | "MEDIUM" | "HIGH"
}
Prompt Caching
Anthropic supports prompt caching for cost savings:
function applyCacheControl(messages: Message[]): Message[] {
if (messages.length === 0) return messages;
const lastIndex = messages.length - 1;
return messages.map((msg, i) => {
if (i === lastIndex) {
// Last message: 5 minute cache
return addCacheControl(msg, "5m");
}
return msg;
});
}
Cost Tracking
Track costs per model for billing and optimization:
interface UsageRecord {
model: string;
provider: string;
inputTokens: number;
outputTokens: number;
cachedInputTokens: number;
cost: number;
}
function calculateCost(
model: string,
inputTokens: number,
outputTokens: number,
cachedInputTokens: number
): number {
const pricing = MODELS[model].pricing;
const uncachedInput = inputTokens - cachedInputTokens;
return (
(uncachedInput / 1_000_000) * pricing.inputPerMTok +
(cachedInputTokens / 1_000_000) * (pricing.cachedInputPerMTok ?? pricing.inputPerMTok) +
(outputTokens / 1_000_000) * pricing.outputPerMTok
);
}
When to Add Multi-Model
Add multi-model routing when:
- Different task types - Need both fast/cheap and slow/smart
- Cost optimization - Use expensive models only when needed
- Subagents - Specialized agents need specialized models
- Large context - Some tasks need 1M+ tokens
Skip if:
- Single model meets all needs
- Simpler implementation is priority
- Don't need subagent specialization
Enhancement based on Amp Code v0.0.1769212917 patterns