Multi-Model Routing

Level: Enhancement (beyond minimal agent) Prerequisites: Agent loop, tool system


What This Adds

Routes requests to different AI models based on:

  • Current mode (smart/rush/free)
  • Subagent type (oracle, finder, etc.)
  • Feature flags and settings
  • Context size constraints

Why Multiple Models?

Different tasks benefit from different models:

Need Best Model Type
Complex reasoning Large, expensive model (Opus, GPT-5)
Fast responses Small, cheap model (Haiku, Gemini Flash)
Code generation Code-specialized model
Vision/media Multimodal model
Very large context High-context model (1M tokens)

Model Registry

Define Available Models

interface ModelDefinition {
  id: string;
  provider: "anthropic" | "openai" | "vertexai";
  contextWindow: number;
  maxOutputTokens: number;
  capabilities: {
    tools: boolean;
    vision: boolean;
    reasoning?: boolean;
  };
  pricing?: {
    inputPerMTok: number;
    outputPerMTok: number;
    cachedInputPerMTok?: number;
  };
}

const MODELS: Record<string, ModelDefinition> = {
  "claude-opus-4-5": {
    id: "claude-opus-4-5-20251101",
    provider: "anthropic",
    contextWindow: 200000,
    maxOutputTokens: 32000,
    capabilities: { tools: true, vision: true, reasoning: true },
    pricing: { inputPerMTok: 5, outputPerMTok: 25, cachedInputPerMTok: 0.5 }
  },
  "claude-haiku-4-5": {
    id: "claude-haiku-4-5-20251001",
    provider: "anthropic",
    contextWindow: 200000,
    maxOutputTokens: 64000,
    capabilities: { tools: true, vision: true },
    pricing: { inputPerMTok: 1, outputPerMTok: 5, cachedInputPerMTok: 0.1 }
  },
  "claude-sonnet-4-5": {
    id: "claude-sonnet-4-5-20250929",
    provider: "anthropic",
    contextWindow: 1000000,
    maxOutputTokens: 32000,
    capabilities: { tools: true, vision: true, reasoning: true },
    pricing: { inputPerMTok: 3, outputPerMTok: 15, cachedInputPerMTok: 0.3 }
  },
  "gpt-5.2": {
    id: "gpt-5.2",
    provider: "openai",
    contextWindow: 400000,
    maxOutputTokens: 128000,
    capabilities: { tools: true, vision: true, reasoning: true },
    pricing: { inputPerMTok: 1.75, outputPerMTok: 14, cachedInputPerMTok: 0.175 }
  },
  "gpt-5.2-codex": {
    id: "gpt-5.2-codex",
    provider: "openai",
    contextWindow: 400000,
    maxOutputTokens: 128000,
    capabilities: { tools: true, vision: true, reasoning: true },
    pricing: { inputPerMTok: 1.75, outputPerMTok: 14, cachedInputPerMTok: 0.175 }
  },
  "gemini-3-pro-preview": {
    id: "gemini-3-pro-preview",
    provider: "vertexai",
    contextWindow: 1048576,
    maxOutputTokens: 65535,
    capabilities: { tools: true, vision: true, reasoning: true }
  },
  "gemini-3-flash-preview": {
    id: "gemini-3-flash-preview",
    provider: "vertexai",
    contextWindow: 1048576,
    maxOutputTokens: 65535,
    capabilities: { tools: true, vision: true, reasoning: true }
  },
  "gemini-2.5-flash": {
    id: "gemini-2.5-flash",
    provider: "vertexai",
    contextWindow: 1048576,
    maxOutputTokens: 65535,
    capabilities: { tools: true, vision: true, reasoning: true }
  }
};

Routing by Mode

Different modes use different default models:

const MODE_MODELS: Record<string, string> = {
  smart: "claude-opus-4-5",      // Best quality
  rush: "claude-haiku-4-5",      // Fast/cheap
  free: "claude-haiku-4-5",      // Free tier
  large: "claude-sonnet-4-5",    // 1M context
  deep: "gpt-5.2-codex"          // Deep reasoning
};

function getModelForMode(mode: AgentMode): string {
  return MODE_MODELS[mode] ?? MODE_MODELS.smart;
}

Routing by Subagent

Subagents have default model assignments (finder is overridden at runtime):

const SUBAGENT_MODELS: Record<string, string> = {
  finder: "claude-haiku-4-5",     // Registry default (runtime override below)
  oracle: "gpt-5.2",              // Deep reasoning
  librarian: "claude-haiku-4-5",  // GitHub operations
  "kraken-scope": "claude-haiku-4-5",
  "kraken-executor": "claude-haiku-4-5",
  reviewer: "claude-opus-4-5",    // Quality review
  "code-review": "claude-sonnet-4-5",
  "course-correction": "gemini-3-pro-preview"
};

function getModelForSubagent(type: string, parentMode: AgentMode): string {
  // Task subagent inherits from parent
  if (type === "task") {
    return getModelForMode(parentMode);
  }
  if (type === "finder") {
    return "gemini-3-flash-preview";  // Runtime override in Amp
  }
  return SUBAGENT_MODELS[type] ?? getModelForMode(parentMode);
}

Context Overflow Fallback

When context exceeds the model's limit, fall back to a larger-context model:

function selectModelWithFallback(
  preferredModel: string,
  inputTokens: number,
  maxOutputTokens: number
): string {
  const model = MODELS[preferredModel];
  const effectiveLimit = model.contextWindow - maxOutputTokens;

  if (inputTokens >= effectiveLimit) {
    console.info("Context overflow, falling back to large-context model", {
      preferredModel,
      inputTokens,
      limit: effectiveLimit
    });
    return "vertexai/gemini-2.5-flash";  // 1M context fallback
  }

  return preferredModel;
}

Provider-Specific Handling

Each provider has different API shapes. Handle them separately:

interface InferenceRequest {
  model: string;
  system: string;
  messages: Message[];
  tools: Tool[];
  maxTokens: number;
  temperature?: number;
}

async function inference(request: InferenceRequest): Promise<Response> {
  const modelDef = MODELS[request.model];

  switch (modelDef.provider) {
    case "anthropic":
      return runAnthropicInference(request, modelDef);
    case "openai":
      return runOpenAIInference(request, modelDef);
    case "vertexai":
      return runVertexAIInference(request, modelDef);
    default:
      throw new Error(`Unknown provider: ${modelDef.provider}`);
  }
}

Anthropic-Specific

async function runAnthropicInference(
  request: InferenceRequest,
  model: ModelDefinition
): Promise<Response> {
  const client = new Anthropic({ apiKey: getApiKey("anthropic") });

  const params = {
    model: model.id,
    max_tokens: request.maxTokens,
    messages: convertToAnthropicMessages(request.messages),
    system: request.system,
    tools: convertToAnthropicTools(request.tools),
    stream: true,
    // Anthropic-specific features
    thinking: {
      type: "enabled",
      budget_tokens: 4000
    }
  };

  const stream = await client.messages.stream(params);
  return processAnthropicStream(stream);
}

OpenAI-Specific

async function runOpenAIInference(
  request: InferenceRequest,
  model: ModelDefinition
): Promise<Response> {
  const client = new OpenAI({ apiKey: getApiKey("openai") });

  const params = {
    model: model.id,
    max_tokens: request.maxTokens,
    messages: [
      { role: "system", content: request.system },
      ...convertToOpenAIMessages(request.messages)
    ],
    tools: convertToOpenAITools(request.tools),
    stream: true
  };

  const stream = await client.chat.completions.create(params);
  return processOpenAIStream(stream);
}

VertexAI-Specific

async function runVertexAIInference(
  request: InferenceRequest,
  model: ModelDefinition
): Promise<Response> {
  const client = new GoogleGenerativeAI(getApiKey("vertexai"));
  const genModel = client.getGenerativeModel({ model: model.id });

  const params = {
    contents: convertToVertexAIContents(request.messages),
    systemInstruction: request.system,
    tools: convertToVertexAITools(request.tools),
    generationConfig: {
      maxOutputTokens: request.maxTokens,
      temperature: request.temperature
    },
    // VertexAI-specific
    thinkingConfig: { thinkingLevel: "MEDIUM" }
  };

  const result = await genModel.generateContentStream(params);
  return processVertexAIStream(result);
}

Extended Thinking

Different providers handle thinking differently:

Anthropic Extended Thinking

// Budget tiers based on user phrases
function getThinkingBudget(lastUserMessage: string): number {
  const text = lastUserMessage.toLowerCase();

  const highTriggers = [
    /\bthink harder\b/,
    /\bthink intensely\b/,
    /\bthink very hard\b/
  ];

  const mediumTriggers = [
    /\bthink deeply\b/,
    /\bthink hard\b/,
    /\bthink more\b/
  ];

  for (const pattern of highTriggers) {
    if (pattern.test(text)) return 31999;
  }

  for (const pattern of mediumTriggers) {
    if (pattern.test(text)) return 10000;
  }

  return 4000;  // Default
}

VertexAI Thinking Config

thinkingConfig: {
  thinkingLevel: "LOW" | "MEDIUM" | "HIGH"
}

Prompt Caching

Anthropic supports prompt caching for cost savings:

function applyCacheControl(messages: Message[]): Message[] {
  if (messages.length === 0) return messages;

  const lastIndex = messages.length - 1;

  return messages.map((msg, i) => {
    if (i === lastIndex) {
      // Last message: 5 minute cache
      return addCacheControl(msg, "5m");
    }
    return msg;
  });
}

Cost Tracking

Track costs per model for billing and optimization:

interface UsageRecord {
  model: string;
  provider: string;
  inputTokens: number;
  outputTokens: number;
  cachedInputTokens: number;
  cost: number;
}

function calculateCost(
  model: string,
  inputTokens: number,
  outputTokens: number,
  cachedInputTokens: number
): number {
  const pricing = MODELS[model].pricing;
  const uncachedInput = inputTokens - cachedInputTokens;

  return (
    (uncachedInput / 1_000_000) * pricing.inputPerMTok +
    (cachedInputTokens / 1_000_000) * (pricing.cachedInputPerMTok ?? pricing.inputPerMTok) +
    (outputTokens / 1_000_000) * pricing.outputPerMTok
  );
}

When to Add Multi-Model

Add multi-model routing when:

  1. Different task types - Need both fast/cheap and slow/smart
  2. Cost optimization - Use expensive models only when needed
  3. Subagents - Specialized agents need specialized models
  4. Large context - Some tasks need 1M+ tokens

Skip if:

  • Single model meets all needs
  • Simpler implementation is priority
  • Don't need subagent specialization

Enhancement based on Amp Code v0.0.1769212917 patterns