The Secret Life of Azure: The Traffic Controller

Optimizing Cost and Latency with Intent-Based Routing.

#Azure #AIAgents #CloudComputing #SoftwareArchitecture

Optimization & Routing

The whiteboard was covered in the evaluation rubrics from our last session, but Timothy was looking at a billing dashboard with a frown.

"Margaret," Timothy said, "the system is accurate now, but it’s slow. And the cost of running every single request through the high-reasoning models is starting to add up. Most of the time, the user is just asking a basic status question, but we’re spinning up the whole orchestrator and the evaluator to answer it. We're using a sledgehammer to hang a picture frame."

Margaret picked up a yellow marker and drew a diamond-shaped box at the very entry point of the system.

"That’s because you’re treating every request like a crisis, Timothy. In the cloud, efficiency is about Routing. We need to move from 'one-size-fits-all' to Intent-Based Routing."

The Router: The Traffic Controller

"How do we decide which model to use without adding more lag?" Timothy asked.

"We deploy a Router Agent," Margaret explained. "This is a stateless, fast, and inexpensive model whose only job is to categorize the user request. We want it to be deterministic—if the request is a simple keyword, we use a rule; if it's natural language, the model judges the probability of the intent. We even cache the most common routes so we don't have to think twice about a 'status check' or a 'greeting'."

Model Selection: Right-Sizing the Logic

"So we don't use the high-reasoning agent for everything?" Timothy asked.

"Exactly," Margaret said. "We map our agents to specific model tiers. We keep the high-reasoning models for the orchestrator and the evaluator because they need to handle complex logic. But for a simple SQL lookup or a basic summary, we use a 'Flash' model. It’s significantly faster and a fraction of the cost."

The Fallback: Escalation Logic

"What if the small model gets it wrong because the request was trickier than it looked?" Timothy questioned.

Margaret drew an arrow from the small model back to the orchestrator.

"We build in Escalation Logic. If the small model’s output fails our Evaluation Loop, or if the Router detects high ambiguity in the intent, the system automatically 'escalates' the task to the high-reasoning agent. You save resources on the easy stuff and spend them only when the task actually earns it."

The Result

Timothy looked at the latency logs. The simple requests were clearing in milliseconds, and the projected monthly bill was dropping. The system was just as smart as before, but it was finally efficient.

"It’s not just a system anymore," Timothy said. "It’s an efficient operation."

Margaret capped her marker. "Exactly. When you stop over-engineering the simple tasks, the architecture finally becomes sustainable."

The Core Concepts

Router Agent: A lightweight, stateless agent that classifies user intent to determine the execution path.
Model Tiers: Utilizing different model sizes based on task complexity to balance performance and cost.
Deterministic vs. Probabilistic: Using fixed rules for known keywords and ML models for ambiguous language.
Escalation Logic: A safety pattern that moves a task to a higher-reasoning model if the initial attempt fails.
Route Caching: Storing the classification of frequent requests to eliminate redundant processing time.

Aaron Rose is a software engineer and technology writer at tech-reader.blog. For explainer videos and podcasts, check out Tech-Reader YouTube channel.

Search This Blog

Tech-Reader.blog