Preface: The "One Model to Rule Them All" strategy is financially unsustainable. Routing 100% of user traffic to GPT-4 is akin to hiring a Ph.D. physicist to greet customers at Walmart. This guide details the Semantic Router Pattern, a machine-learning design pattern that dynamically routes queries to the cheapest model capable of answering them.
1. The Routing Logic (Vector Space)
Traditional routing uses Regular Expressions (Regex). This is brittle. If a user says "I want to bounce the box" instead of "Reboot server", regex fails.
Semantic Routing converts the query into a high-dimensional vector (384 dimensions for `all-MiniLM-L6-v2`) and calculates the Cosine Similarity against a set of pre-defined "Intent Centroids".
2. Implementation (Python & HuggingFace)
We build a router that separates "Simple Operations" (Route to Llama-3-8B) from "Complex Reasoning" (Route to GPT-4o).
Step 1: Define Intents
from semantic_router import Route
from semantic_router.encoders import HuggingFaceEncoder
from semantic_router.layer import RouteLayer
# Encoder: Small, fast model (runs on CPU < 10ms)
encoder = HuggingFaceEncoder(name="sentence-transformers/all-MiniLM-L6-v2")
# Route 1: Simple/Factual (Local SLM)
simple_route = Route(
name="local_slm",
utterances=[
"reset my password",
"what are the office hours?",
"check uptime for server X",
"who is the CEO?",
"summarize this paragraph"
],
)
# Route 2: Complex/Creative (Cloud LLM)
complex_route = Route(
name="cloud_llm",
utterances=[
"write a python script to scrape this website",
"analyze the sentiment of this quarterly report",
"debug this stack trace",
"create a marketing persona for..."
],
)
# Initialize Router
router = RouteLayer(encoder=encoder, routes=[simple_route, complex_route])
Step 2: Service Layer
def handle_request(user_query):
# 1. Classify
classification = router(user_query)
# 2. Route
if classification.name == "local_slm" and classification.similarity_score > 0.8:
print(f"Routing to Local Llama-3. Confidence: {classification.similarity_score}")
return local_inference_engine.generate(user_query)
elif classification.name == "cloud_llm":
print("Routing to GPT-4.")
return openai_client.chat.completions.create(model="gpt-4", messages=[...])
else:
# Fallback for ambiguity
print("Ambiguous intent. Defaulting to Cloud for safety.")
return openai_client.chat.completions.create(model="gpt-3.5-turbo", messages=[...])
3. Tuning Dynamic Thresholds
A static threshold (e.g., 0.8) is dangerous. If the router is unsure, it is cheaper to route to the cloud than to serve a hallucination.
We recommend Dynamic Thresholding:
- If `simple_route` score > 0.85 -> **Local**
- If `simple_route` score is 0.6 - 0.85 -> **Hybrid (Run both, return fastest/best)**
- If `simple_route` score < 0.6 -> **Cloud**
4. The Latency Tax
The router introduces latency. However, because we use a tiny encoder (20MB), the latency is negligible (~15ms) compared to the inference time of an LLM (seconds).
Latency Waterfall
- Request Arrives: T+0ms
- Embedding Gen (CPU): T+10ms
- Vector Search (FAISS): T+12ms
- LLM Generation: T+1500ms
The 12ms tax is worth the 90% cost saving.
5. Circuit Breakers & Failover
Cloud APIs are unreliable (rate limits, 503 errors). The Hybrid Router acts as a natural circuit breaker.
# Circuit Breaker Logic
try:
response = openai_client.generate(query)
except RateLimitError:
# Fail-Open to Local Model
logger.warn("OpenAI Rate Limited. Failing over to Llama-3.")
response = local_inference_engine.generate(query)
This ensures 100% uptime for your application, even during a major cloud outage. The quality of answers might degrade slightly (Llama-3 is not GPT-4), but the user gets an answer, not an error page.
Conclusion: Use the right tool for the job. You don't take a helicopter to get groceries. Intelligent routing is the only implementation pattern that allows AI to scale economically.