You wired up your orchestrator. Your subagents respond. The demo runs clean. You ship it.
Three weeks later, a subagent returns something your orchestrator did not expect. No exception is raised. No alert fires. The orchestrator hallucinates a recovery, proceeds with corrupted state, and produces an output that looks completely plausible. A customer finds the bug six days later.
This is not a LangGraph bug. It is not a model quality problem. It is a protocol problem, and it lives in a layer of your system that most frameworks deliberately hide from you.
This article is about what agent-to-agent communication actually looks like at the wire level, why the default implementation is worse than any other protocol in your stack, and what you need to build to make it production-safe.
What actually gets passed between agents
Strip away the framework abstractions. Forget what LangChain shows you in the trace view. Look at what Agent B actually receives when Agent A calls it.
Here is a minimal LangGraph two-agent setup:
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from typing import TypedDict
class AgentState(TypedDict):
messages: list
result: str
llm = ChatOpenAI(model="gpt-4o")
def orchestrator(state: AgentState):
response = llm.invoke(state["messages"])
return {"messages": state["messages"] + [response], "result": response.content}
def subagent(state: AgentState):
# What does this agent actually receive?
print(repr(state["result"])) # look here
response = llm.invoke([{"role": "user", "content": state["result"]}])
return {"result": response.content}
graph = StateGraph(AgentState)
graph.add_node("orchestrator", orchestrator)
graph.add_node("subagent", subagent)
graph.add_edge("orchestrator", "subagent")
graph.set_entry_point("orchestrator")
graph.add_edge("subagent", END)
app = graph.compile()
Run this. Print that repr. What you get is something like this:
'To complete this task, I need you to extract the user email and return it
in JSON format with the key "email". The user data is as follows:\\n\\n
```json\\n{"user_id": 123, "email": "[email protected]", "role": "admin"}\\n
```\\n\\nPlease process this and return only the email field.'
That is a string. A natural language string with JSON embedded inside a markdown code block inside the string, because GPT-4o decided to format it helpfully.
Your subagent now has to figure out what to do with that. It will try. It will usually succeed. But “usually” is not a production SLA.
Now look at what a microservice receives when another microservice calls it:
{
"user_id": 123,
"email": "[email protected]",
"role": "admin"
}
Typed. Validated. Versioned. Contracted.
The gap between those two things is the entire problem. And it is a gap that almost nobody building multi-agent systems today has consciously decided to accept. It is a gap they inherited by using the framework defaults and never looking underneath.
Why this is a worse protocol than anything else in your stack
Think about the last API you built. You wrote a schema. You validated inputs. You returned typed responses. You versioned the endpoint. If a caller sent malformed data, you returned a 422 with a clear error. If you changed the response shape, you incremented the version and kept the old one running during migration. You did not ship an endpoint that accepted whatever string the caller felt like sending and hoped the handler would interpret it correctly.
None of that discipline exists at the inter-agent layer in a default LangGraph, AutoGen, or CrewAI implementation.
What you have instead is two language models exchanging prose, with an implicit expectation that the receiving model will parse the intent correctly. That works often enough in development that it feels like it works. It fails in production often enough to be a serious operational risk.
The specific properties that make this dangerous:
- No schema. There is no definition of what Agent A is allowed to send Agent B. The orchestrator can send anything. The subagent will attempt to handle anything. What “handling” means when the input is malformed is entirely up to the model, and models are optimized to produce plausible outputs, not to fail loudly on bad inputs.
- No contract. If you change what the orchestrator sends, nothing breaks at deploy time. It breaks at runtime, silently, on the subset of inputs where the format change matters. You will find this in production, not in your test suite, because your test suite runs against a fixed prompt and a fixed expected output format that happened to be what the model was producing when you wrote the test.
- No validation. There is no layer that checks whether Agent B received what Agent A intended to send before Agent B acts on it. The first validation is the model’s own interpretation, which is non-deterministic. The same malformed input might be handled correctly nine times and incorrectly on the tenth, depending on sampling temperature, context window state, and which token the model happens to sample first.
- No versioning. When you update your orchestrator prompt and the output format changes subtly, your subagent is now operating on a different implicit schema with no indication that anything changed. There is no migration path. There is no rollback. There is no diff. You shipped a breaking change to an API and called it a prompt update.
Compare this to what you would accept from any other component in your stack. You would not deploy a Kafka consumer that receives unvalidated string payloads and interprets them by passing them to a language model. You would not ship a REST API where the request body is a prose description of the intended operation. But that is exactly what the default inter-agent communication layer is, and most teams shipping multi-agent systems in production right now are running it.
Three production failure modes
Failure mode 1: Silent schema mismatch
The orchestrator returns a result with a field named output. The subagent expects a field named result. No exception is raised because the communication is a string, not a typed object. The subagent receives the full string, finds no field called result in its parsed interpretation, and proceeds with None or with whatever the model decides is a reasonable substitute.
# Orchestrator returns this (as a string):
orchestrator_output = """
{
"output": "[email protected]",
"confidence": 0.97
}
"""
# Subagent is prompted to extract the "result" field.
# Model sees no "result" field.
# Model hallucinates: returns the email anyway because it seems right.
# No error raised. Wrong data flow. Invisible bug.
# The confidence score is silently dropped.
The insidious part: the subagent’s output often looks correct because the model is good at guessing intent. The bug surfaces two hops later when something downstream actually needed the confidence score that got silently dropped. You will spend four hours in LangSmith traces trying to understand why your final output is inconsistently confident before you realize the problem is a field name mismatch that has been silently failing since day one.
This failure mode is almost impossible to catch in development because development inputs are clean, the model guesses correctly, and nothing looks wrong. It surfaces in production on the tail of your input distribution, where the orchestrator output is slightly different from what it produces on your test cases.
Failure mode 2: Context collapse across hops
Each agent in a chain receives a compressed representation of what the previous agent did. By the third hop, the original task has drifted. Not because any individual agent failed, but because each compression loses information and those losses compound.
# Original task (from user):
# "Analyze the Q1 sales data for the EMEA region, identify the top 3
# underperforming product categories, and draft a remediation plan
# with specific actions for each."
# After Agent 1 (data analysis agent):
# "EMEA Q1 shows underperformance in: Hardware, Professional Services, Licensing."
# After Agent 2 (research agent):
# "Hardware and software categories show market headwinds in European markets."
# Agent 3 (writing agent) receives:
# "Write a remediation plan for hardware and software underperformance in Europe."
# What Agent 3 produces:
# A generic plan about hardware and software in Europe.
#
# Missing from the final output:
# - Professional Services (dropped by Agent 2 during compression)
# - Licensing (also dropped)
# - Specific Q1 framing
# - EMEA vs Europe distinction (quietly broadened)
# - "remediation" shifted to "improvement" (softer framing)
# - No specific actions, only general recommendations
#
# The output looks like a good business document.
# It answers a different question than the one the user asked.
This failure mode is nearly impossible to catch without evaluating the final output against the original task statement. Most pipelines do not do that. Most pipelines evaluate the final output on its own merits, as a piece of writing, where it will score well. The drift is only visible when you compare it to what the user originally asked for.
The root cause is that agents are compressing and summarizing as they pass results downstream, because the framework gives them no mechanism to pass structured data. They are turning typed information into prose, and prose is lossy by definition.
Failure mode 3: Retry storm
A subagent returns a response in an unexpected format. The orchestrator, seeing an unparseable result, retries. The subagent, unsure why it is being called again, retries its own tool calls. The tools are external APIs with their own rate limits and costs. You now have an exponential retry cascade with no circuit breaker, no deduplication, and a token burn rate that will surprise you when the invoice arrives.
# Simplified version of what happens without explicit failure handling.
async def orchestrator_node(state):
for attempt in range(3): # orchestrator retries on parse failure
result = await subagent.invoke(state)
try:
parsed = json.loads(result) # fails if model wrapped output in markdown
return parsed
except json.JSONDecodeError:
continue # retry silently, no logging, no alerting
# After 3 orchestrator retries, each of which triggered the subagent
# to retry its own tool calls (3 times each, by default):
#
# 3 orchestrator retries
# x 3 subagent retries per orchestrator attempt
# x N external tool calls per subagent attempt
#
# = up to 9N LLM calls for one failed task
#
# Actual incident numbers from a production system:
# LLM calls fired: 47
# Time elapsed: 94 seconds
# Task outcome: incomplete
# Cost: $180
# Alerts fired: 0
The fix is not “add more retries with backoff.” The fix is structured failure signaling between agents, so the orchestrator knows the difference between “subagent failed at its task”, “subagent returned an unexpected format”, and “subagent timed out waiting for an external tool call.” Those are three different failure conditions that require three different responses. Without typed inter-agent communication, they are all indistinguishable, and your retry logic cannot be intelligent about any of them.
What a proper inter-agent contract looks like
The solution is to treat inter-agent communication exactly like you treat inter-service communication: typed schemas, explicit validation, loud failures on contract violations.
Pydantic is the right tool for this in a Python agent stack. Not because it is the only option, but because it integrates directly with structured output enforcement at the model level, which means you are validating at the source rather than at the consumer. The model is constrained to produce a valid SubagentResult or fail loudly, rather than producing whatever string it feels like and leaving parsing to the downstream agent.
from pydantic import BaseModel, field_validator
from typing import Optional
from enum import Enum
class TaskStatus(str, Enum):
SUCCESS = "success"
PARTIAL = "partial"
FAILED = "failed"
class SubagentResult(BaseModel):
status: TaskStatus
data: dict
confidence: float
error_message: Optional[str] = None
tokens_used: int
@field_validator("confidence")
@classmethod
def confidence_must_be_valid(cls, v):
if not 0.0 <= v <= 1.0:
raise ValueError(f"Confidence must be between 0 and 1, got {v}")
return v
@field_validator("data")
@classmethod
def data_must_not_be_empty_on_success(cls, v, info):
if info.data.get("status") == TaskStatus.SUCCESS and not v:
raise ValueError("Successful result must contain data")
return v
Now enforce this schema at the model level using structured outputs. The model does not get to decide what format to return. The format is the contract.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
structured_llm = llm.with_structured_output(SubagentResult)
def subagent_node(state: AgentState) -> AgentState:
result: SubagentResult = structured_llm.invoke(
[{"role": "user", "content": state["task"]}]
)
# result is now a validated Pydantic object, not a string.
# If the model cannot produce a valid SubagentResult, this raises.
# That exception is caught at the orchestrator layer, not swallowed here.
return {"subagent_result": result}
Add an explicit validation layer at the orchestrator before any result gets passed downstream. The orchestrator is the trust boundary. Nothing crosses it unvalidated.
from pydantic import ValidationError
import logging
logger = logging.getLogger(__name__)
def orchestrator_node(state: AgentState) -> AgentState:
try:
result = subagent_app.invoke(state)
validated = SubagentResult.model_validate(result["subagent_result"])
if validated.status == TaskStatus.FAILED:
# Fail loudly. Do not attempt to recover silently.
raise ValueError(f"Subagent reported failure: {validated.error_message}")
if validated.confidence < 0.7:
# Route to human review. Do not continue with low-confidence output.
return {"requires_review": True, "result": validated}
return {"result": validated, "requires_review": False}
except ValidationError as e:
# Schema mismatch: this is a contract violation.
# Log it. Alert on it. Do not proceed.
logger.error(f"Inter-agent schema violation: {e}")
raise # do not swallow this exception
The key principle: validation failures must be loud. A ValidationError at the inter-agent boundary is not a recoverable error to handle gracefully. It is a contract violation. It means your system is operating outside its designed parameters, and proceeding anyway is how you get the silent corruption bugs described in failure mode 1. The correct response is to fail loudly, log the violation with the full payload that caused it, alert your on-call, and not produce an output for this request.
Typed task dispatch from the orchestrator
The contract has to go both directions. The orchestrator sends a typed task definition to the subagent, not a prose instruction.
class TaskDefinition(BaseModel):
task_id: str
task_type: str
input_data: dict
required_fields: list[str]
max_tokens: int = 1000
timeout_seconds: int = 30
@field_validator("task_type")
@classmethod
def task_type_must_be_known(cls, v):
allowed = {"extract", "analyze", "summarize", "classify"}
if v not in allowed:
raise ValueError(f"Unknown task type: {v}. Allowed: {allowed}")
return v
When the orchestrator dispatches a task, it sends a TaskDefinition. The subagent receives a TaskDefinition. Both sides have the same schema. If the orchestrator tries to send a task type that does not exist in the enum, it fails at dispatch time, not at execution time three hops later when a customer is waiting.
def dispatch_to_subagent(task: TaskDefinition) -> SubagentResult:
# Validate before sending. Not after receiving.
task.model_validate(task.model_dump())
structured_llm = llm.with_structured_output(SubagentResult)
result = structured_llm.invoke([{
"role": "user",
"content": f"Complete this task: {task.model_dump_json()}"
}])
# The result is typed. The orchestrator can read result.status,
# result.confidence, result.data without parsing anything.
return result
This is the discipline. It is not more code than the untyped version. It is differently structured code, where the contracts between components are explicit rather than implicit, and failures surface at the right layer rather than three hops downstream as a mysterious wrong answer.
Where MCP fits and where it does not
Model Context Protocol is solving a real problem at the tool layer. When an agent needs to call an external system, a database, an API, a file system, MCP gives you a standardized protocol for that communication. Tool definitions are typed. The interface is consistent. You are not hand-rolling JSON schemas for every tool integration.
This is genuinely useful and you should be using it for external tool calls.
But MCP operates between agents and external tools. It does not operate between agents and other agents. When your orchestrator passes a result to a subagent, that communication is not governed by MCP. It is governed by whatever you built, or did not build, at the inter-agent layer.
The distinction matters because a lot of teams are adopting MCP right now and assuming it solves the broader communication problem. It solves the tool-calling part. The agent-to-agent result passing, the state management across hops, the schema enforcement between orchestrator and subagent, that is still your responsibility, and nothing in the MCP spec addresses it.
What MCP handles:
- Tool definitions and discovery
- External system integration with typed inputs and outputs
- Tool call schema validation
- Standardized error responses from external tools
What you still have to build yourself:
- Typed Pydantic schemas for every inter-agent message type, in both directions
- Structured output enforcement at the model level for every agent that produces results consumed by another agent
- Explicit validation at the orchestrator before any result is passed downstream
- Loud failure handling that distinguishes schema violations from task failures from timeout failures
- Observability that captures the inter-agent message at the boundary, not just the final output
Use MCP for what it is designed for. Build the inter-agent contract layer yourself, because nothing else will build it for you. The two are complementary, and a production multi-agent system needs both.
What the architecture looks like when it is done right
The difference between a demo multi-agent system and a production multi-agent system is not model quality or prompt engineering. It is the validation layer at every boundary, and the discipline of treating those boundaries like API contracts rather than informal message-passing.
Every message that crosses an agent boundary should be a typed, validated object. Every agent that produces output should do so through structured output enforcement. Every agent that consumes input should validate before acting. Every failure at a boundary should be loud, logged, and routed explicitly, not swallowed and recovered from silently.
This is not more complex than the default setup. It is differently complex. You pay the cost upfront in schema definition and validation logic instead of paying it later in production debugging sessions at 2am trying to understand why an agent chain is producing plausible-looking wrong answers on 3 percent of inputs.
The three failure modes described above, silent schema mismatch, context collapse across hops, and retry storms, all have the same root cause: untyped communication at the inter-agent boundary. The fix for all three is the same: define the contract, enforce it at the source, validate it at the consumer, and fail loudly when it is violated.
The one thing to take away
Your multi-agent system is probably communicating through unstructured natural language at the inter-agent boundary. It works well enough in development that it feels like it works. It will fail in production in ways that are hard to reproduce, hard to debug, and invisible until a customer finds them.
The fix is not a framework upgrade. It is a protocol decision: treat inter-agent communication like inter-service communication. Define the schema. Enforce it at the source. Validate it at the consumer. Fail loudly when the contract is violated.
Your agents are not special. The boundaries between them are just API boundaries without the discipline that API boundaries normally enforce.
Add the discipline.
At Invra we build production AI agent systems and help engineering teams find and fix the architectural problems that only surface after you ship. If your team is hitting these issues in a live system, or you want to audit your agent architecture before they become incidents, reach out.