You built a chatbot that answers questions. Now you need it to do something — fetch data, call an API, update a database. The difference between a chatbot and an agent is a single constraint: agents take actions based on what they learn.
Most attempts fail because developers treat tool calling as an add-on, not as the core of the system. They call an LLM, wait for a response, pass tools as an afterthought. Production agents need a different architecture — one that treats the LLM as a decision engine, not a text generator.
Tool Calling: The Contract, Not the Feature
Tool calling isn’t about giving an LLM access to functions. It’s about defining a contract the LLM must follow.
When you define a tool, you’re not giving the model a black box. You’re specifying:
- What the tool does (description)
- What parameters it requires (schema)
- What format it returns (output specification)
Most tool calling fails because descriptions are vague. “Fetch user data” loses you immediately — fetch which data? What does the function signature look like? What if the user doesn’t exist?
Here’s what a bad tool definition looks like:
{
"name": "get_user",
"description": "Get user information",
"parameters": {
"type": "object",
"properties": {
"user_id": {
"type": "string"
}
}
}
}
The LLM doesn’t know what happens when user_id is invalid. It doesn’t know if user_id should be a UUID or an integer. It doesn’t know what fields the response contains.
Here’s the improved version:
{
"name": "get_user_profile",
"description": "Retrieve a user's profile by ID. Returns basic account info including name, email, creation date, and account status. Returns null if user not found.",
"parameters": {
"type": "object",
"properties": {
"user_id": {
"type": "string",
"description": "UUID of the user. Format: 550e8400-e29b-41d4-a716-446655440000"
}
},
"required": ["user_id"]
},
"returns": {
"type": "object",
"properties": {
"id": {"type": "string"},
"name": {"type": "string"},
"email": {"type": "string"},
"status": {"type": "string", "enum": ["active", "suspended", "deleted"]},
"created_at": {"type": "string"}
}
}
}
Claude Sonnet 4 (January 2025 release) improved tool calling consistency by 34% compared to earlier versions when schemas are specific. Vague definitions still confuse it — that’s not a model limitation, that’s a design failure.
The Loop: Making Decisions Sequential
An agent loop is simple in structure but broken in almost every first implementation.
The basic flow: LLM decides → tool executes → result returns → LLM decides again → repeat until done.
Here’s a working Python example using Claude:
import anthropic
import json
client = anthropic.Anthropic()
tools = [
{
"name": "fetch_order",
"description": "Get order details by order ID",
"input_schema": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "Unique order identifier"
}
},
"required": ["order_id"]
}
},
{
"name": "update_order_status",
"description": "Update an order's status",
"input_schema": {
"type": "object",
"properties": {
"order_id": {"type": "string"},
"status": {
"type": "string",
"enum": ["pending", "shipped", "delivered"]
}
},
"required": ["order_id", "status"]
}
}
]
messages = [{"role": "user", "content": "Check order ABC123 and mark it as shipped"}]
while True:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
tools=tools,
messages=messages
)
if response.stop_reason == "tool_use":
# LLM wants to use a tool
tool_calls = [block for block in response.content if block.type == "tool_use"]
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for tool_call in tool_calls:
# Execute tool (stubbed here)
if tool_call.name == "fetch_order":
result = {"id": "ABC123", "status": "pending", "total": 99.99}
elif tool_call.name == "update_order_status":
result = {"success": True, "new_status": "shipped"}
tool_results.append({
"type": "tool_result",
"tool_use_id": tool_call.id,
"content": json.dumps(result)
})
messages.append({"role": "user", "content": tool_results})
else:
# LLM reached end_turn or max_tokens
final_response = next(
(block.text for block in response.content if hasattr(block, "text")),
None
)
print(final_response)
break
The critical mistake most developers make: they treat tool results as unstructured text. If a tool returns JSON, parse it and make the structure explicit to the LLM. Don’t force it to parse messy strings.
Memory: What Agents Actually Need to Remember
Memory is not conversation history. That’s the first thing to unlearn.
An agent needs three types of memory:
- Session memory: What happened in this conversation — user goals, context from previous turns. This is short-term and conversation-specific.
- Knowledge memory: Facts about the user, domain, or system state that persist across conversations. This is long-term and shared.
- Execution memory: What the agent has already tried, what failed, what succeeded. This prevents loops and repeated errors.
Most systems conflate all three into a message history. That kills performance.
Session memory should live in the message array — but summarized, not raw. After 20 turns, compress earlier context into a single system message instead of keeping all 20 turns in context.
Knowledge memory should be separate — a vector database (Pinecone, Weaviate) or a structured key-value store. When you need user context, fetch it explicitly with a tool call, don’t stuff it into the initial prompt.
Execution memory should be an explicit log. Before the agent tries a tool, check if it’s already attempted that tool in this session. If it failed last time, pass that failure to the LLM as context.
Example structure:
{
"session_id": "conv_12345",
"user_goal": "Update billing address and confirm new payment method",
"session_context": "User has active subscription. Previously tried to update payment in December but process failed.",
"execution_log": [
{"tool": "fetch_user_profile", "status": "success", "timestamp": "2025-01-15T10:22:00Z"},
{"tool": "validate_address", "status": "failed", "error": "Postal code invalid", "timestamp": "2025-01-15T10:22:15Z"}
],
"knowledge_refs": ["user_payment_history", "subscription_terms"],
"messages": [
{"role": "user", "content": "Update my address..."},
{"role": "assistant", "content": "I'll help with that..."}
]
}
Do This Today
Pick one tool your agent needs to call. Write out the schema with a 3-sentence description, list every parameter with its format and constraints, and define the exact shape of the response. Test it by hand — feed the schema to Claude or GPT-4o and ask it to call the tool. If it calls it wrong, your schema is incomplete.