You asked the model the same question twice. Got two completely different answers. One was useful. One was nonsense.
This isn’t randomness—it’s a setting. Temperature and top-p control how a model picks the next word. Get them wrong, you get inconsistency. Get them right, you get reproducible behavior.
What Temperature Actually Does
Temperature controls how confident the model acts. Think of it as a dial on “how risky the model wants to be.”
Every LLM works the same way at generation time: it calculates a probability score for every possible next word. A word might have a 45% chance, another 23%, another 12%. Then it picks one.
Temperature reshapes those probabilities before the model picks.
- Temperature = 0: Always picks the highest-probability word. Deterministic. Same input, same output, every time. Use this for data extraction, classification, structured outputs.
- Temperature = 1: Uses probabilities as-is. Balanced. This is often the default. Creative enough, consistent enough.
- Temperature = 2+: Flattens the probabilities. Low-probability words become more likely. High variability. Use this for brainstorming or creative writing only.
Here’s what matters: temperature 0 gives you consistency. Temperature above 1 gives you variety at the cost of predictability. In production systems, you almost always want temperature between 0 and 0.7.
Top-P: The Cutoff That Fixes Long Tails
Top-p (nucleus sampling) solves a different problem than temperature.
Temperature changes how confident the model is. Top-p controls which words the model is even allowed to consider.
If top-p = 0.9, the model only considers words that make up the top 90% of probability mass. It ignores the long tail of weird, low-probability words.
Why does this matter? Temperature alone doesn’t prevent bad outputs. If temperature is 1 and the model calculates probabilities like this:
Word A: 40%
Word B: 35%
Word C: 15%
Word D: 5%
Word E: 3%
Word F: 2%
It might still pick Word E or F sometimes. Adding top-p = 0.9 cuts everything below 90% cumulative probability, so Word D, E, F are removed from consideration entirely.
Use top-p to eliminate nonsense outputs without sacrificing useful variety. Top-p between 0.8 and 0.95 works well for most use cases.
When to Use Each Setting
Structured extraction (JSON, CSV, classification): Temperature = 0, top-p = 1. You want the single best answer, consistently.
# Prompt for extracting data
User: "Extract the customer name, order date, and total from this invoice. Return as JSON only."
Invoke with:
- temperature: 0
- top_p: 1
Chat or question-answering: Temperature = 0.7, top-p = 0.9. Consistent enough to stay on-topic, varied enough to feel natural.
# Example in Python using Claude API
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
temperature=0.7,
top_p=0.9,
messages=[
{"role": "user", "content": "Explain why my API requests are timing out."}
]
)
print(message.content[0].text)
Brainstorming or creative work: Temperature = 1.0 to 1.5, top-p = 0.95. You want variation without complete chaos.
Grading or evaluation: Temperature = 0, top-p = 1. Consistency in scoring matters more than anything.
The Real Problem: Interaction Effects
Temperature and top-p interact. They’re not independent dials.
If you set temperature = 0.1 (very conservative) but top-p = 0.5 (aggressive cutoff), the cutoff barely matters—the model was already conservative. If you set temperature = 2 and top-p = 0.99, you’re fighting yourself.
Start with one dial. Temperature first, usually. If you’re still getting nonsense outputs after dropping temperature to 0.3 or lower, add top-p = 0.85.
In practice: temperature 0 to 0.7 + top-p 0.9 to 1 handles 95% of production use cases. Anything beyond that is usually over-tuning.
Test Your Settings Today
Pick one prompt you use regularly. Run it 10 times with temperature = 0. Count how often you get the same output.
Then run it 10 times with temperature = 1 without changing anything else. Notice the difference. This is the gap between consistency and variability.
Now find the temperature that gives you 80% output consistency with acceptable output quality for your use case. That’s your baseline. Add top-p if you still see occasional garbage.
Log the settings. Use them. If you’re debugging output quality later, you’ll know exactly what you changed and why.