You run the same prompt twice against Claude and get completely different answers. Same input, wildcard output. The problem isn’t your prompt—it’s the sampling parameters you haven’t touched.
Temperature, top-p, and top-k are knobs that control how creative or deterministic your model behaves. Get them wrong and you’re chasing ghosts with prompt revisions. Get them right and you can run production systems that don’t surprise you.
What These Parameters Actually Do
When an LLM generates text, it doesn’t just pick the “best” next word. It assigns a probability to every word in its vocabulary, then samples one. Temperature and top-p change which probabilities it considers and how it samples from them.
Temperature scales the probability distribution before sampling. Lower = more confident in high-probability tokens. Higher = more entropy, more randomness.
- Temperature 0.0: greedy decoding. Always pick the highest-probability token. Deterministic, but can get stuck in loops.
- Temperature 0.3–0.7: sweet spot for structured extraction, reasoning, and anything where consistency matters
- Temperature 1.0: default. Natural probability distribution straight from the model.
- Temperature 1.5–2.0: creative chaos. Good for brainstorming, content variation, roleplay. Expect inconsistency.
Top-P (nucleus sampling) says: “only sample from tokens that make up the top P% of probability mass.” If top-p is 0.9, the model ignores tokens in the bottom 10% of probability, no matter how low the temperature.
Top-K is simpler: only consider the K highest-probability tokens. If top-k is 40, the model can only pick from the top 40 tokens by probability, period. Less common than top-p, but some APIs (like Together.ai) use it as default.
The Real Problem: Conflicting Parameters
Temperature 0.7 with top-p 0.99 is almost as random as temperature 1.0. Top-p 0.1 with temperature 2.0 is still constrained. These parameters interact—setting one without considering the others is how you end up tuning blindly.
The key tension: lower temperature alone can trigger repetition. When the model gets confident about a phrase, temperature 0 means it repeats that phrase forever. Top-p provides a guardrail—it cuts off the tail of the distribution, preventing over-commitment to any single token.
This is why the recommended approach for production systems is:
- Set temperature between 0.3 and 0.7 (rarely go lower unless you’re fine with repetition)
- Set top-p between 0.8 and 0.95 (wider = more natural, narrower = more constrained)
- Leave top-k alone unless you have a specific reason to use it
When to Use What
The framework is simpler than it looks once you stop thinking about “randomness” and start thinking about use cases.
Structured extraction (JSON, classification, numeric output): Temperature 0.3, top-p 0.9. You want consistent parsing. No ambiguity.
Summarization and paraphrasing: Temperature 0.5, top-p 0.9. Slightly more variation than extraction, but still reliable. The model shouldn’t hallucinate different facts.
Open-ended writing (blog posts, emails, content): Temperature 0.7, top-p 0.95. Natural variation without random tangents. The output stays coherent.
Brainstorming and creative tasks: Temperature 1.2, top-p 0.95. Higher temperature forces the model to consider lower-probability ideas. Top-p keeps it from complete nonsense.
The Production Reality: Test Your Own Stack
Different models respond differently to the same settings. GPT-4o at temperature 0.5 is not the same as Claude Sonnet 4 at temperature 0.5. OpenAI models tend to be more sensitive to temperature than Anthropic models—a shift of 0.2 matters more.
Here’s what I do for production systems:
# Test harness: same input, same parameters, 10 runs
import anthropic
client = anthropic.Anthropic()
input_prompt = "Extract the company name from: Acme Corp filed for IPO yesterday."
results = []
for i in range(10):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=50,
temperature=0.3,
top_p=0.9,
messages=[{
"role": "user",
"content": input_prompt
}]
)
results.append(response.content[0].text)
# Check for variance
unique_outputs = set(results)
print(f"Unique outputs from 10 runs: {len(unique_outputs)}")
for output in unique_outputs:
print(f" - {output}")
Run this 10 times. If you get 8 identical outputs and 2 variations, your parameters are too high. If you get 10 different outputs, they’re too low. Aim for 0–2 unique outputs for structured tasks, 5–8 for open-ended ones.
The Parameter You’re Missing: Seed (When Available)
Temperature and top-p control randomness within bounds. If you need true reproducibility—same output every single time—some APIs support seed (OpenAI with GPT-4o and later models, Anthropic with Claude Sonnet 4 and later).
A seed doesn’t guarantee identical output across model versions, but it guarantees identical output for the same model version and parameters. If you’re building a system where output variance breaks downstream processes, seed + temperature 0.3 is your move.
Starting Point: One Action Today
Pick one production system you’re running. Log into your API dashboard and check what temperature and top-p you’re currently using. If they’re set to the API defaults and you’re seeing unexpected variance, change them to 0.3 and 0.9 respectively for your next 100 requests. Measure consistency. If it’s still inconsistent, lower temperature to 0. If it becomes too repetitive, raise top-p to 0.95 and try again.
Don’t tune everything at once. Change one parameter, measure, repeat. In a week you’ll know what works for your specific use case—not theory, fact.