Prompt Engineering vs Fine-Tuning: Which Strategy to Choose
An engineering framework for deciding between prompt engineering, fine-tuning, and RAG based on cost, latency, and quality trade-offs.
Prompt Engineering vs Fine-Tuning: Which Strategy to Choose
The question comes up in every LLM project: should you invest in prompt engineering, fine-tune a model, or build a RAG system? This is not a matter of technical preference — it is an engineering decision with implications for cost, maintainability, latency, and quality.
The Decision Framework
Three dimensions structure the choice:
1. Domain stability: do your data and business rules change frequently? 2. Training data volume: do you have hundreds or thousands of labeled examples? 3. Operational constraints: what are your latency, cost, and confidentiality requirements?
When Prompt Engineering Is Sufficient
Prompting is the right answer when:
- The domain is volatile: rules change often, the model must adapt quickly
- You have little training data (< 100 examples)
- The base model already covers the domain (common languages, general tasks)
- You need flexibility: modify behavior in production without redeployment
Empirical heuristic: a well-structured system prompt can often reach
~80% of the quality achievable without fine-tuning — depending on the
domain and task. This figure varies significantly in practice.
Advanced prompting techniques — chain-of-thought, few-shot examples, structured outputs with JSON schema, role prompting — can achieve remarkable performance. Investment is low, iteration is fast, marginal cost is zero.
Critical limitation: prompting cannot inject factual knowledge the model does not have. For proprietary or recent data, you need RAG or fine-tuning.
When to Fine-Tune
Fine-tuning becomes relevant when:
- The domain is stable and specialized (technical terminology, tightly constrained output format)
- You have sufficient data (500+ high-quality examples, ideally 5000+)
- You have latency constraints: a fine-tuned smaller model often beats a large model with a long system prompt
- You have strong confidentiality constraints: in that case, the decisive point is not fine-tuning itself but the hosting mode of the model and pipeline — fine-tuning via an external service does not automatically resolve this
# Typical structure of a fine-tuning dataset (OpenAI format)
training_example = {
"messages": [
{
"role": "system",
"content": "You are an assistant specializing in French contract law."
},
{
"role": "user",
"content": "What is the difference between a penalty clause and damages?"
},
{
"role": "assistant",
"content": "A penalty clause is a contractual stipulation...",
# Expert response validated by a lawyer
}
]
}Fine-tuning on open-source models (Mistral, LLaMA 3, Qwen) with techniques like LoRA or QLoRA produces domain-specific models in a few hours on accessible hardware.
When to Use RAG Instead of Fine-Tuning
RAG is often confused with fine-tuning, but they address different needs:
| Need | Fine-tuning | RAG |
|---|---|---|
| Response style/format | Yes | No |
| Recent factual knowledge | Limited | Yes |
| Large corpus (>10K docs) | Impractical | Yes |
| Source traceability | No | Yes |
| Real-time updates | No | Yes |
The rule of thumb: fine-tune for the how (style, format, behavior), use RAG for the what (factual content, business knowledge).
Comparative Cost Matrix
| Strategy | Upfront cost | Marginal cost/request | Maintenance |
|---|---|---|---|
| Prompting only | Low | High (large model) | Low |
| RAG | Medium (index infra) | Medium | Medium (index) |
| Fine-tuning + hosted model | Medium | Low (small model) | High |
| Fine-tuning + self-hosted model | High | Very low | Very high |
The Combined Strategy
In practice, production-grade systems combine all three approaches:
- A fine-tuned model for style, format, and critical behaviors
- RAG to inject domain knowledge and recent data
- System prompts for contextual instructions and guardrails
This layered architecture maximizes quality while minimizing costs at scale. The key is not to over-invest in fine-tuning before exhausting the gains accessible through prompting and RAG — in that order.