14. LLMs, Prompt Engineering & RAG#
1. LLM System Prompt — Overview#
What is a System Prompt?#
- Instructions given to LLM before user interaction begins
- Defines model’s tone, behavior, response style, knowledge domain
- Most impactful aspect = knowledge domain + expertise level ✅
Purpose:#
- ✅ Defines core communication strategy, learning approach, interaction guidelines — exam answer (JAN_AN Q430)
- ❌ Does NOT create fixed predetermined responses
- ❌ Does NOT prevent model from understanding context
- ❌ Does NOT replace human instructor guidance completely
- ❌ Does NOT limit AI’s ability to understand complex concepts
2. System Prompt — Knowledge Domain Impact#
- What system prompt instructs LLM to adopt most impacts response
- ✅ Knowledge domain + expertise level — exam answer (JAN_AN Q411)
- ❌ Character limit set by system prompt → not most impactful
- ❌ Formatting requirements → secondary concern
- ❌ Language model version → not defined in system prompt
3. System Prompt — What It Does NOT Do#
❌ Creates fixed predetermined responses for every query
❌ Prevents deviation from pre-determined responses
❌ Configures LLM to only respond to specific commands
❌ Replaces human instructor guidance completely4. Socratic System Prompt — Educational Use#
Best System Prompt for Educational Assistant:#
✅ "You are an interactive learning assistant for climate science.
Guide students through complex concepts by asking reflective
questions. Avoid giving direct solutions. Encourage independent
thinking and help students develop problem-solving skills."- ✅ Balancing information delivery with Socratic questioning — exam answer (May_FN Q371)
- ❌ “Provide direct answers to maximize efficiency” → defeats educational purpose
- ❌ “Limiting responses to prevent information overload” → too restrictive
- ❌ “Using technical jargon to maintain academic rigor” → not pedagogically effective
5. Direct vs Socratic Response Style:#
| Style | When |
|---|---|
| Socratic | Education, foster critical thinking ✅ |
| Direct | Production systems, efficiency needed |
| Balanced | General purpose use |
6. LLM POST Request — API Inference#
- LLM APIs use POST requests for inference
- Request body contains: model, messages, max_tokens
import requests, os
response = requests.post(
'https://api.anthropic.com/v1/messages',
headers={'x-api-key': os.getenv('ANTHROPIC_API_KEY')},
json={
'model': 'claude-3-sonnet-20240229',
'max_tokens': 100,
'messages': [{'role': 'user', 'content': 'Classify: Great product!'}]
}
)
result = response.json()['content'][0]['text']- ✅ POST request with review text in request body — exam answer (TDS Q28)
- ❌ GET request → only retrieves, can’t send body
- ❌ DELETE → removes resource
- ❌ PUT → updates existing resource
7. Prompt Specificity — Key Principle#
Vague vs Specific:#
❌ "Tell me about solar energy"
❌ "Discuss solar energy advancements"
❌ "Write a paragraph about solar energy"
✅ "Outline 4 key solar energy technological breakthroughs
from 2014-2024, including specific efficiency improvements
and implementation challenges"- ✅ Specific prompt with: count + timeframe + aspects + constraints — exam answer
- More specific = more precise and useful response
8. Specify Output Format in Prompt:#
❌ "Classify this review"
→ Output: "The sentiment is Negative."
✅ "Classify sentiment as exactly one word in lowercase:
positive, negative, or neutral. Review: {text}"
→ Output: "negative"9. Few-Shot Prompting:#
"Classify customer tickets:
Example 1:
Input: 'App keeps crashing'
Output: technical
Example 2:
Input: 'Wrong charge on my card'
Output: billing
Now classify:
Input: '{ticket_text}'
Output:"10. Chain of Thought Prompting:#
"Solve this step by step:
1. First identify the main issue
2. Consider possible causes
3. Propose solution
Problem: {problem}"11. LLM Token Costs — What Matters#
Factors that Impact Cost:#
- ✅ Token count of input prompts — exam answer
- ✅ Token count of generated responses — exam answer
- ✅ Query complexity requiring deeper reasoning — exam answer
- ✅ Context window utilization for multi-turn conversations — exam answer
- ❌ Time of day when queries submitted → does NOT affect cost
- ❌ Student’s academic level → does NOT affect cost
- ❌ Fixed regardless of complexity → FALSE
12. LLM Batch Processing — Production:#
import time, os
import anthropic
client = anthropic.Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY'))
def analyze_reviews(reviews):
results = []
for review in reviews:
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=50,
messages=[{
"role": "user",
"content": f"Classify as positive/negative/neutral: {review}"
}]
)
results.append(response.content[0].text.strip())
time.sleep(0.5) # rate limiting ✅
return results- ✅ Process in batches, track costs, handle errors, cache results — exam answer (TDS Q30)
- ❌ Send all 500 in single API call → context limit exceeded
- ❌ Only analyze 5 to save money → insufficient coverage
- ❌ Call API repeatedly for same review → wasteful
13. LLM Capabilities & Limitations#
What LLMs CAN Do:#
✅ Generate coherent contextual text
✅ Classify sentiment, categories
✅ Summarize documents
✅ Extract information from text
✅ Answer questions based on context
✅ Write and explain code
✅ Evaluate statistical validity when prompted well
✅ Follow formatting instructionsWhat LLMs CANNOT Do:#
❌ Access real-time internet (unless tool-enabled)
❌ Access live databases or patient records
❌ Cite papers published after training cutoff
❌ Guarantee factually correct information (hallucination)
❌ Guarantee consistent output format without prompting
❌ Remember previous conversations (stateless by default)- ✅ LLMs may generate plausible-sounding but incorrect medical info — exam answer (JAN_AN Q412)
- ✅ Quality of response depends on prompt specificity — exam answer
- ✅ Can evaluate statistical validity when properly prompted — exam answer
- ❌ Can directly access post-cutoff studies → false
- ❌ Have real-time patient data access → false
- ❌ Quality unaffected by vague prompts → false
14. LLM Output Consistency — Enforcement#
Problem:#
LLM outputs:
"negative" ← correct format
"Negative" ← wrong case
"neg" ← wrong format
"NEGATIVE." ← wrong case + punctuationSolution 1 — Prompt Engineering (Best):#
prompt = """
Respond with EXACTLY one word in lowercase: positive, negative, or neutral.
No punctuation. No explanation.
Review: {text}
"""Solution 2 — Post-Processing (Backup):#
def normalize(output):
output = output.strip().lower()
if 'pos' in output: return 'positive'
if 'neg' in output: return 'negative'
if 'neu' in output: return 'neutral'
return 'unknown'- ✅ Use prompt engineering to enforce exact format — exam answer (TDS Q31)
- ❌ Manually edit each response → not scalable
- ❌ Ignore inconsistencies → breaks downstream analysis
- ❌ Switch LLMs for each review → still inconsistent
15. RAG — Overview#
What is RAG?#
- Retrieval Augmented Generation
- Combines retrieval of relevant documents with LLM generation
- Grounds LLM in real, up-to-date knowledge
- Solves: hallucination + knowledge cutoff problems
RAG vs Pure LLM:#
| Aspect | Pure LLM | RAG |
|---|---|---|
| Knowledge | Training data only | External knowledge base |
| Accuracy | May hallucinate | Grounded in real docs |
| Updatable | Needs retraining | Update knowledge base |
16. RAG Pipeline — Process Flow#
Correct Flow — Exam Answer:#
Student Query
↓
Vectorize Query ← convert to embedding
↓
Retrieve Content from ← similarity search
Vector Database
↓
Pass Retrieved Content ← add as context
to LLM
↓
LLM Generates ← context-aware response ✅
Context-Aware Response- ✅ Query → Vectorize → Retrieve from vector DB → Pass to LLM → Response — exam answer (JAN_FN Q314, JAN_AN Q433)
- ❌ Student Query → LLM Direct Answer → Response Sent Back → skips retrieval
- ❌ Student Query → Chunk Course Material → Vectorize Query → wrong order
- ❌ Student Query → Rule-Based System → Pre-Written Response → not RAG
17. RAG Chunking — Strategy#
Why Chunk?#
- LLMs have limited context windows
- Smaller chunks → more precise matching
- Better retrieval accuracy with focused chunks
Chunk Size — Exam Answer:#
| Size | Chunks | Best For |
|---|---|---|
| Very small (1-2 sentences) | Too little context | ❌ |
| Medium (200-500 words) | Balance context + focus | ✅ customer support |
| Very large (entire documents) | Too much irrelevant content | ❌ |
| Random sizes | Inconsistent retrieval | ❌ |
- ✅ Medium chunks (1-2 paragraphs, ~200-500 words) — exam answer (TDS Q36)
18. RAG Stale Chunks — Exam Answer#
Problem:#
Document v1 indexed → answer from v1
Document v2 released → old chunks still in DB
→ Chatbot returns outdated informationMost Likely Cause:#
- ✅ Old document chunks remain in vector DB and weren’t updated — exam answer (TDS Q35)
- ❌ LLM relies on outdated training data → LLM uses retrieved context, not training
- ❌ Wrong chunking strategy → different problem
- ❌ Embedding model too small → different problem
19. Vector Databases#
What are They?#
- Specialized databases for storing and searching embeddings
- Find semantically similar content via vector similarity
Options:#
| Database | Type | Best For |
|---|---|---|
| FAISS | Library | Local, fast prototyping |
| Weaviate | Full database | Production |
| Pinecone | Managed service | Scalable |
| Chroma | Open source | Simple local use |
- ✅ FAISS / Weaviate — exam answer (JAN_FN Q314)
20. RAG — LLM Uses Chunks as Context#
LLM uses retrieved chunks as context to generate response
Does NOT copy chunks verbatim
Does NOT ignore retrieved chunks
Does NOT store chunks for future queries
✅ LLM uses chunks as context to generate informed answer — exam answer (TDS Q37)
❌ LLM ignores chunks and uses training data → defeats purpose of RAG
❌ LLM copies text verbatim → not generation
❌ LLM stores chunks for future queries → not how it works
21. Multi-Modal RAG — Overview#
What is Multi-Modal RAG?#
- Processes multiple types of data:
- Text (papers, documents, FAQs)
- Code (GitHub repos, notebooks)
- Visual (figures, charts, diagrams)
- Data (experimental results, sensor data)
Primary Advantage:#
- ✅ Comprehensive understanding through integration of textual concepts, visual data, and computational methods — exam answer (May_FN Q389)
- ❌ Reduced computational complexity → false, it’s MORE complex
- ❌ Simplified architecture → false, more components
- ❌ Lower storage requirements → false, needs MORE storage
Process Flow:#
Research Query
↓
Multi-Modal Embedding (text + code + visual + data)
↓
Cross-Disciplinary Retrieval
↓
Concept Mapping
↓
Synthesized Research Insights ✅- ✅ Multi-modal embedding → cross-disciplinary retrieval → concept mapping → synthesized insights — exam answer (May_FN Q392)
Pedagogically Effective LLM Response — Exam Pattern#
Scenario: Student asks how to calculate GWP or identify ORFs#
Wrong Responses:#
❌ "Here's the exact formula/code: [provides complete answer]"
→ Gives direct answer, no learning
❌ "GWP is determined by molecular structure..."
→ Provides information but no engagement
❌ "I cannot help. Please refer to textbook."
→ UnhelpfulCorrect Response — Socratic:#
✅ "Let's break this down. What do you already know about
greenhouse gases? Have you considered how different
molecules might vary in their ability to trap heat?"
→ Guides student to discover answer themselves ✅- ✅ Ask reflective/guiding questions → promote independent thinking — exam answer (JAN_FN Q313, JAN_AN Q432)
Quick Reference#
System Prompt:
✅ Defines tone, behavior, knowledge domain, guidelines
✅ Most impactful: knowledge domain + expertise level
❌ NOT fixed predetermined responses
❌ NOT replacement for human guidance
Prompt Specificity:
✅ Include: count + timeframe + aspects + constraints
❌ Vague prompts → generic useless responses
Token Costs:
✅ Input tokens + output tokens + complexity + context window
❌ Time of day, academic level → don't affect cost
LLM Limitations:
❌ No real-time data access
❌ No post-cutoff knowledge
❌ May hallucinate
Output Consistency:
✅ Prompt engineering to enforce exact format
❌ Manual editing → not scalable
RAG Pipeline:
Query → Vectorize → Retrieve → Pass to LLM → Response ✅
RAG Chunking:
✅ Medium chunks 200-500 words
❌ Very small → too little context
❌ Very large → too much irrelevant content
Stale chunks:
✅ Old chunks in vector DB not updated
LLM + RAG:
LLM uses chunks as CONTEXT ✅
NOT copied verbatim ❌
NOT from training data ❌
Multi-modal RAG:
✅ Text + code + visual + data integration
✅ Comprehensive cross-disciplinary understanding