14. LLMs, Prompt Engineering & RAG

14. LLMs, Prompt Engineering & RAG#

1. LLM System Prompt — Overview#

What is a System Prompt?#

Instructions given to LLM before user interaction begins
Defines model’s tone, behavior, response style, knowledge domain
Most impactful aspect = knowledge domain + expertise level ✅

Purpose:#

✅ Defines core communication strategy, learning approach, interaction guidelines — exam answer (JAN_AN Q430)
❌ Does NOT create fixed predetermined responses
❌ Does NOT prevent model from understanding context
❌ Does NOT replace human instructor guidance completely
❌ Does NOT limit AI’s ability to understand complex concepts

2. System Prompt — Knowledge Domain Impact#

What system prompt instructs LLM to adopt most impacts response
✅ Knowledge domain + expertise level — exam answer (JAN_AN Q411)
❌ Character limit set by system prompt → not most impactful
❌ Formatting requirements → secondary concern
❌ Language model version → not defined in system prompt

3. System Prompt — What It Does NOT Do#

❌ Creates fixed predetermined responses for every query
❌ Prevents deviation from pre-determined responses
❌ Configures LLM to only respond to specific commands
❌ Replaces human instructor guidance completely

4. Socratic System Prompt — Educational Use#

Best System Prompt for Educational Assistant:#

✅ "You are an interactive learning assistant for climate science.
Guide students through complex concepts by asking reflective
questions. Avoid giving direct solutions. Encourage independent
thinking and help students develop problem-solving skills."

✅ Balancing information delivery with Socratic questioning — exam answer (May_FN Q371)
❌ “Provide direct answers to maximize efficiency” → defeats educational purpose
❌ “Limiting responses to prevent information overload” → too restrictive
❌ “Using technical jargon to maintain academic rigor” → not pedagogically effective

5. Direct vs Socratic Response Style:#

Style	When
Socratic	Education, foster critical thinking ✅
Direct	Production systems, efficiency needed
Balanced	General purpose use

6. LLM POST Request — API Inference#

LLM APIs use POST requests for inference
Request body contains: model, messages, max_tokens

import requests, os

response = requests.post(
    'https://api.anthropic.com/v1/messages',
    headers={'x-api-key': os.getenv('ANTHROPIC_API_KEY')},
    json={
        'model': 'claude-3-sonnet-20240229',
        'max_tokens': 100,
        'messages': [{'role': 'user', 'content': 'Classify: Great product!'}]
    }
)
result = response.json()['content'][0]['text']

✅ POST request with review text in request body — exam answer (TDS Q28)
❌ GET request → only retrieves, can’t send body
❌ DELETE → removes resource
❌ PUT → updates existing resource

7. Prompt Specificity — Key Principle#

Vague vs Specific:#

❌ "Tell me about solar energy"
❌ "Discuss solar energy advancements"
❌ "Write a paragraph about solar energy"

✅ "Outline 4 key solar energy technological breakthroughs
    from 2014-2024, including specific efficiency improvements
    and implementation challenges"

✅ Specific prompt with: count + timeframe + aspects + constraints — exam answer
More specific = more precise and useful response

8. Specify Output Format in Prompt:#

❌ "Classify this review"
→ Output: "The sentiment is Negative."

✅ "Classify sentiment as exactly one word in lowercase:
    positive, negative, or neutral. Review: {text}"
→ Output: "negative"

9. Few-Shot Prompting:#

"Classify customer tickets:

Example 1:
Input: 'App keeps crashing'
Output: technical

Example 2:
Input: 'Wrong charge on my card'
Output: billing

Now classify:
Input: '{ticket_text}'
Output:"

10. Chain of Thought Prompting:#

"Solve this step by step:
1. First identify the main issue
2. Consider possible causes
3. Propose solution

Problem: {problem}"

11. LLM Token Costs — What Matters#

Factors that Impact Cost:#

✅ Token count of input prompts — exam answer
✅ Token count of generated responses — exam answer
✅ Query complexity requiring deeper reasoning — exam answer
✅ Context window utilization for multi-turn conversations — exam answer
❌ Time of day when queries submitted → does NOT affect cost
❌ Student’s academic level → does NOT affect cost
❌ Fixed regardless of complexity → FALSE

12. LLM Batch Processing — Production:#

import time, os
import anthropic

client = anthropic.Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY'))

def analyze_reviews(reviews):
    results = []
    
    for review in reviews:
        response = client.messages.create(
            model="claude-3-sonnet-20240229",
            max_tokens=50,
            messages=[{
                "role": "user",
                "content": f"Classify as positive/negative/neutral: {review}"
            }]
        )
        results.append(response.content[0].text.strip())
        time.sleep(0.5)    # rate limiting ✅
    
    return results

✅ Process in batches, track costs, handle errors, cache results — exam answer (TDS Q30)
❌ Send all 500 in single API call → context limit exceeded
❌ Only analyze 5 to save money → insufficient coverage
❌ Call API repeatedly for same review → wasteful

13. LLM Capabilities & Limitations#

What LLMs CAN Do:#

✅ Generate coherent contextual text
✅ Classify sentiment, categories
✅ Summarize documents
✅ Extract information from text
✅ Answer questions based on context
✅ Write and explain code
✅ Evaluate statistical validity when prompted well
✅ Follow formatting instructions

What LLMs CANNOT Do:#

❌ Access real-time internet (unless tool-enabled)
❌ Access live databases or patient records
❌ Cite papers published after training cutoff
❌ Guarantee factually correct information (hallucination)
❌ Guarantee consistent output format without prompting
❌ Remember previous conversations (stateless by default)

✅ LLMs may generate plausible-sounding but incorrect medical info — exam answer (JAN_AN Q412)
✅ Quality of response depends on prompt specificity — exam answer
✅ Can evaluate statistical validity when properly prompted — exam answer
❌ Can directly access post-cutoff studies → false
❌ Have real-time patient data access → false
❌ Quality unaffected by vague prompts → false

14. LLM Output Consistency — Enforcement#

Problem:#

LLM outputs:
"negative"    ← correct format
"Negative"    ← wrong case
"neg"         ← wrong format
"NEGATIVE."   ← wrong case + punctuation

Solution 1 — Prompt Engineering (Best):#

prompt = """
Respond with EXACTLY one word in lowercase: positive, negative, or neutral.
No punctuation. No explanation.
Review: {text}
"""

Solution 2 — Post-Processing (Backup):#

def normalize(output):
    output = output.strip().lower()
    if 'pos' in output: return 'positive'
    if 'neg' in output: return 'negative'
    if 'neu' in output: return 'neutral'
    return 'unknown'

✅ Use prompt engineering to enforce exact format — exam answer (TDS Q31)
❌ Manually edit each response → not scalable
❌ Ignore inconsistencies → breaks downstream analysis
❌ Switch LLMs for each review → still inconsistent

15. RAG — Overview#

What is RAG?#

Retrieval Augmented Generation
Combines retrieval of relevant documents with LLM generation
Grounds LLM in real, up-to-date knowledge
Solves: hallucination + knowledge cutoff problems

RAG vs Pure LLM:#

Aspect	Pure LLM	RAG
Knowledge	Training data only	External knowledge base
Accuracy	May hallucinate	Grounded in real docs
Updatable	Needs retraining	Update knowledge base

16. RAG Pipeline — Process Flow#

Correct Flow — Exam Answer:#

Student Query
     ↓
Vectorize Query          ← convert to embedding
     ↓
Retrieve Content from    ← similarity search
Vector Database
     ↓
Pass Retrieved Content   ← add as context
to LLM
     ↓
LLM Generates            ← context-aware response ✅
Context-Aware Response

✅ Query → Vectorize → Retrieve from vector DB → Pass to LLM → Response — exam answer (JAN_FN Q314, JAN_AN Q433)
❌ Student Query → LLM Direct Answer → Response Sent Back → skips retrieval
❌ Student Query → Chunk Course Material → Vectorize Query → wrong order
❌ Student Query → Rule-Based System → Pre-Written Response → not RAG

17. RAG Chunking — Strategy#

Why Chunk?#

LLMs have limited context windows
Smaller chunks → more precise matching
Better retrieval accuracy with focused chunks

Chunk Size — Exam Answer:#

Size	Chunks	Best For
Very small (1-2 sentences)	Too little context	❌
Medium (200-500 words)	Balance context + focus	✅ customer support
Very large (entire documents)	Too much irrelevant content	❌
Random sizes	Inconsistent retrieval	❌

✅ Medium chunks (1-2 paragraphs, ~200-500 words) — exam answer (TDS Q36)

18. RAG Stale Chunks — Exam Answer#

Problem:#

Document v1 indexed → answer from v1
Document v2 released → old chunks still in DB
→ Chatbot returns outdated information

Most Likely Cause:#

✅ Old document chunks remain in vector DB and weren’t updated — exam answer (TDS Q35)
❌ LLM relies on outdated training data → LLM uses retrieved context, not training
❌ Wrong chunking strategy → different problem
❌ Embedding model too small → different problem

19. Vector Databases#

What are They?#

Specialized databases for storing and searching embeddings
Find semantically similar content via vector similarity

Options:#

Database	Type	Best For
FAISS	Library	Local, fast prototyping
Weaviate	Full database	Production
Pinecone	Managed service	Scalable
Chroma	Open source	Simple local use

✅ FAISS / Weaviate — exam answer (JAN_FN Q314)

20. RAG — LLM Uses Chunks as Context#

LLM uses retrieved chunks as context to generate response
Does NOT copy chunks verbatim
Does NOT ignore retrieved chunks
Does NOT store chunks for future queries
✅ LLM uses chunks as context to generate informed answer — exam answer (TDS Q37)
❌ LLM ignores chunks and uses training data → defeats purpose of RAG
❌ LLM copies text verbatim → not generation
❌ LLM stores chunks for future queries → not how it works

Processes multiple types of data:
- Text (papers, documents, FAQs)
- Code (GitHub repos, notebooks)
- Visual (figures, charts, diagrams)
- Data (experimental results, sensor data)

Primary Advantage:#

✅ Comprehensive understanding through integration of textual concepts, visual data, and computational methods — exam answer (May_FN Q389)
❌ Reduced computational complexity → false, it’s MORE complex
❌ Simplified architecture → false, more components
❌ Lower storage requirements → false, needs MORE storage

Process Flow:#

Research Query
     ↓
Multi-Modal Embedding (text + code + visual + data)
     ↓
Cross-Disciplinary Retrieval
     ↓
Concept Mapping
     ↓
Synthesized Research Insights ✅

✅ Multi-modal embedding → cross-disciplinary retrieval → concept mapping → synthesized insights — exam answer (May_FN Q392)

Pedagogically Effective LLM Response — Exam Pattern#

Scenario: Student asks how to calculate GWP or identify ORFs#

Wrong Responses:#

❌ "Here's the exact formula/code: [provides complete answer]"
   → Gives direct answer, no learning

❌ "GWP is determined by molecular structure..."
   → Provides information but no engagement

❌ "I cannot help. Please refer to textbook."
   → Unhelpful

Correct Response — Socratic:#

✅ "Let's break this down. What do you already know about
    greenhouse gases? Have you considered how different
    molecules might vary in their ability to trap heat?"
→ Guides student to discover answer themselves ✅

✅ Ask reflective/guiding questions → promote independent thinking — exam answer (JAN_FN Q313, JAN_AN Q432)

Quick Reference#

System Prompt:
  ✅ Defines tone, behavior, knowledge domain, guidelines
  ✅ Most impactful: knowledge domain + expertise level
  ❌ NOT fixed predetermined responses
  ❌ NOT replacement for human guidance

Prompt Specificity:
  ✅ Include: count + timeframe + aspects + constraints
  ❌ Vague prompts → generic useless responses

Token Costs:
  ✅ Input tokens + output tokens + complexity + context window
  ❌ Time of day, academic level → don't affect cost

LLM Limitations:
  ❌ No real-time data access
  ❌ No post-cutoff knowledge
  ❌ May hallucinate

Output Consistency:
  ✅ Prompt engineering to enforce exact format
  ❌ Manual editing → not scalable

RAG Pipeline:
  Query → Vectorize → Retrieve → Pass to LLM → Response ✅

RAG Chunking:
  ✅ Medium chunks 200-500 words
  ❌ Very small → too little context
  ❌ Very large → too much irrelevant content

Stale chunks:
  ✅ Old chunks in vector DB not updated

LLM + RAG:
  LLM uses chunks as CONTEXT ✅
  NOT copied verbatim ❌
  NOT from training data ❌

Multi-modal RAG:
  ✅ Text + code + visual + data integration
  ✅ Comprehensive cross-disciplinary understanding

14. LLMs, Prompt Engineering & RAG#

1. LLM System Prompt — Overview#

What is a System Prompt?#

Purpose:#

2. System Prompt — Knowledge Domain Impact#

3. System Prompt — What It Does NOT Do#

4. Socratic System Prompt — Educational Use#

Best System Prompt for Educational Assistant:#

5. Direct vs Socratic Response Style:#

6. LLM POST Request — API Inference#

7. Prompt Specificity — Key Principle#

Vague vs Specific:#

8. Specify Output Format in Prompt:#

9. Few-Shot Prompting:#

10. Chain of Thought Prompting:#

11. LLM Token Costs — What Matters#

Factors that Impact Cost:#

12. LLM Batch Processing — Production:#

13. LLM Capabilities & Limitations#

What LLMs CAN Do:#

What LLMs CANNOT Do:#

14. LLM Output Consistency — Enforcement#

Problem:#

Solution 1 — Prompt Engineering (Best):#

Solution 2 — Post-Processing (Backup):#

15. RAG — Overview#

What is RAG?#

RAG vs Pure LLM:#

16. RAG Pipeline — Process Flow#

Correct Flow — Exam Answer:#

17. RAG Chunking — Strategy#

Why Chunk?#

Chunk Size — Exam Answer:#

18. RAG Stale Chunks — Exam Answer#

Problem:#

Most Likely Cause:#

19. Vector Databases#

What are They?#

Options:#

20. RAG — LLM Uses Chunks as Context#

21. Multi-Modal RAG — Overview#

What is Multi-Modal RAG?#

Primary Advantage:#

Process Flow:#

Pedagogically Effective LLM Response — Exam Pattern#

Scenario: Student asks how to calculate GWP or identify ORFs#

Wrong Responses:#

Correct Response — Socratic:#

Quick Reference#