14. LLMs, Prompt Engineering & RAG#

1. LLM System Prompt — Overview#

What is a System Prompt?#

  • Instructions given to LLM before user interaction begins
  • Defines model’s tone, behavior, response style, knowledge domain
  • Most impactful aspect = knowledge domain + expertise level

Purpose:#

  • ✅ Defines core communication strategy, learning approach, interaction guidelines — exam answer (JAN_AN Q430)
  • ❌ Does NOT create fixed predetermined responses
  • ❌ Does NOT prevent model from understanding context
  • ❌ Does NOT replace human instructor guidance completely
  • ❌ Does NOT limit AI’s ability to understand complex concepts

2. System Prompt — Knowledge Domain Impact#

  • What system prompt instructs LLM to adopt most impacts response
  • ✅ Knowledge domain + expertise level — exam answer (JAN_AN Q411)
  • ❌ Character limit set by system prompt → not most impactful
  • ❌ Formatting requirements → secondary concern
  • ❌ Language model version → not defined in system prompt

3. System Prompt — What It Does NOT Do#

❌ Creates fixed predetermined responses for every query
❌ Prevents deviation from pre-determined responses
❌ Configures LLM to only respond to specific commands
❌ Replaces human instructor guidance completely

4. Socratic System Prompt — Educational Use#

Best System Prompt for Educational Assistant:#

✅ "You are an interactive learning assistant for climate science.
Guide students through complex concepts by asking reflective
questions. Avoid giving direct solutions. Encourage independent
thinking and help students develop problem-solving skills."
  • ✅ Balancing information delivery with Socratic questioning — exam answer (May_FN Q371)
  • ❌ “Provide direct answers to maximize efficiency” → defeats educational purpose
  • ❌ “Limiting responses to prevent information overload” → too restrictive
  • ❌ “Using technical jargon to maintain academic rigor” → not pedagogically effective

5. Direct vs Socratic Response Style:#

StyleWhen
SocraticEducation, foster critical thinking ✅
DirectProduction systems, efficiency needed
BalancedGeneral purpose use

6. LLM POST Request — API Inference#

  • LLM APIs use POST requests for inference
  • Request body contains: model, messages, max_tokens
import requests, os

response = requests.post(
    'https://api.anthropic.com/v1/messages',
    headers={'x-api-key': os.getenv('ANTHROPIC_API_KEY')},
    json={
        'model': 'claude-3-sonnet-20240229',
        'max_tokens': 100,
        'messages': [{'role': 'user', 'content': 'Classify: Great product!'}]
    }
)
result = response.json()['content'][0]['text']
  • ✅ POST request with review text in request body — exam answer (TDS Q28)
  • ❌ GET request → only retrieves, can’t send body
  • ❌ DELETE → removes resource
  • ❌ PUT → updates existing resource

7. Prompt Specificity — Key Principle#

Vague vs Specific:#

❌ "Tell me about solar energy"
❌ "Discuss solar energy advancements"
❌ "Write a paragraph about solar energy"

✅ "Outline 4 key solar energy technological breakthroughs
    from 2014-2024, including specific efficiency improvements
    and implementation challenges"
  • ✅ Specific prompt with: count + timeframe + aspects + constraints — exam answer
  • More specific = more precise and useful response

8. Specify Output Format in Prompt:#

❌ "Classify this review"
→ Output: "The sentiment is Negative."

✅ "Classify sentiment as exactly one word in lowercase:
    positive, negative, or neutral. Review: {text}"
→ Output: "negative"

9. Few-Shot Prompting:#

"Classify customer tickets:

Example 1:
Input: 'App keeps crashing'
Output: technical

Example 2:
Input: 'Wrong charge on my card'
Output: billing

Now classify:
Input: '{ticket_text}'
Output:"

10. Chain of Thought Prompting:#

"Solve this step by step:
1. First identify the main issue
2. Consider possible causes
3. Propose solution

Problem: {problem}"

11. LLM Token Costs — What Matters#

Factors that Impact Cost:#

  • ✅ Token count of input prompts — exam answer
  • ✅ Token count of generated responses — exam answer
  • ✅ Query complexity requiring deeper reasoning — exam answer
  • ✅ Context window utilization for multi-turn conversations — exam answer
  • ❌ Time of day when queries submitted → does NOT affect cost
  • ❌ Student’s academic level → does NOT affect cost
  • ❌ Fixed regardless of complexity → FALSE

12. LLM Batch Processing — Production:#

import time, os
import anthropic

client = anthropic.Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY'))

def analyze_reviews(reviews):
    results = []
    
    for review in reviews:
        response = client.messages.create(
            model="claude-3-sonnet-20240229",
            max_tokens=50,
            messages=[{
                "role": "user",
                "content": f"Classify as positive/negative/neutral: {review}"
            }]
        )
        results.append(response.content[0].text.strip())
        time.sleep(0.5)    # rate limiting ✅
    
    return results
  • ✅ Process in batches, track costs, handle errors, cache results — exam answer (TDS Q30)
  • ❌ Send all 500 in single API call → context limit exceeded
  • ❌ Only analyze 5 to save money → insufficient coverage
  • ❌ Call API repeatedly for same review → wasteful

13. LLM Capabilities & Limitations#

What LLMs CAN Do:#

✅ Generate coherent contextual text
✅ Classify sentiment, categories
✅ Summarize documents
✅ Extract information from text
✅ Answer questions based on context
✅ Write and explain code
✅ Evaluate statistical validity when prompted well
✅ Follow formatting instructions

What LLMs CANNOT Do:#

❌ Access real-time internet (unless tool-enabled)
❌ Access live databases or patient records
❌ Cite papers published after training cutoff
❌ Guarantee factually correct information (hallucination)
❌ Guarantee consistent output format without prompting
❌ Remember previous conversations (stateless by default)
  • ✅ LLMs may generate plausible-sounding but incorrect medical info — exam answer (JAN_AN Q412)
  • ✅ Quality of response depends on prompt specificity — exam answer
  • ✅ Can evaluate statistical validity when properly prompted — exam answer
  • ❌ Can directly access post-cutoff studies → false
  • ❌ Have real-time patient data access → false
  • ❌ Quality unaffected by vague prompts → false

14. LLM Output Consistency — Enforcement#

Problem:#

LLM outputs:
"negative"    ← correct format
"Negative"    ← wrong case
"neg"         ← wrong format
"NEGATIVE."   ← wrong case + punctuation

Solution 1 — Prompt Engineering (Best):#

prompt = """
Respond with EXACTLY one word in lowercase: positive, negative, or neutral.
No punctuation. No explanation.
Review: {text}
"""

Solution 2 — Post-Processing (Backup):#

def normalize(output):
    output = output.strip().lower()
    if 'pos' in output: return 'positive'
    if 'neg' in output: return 'negative'
    if 'neu' in output: return 'neutral'
    return 'unknown'
  • ✅ Use prompt engineering to enforce exact format — exam answer (TDS Q31)
  • ❌ Manually edit each response → not scalable
  • ❌ Ignore inconsistencies → breaks downstream analysis
  • ❌ Switch LLMs for each review → still inconsistent

15. RAG — Overview#

What is RAG?#

  • Retrieval Augmented Generation
  • Combines retrieval of relevant documents with LLM generation
  • Grounds LLM in real, up-to-date knowledge
  • Solves: hallucination + knowledge cutoff problems

RAG vs Pure LLM:#

AspectPure LLMRAG
KnowledgeTraining data onlyExternal knowledge base
AccuracyMay hallucinateGrounded in real docs
UpdatableNeeds retrainingUpdate knowledge base

16. RAG Pipeline — Process Flow#

Correct Flow — Exam Answer:#

Student Query
     ↓
Vectorize Query          ← convert to embedding
     ↓
Retrieve Content from    ← similarity search
Vector Database
     ↓
Pass Retrieved Content   ← add as context
to LLM
     ↓
LLM Generates            ← context-aware response ✅
Context-Aware Response
  • ✅ Query → Vectorize → Retrieve from vector DB → Pass to LLM → Response — exam answer (JAN_FN Q314, JAN_AN Q433)
  • ❌ Student Query → LLM Direct Answer → Response Sent Back → skips retrieval
  • ❌ Student Query → Chunk Course Material → Vectorize Query → wrong order
  • ❌ Student Query → Rule-Based System → Pre-Written Response → not RAG

17. RAG Chunking — Strategy#

Why Chunk?#

  • LLMs have limited context windows
  • Smaller chunks → more precise matching
  • Better retrieval accuracy with focused chunks

Chunk Size — Exam Answer:#

SizeChunksBest For
Very small (1-2 sentences)Too little context
Medium (200-500 words)Balance context + focus✅ customer support
Very large (entire documents)Too much irrelevant content
Random sizesInconsistent retrieval
  • ✅ Medium chunks (1-2 paragraphs, ~200-500 words) — exam answer (TDS Q36)

18. RAG Stale Chunks — Exam Answer#

Problem:#

Document v1 indexed → answer from v1
Document v2 released → old chunks still in DB
→ Chatbot returns outdated information

Most Likely Cause:#

  • ✅ Old document chunks remain in vector DB and weren’t updated — exam answer (TDS Q35)
  • ❌ LLM relies on outdated training data → LLM uses retrieved context, not training
  • ❌ Wrong chunking strategy → different problem
  • ❌ Embedding model too small → different problem

19. Vector Databases#

What are They?#

  • Specialized databases for storing and searching embeddings
  • Find semantically similar content via vector similarity

Options:#

DatabaseTypeBest For
FAISSLibraryLocal, fast prototyping
WeaviateFull databaseProduction
PineconeManaged serviceScalable
ChromaOpen sourceSimple local use
  • ✅ FAISS / Weaviate — exam answer (JAN_FN Q314)

20. RAG — LLM Uses Chunks as Context#

  • LLM uses retrieved chunks as context to generate response

  • Does NOT copy chunks verbatim

  • Does NOT ignore retrieved chunks

  • Does NOT store chunks for future queries

  • ✅ LLM uses chunks as context to generate informed answer — exam answer (TDS Q37)

  • ❌ LLM ignores chunks and uses training data → defeats purpose of RAG

  • ❌ LLM copies text verbatim → not generation

  • ❌ LLM stores chunks for future queries → not how it works


21. Multi-Modal RAG — Overview#

What is Multi-Modal RAG?#

  • Processes multiple types of data:
    • Text (papers, documents, FAQs)
    • Code (GitHub repos, notebooks)
    • Visual (figures, charts, diagrams)
    • Data (experimental results, sensor data)

Primary Advantage:#

  • ✅ Comprehensive understanding through integration of textual concepts, visual data, and computational methods — exam answer (May_FN Q389)
  • ❌ Reduced computational complexity → false, it’s MORE complex
  • ❌ Simplified architecture → false, more components
  • ❌ Lower storage requirements → false, needs MORE storage

Process Flow:#

Research Query
     ↓
Multi-Modal Embedding (text + code + visual + data)
     ↓
Cross-Disciplinary Retrieval
     ↓
Concept Mapping
     ↓
Synthesized Research Insights ✅
  • ✅ Multi-modal embedding → cross-disciplinary retrieval → concept mapping → synthesized insights — exam answer (May_FN Q392)

Pedagogically Effective LLM Response — Exam Pattern#

Scenario: Student asks how to calculate GWP or identify ORFs#

Wrong Responses:#

❌ "Here's the exact formula/code: [provides complete answer]"
   → Gives direct answer, no learning

❌ "GWP is determined by molecular structure..."
   → Provides information but no engagement

❌ "I cannot help. Please refer to textbook."
   → Unhelpful

Correct Response — Socratic:#

✅ "Let's break this down. What do you already know about
    greenhouse gases? Have you considered how different
    molecules might vary in their ability to trap heat?"
→ Guides student to discover answer themselves ✅
  • ✅ Ask reflective/guiding questions → promote independent thinking — exam answer (JAN_FN Q313, JAN_AN Q432)

Quick Reference#

System Prompt:
  ✅ Defines tone, behavior, knowledge domain, guidelines
  ✅ Most impactful: knowledge domain + expertise level
  ❌ NOT fixed predetermined responses
  ❌ NOT replacement for human guidance

Prompt Specificity:
  ✅ Include: count + timeframe + aspects + constraints
  ❌ Vague prompts → generic useless responses

Token Costs:
  ✅ Input tokens + output tokens + complexity + context window
  ❌ Time of day, academic level → don't affect cost

LLM Limitations:
  ❌ No real-time data access
  ❌ No post-cutoff knowledge
  ❌ May hallucinate

Output Consistency:
  ✅ Prompt engineering to enforce exact format
  ❌ Manual editing → not scalable

RAG Pipeline:
  Query → Vectorize → Retrieve → Pass to LLM → Response ✅

RAG Chunking:
  ✅ Medium chunks 200-500 words
  ❌ Very small → too little context
  ❌ Very large → too much irrelevant content

Stale chunks:
  ✅ Old chunks in vector DB not updated

LLM + RAG:
  LLM uses chunks as CONTEXT ✅
  NOT copied verbatim ❌
  NOT from training data ❌

Multi-modal RAG:
  ✅ Text + code + visual + data integration
  ✅ Comprehensive cross-disciplinary understanding