10. Apache Log Analysis

10. Apache Log Analysis#

1. Apache Log Format - Structure#

Standard Log Entry:#

IP RemoteUser AuthUser [Timestamp] "Method URL Protocol" Status Size "Referrer" "UserAgent" VHost ServerIP

Example Entry:#

203.0.113.7 - - [14/Dec/2024:16:45:11 -0500] "GET /index.html HTTP/1.1" 200 3500 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_0)" shop.com 192.168.1.100

Field Breakdown:#

Field	Example	Notes
IP	`203.0.113.7`	Visitor IP
RemoteUser	`-`	Typically `-`
AuthUser	`-`	Typically `-`
Timestamp	`[14/Dec/2024:16:45:11 -0500]`	Includes timezone
Method	`GET`	HTTP method
URL	`/index.html`	Path visited
Protocol	`HTTP/1.1`	HTTP version
Status	`200`	HTTP status code
Size	`3500`	Response bytes
Referrer	`"-"`	Where request came from
UserAgent	`"Mozilla/5.0..."`	Browser/device info
VHost	`shop.com`	Virtual host
ServerIP	`192.168.1.100`	Server IP

2. Valid Log Entry Rules#

Valid Entry Must Have:#

✅ Numeric HTTP status code (200, 404)
✅ Valid HTTP method (GET, POST, PUT, DELETE)
✅ Valid protocol string (HTTP/1.1, HTTP/2.0)

Invalid Entries:#

"POST /image.jpg INVALID"     ❌ invalid protocol
"PUT /image.jpg HTTP/1.1" OK  ❌ status must be numeric not "OK"

3. Log Filtering Fields#

To filter “POST requests for /images/ from 15:00–18:00 on Mondays” need:#

✅ Time → for hour AND day-of-week filtering
✅ Method → check POST
✅ URL → check /images/ path

NOT needed for this filter:#

❌ Status, Size, Referrer, Server → irrelevant to this filter

4. Mobile Detection from UserAgent#

# ✅ Correct - check multiple mobile indicators
mobile_indicators = ['mobile', 'iphone', 'android', 'ipad']
is_mobile = any(
    indicator in user_agent.lower()
    for indicator in mobile_indicators
)

# ❌ Wrong - too narrow
'Mobile' in user_agent              # case sensitive
user_agent.startswith('Mozilla/5.0 (Mobile')  # too specific
'webkit' in user_agent.lower()      # webkit used by desktop too

5. Counting URL Segments - `collections.Counter`#

from collections import Counter

# Count checkout transactions by product category
# URLs like: /checkout/electronics/, /checkout/clothing/
Counter(
    url.split('/')[2]
    for url in filtered_checkout_urls
    if len(url.split('/')) > 2
)
# Result: {'electronics': 45, 'clothing': 32, 'food': 18}

6. Validate Filters Independently - Debugging#

Systematic Approach:#

# Validate each filter step by step
all_entries = parse_log('access.log')

# Step 1: Check time filter
time_filtered = [e for e in all_entries
                 if 12 <= e['timestamp'].hour <= 15]
print(f"After time filter: {len(time_filtered)}")

# Step 2: Check method filter
method_filtered = [e for e in time_filtered
                   if e['method'] == 'POST']
print(f"After method filter: {len(method_filtered)}")

# Step 3: Check URL filter
url_filtered = [e for e in method_filtered
                if e['url'].startswith('/checkout/')]
print(f"After URL filter: {len(url_filtered)}")

# Step 4: Check status filter
final = [e for e in url_filtered
         if 200 <= e['status'] < 300]
print(f"Final count: {len(final)}")

✅ Validate each filter independently - exam answer
❌ Assume data corruption → wrong first assumption
❌ Reduce dataset randomly → loses data integrity

7. “Any of These Could Be the Reason”#

If log count appears unexpectedly high → any filter could be wrong:
- Wrong weekday number
- Wrong hour range boundary
- Missing URL path filter
- Wrong HTTP method filter
✅ “Any of these could be the reason” - exam answer (JAN_FN Q310, JAN_AN Q429)

8. Double Filter for 404 Count#

# ✅ Must apply BOTH filters before counting
filtered = [
    e for e in entries
    if e['url'].startswith('/error/')    # URL filter
    and e['status'] == 404              # status filter
]
count = len(filtered)

# ❌ Wrong - counts everything in filtered_entries
# without applying the second filter (status==404)
len(filtered_entries)   # if filtered_entries only has URL filter applied

✅ “None of these” - exam answer (JAN_AN Q428) when options show incomplete filtering

9. Reading Compressed Logs#

import gzip

# ✅ Read gzipped log file
with gzip.open('access.log.gz', 'rt') as f:
    for line in f:          # generator - memory efficient
        process(line)

10. Generator Pattern for Large Logs#

# ✅ Memory efficient - one line at a time
with open('access.log') as f:
    for line in f:
        process(line)

# ❌ Loads entire file into memory
lines = open('access.log').readlines()

Complete Log Parsing - Minimal Reference:#

import gzip
from datetime import datetime

def parse_log_line(line):
    """Parse single Apache log line"""
    import shlex
    try:
        parts = shlex.split(line)   # handles quoted fields correctly
        return {
            'ip': parts[0],
            'timestamp': datetime.strptime(
                parts[3].strip('[]'),
                '%d/%b/%Y:%H:%M:%S %z'    # ✅ exam format
            ),
            'method': parts[5],           # split(' ')[0] of request
            'url': parts[6],              # split(' ')[1] of request ✅
            'protocol': parts[7],         # split(' ')[2] of request ✅
            'status': int(parts[8]),
            'size': parts[9],
            'user_agent': parts[11]
        }
    except:
        return None

def analyze_log(filepath):
    results = []
    
    # Handle both gzipped and plain files
    opener = gzip.open if filepath.endswith('.gz') else open
    
    with opener(filepath, 'rt') as f:
        for line in f:                    # ✅ generator pattern
            entry = parse_log_line(line.strip())
            if entry:
                results.append(entry)
    
    return results

Common Log Analysis Patterns:#

entries = analyze_log('access.log.gz')

# Filter by time range
peak_hour = [e for e in entries
             if 12 <= e['timestamp'].hour <= 15]  # ✅

# Filter by weekday
mondays = [e for e in entries
           if e['timestamp'].weekday() == 0]       # ✅ 0=Monday

# Filter POST to /checkout/
checkouts = [e for e in entries
             if e['method'] == 'POST'
             and e['url'].startswith('/checkout/')] # ✅

# Filter successful requests
successful = [e for e in entries
              if 200 <= e['status'] < 300]         # ✅

# Filter mobile traffic
mobile = [e for e in entries
          if any(m in e['user_agent'].lower()
                 for m in ['mobile','iphone','android','ipad'])]  # ✅

# Filter redirects
redirects = [e for e in entries
             if 300 <= e['status'] < 400]          # ✅

Quick Reference#

Log Format:
  IP - - [Timestamp] "Method URL Protocol" Status Size "Ref" "UA" VHost ServerIP

Valid entry needs:
  ✅ Numeric status code
  ✅ Valid HTTP method
  ✅ Valid protocol string (HTTP/1.1)

Fields for POST/path/time filter:
  ✅ Time + Method + URL

Datetime format:
  "%d/%b/%Y:%H:%M:%S %z"  ✅

Weekday numbers:
  0=Mon, 1=Tue, 2=Wed, 3=Thu, 4=Fri, 5=Sat, 6=Sun

Time ranges:
  12:00-15:59 → 12 <= hour <= 15  ✅
  16:00-18:59 → 16 <= hour < 19   ✅

Request field splitting:
  split(' ')[0] → method
  split(' ')[1] → URL      ✅
  split(' ')[2] → protocol ✅

URL matching:
  url.startswith('/path/')  ✅

Status ranges:
  200 <= status < 300  → success ✅
  300 <= status < 400  → redirect ✅

Unexpected high count:
  → Any filter could be wrong ✅

Large file processing:
  → Generator: for line in open() ✅
  → gzip.open() for .gz files ✅

10. Apache Log Analysis#

1. Apache Log Format - Structure#

Standard Log Entry:#

Example Entry:#

Field Breakdown:#

2. Valid Log Entry Rules#

Valid Entry Must Have:#

Invalid Entries:#

3. Log Filtering Fields#

To filter “POST requests for /images/ from 15:00–18:00 on Mondays” need:#

NOT needed for this filter:#

4. Mobile Detection from UserAgent#

5. Counting URL Segments - collections.Counter#

6. Validate Filters Independently - Debugging#

Systematic Approach:#

7. “Any of These Could Be the Reason”#

8. Double Filter for 404 Count#

9. Reading Compressed Logs#

10. Generator Pattern for Large Logs#

Complete Log Parsing - Minimal Reference:#

Common Log Analysis Patterns:#

Quick Reference#

5. Counting URL Segments - `collections.Counter`#