10. Apache Log Analysis#
Standard Log Entry:#
IP RemoteUser AuthUser [Timestamp] "Method URL Protocol" Status Size "Referrer" "UserAgent" VHost ServerIP
Example Entry:#
203.0.113.7 - - [14/Dec/2024:16:45:11 -0500] "GET /index.html HTTP/1.1" 200 3500 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_0)" shop.com 192.168.1.100
Field Breakdown:#
| Field | Example | Notes |
|---|
| IP | 203.0.113.7 | Visitor IP |
| RemoteUser | - | Typically - |
| AuthUser | - | Typically - |
| Timestamp | [14/Dec/2024:16:45:11 -0500] | Includes timezone |
| Method | GET | HTTP method |
| URL | /index.html | Path visited |
| Protocol | HTTP/1.1 | HTTP version |
| Status | 200 | HTTP status code |
| Size | 3500 | Response bytes |
| Referrer | "-" | Where request came from |
| UserAgent | "Mozilla/5.0..." | Browser/device info |
| VHost | shop.com | Virtual host |
| ServerIP | 192.168.1.100 | Server IP |
2. Valid Log Entry Rules#
Valid Entry Must Have:#
- ✅ Numeric HTTP status code (
200, 404) - ✅ Valid HTTP method (
GET, POST, PUT, DELETE) - ✅ Valid protocol string (
HTTP/1.1, HTTP/2.0)
Invalid Entries:#
"POST /image.jpg INVALID" ❌ invalid protocol
"PUT /image.jpg HTTP/1.1" OK ❌ status must be numeric not "OK"
3. Log Filtering Fields#
To filter “POST requests for /images/ from 15:00–18:00 on Mondays” need:#
- ✅ Time → for hour AND day-of-week filtering
- ✅ Method → check POST
- ✅ URL → check
/images/ path
NOT needed for this filter:#
- ❌ Status, Size, Referrer, Server → irrelevant to this filter
4. Mobile Detection from UserAgent#
# ✅ Correct - check multiple mobile indicators
mobile_indicators = ['mobile', 'iphone', 'android', 'ipad']
is_mobile = any(
indicator in user_agent.lower()
for indicator in mobile_indicators
)
# ❌ Wrong - too narrow
'Mobile' in user_agent # case sensitive
user_agent.startswith('Mozilla/5.0 (Mobile') # too specific
'webkit' in user_agent.lower() # webkit used by desktop too
5. Counting URL Segments - collections.Counter#
from collections import Counter
# Count checkout transactions by product category
# URLs like: /checkout/electronics/, /checkout/clothing/
Counter(
url.split('/')[2]
for url in filtered_checkout_urls
if len(url.split('/')) > 2
)
# Result: {'electronics': 45, 'clothing': 32, 'food': 18}
6. Validate Filters Independently - Debugging#
Systematic Approach:#
# Validate each filter step by step
all_entries = parse_log('access.log')
# Step 1: Check time filter
time_filtered = [e for e in all_entries
if 12 <= e['timestamp'].hour <= 15]
print(f"After time filter: {len(time_filtered)}")
# Step 2: Check method filter
method_filtered = [e for e in time_filtered
if e['method'] == 'POST']
print(f"After method filter: {len(method_filtered)}")
# Step 3: Check URL filter
url_filtered = [e for e in method_filtered
if e['url'].startswith('/checkout/')]
print(f"After URL filter: {len(url_filtered)}")
# Step 4: Check status filter
final = [e for e in url_filtered
if 200 <= e['status'] < 300]
print(f"Final count: {len(final)}")
- ✅ Validate each filter independently - exam answer
- ❌ Assume data corruption → wrong first assumption
- ❌ Reduce dataset randomly → loses data integrity
7. “Any of These Could Be the Reason”#
- If log count appears unexpectedly high → any filter could be wrong:
- Wrong weekday number
- Wrong hour range boundary
- Missing URL path filter
- Wrong HTTP method filter
- ✅ “Any of these could be the reason” - exam answer (JAN_FN Q310, JAN_AN Q429)
8. Double Filter for 404 Count#
# ✅ Must apply BOTH filters before counting
filtered = [
e for e in entries
if e['url'].startswith('/error/') # URL filter
and e['status'] == 404 # status filter
]
count = len(filtered)
# ❌ Wrong - counts everything in filtered_entries
# without applying the second filter (status==404)
len(filtered_entries) # if filtered_entries only has URL filter applied
- ✅ “None of these” - exam answer (JAN_AN Q428) when options show incomplete filtering
9. Reading Compressed Logs#
import gzip
# ✅ Read gzipped log file
with gzip.open('access.log.gz', 'rt') as f:
for line in f: # generator - memory efficient
process(line)
10. Generator Pattern for Large Logs#
# ✅ Memory efficient - one line at a time
with open('access.log') as f:
for line in f:
process(line)
# ❌ Loads entire file into memory
lines = open('access.log').readlines()
Complete Log Parsing - Minimal Reference:#
import gzip
from datetime import datetime
def parse_log_line(line):
"""Parse single Apache log line"""
import shlex
try:
parts = shlex.split(line) # handles quoted fields correctly
return {
'ip': parts[0],
'timestamp': datetime.strptime(
parts[3].strip('[]'),
'%d/%b/%Y:%H:%M:%S %z' # ✅ exam format
),
'method': parts[5], # split(' ')[0] of request
'url': parts[6], # split(' ')[1] of request ✅
'protocol': parts[7], # split(' ')[2] of request ✅
'status': int(parts[8]),
'size': parts[9],
'user_agent': parts[11]
}
except:
return None
def analyze_log(filepath):
results = []
# Handle both gzipped and plain files
opener = gzip.open if filepath.endswith('.gz') else open
with opener(filepath, 'rt') as f:
for line in f: # ✅ generator pattern
entry = parse_log_line(line.strip())
if entry:
results.append(entry)
return results
Common Log Analysis Patterns:#
entries = analyze_log('access.log.gz')
# Filter by time range
peak_hour = [e for e in entries
if 12 <= e['timestamp'].hour <= 15] # ✅
# Filter by weekday
mondays = [e for e in entries
if e['timestamp'].weekday() == 0] # ✅ 0=Monday
# Filter POST to /checkout/
checkouts = [e for e in entries
if e['method'] == 'POST'
and e['url'].startswith('/checkout/')] # ✅
# Filter successful requests
successful = [e for e in entries
if 200 <= e['status'] < 300] # ✅
# Filter mobile traffic
mobile = [e for e in entries
if any(m in e['user_agent'].lower()
for m in ['mobile','iphone','android','ipad'])] # ✅
# Filter redirects
redirects = [e for e in entries
if 300 <= e['status'] < 400] # ✅
Quick Reference#
Log Format:
IP - - [Timestamp] "Method URL Protocol" Status Size "Ref" "UA" VHost ServerIP
Valid entry needs:
✅ Numeric status code
✅ Valid HTTP method
✅ Valid protocol string (HTTP/1.1)
Fields for POST/path/time filter:
✅ Time + Method + URL
Datetime format:
"%d/%b/%Y:%H:%M:%S %z" ✅
Weekday numbers:
0=Mon, 1=Tue, 2=Wed, 3=Thu, 4=Fri, 5=Sat, 6=Sun
Time ranges:
12:00-15:59 → 12 <= hour <= 15 ✅
16:00-18:59 → 16 <= hour < 19 ✅
Request field splitting:
split(' ')[0] → method
split(' ')[1] → URL ✅
split(' ')[2] → protocol ✅
URL matching:
url.startswith('/path/') ✅
Status ranges:
200 <= status < 300 → success ✅
300 <= status < 400 → redirect ✅
Unexpected high count:
→ Any filter could be wrong ✅
Large file processing:
→ Generator: for line in open() ✅
→ gzip.open() for .gz files ✅