6. Web Scraping#
1. Web Scraping - Overview#
What is Web Scraping?#
- Programmatically extracting data from websites
- Automates what a human would do manually in a browser
- Used for: price monitoring, research data collection, news aggregation, lead generation
General Workflow:#
1. Send HTTP request to URL
↓
2. Receive HTML response
↓
3. Parse HTML to find data
↓
4. Extract specific elements
↓
5. Clean and store data
| Tool | Purpose | Best For |
|---|
requests | Fetch HTML pages | Simple static pages |
BeautifulSoup | Parse HTML/XML | Extracting data from HTML |
Scrapy | Full scraping framework | Large-scale scraping |
Selenium | Browser automation | Dynamic/JavaScript pages |
Playwright | Modern browser automation | Modern JS-heavy pages |
httpx | Async HTTP requests | Concurrent scraping |
2. Ethical Scraping - robots.txt#
What is robots.txt?#
- Text file at root of website:
https://example.com/robots.txt - Tells web crawlers which paths are allowed or disallowed
- Standard convention - not technically enforced but legally/ethically important
Reading robots.txt:#
# Example robots.txt
User-agent: * # applies to all bots
Disallow: /admin/ # don't scrape /admin/
Disallow: /private/ # don't scrape /private/
Allow: /public/ # explicitly allow /public/
Crawl-delay: 2 # wait 2 seconds between requests ✅
User-agent: Googlebot # specific rule for Google
Allow: / # Googlebot can access everything
User-agent: BadBot # block specific bot
Disallow: / # disallow everything
Checking robots.txt in Python:#
from urllib.robotparser import RobotFileParser
def can_scrape(url, user_agent='*'):
"""Check if URL can be scraped according to robots.txt"""
# Parse robots.txt
rp = RobotFileParser()
# Get base URL
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp.set_url(robots_url)
rp.read()
# Check if allowed
allowed = rp.can_fetch(user_agent, url)
crawl_delay = rp.crawl_delay(user_agent)
return allowed, crawl_delay
# Usage
url = 'https://example.com/products'
allowed, delay = can_scrape(url)
if allowed:
response = requests.get(url)
# process...
if delay:
time.sleep(delay) # ✅ respect crawl-delay
else:
print(f"Scraping {url} is not allowed by robots.txt")
Ethical Scraping Checklist:#
✅ Parse robots.txt before scraping
✅ Follow disallowed paths
✅ Implement crawl-delay between requests
✅ Identify your bot in User-Agent header
✅ Don't overload servers
✅ Check website's Terms of Service
✅ Cache responses to avoid redundant requests
✅ Scrape during off-peak hours if large volume
❌ Ignore robots.txt even for academic purposes
❌ Treat robots.txt as optional guidance
❌ Make rapid-fire requests without delays
❌ Scrape personal/private data without consent
❌ Bypass authentication/paywalls
- ✅ Parse robots.txt, follow disallowed paths, implement crawl-delay - exam answer
- ❌ Ignore robots.txt if data is for academic purposes - wrong
- ❌ Check robots.txt only once per domain - insufficient
- ❌ Treat robots.txt as optional - unethical
3. crawl-delay Directive#
import time
import requests
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
# Get crawl delay
delay = rp.crawl_delay('*') # for all bots
delay = delay if delay else 1 # default 1 second
urls_to_scrape = ['https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3']
for url in urls_to_scrape:
if rp.can_fetch('*', url):
response = requests.get(url)
process(response)
time.sleep(delay) # ✅ respect crawl-delay
else:
print(f"Skipping disallowed URL: {url}")
4. BeautifulSoup - HTML Parsing#
What is BeautifulSoup?#
- Python library for parsing HTML and XML
- Makes it easy to navigate, search, and extract data from HTML
- Works with
requests to fetch and parse web pages - ❌ NOT for making HTTP requests (use
requests for that) - ❌ NOT for dynamic/JavaScript pages (use Selenium/Playwright)
Installation:#
pip install beautifulsoup4
pip install lxml # faster parser (optional)
Import:#
from bs4 import BeautifulSoup
import requests
5. BeautifulSoup() - Creating a Soup Object#
Basic HTML Parsing:#
import requests
from bs4 import BeautifulSoup
# Step 1: Fetch HTML
url = 'https://example.com'
response = requests.get(url)
html_content = response.text
# Step 2: Create soup object
soup = BeautifulSoup(html_content, 'html.parser') # ✅ built-in parser
soup = BeautifulSoup(html_content, 'lxml') # faster, needs lxml
soup = BeautifulSoup(html_content, 'html5lib') # most lenient
# Step 3: Extract data
title = soup.title.text
print(title)
Parsers Comparison:#
| Parser | Speed | Leniency | Requires Install |
|---|
html.parser | Medium | Medium | No (built-in) |
lxml | Fast | Strict | Yes |
html5lib | Slow | Most lenient | Yes |
Parsing from String (for testing):#
html = """
<html>
<body>
<h1 class="title">Data Science Tools</h1>
<div id="content">
<p class="intro">Welcome to TDS</p>
<ul>
<li>Python</li>
<li>Pandas</li>
<li>Git</li>
</ul>
</div>
<table>
<tr><th>Tool</th><th>Purpose</th></tr>
<tr><td>Git</td><td>Version Control</td></tr>
<tr><td>Docker</td><td>Containerization</td></tr>
</table>
<a href="https://example.com">Click here</a>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
6. .find() and .find_all() - Locating Elements#
.find() - Returns FIRST matching element:#
# Find by tag name
h1 = soup.find('h1') # first <h1>
div = soup.find('div') # first <div>
p = soup.find('p') # first <p>
# Find by CSS class
intro = soup.find('p', class_='intro') # <p class="intro">
title = soup.find('h1', class_='title') # <h1 class="title">
# Find by id
content = soup.find('div', id='content') # <div id="content">
content = soup.find(id='content') # same result
# Find by attribute
link = soup.find('a', href=True) # any <a> with href
link = soup.find('a', href='https://example.com')
# Find by multiple attributes
elem = soup.find('div', {'class': 'intro', 'id': 'main'})
# Shorthand (tag as attribute)
title = soup.h1 # first <h1>
first_p = soup.p # first <p>
.find_all() - Returns ALL matching elements:#
# Find all by tag
all_links = soup.find_all('a') # all <a> tags
all_paragraphs = soup.find_all('p') # all <p> tags
all_rows = soup.find_all('tr') # all table rows
# Find all by class
all_items = soup.find_all('li', class_='item')
# Find all by attribute
all_hrefs = soup.find_all('a', href=True)
# Limit results
first_3 = soup.find_all('p', limit=3)
# Find multiple tag types
all_headings = soup.find_all(['h1', 'h2', 'h3'])
# Find by regex
import re
all_h = soup.find_all(re.compile('^h[1-6]$')) # all heading tags
Navigating the Tree:#
# Parent/child navigation
div = soup.find('div', id='content')
# Children
children = list(div.children) # direct children
descendants = list(div.descendants) # all descendants
# First/last child
first = div.find() # first child element
# Parent
parent = div.parent # parent element
# Siblings
next_sib = div.next_sibling
prev_sib = div.previous_sibling
all_siblings = div.next_siblings
# Find within element (scoped search)
ul = soup.find('ul')
items = ul.find_all('li') # only li inside this ul
# Extract text from element
h1 = soup.find('h1')
text = h1.get_text() # "Data Science Tools"
text = h1.text # same result
# Get text with separator
div = soup.find('div', id='content')
text = div.get_text(separator='\n') # text with newlines
text = div.get_text(separator=' ') # text with spaces
text = div.get_text(strip=True) # strip whitespace ✅
# Get text from all matching elements
all_li = soup.find_all('li')
items = [li.get_text(strip=True) for li in all_li]
# ['Python', 'Pandas', 'Git']
# Get all text on page
all_text = soup.get_text()
all_text = soup.get_text(separator='\n', strip=True)
# Clean text
import re
clean = re.sub(r'\s+', ' ', all_text).strip()
# Get attribute value
link = soup.find('a')
href = link.get('href') # ✅ safe (returns None if missing)
href = link['href'] # raises KeyError if missing
# Get any attribute
img = soup.find('img')
src = img.get('src')
alt = img.get('alt', '') # with default value
width = img.get('width')
# Check if attribute exists
if link.has_attr('href'):
print(link['href'])
# Get all attributes as dict
attrs = link.attrs # {'href': 'url', 'class': ['link']}
# Get all links on page
all_links = soup.find_all('a', href=True)
urls = [link.get('href') for link in all_links]
# Resolve relative URLs
from urllib.parse import urljoin
base_url = 'https://example.com'
absolute_urls = [urljoin(base_url, url) for url in urls]
Complete BeautifulSoup Example - Scraping a Table:#
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
def scrape_table(url):
"""
Scrape HTML table from webpage
✅ Follows ethical scraping practices
"""
# Set User-Agent to identify bot
headers = {
'User-Agent': 'DataScienceBot/1.0 (research purposes)'
}
# Fetch page
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Find table
table = soup.find('table')
if not table:
return None
# Extract headers
headers_row = table.find('tr')
headers = [th.get_text(strip=True)
for th in headers_row.find_all('th')]
# Extract rows
rows = []
for tr in table.find_all('tr')[1:]: # skip header row
cells = [td.get_text(strip=True)
for td in tr.find_all('td')]
if cells:
rows.append(cells)
# Convert to DataFrame
df = pd.DataFrame(rows, columns=headers)
return df
def scrape_multiple_pages(base_url, pages):
"""Scrape multiple pages with rate limiting"""
all_data = []
for page in range(1, pages + 1):
url = f'{base_url}?page={page}'
try:
df = scrape_table(url)
if df is not None:
all_data.append(df)
print(f"✅ Scraped page {page}")
except Exception as e:
print(f"❌ Failed page {page}: {e}")
time.sleep(2) # ✅ rate limiting
return pd.concat(all_data, ignore_index=True)
Complete BeautifulSoup Reference:#
# CSS Selectors (alternative to find/find_all)
soup.select('div.content') # <div class="content">
soup.select('#main') # id="main"
soup.select('table tr td') # nested elements
soup.select('a[href]') # <a> with href attribute
soup.select('p.intro, p.outro') # multiple selectors
# First match with CSS selector
soup.select_one('h1.title')
# Practical extraction patterns:
# 1. All links
links = [(a.get_text(strip=True), a.get('href'))
for a in soup.find_all('a', href=True)]
# 2. All images
images = [(img.get('alt', ''), img.get('src'))
for img in soup.find_all('img', src=True)]
# 3. Table to DataFrame
import pandas as pd
tables = pd.read_html(str(soup)) # ✅ pandas can parse HTML tables
df = tables[0] # first table
# 4. Specific div content
content = soup.find('div', class_='article-body')
paragraphs = [p.get_text(strip=True)
for p in content.find_all('p')]
# 5. Meta tags
meta_desc = soup.find('meta', attrs={'name': 'description'})
description = meta_desc.get('content') if meta_desc else ''
# 6. JSON-LD structured data
import json
script = soup.find('script', type='application/ld+json')
if script:
data = json.loads(script.string)
# 7. Next page link (pagination)
next_page = soup.find('a', {'rel': 'next'})
next_url = next_page.get('href') if next_page else None
9. Scrapy - Full Web Scraping Framework#
What is Scrapy?#
- Full-featured, asynchronous web scraping framework
- Built for large-scale scraping
- Handles: requests, parsing, pipelines, rate limiting automatically
- More complex than BeautifulSoup but much more powerful
When to Use Scrapy vs BeautifulSoup:#
| Scenario | Use |
|---|
| Simple, one-off scraping | BeautifulSoup + requests |
| Large-scale, many pages | Scrapy |
| Need JavaScript | Selenium/Playwright |
| Need data pipelines | Scrapy |
| Quick data extraction | BeautifulSoup |
Basic Scrapy Spider:#
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
# Respect robots.txt
custom_settings = {
'ROBOTSTXT_OBEY': True, # ✅ obey robots.txt
'DOWNLOAD_DELAY': 2, # ✅ 2 second delay
'CONCURRENT_REQUESTS': 4, # max concurrent requests
'USER_AGENT': 'DataBot/1.0'
}
def parse(self, response):
# Extract data from page
for product in response.css('div.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get(),
'url': product.css('a::attr(href)').get()
}
# Follow next page link
next_page = response.css('a.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
10. requests + BeautifulSoup - Combined Pattern#
Standard Workflow:#
import requests
from bs4 import BeautifulSoup
import time
import os
def scrape_page(url, session=None):
"""
Standard requests + BeautifulSoup workflow
"""
requester = session or requests
response = requester.get(
url,
headers={'User-Agent': 'Research Bot 1.0'},
timeout=30
)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def extract_articles(soup):
"""Extract article data from parsed HTML"""
articles = []
for article in soup.find_all('article'):
title = article.find('h2')
link = article.find('a')
date = article.find('time')
summary = article.find('p', class_='summary')
articles.append({
'title': title.get_text(strip=True) if title else '',
'url': link.get('href') if link else '',
'date': date.get('datetime') if date else '',
'summary': summary.get_text(strip=True) if summary else ''
})
return articles
# Full pipeline
def scrape_news_site(base_url, num_pages=5):
all_articles = []
with requests.Session() as session:
session.headers.update({
'User-Agent': 'DataScience Research Bot 1.0'
})
for page in range(1, num_pages + 1):
url = f'{base_url}/news?page={page}'
try:
soup = scrape_page(url, session)
articles = extract_articles(soup)
all_articles.extend(articles)
print(f"Page {page}: {len(articles)} articles")
except Exception as e:
print(f"Failed page {page}: {e}")
time.sleep(2) # ✅ rate limiting
return all_articles
11. Handling Dynamic Pages - Selenium & Playwright#
When is JavaScript Needed?#
Static page (BeautifulSoup works):
→ HTML is fully rendered in response
→ Data visible in page source (Ctrl+U)
Dynamic page (need Selenium/Playwright):
→ Data loaded by JavaScript after page loads
→ Data NOT in page source
→ Requires button clicks, scrolling, waiting
→ Examples: SPAs, infinite scroll, login-required pages
Selenium Basic Usage:#
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
# Setup driver (Chrome)
options = webdriver.ChromeOptions()
options.add_argument('--headless') # run without browser window
options.add_argument('--no-sandbox')
driver = webdriver.Chrome(options=options)
try:
# Navigate to page
driver.get('https://example.com')
# Wait for element to load
wait = WebDriverWait(driver, 10)
element = wait.until(
EC.presence_of_element_located((By.CLASS_NAME, 'product'))
)
# Scroll to load more content
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(2)
# Click button
button = driver.find_element(By.ID, 'load-more')
button.click()
time.sleep(2)
# Get rendered HTML and parse with BeautifulSoup
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
# Extract data normally
products = soup.find_all('div', class_='product')
for product in products:
print(product.get_text(strip=True))
finally:
driver.quit() # always close browser
Playwright (Modern Alternative):#
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://example.com')
page.wait_for_selector('.product') # wait for element
# Get content
html = page.content()
soup = BeautifulSoup(html, 'html.parser')
browser.close()
12. Respecting Rate Limits in Scraping#
import time
import random
import requests
from urllib.robotparser import RobotFileParser
class EthicalScraper:
"""
Scraper that respects robots.txt and rate limits
"""
def __init__(self, base_url, delay=2):
self.base_url = base_url
self.delay = delay
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'ResearchBot/1.0 (academic research)'
})
# Load robots.txt
self.robot_parser = RobotFileParser()
self.robot_parser.set_url(f'{base_url}/robots.txt')
self.robot_parser.read()
# Get crawl delay from robots.txt
robot_delay = self.robot_parser.crawl_delay('*')
if robot_delay:
self.delay = max(self.delay, robot_delay) # use larger delay
def fetch(self, url, retries=3):
"""Fetch URL with rate limiting and error handling"""
# Check robots.txt
if not self.robot_parser.can_fetch('*', url):
print(f"Disallowed by robots.txt: {url}")
return None
for attempt in range(retries):
try:
response = self.session.get(url, timeout=30)
response.raise_for_status()
# ✅ Rate limiting - respect crawl delay
# Add random jitter to avoid being detected as bot
sleep_time = self.delay + random.uniform(0, 1)
time.sleep(sleep_time)
return response.text
except requests.exceptions.HTTPError as e:
if e.response.status_code == 429: # rate limited
time.sleep(60) # wait 1 minute
elif e.response.status_code == 404:
return None # page not found
else:
time.sleep(2 ** attempt) # backoff
except requests.exceptions.RequestException:
time.sleep(2 ** attempt)
return None
def scrape(self, urls):
"""Scrape multiple URLs ethically"""
results = []
for url in urls:
html = self.fetch(url)
if html:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
results.append({'url': url, 'soup': soup})
return results
Web Scraping - Quick Reference Card#
Tool Selection:
Static HTML page → requests + BeautifulSoup ✅
JavaScript page → Selenium or Playwright
Large-scale scraping → Scrapy
Simple HTTP → requests alone
BeautifulSoup:
soup = BeautifulSoup(html, 'html.parser')
soup.find('tag') → first element
soup.find('tag', class_='name') → by class
soup.find(id='name') → by id
soup.find_all('tag') → all elements
element.get_text(strip=True) → extract text ✅
element.get('href') → extract attribute ✅
soup.select('div.class > p') → CSS selector
Ethical Scraping:
✅ Check robots.txt first
✅ Follow crawl-delay
✅ Add time.sleep() between requests
✅ Set User-Agent header
❌ Ignore robots.txt
❌ Rapid-fire requests
Common Patterns:
# Fetch + parse
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract all links
links = [a.get('href') for a in soup.find_all('a', href=True)]
# Extract table
df = pd.read_html(response.text)[0]
# Extract all text
text = soup.get_text(separator='\n', strip=True)
✅ Section 6 done! Proceed to Section 7: Python Error Handling & Logging?