6. Web Scraping#

1. Web Scraping - Overview#

What is Web Scraping?#

  • Programmatically extracting data from websites
  • Automates what a human would do manually in a browser
  • Used for: price monitoring, research data collection, news aggregation, lead generation

General Workflow:#

1. Send HTTP request to URL
         ↓
2. Receive HTML response
         ↓
3. Parse HTML to find data
         ↓
4. Extract specific elements
         ↓
5. Clean and store data

Tools Available:#

ToolPurposeBest For
requestsFetch HTML pagesSimple static pages
BeautifulSoupParse HTML/XMLExtracting data from HTML
ScrapyFull scraping frameworkLarge-scale scraping
SeleniumBrowser automationDynamic/JavaScript pages
PlaywrightModern browser automationModern JS-heavy pages
httpxAsync HTTP requestsConcurrent scraping

2. Ethical Scraping - robots.txt#

What is robots.txt?#

  • Text file at root of website: https://example.com/robots.txt
  • Tells web crawlers which paths are allowed or disallowed
  • Standard convention - not technically enforced but legally/ethically important

Reading robots.txt:#

# Example robots.txt
User-agent: *              # applies to all bots
Disallow: /admin/          # don't scrape /admin/
Disallow: /private/        # don't scrape /private/
Allow: /public/            # explicitly allow /public/
Crawl-delay: 2             # wait 2 seconds between requests ✅

User-agent: Googlebot      # specific rule for Google
Allow: /                   # Googlebot can access everything

User-agent: BadBot         # block specific bot
Disallow: /                # disallow everything

Checking robots.txt in Python:#

from urllib.robotparser import RobotFileParser

def can_scrape(url, user_agent='*'):
    """Check if URL can be scraped according to robots.txt"""
    # Parse robots.txt
    rp = RobotFileParser()
    
    # Get base URL
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    
    rp.set_url(robots_url)
    rp.read()
    
    # Check if allowed
    allowed = rp.can_fetch(user_agent, url)
    crawl_delay = rp.crawl_delay(user_agent)
    
    return allowed, crawl_delay

# Usage
url = 'https://example.com/products'
allowed, delay = can_scrape(url)

if allowed:
    response = requests.get(url)
    # process...
    if delay:
        time.sleep(delay)  # ✅ respect crawl-delay
else:
    print(f"Scraping {url} is not allowed by robots.txt")

Ethical Scraping Checklist:#

✅ Parse robots.txt before scraping
✅ Follow disallowed paths
✅ Implement crawl-delay between requests
✅ Identify your bot in User-Agent header
✅ Don't overload servers
✅ Check website's Terms of Service
✅ Cache responses to avoid redundant requests
✅ Scrape during off-peak hours if large volume

❌ Ignore robots.txt even for academic purposes
❌ Treat robots.txt as optional guidance
❌ Make rapid-fire requests without delays
❌ Scrape personal/private data without consent
❌ Bypass authentication/paywalls
  • ✅ Parse robots.txt, follow disallowed paths, implement crawl-delay - exam answer
  • ❌ Ignore robots.txt if data is for academic purposes - wrong
  • ❌ Check robots.txt only once per domain - insufficient
  • ❌ Treat robots.txt as optional - unethical

3. crawl-delay Directive#

import time
import requests
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()

# Get crawl delay
delay = rp.crawl_delay('*')    # for all bots
delay = delay if delay else 1   # default 1 second

urls_to_scrape = ['https://example.com/page1',
                  'https://example.com/page2',
                  'https://example.com/page3']

for url in urls_to_scrape:
    if rp.can_fetch('*', url):
        response = requests.get(url)
        process(response)
        time.sleep(delay)       # ✅ respect crawl-delay
    else:
        print(f"Skipping disallowed URL: {url}")

4. BeautifulSoup - HTML Parsing#

What is BeautifulSoup?#

  • Python library for parsing HTML and XML
  • Makes it easy to navigate, search, and extract data from HTML
  • Works with requests to fetch and parse web pages
  • ❌ NOT for making HTTP requests (use requests for that)
  • ❌ NOT for dynamic/JavaScript pages (use Selenium/Playwright)

Installation:#

pip install beautifulsoup4
pip install lxml    # faster parser (optional)

Import:#

from bs4 import BeautifulSoup
import requests

5. BeautifulSoup() - Creating a Soup Object#

Basic HTML Parsing:#

import requests
from bs4 import BeautifulSoup

# Step 1: Fetch HTML
url = 'https://example.com'
response = requests.get(url)
html_content = response.text

# Step 2: Create soup object
soup = BeautifulSoup(html_content, 'html.parser')  # ✅ built-in parser
soup = BeautifulSoup(html_content, 'lxml')          # faster, needs lxml
soup = BeautifulSoup(html_content, 'html5lib')      # most lenient

# Step 3: Extract data
title = soup.title.text
print(title)

Parsers Comparison:#

ParserSpeedLeniencyRequires Install
html.parserMediumMediumNo (built-in)
lxmlFastStrictYes
html5libSlowMost lenientYes

Parsing from String (for testing):#

html = """
<html>
<body>
    <h1 class="title">Data Science Tools</h1>
    <div id="content">
        <p class="intro">Welcome to TDS</p>
        <ul>
            <li>Python</li>
            <li>Pandas</li>
            <li>Git</li>
        </ul>
    </div>
    <table>
        <tr><th>Tool</th><th>Purpose</th></tr>
        <tr><td>Git</td><td>Version Control</td></tr>
        <tr><td>Docker</td><td>Containerization</td></tr>
    </table>
    <a href="https://example.com">Click here</a>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

6. .find() and .find_all() - Locating Elements#

.find() - Returns FIRST matching element:#

# Find by tag name
h1 = soup.find('h1')                          # first <h1>
div = soup.find('div')                         # first <div>
p = soup.find('p')                             # first <p>

# Find by CSS class
intro = soup.find('p', class_='intro')        # <p class="intro">
title = soup.find('h1', class_='title')       # <h1 class="title">

# Find by id
content = soup.find('div', id='content')      # <div id="content">
content = soup.find(id='content')             # same result

# Find by attribute
link = soup.find('a', href=True)              # any <a> with href
link = soup.find('a', href='https://example.com')

# Find by multiple attributes
elem = soup.find('div', {'class': 'intro', 'id': 'main'})

# Shorthand (tag as attribute)
title = soup.h1                               # first <h1>
first_p = soup.p                              # first <p>

.find_all() - Returns ALL matching elements:#

# Find all by tag
all_links = soup.find_all('a')                # all <a> tags
all_paragraphs = soup.find_all('p')           # all <p> tags
all_rows = soup.find_all('tr')                # all table rows

# Find all by class
all_items = soup.find_all('li', class_='item')

# Find all by attribute
all_hrefs = soup.find_all('a', href=True)

# Limit results
first_3 = soup.find_all('p', limit=3)

# Find multiple tag types
all_headings = soup.find_all(['h1', 'h2', 'h3'])

# Find by regex
import re
all_h = soup.find_all(re.compile('^h[1-6]$'))  # all heading tags
# Parent/child navigation
div = soup.find('div', id='content')

# Children
children = list(div.children)       # direct children
descendants = list(div.descendants) # all descendants

# First/last child
first = div.find()                  # first child element

# Parent
parent = div.parent                 # parent element

# Siblings
next_sib = div.next_sibling
prev_sib = div.previous_sibling
all_siblings = div.next_siblings

# Find within element (scoped search)
ul = soup.find('ul')
items = ul.find_all('li')           # only li inside this ul

7. .get_text() - Extracting Text#

# Extract text from element
h1 = soup.find('h1')
text = h1.get_text()               # "Data Science Tools"
text = h1.text                     # same result

# Get text with separator
div = soup.find('div', id='content')
text = div.get_text(separator='\n')   # text with newlines
text = div.get_text(separator=' ')    # text with spaces
text = div.get_text(strip=True)       # strip whitespace ✅

# Get text from all matching elements
all_li = soup.find_all('li')
items = [li.get_text(strip=True) for li in all_li]
# ['Python', 'Pandas', 'Git']

# Get all text on page
all_text = soup.get_text()
all_text = soup.get_text(separator='\n', strip=True)

# Clean text
import re
clean = re.sub(r'\s+', ' ', all_text).strip()

8. .get('href') - Extracting Attributes#

# Get attribute value
link = soup.find('a')
href = link.get('href')            # ✅ safe (returns None if missing)
href = link['href']                # raises KeyError if missing

# Get any attribute
img = soup.find('img')
src = img.get('src')
alt = img.get('alt', '')          # with default value
width = img.get('width')

# Check if attribute exists
if link.has_attr('href'):
    print(link['href'])

# Get all attributes as dict
attrs = link.attrs                 # {'href': 'url', 'class': ['link']}

# Get all links on page
all_links = soup.find_all('a', href=True)
urls = [link.get('href') for link in all_links]

# Resolve relative URLs
from urllib.parse import urljoin
base_url = 'https://example.com'
absolute_urls = [urljoin(base_url, url) for url in urls]

Complete BeautifulSoup Example - Scraping a Table:#

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def scrape_table(url):
    """
    Scrape HTML table from webpage
    ✅ Follows ethical scraping practices
    """
    # Set User-Agent to identify bot
    headers = {
        'User-Agent': 'DataScienceBot/1.0 (research purposes)'
    }

    # Fetch page
    response = requests.get(url, headers=headers, timeout=30)
    response.raise_for_status()

    # Parse HTML
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find table
    table = soup.find('table')
    if not table:
        return None

    # Extract headers
    headers_row = table.find('tr')
    headers = [th.get_text(strip=True)
               for th in headers_row.find_all('th')]

    # Extract rows
    rows = []
    for tr in table.find_all('tr')[1:]:  # skip header row
        cells = [td.get_text(strip=True)
                 for td in tr.find_all('td')]
        if cells:
            rows.append(cells)

    # Convert to DataFrame
    df = pd.DataFrame(rows, columns=headers)
    return df

def scrape_multiple_pages(base_url, pages):
    """Scrape multiple pages with rate limiting"""
    all_data = []

    for page in range(1, pages + 1):
        url = f'{base_url}?page={page}'

        try:
            df = scrape_table(url)
            if df is not None:
                all_data.append(df)
                print(f"✅ Scraped page {page}")
        except Exception as e:
            print(f"❌ Failed page {page}: {e}")

        time.sleep(2)           # ✅ rate limiting

    return pd.concat(all_data, ignore_index=True)

Complete BeautifulSoup Reference:#

# CSS Selectors (alternative to find/find_all)
soup.select('div.content')          # <div class="content">
soup.select('#main')                # id="main"
soup.select('table tr td')         # nested elements
soup.select('a[href]')             # <a> with href attribute
soup.select('p.intro, p.outro')    # multiple selectors

# First match with CSS selector
soup.select_one('h1.title')

# Practical extraction patterns:

# 1. All links
links = [(a.get_text(strip=True), a.get('href'))
         for a in soup.find_all('a', href=True)]

# 2. All images
images = [(img.get('alt', ''), img.get('src'))
          for img in soup.find_all('img', src=True)]

# 3. Table to DataFrame
import pandas as pd
tables = pd.read_html(str(soup))   # ✅ pandas can parse HTML tables
df = tables[0]                     # first table

# 4. Specific div content
content = soup.find('div', class_='article-body')
paragraphs = [p.get_text(strip=True)
              for p in content.find_all('p')]

# 5. Meta tags
meta_desc = soup.find('meta', attrs={'name': 'description'})
description = meta_desc.get('content') if meta_desc else ''

# 6. JSON-LD structured data
import json
script = soup.find('script', type='application/ld+json')
if script:
    data = json.loads(script.string)

# 7. Next page link (pagination)
next_page = soup.find('a', {'rel': 'next'})
next_url = next_page.get('href') if next_page else None

9. Scrapy - Full Web Scraping Framework#

What is Scrapy?#

  • Full-featured, asynchronous web scraping framework
  • Built for large-scale scraping
  • Handles: requests, parsing, pipelines, rate limiting automatically
  • More complex than BeautifulSoup but much more powerful

When to Use Scrapy vs BeautifulSoup:#

ScenarioUse
Simple, one-off scrapingBeautifulSoup + requests
Large-scale, many pagesScrapy
Need JavaScriptSelenium/Playwright
Need data pipelinesScrapy
Quick data extractionBeautifulSoup

Basic Scrapy Spider:#

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    # Respect robots.txt
    custom_settings = {
        'ROBOTSTXT_OBEY': True,           # ✅ obey robots.txt
        'DOWNLOAD_DELAY': 2,              # ✅ 2 second delay
        'CONCURRENT_REQUESTS': 4,         # max concurrent requests
        'USER_AGENT': 'DataBot/1.0'
    }

    def parse(self, response):
        # Extract data from page
        for product in response.css('div.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
                'url': product.css('a::attr(href)').get()
            }

        # Follow next page link
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

10. requests + BeautifulSoup - Combined Pattern#

Standard Workflow:#

import requests
from bs4 import BeautifulSoup
import time
import os

def scrape_page(url, session=None):
    """
    Standard requests + BeautifulSoup workflow
    """
    requester = session or requests

    response = requester.get(
        url,
        headers={'User-Agent': 'Research Bot 1.0'},
        timeout=30
    )
    response.raise_for_status()

    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

def extract_articles(soup):
    """Extract article data from parsed HTML"""
    articles = []

    for article in soup.find_all('article'):
        title = article.find('h2')
        link = article.find('a')
        date = article.find('time')
        summary = article.find('p', class_='summary')

        articles.append({
            'title': title.get_text(strip=True) if title else '',
            'url': link.get('href') if link else '',
            'date': date.get('datetime') if date else '',
            'summary': summary.get_text(strip=True) if summary else ''
        })

    return articles

# Full pipeline
def scrape_news_site(base_url, num_pages=5):
    all_articles = []

    with requests.Session() as session:
        session.headers.update({
            'User-Agent': 'DataScience Research Bot 1.0'
        })

        for page in range(1, num_pages + 1):
            url = f'{base_url}/news?page={page}'

            try:
                soup = scrape_page(url, session)
                articles = extract_articles(soup)
                all_articles.extend(articles)
                print(f"Page {page}: {len(articles)} articles")
            except Exception as e:
                print(f"Failed page {page}: {e}")

            time.sleep(2)   # ✅ rate limiting

    return all_articles

11. Handling Dynamic Pages - Selenium & Playwright#

When is JavaScript Needed?#

Static page (BeautifulSoup works):
→ HTML is fully rendered in response
→ Data visible in page source (Ctrl+U)

Dynamic page (need Selenium/Playwright):
→ Data loaded by JavaScript after page loads
→ Data NOT in page source
→ Requires button clicks, scrolling, waiting
→ Examples: SPAs, infinite scroll, login-required pages

Selenium Basic Usage:#

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

# Setup driver (Chrome)
options = webdriver.ChromeOptions()
options.add_argument('--headless')     # run without browser window
options.add_argument('--no-sandbox')

driver = webdriver.Chrome(options=options)

try:
    # Navigate to page
    driver.get('https://example.com')

    # Wait for element to load
    wait = WebDriverWait(driver, 10)
    element = wait.until(
        EC.presence_of_element_located((By.CLASS_NAME, 'product'))
    )

    # Scroll to load more content
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(2)

    # Click button
    button = driver.find_element(By.ID, 'load-more')
    button.click()
    time.sleep(2)

    # Get rendered HTML and parse with BeautifulSoup
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')

    # Extract data normally
    products = soup.find_all('div', class_='product')
    for product in products:
        print(product.get_text(strip=True))

finally:
    driver.quit()               # always close browser

Playwright (Modern Alternative):#

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    page.goto('https://example.com')
    page.wait_for_selector('.product')  # wait for element

    # Get content
    html = page.content()
    soup = BeautifulSoup(html, 'html.parser')

    browser.close()

12. Respecting Rate Limits in Scraping#

import time
import random
import requests
from urllib.robotparser import RobotFileParser

class EthicalScraper:
    """
    Scraper that respects robots.txt and rate limits
    """
    def __init__(self, base_url, delay=2):
        self.base_url = base_url
        self.delay = delay
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'ResearchBot/1.0 (academic research)'
        })

        # Load robots.txt
        self.robot_parser = RobotFileParser()
        self.robot_parser.set_url(f'{base_url}/robots.txt')
        self.robot_parser.read()

        # Get crawl delay from robots.txt
        robot_delay = self.robot_parser.crawl_delay('*')
        if robot_delay:
            self.delay = max(self.delay, robot_delay)  # use larger delay

    def fetch(self, url, retries=3):
        """Fetch URL with rate limiting and error handling"""
        # Check robots.txt
        if not self.robot_parser.can_fetch('*', url):
            print(f"Disallowed by robots.txt: {url}")
            return None

        for attempt in range(retries):
            try:
                response = self.session.get(url, timeout=30)
                response.raise_for_status()

                # ✅ Rate limiting - respect crawl delay
                # Add random jitter to avoid being detected as bot
                sleep_time = self.delay + random.uniform(0, 1)
                time.sleep(sleep_time)

                return response.text

            except requests.exceptions.HTTPError as e:
                if e.response.status_code == 429:   # rate limited
                    time.sleep(60)                  # wait 1 minute
                elif e.response.status_code == 404:
                    return None                     # page not found
                else:
                    time.sleep(2 ** attempt)        # backoff

            except requests.exceptions.RequestException:
                time.sleep(2 ** attempt)

        return None

    def scrape(self, urls):
        """Scrape multiple URLs ethically"""
        results = []

        for url in urls:
            html = self.fetch(url)
            if html:
                from bs4 import BeautifulSoup
                soup = BeautifulSoup(html, 'html.parser')
                results.append({'url': url, 'soup': soup})

        return results

Web Scraping - Quick Reference Card#

Tool Selection:
  Static HTML page    → requests + BeautifulSoup ✅
  JavaScript page     → Selenium or Playwright
  Large-scale scraping → Scrapy
  Simple HTTP         → requests alone

BeautifulSoup:
  soup = BeautifulSoup(html, 'html.parser')
  soup.find('tag')                    → first element
  soup.find('tag', class_='name')     → by class
  soup.find(id='name')               → by id
  soup.find_all('tag')               → all elements
  element.get_text(strip=True)       → extract text ✅
  element.get('href')                → extract attribute ✅
  soup.select('div.class > p')       → CSS selector

Ethical Scraping:
  ✅ Check robots.txt first
  ✅ Follow crawl-delay
  ✅ Add time.sleep() between requests
  ✅ Set User-Agent header
  ❌ Ignore robots.txt
  ❌ Rapid-fire requests

Common Patterns:
  # Fetch + parse
  response = requests.get(url)
  soup = BeautifulSoup(response.text, 'html.parser')

  # Extract all links
  links = [a.get('href') for a in soup.find_all('a', href=True)]

  # Extract table
  df = pd.read_html(response.text)[0]

  # Extract all text
  text = soup.get_text(separator='\n', strip=True)

✅ Section 6 done! Proceed to Section 7: Python Error Handling & Logging?