11. Data Cleaning & OpenRefine#

1. OpenRefine - Overview#

  • Free, open-source tool for interactive data cleaning and reconciliation
  • Runs locally in browser (not cloud-based)
  • Works like a spreadsheet but far more powerful for messy data
  • ❌ NOT for model training
  • ❌ NOT for visual dashboarding
  • ❌ NOT for source control

2. Text Clustering - Overview#

  • Automatically groups similar text values so you can merge them
  • Solves inconsistent naming problem:
"Microsoft Corp"
"MICROSOFT CORPORATION"
"microsoft"
"MSFT"
→ All same entity → cluster and merge into one

3. Clustering Algorithms#

Fingerprinting (default):#

  • Normalizes text → lowercase, remove punctuation, sort words, remove duplicates
  • Best for: simple variations in spacing/case
  • Example: “John Smith” and “smith john” → same fingerprint

N-gram Fingerprint:#

  • Uses character n-grams for comparison
  • Best for: slight spelling differences, abbreviations
  • Example: “Microsoft” and “Microsoft” (typo)

Levenshtein (Edit Distance):#

  • Counts minimum edits (insert/delete/replace) to transform one string to another
  • Best for: typos, minor errors

PPM:#

  • Compression-based similarity

  • Best for: complex variations

  • ✅ Fingerprinting + n-gram = most effective for company name standardization (exam answer)

  • ❌ Simple find and replace → misses variations

  • ❌ Regular expression only → misses fuzzy matches

  • ❌ Manual standardization → not scalable


4. GREL - Transformation Language#

What is GREL?#

  • General Refine Expression Language
  • Used to transform cell values programmatically in OpenRefine

Exam-Relevant Functions:#

value.toLowercase()     → "APPLE INC" → "apple inc"
value.toUppercase()     → "apple" → "APPLE"
value.toTitlecase()     → "apple inc" → "Apple Inc"
value.trim()            → removes leading/trailing spaces
value.replace("a","b")  → replace substring
value.split(",")[0]     → split and get first part

Chaining (exam-relevant):#

value.toLowercase().toTitlecase()   ✅ standardize to title case
  • value.toLowercase() then value.toTitlecase() - exam answer (TDS_(1) Q33)
  • ❌ Manually retyping each company name → not scalable
  • ❌ Delete column and recreate → loses data
  • ❌ Export to Excel and use find-replace → not reproducible

5. Faceting - Grouping & Aggregating#

What is Faceting?#

  • Groups rows by value for exploration and bulk editing
  • Shows distribution of values in a column

Types of Facets:#

TypeUse
Text facetGroup by text value
Numeric facetFilter by number range
Timeline facetFilter by date range
Custom facetUse GREL expression

Workflow for Duplicate Aggregation:#

1. Standardize names (toLowercase → toTitlecase)
2. Apply Text Facet on company name column
3. See all grouped values and counts
4. Merge similar values manually or via clustering
5. Use facet to group → aggregate/sum revenue values
  • ✅ Use faceting to group then aggregate/sum values - exam answer (TDS_(1) Q34)
  • ❌ Delete all duplicates → loses revenue data
  • ❌ Leave duplicates → incorrect totals
  • ❌ Manually total using calculator → not scalable

6. Operations as JSON - Reproducibility#

What are Operations?#

  • Every cleaning step recorded as an operation
  • Can be exported as JSON
  • Can be replayed on similar datasets

Workflow:#

1. Clean dataset A in OpenRefine
2. Export operations as JSON
3. Open dataset B in OpenRefine
4. Apply (replay) saved JSON operations
5. Dataset B cleaned with same steps ✅
  • ✅ Export operations as JSON → replay on similar datasets (exam answer)
  • ❌ Automatically cleans new data without user action → false
  • ❌ Makes software faster → false
  • ❌ Deletes original data → false

7. Data Quality Issues - Types#

IssueDescriptionExample
Inconsistent namingSame entity spelled differently ✅“Apple”, “Apple Inc.”, “APPLE”
Missing valuesNaN, null, empty cellsRating with no review text
Duplicate recordsSame record appears multiple timesSame customer, different format
Wrong data typesNumbers stored as text“001234” becoming date
OutliersExtreme valuesAge = 999
Structural errorsInconsistent formats“01/12/2024” vs “2024-12-01”
Extra whitespaceLeading/trailing spaces" Apple " ≠ “Apple”
Mixed unitsDifferent units in same columnUSD vs EUR
  • ✅ Inconsistent naming/capitalization - exam answer (TDS_(1) Q36)

8. Deduplication - Best Practice#

Problem:#

Source A: "John Smith"
Source B: "smith, john"
→ Same person, different format → duplicate after combining

Solution:#

# Step 1: Normalize
def normalize_name(name):
    if ',' in name:
        parts = name.split(',')
        name = f"{parts[1].strip()} {parts[0].strip()}"
    return name.strip().title()

# Step 2: Apply and deduplicate
df['normalized'] = df['name'].apply(normalize_name)
df = df.drop_duplicates(subset=['normalized'])
  • ✅ Normalize → standardize → deduplicate (exam answer TDS Q8)
  • ❌ Keep both formats → causes incorrect counts
  • ❌ Delete all duplicates arbitrarily → loses valid data
  • ❌ Require manual resolution → not scalable

9. Text Matching Challenges#

Why Simple Matching Fails:#

Supplier A: "Blue T-Shirt Size M"
Supplier B: "tshirt blue m"
→ Same product, simple match fails because of:
ChallengeExample
Word order“Blue T-Shirt” vs “T-Shirt Blue”
Abbreviations“T-Shirt” vs “tshirt”
Case differences“BLUE” vs “blue”
Synonyms“large” vs “L”
Extra words“Size M” vs “M”
  • ✅ Effective text comparison requires accounting for word order, abbreviations, synonyms - exam answer (TDS Q10)
  • ❌ Text matching is impossible with product names → false
  • ❌ Product codes are different so must be different products → false

10. Fuzzy Matching#

What is Fuzzy Matching?#

  • Finds strings that are similar but not identical
  • Used when exact matching fails
from fuzzywuzzy import fuzz

# Simple ratio
fuzz.ratio("Microsoft Corp", "Microsoft Corporation")    # 82

# Token sort ratio (order-independent) ✅
fuzz.token_sort_ratio("Blue T-Shirt", "T-Shirt Blue")   # 100

# Token set ratio (handles extra words)
fuzz.token_set_ratio("Blue T-Shirt Size M", "tshirt blue m")  # high score

11. String Normalization for Matching#

import re

def normalize_text(text):
    text = text.lower()                      # lowercase
    text = re.sub(r'[^\w\s]', '', text)      # remove punctuation
    text = ' '.join(sorted(text.split()))    # sort words alphabetically
    return text.strip()

# Examples:
normalize_text("Blue T-Shirt Size M")   "blue m shirt size t"
normalize_text("tshirt blue m")         "blue m tshirt"
# Still not identical but closer for fuzzy matching

pandas - Data Cleaning Reference#

# Detect issues
df.isnull().sum()           # missing values per column
df.duplicated().sum()       # count duplicates
df.dtypes                   # check data types
df['col'].value_counts()    # see value distribution

# Fix missing values
df.dropna(subset=['col'])   # drop rows with NaN in col ✅
df.fillna(0)                # fill with constant
df['col'].fillna(df['col'].mean())  # fill with mean

# Fix duplicates
df.drop_duplicates(subset=['col'])  # ✅

# Fix data types
df['col'].astype('float')
pd.to_datetime(df['date_col'])

# Fix strings
df['col'].str.lower()
df['col'].str.strip()
df['col'].str.replace('Inc.', 'Inc')

Quick Reference#

OpenRefine:
  Purpose     → interactive data cleaning ✅
  Clustering  → fingerprinting + n-gram ✅
  GREL        → value.toLowercase().toTitlecase() ✅
  Faceting    → group and aggregate duplicates ✅
  Operations  → export as JSON, replay on new datasets ✅

Data quality issues:
  Inconsistent naming/capitalization ✅ (exam answer)
  Missing values, duplicates, wrong types, outliers

Text matching challenges:
  Word order, abbreviations, synonyms, case ✅

Deduplication:
  Normalize → standardize → deduplicate ✅

Fuzzy matching:
  fuzz.token_sort_ratio() → order-independent ✅