11. Data Cleaning & OpenRefine

11. Data Cleaning & OpenRefine#

1. OpenRefine - Overview#

Free, open-source tool for interactive data cleaning and reconciliation
Runs locally in browser (not cloud-based)
Works like a spreadsheet but far more powerful for messy data
❌ NOT for model training
❌ NOT for visual dashboarding
❌ NOT for source control

2. Text Clustering - Overview#

Automatically groups similar text values so you can merge them
Solves inconsistent naming problem:

"Microsoft Corp"
"MICROSOFT CORPORATION"
"microsoft"
"MSFT"
→ All same entity → cluster and merge into one

3. Clustering Algorithms#

Fingerprinting (default):#

Normalizes text → lowercase, remove punctuation, sort words, remove duplicates
Best for: simple variations in spacing/case
Example: “John Smith” and “smith john” → same fingerprint

N-gram Fingerprint:#

Uses character n-grams for comparison
Best for: slight spelling differences, abbreviations
Example: “Microsoft” and “Microsoft” (typo)

Levenshtein (Edit Distance):#

Counts minimum edits (insert/delete/replace) to transform one string to another
Best for: typos, minor errors

PPM:#

Compression-based similarity
Best for: complex variations
✅ Fingerprinting + n-gram = most effective for company name standardization (exam answer)
❌ Simple find and replace → misses variations
❌ Regular expression only → misses fuzzy matches
❌ Manual standardization → not scalable

4. GREL - Transformation Language#

What is GREL?#

General Refine Expression Language
Used to transform cell values programmatically in OpenRefine

Exam-Relevant Functions:#

value.toLowercase()     → "APPLE INC" → "apple inc"
value.toUppercase()     → "apple" → "APPLE"
value.toTitlecase()     → "apple inc" → "Apple Inc"
value.trim()            → removes leading/trailing spaces
value.replace("a","b")  → replace substring
value.split(",")[0]     → split and get first part

Chaining (exam-relevant):#

value.toLowercase().toTitlecase()   ✅ standardize to title case

✅ value.toLowercase() then value.toTitlecase() - exam answer (TDS_(1) Q33)
❌ Manually retyping each company name → not scalable
❌ Delete column and recreate → loses data
❌ Export to Excel and use find-replace → not reproducible

5. Faceting - Grouping & Aggregating#

What is Faceting?#

Groups rows by value for exploration and bulk editing
Shows distribution of values in a column

Type	Use
Text facet	Group by text value
Numeric facet	Filter by number range
Timeline facet	Filter by date range
Custom facet	Use GREL expression

Workflow for Duplicate Aggregation:#

1. Standardize names (toLowercase → toTitlecase)
2. Apply Text Facet on company name column
3. See all grouped values and counts
4. Merge similar values manually or via clustering
5. Use facet to group → aggregate/sum revenue values

✅ Use faceting to group then aggregate/sum values - exam answer (TDS_(1) Q34)
❌ Delete all duplicates → loses revenue data
❌ Leave duplicates → incorrect totals
❌ Manually total using calculator → not scalable

6. Operations as JSON - Reproducibility#

What are Operations?#

Every cleaning step recorded as an operation
Can be exported as JSON
Can be replayed on similar datasets

Workflow:#

1. Clean dataset A in OpenRefine
2. Export operations as JSON
3. Open dataset B in OpenRefine
4. Apply (replay) saved JSON operations
5. Dataset B cleaned with same steps ✅

✅ Export operations as JSON → replay on similar datasets (exam answer)
❌ Automatically cleans new data without user action → false
❌ Makes software faster → false
❌ Deletes original data → false

7. Data Quality Issues - Types#

Issue	Description	Example
Inconsistent naming	Same entity spelled differently ✅	“Apple”, “Apple Inc.”, “APPLE”
Missing values	NaN, null, empty cells	Rating with no review text
Duplicate records	Same record appears multiple times	Same customer, different format
Wrong data types	Numbers stored as text	“001234” becoming date
Outliers	Extreme values	Age = 999
Structural errors	Inconsistent formats	“01/12/2024” vs “2024-12-01”
Extra whitespace	Leading/trailing spaces	" Apple " ≠ “Apple”
Mixed units	Different units in same column	USD vs EUR

✅ Inconsistent naming/capitalization - exam answer (TDS_(1) Q36)

8. Deduplication - Best Practice#

Problem:#

Source A: "John Smith"
Source B: "smith, john"
→ Same person, different format → duplicate after combining

Solution:#

# Step 1: Normalize
def normalize_name(name):
    if ',' in name:
        parts = name.split(',')
        name = f"{parts[1].strip()} {parts[0].strip()}"
    return name.strip().title()

# Step 2: Apply and deduplicate
df['normalized'] = df['name'].apply(normalize_name)
df = df.drop_duplicates(subset=['normalized'])

✅ Normalize → standardize → deduplicate (exam answer TDS Q8)
❌ Keep both formats → causes incorrect counts
❌ Delete all duplicates arbitrarily → loses valid data
❌ Require manual resolution → not scalable

9. Text Matching Challenges#

Why Simple Matching Fails:#

Supplier A: "Blue T-Shirt Size M"
Supplier B: "tshirt blue m"
→ Same product, simple match fails because of:

Challenge	Example
Word order	“Blue T-Shirt” vs “T-Shirt Blue”
Abbreviations	“T-Shirt” vs “tshirt”
Case differences	“BLUE” vs “blue”
Synonyms	“large” vs “L”
Extra words	“Size M” vs “M”

✅ Effective text comparison requires accounting for word order, abbreviations, synonyms - exam answer (TDS Q10)
❌ Text matching is impossible with product names → false
❌ Product codes are different so must be different products → false

10. Fuzzy Matching#

What is Fuzzy Matching?#

Finds strings that are similar but not identical
Used when exact matching fails

from fuzzywuzzy import fuzz

# Simple ratio
fuzz.ratio("Microsoft Corp", "Microsoft Corporation")    # 82

# Token sort ratio (order-independent) ✅
fuzz.token_sort_ratio("Blue T-Shirt", "T-Shirt Blue")   # 100

# Token set ratio (handles extra words)
fuzz.token_set_ratio("Blue T-Shirt Size M", "tshirt blue m")  # high score

11. String Normalization for Matching#

import re

def normalize_text(text):
    text = text.lower()                      # lowercase
    text = re.sub(r'[^\w\s]', '', text)      # remove punctuation
    text = ' '.join(sorted(text.split()))    # sort words alphabetically
    return text.strip()

# Examples:
normalize_text("Blue T-Shirt Size M")  → "blue m shirt size t"
normalize_text("tshirt blue m")        → "blue m tshirt"
# Still not identical but closer for fuzzy matching

pandas - Data Cleaning Reference#

# Detect issues
df.isnull().sum()           # missing values per column
df.duplicated().sum()       # count duplicates
df.dtypes                   # check data types
df['col'].value_counts()    # see value distribution

# Fix missing values
df.dropna(subset=['col'])   # drop rows with NaN in col ✅
df.fillna(0)                # fill with constant
df['col'].fillna(df['col'].mean())  # fill with mean

# Fix duplicates
df.drop_duplicates(subset=['col'])  # ✅

# Fix data types
df['col'].astype('float')
pd.to_datetime(df['date_col'])

# Fix strings
df['col'].str.lower()
df['col'].str.strip()
df['col'].str.replace('Inc.', 'Inc')

Quick Reference#

OpenRefine:
  Purpose     → interactive data cleaning ✅
  Clustering  → fingerprinting + n-gram ✅
  GREL        → value.toLowercase().toTitlecase() ✅
  Faceting    → group and aggregate duplicates ✅
  Operations  → export as JSON, replay on new datasets ✅

Data quality issues:
  Inconsistent naming/capitalization ✅ (exam answer)
  Missing values, duplicates, wrong types, outliers

Text matching challenges:
  Word order, abbreviations, synonyms, case ✅

Deduplication:
  Normalize → standardize → deduplicate ✅

Fuzzy matching:
  fuzz.token_sort_ratio() → order-independent ✅

11. Data Cleaning & OpenRefine#

1. OpenRefine - Overview#

2. Text Clustering - Overview#

3. Clustering Algorithms#

Fingerprinting (default):#

N-gram Fingerprint:#

Levenshtein (Edit Distance):#

PPM:#

4. GREL - Transformation Language#

What is GREL?#

Exam-Relevant Functions:#

Chaining (exam-relevant):#

5. Faceting - Grouping & Aggregating#

What is Faceting?#

Types of Facets:#

Workflow for Duplicate Aggregation:#

6. Operations as JSON - Reproducibility#

What are Operations?#

Workflow:#

7. Data Quality Issues - Types#

8. Deduplication - Best Practice#

Problem:#

Solution:#

9. Text Matching Challenges#

Why Simple Matching Fails:#

10. Fuzzy Matching#

What is Fuzzy Matching?#

11. String Normalization for Matching#

pandas - Data Cleaning Reference#

Quick Reference#