11. Data Cleaning & OpenRefine#
1. OpenRefine - Overview#
- Free, open-source tool for interactive data cleaning and reconciliation
- Runs locally in browser (not cloud-based)
- Works like a spreadsheet but far more powerful for messy data
- ❌ NOT for model training
- ❌ NOT for visual dashboarding
- ❌ NOT for source control
2. Text Clustering - Overview#
- Automatically groups similar text values so you can merge them
- Solves inconsistent naming problem:
"Microsoft Corp"
"MICROSOFT CORPORATION"
"microsoft"
"MSFT"
→ All same entity → cluster and merge into one3. Clustering Algorithms#
Fingerprinting (default):#
- Normalizes text → lowercase, remove punctuation, sort words, remove duplicates
- Best for: simple variations in spacing/case
- Example: “John Smith” and “smith john” → same fingerprint
N-gram Fingerprint:#
- Uses character n-grams for comparison
- Best for: slight spelling differences, abbreviations
- Example: “Microsoft” and “Microsoft” (typo)
Levenshtein (Edit Distance):#
- Counts minimum edits (insert/delete/replace) to transform one string to another
- Best for: typos, minor errors
PPM:#
Compression-based similarity
Best for: complex variations
✅ Fingerprinting + n-gram = most effective for company name standardization (exam answer)
❌ Simple find and replace → misses variations
❌ Regular expression only → misses fuzzy matches
❌ Manual standardization → not scalable
4. GREL - Transformation Language#
What is GREL?#
- General Refine Expression Language
- Used to transform cell values programmatically in OpenRefine
Exam-Relevant Functions:#
value.toLowercase() → "APPLE INC" → "apple inc"
value.toUppercase() → "apple" → "APPLE"
value.toTitlecase() → "apple inc" → "Apple Inc"
value.trim() → removes leading/trailing spaces
value.replace("a","b") → replace substring
value.split(",")[0] → split and get first partChaining (exam-relevant):#
value.toLowercase().toTitlecase() ✅ standardize to title case- ✅
value.toLowercase()thenvalue.toTitlecase()- exam answer (TDS_(1) Q33) - ❌ Manually retyping each company name → not scalable
- ❌ Delete column and recreate → loses data
- ❌ Export to Excel and use find-replace → not reproducible
5. Faceting - Grouping & Aggregating#
What is Faceting?#
- Groups rows by value for exploration and bulk editing
- Shows distribution of values in a column
Types of Facets:#
| Type | Use |
|---|---|
| Text facet | Group by text value |
| Numeric facet | Filter by number range |
| Timeline facet | Filter by date range |
| Custom facet | Use GREL expression |
Workflow for Duplicate Aggregation:#
1. Standardize names (toLowercase → toTitlecase)
2. Apply Text Facet on company name column
3. See all grouped values and counts
4. Merge similar values manually or via clustering
5. Use facet to group → aggregate/sum revenue values- ✅ Use faceting to group then aggregate/sum values - exam answer (TDS_(1) Q34)
- ❌ Delete all duplicates → loses revenue data
- ❌ Leave duplicates → incorrect totals
- ❌ Manually total using calculator → not scalable
6. Operations as JSON - Reproducibility#
What are Operations?#
- Every cleaning step recorded as an operation
- Can be exported as JSON
- Can be replayed on similar datasets
Workflow:#
1. Clean dataset A in OpenRefine
2. Export operations as JSON
3. Open dataset B in OpenRefine
4. Apply (replay) saved JSON operations
5. Dataset B cleaned with same steps ✅- ✅ Export operations as JSON → replay on similar datasets (exam answer)
- ❌ Automatically cleans new data without user action → false
- ❌ Makes software faster → false
- ❌ Deletes original data → false
7. Data Quality Issues - Types#
| Issue | Description | Example |
|---|---|---|
| Inconsistent naming | Same entity spelled differently ✅ | “Apple”, “Apple Inc.”, “APPLE” |
| Missing values | NaN, null, empty cells | Rating with no review text |
| Duplicate records | Same record appears multiple times | Same customer, different format |
| Wrong data types | Numbers stored as text | “001234” becoming date |
| Outliers | Extreme values | Age = 999 |
| Structural errors | Inconsistent formats | “01/12/2024” vs “2024-12-01” |
| Extra whitespace | Leading/trailing spaces | " Apple " ≠ “Apple” |
| Mixed units | Different units in same column | USD vs EUR |
- ✅ Inconsistent naming/capitalization - exam answer (TDS_(1) Q36)
8. Deduplication - Best Practice#
Problem:#
Source A: "John Smith"
Source B: "smith, john"
→ Same person, different format → duplicate after combiningSolution:#
# Step 1: Normalize
def normalize_name(name):
if ',' in name:
parts = name.split(',')
name = f"{parts[1].strip()} {parts[0].strip()}"
return name.strip().title()
# Step 2: Apply and deduplicate
df['normalized'] = df['name'].apply(normalize_name)
df = df.drop_duplicates(subset=['normalized'])- ✅ Normalize → standardize → deduplicate (exam answer TDS Q8)
- ❌ Keep both formats → causes incorrect counts
- ❌ Delete all duplicates arbitrarily → loses valid data
- ❌ Require manual resolution → not scalable
9. Text Matching Challenges#
Why Simple Matching Fails:#
Supplier A: "Blue T-Shirt Size M"
Supplier B: "tshirt blue m"
→ Same product, simple match fails because of:| Challenge | Example |
|---|---|
| Word order | “Blue T-Shirt” vs “T-Shirt Blue” |
| Abbreviations | “T-Shirt” vs “tshirt” |
| Case differences | “BLUE” vs “blue” |
| Synonyms | “large” vs “L” |
| Extra words | “Size M” vs “M” |
- ✅ Effective text comparison requires accounting for word order, abbreviations, synonyms - exam answer (TDS Q10)
- ❌ Text matching is impossible with product names → false
- ❌ Product codes are different so must be different products → false
10. Fuzzy Matching#
What is Fuzzy Matching?#
- Finds strings that are similar but not identical
- Used when exact matching fails
from fuzzywuzzy import fuzz
# Simple ratio
fuzz.ratio("Microsoft Corp", "Microsoft Corporation") # 82
# Token sort ratio (order-independent) ✅
fuzz.token_sort_ratio("Blue T-Shirt", "T-Shirt Blue") # 100
# Token set ratio (handles extra words)
fuzz.token_set_ratio("Blue T-Shirt Size M", "tshirt blue m") # high score11. String Normalization for Matching#
import re
def normalize_text(text):
text = text.lower() # lowercase
text = re.sub(r'[^\w\s]', '', text) # remove punctuation
text = ' '.join(sorted(text.split())) # sort words alphabetically
return text.strip()
# Examples:
normalize_text("Blue T-Shirt Size M") → "blue m shirt size t"
normalize_text("tshirt blue m") → "blue m tshirt"
# Still not identical but closer for fuzzy matchingpandas - Data Cleaning Reference#
# Detect issues
df.isnull().sum() # missing values per column
df.duplicated().sum() # count duplicates
df.dtypes # check data types
df['col'].value_counts() # see value distribution
# Fix missing values
df.dropna(subset=['col']) # drop rows with NaN in col ✅
df.fillna(0) # fill with constant
df['col'].fillna(df['col'].mean()) # fill with mean
# Fix duplicates
df.drop_duplicates(subset=['col']) # ✅
# Fix data types
df['col'].astype('float')
pd.to_datetime(df['date_col'])
# Fix strings
df['col'].str.lower()
df['col'].str.strip()
df['col'].str.replace('Inc.', 'Inc')Quick Reference#
OpenRefine:
Purpose → interactive data cleaning ✅
Clustering → fingerprinting + n-gram ✅
GREL → value.toLowercase().toTitlecase() ✅
Faceting → group and aggregate duplicates ✅
Operations → export as JSON, replay on new datasets ✅
Data quality issues:
Inconsistent naming/capitalization ✅ (exam answer)
Missing values, duplicates, wrong types, outliers
Text matching challenges:
Word order, abbreviations, synonyms, case ✅
Deduplication:
Normalize → standardize → deduplicate ✅
Fuzzy matching:
fuzz.token_sort_ratio() → order-independent ✅