1. Markdown (.md)#
- Lightweight, human-readable documentation format
- Converts easily to HTML
- Simple syntax:
# Heading 1
## Heading 2
**bold**, *italic*, `code`
- bullet point
[link text](url)

- Best for: README files, documentation, notebooks, wikis
- ❌ NOT suitable for complex formatted documents (use DOCX)
- ❌ NOT binary - fully readable as plain text
2. DOCX (Word Document)#
- Microsoft Word format
- Supports rich formatting: tables, headers, page numbers, styles
- NOT ideal for lightweight documentation
- NOT easily converted to HTML without tools
- Best for: formal reports, letters, documents sent to clients
- ❌ Binary format - not human-readable as plain text
- ❌ Requires Word or compatible software to open properly
3. Binary Blob#
- Raw binary data with no human-readable structure
- Cannot be opened in text editor meaningfully
- Examples: compiled files, encrypted data, raw image data
- ❌ NOT suitable for documentation
- ❌ NOT human-readable
4. Parquet#
- Columnar storage format
- Most efficient for large numeric arrays
- Minimal storage overhead - uses compression
- Stores data type information (schema-aware)
- Best for: big data, analytics pipelines, data lakes
- Comparison:
| Format | Human-Readable | Efficient | Schema | Best For |
|---|
| Parquet | ❌ | ✅✅✅ | ✅ | Large numeric arrays |
| CSV | ✅ | ❌ | ❌ | Sharing with stakeholders |
| JSON | ✅ | ❌ | Partial | APIs, nested data |
| XML | ✅ | ❌❌ | ✅ | Legacy systems |
# Reading/writing Parquet with pandas
df.to_parquet('data.parquet')
df = pd.read_parquet('data.parquet')
5. XML (eXtensible Markup Language)#
- Human-readable but very verbose
- Uses tags:
<tag>value</tag> - High overhead - lots of repeated tag text
- NOT efficient for large numeric data
- Best for: configuration files, legacy data exchange, SOAP APIs
<!-- Example: verbose compared to other formats -->
<record>
<name>John</name>
<age>25</age>
<salary>50000</salary>
</record>
6. JSON (JavaScript Object Notation)#
- Human-readable, flexible format
- Supports nested structures
- NOT efficient for large numeric arrays
- Best for: APIs, configuration files, nested/hierarchical data
import json
# Reading JSON
with open('data.json') as f:
data = json.load(f)
# Writing JSON
with open('output.json', 'w') as f:
json.dump(data, f, indent=2)
# Parse JSON string
data = json.loads('{"name": "John", "age": 25}')
# Convert to JSON string
json_str = json.dumps(data, indent=2)
# From API response
response = requests.get(url)
data = response.json() # parse JSON response directly
7. CSV (Comma-Separated Values)#
- Simple, human-readable tabular format
- Best for sharing results with non-technical stakeholders
- Can be opened directly in Excel
- No schema enforcement - all values stored as text
- Limitations: no data types, no nested structures, no compression
import pandas as pd
import csv
# pandas
df = pd.read_csv('data.csv')
df.to_csv('output.csv', index=False)
# Custom delimiter
df = pd.read_csv('data.tsv', sep='\t') # tab-delimited
# Built-in csv module
with open('data.csv') as f:
reader = csv.DictReader(f)
for row in reader:
print(row)
- ✅ Best for sharing summary results with non-technical stakeholders
- ❌ Python pickle (
.pkl) - not human-readable, not cross-platform safe - ❌ JSON with nested structures - harder to open in Excel
- ❌ Binary database file - not accessible without tools
| Format | Readable | Size | Types | Nested | Best Use |
|---|
| Markdown | ✅ | Small | Text | ❌ | Documentation |
| CSV | ✅ | Medium | No | ❌ | Stakeholder sharing |
| JSON | ✅ | Medium | Partial | ✅ | APIs, config |
| XML | ✅ | Large | ✅ | ✅ | Legacy, SOAP |
| Parquet | ❌ | Small | ✅ | ✅ | Big data, analytics |
| DOCX | ❌ | Medium | ✅ | ❌ | Formal documents |
| Pickle | ❌ | Small | ✅ | ✅ | Python-only objects |
| HDF5 | ❌ | Small | ✅ | ✅ | Scientific data |
9. Excel Text-to-Columns - Auto-Date Conversion Problem#
Problem:#
Tab-delimited file contains: "3/10"
Excel auto-interprets as: "March 10" (date)
Fix:#
- ✅ Format the column as Text BEFORE applying Text-to-Columns
- This tells Excel to treat the column as plain text, not numbers or dates
Steps:#
1. Select the column(s) in Excel
2. Format → Cells → Text
3. THEN apply Data → Text to Columns
4. In the wizard → select "Text" as column data format
- ❌ Changing delimiter to commas → doesn’t fix data type issue
- ❌ Using General format → still auto-converts to date
- ❌ Manually changing cells after → tedious, error-prone
10. Excel Text-to-Columns - Leading Zeros Problem#
Problem:#
File contains: "001234"
Excel auto-converts to: 1234 (drops leading zeros)
Fix:#
- ✅ Format cells as Text BEFORE using Text-to-Columns
- Same fix as above - Text format preserves leading zeros
Common cases where leading zeros matter:#
- Postal/ZIP codes:
00123 - Phone numbers:
0412345678 - Product codes:
001234 - Student IDs:
0045
11. The Universal Fix - Format as Text First#
When to ALWAYS format as Text before import:#
- Values that look like dates but aren’t (
3/10, 1-2, 12/5) - Values with leading zeros (
001234, 0045) - Long numeric IDs that might be converted to scientific notation
- Values with special characters that might be misinterpreted
In Text-to-Columns Wizard:#
Step 3 of 3 in wizard:
→ Click on the problematic column
→ Under "Column data format" select "Text"
→ Finish
12. Python csv Module#
import csv
# Reading CSV
with open('data.csv', 'r', newline='') as f:
reader = csv.reader(f)
header = next(reader) # skip header
for row in reader:
print(row) # list of values
# Reading as dictionary
with open('data.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
print(row['name'], row['age'])
# Writing CSV
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['name', 'age', 'salary']) # header
writer.writerow(['John', 25, 50000])
# Custom delimiter (tab-delimited)
with open('data.tsv', 'r') as f:
reader = csv.reader(f, delimiter='\t')
# Handling special characters
with open('data.csv', 'r', encoding='utf-8') as f:
reader = csv.reader(f)
13. Python json Module - Complete Reference#
import json
# Parse JSON string → Python object
data = json.loads('{"name": "John", "age": 25}')
# Convert Python object → JSON string
json_str = json.dumps(data, indent=2)
# Read JSON file → Python object
with open('data.json', 'r') as f:
data = json.load(f)
# Write Python object → JSON file
with open('output.json', 'w') as f:
json.dump(data, f, indent=2)
# Handling nested JSON (API response)
response_json = {
"weather": [{"description": "clear sky"}],
"main": {"temp": 298.15, "humidity": 60}
}
# Access nested fields
description = response_json['weather'][0]['description'] # ✅
temperature = response_json['main']['temp']
# Common json.dumps options
json.dumps(data, indent=2) # pretty print
json.dumps(data, sort_keys=True) # sort keys alphabetically
json.dumps(data, ensure_ascii=False) # allow non-ASCII characters
json.dumps(data, default=str) # handle non-serializable types
14. Python Pickle - When NOT to Use#
import pickle
# Save Python object
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
# Load Python object
with open('model.pkl', 'rb') as f:
model = pickle.load(f)
Pickle Limitations:#
- ❌ NOT human-readable (binary format)
- ❌ NOT cross-platform safe (Python version dependent)
- ❌ NOT secure - can execute arbitrary code on load
- ❌ NOT suitable for sharing with non-technical stakeholders
- ✅ OK for: saving Python ML models temporarily
- ✅ Better alternatives:
joblib for ML models, parquet for data, json for configs
Quick Reference Card#
Documentation needed?
→ Markdown (.md)
Sharing data with non-technical users?
→ CSV (opens in Excel)
Large numeric dataset in pipeline?
→ Parquet
API data / nested data?
→ JSON
Formal Word document?
→ DOCX
Excel importing numbers as dates or dropping leading zeros?
→ Format column as Text BEFORE Text-to-Columns
Comparing formats for efficiency?
→ Parquet > CSV > JSON > XML
(storage efficiency, NOT readability)