1. File Formats & Data Handling

1. File Formats & Data Handling#

1. Markdown (`.md`)#

Lightweight, human-readable documentation format
Converts easily to HTML
Simple syntax:

# Heading 1
## Heading 2
**bold**, *italic*, `code`
- bullet point
[link text](url)
![image](path)

Best for: README files, documentation, notebooks, wikis
❌ NOT suitable for complex formatted documents (use DOCX)
❌ NOT binary - fully readable as plain text

2. DOCX (Word Document)#

Microsoft Word format
Supports rich formatting: tables, headers, page numbers, styles
NOT ideal for lightweight documentation
NOT easily converted to HTML without tools
Best for: formal reports, letters, documents sent to clients
❌ Binary format - not human-readable as plain text
❌ Requires Word or compatible software to open properly

3. Binary Blob#

Raw binary data with no human-readable structure
Cannot be opened in text editor meaningfully
Examples: compiled files, encrypted data, raw image data
❌ NOT suitable for documentation
❌ NOT human-readable

4. Parquet#

Columnar storage format
Most efficient for large numeric arrays
Minimal storage overhead - uses compression
Stores data type information (schema-aware)
Best for: big data, analytics pipelines, data lakes
Comparison:

Format	Human-Readable	Efficient	Schema	Best For
Parquet	❌	✅✅✅	✅	Large numeric arrays
CSV	✅	❌	❌	Sharing with stakeholders
JSON	✅	❌	Partial	APIs, nested data
XML	✅	❌❌	✅	Legacy systems

# Reading/writing Parquet with pandas
df.to_parquet('data.parquet')
df = pd.read_parquet('data.parquet')

5. XML (eXtensible Markup Language)#

Human-readable but very verbose
Uses tags: <tag>value</tag>
High overhead - lots of repeated tag text
NOT efficient for large numeric data
Best for: configuration files, legacy data exchange, SOAP APIs

<!-- Example: verbose compared to other formats -->
<record>
    <name>John</name>
    <age>25</age>
    <salary>50000</salary>
</record>

6. JSON (JavaScript Object Notation)#

Human-readable, flexible format
Supports nested structures
NOT efficient for large numeric arrays
Best for: APIs, configuration files, nested/hierarchical data

import json

# Reading JSON
with open('data.json') as f:
    data = json.load(f)

# Writing JSON
with open('output.json', 'w') as f:
    json.dump(data, f, indent=2)

# Parse JSON string
data = json.loads('{"name": "John", "age": 25}')

# Convert to JSON string
json_str = json.dumps(data, indent=2)

# From API response
response = requests.get(url)
data = response.json()  # parse JSON response directly

7. CSV (Comma-Separated Values)#

Simple, human-readable tabular format
Best for sharing results with non-technical stakeholders
Can be opened directly in Excel
No schema enforcement - all values stored as text
Limitations: no data types, no nested structures, no compression

import pandas as pd
import csv

# pandas
df = pd.read_csv('data.csv')
df.to_csv('output.csv', index=False)

# Custom delimiter
df = pd.read_csv('data.tsv', sep='\t')  # tab-delimited

# Built-in csv module
with open('data.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

✅ Best for sharing summary results with non-technical stakeholders
❌ Python pickle (.pkl) - not human-readable, not cross-platform safe
❌ JSON with nested structures - harder to open in Excel
❌ Binary database file - not accessible without tools

8. File Format Comparison - Complete#

Format	Readable	Size	Types	Nested	Best Use
Markdown	✅	Small	Text	❌	Documentation
CSV	✅	Medium	No	❌	Stakeholder sharing
JSON	✅	Medium	Partial	✅	APIs, config
XML	✅	Large	✅	✅	Legacy, SOAP
Parquet	❌	Small	✅	✅	Big data, analytics
DOCX	❌	Medium	✅	❌	Formal documents
Pickle	❌	Small	✅	✅	Python-only objects
HDF5	❌	Small	✅	✅	Scientific data

9. Excel Text-to-Columns - Auto-Date Conversion Problem#

Problem:#

Tab-delimited file contains: "3/10"
Excel auto-interprets as: "March 10" (date)

Fix:#

✅ Format the column as Text BEFORE applying Text-to-Columns
This tells Excel to treat the column as plain text, not numbers or dates

Steps:#

1. Select the column(s) in Excel
2. Format → Cells → Text
3. THEN apply Data → Text to Columns
4. In the wizard → select "Text" as column data format

❌ Changing delimiter to commas → doesn’t fix data type issue
❌ Using General format → still auto-converts to date
❌ Manually changing cells after → tedious, error-prone

10. Excel Text-to-Columns - Leading Zeros Problem#

Problem:#

File contains: "001234"
Excel auto-converts to: 1234 (drops leading zeros)

Fix:#

✅ Format cells as Text BEFORE using Text-to-Columns
Same fix as above - Text format preserves leading zeros

Common cases where leading zeros matter:#

Postal/ZIP codes: 00123
Phone numbers: 0412345678
Product codes: 001234
Student IDs: 0045

11. The Universal Fix - Format as Text First#

When to ALWAYS format as Text before import:#

Values that look like dates but aren’t (3/10, 1-2, 12/5)
Values with leading zeros (001234, 0045)
Long numeric IDs that might be converted to scientific notation
Values with special characters that might be misinterpreted

In Text-to-Columns Wizard:#

Step 3 of 3 in wizard:
→ Click on the problematic column
→ Under "Column data format" select "Text"
→ Finish

12. Python `csv` Module#

import csv

# Reading CSV
with open('data.csv', 'r', newline='') as f:
    reader = csv.reader(f)
    header = next(reader)  # skip header
    for row in reader:
        print(row)  # list of values

# Reading as dictionary
with open('data.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['name'], row['age'])

# Writing CSV
with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['name', 'age', 'salary'])  # header
    writer.writerow(['John', 25, 50000])

# Custom delimiter (tab-delimited)
with open('data.tsv', 'r') as f:
    reader = csv.reader(f, delimiter='\t')

# Handling special characters
with open('data.csv', 'r', encoding='utf-8') as f:
    reader = csv.reader(f)

13. Python `json` Module - Complete Reference#

import json

# Parse JSON string → Python object
data = json.loads('{"name": "John", "age": 25}')

# Convert Python object → JSON string
json_str = json.dumps(data, indent=2)

# Read JSON file → Python object
with open('data.json', 'r') as f:
    data = json.load(f)

# Write Python object → JSON file
with open('output.json', 'w') as f:
    json.dump(data, f, indent=2)

# Handling nested JSON (API response)
response_json = {
    "weather": [{"description": "clear sky"}],
    "main": {"temp": 298.15, "humidity": 60}
}

# Access nested fields
description = response_json['weather'][0]['description']  # ✅
temperature = response_json['main']['temp']

# Common json.dumps options
json.dumps(data, indent=2)            # pretty print
json.dumps(data, sort_keys=True)      # sort keys alphabetically
json.dumps(data, ensure_ascii=False)  # allow non-ASCII characters
json.dumps(data, default=str)         # handle non-serializable types

14. Python Pickle - When NOT to Use#

import pickle

# Save Python object
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Load Python object
with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

Pickle Limitations:#

❌ NOT human-readable (binary format)
❌ NOT cross-platform safe (Python version dependent)
❌ NOT secure - can execute arbitrary code on load
❌ NOT suitable for sharing with non-technical stakeholders
✅ OK for: saving Python ML models temporarily
✅ Better alternatives: joblib for ML models, parquet for data, json for configs

Quick Reference Card#

Documentation needed?
→ Markdown (.md)

Sharing data with non-technical users?
→ CSV (opens in Excel)

Large numeric dataset in pipeline?
→ Parquet

API data / nested data?
→ JSON

Formal Word document?
→ DOCX

Excel importing numbers as dates or dropping leading zeros?
→ Format column as Text BEFORE Text-to-Columns

Comparing formats for efficiency?
→ Parquet > CSV > JSON > XML
   (storage efficiency, NOT readability)

1. File Formats & Data Handling#

1. Markdown (.md)#

2. DOCX (Word Document)#

3. Binary Blob#

4. Parquet#

5. XML (eXtensible Markup Language)#

6. JSON (JavaScript Object Notation)#

7. CSV (Comma-Separated Values)#

8. File Format Comparison - Complete#

9. Excel Text-to-Columns - Auto-Date Conversion Problem#

Problem:#

Fix:#

Steps:#

10. Excel Text-to-Columns - Leading Zeros Problem#

Problem:#

Fix:#

Common cases where leading zeros matter:#

11. The Universal Fix - Format as Text First#

When to ALWAYS format as Text before import:#

In Text-to-Columns Wizard:#

12. Python csv Module#

13. Python json Module - Complete Reference#

14. Python Pickle - When NOT to Use#

Pickle Limitations:#

Quick Reference Card#

1. Markdown (`.md`)#

12. Python `csv` Module#

13. Python `json` Module - Complete Reference#