1. File Formats & Data Handling#

1. Markdown (.md)#

  • Lightweight, human-readable documentation format
  • Converts easily to HTML
  • Simple syntax:
# Heading 1
## Heading 2
**bold**, *italic*, `code`
- bullet point
[link text](url)
![image](path)
  • Best for: README files, documentation, notebooks, wikis
  • ❌ NOT suitable for complex formatted documents (use DOCX)
  • ❌ NOT binary - fully readable as plain text

2. DOCX (Word Document)#

  • Microsoft Word format
  • Supports rich formatting: tables, headers, page numbers, styles
  • NOT ideal for lightweight documentation
  • NOT easily converted to HTML without tools
  • Best for: formal reports, letters, documents sent to clients
  • ❌ Binary format - not human-readable as plain text
  • ❌ Requires Word or compatible software to open properly

3. Binary Blob#

  • Raw binary data with no human-readable structure
  • Cannot be opened in text editor meaningfully
  • Examples: compiled files, encrypted data, raw image data
  • ❌ NOT suitable for documentation
  • ❌ NOT human-readable

4. Parquet#

  • Columnar storage format
  • Most efficient for large numeric arrays
  • Minimal storage overhead - uses compression
  • Stores data type information (schema-aware)
  • Best for: big data, analytics pipelines, data lakes
  • Comparison:
FormatHuman-ReadableEfficientSchemaBest For
Parquet✅✅✅Large numeric arrays
CSVSharing with stakeholders
JSONPartialAPIs, nested data
XML❌❌Legacy systems
# Reading/writing Parquet with pandas
df.to_parquet('data.parquet')
df = pd.read_parquet('data.parquet')

5. XML (eXtensible Markup Language)#

  • Human-readable but very verbose
  • Uses tags: <tag>value</tag>
  • High overhead - lots of repeated tag text
  • NOT efficient for large numeric data
  • Best for: configuration files, legacy data exchange, SOAP APIs
<!-- Example: verbose compared to other formats -->
<record>
    <name>John</name>
    <age>25</age>
    <salary>50000</salary>
</record>

6. JSON (JavaScript Object Notation)#

  • Human-readable, flexible format
  • Supports nested structures
  • NOT efficient for large numeric arrays
  • Best for: APIs, configuration files, nested/hierarchical data
import json

# Reading JSON
with open('data.json') as f:
    data = json.load(f)

# Writing JSON
with open('output.json', 'w') as f:
    json.dump(data, f, indent=2)

# Parse JSON string
data = json.loads('{"name": "John", "age": 25}')

# Convert to JSON string
json_str = json.dumps(data, indent=2)

# From API response
response = requests.get(url)
data = response.json()  # parse JSON response directly

7. CSV (Comma-Separated Values)#

  • Simple, human-readable tabular format
  • Best for sharing results with non-technical stakeholders
  • Can be opened directly in Excel
  • No schema enforcement - all values stored as text
  • Limitations: no data types, no nested structures, no compression
import pandas as pd
import csv

# pandas
df = pd.read_csv('data.csv')
df.to_csv('output.csv', index=False)

# Custom delimiter
df = pd.read_csv('data.tsv', sep='\t')  # tab-delimited

# Built-in csv module
with open('data.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)
  • ✅ Best for sharing summary results with non-technical stakeholders
  • ❌ Python pickle (.pkl) - not human-readable, not cross-platform safe
  • ❌ JSON with nested structures - harder to open in Excel
  • ❌ Binary database file - not accessible without tools

8. File Format Comparison - Complete#

FormatReadableSizeTypesNestedBest Use
MarkdownSmallTextDocumentation
CSVMediumNoStakeholder sharing
JSONMediumPartialAPIs, config
XMLLargeLegacy, SOAP
ParquetSmallBig data, analytics
DOCXMediumFormal documents
PickleSmallPython-only objects
HDF5SmallScientific data

9. Excel Text-to-Columns - Auto-Date Conversion Problem#

Problem:#

Tab-delimited file contains: "3/10"
Excel auto-interprets as: "March 10" (date)

Fix:#

  • Format the column as Text BEFORE applying Text-to-Columns
  • This tells Excel to treat the column as plain text, not numbers or dates

Steps:#

1. Select the column(s) in Excel
2. Format → Cells → Text
3. THEN apply Data → Text to Columns
4. In the wizard → select "Text" as column data format
  • ❌ Changing delimiter to commas → doesn’t fix data type issue
  • ❌ Using General format → still auto-converts to date
  • ❌ Manually changing cells after → tedious, error-prone

10. Excel Text-to-Columns - Leading Zeros Problem#

Problem:#

File contains: "001234"
Excel auto-converts to: 1234 (drops leading zeros)

Fix:#

  • Format cells as Text BEFORE using Text-to-Columns
  • Same fix as above - Text format preserves leading zeros

Common cases where leading zeros matter:#

  • Postal/ZIP codes: 00123
  • Phone numbers: 0412345678
  • Product codes: 001234
  • Student IDs: 0045

11. The Universal Fix - Format as Text First#

When to ALWAYS format as Text before import:#

  • Values that look like dates but aren’t (3/10, 1-2, 12/5)
  • Values with leading zeros (001234, 0045)
  • Long numeric IDs that might be converted to scientific notation
  • Values with special characters that might be misinterpreted

In Text-to-Columns Wizard:#

Step 3 of 3 in wizard:
→ Click on the problematic column
→ Under "Column data format" select "Text"
→ Finish

12. Python csv Module#

import csv

# Reading CSV
with open('data.csv', 'r', newline='') as f:
    reader = csv.reader(f)
    header = next(reader)  # skip header
    for row in reader:
        print(row)  # list of values

# Reading as dictionary
with open('data.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['name'], row['age'])

# Writing CSV
with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['name', 'age', 'salary'])  # header
    writer.writerow(['John', 25, 50000])

# Custom delimiter (tab-delimited)
with open('data.tsv', 'r') as f:
    reader = csv.reader(f, delimiter='\t')

# Handling special characters
with open('data.csv', 'r', encoding='utf-8') as f:
    reader = csv.reader(f)

13. Python json Module - Complete Reference#

import json

# Parse JSON string → Python object
data = json.loads('{"name": "John", "age": 25}')

# Convert Python object → JSON string
json_str = json.dumps(data, indent=2)

# Read JSON file → Python object
with open('data.json', 'r') as f:
    data = json.load(f)

# Write Python object → JSON file
with open('output.json', 'w') as f:
    json.dump(data, f, indent=2)

# Handling nested JSON (API response)
response_json = {
    "weather": [{"description": "clear sky"}],
    "main": {"temp": 298.15, "humidity": 60}
}

# Access nested fields
description = response_json['weather'][0]['description']  # ✅
temperature = response_json['main']['temp']

# Common json.dumps options
json.dumps(data, indent=2)            # pretty print
json.dumps(data, sort_keys=True)      # sort keys alphabetically
json.dumps(data, ensure_ascii=False)  # allow non-ASCII characters
json.dumps(data, default=str)         # handle non-serializable types

14. Python Pickle - When NOT to Use#

import pickle

# Save Python object
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Load Python object
with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

Pickle Limitations:#

  • ❌ NOT human-readable (binary format)
  • ❌ NOT cross-platform safe (Python version dependent)
  • ❌ NOT secure - can execute arbitrary code on load
  • ❌ NOT suitable for sharing with non-technical stakeholders
  • ✅ OK for: saving Python ML models temporarily
  • ✅ Better alternatives: joblib for ML models, parquet for data, json for configs

Quick Reference Card#

Documentation needed?
→ Markdown (.md)

Sharing data with non-technical users?
→ CSV (opens in Excel)

Large numeric dataset in pipeline?
→ Parquet

API data / nested data?
→ JSON

Formal Word document?
→ DOCX

Excel importing numbers as dates or dropping leading zeros?
→ Format column as Text BEFORE Text-to-Columns

Comparing formats for efficiency?
→ Parquet > CSV > JSON > XML
   (storage efficiency, NOT readability)