4. Git & Version Control#

1. Git - Core Concepts & Purpose#

What is Git?#

  • Distributed version control system
  • Tracks changes to files over time
  • Every developer has full copy of repository
  • Works offline - syncs when connected

Core Benefits:#

✅ Tracks every change ever made
✅ Enables collaboration among multiple developers
✅ Rollback to any previous state
✅ Branching - work on features without affecting main code
✅ Blame - see who changed what and when
✅ Merge - combine work from multiple people

❌ Does NOT automatically clean data
❌ Does NOT replace testing
❌ Does NOT store only binary assets

Git vs GitHub vs GitLab:#

ToolWhat it is
GitVersion control software (local)
GitHubCloud hosting for Git repos + collaboration
GitLabAlternative to GitHub (self-hostable)
BitbucketAnother alternative by Atlassian

2. Git - Core Architecture#

Working Directory    Staging Area      Local Repo       Remote Repo
(your files)         (index)          (.git folder)    (GitHub)
     │                   │                 │                │
     │   git add         │   git commit    │   git push     │
     │──────────────────>│────────────────>│───────────────>│
     │                   │                 │                │
     │<──────────────────────────────────────────────────── │
     │              git pull / git clone                    │
     │                   │                 │                │
     │   git checkout -- .               git fetch          │
     │<──────────────────│                 │<───────────────│

Three Areas Explained:#

# 1. Working Directory
# → Files you're currently editing
# → Changes not yet tracked by Git

# 2. Staging Area (Index)
# → Files marked for next commit
# → git add moves files here

# 3. Local Repository
# → Committed snapshots
# → git commit moves from staging to here

# 4. Remote Repository
# → GitHub/GitLab server
# → git push uploads local commits
# → git pull downloads remote commits

3. .gitignore - Excluding Files from Tracking#

What is .gitignore?#

  • Text file listing patterns of files Git should NOT track
  • Placed in root of repository
  • Committed to repo so all team members use same rules

Common .gitignore Entries:#

# Python
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
*.egg-info/

# Virtual environments
venv/
.venv/
env/
.env

# Secrets & API keys ✅ most important
.env
.env.*
secrets.json
credentials.json
config/secrets.py

# Data files (often too large for git)
data/raw/
*.csv
*.parquet
*.pkl

# IDE files
.vscode/
.idea/
*.swp

# OS files
.DS_Store          # Mac
Thumbs.db          # Windows

# Jupyter
.ipynb_checkpoints/

# Build artifacts
dist/
build/
*.egg-info/

# Logs
*.log
logs/

.gitignore vs .git/config:#

FilePurpose
.gitignoreSpecify files Git should NOT track ✅
.git/configRepository settings (remote URL, branches)
.gitattributesFile-specific settings (line endings, merge strategy)
  • ❌ NOT for tracking specific files
  • ❌ NOT for listing contributors
  • ❌ NOT for defining repository settings

4. git diff - View Changes#

# Working directory vs last commit
git diff                        # ✅ exam answer

# Staged changes vs last commit
git diff --staged
git diff --cached               # same as --staged

# Between two commits
git diff abc123 def456

# Between two branches
git diff main feature-branch

# Specific file
git diff file.py

# Summary only (no actual diff)
git diff --stat
git status    # shows WHICH files changed (not the actual diff)
git log       # shows commit HISTORY
git show      # shows changes in a specific commit
git log -p    # shows commits WITH their diffs

git diff Output:#

diff --git a/script.py b/script.py
index abc123..def456 100644
--- a/script.py        ← old version
+++ b/script.py        ← new version
@@ -10,7 +10,8 @@    ← line numbers
 def clean_data(df):
-    df.dropna()                  ← removed line (red)
+    df.dropna(subset=['name'])   ← added line (green)
+    df.reset_index(drop=True)    ← added line (green)
     return df

5. Git Workflow Commands - Complete Reference#

Staging & Committing:#

# Check what changed
git status                  # see modified/staged/untracked files

# Stage changes
git add file.py             # stage specific file
git add .                   # stage ALL changes ✅
git add *.py                # stage all Python files
git add -p                  # stage changes interactively (chunk by chunk)

# Unstage (undo git add)
git restore --staged file.py
git reset HEAD file.py      # older syntax

# Commit
git commit -m "Add data cleaning function"  # ✅
git commit -am "message"    # stage + commit tracked files in one step
git commit --amend          # modify last commit message

# Correct sequence for uploading work:
git add .
git commit -m "message"
git push                    # ✅ exam answer: add → commit → push

Viewing History:#

git log                     # full commit history
git log --oneline           # compact one-line per commit
git log --graph             # visual branch graph
git log --oneline --graph --all  # full visual history
git log -n 5                # last 5 commits
git log --author="John"     # commits by specific author
git log --since="2024-01-01" # commits after date
git log -- file.py          # commits touching specific file
git show abc123             # show specific commit details
git show HEAD               # show last commit
git show HEAD~1             # show second-to-last commit

6. Syncing with Remote#

# Download + integrate remote changes
git pull                    # pull from tracked remote branch
git pull origin main        # ✅ pull from main branch specifically

# Download WITHOUT integrating
git fetch origin            # fetch all remote changes
git fetch origin main       # fetch specific branch

# Pull vs Fetch:
# git pull = git fetch + git merge
# git fetch = download only (safe, no changes to working files)

# Push local commits to remote
git push                    # push to tracked remote branch
git push origin main        # push to specific branch
git push origin feature     # push feature branch
git push -u origin feature  # push + set upstream tracking
git push --force            # ⚠️ force push (dangerous, avoid)

7. Git Branching#

Why Branches?#

  • Work on features without affecting main code
  • Multiple people work simultaneously without conflicts
  • Keep main always production-ready
# Create branch
git branch feature-analysis         # create branch
git branch                          # list local branches
git branch -a                       # list all branches (local + remote)
git branch -d feature-analysis      # delete branch (safe)
git branch -D feature-analysis      # force delete branch

# Switch branch
git checkout feature-analysis       # switch to branch
git checkout main                   # go back to main

# Create + switch in one step ✅
git checkout -b feature-analysis    # ✅ most common shortcut
git switch -c feature-analysis      # newer syntax (Git 2.23+)

# ✅ Exam answer: create and switch
git branch feature-analysis
git checkout feature-analysis
# OR single command:
git checkout -b feature-analysis

Branch Naming Conventions:#

feature/data-cleaning       # new features
bugfix/fix-null-handling    # bug fixes
hotfix/critical-api-fix     # urgent production fixes
release/v1.2.0              # release preparation
docs/update-readme          # documentation

8. Undoing Changes#

# Discard working directory changes ✅ (exam answer)
git checkout -- .           # discard ALL unstaged changes
git checkout -- file.py     # discard changes to specific file
git restore .               # newer syntax (Git 2.23+)
git restore file.py         # specific file

# Unstage (undo git add)
git restore --staged file.py
git reset HEAD file.py

# Undo last commit (keep changes in working directory)
git reset --soft HEAD~1     # undo commit, keep staged
git reset HEAD~1            # undo commit, unstage changes
git reset --mixed HEAD~1    # same as above (default)

# Undo last commit (DISCARD all changes) ⚠️ destructive
git reset --hard HEAD~1

# Revert (creates NEW commit that undoes previous)
git revert HEAD             # undo last commit safely
git revert abc123           # undo specific commit

# Stash (save work temporarily)
git stash                   # save current changes
git stash list              # see all stashes
git stash pop               # apply most recent stash + delete it
git stash apply             # apply most recent stash (keep it)
git stash drop              # delete most recent stash
git stash clear             # delete all stashes

Undo Command Comparison:#

CommandEffectDestructive?
git checkout -- .Discard working dir changes ✅Yes (local only)
git reset --softUndo commit, keep stagedNo
git reset --hardUndo commit, discard allYes ⚠️
git revertNew commit undoing previousNo (safe)
git stashTemporarily save changesNo
git clean -fdRemove untracked filesYes ⚠️

9. Merging & Merge Conflicts#

Merging Branches:#

# Merge feature into main
git checkout main           # switch to target branch
git merge feature-analysis  # merge feature into current branch

# Fast-forward merge (no new commit needed)
git merge --ff feature

# Always create merge commit
git merge --no-ff feature

# Merge with message
git merge -m "Merge feature-analysis" feature

Merge Conflicts:#

# When do conflicts occur?
# → Two people edit the SAME LINE on different branches
# → Git cannot auto-decide which change to keep
# → Must be RESOLVED MANUALLY ✅ (exam answer)

# After conflict:
git status  # shows conflicted files

# Conflicted file looks like:
<<<<<<< HEAD (current branch)
df.dropna(subset=['name'])     ← your change
=======
df.dropna(subset=['email'])    ← incoming change
>>>>>>> feature-branch

# Resolution steps:
# 1. Open conflicted file
# 2. Choose which change to keep (or combine both)
# 3. Remove conflict markers (<<<<, ====, >>>>)
# 4. Stage resolved file
git add file.py
# 5. Complete merge
git commit

# Abort merge (go back to before merge)
git merge --abort

10. Pull Requests (PR)#

What is a PR?#

  • Formal request to merge your branch into main
  • Team members can:
    • Review code changes
    • Leave comments
    • Request changes
    • Approve the merge
  • Standard practice in collaborative projects

PR Workflow:#

1. Create feature branch
   git checkout -b feature-data-cleaning

2. Make changes and commit
   git add .
   git commit -m "Add data cleaning pipeline"

3. Push branch to remote
   git push origin feature-data-cleaning

4. Create PR on GitHub/GitLab
   → Go to repository on GitHub
   → Click "New Pull Request"
   → Select base: main, compare: feature-data-cleaning
   → Add title and description
   → Request reviewers

5. Code review
   → Reviewers comment on specific lines
   → Author makes requested changes
   → Push additional commits to same branch

6. Approval and merge
   → Reviewer approves
   → PR merged into main

7. Clean up
   git branch -d feature-data-cleaning
   git push origin --delete feature-data-cleaning
  • ✅ PRs allow code review + approval before merging
  • ❌ PRs are NOT faster than direct merging
  • ❌ PRs do NOT automatically fix bugs
  • ❌ PRs are NOT only for documentation

11. Git Hotfix Workflow#

When to Use Hotfix?#

  • Critical bug found in production
  • Cannot wait for regular release cycle
  • Need immediate fix with minimal disruption

Hotfix Process:#

# Step 1: Create hotfix branch from LAST RELEASE TAG
git checkout main
git checkout -b hotfix/critical-routing-bug

# Step 2: Implement and test the fix
# (make changes, run tests)
git add .
git commit -m "Fix critical routing bug in intersection algorithm"

# Step 3: Open PR for expedited review
git push origin hotfix/critical-routing-bug
# → Create PR on GitHub
# → Request urgent/expedited review

# Step 4: Merge into BOTH main AND develop ✅ (exam answer)
git checkout main
git merge hotfix/critical-routing-bug
git tag -a v1.0.1 -m "Hotfix: critical routing bug"
git push origin main --tags

git checkout develop
git merge hotfix/critical-routing-bug
git push origin develop

# Step 5: Delete hotfix branch
git branch -d hotfix/critical-routing-bug
git push origin --delete hotfix/critical-routing-bug
  • ✅ Create hotfix branch from release tag → fix → PR → merge to main AND develop
  • ❌ Apply fix directly to main without review
  • ❌ Delay fix to next regular release cycle
  • ❌ Fix in working directory, apply to production, backport to Git later

12. Branch Strategy - GitFlow#

Complete Branching Model:#

main (production-ready)
  │
  ├── hotfix branches (emergency fixes)
  │   └── merge back to main + develop
  │
develop (integration branch)
  │
  ├── feature/data-cleaning
  ├── feature/visualization
  ├── feature/api-integration
  │   └── merge back to develop when complete
  │
  └── release/v1.2.0 (when ready for production)
      └── merge to main + develop, then tag

Simple GitHub Flow (simpler alternative):#

main (always deployable)
  │
  ├── feature/data-cleaning    → PR → main
  ├── bugfix/fix-null-values   → PR → main
  └── hotfix/urgent-fix        → PR → main (expedited)

13. Pipeline Transformation Versioning#

Problem:#

Transformation logic changes frequently
→ Hard to track what changed and when
→ Can't reproduce old results
→ Risk of breaking existing processes

Solution:#

# ✅ Version transformations with git + unit tests
git checkout -b feature/new-transformation-logic

# Write unit tests FIRST
# tests/test_transform.py

# Implement transformation
# src/transform.py

# Test on sample data (not production)
pytest tests/ -v

# If all tests pass → PR → merge
git add .
git commit -m "Update normalization: handle unicode names"
git push origin feature/new-transformation-logic
# Unit test for transformation
def test_normalize_company_name():
    assert normalize("APPLE INC") == "Apple Inc"
    assert normalize("apple incorporated") == "Apple Inc"
    assert normalize("  Apple  ") == "Apple"

def test_schema_preserved():
    result = transform(sample_df)
    assert list(result.columns) == ['id', 'name', 'revenue']
    assert result['revenue'].dtype == 'float64'
  • ✅ Version with git + write unit tests + test on sample data before production
  • ❌ Keep all logic in one script and update directly
  • ❌ Manually document changes in shared doc
  • ❌ Avoid changes to prevent breaking things

Git Commands - Complete Quick Reference#

# SETUP
git config --global user.name "Name"
git config --global user.email "email"
git init                    # initialize new repo
git clone url               # clone existing repo

# DAILY WORKFLOW
git status                  # what changed?
git diff                    # what exactly changed? ✅
git add .                   # stage all ✅
git commit -m "message"     # save snapshot ✅
git push                    # upload to remote ✅
git pull origin main        # get latest ✅

# BRANCHING
git branch                  # list branches
git checkout -b feature     # create + switch ✅
git checkout main           # switch branch
git merge feature           # merge branch
git branch -d feature       # delete branch

# UNDOING
git checkout -- .           # discard changes ✅
git restore --staged file   # unstage
git reset --soft HEAD~1     # undo last commit (keep changes)
git reset --hard HEAD~1     # undo last commit (discard) ⚠️
git revert HEAD             # safe undo (new commit)
git stash / git stash pop   # save/restore temp changes

# REMOTE
git remote -v               # show remote URLs
git fetch origin            # download (don't merge)
git pull origin main        # download + merge ✅
git push origin feature     # push branch

# INSPECTION
git log --oneline --graph   # visual history
git show abc123             # show commit details
git blame file.py           # who changed each line
git tag -a v1.0 -m "msg"    # create tag

# HOTFIX
git checkout -b hotfix/fix  # from main ✅
# → fix → commit → PR
# → merge to main AND develop ✅

Git - Exam Scenario Answers#

ScenarioCommand
Get latest team updatesgit pull
See what changed in filesgit diff
Save work and uploadgit add . → git commit → git push
Create and switch branchgit checkout -b feature-name
Discard working directory changesgit checkout -- .
Files Git should not track.gitignore
Collaborative review before mergePull Request (PR) ✅
Critical production bug fixHotfix branch → merge to main + develop ✅
Two people edit same lineMerge conflict → manual resolution ✅