4. Git & Version Control#
1. Git - Core Concepts & Purpose#
What is Git?#
- Distributed version control system
- Tracks changes to files over time
- Every developer has full copy of repository
- Works offline - syncs when connected
Core Benefits:#
✅ Tracks every change ever made
✅ Enables collaboration among multiple developers
✅ Rollback to any previous state
✅ Branching - work on features without affecting main code
✅ Blame - see who changed what and when
✅ Merge - combine work from multiple people
❌ Does NOT automatically clean data
❌ Does NOT replace testing
❌ Does NOT store only binary assets
Git vs GitHub vs GitLab:#
| Tool | What it is |
|---|
| Git | Version control software (local) |
| GitHub | Cloud hosting for Git repos + collaboration |
| GitLab | Alternative to GitHub (self-hostable) |
| Bitbucket | Another alternative by Atlassian |
2. Git - Core Architecture#
Working Directory Staging Area Local Repo Remote Repo
(your files) (index) (.git folder) (GitHub)
│ │ │ │
│ git add │ git commit │ git push │
│──────────────────>│────────────────>│───────────────>│
│ │ │ │
│<──────────────────────────────────────────────────── │
│ git pull / git clone │
│ │ │ │
│ git checkout -- . git fetch │
│<──────────────────│ │<───────────────│
Three Areas Explained:#
# 1. Working Directory
# → Files you're currently editing
# → Changes not yet tracked by Git
# 2. Staging Area (Index)
# → Files marked for next commit
# → git add moves files here
# 3. Local Repository
# → Committed snapshots
# → git commit moves from staging to here
# 4. Remote Repository
# → GitHub/GitLab server
# → git push uploads local commits
# → git pull downloads remote commits
3. .gitignore - Excluding Files from Tracking#
What is .gitignore?#
- Text file listing patterns of files Git should NOT track
- Placed in root of repository
- Committed to repo so all team members use same rules
Common .gitignore Entries:#
# Python
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
*.egg-info/
# Virtual environments
venv/
.venv/
env/
.env
# Secrets & API keys ✅ most important
.env
.env.*
secrets.json
credentials.json
config/secrets.py
# Data files (often too large for git)
data/raw/
*.csv
*.parquet
*.pkl
# IDE files
.vscode/
.idea/
*.swp
# OS files
.DS_Store # Mac
Thumbs.db # Windows
# Jupyter
.ipynb_checkpoints/
# Build artifacts
dist/
build/
*.egg-info/
# Logs
*.log
logs/
.gitignore vs .git/config:#
| File | Purpose |
|---|
.gitignore | Specify files Git should NOT track ✅ |
.git/config | Repository settings (remote URL, branches) |
.gitattributes | File-specific settings (line endings, merge strategy) |
- ❌ NOT for tracking specific files
- ❌ NOT for listing contributors
- ❌ NOT for defining repository settings
4. git diff - View Changes#
# Working directory vs last commit
git diff # ✅ exam answer
# Staged changes vs last commit
git diff --staged
git diff --cached # same as --staged
# Between two commits
git diff abc123 def456
# Between two branches
git diff main feature-branch
# Specific file
git diff file.py
# Summary only (no actual diff)
git diff --stat
git status # shows WHICH files changed (not the actual diff)
git log # shows commit HISTORY
git show # shows changes in a specific commit
git log -p # shows commits WITH their diffs
git diff Output:#
diff --git a/script.py b/script.py
index abc123..def456 100644
--- a/script.py ← old version
+++ b/script.py ← new version
@@ -10,7 +10,8 @@ ← line numbers
def clean_data(df):
- df.dropna() ← removed line (red)
+ df.dropna(subset=['name']) ← added line (green)
+ df.reset_index(drop=True) ← added line (green)
return df
5. Git Workflow Commands - Complete Reference#
Staging & Committing:#
# Check what changed
git status # see modified/staged/untracked files
# Stage changes
git add file.py # stage specific file
git add . # stage ALL changes ✅
git add *.py # stage all Python files
git add -p # stage changes interactively (chunk by chunk)
# Unstage (undo git add)
git restore --staged file.py
git reset HEAD file.py # older syntax
# Commit
git commit -m "Add data cleaning function" # ✅
git commit -am "message" # stage + commit tracked files in one step
git commit --amend # modify last commit message
# Correct sequence for uploading work:
git add .
git commit -m "message"
git push # ✅ exam answer: add → commit → push
Viewing History:#
git log # full commit history
git log --oneline # compact one-line per commit
git log --graph # visual branch graph
git log --oneline --graph --all # full visual history
git log -n 5 # last 5 commits
git log --author="John" # commits by specific author
git log --since="2024-01-01" # commits after date
git log -- file.py # commits touching specific file
git show abc123 # show specific commit details
git show HEAD # show last commit
git show HEAD~1 # show second-to-last commit
6. Syncing with Remote#
# Download + integrate remote changes
git pull # pull from tracked remote branch
git pull origin main # ✅ pull from main branch specifically
# Download WITHOUT integrating
git fetch origin # fetch all remote changes
git fetch origin main # fetch specific branch
# Pull vs Fetch:
# git pull = git fetch + git merge
# git fetch = download only (safe, no changes to working files)
# Push local commits to remote
git push # push to tracked remote branch
git push origin main # push to specific branch
git push origin feature # push feature branch
git push -u origin feature # push + set upstream tracking
git push --force # ⚠️ force push (dangerous, avoid)
7. Git Branching#
Why Branches?#
- Work on features without affecting
main code - Multiple people work simultaneously without conflicts
- Keep
main always production-ready
# Create branch
git branch feature-analysis # create branch
git branch # list local branches
git branch -a # list all branches (local + remote)
git branch -d feature-analysis # delete branch (safe)
git branch -D feature-analysis # force delete branch
# Switch branch
git checkout feature-analysis # switch to branch
git checkout main # go back to main
# Create + switch in one step ✅
git checkout -b feature-analysis # ✅ most common shortcut
git switch -c feature-analysis # newer syntax (Git 2.23+)
# ✅ Exam answer: create and switch
git branch feature-analysis
git checkout feature-analysis
# OR single command:
git checkout -b feature-analysis
Branch Naming Conventions:#
feature/data-cleaning # new features
bugfix/fix-null-handling # bug fixes
hotfix/critical-api-fix # urgent production fixes
release/v1.2.0 # release preparation
docs/update-readme # documentation
8. Undoing Changes#
# Discard working directory changes ✅ (exam answer)
git checkout -- . # discard ALL unstaged changes
git checkout -- file.py # discard changes to specific file
git restore . # newer syntax (Git 2.23+)
git restore file.py # specific file
# Unstage (undo git add)
git restore --staged file.py
git reset HEAD file.py
# Undo last commit (keep changes in working directory)
git reset --soft HEAD~1 # undo commit, keep staged
git reset HEAD~1 # undo commit, unstage changes
git reset --mixed HEAD~1 # same as above (default)
# Undo last commit (DISCARD all changes) ⚠️ destructive
git reset --hard HEAD~1
# Revert (creates NEW commit that undoes previous)
git revert HEAD # undo last commit safely
git revert abc123 # undo specific commit
# Stash (save work temporarily)
git stash # save current changes
git stash list # see all stashes
git stash pop # apply most recent stash + delete it
git stash apply # apply most recent stash (keep it)
git stash drop # delete most recent stash
git stash clear # delete all stashes
Undo Command Comparison:#
| Command | Effect | Destructive? |
|---|
git checkout -- . | Discard working dir changes ✅ | Yes (local only) |
git reset --soft | Undo commit, keep staged | No |
git reset --hard | Undo commit, discard all | Yes ⚠️ |
git revert | New commit undoing previous | No (safe) |
git stash | Temporarily save changes | No |
git clean -fd | Remove untracked files | Yes ⚠️ |
9. Merging & Merge Conflicts#
Merging Branches:#
# Merge feature into main
git checkout main # switch to target branch
git merge feature-analysis # merge feature into current branch
# Fast-forward merge (no new commit needed)
git merge --ff feature
# Always create merge commit
git merge --no-ff feature
# Merge with message
git merge -m "Merge feature-analysis" feature
Merge Conflicts:#
# When do conflicts occur?
# → Two people edit the SAME LINE on different branches
# → Git cannot auto-decide which change to keep
# → Must be RESOLVED MANUALLY ✅ (exam answer)
# After conflict:
git status # shows conflicted files
# Conflicted file looks like:
<<<<<<< HEAD (current branch)
df.dropna(subset=['name']) ← your change
=======
df.dropna(subset=['email']) ← incoming change
>>>>>>> feature-branch
# Resolution steps:
# 1. Open conflicted file
# 2. Choose which change to keep (or combine both)
# 3. Remove conflict markers (<<<<, ====, >>>>)
# 4. Stage resolved file
git add file.py
# 5. Complete merge
git commit
# Abort merge (go back to before merge)
git merge --abort
10. Pull Requests (PR)#
What is a PR?#
- Formal request to merge your branch into main
- Team members can:
- Review code changes
- Leave comments
- Request changes
- Approve the merge
- Standard practice in collaborative projects
PR Workflow:#
1. Create feature branch
git checkout -b feature-data-cleaning
2. Make changes and commit
git add .
git commit -m "Add data cleaning pipeline"
3. Push branch to remote
git push origin feature-data-cleaning
4. Create PR on GitHub/GitLab
→ Go to repository on GitHub
→ Click "New Pull Request"
→ Select base: main, compare: feature-data-cleaning
→ Add title and description
→ Request reviewers
5. Code review
→ Reviewers comment on specific lines
→ Author makes requested changes
→ Push additional commits to same branch
6. Approval and merge
→ Reviewer approves
→ PR merged into main
7. Clean up
git branch -d feature-data-cleaning
git push origin --delete feature-data-cleaning
- ✅ PRs allow code review + approval before merging
- ❌ PRs are NOT faster than direct merging
- ❌ PRs do NOT automatically fix bugs
- ❌ PRs are NOT only for documentation
11. Git Hotfix Workflow#
When to Use Hotfix?#
- Critical bug found in production
- Cannot wait for regular release cycle
- Need immediate fix with minimal disruption
Hotfix Process:#
# Step 1: Create hotfix branch from LAST RELEASE TAG
git checkout main
git checkout -b hotfix/critical-routing-bug
# Step 2: Implement and test the fix
# (make changes, run tests)
git add .
git commit -m "Fix critical routing bug in intersection algorithm"
# Step 3: Open PR for expedited review
git push origin hotfix/critical-routing-bug
# → Create PR on GitHub
# → Request urgent/expedited review
# Step 4: Merge into BOTH main AND develop ✅ (exam answer)
git checkout main
git merge hotfix/critical-routing-bug
git tag -a v1.0.1 -m "Hotfix: critical routing bug"
git push origin main --tags
git checkout develop
git merge hotfix/critical-routing-bug
git push origin develop
# Step 5: Delete hotfix branch
git branch -d hotfix/critical-routing-bug
git push origin --delete hotfix/critical-routing-bug
- ✅ Create hotfix branch from release tag → fix → PR → merge to main AND develop
- ❌ Apply fix directly to main without review
- ❌ Delay fix to next regular release cycle
- ❌ Fix in working directory, apply to production, backport to Git later
12. Branch Strategy - GitFlow#
Complete Branching Model:#
main (production-ready)
│
├── hotfix branches (emergency fixes)
│ └── merge back to main + develop
│
develop (integration branch)
│
├── feature/data-cleaning
├── feature/visualization
├── feature/api-integration
│ └── merge back to develop when complete
│
└── release/v1.2.0 (when ready for production)
└── merge to main + develop, then tag
Simple GitHub Flow (simpler alternative):#
main (always deployable)
│
├── feature/data-cleaning → PR → main
├── bugfix/fix-null-values → PR → main
└── hotfix/urgent-fix → PR → main (expedited)
Problem:#
Transformation logic changes frequently
→ Hard to track what changed and when
→ Can't reproduce old results
→ Risk of breaking existing processes
Solution:#
# ✅ Version transformations with git + unit tests
git checkout -b feature/new-transformation-logic
# Write unit tests FIRST
# tests/test_transform.py
# Implement transformation
# src/transform.py
# Test on sample data (not production)
pytest tests/ -v
# If all tests pass → PR → merge
git add .
git commit -m "Update normalization: handle unicode names"
git push origin feature/new-transformation-logic
# Unit test for transformation
def test_normalize_company_name():
assert normalize("APPLE INC") == "Apple Inc"
assert normalize("apple incorporated") == "Apple Inc"
assert normalize(" Apple ") == "Apple"
def test_schema_preserved():
result = transform(sample_df)
assert list(result.columns) == ['id', 'name', 'revenue']
assert result['revenue'].dtype == 'float64'
- ✅ Version with git + write unit tests + test on sample data before production
- ❌ Keep all logic in one script and update directly
- ❌ Manually document changes in shared doc
- ❌ Avoid changes to prevent breaking things
Git Commands - Complete Quick Reference#
# SETUP
git config --global user.name "Name"
git config --global user.email "email"
git init # initialize new repo
git clone url # clone existing repo
# DAILY WORKFLOW
git status # what changed?
git diff # what exactly changed? ✅
git add . # stage all ✅
git commit -m "message" # save snapshot ✅
git push # upload to remote ✅
git pull origin main # get latest ✅
# BRANCHING
git branch # list branches
git checkout -b feature # create + switch ✅
git checkout main # switch branch
git merge feature # merge branch
git branch -d feature # delete branch
# UNDOING
git checkout -- . # discard changes ✅
git restore --staged file # unstage
git reset --soft HEAD~1 # undo last commit (keep changes)
git reset --hard HEAD~1 # undo last commit (discard) ⚠️
git revert HEAD # safe undo (new commit)
git stash / git stash pop # save/restore temp changes
# REMOTE
git remote -v # show remote URLs
git fetch origin # download (don't merge)
git pull origin main # download + merge ✅
git push origin feature # push branch
# INSPECTION
git log --oneline --graph # visual history
git show abc123 # show commit details
git blame file.py # who changed each line
git tag -a v1.0 -m "msg" # create tag
# HOTFIX
git checkout -b hotfix/fix # from main ✅
# → fix → commit → PR
# → merge to main AND develop ✅
Git - Exam Scenario Answers#
| Scenario | Command |
|---|
| Get latest team updates | git pull ✅ |
| See what changed in files | git diff ✅ |
| Save work and upload | git add . → git commit → git push ✅ |
| Create and switch branch | git checkout -b feature-name ✅ |
| Discard working directory changes | git checkout -- . ✅ |
| Files Git should not track | .gitignore ✅ |
| Collaborative review before merge | Pull Request (PR) ✅ |
| Critical production bug fix | Hotfix branch → merge to main + develop ✅ |
| Two people edit same line | Merge conflict → manual resolution ✅ |