8. Python & pandas - Core Operations#
1. Detecting Missing Values:#
df.isnull().sum() # count NaN per column
df.isnull().sum() / len(df) * 100 # percentage missing
2. df.count() - Count Non-Null Values:#
df.count() # non-null count per column ✅
df['col'].count() # for single column
- ❌
df.notnull() → returns boolean mask, not count - ❌
df.describe() → summary stats, not just count
3. df.dropna() - Remove Missing Values:#
df.dropna() # drop rows with ANY NaN
df.dropna(subset=['review_text']) # drop rows where specific column is NaN ✅
df.dropna(how='all') # drop only if ALL values are NaN
- ✅
dropna(subset=['col']) - exam answer for targeted column NaN removal - ❌ Replace all NaN with “missing” → pollutes text analysis
- ❌ Delete entire column → loses all data
4. df.fillna() - Fill Missing Values:#
df.fillna(0) # fill with constant
df['col'].fillna(df['col'].mean()) # fill with mean
df['col'].fillna(df['col'].median()) # fill with median
df.fillna(method='ffill') # forward fill (time series)
df.fillna(method='bfill') # backward fill
5. When to Use Which Fill Strategy:#
| Strategy | When |
|---|
| Drop rows | Text analysis, small % missing |
| Fill mean | Normally distributed numeric |
| Fill median | Skewed data, outliers present |
| Fill mode | Categorical data |
| Forward fill | Time series, ordered data |
6. column.nunique() - Count Distinct Values:#
df['col'].nunique() # number of unique values ✅
df['col'].unique() # array of unique values
df['col'].value_counts() # count of each unique value
- ✅
nunique() - most efficient for just counting distinct values - ❌
len(set(column)) → works but memory inefficient for large data - ❌
column.count() → counts non-null, NOT unique
7. Value Inspection Methods:#
| Method | Returns |
|---|
nunique() | Integer count of unique values ✅ |
unique() | Array of all unique values |
value_counts() | Frequency of each unique value |
count() | Count of non-null values |
8. groupby().mean() - Aggregation:#
df.groupby('cuisine')['rating'].mean() # average rating per cuisine ✅
- ❌
df.sort_values('cuisine') → sorts, doesn’t aggregate - ❌
df.filter('cuisine') → not valid syntax - ❌
df.merge('cuisine', 'rating') → not valid syntax
9. All Aggregation Functions:#
df.groupby('col')['val'].mean() # average
df.groupby('col')['val'].sum() # total
df.groupby('col')['val'].count() # count
df.groupby('col')['val'].max() # maximum
df.groupby('col')['val'].min() # minimum
df.groupby('col')['val'].median() # median
df.groupby('col')['val'].std() # standard deviation
10. drop_duplicates() - Remove Duplicates:#
df.drop_duplicates() # all columns
df.drop_duplicates(subset=['col']) # specific column ✅
df['col'].drop_duplicates().values # memory-efficient unique values ✅
11. Correlation Analysis:#
df.corr() # correlation matrix for all columns
df['col1'].corr(df['col2']) # correlation between two columns
# pivot_table + corr for customer similarity
df.pivot_table(
index='customer_id',
columns='product_category',
values='purchase_amount'
).fillna(0).corr() # ✅ exam answer
12. Sorting:#
df.sort_values('col') # ascending
df.sort_values('col', ascending=False) # descending
df.sort_values(['col1', 'col2']) # multiple columns
13. Merging DataFrames:#
pd.merge(df1, df2, on='key') # inner join (default)
pd.merge(df1, df2, on='key', how='left') # left join
pd.merge(df1, df2, on='key', how='outer') # outer join
pd.concat([df1, df2]) # stack vertically
14. Filtering Rows:#
df[df['rating'] < 3.0] # single condition ✅
df[(df['rating'] < 3) & (df['city'] == 'NYC')] # multiple conditions
df[df['col'].isin(['val1', 'val2'])] # filter by list
df[df['col'].str.contains('pattern')] # string filter
15. String Operations on Columns:#
df['col'].str.lower() # lowercase
df['col'].str.upper() # uppercase
df['col'].str.strip() # remove whitespace
df['col'].str.replace('a', 'b') # replace
df['col'].str.split(',') # split
df['col'].str.contains('pat') # boolean mask
16. DataFrame Exploration:#
df.head() # first 5 rows
df.tail() # last 5 rows
df.info() # column types + non-null counts
df.describe() # summary statistics
df.shape # (rows, columns)
df.dtypes # data type of each column
df.columns # column names
17. Data Types:#
df.dtypes # check all types
df['col'].astype('int') # convert to integer
df['col'].astype('float') # convert to float
df['col'].astype('str') # convert to string
df['col'].astype('category') # convert to category (memory efficient)
pd.to_datetime(df['date_col']) # convert to datetime
18. Reading & Writing Data:#
pd.read_csv('file.csv')
pd.read_json('file.json')
pd.read_parquet('file.parquet')
pd.read_excel('file.xlsx')
pd.read_sql(query, connection)
df.to_csv('output.csv', index=False)
df.to_json('output.json')
df.to_parquet('output.parquet')
df.to_excel('output.xlsx', index=False)
19. collections.Counter:#
from collections import Counter
# Count frequencies
Counter(['a', 'b', 'a', 'c']) # {'a': 2, 'b': 1, 'c': 1}
Counter.most_common(5) # top 5 most frequent ✅
# Count URL segments in log analysis
Counter(
url.split('/')[2]
for url in urls
if len(url.split('/')) > 2
)
Quick Reference#
Missing values:
df.isnull().sum() → count NaN
df.dropna(subset=['col']) → remove targeted NaN ✅
df.fillna(value) → fill NaN
Unique values:
df['col'].nunique() → count distinct ✅
df['col'].unique() → array of unique
df['col'].value_counts() → frequency table
Aggregation:
df.groupby('col')['val'].mean() ✅
Duplicates:
df.drop_duplicates(subset=['col']) ✅
Exception handling:
No exception → try → finally ✅
Exception → try → except → finally ✅
finally → ALWAYS runs ✅
try-catch → Java, NOT Python ❌