8. Python & pandas - Core Operations#

1. Detecting Missing Values:#

df.isnull().sum()         # count NaN per column
df.isnull().sum() / len(df) * 100  # percentage missing

2. df.count() - Count Non-Null Values:#

df.count()           # non-null count per column ✅
df['col'].count()    # for single column
  • df.notnull() → returns boolean mask, not count
  • df.describe() → summary stats, not just count

3. df.dropna() - Remove Missing Values:#

df.dropna()                        # drop rows with ANY NaN
df.dropna(subset=['review_text'])  # drop rows where specific column is NaN ✅
df.dropna(how='all')               # drop only if ALL values are NaN
  • dropna(subset=['col']) - exam answer for targeted column NaN removal
  • ❌ Replace all NaN with “missing” → pollutes text analysis
  • ❌ Delete entire column → loses all data

4. df.fillna() - Fill Missing Values:#

df.fillna(0)                        # fill with constant
df['col'].fillna(df['col'].mean())  # fill with mean
df['col'].fillna(df['col'].median()) # fill with median
df.fillna(method='ffill')           # forward fill (time series)
df.fillna(method='bfill')           # backward fill

5. When to Use Which Fill Strategy:#

StrategyWhen
Drop rowsText analysis, small % missing
Fill meanNormally distributed numeric
Fill medianSkewed data, outliers present
Fill modeCategorical data
Forward fillTime series, ordered data

6. column.nunique() - Count Distinct Values:#

df['col'].nunique()      # number of unique values ✅
df['col'].unique()       # array of unique values
df['col'].value_counts() # count of each unique value
  • nunique() - most efficient for just counting distinct values
  • len(set(column)) → works but memory inefficient for large data
  • column.count() → counts non-null, NOT unique

7. Value Inspection Methods:#

MethodReturns
nunique()Integer count of unique values ✅
unique()Array of all unique values
value_counts()Frequency of each unique value
count()Count of non-null values

8. groupby().mean() - Aggregation:#

df.groupby('cuisine')['rating'].mean()   # average rating per cuisine ✅
  • df.sort_values('cuisine') → sorts, doesn’t aggregate
  • df.filter('cuisine') → not valid syntax
  • df.merge('cuisine', 'rating') → not valid syntax

9. All Aggregation Functions:#

df.groupby('col')['val'].mean()    # average
df.groupby('col')['val'].sum()     # total
df.groupby('col')['val'].count()   # count
df.groupby('col')['val'].max()     # maximum
df.groupby('col')['val'].min()     # minimum
df.groupby('col')['val'].median()  # median
df.groupby('col')['val'].std()     # standard deviation

10. drop_duplicates() - Remove Duplicates:#

df.drop_duplicates()                    # all columns
df.drop_duplicates(subset=['col'])      # specific column ✅
df['col'].drop_duplicates().values      # memory-efficient unique values ✅

11. Correlation Analysis:#

df.corr()                              # correlation matrix for all columns
df['col1'].corr(df['col2'])            # correlation between two columns

# pivot_table + corr for customer similarity
df.pivot_table(
    index='customer_id',
    columns='product_category',
    values='purchase_amount'
).fillna(0).corr()                     # ✅ exam answer

12. Sorting:#

df.sort_values('col')                   # ascending
df.sort_values('col', ascending=False)  # descending
df.sort_values(['col1', 'col2'])        # multiple columns

13. Merging DataFrames:#

pd.merge(df1, df2, on='key')              # inner join (default)
pd.merge(df1, df2, on='key', how='left')  # left join
pd.merge(df1, df2, on='key', how='outer') # outer join
pd.concat([df1, df2])                     # stack vertically

14. Filtering Rows:#

df[df['rating'] < 3.0]                          # single condition ✅
df[(df['rating'] < 3) & (df['city'] == 'NYC')]  # multiple conditions
df[df['col'].isin(['val1', 'val2'])]            # filter by list
df[df['col'].str.contains('pattern')]           # string filter

15. String Operations on Columns:#

df['col'].str.lower()           # lowercase
df['col'].str.upper()           # uppercase
df['col'].str.strip()           # remove whitespace
df['col'].str.replace('a', 'b') # replace
df['col'].str.split(',')        # split
df['col'].str.contains('pat')   # boolean mask

16. DataFrame Exploration:#

df.head()        # first 5 rows
df.tail()        # last 5 rows
df.info()        # column types + non-null counts
df.describe()    # summary statistics
df.shape         # (rows, columns)
df.dtypes        # data type of each column
df.columns       # column names

17. Data Types:#

df.dtypes                          # check all types
df['col'].astype('int')           # convert to integer
df['col'].astype('float')         # convert to float
df['col'].astype('str')           # convert to string
df['col'].astype('category')      # convert to category (memory efficient)
pd.to_datetime(df['date_col'])    # convert to datetime

18. Reading & Writing Data:#

pd.read_csv('file.csv')
pd.read_json('file.json')
pd.read_parquet('file.parquet')
pd.read_excel('file.xlsx')
pd.read_sql(query, connection)

df.to_csv('output.csv', index=False)
df.to_json('output.json')
df.to_parquet('output.parquet')
df.to_excel('output.xlsx', index=False)

19. collections.Counter:#

from collections import Counter

# Count frequencies
Counter(['a', 'b', 'a', 'c'])     # {'a': 2, 'b': 1, 'c': 1}
Counter.most_common(5)            # top 5 most frequent ✅

# Count URL segments in log analysis
Counter(
    url.split('/')[2]
    for url in urls
    if len(url.split('/')) > 2
)

Quick Reference#

Missing values:
  df.isnull().sum()           → count NaN
  df.dropna(subset=['col'])   → remove targeted NaN ✅
  df.fillna(value)            → fill NaN

Unique values:
  df['col'].nunique()         → count distinct ✅
  df['col'].unique()          → array of unique
  df['col'].value_counts()    → frequency table

Aggregation:
  df.groupby('col')['val'].mean()  ✅

Duplicates:
  df.drop_duplicates(subset=['col']) ✅

Exception handling:
  No exception → try → finally ✅
  Exception → try → except → finally ✅
  finally → ALWAYS runs ✅
  try-catch → Java, NOT Python ❌