Handling missing data
How do you handle missing data in Pandas?
Pandas provides several methods for handling missing data, including isnull()
, notnull()
, dropna()
,
and fillna()
.
Pandas provides several methods for handling missing data. 1. Detecting Missing Data 2. Dropping Missing Data 3. Filling Missing Data 4. Interpolating Missing Data 5. Replacing Missing Data
Detecting Missing Data
import pandas as pd
import numpy as np
# Create a DataFrame
data = {
'Open': [99,100,101],
'High': [102, 103, np.nan],
'Close': [101, 102, 103],
'Low': [98, 99, np.nan]
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isnull())
Dropping Missing Data
# Drop rows with missing values
df_dropped_rows = df.dropna()
# Drop columns with missing values
df_dropped_cols = df.dropna(axis=1)
print("Dropped Rows:")
print(df_dropped_rows)
print("\nDropped Columns:")
print(df_dropped_cols)
Filling Missing Data
# Fill missing values with a specific value
df_filled_value = df.fillna('Unknown')
# Fill missing values with mean (for numerical columns)
df_filled_mean = df.copy()
df_filled_mean['High'] = df_filled_mean['High'].fillna(df_filled_mean['High'].mean())
print("Filled with Value:")
print(df_filled_value)
print("\nFilled with Mean:")
print(df_filled_mean)
Interpolating Missing Data
# Interpolate missing values
df_interpolated = df.copy()
df_interpolated['High'] = df_interpolated['High'].interpolate()
print("Interpolated:")
print(df_interpolated)
Replacing Missing Data
# Replace missing values with a dictionary
df_replaced = df.copy()
df_replaced = df_replaced.replace({np.nan: 'Unknown'})
print("Replaced:")
print(df_replaced)
Best Practices
- Always check for missing values using
isnull()
ornotnull()
. - Choose the appropriate method based on the data type and requirements.
- Verify the results after handling missing data.