Chapter 2 - Data Fundamentals#
Data Types and Structures#
Basic Data Types#
1. Numerical Data#
Continuous Numbers Think of continuous data like measuring temperature - it can be any value: 98.6°F, 98.7°F, or even 98.65°F. Other examples include:
Height (5.7 feet, 5.8 feet)
Weight (150.5 pounds)
House prices ($299,999.99)
Discrete Numbers These are counting numbers that can’t be broken down further. Imagine:
Number of pets in a house
Students in a classroom
Cars in a parking lot
2. Categorical Data#
Nominal Categories Think of these as labels with no order, like:
Colors (Red, Blue, Green)
Car brands (Toyota, Honda, Ford)
Cities (New York, London, Tokyo)
Ordinal Categories These are categories with a natural order, such as:
T-shirt sizes (S, M, L, XL)
Movie ratings (1 to 5 stars)
Education levels (High School, Bachelor’s, Master’s)
3. Binary Data#
This type of data has only two possible values, often represented as 0 and 1. Examples include:
Yes/No questions (e.g., Is the light on? Yes or No)
Gender (Male or Female)
Pass/Fail results in exams
4. Time Series Data#
This is data that changes over time, like:
Daily stock prices
Monthly temperature readings
Weekly sales figures
5. Text Data#
Words and sentences that carry meaning, such as:
Customer reviews
Email content
Social media posts
Remember: Understanding data types helps us choose the right tools and techniques for our machine learning models. Just like you wouldn’t use a hammer to cut paper, each data type needs its own special handling!
Exercise: Understanding Data Types and Structures#
Let’s test your understanding of different data types in machine learning. For each scenario, select the most appropriate data type:
Scenario |
Your Answer |
---|---|
A sensor measuring ocean water temperature every hour |
|
Customer satisfaction ratings from 1-5 stars |
|
Blood type (A, B, AB, O) |
|
Number of children in a family |
|
Product reviews written by customers |
|
Daily closing prices of a stock |
|
Distance traveled by a car (in miles) |
|
Academic grades (A, B, C, D, F) |
Data Structures in Python#
These data structures are commonly used in machine learning:
1. NumPy Arrays#
Think of NumPy arrays like a super-powered list. Imagine organizing chocolates in a box:
A 1D array is like a single row of chocolates
A 2D array is like a box with rows and columns
A 3D array is like stacking multiple boxes
import numpy as np
# 1D array (like a line of students)
heights = np.array([170, 175, 160, 180])
# 2D array (like a seating chart)
classroom = np.array([
[1, 2, 3],
[4, 5, 6]
])
2. Pandas DataFrames & Series#
2.1 Series Think of a Series as a single column in a spreadsheet:
Like a list of test scores with student names as labels
Or daily temperatures for a week
import pandas as pd
# Series (like a single column of grades)
grades = pd.Series([85, 90, 88, 92],
index=['Alice', 'Bob', 'Charlie', 'David'])
2.2 DataFrames Imagine a DataFrame as an Excel spreadsheet:
Rows represent individual entries (like students)
Columns represent different features (like subjects)
# DataFrame (like a grade book)
student_data = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [15, 16, 15],
'Grade': [90, 85, 88]
})
3. Tensors#
Think of tensors as stacks of arrays:
0 Dimensional (0D) tensor: Single number (scalar)
1 Dimensional (1D) tensor: Array (vector)
2 Dimensional (2D) tensor: Matrix
3 Dimensional (3D) tensor: Cube of numbers (like a Rubik’s cube)
4 Dimensional (4D) tensor: Multiple cubes (like video frames)
import torch
# Creating a simple tensor (like RGB values of an image)
image_tensor = torch.tensor([
[[255, 0, 0], # Red
[0, 255, 0], # Green
[0, 0, 255]] # Blue
])
4. Common ML Dataset Formats#
CSV Files Like a simple spreadsheet:
name,age,grade
Alice,15,90
Bob,16,85
JSON Format Like a nested dictionary:
{
"students": [
{"name": "Alice", "age": 15},
{"name": "Bob", "age": 16}
]
}
Parquet Files Think of these as compressed, column-oriented files:
Great for big data
Faster to read than CSV
Takes less storage space
Remember: These structures are like different types of containers. Just like you wouldn’t store soup in a paper bag, choosing the right data structure is crucial for efficient machine learning!
Exploratory Data Analysis#
Statistical Summary#
1. Mean, Median, and Mode#
Think of these as the “averages family”:
Mean: The average value
Like splitting a pizza equally among friends
Example: Test scores [90, 85, 85, 95, 75]
Mean = (90 + 85 + 85 + 95 + 75) ÷ 5 = 86
Median: The middle value
Like finding the middle person in a height-ordered line
Example: [75, 85, 85, 90, 95]
Median = 85 (middle number)
Mode: The most common value
Like the most popular ice cream flavor
Example: [75, 85, 85, 90, 95]
Mode = 85 (appears twice)
2. Standard Deviation#
Think of this as the “spread-out-ness” of your data:
Small standard deviation = data points close together
Like a tight group of friends walking together
Large standard deviation = data points spread out
Like friends scattered across a playground
Example:
Dataset A: [10, 11, 9, 10, 10] → Small deviation
Dataset B: [2, 18, 5, 15, 10] → Large deviation
3. Quartiles and IQR#
Imagine lining up students by height:
Q1 (25th percentile): Quarter way through the line
Q2 (50th percentile): Halfway (same as median)
Q3 (75th percentile): Three-quarters way through
IQR = Q3 - Q1 (middle 50% of data)
Example:
Data: [1, 3, 5, 7, 8, 9, 10, 12, 15]
Q1 = 4
Q2 = 8
Q3 = 11
IQR = 11 - 4 = 7
4. Correlation Analysis#
Think of correlation as how two things move together:
Positive correlation: Both increase together
Like ice cream sales and temperature
Negative correlation: One up, one down
Like study time and gaming time
No correlation: No relationship
Like shoe size and test scores
Scale: -1 to +1
+1: Perfect positive relationship
-1: Perfect negative relationship
0: No relationship
5. Distribution Analysis#
Think of this as the “shape” of your data:
Normal Distribution (Bell Curve)
Like heights in a classroom
Most values cluster in the middle
Fewer values at extremes
Skewed Distribution
Right-skewed: Long tail on right
Like household income
Left-skewed: Long tail on left
Like test scores in an easy exam
Uniform Distribution
Like rolling a fair die
All values equally likely
Data Quality Checks#
1. Identifying Duplicates#
Think of duplicates like identical twins in your dataset - sometimes they’re real, sometimes they’re mistakes!
Common Scenarios
Same customer entered twice
Multiple transactions with identical values
Accidentally repeated rows
# Example of finding duplicates
students = [
{'id': 1, 'name': 'Alice', 'grade': 90},
{'id': 1, 'name': 'Alice', 'grade': 90}, # Duplicate!
{'id': 2, 'name': 'Bob', 'grade': 85}
]
2. Detecting Outliers#
Outliers are like the odd ones out - imagine a basketball player in a kindergarten class!
Common Methods
IQR Method (1.5 × IQR rule)
Normal range: Q1 - 1.5×IQR to Q3 + 1.5×IQR
Z-score Method
Values beyond 3 standard deviations
Example:
Heights (cm): [165, 170, 168, 172, 169, 250]
250 cm is clearly an outlier!
3. Checking Data Balance#
Think of data balance like a see-saw:
Class Balance Example
Spam Detection Dataset:
95% Normal emails (Imbalanced!)
5% Spam emails
Like trying to learn about tigers by showing 100 cats and 1 tiger
Solutions
Oversampling minority class
Undersampling majority class
Generating synthetic data
4. Finding Data Anomalies#
Anomalies are like suspicious behavior patterns:
Types of Anomalies
Point Anomalies
Single weird values
Like a -50°C temperature in summer
Contextual Anomalies
Normal values in wrong context
Like high ice cream sales in winter
Pattern Anomalies
Strange sequences
Like sudden spikes in network traffic
5. Verifying Data Types#
Think of this as making sure ingredients are correct before cooking:
Common Issues
Numbers stored as text
“123” instead of 123
Dates in wrong format
“01-12-2023” vs “12-01-2023”
Mixed types in columns
Age column: [25, “thirty”, 40]
Data Preprocessing#
Data Cleaning#
Think of data cleaning like preparing ingredients before cooking a meal - everything needs to be just right!
1. Removing Duplicates#
Think of this like removing extra copies of the same photo:
# Example messy data
student_records = [
{'id': 1, 'name': 'Alice', 'grade': 90},
{'id': 1, 'name': 'Alice', 'grade': 90}, # Duplicate
{'id': 1, 'name': 'alice', 'grade': 90} # Case difference
]
Common Duplicate Types
Exact duplicates (identical rows)
Partial duplicates (same core info, different format)
Case-sensitive duplicates (JOHN vs john)
2. Handling Inconsistent Values#
Think of this like standardizing measurements in a recipe:
Common Inconsistencies
Different spellings
“New York” vs “NY” vs “New York City”
Different units
“1.8m” vs “180cm” vs “5’11”
Different formats
“Yes” vs “Y” vs “1” vs “True”
Example Standardization:
Before:
- States: ["NY", "New York", "new york"]
- Gender: ["M", "Male", "male", "1"]
- Status: ["Y", "Yes", "TRUE", "1"]
After:
- States: ["NY", "NY", "NY"]
- Gender: ["M", "M", "M", "M"]
- Status: [True, True, True, True]
3. String Cleaning#
Like washing vegetables before cooking:
Common String Issues
Extra spaces: “John Smith “
Mixed case: “john SMITH”
Special characters: “John’s-Account#123”
Unwanted symbols: “Phone: (555) 123-4567”
4. Date/Time Formatting#
Think of this like setting all clocks to the same time zone:
Common Date Formats
Input Formats:
- "01-12-2023"
- "12/01/2023"
- "2023.12.01"
- "Dec 1, 2023"
Standardized Output:
- "2023-12-01"
5. Fixing Data Types#
Like putting groceries in their proper places:
Common Conversions
# Before cleaning
data = {
'age': ['25', '30', 'unknown', '40'],
'salary': ['50,000', '60000', '75,000'],
'active': ['Y', 'N', 'yes', 'no']
}
# After cleaning
clean_data = {
'age': [25, 30, None, 40],
'salary': [50000, 60000, 75000],
'active': [True, False, True, False]
}
Data Transformation#
1. One-Hot Encoding#
Think of this like breaking down a multiple-choice question into yes/no questions:
Original Data
colors = ['red', 'blue', 'green']
# Becomes:
# red: [1, 0, 0]
# blue: [0, 1, 0]
# green: [0, 0, 1]
When to Use
Categorical data with no order
Like: Cities, Colors, Product Categories
When categories are equally important
2. Label Encoding#
Think of this like giving students numbers instead of names:
# T-Shirt Sizes
sizes = ['S', 'M', 'L', 'XL']
# Becomes:
# S → 0
# M → 1
# L → 2
# XL → 3
When to Use
Ordinal categories
When order matters
Like: Grades, Rankings, Sizes
3. Binning Numerical Data#
Think of this like grouping students by age ranges:
# Ages: 0-100
age_bins = [
'0-18' : Child
'19-35' : Young Adult
'36-50' : Adult
'51+' : Senior
]
# Example:
# 15 → Child
# 27 → Young Adult
# 45 → Adult
Common Binning Methods
Equal-width bins (same size ranges)
Equal-frequency bins (same number of items)
Custom bins (based on domain knowledge)
4. Log Transformation#
Think of this like using a zoom lens to see both mountains and molehills:
When to Use
Data with large ranges
Skewed distributions
When ratios matter more than differences
Example:
Income Data:
Original: [30000, 50000, 1000000]
Log: [10.3, 10.8, 13.8]
5. Power Transformation#
Think of this like adjusting a camera’s exposure:
Common Types
Square Root: For moderately skewed data
Cube Root: For strongly skewed data
Box-Cox: Automatically finds best power
Example:
Original Data: [1, 4, 9, 16]
Square Root: [1, 2, 3, 4]
Cube Root: [1, 1.6, 2.1, 2.5]
Remember:
Choose transformations based on data type
Consider the model you’ll use
Keep track of transformations for future data
Always scale after transforming
Think of data transformation like preparing ingredients - sometimes you need to chop, grind, or puree to get the right texture for your recipe!
Feature Scaling#
Scaling Methods#
Imagine you’re comparing apples and watermelons - you need to make them comparable!
1. Min-Max Scaling#
Think of this like converting test scores to percentages:
Real-Life Example
Original test scores: [60, 75, 90, 100]
Scaled to 0-1: [0, 0.375, 0.75, 1]
# Basketball Players
heights = [160, 180, 200] # cm
scaled_heights = [0, 0.5, 1]
# House Prices
prices = [100k, 500k, 900k]
scaled_prices = [0, 0.5, 1]
When to Use
Like Instagram filters: scaling image pixels (0-255 → 0-1)
Like rating systems (1-5 stars → 0-1)
Perfect for neural networks that expect 0-1 input
2. Standard Scaling (Z-score)#
Like comparing how unusual something is:
Fun Example: Height in a Class
Original heights (cm): [150, 160, 170, 180, 190]
Mean = 170, Std = 10
Scaled heights: [-2, -1, 0, 1, 2]
Meaning:
- 0 = Average height
- +1 = Taller than average
- -1 = Shorter than average
Real-World Example
Test Scores: [70, 80, 90, 100]
After scaling: [-1.5, -0.5, 0.5, 1.5]
0 means “average student”
+1 means “one standard deviation above average”
3. Robust Scaling#
Like measuring kids’ heights using a growth chart:
Example: Salary Data
Salaries: [30k, 45k, 50k, 52k, 1M]
↓
After robust scaling: [-1, -0.2, 0, 0.1, 5]
The millionaire doesn't mess up our scale!
4. Normalization#
Like adjusting a recipe for different serving sizes:
Pizza Toppings Example
Original Recipe (4 pizzas):
- Cheese: 400g
- Tomatoes: 200g
- Pepperoni: 200g
↓
Normalized (per pizza):
- Cheese: 0.5 (50%)
- Tomatoes: 0.25 (25%)
- Pepperoni: 0.25 (25%)
When to Use Each Method?#
Think of scaling methods like different measuring tools:
If Your Data Is Like… |
Use This Scaling |
Example |
---|---|---|
Instagram photo pixels |
Min-Max |
0-255 → 0-1 |
Student test scores |
Standard |
Compare to class average |
House prices with mansions |
Robust |
Handle those million-dollar outliers |
Recipe ingredients |
Normalization |
Adjust for different serving sizes |
Remember:
Scale features like you’re making ingredients work together in a recipe
Always scale after splitting training/test data
Keep the original data backed up
Implementation Considerations#
1. Scaling Order in Pipeline#
Think of this like following a recipe in the right order:
Wrong Order (Common Mistake)
# DON'T DO THIS!
1. Scale all data
2. Split into train/test
Result: Data leakage! 😱
Right Order (Like Cooking)
1. Split data first (like separating ingredients)
2. Calculate scaling on training data (like measuring spices)
3. Apply to both train and test (like using same measurements)
Real Example:
# Height and Weight Dataset
heights = [170, 165, 180, 175] # cm
weights = [70, 65, 80, 75] # kg
# Step 1: Split
train_heights = [170, 165]
test_heights = [180, 175]
# Step 2: Calculate scale on train
mean_height = 167.5
std_height = 2.5
# Step 3: Apply to both
scaled_train = [-0.5, 0.5]
scaled_test = [1.5, 1.0]
Handling New Data#
Like using the same recipe measurements for future cooking:
Example: Temperature Scaling
Training Data:
Temps = [60°F, 70°F, 80°F]
Min = 60°F, Max = 80°F
New Data comes in: 90°F
Use SAME min/max from training:
scaled = (90 - 60) / (80 - 60) = 1.5
Scaling Categorical Data#
Think of this like organizing different types of fruits:
Before Scaling:
Colors: ['red', 'blue', 'red', 'green']
Sizes: ['S', 'M', 'L']
After Proper Handling:
# One-Hot for Colors:
red: [1, 0, 0]
blue: [0, 1, 0]
green: [0, 0, 1]
# Ordinal for Sizes:
S: 0
M: 1
L: 2
Common Pitfalls#
Like common cooking mistakes to avoid:
Pitfall 1: Scaling Target Variable
House Prices (Target):
Original: [100k, 200k, 300k]
DON'T scale unless absolutely necessary!
Pitfall 2: Scaling After Feature Selection
Wrong Order:
1. Select important features
2. Scale data
✓ Right Order:
1. Scale data
2. Select important features
Pitfall 3: Using Wrong Scaler
Salary Data with Outliers:
[30k, 35k, 32k, 1M]
❌ Standard Scaler: Gets messed up
✓ Robust Scaler: Handles it well
Best Practices#
Think of these like kitchen safety rules:
1. Always Save Scaler Parameters
# Save these from training:
min_value = data.min()
max_value = data.max()
# Use for future data
2. Handle Missing Values First
Original: [10, NaN, 20, 30]
❌ Scale first
✓ Fill NaN, then scale
3. Document Your Scaling
# Good Documentation
scaling_info = {
'height': 'standard_scaler',
'weight': 'min_max_scaler',
'age': 'robust_scaler'
}
Remember:
Scale like you’re following a recipe
Keep your scaling consistent
Document everything
Test your pipeline with new data
Handling Missing Values#
Missing Value Analysis#
1. Types of Missing Data#
Think of this like taking attendance in a classroom:
Missing Completely at Random (MCAR)
Like students randomly absent due to various reasons
Example:
Survey responses where some people randomly skipped questions
Temperature readings where sensor occasionally malfunctions
Temperature Log:
Monday: 75°F
Tuesday: ?? (Sensor battery died)
Wednesday: 72°F
Thursday: 74°F
Missing at Random (MAR)
Like teenagers more likely to skip morning classes
Missing values related to other variables
Shopping Survey:
Age: 25 | Income: $50k | Luxury Brands: ??
Age: 65 | Income: $70k | Luxury Brands: Yes
Age: 30 | Income: $40k | Luxury Brands: No
Missing Not at Random (MNAR)
Like people avoiding salary questions in surveys
The missing value itself is meaningful
Salary Survey:
Low Income: Answered
High Income: Often Missing
Very High Income: Usually Missing
2. Missing Patterns#
Think of this like patterns in student homework submission:
Univariate Pattern
Like one student always missing math assignments
Student Data:
Math: ??, ??, ??
English: 85, 90, 88
Science: 92, 87, 90
Multivariate Pattern
Like students missing both test and homework when sick
Sick Day Records:
Test Score: ?? | Homework: ?? | Attendance: Absent
Test Score: 85 | Homework: 90 | Attendance: Present
3. Impact Assessment#
Like checking how missing ingredients affect a recipe:
Critical Missing Data
Recipe Example:
Flour: 2 cups
Sugar: ??
Eggs: 2
Impact: Can't bake without knowing sugar amount!
Non-Critical Missing Data
Customer Profile:
Name: John
Age: ??
Favorite Color: Blue
Impact: Can still provide service without age
Documentation#
Like keeping a kitchen incident log:
Good Documentation Example:
Dataset: Customer Survey 2023
Missing Values Log:
- Income: 15% missing (people skipped)
- Age: 5% missing (form errors)
- Email: 20% missing (optional field)
Actions Taken:
- Income: Used median by profession
- Age: Used mean age
- Email: Marked as 'Not Provided'
Real-World Solutions#
Like Fixing a Recipe:
Original Recipe:
- Sugar: 1 cup
- Flour: ?? cups
- Eggs: 2
Solutions:
1. Check similar recipes (Like looking at similar data points)
2. Ask an expert (Domain knowledge)
3. Use standard amount (Mean/median imputation)
4. Skip recipe (Remove row)
Remember:
Missing values are like missing ingredients - sometimes you can substitute, sometimes you can’t
Document your missing value strategy like a recipe modification
Always consider why the data is missing before deciding how to handle it
Handling Missing Values Techniques#
1. Deletion Methods#
Think of this like cleaning up a messy photo album:
Complete Case Analysis (Listwise Deletion) Like removing group photos where someone blinked:
Before:
Photo 1: [👨 👩 😴 👶] # Someone sleeping
Photo 2: [👨 👩 👧 👶] # Perfect photo
Photo 3: [👨 ?? 👧 👶] # Someone missing
After:
Keep only Photo 2
Pairwise Deletion Like using different photos for different purposes:
Family Height Chart:
Dad: 6'0"
Mom: ??
Son: 5'0"
Daughter: 5'2"
Use what you have:
- Can still compare Dad and kids
- Skip comparisons with Mom
2. Simple Imputation#
Think of this like substituting ingredients in cooking:
Mean Imputation Like using average cookie size when recipe doesn’t specify:
Cookie Sizes:
[3", 3.5", ??, 3", 4"]
Solution: Use average (3.375")
Mode Imputation Like picking the most common ice cream flavor:
Ice Cream Orders:
[Vanilla, ??, Chocolate, Vanilla, ??, Vanilla]
Solution: Fill with Vanilla
Constant Imputation Like marking “Unknown” for missing names:
Contact List:
Name: John
Phone: ?? → "Not Provided"
Email: john@email.com
3. Statistical Imputation#
Like making educated guesses based on patterns:
Median Imputation Good for skewed data, like house prices:
House Prices:
[$200k, $210k, ??, $220k, $1M]
Better to use median than mean!
Regression Imputation Like guessing someone’s weight based on height:
Height | Weight
5'0" | 110 lbs
5'6" | ??
6'0" | 180 lbs
Use pattern to estimate missing weight
4. Advanced Imputation#
Like sophisticated detective work:
K-Nearest Neighbors (KNN) Like asking similar people for answers:
Movie Ratings:
Person A: [5, 4, ??, 5]
Similar viewers rated it 4
→ Probably Person A would rate it 4
Multiple Imputation Like getting multiple opinions:
Missing Salary:
Estimate 1: $50k (based on age)
Estimate 2: $52k (based on education)
Estimate 3: $48k (based on experience)
Final: Average of all estimates
When to Use Each Method#
Think of this like choosing tools for different home repairs:
Missing Data Situation |
Method |
Real-Life Analogy |
---|---|---|
Few missing values (<5%) |
Deletion |
Removing few bad apples |
Random missing values |
Mean/Median |
Using average recipe portions |
Categorical data |
Mode |
Most popular menu item |
Related variables exist |
Regression |
Guessing gift price from wrapping |
Complex patterns |
Advanced |
Detective solving complex case |
Remember:
Simple isn’t always worse (like basic recipes often work best)
Consider the impact of your choice (like substituting ingredients)
Document your methods (like writing down recipe modifications)
Test your results (like tasting the food)
Think of handling missing values like being a chef - sometimes you substitute ingredients (imputation), sometimes you modify the recipe (deletion), but always make sure the final dish makes sense!
Feature Selection#
Feature Importance#
Think of this like picking players for a sports team:
1. Statistical Methods#
Like measuring athletes’ performance statistics:
T-Test Example
Basketball Players:
Height Impact:
Tall players: 20 pts average
Short players: 10 pts average
→ Height is statistically significant!
Weight Impact:
Heavy players: 15 pts average
Light players: 14 pts average
→ Weight might not matter much
2. Correlation Analysis#
Like finding which friends always hang out together:
Strong Correlation
Ice Cream Sales vs Temperature:
🌡️ 75°F → $100 sales
🌡️ 85°F → $200 sales
🌡️ 95°F → $300 sales
(Strong positive correlation)
Weak Correlation
Ice Cream Sales vs Day of Week:
Monday: $150
Tuesday: $200
Wednesday: $100
(No clear pattern)
3. Information Gain#
Like choosing questions for “20 Questions” game:
Good Questions (High Information Gain)
Animal Guessing Game:
Q: "Is it a mammal?"
→ Eliminates 70% of possibilities!
Bad Questions (Low Information Gain):
Q: "Is it purple?"
→ Rarely helps identify animal
2. Feature Rankings#
Like ranking kitchen tools by usefulness:
Chef's Tool Ranking:
1. Knife (Used in 90% of dishes)
2. Pan (Used in 70% of dishes)
3. Whisk (Used in 30% of dishes)
4. Melon baller (Used in 1% of dishes)
3. Domain Knowledge#
Like an experienced chef knowing what matters:
Cooking Example:
Making Pizza:
Important Features:
- Dough quality
- Oven temperature
- Cooking time
Less Important:
- Brand of pan
- Color of spatula
- Kitchen temperature
Real-World Example: House Price Prediction#
Like a realtor knowing what affects house prices:
Features Available:
1. Square footage 🏠
2. Location 📍
3. Number of rooms 🚪
4. Wall color 🎨
5. House age 📅
6. Door handle style 🔒
Important (Keep):
- Square footage (Strong correlation)
- Location (High impact)
- Number of rooms (Significant)
Unimportant (Drop):
- Wall color (Can change easily)
- Door handle style (Irrelevant)
Feature Selection Methods Comparison#
Think of this like choosing tools for different jobs:
Method |
Real-Life Analogy |
When to Use |
---|---|---|
Statistical |
Sports tryouts |
Clear performance metrics |
Correlation |
Friend groups |
Finding relationships |
Information Gain |
Game strategy |
Decision trees |
Domain Knowledge |
Expert advice |
Industry-specific problems |
Common Patterns to Watch#
Like reading recipe reviews:
Redundant Features
Recipe Ingredients:
- Butter
- Margarine
(Choose one, not both!)
Irrelevant Features
Basketball Performance:
- Height (Relevant)
- Favorite color (Irrelevant)
Remember:
Not all features are created equal (like ingredients in a recipe)
More isn’t always better (like too many spices)
Let the data speak (like taste-testing)
Trust expert knowledge (like following chef’s advice)
Think of feature selection like packing for a trip:
Take what you need (important features)
Leave what you don’t (irrelevant features)
Consider the destination (your model’s purpose)
Ask experienced travelers (domain experts)
Selection Methods#
1. Filter Methods#
Think of this like using tryout tests for a sports team:
Variance Threshold
Basketball Shooting Test:
Player A: [10, 11, 9, 10] points
Player B: [2, 15, 0, 20] points
Player C: [10, 10, 10, 10] points
Keep: Player A, B (show variety)
Drop: Player C (no variance)
Correlation Filter Like choosing between similar players:
Two Players' Stats:
Points | Rebounds
Player1: [20, 10, 15]
Player2: [19, 11, 14]
Decision: Too similar - keep only one!
2. Wrapper Methods#
Like holding actual practice games to pick players:
Forward Selection
Building a Soccer Team:
Start: Empty team
Round 1: Add goalkeeper (biggest impact)
Round 2: Add striker (next best)
Round 3: Add defender
Keep adding until team performance stops improving
Backward Elimination
Orchestra Selection:
Start: Full orchestra
Remove: Triangle player (least impact)
Remove: Second tambourine
Keep removing until performance drops
3. Embedded Methods#
Like having a coach who evaluates players during the actual game:
LASSO Example
Basketball Team:
Player | Points | Defense | Team Spirit
A | High | Low | Medium
B | Medium | Medium | Medium
C | Low | Low | Low
LASSO automatically benches Player C!
4. Dimensionality Reduction#
Like combining similar skills into one rating:
PCA Example
Student Grades:
Math Tests: [90, 85, 88]
Algebra Tests: [92, 87, 89]
Geometry Tests: [88, 86, 90]
Combine into: "Math Ability Score"
Real-World Example:
Restaurant Rating:
Original Features:
- Food Quality (1-5)
- Taste (1-5)
- Flavor (1-5)
- Service Speed (1-5)
- Waiter Friendliness (1-5)
Reduced to:
- Food Score (combining first 3)
- Service Score (combining last 2)
5. Feature Engineering Basics#
Like creating new sports metrics from existing stats:
Combining Features
Shopping Patterns:
Original:
- Items bought: 5
- Total cost: $100
New Feature:
- Cost per item: $20
Breaking Down Features
Date: "2023-11-26"
Becomes:
- Year: 2023
- Month: 11
- Day: 26
- Is_Weekend: True
Selection Method Comparison#
Think of this like different ways to build a team:
Method |
Real-Life Analogy |
Best For |
---|---|---|
Filter |
Quick tryouts |
Large feature sets |
Wrapper |
Practice games |
Small feature sets |
Embedded |
In-game evaluation |
Complex relationships |
Reduction |
Skill combining |
Many similar features |
Common Patterns#
Feature Combinations
Car Performance:
Original Features:
- Distance: 100 miles
- Time: 2 hours
New Feature:
- Speed: 50 mph
Time-Based Features
Sales Data:
Original: Purchase DateTime
New Features:
- Hour of Day
- Day of Week
- Is Holiday
Remember:
Start simple (like basic tryouts)
Test combinations (like team chemistry)
Consider computation costs (like training time)
Use domain knowledge (like coach’s experience)
Think of feature selection methods like different recruitment strategies:
Filters: Quick initial screening
Wrappers: Thorough tryouts
Embedded: On-the-job evaluation
Reduction: Combining roles
Engineering: Creating new positions
The goal is to build the best team (model) with the right players (features)!
Data Visualization#
Basic Plots#
1. Histograms#
Think of this like organizing kids by height in a class photo:
Height Distribution Example:
Kids' Heights:
4'0" - 4'2": 🧍♂️🧍♂️ (2 kids)
4'3" - 4'5": 🧍♂️🧍♂️🧍♂️🧍♂️ (4 kids)
4'6" - 4'8": 🧍♂️🧍♂️🧍♂️ (3 kids)
4'9" - 4'11": 🧍♂️ (1 kid)
Real-Life Uses:
Test scores distribution
Daily steps count
Customer age groups
2. Box Plots#
Like summarizing a family’s height in one picture:
Family Heights:
* (Dad - Outlier)
┌──┬──┐
│ │ │
────┴──┴──┴────
Shortest (Mom) to Tallest (Kids)
Shows:
Minimum (Shortest family member)
Maximum (Tallest regular member)
Median (Middle height)
Outliers (Unusually tall Dad)
3. Scatter Plots#
Like mapping where kids drop their toys:
Toy Location Map:
Kitchen
│
│ 🧸 🎮
│ 🎪
│ 📱
│
─────┼─────────────
│ Living Room
Real Examples:
Ice Cream Sales vs Temperature:
Cold ·
· ·
Sales · · ·
· · · ·
· · · · ·
└──────────
Temperature
4. Line Plots#
Like tracking a child’s height over time:
Growth Chart:
Height
│ 📏
│ 📏
│ 📏
│ 📏
│📏
└─────────
Age (years)
Perfect For:
Temperature changes
Stock prices
Weight loss progress
Monthly sales
5. Bar Charts#
Like comparing kids’ candy collection after Halloween:
Candy Count:
Alice │██████ (6)
Bob │████████ (8)
Charlie│███ (3)
David │█████ (5)
Real-World Uses:
Monthly expenses
Product sales comparison
Survey responses
Sports scores
Choosing the Right Plot#
Think of this like choosing the right camera lens:
What You’re Showing |
Plot Type |
Real-Life Example |
---|---|---|
Distribution |
Histogram |
Kids’ ages in school |
Comparison |
Bar Chart |
Test scores by subject |
Trend over time |
Line Plot |
Daily temperature |
Relationships |
Scatter Plot |
Height vs Weight |
Summary stats |
Box Plot |
Family heights |
Common Visualization Tips#
Color Usage
Good:
Temperature: Blue (cold) to Red (hot)
Bad:
Rainbow colors for categories
Labels and Titles
Good:
"Monthly Ice Cream Sales 2023"
Bad:
"Graph 1"
Scale Considerations
Good:
Starting bar charts at zero
Bad:
Cutting off bars to exaggerate differences
Remember:
Keep it simple (like a family photo)
Tell a story (like a photo album)
Be honest (no photoshopping!)
Consider your audience (like showing photos to grandma)
Think of data visualization like being a photographer:
Choose the right angle (plot type)
Frame it well (axes and scales)
Make it clear (labels and colors)
Tell the story (context and explanation)
Advanced Visualization#
1. Correlation Heatmaps#
Think of this like a friendship map in a classroom:
Friend Groups:
Alice Bob Charlie David
Alice 🔴 🟡 🟢 🟡
Bob 🟡 🔴 🟡 🟢
Charlie🟢 🟡 🔴 🟡
David 🟡 🟢 🟡 🔴
🔴 = Best Friends
🟢 = Good Friends
🟡 = Sometimes Play Together
⚪ = Don't Interact
Real-World Example:
Menu Item Combinations:
Coffee Tea Cake Sandwich
Coffee 🔴 🟡 🟢 ⚪
Tea 🟡 🔴 🟢 ⚪
Cake 🟢 🟢 🔴 🟡
Sandwich ⚪ ⚪ 🟡 🔴
Shows what customers buy together!
2. Pair Plots#
Like taking photos of every possible pair at a party:
Student Metrics Grid:
Study│Grade│Sleep
Study ━━━━│ 📊 │ 📊
Grade 📊 │━━━━│ 📊
Sleep 📊 │ 📊 │━━━━
Real Example:
Car Features:
Speed│MPG │Price
Speed ━━━━│ 📉 │ 📉
MPG 📉 │━━━━│ 📈
Price 📉 │ 📈 │━━━━
3. Distribution Plots#
Like showing how kids spread out in a playground:
Kernel Density Plot
Height Distribution:
╭─────╮
│ │
╭──┤ ├──╮
___╱ │ │ ╰___
Short Med Tall
Violin Plot
Test Scores by Class:
Class A: )│( Wide spread
Class B: )─( Normal spread
Class C: )•( Tight spread
4. Time Series Plots#
Like making a flip-book of your growing garden:
Multiple Time Series
Plant Growth:
Height│ 🌳
│ 🌲🌳
│ 🌲🌳
│🌲
└─────────
Week 1-4
Seasonal Patterns
Ice Cream Sales:
Sales│ 🍦
│🍦 🍦
│ 🍦
│
└─────────
Win Spr Sum Fall
5. Interactive Visualizations#
Like a digital photo frame that responds to touch:
Hover Effects
Sales Dashboard:
Bar Chart:
│██████ (Hover→"$600")
│████ (Hover→"$400")
│███████ (Hover→"$700")
Zoom Features
Stock Price:
Overview:
└─────────────
Zoomed In:
╭╮
──╯ ╰────
Choosing Advanced Plots#
Think of this like choosing photo effects:
What You’re Showing |
Plot Type |
Real-Life Example |
---|---|---|
Relationships |
Heatmap |
Friend groups |
Multiple variables |
Pair Plot |
Student performance |
Complex distributions |
Violin Plot |
Age groups |
Patterns over time |
Time Series |
Weight loss journey |
Detailed exploration |
Interactive |
Digital yearbook |
Best Practices#
Color Schemes
Professional:
Low → High
⚪ 🟡 🟠 🔴
Avoid:
Rainbow colors
Interactivity Level
Good:
Click for details
Hover for values
Avoid:
Too many animations
Layout
Good:
┌─────┬─────┐
│ 1 │ 2 │
├─────┼─────┤
│ 3 │ 4 │
└─────┴─────┘
Bad:
Cluttered, random placement
Remember:
Keep it informative (like a well-organized photo album)
Make it intuitive (like natural gestures)
Enable exploration (like flipping through pages)
Maintain clarity (like good lighting in photos)
Think of advanced visualization like being a professional photographer:
Use the right tools (plot types)
Create the right composition (layout)
Add appropriate effects (interactivity)
Tell a complete story (multiple views)