Chapter 2 - Data Fundamentals

Chapter 2 - Data Fundamentals#

Data Types and Structures#

Basic Data Types#

1. Numerical Data#

Continuous Numbers Think of continuous data like measuring temperature - it can be any value: 98.6°F, 98.7°F, or even 98.65°F. Other examples include:

Height (5.7 feet, 5.8 feet)
Weight (150.5 pounds)
House prices ($299,999.99)

Discrete Numbers These are counting numbers that can’t be broken down further. Imagine:

Number of pets in a house
Students in a classroom
Cars in a parking lot

2. Categorical Data#

Nominal Categories Think of these as labels with no order, like:

Colors (Red, Blue, Green)
Car brands (Toyota, Honda, Ford)
Cities (New York, London, Tokyo)

Ordinal Categories These are categories with a natural order, such as:

T-shirt sizes (S, M, L, XL)
Movie ratings (1 to 5 stars)
Education levels (High School, Bachelor’s, Master’s)

3. Binary Data#

This type of data has only two possible values, often represented as 0 and 1. Examples include:

Yes/No questions (e.g., Is the light on? Yes or No)
Gender (Male or Female)
Pass/Fail results in exams

4. Time Series Data#

This is data that changes over time, like:

Daily stock prices
Monthly temperature readings
Weekly sales figures

5. Text Data#

Words and sentences that carry meaning, such as:

Customer reviews
Email content
Social media posts

Remember: Understanding data types helps us choose the right tools and techniques for our machine learning models. Just like you wouldn’t use a hammer to cut paper, each data type needs its own special handling!

Exercise: Understanding Data Types and Structures#

Let’s test your understanding of different data types in machine learning. For each scenario, select the most appropriate data type:

Scenario	Your Answer
A sensor measuring ocean water temperature every hour
Customer satisfaction ratings from 1-5 stars
Blood type (A, B, AB, O)
Number of children in a family
Product reviews written by customers
Daily closing prices of a stock
Distance traveled by a car (in miles)
Academic grades (A, B, C, D, F)

Data Structures in Python#

These data structures are commonly used in machine learning:

1. NumPy Arrays#

Think of NumPy arrays like a super-powered list. Imagine organizing chocolates in a box:

A 1D array is like a single row of chocolates
A 2D array is like a box with rows and columns
A 3D array is like stacking multiple boxes

import numpy as np

# 1D array (like a line of students)
heights = np.array([170, 175, 160, 180])

# 2D array (like a seating chart)
classroom = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

2. Pandas DataFrames & Series#

2.1 Series Think of a Series as a single column in a spreadsheet:

Like a list of test scores with student names as labels
Or daily temperatures for a week

import pandas as pd

# Series (like a single column of grades)
grades = pd.Series([85, 90, 88, 92], 
                  index=['Alice', 'Bob', 'Charlie', 'David'])

2.2 DataFrames Imagine a DataFrame as an Excel spreadsheet:

Rows represent individual entries (like students)
Columns represent different features (like subjects)

# DataFrame (like a grade book)
student_data = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [15, 16, 15],
    'Grade': [90, 85, 88]
})

3. Tensors#

Think of tensors as stacks of arrays:

0 Dimensional (0D) tensor: Single number (scalar)
1 Dimensional (1D) tensor: Array (vector)
2 Dimensional (2D) tensor: Matrix
3 Dimensional (3D) tensor: Cube of numbers (like a Rubik’s cube)
4 Dimensional (4D) tensor: Multiple cubes (like video frames)

import torch

# Creating a simple tensor (like RGB values of an image)
image_tensor = torch.tensor([
    [[255, 0, 0],    # Red
     [0, 255, 0],    # Green
     [0, 0, 255]]    # Blue
])

4. Common ML Dataset Formats#

CSV Files Like a simple spreadsheet:

name,age,grade
Alice,15,90
Bob,16,85

JSON Format Like a nested dictionary:

{
    "students": [
        {"name": "Alice", "age": 15},
        {"name": "Bob", "age": 16}
    ]
}

Parquet Files Think of these as compressed, column-oriented files:

Great for big data
Faster to read than CSV
Takes less storage space

Remember: These structures are like different types of containers. Just like you wouldn’t store soup in a paper bag, choosing the right data structure is crucial for efficient machine learning!

Exploratory Data Analysis#

Statistical Summary#

1. Mean, Median, and Mode#

Think of these as the “averages family”:

Mean: The average value
- Like splitting a pizza equally among friends
- Example: Test scores [90, 85, 85, 95, 75]
- Mean = (90 + 85 + 85 + 95 + 75) ÷ 5 = 86
Median: The middle value
- Like finding the middle person in a height-ordered line
- Example: [75, 85, 85, 90, 95]
- Median = 85 (middle number)
Mode: The most common value
- Like the most popular ice cream flavor
- Example: [75, 85, 85, 90, 95]
- Mode = 85 (appears twice)

2. Standard Deviation#

Think of this as the “spread-out-ness” of your data:

Small standard deviation = data points close together
- Like a tight group of friends walking together
Large standard deviation = data points spread out
- Like friends scattered across a playground

Example:

Dataset A: [10, 11, 9, 10, 10] → Small deviation
Dataset B: [2, 18, 5, 15, 10] → Large deviation

3. Quartiles and IQR#

Imagine lining up students by height:

Q1 (25th percentile): Quarter way through the line
Q2 (50th percentile): Halfway (same as median)
Q3 (75th percentile): Three-quarters way through
IQR = Q3 - Q1 (middle 50% of data)

Example:

Data: [1, 3, 5, 7, 8, 9, 10, 12, 15]
Q1 = 4
Q2 = 8
Q3 = 11
IQR = 11 - 4 = 7

4. Correlation Analysis#

Think of correlation as how two things move together:

Positive correlation: Both increase together
- Like ice cream sales and temperature
Negative correlation: One up, one down
- Like study time and gaming time
No correlation: No relationship
- Like shoe size and test scores

Scale: -1 to +1

+1: Perfect positive relationship
-1: Perfect negative relationship
0: No relationship

5. Distribution Analysis#

Think of this as the “shape” of your data:

Normal Distribution (Bell Curve)

Like heights in a classroom
Most values cluster in the middle
Fewer values at extremes

Skewed Distribution

Right-skewed: Long tail on right
- Like household income
Left-skewed: Long tail on left
- Like test scores in an easy exam

Uniform Distribution

Like rolling a fair die
All values equally likely

Data Quality Checks#

1. Identifying Duplicates#

Think of duplicates like identical twins in your dataset - sometimes they’re real, sometimes they’re mistakes!

Common Scenarios

Same customer entered twice
Multiple transactions with identical values
Accidentally repeated rows

# Example of finding duplicates
students = [
    {'id': 1, 'name': 'Alice', 'grade': 90},
    {'id': 1, 'name': 'Alice', 'grade': 90},  # Duplicate!
    {'id': 2, 'name': 'Bob', 'grade': 85}
]

2. Detecting Outliers#

Outliers are like the odd ones out - imagine a basketball player in a kindergarten class!

Common Methods

IQR Method (1.5 × IQR rule)
- Normal range: Q1 - 1.5×IQR to Q3 + 1.5×IQR
Z-score Method
- Values beyond 3 standard deviations

Example:

Heights (cm): [165, 170, 168, 172, 169, 250]
250 cm is clearly an outlier!

3. Checking Data Balance#

Think of data balance like a see-saw:

Class Balance Example

Spam Detection Dataset:
- 95% Normal emails (Imbalanced!)
- 5% Spam emails
Like trying to learn about tigers by showing 100 cats and 1 tiger

Solutions

Oversampling minority class
Undersampling majority class
Generating synthetic data

4. Finding Data Anomalies#

Anomalies are like suspicious behavior patterns:

Types of Anomalies

Point Anomalies
- Single weird values
- Like a -50°C temperature in summer
Contextual Anomalies
- Normal values in wrong context
- Like high ice cream sales in winter
Pattern Anomalies
- Strange sequences
- Like sudden spikes in network traffic

5. Verifying Data Types#

Think of this as making sure ingredients are correct before cooking:

Common Issues

Numbers stored as text
- “123” instead of 123
Dates in wrong format
- “01-12-2023” vs “12-01-2023”
Mixed types in columns
- Age column: [25, “thirty”, 40]

Data Preprocessing#

Data Cleaning#

Think of data cleaning like preparing ingredients before cooking a meal - everything needs to be just right!

1. Removing Duplicates#

Think of this like removing extra copies of the same photo:

# Example messy data
student_records = [
    {'id': 1, 'name': 'Alice', 'grade': 90},
    {'id': 1, 'name': 'Alice', 'grade': 90},  # Duplicate
    {'id': 1, 'name': 'alice', 'grade': 90}   # Case difference
]

Common Duplicate Types

Exact duplicates (identical rows)
Partial duplicates (same core info, different format)
Case-sensitive duplicates (JOHN vs john)

2. Handling Inconsistent Values#

Think of this like standardizing measurements in a recipe:

Common Inconsistencies

Different spellings
- “New York” vs “NY” vs “New York City”
Different units
- “1.8m” vs “180cm” vs “5’11”
Different formats
- “Yes” vs “Y” vs “1” vs “True”

Example Standardization:

Before:
  - States: ["NY", "New York", "new york"]
  - Gender: ["M", "Male", "male", "1"]
  - Status: ["Y", "Yes", "TRUE", "1"]

After:
  - States: ["NY", "NY", "NY"]
  - Gender: ["M", "M", "M", "M"]
  - Status: [True, True, True, True]

3. String Cleaning#

Like washing vegetables before cooking:

Common String Issues

Extra spaces: “John Smith “
Mixed case: “john SMITH”
Special characters: “John’s-Account#123”
Unwanted symbols: “Phone: (555) 123-4567”

4. Date/Time Formatting#

Think of this like setting all clocks to the same time zone:

Common Date Formats

Input Formats:
- "01-12-2023"
- "12/01/2023"
- "2023.12.01"
- "Dec 1, 2023"

Standardized Output:
- "2023-12-01"

5. Fixing Data Types#

Like putting groceries in their proper places:

Common Conversions

# Before cleaning
data = {
    'age': ['25', '30', 'unknown', '40'],
    'salary': ['50,000', '60000', '75,000'],
    'active': ['Y', 'N', 'yes', 'no']
}

# After cleaning
clean_data = {
    'age': [25, 30, None, 40],
    'salary': [50000, 60000, 75000],
    'active': [True, False, True, False]
}

Data Transformation#

1. One-Hot Encoding#

Think of this like breaking down a multiple-choice question into yes/no questions:

Original Data

colors = ['red', 'blue', 'green']

# Becomes:
# red:   [1, 0, 0]
# blue:  [0, 1, 0]
# green: [0, 0, 1]

When to Use

Categorical data with no order
Like: Cities, Colors, Product Categories
When categories are equally important

2. Label Encoding#

Think of this like giving students numbers instead of names:

# T-Shirt Sizes
sizes = ['S', 'M', 'L', 'XL']

# Becomes:
# S  → 0
# M  → 1
# L  → 2
# XL → 3

When to Use

Ordinal categories
When order matters
Like: Grades, Rankings, Sizes

3. Binning Numerical Data#

Think of this like grouping students by age ranges:

# Ages: 0-100
age_bins = [
    '0-18'  : Child
    '19-35' : Young Adult
    '36-50' : Adult
    '51+'   : Senior
]

# Example:
# 15 → Child
# 27 → Young Adult
# 45 → Adult

Common Binning Methods

Equal-width bins (same size ranges)
Equal-frequency bins (same number of items)
Custom bins (based on domain knowledge)

4. Log Transformation#

Think of this like using a zoom lens to see both mountains and molehills:

When to Use

Data with large ranges
Skewed distributions
When ratios matter more than differences

Example:

Income Data:
Original: [30000, 50000, 1000000]
Log:     [10.3,  10.8,  13.8]

5. Power Transformation#

Think of this like adjusting a camera’s exposure:

Common Types

Square Root: For moderately skewed data
Cube Root: For strongly skewed data
Box-Cox: Automatically finds best power

Example:

Original Data: [1, 4, 9, 16]

Square Root: [1, 2, 3, 4]
Cube Root:   [1, 1.6, 2.1, 2.5]

Remember:

Choose transformations based on data type
Consider the model you’ll use
Keep track of transformations for future data
Always scale after transforming

Think of data transformation like preparing ingredients - sometimes you need to chop, grind, or puree to get the right texture for your recipe!

Feature Scaling#

Scaling Methods#

Imagine you’re comparing apples and watermelons - you need to make them comparable!

1. Min-Max Scaling#

Think of this like converting test scores to percentages:

Real-Life Example

Original test scores: [60, 75, 90, 100]
Scaled to 0-1: [0, 0.375, 0.75, 1]

# Basketball Players
heights = [160, 180, 200]  # cm
scaled_heights = [0, 0.5, 1]

# House Prices
prices = [100k, 500k, 900k]
scaled_prices = [0, 0.5, 1]

When to Use

Like Instagram filters: scaling image pixels (0-255 → 0-1)
Like rating systems (1-5 stars → 0-1)
Perfect for neural networks that expect 0-1 input

2. Standard Scaling (Z-score)#

Like comparing how unusual something is:

Fun Example: Height in a Class

Original heights (cm): [150, 160, 170, 180, 190]
Mean = 170, Std = 10

Scaled heights: [-2, -1, 0, 1, 2]
Meaning:
- 0 = Average height
- +1 = Taller than average
- -1 = Shorter than average

Real-World Example

Test Scores: [70, 80, 90, 100]
After scaling: [-1.5, -0.5, 0.5, 1.5]
0 means “average student”
+1 means “one standard deviation above average”

3. Robust Scaling#

Like measuring kids’ heights using a growth chart:

Example: Salary Data

Salaries: [30k, 45k, 50k, 52k, 1M]
                    ↓
After robust scaling: [-1, -0.2, 0, 0.1, 5]

The millionaire doesn't mess up our scale!

4. Normalization#

Like adjusting a recipe for different serving sizes:

Pizza Toppings Example

Original Recipe (4 pizzas):
- Cheese: 400g
- Tomatoes: 200g
- Pepperoni: 200g
                ↓
Normalized (per pizza):
- Cheese: 0.5 (50%)
- Tomatoes: 0.25 (25%)
- Pepperoni: 0.25 (25%)

When to Use Each Method?#

Think of scaling methods like different measuring tools:

If Your Data Is Like…	Use This Scaling	Example
Instagram photo pixels	Min-Max	0-255 → 0-1
Student test scores	Standard	Compare to class average
House prices with mansions	Robust	Handle those million-dollar outliers
Recipe ingredients	Normalization	Adjust for different serving sizes

Remember:

Scale features like you’re making ingredients work together in a recipe
Always scale after splitting training/test data
Keep the original data backed up

Implementation Considerations#

1. Scaling Order in Pipeline#

Think of this like following a recipe in the right order:

Wrong Order (Common Mistake)

# DON'T DO THIS!
1. Scale all data
2. Split into train/test
Result: Data leakage! 😱

Right Order (Like Cooking)

Split data first (like separating ingredients)
Calculate scaling on training data (like measuring spices)
Apply to both train and test (like using same measurements)

Real Example:

# Height and Weight Dataset
heights = [170, 165, 180, 175]  # cm
weights = [70, 65, 80, 75]      # kg

# Step 1: Split
train_heights = [170, 165]
test_heights = [180, 175]

# Step 2: Calculate scale on train
mean_height = 167.5
std_height = 2.5

# Step 3: Apply to both
scaled_train = [-0.5, 0.5]
scaled_test = [1.5, 1.0]

Handling New Data#

Like using the same recipe measurements for future cooking:

Example: Temperature Scaling

Training Data:
Temps = [60°F, 70°F, 80°F]
Min = 60°F, Max = 80°F

New Data comes in: 90°F
Use SAME min/max from training:
scaled = (90 - 60) / (80 - 60) = 1.5

Scaling Categorical Data#

Think of this like organizing different types of fruits:

Before Scaling:

Colors: ['red', 'blue', 'red', 'green']
Sizes: ['S', 'M', 'L']

After Proper Handling:

# One-Hot for Colors:
red:   [1, 0, 0]
blue:  [0, 1, 0]
green: [0, 0, 1]

# Ordinal for Sizes:
S: 0
M: 1
L: 2

Common Pitfalls#

Like common cooking mistakes to avoid:

Pitfall 1: Scaling Target Variable

House Prices (Target):
Original: [100k, 200k, 300k]
DON'T scale unless absolutely necessary!

Pitfall 2: Scaling After Feature Selection

Wrong Order:
1. Select important features
2. Scale data
✓ Right Order:
1. Scale data
2. Select important features

Pitfall 3: Using Wrong Scaler

Salary Data with Outliers:
[30k, 35k, 32k, 1M]
❌ Standard Scaler: Gets messed up
✓ Robust Scaler: Handles it well

Best Practices#

Think of these like kitchen safety rules:

1. Always Save Scaler Parameters

# Save these from training:
min_value = data.min()
max_value = data.max()

# Use for future data

2. Handle Missing Values First

Original: [10, NaN, 20, 30]
❌ Scale first
✓ Fill NaN, then scale

3. Document Your Scaling

# Good Documentation
scaling_info = {
    'height': 'standard_scaler',
    'weight': 'min_max_scaler',
    'age': 'robust_scaler'
}

Remember:

Scale like you’re following a recipe
Keep your scaling consistent
Document everything
Test your pipeline with new data

Handling Missing Values#

Missing Value Analysis#

1. Types of Missing Data#

Think of this like taking attendance in a classroom:

Missing Completely at Random (MCAR)

Like students randomly absent due to various reasons
Example:
- Survey responses where some people randomly skipped questions
- Temperature readings where sensor occasionally malfunctions

Temperature Log:
Monday: 75°F
Tuesday: ??    (Sensor battery died)
Wednesday: 72°F
Thursday: 74°F

Missing at Random (MAR)

Like teenagers more likely to skip morning classes
Missing values related to other variables

Shopping Survey:
Age: 25    | Income: $50k  | Luxury Brands: ??
Age: 65    | Income: $70k  | Luxury Brands: Yes
Age: 30    | Income: $40k  | Luxury Brands: No

Missing Not at Random (MNAR)

Like people avoiding salary questions in surveys
The missing value itself is meaningful

Salary Survey:
Low Income: Answered
High Income: Often Missing
Very High Income: Usually Missing

2. Missing Patterns#

Think of this like patterns in student homework submission:

Univariate Pattern

Like one student always missing math assignments

Student Data:
Math: ??, ??, ??
English: 85, 90, 88
Science: 92, 87, 90

Multivariate Pattern

Like students missing both test and homework when sick

Sick Day Records:
Test Score: ??    | Homework: ??    | Attendance: Absent
Test Score: 85   | Homework: 90   | Attendance: Present

3. Impact Assessment#

Like checking how missing ingredients affect a recipe:

Critical Missing Data

Recipe Example:
Flour: 2 cups
Sugar: ??
Eggs: 2
Impact: Can't bake without knowing sugar amount!

Non-Critical Missing Data

Customer Profile:
Name: John
Age: ??
Favorite Color: Blue
Impact: Can still provide service without age

Documentation#

Like keeping a kitchen incident log:

Good Documentation Example:

Dataset: Customer Survey 2023
Missing Values Log:
- Income: 15% missing (people skipped)
- Age: 5% missing (form errors)
- Email: 20% missing (optional field)

Actions Taken:
- Income: Used median by profession
- Age: Used mean age
- Email: Marked as 'Not Provided'

Real-World Solutions#

Like Fixing a Recipe:

Original Recipe:
- Sugar: 1 cup
- Flour: ?? cups
- Eggs: 2

Solutions:
1. Check similar recipes (Like looking at similar data points)
2. Ask an expert (Domain knowledge)
3. Use standard amount (Mean/median imputation)
4. Skip recipe (Remove row)

Remember:

Missing values are like missing ingredients - sometimes you can substitute, sometimes you can’t
Document your missing value strategy like a recipe modification
Always consider why the data is missing before deciding how to handle it

Handling Missing Values Techniques#

1. Deletion Methods#

Think of this like cleaning up a messy photo album:

Complete Case Analysis (Listwise Deletion) Like removing group photos where someone blinked:

Before:
Photo 1: [👨 👩 😴 👶]  # Someone sleeping
Photo 2: [👨 👩 👧 👶]  # Perfect photo
Photo 3: [👨 ?? 👧 👶]  # Someone missing

After:
Keep only Photo 2

Pairwise Deletion Like using different photos for different purposes:

Family Height Chart:
Dad: 6'0"
Mom: ??
Son: 5'0"
Daughter: 5'2"

Use what you have:
- Can still compare Dad and kids
- Skip comparisons with Mom

2. Simple Imputation#

Think of this like substituting ingredients in cooking:

Mean Imputation Like using average cookie size when recipe doesn’t specify:

Cookie Sizes:
[3", 3.5", ??, 3", 4"]
Solution: Use average (3.375")

Mode Imputation Like picking the most common ice cream flavor:

Ice Cream Orders:
[Vanilla, ??, Chocolate, Vanilla, ??, Vanilla]
Solution: Fill with Vanilla

Constant Imputation Like marking “Unknown” for missing names:

Contact List:
Name: John
Phone: ??    → "Not Provided"
Email: john@email.com

3. Statistical Imputation#

Like making educated guesses based on patterns:

Median Imputation Good for skewed data, like house prices:

House Prices:
[$200k, $210k, ??, $220k, $1M]
Better to use median than mean!

Regression Imputation Like guessing someone’s weight based on height:

Height | Weight
5'0"   | 110 lbs
5'6"   | ??
6'0"   | 180 lbs
Use pattern to estimate missing weight

4. Advanced Imputation#

Like sophisticated detective work:

K-Nearest Neighbors (KNN) Like asking similar people for answers:

Movie Ratings:
Person A: [5, 4, ??, 5]
Similar viewers rated it 4
→ Probably Person A would rate it 4

Multiple Imputation Like getting multiple opinions:

Missing Salary:
Estimate 1: $50k (based on age)
Estimate 2: $52k (based on education)
Estimate 3: $48k (based on experience)
Final: Average of all estimates

When to Use Each Method#

Think of this like choosing tools for different home repairs:

Missing Data Situation	Method	Real-Life Analogy
Few missing values (<5%)	Deletion	Removing few bad apples
Random missing values	Mean/Median	Using average recipe portions
Categorical data	Mode	Most popular menu item
Related variables exist	Regression	Guessing gift price from wrapping
Complex patterns	Advanced	Detective solving complex case

Remember:

Simple isn’t always worse (like basic recipes often work best)
Consider the impact of your choice (like substituting ingredients)
Document your methods (like writing down recipe modifications)
Test your results (like tasting the food)

Think of handling missing values like being a chef - sometimes you substitute ingredients (imputation), sometimes you modify the recipe (deletion), but always make sure the final dish makes sense!

Feature Selection#

Feature Importance#

Think of this like picking players for a sports team:

1. Statistical Methods#

Like measuring athletes’ performance statistics:

T-Test Example

Basketball Players:
Height Impact:
Tall players: 20 pts average
Short players: 10 pts average
→ Height is statistically significant!

Weight Impact:
Heavy players: 15 pts average
Light players: 14 pts average
→ Weight might not matter much

2. Correlation Analysis#

Like finding which friends always hang out together:

Strong Correlation

Ice Cream Sales vs Temperature:
🌡️ 75°F → $100 sales
🌡️ 85°F → $200 sales
🌡️ 95°F → $300 sales
(Strong positive correlation)

Weak Correlation

Ice Cream Sales vs Day of Week:
Monday: $150
Tuesday: $200
Wednesday: $100
(No clear pattern)

3. Information Gain#

Like choosing questions for “20 Questions” game:

Good Questions (High Information Gain)

Animal Guessing Game:
Q: "Is it a mammal?" 
→ Eliminates 70% of possibilities!

Bad Questions (Low Information Gain):
Q: "Is it purple?"
→ Rarely helps identify animal

2. Feature Rankings#

Like ranking kitchen tools by usefulness:

Chef's Tool Ranking:
Knife (Used in 90% of dishes)
Pan (Used in 70% of dishes)
Whisk (Used in 30% of dishes)
Melon baller (Used in 1% of dishes)

3. Domain Knowledge#

Like an experienced chef knowing what matters:

Cooking Example:

Making Pizza:
Important Features:
- Dough quality
- Oven temperature
- Cooking time

Less Important:
- Brand of pan
- Color of spatula
- Kitchen temperature

Real-World Example: House Price Prediction#

Like a realtor knowing what affects house prices:

Features Available:
1. Square footage 🏠
2. Location 📍
3. Number of rooms 🚪
4. Wall color 🎨
5. House age 📅
6. Door handle style 🔒

Important (Keep):
- Square footage (Strong correlation)
- Location (High impact)
- Number of rooms (Significant)

Unimportant (Drop):
- Wall color (Can change easily)
- Door handle style (Irrelevant)

Feature Selection Methods Comparison#

Think of this like choosing tools for different jobs:

Method	Real-Life Analogy	When to Use
Statistical	Sports tryouts	Clear performance metrics
Correlation	Friend groups	Finding relationships
Information Gain	Game strategy	Decision trees
Domain Knowledge	Expert advice	Industry-specific problems

Common Patterns to Watch#

Like reading recipe reviews:

Redundant Features

Recipe Ingredients:
- Butter
- Margarine
(Choose one, not both!)

Irrelevant Features

Basketball Performance:
- Height (Relevant)
- Favorite color (Irrelevant)

Remember:

Not all features are created equal (like ingredients in a recipe)
More isn’t always better (like too many spices)
Let the data speak (like taste-testing)
Trust expert knowledge (like following chef’s advice)

Think of feature selection like packing for a trip:

Take what you need (important features)
Leave what you don’t (irrelevant features)
Consider the destination (your model’s purpose)
Ask experienced travelers (domain experts)

Selection Methods#

1. Filter Methods#

Think of this like using tryout tests for a sports team:

Variance Threshold

Basketball Shooting Test:
Player A: [10, 11, 9, 10] points
Player B: [2, 15, 0, 20] points
Player C: [10, 10, 10, 10] points

Keep: Player A, B (show variety)
Drop: Player C (no variance)

Correlation Filter Like choosing between similar players:

Two Players' Stats:
Points  | Rebounds
Player1: [20, 10, 15]
Player2: [19, 11, 14]

Decision: Too similar - keep only one!

2. Wrapper Methods#

Like holding actual practice games to pick players:

Forward Selection

Building a Soccer Team:
Start: Empty team
Round 1: Add goalkeeper (biggest impact)
Round 2: Add striker (next best)
Round 3: Add defender
Keep adding until team performance stops improving

Backward Elimination

Orchestra Selection:
Start: Full orchestra
Remove: Triangle player (least impact)
Remove: Second tambourine
Keep removing until performance drops

3. Embedded Methods#

Like having a coach who evaluates players during the actual game:

LASSO Example

Basketball Team:
Player | Points | Defense | Team Spirit
A      | High   | Low     | Medium
B      | Medium | Medium  | Medium
C      | Low    | Low     | Low

LASSO automatically benches Player C!

4. Dimensionality Reduction#

Like combining similar skills into one rating:

PCA Example

Student Grades:
Math Tests: [90, 85, 88]
Algebra Tests: [92, 87, 89]
Geometry Tests: [88, 86, 90]

Combine into: "Math Ability Score"

Real-World Example:

Restaurant Rating:
Original Features:
- Food Quality (1-5)
- Taste (1-5)
- Flavor (1-5)
- Service Speed (1-5)
- Waiter Friendliness (1-5)

Reduced to:
- Food Score (combining first 3)
- Service Score (combining last 2)

5. Feature Engineering Basics#

Like creating new sports metrics from existing stats:

Combining Features

Shopping Patterns:
Original:
- Items bought: 5
- Total cost: $100

New Feature:
- Cost per item: $20

Breaking Down Features

Date: "2023-11-26"
Becomes:
- Year: 2023
- Month: 11
- Day: 26
- Is_Weekend: True

Selection Method Comparison#

Think of this like different ways to build a team:

Method	Real-Life Analogy	Best For
Filter	Quick tryouts	Large feature sets
Wrapper	Practice games	Small feature sets
Embedded	In-game evaluation	Complex relationships
Reduction	Skill combining	Many similar features

Common Patterns#

Feature Combinations

Car Performance:
Original Features:
- Distance: 100 miles
- Time: 2 hours

New Feature:
- Speed: 50 mph

Time-Based Features

Sales Data:
Original: Purchase DateTime
New Features:
- Hour of Day
- Day of Week
- Is Holiday

Remember:

Start simple (like basic tryouts)
Test combinations (like team chemistry)
Consider computation costs (like training time)
Use domain knowledge (like coach’s experience)

Think of feature selection methods like different recruitment strategies:

Filters: Quick initial screening
Wrappers: Thorough tryouts
Embedded: On-the-job evaluation
Reduction: Combining roles
Engineering: Creating new positions

The goal is to build the best team (model) with the right players (features)!

Data Visualization#

Basic Plots#

1. Histograms#

Think of this like organizing kids by height in a class photo:

Height Distribution Example:

Kids' Heights:
4'0" - 4'2": 🧍‍♂️🧍‍♂️ (2 kids)
4'3" - 4'5": 🧍‍♂️🧍‍♂️🧍‍♂️🧍‍♂️ (4 kids)
4'6" - 4'8": 🧍‍♂️🧍‍♂️🧍‍♂️ (3 kids)
4'9" - 4'11": 🧍‍♂️ (1 kid)

Real-Life Uses:

Test scores distribution
Daily steps count
Customer age groups

2. Box Plots#

Like summarizing a family’s height in one picture:

Family Heights:
                    *  (Dad - Outlier)
                    
            ┌──┬──┐
            │  │  │
        ────┴──┴──┴────
            
Shortest (Mom) to Tallest (Kids)

Shows:

Minimum (Shortest family member)
Maximum (Tallest regular member)
Median (Middle height)
Outliers (Unusually tall Dad)

3. Scatter Plots#

Like mapping where kids drop their toys:

Toy Location Map:
   Kitchen
     │
     │  🧸  🎮
     │    🎪
     │      📱
     │
─────┼─────────────
     │     Living Room

Real Examples:

Ice Cream Sales vs Temperature:
Cold  ·
     · ·
Sales · · ·
     · · · ·
     · · · · ·
     └──────────
      Temperature

4. Line Plots#

Like tracking a child’s height over time:

Growth Chart:
Height
   │    📏
   │   📏
   │  📏
   │ 📏
   │📏
   └─────────
    Age (years)

Perfect For:

Temperature changes
Stock prices
Weight loss progress
Monthly sales

5. Bar Charts#

Like comparing kids’ candy collection after Halloween:

Candy Count:
Alice  │██████ (6)
Bob    │████████ (8)
Charlie│███ (3)
David  │█████ (5)

Real-World Uses:

Monthly expenses
Product sales comparison
Survey responses
Sports scores

Choosing the Right Plot#

Think of this like choosing the right camera lens:

What You’re Showing	Plot Type	Real-Life Example
Distribution	Histogram	Kids’ ages in school
Comparison	Bar Chart	Test scores by subject
Trend over time	Line Plot	Daily temperature
Relationships	Scatter Plot	Height vs Weight
Summary stats	Box Plot	Family heights

Common Visualization Tips#

Color Usage

Good:
Temperature: Blue (cold) to Red (hot)

Bad:
Rainbow colors for categories

Labels and Titles

Good:
"Monthly Ice Cream Sales 2023"

Bad:
"Graph 1"

Scale Considerations

Good:
Starting bar charts at zero

Bad:
Cutting off bars to exaggerate differences

Remember:

Keep it simple (like a family photo)
Tell a story (like a photo album)
Be honest (no photoshopping!)
Consider your audience (like showing photos to grandma)

Think of data visualization like being a photographer:

Choose the right angle (plot type)
Frame it well (axes and scales)
Make it clear (labels and colors)
Tell the story (context and explanation)

Advanced Visualization#

1. Correlation Heatmaps#

Think of this like a friendship map in a classroom:

Friend Groups:
     Alice  Bob  Charlie  David
Alice  🔴    🟡    🟢      🟡
Bob    🟡    🔴    🟡      🟢
Charlie🟢    🟡    🔴      🟡
David  🟡    🟢    🟡      🔴

🔴 = Best Friends
🟢 = Good Friends
🟡 = Sometimes Play Together
⚪ = Don't Interact

Real-World Example:

Menu Item Combinations:
        Coffee  Tea   Cake  Sandwich
Coffee    🔴     🟡    🟢     ⚪
Tea       🟡     🔴    🟢     ⚪
Cake      🟢     🟢    🔴     🟡
Sandwich  ⚪     ⚪    🟡     🔴

Shows what customers buy together!

2. Pair Plots#

Like taking photos of every possible pair at a party:

Student Metrics Grid:
     Study│Grade│Sleep
Study ━━━━│  📊 │  📊
Grade  📊 │━━━━│  📊
Sleep  📊 │  📊 │━━━━

Real Example:

Car Features:
     Speed│MPG │Price
Speed ━━━━│ 📉 │ 📉
MPG   📉  │━━━━│ 📈
Price 📉  │ 📈 │━━━━

3. Distribution Plots#

Like showing how kids spread out in a playground:

Kernel Density Plot

Height Distribution:
      ╭─────╮
      │     │
   ╭──┤     ├──╮
___╱  │     │  ╰___
   Short  Med  Tall

Violin Plot

Test Scores by Class:
Class A:  )│(  Wide spread
Class B:  )─(  Normal spread
Class C:  )•(  Tight spread

4. Time Series Plots#

Like making a flip-book of your growing garden:

Multiple Time Series

Plant Growth:
Height│   🌳
     │  🌲🌳
     │ 🌲🌳
     │🌲
     └─────────
      Week 1-4

Seasonal Patterns

Ice Cream Sales:
Sales│    🍦
     │🍦    🍦
     │  🍦
     │
     └─────────
      Win Spr Sum Fall

5. Interactive Visualizations#

Like a digital photo frame that responds to touch:

Hover Effects

Sales Dashboard:
Bar Chart:
│██████ (Hover→"$600")
│████ (Hover→"$400")
│███████ (Hover→"$700")

Zoom Features

Stock Price:
Overview:
└─────────────

Zoomed In:
   ╭╮
──╯ ╰────

Choosing Advanced Plots#

Think of this like choosing photo effects:

What You’re Showing	Plot Type	Real-Life Example
Relationships	Heatmap	Friend groups
Multiple variables	Pair Plot	Student performance
Complex distributions	Violin Plot	Age groups
Patterns over time	Time Series	Weight loss journey
Detailed exploration	Interactive	Digital yearbook

Best Practices#

Color Schemes

Professional:
Low → High
⚪ 🟡 🟠 🔴

Avoid:
Rainbow colors

Interactivity Level

Good:
Click for details
Hover for values

Avoid:
Too many animations

Layout

Good:
┌─────┬─────┐
│ 1   │  2  │
├─────┼─────┤
│ 3   │  4  │
└─────┴─────┘

Bad:
Cluttered, random placement

Remember:

Keep it informative (like a well-organized photo album)
Make it intuitive (like natural gestures)
Enable exploration (like flipping through pages)
Maintain clarity (like good lighting in photos)

Think of advanced visualization like being a professional photographer:

Use the right tools (plot types)
Create the right composition (layout)
Add appropriate effects (interactivity)
Tell a complete story (multiple views)

Chapter 2 - Data Fundamentals

Contents

Chapter 2 - Data Fundamentals#

Data Types and Structures#

Basic Data Types#

1. Numerical Data#

2. Categorical Data#

3. Binary Data#

4. Time Series Data#

5. Text Data#

Exercise: Understanding Data Types and Structures#

Data Structures in Python#

1. NumPy Arrays#

2. Pandas DataFrames & Series#

3. Tensors#

4. Common ML Dataset Formats#

Exploratory Data Analysis#

Statistical Summary#

1. Mean, Median, and Mode#

2. Standard Deviation#

3. Quartiles and IQR#

4. Correlation Analysis#

5. Distribution Analysis#

Data Quality Checks#

1. Identifying Duplicates#

2. Detecting Outliers#

3. Checking Data Balance#

4. Finding Data Anomalies#

5. Verifying Data Types#

Data Preprocessing#

Data Cleaning#

1. Removing Duplicates#

2. Handling Inconsistent Values#

3. String Cleaning#

4. Date/Time Formatting#

5. Fixing Data Types#

Data Transformation#

1. One-Hot Encoding#

2. Label Encoding#

3. Binning Numerical Data#

4. Log Transformation#

5. Power Transformation#

Feature Scaling#

Scaling Methods#

1. Min-Max Scaling#

2. Standard Scaling (Z-score)#

3. Robust Scaling#

4. Normalization#

When to Use Each Method?#

Implementation Considerations#

1. Scaling Order in Pipeline#

Handling New Data#

Scaling Categorical Data#

Common Pitfalls#

Best Practices#

Handling Missing Values#

Missing Value Analysis#

1. Types of Missing Data#

2. Missing Patterns#

3. Impact Assessment#

Documentation#

Real-World Solutions#

Handling Missing Values Techniques#

1. Deletion Methods#

2. Simple Imputation#

3. Statistical Imputation#

4. Advanced Imputation#

When to Use Each Method#

Feature Selection#

Feature Importance#

1. Statistical Methods#

2. Correlation Analysis#

3. Information Gain#

2. Feature Rankings#

3. Domain Knowledge#

Real-World Example: House Price Prediction#

Feature Selection Methods Comparison#

Common Patterns to Watch#

Selection Methods#

1. Filter Methods#