Chapter 2 - Data Fundamentals

Contents

Chapter 2 - Data Fundamentals#


Data Types and Structures#


Basic Data Types#

1. Numerical Data#

Continuous Numbers Think of continuous data like measuring temperature - it can be any value: 98.6°F, 98.7°F, or even 98.65°F. Other examples include:

  • Height (5.7 feet, 5.8 feet)

  • Weight (150.5 pounds)

  • House prices ($299,999.99)

Discrete Numbers These are counting numbers that can’t be broken down further. Imagine:

  • Number of pets in a house

  • Students in a classroom

  • Cars in a parking lot

2. Categorical Data#

Nominal Categories Think of these as labels with no order, like:

  • Colors (Red, Blue, Green)

  • Car brands (Toyota, Honda, Ford)

  • Cities (New York, London, Tokyo)

Ordinal Categories These are categories with a natural order, such as:

  • T-shirt sizes (S, M, L, XL)

  • Movie ratings (1 to 5 stars)

  • Education levels (High School, Bachelor’s, Master’s)

3. Binary Data#

This type of data has only two possible values, often represented as 0 and 1. Examples include:

  • Yes/No questions (e.g., Is the light on? Yes or No)

  • Gender (Male or Female)

  • Pass/Fail results in exams

4. Time Series Data#

This is data that changes over time, like:

  • Daily stock prices

  • Monthly temperature readings

  • Weekly sales figures

5. Text Data#

Words and sentences that carry meaning, such as:

  • Customer reviews

  • Email content

  • Social media posts

Remember: Understanding data types helps us choose the right tools and techniques for our machine learning models. Just like you wouldn’t use a hammer to cut paper, each data type needs its own special handling!

Exercise: Understanding Data Types and Structures#

Let’s test your understanding of different data types in machine learning. For each scenario, select the most appropriate data type:

Scenario

Your Answer

A sensor measuring ocean water temperature every hour

Customer satisfaction ratings from 1-5 stars

Blood type (A, B, AB, O)

Number of children in a family

Product reviews written by customers

Daily closing prices of a stock

Distance traveled by a car (in miles)

Academic grades (A, B, C, D, F)

Data Structures in Python#

These data structures are commonly used in machine learning:

1. NumPy Arrays#

Think of NumPy arrays like a super-powered list. Imagine organizing chocolates in a box:

  • A 1D array is like a single row of chocolates

  • A 2D array is like a box with rows and columns

  • A 3D array is like stacking multiple boxes

import numpy as np

# 1D array (like a line of students)
heights = np.array([170, 175, 160, 180])

# 2D array (like a seating chart)
classroom = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

2. Pandas DataFrames & Series#

2.1 Series Think of a Series as a single column in a spreadsheet:

  • Like a list of test scores with student names as labels

  • Or daily temperatures for a week

import pandas as pd

# Series (like a single column of grades)
grades = pd.Series([85, 90, 88, 92], 
                  index=['Alice', 'Bob', 'Charlie', 'David'])

2.2 DataFrames Imagine a DataFrame as an Excel spreadsheet:

  • Rows represent individual entries (like students)

  • Columns represent different features (like subjects)

# DataFrame (like a grade book)
student_data = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [15, 16, 15],
    'Grade': [90, 85, 88]
})

3. Tensors#

Think of tensors as stacks of arrays:

  • 0 Dimensional (0D) tensor: Single number (scalar)

  • 1 Dimensional (1D) tensor: Array (vector)

  • 2 Dimensional (2D) tensor: Matrix

  • 3 Dimensional (3D) tensor: Cube of numbers (like a Rubik’s cube)

  • 4 Dimensional (4D) tensor: Multiple cubes (like video frames)

import torch

# Creating a simple tensor (like RGB values of an image)
image_tensor = torch.tensor([
    [[255, 0, 0],    # Red
     [0, 255, 0],    # Green
     [0, 0, 255]]    # Blue
])

4. Common ML Dataset Formats#

CSV Files Like a simple spreadsheet:

name,age,grade
Alice,15,90
Bob,16,85

JSON Format Like a nested dictionary:

{
    "students": [
        {"name": "Alice", "age": 15},
        {"name": "Bob", "age": 16}
    ]
}

Parquet Files Think of these as compressed, column-oriented files:

  • Great for big data

  • Faster to read than CSV

  • Takes less storage space

Remember: These structures are like different types of containers. Just like you wouldn’t store soup in a paper bag, choosing the right data structure is crucial for efficient machine learning!

Exploratory Data Analysis#


Statistical Summary#

1. Mean, Median, and Mode#

Think of these as the “averages family”:

  • Mean: The average value

    • Like splitting a pizza equally among friends

    • Example: Test scores [90, 85, 85, 95, 75]

    • Mean = (90 + 85 + 85 + 95 + 75) ÷ 5 = 86

  • Median: The middle value

    • Like finding the middle person in a height-ordered line

    • Example: [75, 85, 85, 90, 95]

    • Median = 85 (middle number)

  • Mode: The most common value

    • Like the most popular ice cream flavor

    • Example: [75, 85, 85, 90, 95]

    • Mode = 85 (appears twice)

2. Standard Deviation#

Think of this as the “spread-out-ness” of your data:

  • Small standard deviation = data points close together

    • Like a tight group of friends walking together

  • Large standard deviation = data points spread out

    • Like friends scattered across a playground

Example:

  • Dataset A: [10, 11, 9, 10, 10] → Small deviation

  • Dataset B: [2, 18, 5, 15, 10] → Large deviation

3. Quartiles and IQR#

Imagine lining up students by height:

  • Q1 (25th percentile): Quarter way through the line

  • Q2 (50th percentile): Halfway (same as median)

  • Q3 (75th percentile): Three-quarters way through

  • IQR = Q3 - Q1 (middle 50% of data)

Example:

Data: [1, 3, 5, 7, 8, 9, 10, 12, 15]
Q1 = 4
Q2 = 8
Q3 = 11
IQR = 11 - 4 = 7

4. Correlation Analysis#

Think of correlation as how two things move together:

  • Positive correlation: Both increase together

    • Like ice cream sales and temperature

  • Negative correlation: One up, one down

    • Like study time and gaming time

  • No correlation: No relationship

    • Like shoe size and test scores

Scale: -1 to +1

  • +1: Perfect positive relationship

  • -1: Perfect negative relationship

  • 0: No relationship

5. Distribution Analysis#

Think of this as the “shape” of your data:

Normal Distribution (Bell Curve)

  • Like heights in a classroom

  • Most values cluster in the middle

  • Fewer values at extremes

Skewed Distribution

  • Right-skewed: Long tail on right

    • Like household income

  • Left-skewed: Long tail on left

    • Like test scores in an easy exam

Uniform Distribution

  • Like rolling a fair die

  • All values equally likely

Data Quality Checks#

1. Identifying Duplicates#

Think of duplicates like identical twins in your dataset - sometimes they’re real, sometimes they’re mistakes!

Common Scenarios

  • Same customer entered twice

  • Multiple transactions with identical values

  • Accidentally repeated rows

# Example of finding duplicates
students = [
    {'id': 1, 'name': 'Alice', 'grade': 90},
    {'id': 1, 'name': 'Alice', 'grade': 90},  # Duplicate!
    {'id': 2, 'name': 'Bob', 'grade': 85}
]

2. Detecting Outliers#

Outliers are like the odd ones out - imagine a basketball player in a kindergarten class!

Common Methods

  • IQR Method (1.5 × IQR rule)

    • Normal range: Q1 - 1.5×IQR to Q3 + 1.5×IQR

  • Z-score Method

    • Values beyond 3 standard deviations

Example:

Heights (cm): [165, 170, 168, 172, 169, 250]
250 cm is clearly an outlier!

3. Checking Data Balance#

Think of data balance like a see-saw:

Class Balance Example

  • Spam Detection Dataset:

    • 95% Normal emails (Imbalanced!)

    • 5% Spam emails

  • Like trying to learn about tigers by showing 100 cats and 1 tiger

Solutions

  • Oversampling minority class

  • Undersampling majority class

  • Generating synthetic data

4. Finding Data Anomalies#

Anomalies are like suspicious behavior patterns:

Types of Anomalies

  • Point Anomalies

    • Single weird values

    • Like a -50°C temperature in summer

  • Contextual Anomalies

    • Normal values in wrong context

    • Like high ice cream sales in winter

  • Pattern Anomalies

    • Strange sequences

    • Like sudden spikes in network traffic

5. Verifying Data Types#

Think of this as making sure ingredients are correct before cooking:

Common Issues

  • Numbers stored as text

    • “123” instead of 123

  • Dates in wrong format

    • “01-12-2023” vs “12-01-2023”

  • Mixed types in columns

    • Age column: [25, “thirty”, 40]

Data Preprocessing#


Data Cleaning#

Think of data cleaning like preparing ingredients before cooking a meal - everything needs to be just right!

1. Removing Duplicates#

Think of this like removing extra copies of the same photo:

# Example messy data
student_records = [
    {'id': 1, 'name': 'Alice', 'grade': 90},
    {'id': 1, 'name': 'Alice', 'grade': 90},  # Duplicate
    {'id': 1, 'name': 'alice', 'grade': 90}   # Case difference
]

Common Duplicate Types

  • Exact duplicates (identical rows)

  • Partial duplicates (same core info, different format)

  • Case-sensitive duplicates (JOHN vs john)

2. Handling Inconsistent Values#

Think of this like standardizing measurements in a recipe:

Common Inconsistencies

  • Different spellings

    • “New York” vs “NY” vs “New York City”

  • Different units

    • “1.8m” vs “180cm” vs “5’11”

  • Different formats

    • “Yes” vs “Y” vs “1” vs “True”

Example Standardization:

Before:
  - States: ["NY", "New York", "new york"]
  - Gender: ["M", "Male", "male", "1"]
  - Status: ["Y", "Yes", "TRUE", "1"]

After:
  - States: ["NY", "NY", "NY"]
  - Gender: ["M", "M", "M", "M"]
  - Status: [True, True, True, True]

3. String Cleaning#

Like washing vegetables before cooking:

Common String Issues

  • Extra spaces: “John Smith “

  • Mixed case: “john SMITH”

  • Special characters: “John’s-Account#123”

  • Unwanted symbols: “Phone: (555) 123-4567”

4. Date/Time Formatting#

Think of this like setting all clocks to the same time zone:

Common Date Formats

Input Formats:
- "01-12-2023"
- "12/01/2023"
- "2023.12.01"
- "Dec 1, 2023"

Standardized Output:
- "2023-12-01"

5. Fixing Data Types#

Like putting groceries in their proper places:

Common Conversions

# Before cleaning
data = {
    'age': ['25', '30', 'unknown', '40'],
    'salary': ['50,000', '60000', '75,000'],
    'active': ['Y', 'N', 'yes', 'no']
}

# After cleaning
clean_data = {
    'age': [25, 30, None, 40],
    'salary': [50000, 60000, 75000],
    'active': [True, False, True, False]
}

Data Transformation#

1. One-Hot Encoding#

Think of this like breaking down a multiple-choice question into yes/no questions:

Original Data

colors = ['red', 'blue', 'green']

# Becomes:
# red:   [1, 0, 0]
# blue:  [0, 1, 0]
# green: [0, 0, 1]

When to Use

  • Categorical data with no order

  • Like: Cities, Colors, Product Categories

  • When categories are equally important

2. Label Encoding#

Think of this like giving students numbers instead of names:

# T-Shirt Sizes
sizes = ['S', 'M', 'L', 'XL']

# Becomes:
# S  → 0
# M  → 1
# L  → 2
# XL → 3

When to Use

  • Ordinal categories

  • When order matters

  • Like: Grades, Rankings, Sizes

3. Binning Numerical Data#

Think of this like grouping students by age ranges:

# Ages: 0-100
age_bins = [
    '0-18'  : Child
    '19-35' : Young Adult
    '36-50' : Adult
    '51+'   : Senior
]

# Example:
# 15 → Child
# 27 → Young Adult
# 45 → Adult

Common Binning Methods

  • Equal-width bins (same size ranges)

  • Equal-frequency bins (same number of items)

  • Custom bins (based on domain knowledge)

4. Log Transformation#

Think of this like using a zoom lens to see both mountains and molehills:

When to Use

  • Data with large ranges

  • Skewed distributions

  • When ratios matter more than differences

Example:

Income Data:
Original: [30000, 50000, 1000000]
Log:     [10.3,  10.8,  13.8]

5. Power Transformation#

Think of this like adjusting a camera’s exposure:

Common Types

  • Square Root: For moderately skewed data

  • Cube Root: For strongly skewed data

  • Box-Cox: Automatically finds best power

Example:

Original Data: [1, 4, 9, 16]

Square Root: [1, 2, 3, 4]
Cube Root:   [1, 1.6, 2.1, 2.5]

Remember:

  • Choose transformations based on data type

  • Consider the model you’ll use

  • Keep track of transformations for future data

  • Always scale after transforming

Think of data transformation like preparing ingredients - sometimes you need to chop, grind, or puree to get the right texture for your recipe!

Feature Scaling#


Scaling Methods#

Imagine you’re comparing apples and watermelons - you need to make them comparable!

1. Min-Max Scaling#

Think of this like converting test scores to percentages:

Real-Life Example

  • Original test scores: [60, 75, 90, 100]

  • Scaled to 0-1: [0, 0.375, 0.75, 1]

# Basketball Players
heights = [160, 180, 200]  # cm
scaled_heights = [0, 0.5, 1]

# House Prices
prices = [100k, 500k, 900k]
scaled_prices = [0, 0.5, 1]

When to Use

  • Like Instagram filters: scaling image pixels (0-255 → 0-1)

  • Like rating systems (1-5 stars → 0-1)

  • Perfect for neural networks that expect 0-1 input

2. Standard Scaling (Z-score)#

Like comparing how unusual something is:

Fun Example: Height in a Class

Original heights (cm): [150, 160, 170, 180, 190]
Mean = 170, Std = 10

Scaled heights: [-2, -1, 0, 1, 2]
Meaning:
- 0 = Average height
- +1 = Taller than average
- -1 = Shorter than average

Real-World Example

  • Test Scores: [70, 80, 90, 100]

  • After scaling: [-1.5, -0.5, 0.5, 1.5]

  • 0 means “average student”

  • +1 means “one standard deviation above average”

3. Robust Scaling#

Like measuring kids’ heights using a growth chart:

Example: Salary Data

Salaries: [30k, 45k, 50k, 52k, 1M]
                    ↓
After robust scaling: [-1, -0.2, 0, 0.1, 5]

The millionaire doesn't mess up our scale!

4. Normalization#

Like adjusting a recipe for different serving sizes:

Pizza Toppings Example

Original Recipe (4 pizzas):
- Cheese: 400g
- Tomatoes: 200g
- Pepperoni: 200g
                ↓
Normalized (per pizza):
- Cheese: 0.5 (50%)
- Tomatoes: 0.25 (25%)
- Pepperoni: 0.25 (25%)

When to Use Each Method?#

Think of scaling methods like different measuring tools:

If Your Data Is Like…

Use This Scaling

Example

Instagram photo pixels

Min-Max

0-255 → 0-1

Student test scores

Standard

Compare to class average

House prices with mansions

Robust

Handle those million-dollar outliers

Recipe ingredients

Normalization

Adjust for different serving sizes

Remember:

  • Scale features like you’re making ingredients work together in a recipe

  • Always scale after splitting training/test data

  • Keep the original data backed up

Implementation Considerations#

1. Scaling Order in Pipeline#

Think of this like following a recipe in the right order:

Wrong Order (Common Mistake)

# DON'T DO THIS!
1. Scale all data
2. Split into train/test
Result: Data leakage! 😱

Right Order (Like Cooking)

1. Split data first (like separating ingredients)
2. Calculate scaling on training data (like measuring spices)
3. Apply to both train and test (like using same measurements)

Real Example:

# Height and Weight Dataset
heights = [170, 165, 180, 175]  # cm
weights = [70, 65, 80, 75]      # kg

# Step 1: Split
train_heights = [170, 165]
test_heights = [180, 175]

# Step 2: Calculate scale on train
mean_height = 167.5
std_height = 2.5

# Step 3: Apply to both
scaled_train = [-0.5, 0.5]
scaled_test = [1.5, 1.0]

Handling New Data#

Like using the same recipe measurements for future cooking:

Example: Temperature Scaling

Training Data:
Temps = [60°F, 70°F, 80°F]
Min = 60°F, Max = 80°F

New Data comes in: 90°F
Use SAME min/max from training:
scaled = (90 - 60) / (80 - 60) = 1.5

Scaling Categorical Data#

Think of this like organizing different types of fruits:

Before Scaling:

Colors: ['red', 'blue', 'red', 'green']
Sizes: ['S', 'M', 'L']

After Proper Handling:

# One-Hot for Colors:
red:   [1, 0, 0]
blue:  [0, 1, 0]
green: [0, 0, 1]

# Ordinal for Sizes:
S: 0
M: 1
L: 2

Common Pitfalls#

Like common cooking mistakes to avoid:

Pitfall 1: Scaling Target Variable

House Prices (Target):
Original: [100k, 200k, 300k]
DON'T scale unless absolutely necessary!

Pitfall 2: Scaling After Feature Selection

Wrong Order:
1. Select important features
2. Scale data
✓ Right Order:
1. Scale data
2. Select important features

Pitfall 3: Using Wrong Scaler

Salary Data with Outliers:
[30k, 35k, 32k, 1M]
❌ Standard Scaler: Gets messed up
✓ Robust Scaler: Handles it well

Best Practices#

Think of these like kitchen safety rules:

1. Always Save Scaler Parameters

# Save these from training:
min_value = data.min()
max_value = data.max()

# Use for future data

2. Handle Missing Values First

Original: [10, NaN, 20, 30]
❌ Scale first
✓ Fill NaN, then scale

3. Document Your Scaling

# Good Documentation
scaling_info = {
    'height': 'standard_scaler',
    'weight': 'min_max_scaler',
    'age': 'robust_scaler'
}

Remember:

  • Scale like you’re following a recipe

  • Keep your scaling consistent

  • Document everything

  • Test your pipeline with new data

Handling Missing Values#


Missing Value Analysis#

1. Types of Missing Data#

Think of this like taking attendance in a classroom:

Missing Completely at Random (MCAR)

  • Like students randomly absent due to various reasons

  • Example:

    • Survey responses where some people randomly skipped questions

    • Temperature readings where sensor occasionally malfunctions

Temperature Log:
Monday: 75°F
Tuesday: ??    (Sensor battery died)
Wednesday: 72°F
Thursday: 74°F

Missing at Random (MAR)

  • Like teenagers more likely to skip morning classes

  • Missing values related to other variables

Shopping Survey:
Age: 25    | Income: $50k  | Luxury Brands: ??
Age: 65    | Income: $70k  | Luxury Brands: Yes
Age: 30    | Income: $40k  | Luxury Brands: No

Missing Not at Random (MNAR)

  • Like people avoiding salary questions in surveys

  • The missing value itself is meaningful

Salary Survey:
Low Income: Answered
High Income: Often Missing
Very High Income: Usually Missing

2. Missing Patterns#

Think of this like patterns in student homework submission:

Univariate Pattern

  • Like one student always missing math assignments

Student Data:
Math: ??, ??, ??
English: 85, 90, 88
Science: 92, 87, 90

Multivariate Pattern

  • Like students missing both test and homework when sick

Sick Day Records:
Test Score: ??    | Homework: ??    | Attendance: Absent
Test Score: 85   | Homework: 90   | Attendance: Present

3. Impact Assessment#

Like checking how missing ingredients affect a recipe:

Critical Missing Data

Recipe Example:
Flour: 2 cups
Sugar: ??
Eggs: 2
Impact: Can't bake without knowing sugar amount!

Non-Critical Missing Data

Customer Profile:
Name: John
Age: ??
Favorite Color: Blue
Impact: Can still provide service without age

Documentation#

Like keeping a kitchen incident log:

Good Documentation Example:

Dataset: Customer Survey 2023
Missing Values Log:
- Income: 15% missing (people skipped)
- Age: 5% missing (form errors)
- Email: 20% missing (optional field)

Actions Taken:
- Income: Used median by profession
- Age: Used mean age
- Email: Marked as 'Not Provided'

Real-World Solutions#

Like Fixing a Recipe:

Original Recipe:
- Sugar: 1 cup
- Flour: ?? cups
- Eggs: 2

Solutions:
1. Check similar recipes (Like looking at similar data points)
2. Ask an expert (Domain knowledge)
3. Use standard amount (Mean/median imputation)
4. Skip recipe (Remove row)

Remember:

  • Missing values are like missing ingredients - sometimes you can substitute, sometimes you can’t

  • Document your missing value strategy like a recipe modification

  • Always consider why the data is missing before deciding how to handle it

Handling Missing Values Techniques#

1. Deletion Methods#

Think of this like cleaning up a messy photo album:

Complete Case Analysis (Listwise Deletion) Like removing group photos where someone blinked:

Before:
Photo 1: [👨 👩 😴 👶]  # Someone sleeping
Photo 2: [👨 👩 👧 👶]  # Perfect photo
Photo 3: [👨 ?? 👧 👶]  # Someone missing

After:
Keep only Photo 2

Pairwise Deletion Like using different photos for different purposes:

Family Height Chart:
Dad: 6'0"
Mom: ??
Son: 5'0"
Daughter: 5'2"

Use what you have:
- Can still compare Dad and kids
- Skip comparisons with Mom

2. Simple Imputation#

Think of this like substituting ingredients in cooking:

Mean Imputation Like using average cookie size when recipe doesn’t specify:

Cookie Sizes:
[3", 3.5", ??, 3", 4"]
Solution: Use average (3.375")

Mode Imputation Like picking the most common ice cream flavor:

Ice Cream Orders:
[Vanilla, ??, Chocolate, Vanilla, ??, Vanilla]
Solution: Fill with Vanilla

Constant Imputation Like marking “Unknown” for missing names:

Contact List:
Name: John
Phone: ??    → "Not Provided"
Email: john@email.com

3. Statistical Imputation#

Like making educated guesses based on patterns:

Median Imputation Good for skewed data, like house prices:

House Prices:
[$200k, $210k, ??, $220k, $1M]
Better to use median than mean!

Regression Imputation Like guessing someone’s weight based on height:

Height | Weight
5'0"   | 110 lbs
5'6"   | ??
6'0"   | 180 lbs
Use pattern to estimate missing weight

4. Advanced Imputation#

Like sophisticated detective work:

K-Nearest Neighbors (KNN) Like asking similar people for answers:

Movie Ratings:
Person A: [5, 4, ??, 5]
Similar viewers rated it 4
→ Probably Person A would rate it 4

Multiple Imputation Like getting multiple opinions:

Missing Salary:
Estimate 1: $50k (based on age)
Estimate 2: $52k (based on education)
Estimate 3: $48k (based on experience)
Final: Average of all estimates

When to Use Each Method#

Think of this like choosing tools for different home repairs:

Missing Data Situation

Method

Real-Life Analogy

Few missing values (<5%)

Deletion

Removing few bad apples

Random missing values

Mean/Median

Using average recipe portions

Categorical data

Mode

Most popular menu item

Related variables exist

Regression

Guessing gift price from wrapping

Complex patterns

Advanced

Detective solving complex case

Remember:

  • Simple isn’t always worse (like basic recipes often work best)

  • Consider the impact of your choice (like substituting ingredients)

  • Document your methods (like writing down recipe modifications)

  • Test your results (like tasting the food)

Think of handling missing values like being a chef - sometimes you substitute ingredients (imputation), sometimes you modify the recipe (deletion), but always make sure the final dish makes sense!

Feature Selection#


Feature Importance#

Think of this like picking players for a sports team:

1. Statistical Methods#

Like measuring athletes’ performance statistics:

T-Test Example

Basketball Players:
Height Impact:
Tall players: 20 pts average
Short players: 10 pts average
→ Height is statistically significant!

Weight Impact:
Heavy players: 15 pts average
Light players: 14 pts average
→ Weight might not matter much

2. Correlation Analysis#

Like finding which friends always hang out together:

Strong Correlation

Ice Cream Sales vs Temperature:
🌡️ 75°F → $100 sales
🌡️ 85°F → $200 sales
🌡️ 95°F → $300 sales
(Strong positive correlation)

Weak Correlation

Ice Cream Sales vs Day of Week:
Monday: $150
Tuesday: $200
Wednesday: $100
(No clear pattern)

3. Information Gain#

Like choosing questions for “20 Questions” game:

Good Questions (High Information Gain)

Animal Guessing Game:
Q: "Is it a mammal?" 
→ Eliminates 70% of possibilities!

Bad Questions (Low Information Gain):
Q: "Is it purple?"
→ Rarely helps identify animal

2. Feature Rankings#

Like ranking kitchen tools by usefulness:

Chef's Tool Ranking:
1. Knife (Used in 90% of dishes)
2. Pan (Used in 70% of dishes)
3. Whisk (Used in 30% of dishes)
4. Melon baller (Used in 1% of dishes)

3. Domain Knowledge#

Like an experienced chef knowing what matters:

Cooking Example:

Making Pizza:
Important Features:
- Dough quality
- Oven temperature
- Cooking time

Less Important:
- Brand of pan
- Color of spatula
- Kitchen temperature

Real-World Example: House Price Prediction#

Like a realtor knowing what affects house prices:

Features Available:
1. Square footage 🏠
2. Location 📍
3. Number of rooms 🚪
4. Wall color 🎨
5. House age 📅
6. Door handle style 🔒

Important (Keep):
- Square footage (Strong correlation)
- Location (High impact)
- Number of rooms (Significant)

Unimportant (Drop):
- Wall color (Can change easily)
- Door handle style (Irrelevant)

Feature Selection Methods Comparison#

Think of this like choosing tools for different jobs:

Method

Real-Life Analogy

When to Use

Statistical

Sports tryouts

Clear performance metrics

Correlation

Friend groups

Finding relationships

Information Gain

Game strategy

Decision trees

Domain Knowledge

Expert advice

Industry-specific problems

Common Patterns to Watch#

Like reading recipe reviews:

Redundant Features

Recipe Ingredients:
- Butter
- Margarine
(Choose one, not both!)

Irrelevant Features

Basketball Performance:
- Height (Relevant)
- Favorite color (Irrelevant)

Remember:

  • Not all features are created equal (like ingredients in a recipe)

  • More isn’t always better (like too many spices)

  • Let the data speak (like taste-testing)

  • Trust expert knowledge (like following chef’s advice)

Think of feature selection like packing for a trip:

  • Take what you need (important features)

  • Leave what you don’t (irrelevant features)

  • Consider the destination (your model’s purpose)

  • Ask experienced travelers (domain experts)

Selection Methods#

1. Filter Methods#

Think of this like using tryout tests for a sports team:

Variance Threshold

Basketball Shooting Test:
Player A: [10, 11, 9, 10] points
Player B: [2, 15, 0, 20] points
Player C: [10, 10, 10, 10] points

Keep: Player A, B (show variety)
Drop: Player C (no variance)

Correlation Filter Like choosing between similar players:

Two Players' Stats:
Points  | Rebounds
Player1: [20, 10, 15]
Player2: [19, 11, 14]

Decision: Too similar - keep only one!

2. Wrapper Methods#

Like holding actual practice games to pick players:

Forward Selection

Building a Soccer Team:
Start: Empty team
Round 1: Add goalkeeper (biggest impact)
Round 2: Add striker (next best)
Round 3: Add defender
Keep adding until team performance stops improving

Backward Elimination

Orchestra Selection:
Start: Full orchestra
Remove: Triangle player (least impact)
Remove: Second tambourine
Keep removing until performance drops

3. Embedded Methods#

Like having a coach who evaluates players during the actual game:

LASSO Example

Basketball Team:
Player | Points | Defense | Team Spirit
A      | High   | Low     | Medium
B      | Medium | Medium  | Medium
C      | Low    | Low     | Low

LASSO automatically benches Player C!

4. Dimensionality Reduction#

Like combining similar skills into one rating:

PCA Example

Student Grades:
Math Tests: [90, 85, 88]
Algebra Tests: [92, 87, 89]
Geometry Tests: [88, 86, 90]

Combine into: "Math Ability Score"

Real-World Example:

Restaurant Rating:
Original Features:
- Food Quality (1-5)
- Taste (1-5)
- Flavor (1-5)
- Service Speed (1-5)
- Waiter Friendliness (1-5)

Reduced to:
- Food Score (combining first 3)
- Service Score (combining last 2)

5. Feature Engineering Basics#

Like creating new sports metrics from existing stats:

Combining Features

Shopping Patterns:
Original:
- Items bought: 5
- Total cost: $100

New Feature:
- Cost per item: $20

Breaking Down Features

Date: "2023-11-26"
Becomes:
- Year: 2023
- Month: 11
- Day: 26
- Is_Weekend: True

Selection Method Comparison#

Think of this like different ways to build a team:

Method

Real-Life Analogy

Best For

Filter

Quick tryouts

Large feature sets

Wrapper

Practice games

Small feature sets

Embedded

In-game evaluation

Complex relationships

Reduction

Skill combining

Many similar features

Common Patterns#

Feature Combinations

Car Performance:
Original Features:
- Distance: 100 miles
- Time: 2 hours

New Feature:
- Speed: 50 mph

Time-Based Features

Sales Data:
Original: Purchase DateTime
New Features:
- Hour of Day
- Day of Week
- Is Holiday

Remember:

  • Start simple (like basic tryouts)

  • Test combinations (like team chemistry)

  • Consider computation costs (like training time)

  • Use domain knowledge (like coach’s experience)

Think of feature selection methods like different recruitment strategies:

  • Filters: Quick initial screening

  • Wrappers: Thorough tryouts

  • Embedded: On-the-job evaluation

  • Reduction: Combining roles

  • Engineering: Creating new positions

The goal is to build the best team (model) with the right players (features)!

Data Visualization#


Basic Plots#

1. Histograms#

Think of this like organizing kids by height in a class photo:

Height Distribution Example:

Kids' Heights:
4'0" - 4'2": 🧍‍♂️🧍‍♂️ (2 kids)
4'3" - 4'5": 🧍‍♂️🧍‍♂️🧍‍♂️🧍‍♂️ (4 kids)
4'6" - 4'8": 🧍‍♂️🧍‍♂️🧍‍♂️ (3 kids)
4'9" - 4'11": 🧍‍♂️ (1 kid)

Real-Life Uses:

  • Test scores distribution

  • Daily steps count

  • Customer age groups

2. Box Plots#

Like summarizing a family’s height in one picture:

Family Heights:
                    *  (Dad - Outlier)
                    
            ┌──┬──┐
            │  │  │
        ────┴──┴──┴────
            
Shortest (Mom) to Tallest (Kids)

Shows:

  • Minimum (Shortest family member)

  • Maximum (Tallest regular member)

  • Median (Middle height)

  • Outliers (Unusually tall Dad)

3. Scatter Plots#

Like mapping where kids drop their toys:

Toy Location Map:
   Kitchen
     │
     │  🧸  🎮
     │    🎪
     │      📱
     │
─────┼─────────────
     │     Living Room

Real Examples:

Ice Cream Sales vs Temperature:
Cold  ·
     · ·
Sales · · ·
     · · · ·
     · · · · ·
     └──────────
      Temperature

4. Line Plots#

Like tracking a child’s height over time:

Growth Chart:
Height
   │    📏
   │   📏
   │  📏
   │ 📏
   │📏
   └─────────
    Age (years)

Perfect For:

  • Temperature changes

  • Stock prices

  • Weight loss progress

  • Monthly sales

5. Bar Charts#

Like comparing kids’ candy collection after Halloween:

Candy Count:
Alice  │██████ (6)
Bob    │████████ (8)
Charlie│███ (3)
David  │█████ (5)

Real-World Uses:

  • Monthly expenses

  • Product sales comparison

  • Survey responses

  • Sports scores

Choosing the Right Plot#

Think of this like choosing the right camera lens:

What You’re Showing

Plot Type

Real-Life Example

Distribution

Histogram

Kids’ ages in school

Comparison

Bar Chart

Test scores by subject

Trend over time

Line Plot

Daily temperature

Relationships

Scatter Plot

Height vs Weight

Summary stats

Box Plot

Family heights

Common Visualization Tips#

Color Usage

Good:
Temperature: Blue (cold) to Red (hot)

Bad:
Rainbow colors for categories

Labels and Titles

Good:
"Monthly Ice Cream Sales 2023"

Bad:
"Graph 1"

Scale Considerations

Good:
Starting bar charts at zero

Bad:
Cutting off bars to exaggerate differences

Remember:

  • Keep it simple (like a family photo)

  • Tell a story (like a photo album)

  • Be honest (no photoshopping!)

  • Consider your audience (like showing photos to grandma)

Think of data visualization like being a photographer:

  • Choose the right angle (plot type)

  • Frame it well (axes and scales)

  • Make it clear (labels and colors)

  • Tell the story (context and explanation)

Advanced Visualization#

1. Correlation Heatmaps#

Think of this like a friendship map in a classroom:

Friend Groups:
     Alice  Bob  Charlie  David
Alice  🔴    🟡    🟢      🟡
Bob    🟡    🔴    🟡      🟢
Charlie🟢    🟡    🔴      🟡
David  🟡    🟢    🟡      🔴

🔴 = Best Friends
🟢 = Good Friends
🟡 = Sometimes Play Together
⚪ = Don't Interact

Real-World Example:

Menu Item Combinations:
        Coffee  Tea   Cake  Sandwich
Coffee    🔴     🟡    🟢     ⚪
Tea       🟡     🔴    🟢     ⚪
Cake      🟢     🟢    🔴     🟡
Sandwich  ⚪     ⚪    🟡     🔴

Shows what customers buy together!

2. Pair Plots#

Like taking photos of every possible pair at a party:

Student Metrics Grid:
     Study│Grade│Sleep
Study ━━━━│  📊 │  📊
Grade  📊 │━━━━│  📊
Sleep  📊 │  📊 │━━━━

Real Example:

Car Features:
     Speed│MPG │Price
Speed ━━━━│ 📉 │ 📉
MPG   📉  │━━━━│ 📈
Price 📉  │ 📈 │━━━━

3. Distribution Plots#

Like showing how kids spread out in a playground:

Kernel Density Plot

Height Distribution:
      ╭─────╮
      │     │
   ╭──┤     ├──╮
___╱  │     │  ╰___
   Short  Med  Tall

Violin Plot

Test Scores by Class:
Class A:  )│(  Wide spread
Class B:  )─(  Normal spread
Class C:  )•(  Tight spread

4. Time Series Plots#

Like making a flip-book of your growing garden:

Multiple Time Series

Plant Growth:
Height│   🌳
     │  🌲🌳
     │ 🌲🌳
     │🌲
     └─────────
      Week 1-4

Seasonal Patterns

Ice Cream Sales:
Sales│    🍦
     │🍦    🍦
     │  🍦
     │
     └─────────
      Win Spr Sum Fall

5. Interactive Visualizations#

Like a digital photo frame that responds to touch:

Hover Effects

Sales Dashboard:
Bar Chart:
│██████ (Hover→"$600")
│████ (Hover→"$400")
│███████ (Hover→"$700")

Zoom Features

Stock Price:
Overview:
└─────────────

Zoomed In:
   ╭╮
──╯ ╰────

Choosing Advanced Plots#

Think of this like choosing photo effects:

What You’re Showing

Plot Type

Real-Life Example

Relationships

Heatmap

Friend groups

Multiple variables

Pair Plot

Student performance

Complex distributions

Violin Plot

Age groups

Patterns over time

Time Series

Weight loss journey

Detailed exploration

Interactive

Digital yearbook

Best Practices#

Color Schemes

Professional:
Low → High
⚪ 🟡 🟠 🔴

Avoid:
Rainbow colors

Interactivity Level

Good:
Click for details
Hover for values

Avoid:
Too many animations

Layout

Good:
┌─────┬─────┐
│ 1   │  2  │
├─────┼─────┤
│ 3   │  4  │
└─────┴─────┘

Bad:
Cluttered, random placement

Remember:

  • Keep it informative (like a well-organized photo album)

  • Make it intuitive (like natural gestures)

  • Enable exploration (like flipping through pages)

  • Maintain clarity (like good lighting in photos)

Think of advanced visualization like being a professional photographer:

  • Use the right tools (plot types)

  • Create the right composition (layout)

  • Add appropriate effects (interactivity)

  • Tell a complete story (multiple views)