This morning you learned to store data. But what if I told you Netflix analyzes the viewing habits of 260 MILLION users to decide what shows to recommend? How do you think they store and analyze that much data?

The Big Idea:

Variables store one piece of information
Lists store many pieces of related information
Data Science = finding patterns in those collections

Concept: Lists – Your First Dataset

Explanation: “A list in Python is like a column in a spreadsheet. Instead of storing one value, it stores many related values in order.”

# Instead of this (one student):
student_name = "Alex"
test_score = 85

# We can do this (whole class):
student_names = ["Alex", "Sam", "Jordan", "Casey", "Taylor", "Morgan"]
test_scores = [85, 92, 78, 96, 88, 74]

# The data is connected by position:
# Alex scored 85, Sam scored 92, Jordan scored 78, etc.

Code Example: Basic List Operations

Before we code, let’s learn some powerful built-in functions:

len() – counts how many items are in a list
max() – finds the highest value in a list of numbers
min() – finds the lowest value in a list of numbers
sum() – adds up all numbers in a list
round() – rounds a decimal number to fewer decimal places

# Our first real dataset: Class test scores
print("=== CLASSROOM DATA ANALYSIS ===")

# Create our lists (like columns in a spreadsheet)
student_names = ["Alex", "Sam", "Jordan", "Casey", "Taylor", "Morgan"]
test_scores = [85, 92, 78, 96, 88, 74]

# Basic dataset information using built-in functions
print("Number of students:", len(student_names))  # len() counts items in the list
print("Highest score:", max(test_scores))         # max() finds the biggest number
print("Lowest score:", min(test_scores))          # min() finds the smallest number

# Calculate class average (first real data science calculation!)
total_points = sum(test_scores)           # sum() adds all numbers: 85+92+78+96+88+74 = 513
class_average = total_points / len(test_scores)  # 513 ÷ 6 students = 85.5
print("Class average:", round(class_average, 1))  # round() makes 85.5 instead of 85.50000

# Show all the data using a loop
print("\n=== INDIVIDUAL STUDENT SCORES ===")
for i in range(len(student_names)):  # range(6) gives us [0, 1, 2, 3, 4, 5]
    # i is the position: 0=Alex, 1=Sam, 2=Jordan, etc.
    print(f"{student_names[i]}: {test_scores[i]} points")

Notice how student_names[0] and test_scores[0] both refer to Alex’s data. The position numbers keep everything connected!

Challenge 1: Analyzing Your Friend Group

Your Turn: Create lists for your friend group and analyze their social media habits!

# TODO: Fill in with real data about your friends
friend_names = ["Friend1", "Friend2", "Friend3", "Friend4", "Friend5"]
daily_screen_time = [0, 0, 0, 0, 0]  # Hours per day (estimate!)

# Your mission: Calculate and display:
# 1. Who spends the most time on their phone?
# 2. What's the average screen time for your friend group?
# 3. How does each friend compare to the average?

print("=== FRIEND GROUP SCREEN TIME ANALYSIS ===")
# Write your code here!

Section 2: Finding Patterns (25 minutes)

Concept: Data Filtering and Categorization

Explanation: “Data scientists don’t just calculate averages – they find patterns and group data into categories. This is how Spotify creates playlists or how schools identify students who need extra help.”

New Skills We’ll Learn:

Empty lists – Start with [] and add items later using .append()
.append() – Adds an item to the end of a list
Categorization – Sorting data into groups based on conditions

Code Example: Academic Performance Analysis

print("=== ADVANCED STUDENT PERFORMANCE ANALYSIS ===")

# Expanded dataset with more students
students = ["Alex", "Sam", "Jordan", "Casey", "Taylor", "Morgan", "Riley", "Avery"]
test_scores = [85, 92, 78, 96, 88, 74, 91, 82]
study_hours = [2.5, 4.0, 1.5, 4.5, 3.0, 1.0, 3.5, 2.0]  # Hours per week

# Create empty lists to sort students into categories
excellent_students = []    # Will hold names of students with scores >= 90
good_students = []        # Will hold names of students with scores 80-89  
needs_help_students = []  # Will hold names of students with scores < 80

print("Analyzing each student...")

# Analyze each student (this is real data science!)
for i in range(len(students)):  # i goes from 0 to 7 (8 students total)
    name = students[i]          # Get student name at position i
    score = test_scores[i]      # Get test score at position i  
    hours = study_hours[i]      # Get study hours at position i
    
    # Categorize based on test score
    if score >= 90:
        excellent_students.append(name)  # Add this student's name to excellent list
        performance_category = "Excellent"
    elif score >= 80:  # This means 80-89 (since >= 90 was already checked)
        good_students.append(name)       # Add this student's name to good list
        performance_category = "Good"
    else:  # This means < 80
        needs_help_students.append(name) # Add this student's name to needs help list
        performance_category = "Needs Support"
    
    # Show individual analysis
    print(f"{name}: {score} points, {hours}h study time → {performance_category}")

# Generate insights (the payoff!)
print(f"\n=== CLASS INSIGHTS ===")
print(f" Excellent performers ({len(excellent_students)} students): {excellent_students}")
print(f" Good performers ({len(good_students)} students): {good_students}")
print(f" Students who could use support ({len(needs_help_students)} students): {needs_help_students}")

# Advanced insight: Is there a relationship between study time and performance?
print(f"\n=== STUDY TIME ANALYSIS ===")
if len(excellent_students) > 0:  # Make sure we have excellent students to analyze
    # Calculate average study time for excellent students
    excellent_study_times = []  # Empty list to collect study times
    for name in excellent_students:  # Go through each excellent student
        student_position = students.index(name)  # Find where this student is in the main list
        study_time = study_hours[student_position]  # Get their study time
        excellent_study_times.append(study_time)  # Add to our collection
    
    excellent_avg_study = sum(excellent_study_times) / len(excellent_study_times)
    print(f"Excellent students study an average of {excellent_avg_study:.1f} hours per week")
else:
    print("No excellent students to analyze study time patterns")

.append() lets us build lists as we analyze data
Multiple lists can work together (names, scores, and hours are connected by position)
.index() helps us find where an item is located in a list

Challenge 2: Simple Grade Calculator

Your Turn: Create a grade calculator that analyzes multiple students and determines who needs help!

print("=== CLASS GRADE ANALYZER ===")

# Dataset: Student test scores from this week
student_names = ["Maya", "Jordan", "Alex", "Sam", "Casey"]
test_scores = [92, 67, 88, 45, 91]

print("=== BASIC CLASS INFORMATION ===")
print("Students in class:", len(student_names))
print("All test scores:", test_scores)

# TODO: Calculate class statistics using the functions you learned
# STEP 1: Calculate the total points (use sum() function)
total_points = # Your code here

# STEP 2: Calculate the class average (total ÷ number of students)
class_average = # Your code here

# STEP 3: Find the highest and lowest scores (use max() and min() functions)
highest_score = # Your code here
lowest_score = # Your code here

# Display the statistics (round the average to 1 decimal place)
print("Class average:", round(class_average, 1))
print("Highest score:", highest_score)
print("Lowest score:", lowest_score)

# TODO: Analyze each student's performance
print("\n=== INDIVIDUAL STUDENT ANALYSIS ===")

# STEP 4: Create empty lists to categorize students
students_needing_help = []  # For students with scores below 70
students_doing_well = []    # For students with scores 70 or above

# STEP 5: Loop through each student and categorize them
for i in range(len(student_names)):
    name = student_names[i]
    score = test_scores[i]
    
    print(f"Analyzing {name}: {score} points")
    
    # TODO: Write the if/else logic to categorize students
    if # Your condition here (hint: check if score is below 70):
        # Add student to needs help list
        # Set performance message
        print(f"  → NEEDS HELP (below 70)")
    else:
        # Add student to doing well list  
        # Set performance message
        print(f"  → DOING WELL (70 or above)")

# TODO: Generate insights for the teacher
print("\n=== TEACHER INSIGHTS ===")
print(f"Students doing well: {students_doing_well}")
print(f"Students needing help: {students_needing_help}")

# STEP 6: Calculate what percentage of class needs help
# (number needing help ÷ total students) × 100
help_percentage = # Your calculation here

print(f"Number needing support: {len(students_needing_help)} out of {len(student_names)}")
print(f"Percentage needing help: {round(help_percentage, 1)}%")

# STEP 7: Provide recommendation based on results
if len(students_needing_help) > 0:
    print(f"\n RECOMMENDATION: Schedule tutoring sessions for {len(students_needing_help)} students")
    print(f"Focus on: {students_needing_help}")
else:
    print(f"\n GREAT NEWS: All students are performing well!")

Hints to Get You Started:

# Hint for Step 1: Use sum() to add all scores
total_points = sum(test_scores)

# Hint for Step 2: Divide total by number of students  
class_average = total_points / len(test_scores)

# Hint for Step 5: Check if score is less than 70
if score < 70:
    students_needing_help.append(name)
else:
    students_doing_well.append(name)

What You’re Learning:

Data categorization – Sorting students into “doing well” vs “needs help”
Percentage calculations – Converting counts to percentages for insights
List building – Using .append() to collect students in different categories
Educational applications – How schools use data to identify at-risk students

Extension Challenges (If You Finish Early):

Change the threshold: Try changing the “needs help” threshold from 70 to 75. How does this change the results?
Add more categories: Create a third category for “excellent” students (scores >= 90)
Find the struggling student: Write code to find which student has the lowest score and needs the most help
Calculate grade letters: Add logic to assign letter grades (A, B, C, D, F) to each student

Section 3: Multi-Variable Analysis

Concept: Connecting Multiple Data Points

Explanation: “Real data science gets exciting when you connect different types of information. Does study time actually correlate with test scores? Do premium users really listen to more music? This is where insights come from!”

Advanced Skills We’ll Use:

Multi-factor scoring – Combining several pieces of data to make predictions
Weighted factors – Some data points matter more than others
Conditional logic chains – Using if/elif with multiple conditions
Complex business analysis – Multiple separate analytical workflows in one program

Code Example: Student Success Predictor

print("=== STUDENT SUCCESS PREDICTOR ===")

# Comprehensive student dataset (4 different types of information per student)
students = ["Emma", "Liam", "Sophia", "Noah", "Olivia", "William", "Ava", "James"]
test_scores = [92, 78, 96, 85, 88, 74, 91, 82]
study_hours = [4.0, 1.5, 4.5, 2.5, 3.0, 1.0, 3.5, 2.0]  # Hours per week
attendance_rate = [0.95, 0.82, 0.98, 0.90, 0.93, 0.78, 0.96, 0.87]  # Decimal (0.95 = 95%)
extracurriculars = [2, 0, 3, 1, 2, 0, 2, 1]  # Number of activities

print("=== COMPREHENSIVE STUDENT ANALYSIS ===")
print("Analyzing", len(students), "students using 4 different factors...")

for i in range(len(students)):  # Analyze each student individually
    name = students[i]
    score = test_scores[i]
    hours = study_hours[i]
    attendance = attendance_rate[i]
    activities = extracurriculars[i]
    
    print(f"\n--- Analyzing {name} ---")
    print(f"Test Score: {score}, Study Time: {hours}h, Attendance: {attendance:.0%}, Activities: {activities}")
    
    # Calculate a "success prediction score" using multiple factors
    success_score = 0  # Start at zero and add points for each factor
    
    # Factor 1: Test performance (worth up to 40 points - most important!)
    print("Factor 1 - Test Performance:")
    if score >= 90:
        points_earned = 40
        success_score += 40
        print(f"  Score {score} >= 90: Excellent! +{points_earned} points")
    elif score >= 80:  # Between 80-89
        points_earned = 25
        success_score += 25
        print(f"  Score {score} is 80-89: Good! +{points_earned} points")
    else:  # Below 80
        points_earned = 10
        success_score += 10
        print(f"  Score {score} < 80: Needs work. +{points_earned} points")
    
    # Factor 2: Study habits (worth up to 30 points)
    print("Factor 2 - Study Habits:")
    if hours >= 3.5:
        points_earned = 30
        success_score += 30
        print(f"  {hours} hours >= 3.5: Excellent study habits! +{points_earned} points")
    elif hours >= 2.0:  # Between 2.0-3.4 hours
        points_earned = 20
        success_score += 20
        print(f"  {hours} hours is 2.0-3.4: Good study habits. +{points_earned} points")
    else:  # Less than 2 hours
        points_earned = 5
        success_score += 5
        print(f"  {hours} hours < 2.0: Needs more study time. +{points_earned} points")
    
    # Factor 3: Attendance (worth up to 20 points)
    print("Factor 3 - Attendance:")
    if attendance >= 0.95:  # 95% or higher
        points_earned = 20
        success_score += 20
        print(f"  {attendance:.0%} >= 95%: Excellent attendance! +{points_earned} points")
    elif attendance >= 0.85:  # 85-94%
        points_earned = 15
        success_score += 15
        print(f"  {attendance:.0%} is 85-94%: Good attendance. +{points_earned} points")
    else:  # Below 85%
        points_earned = 5
        success_score += 5
        print(f"  {attendance:.0%} < 85%: Attendance needs improvement. +{points_earned} points")
    
    # Factor 4: Engagement/Activities (worth up to 10 points)
    print("Factor 4 - Extracurricular Engagement:")
    if activities >= 2:
        points_earned = 10
        success_score += 10
        print(f"  {activities} activities >= 2: Well-rounded! +{points_earned} points")
    elif activities >= 1:
        points_earned = 5
        success_score += 5
        print(f"  {activities} activity: Some involvement. +{points_earned} points")
    else:  # 0 activities
        points_earned = 0
        success_score += 0
        print(f"  {activities} activities: Consider joining activities. +{points_earned} points")
    
    # Generate prediction based on total score (out of 100 possible points)
    print(f"Total Success Score: {success_score}/100")
    
    if success_score >= 85:
        prediction = "Highly likely to succeed"
        confidence = "High confidence"
    elif success_score >= 65:
        prediction = "Good chance of success" 
        confidence = "Moderate confidence"
    else:
        prediction = "May need additional support"
        confidence = "Requires intervention"
    
    print(f"PREDICTION: {prediction} ({confidence})")

Key Data Science Concepts:

Weighted scoring – Test scores worth 40 points, study habits 30, attendance 20, activities 10
Threshold analysis – Different score ranges lead to different conclusions
Multi-factor prediction – Combining several data types for better accuracy
Confidence levels – Higher scores = more confident predictions

Challenge 3: Music Streaming Analysis (Advanced)

Your Turn: You work for “TeenTunes” (a fictional music streaming app) and need to analyze user listening habits using multiple analytical techniques!

Music Streaming Analysis (Advanced)

Your Turn: You work for “TeenTunes” (a fictional music streaming app) and need to analyze user listening habits using multiple analytical techniques!

print("=== TEENTUNES COMPREHENSIVE USER ANALYSIS ===")

# Your dataset: User listening habits
usernames = ["MusicLover23", "PopFan2024", "RockStar_Kid", "StudyBeats", "DanceMachine", "ChillVibes"]
monthly_hours = [45, 67, 23, 89, 52, 34]  # Hours listened this month
favorite_genres = ["Pop", "Pop", "Rock", "Lo-fi", "Electronic", "Indie"]
premium_subscriber = [False, True, False, True, True, False]  # True = Premium, False = Free

# TODO: ANALYSIS 1 - Basic Dataset Overview
print("=== BASIC DATASET OVERVIEW ===")
print("Total users analyzed:", len(usernames))
print("All monthly listening hours:", monthly_hours)

# STEP 1: Calculate total listening time across all users
total_listening = # Your code here (use sum function)

# STEP 2: Calculate average listening time per user  
average_listening = # Your code here (total ÷ number of users)

print(f"Total company listening time: {total_listening} hours")
print(f"Average user listening time: {round(average_listening, 1)} hours per month")

# TODO: ANALYSIS 2 - Power User Identification
print("\n=== ANALYSIS 1: POWER USER IDENTIFICATION ===")
print("Finding users who listen more than 50 hours per month...")

# STEP 3: Create empty lists to categorize users
power_users = []     # Users with > 50 hours
regular_users = []   # Users with <= 50 hours

# STEP 4: Loop through users and categorize them
for i in range(len(usernames)):
    user = usernames[i]
    hours = monthly_hours[i]
    
    # TODO: Write if/else logic to categorize users
    if # Your condition here (hint: check if hours > 50):
        # Add user to power_users list
        # Print that they're a power user
        print(f"{user}: {hours} hours - POWER USER! ")
    else:
        # Add user to regular_users list  
        # Print that they're a regular user
        print(f"{user}: {hours} hours - Regular user")

# Display results
print(f"\nPower Users Identified: {len(power_users)} out of {len(usernames)} total")
print(f"Power users: {power_users}")
print(f"Regular users: {regular_users}")

# TODO: ANALYSIS 3 - Premium vs Free User Comparison
print("\n=== ANALYSIS 2: PREMIUM VS FREE USER BEHAVIOR ===")

# STEP 5: Set up variables to track premium vs free users
premium_total_hours = 0    # Will add up all premium user hours
premium_count = 0          # Will count premium users
free_total_hours = 0       # Will add up all free user hours  
free_count = 0             # Will count free users

print("Categorizing users by subscription type:")

# STEP 6: Loop through users and separate by subscription type
for i in range(len(usernames)):
    user = usernames[i]
    hours = monthly_hours[i]
    is_premium = premium_subscriber[i]  # True or False
    
    # TODO: Write if/else to handle premium vs free users
    if # Your condition here (check if is_premium == True):
        # Add hours to premium_total_hours
        # Add 1 to premium_count
        # Print user as PREMIUM
        print(f"{user}: {hours} hours (PREMIUM) ")
    else:
        # Add hours to free_total_hours
        # Add 1 to free_count  
        # Print user as FREE
        print(f"{user}: {hours} hours (FREE)")

# STEP 7: Calculate averages (with error checking)
print(f"\nSubscription Analysis Results:")

if premium_count > 0:
    premium_avg = # Your calculation here
    print(f"Premium users ({premium_count} total): {round(premium_avg, 1)} hours average")
else:
    print("No premium users found")
    premium_avg = 0

if free_count > 0:
    free_avg = # Your calculation here  
    print(f"Free users ({free_count} total): {round(free_avg, 1)} hours average")
else:
    print("No free users found")
    free_avg = 0

# TODO: ANALYSIS 4 - Genre Popularity
print("\n=== ANALYSIS 3: MUSIC GENRE POPULARITY ===")
print("All user genre preferences:", favorite_genres)

# STEP 8: Count each genre manually
pop_count = 0
rock_count = 0
lofi_count = 0
electronic_count = 0
indie_count = 0

print("Counting genre preferences:")
# STEP 9: Loop through favorite_genres and count each one
for i in range(len(usernames)):
    user = usernames[i]
    genre = favorite_genres[i]
    
    # TODO: Write if/elif statements to count each genre
    if # genre equals "Pop":
        # Add 1 to pop_count
    elif # genre equals "Rock":
        # Add 1 to rock_count  
    elif # genre equals "Lo-fi":
        # Add 1 to lofi_count
    elif # genre equals "Electronic":
        # Add 1 to electronic_count
    elif # genre equals "Indie":
        # Add 1 to indie_count
    
    print(f"{user} likes {genre}")

# STEP 10: Display results and find most popular
print(f"\nGenre Popularity Results:")
print(f"Pop: {pop_count} users")
print(f"Rock: {rock_count} users") 
print(f"Lo-fi: {lofi_count} users")
print(f"Electronic: {electronic_count} users")
print(f"Indie: {indie_count} users")

# TODO: Find which genre is most popular
# STEP 11: Create a list of all counts and find the maximum
all_counts = [pop_count, rock_count, lofi_count, electronic_count, indie_count]
genre_names = ["Pop", "Rock", "Lo-fi", "Electronic", "Indie"]
max_count = # Your code here (use max function)

# Find which genre has the max count
print(f"\nMost popular genre analysis:")
for i in range(len(genre_names)):
    if all_counts[i] == max_count:
        most_popular_genre = genre_names[i]
        print(f"🎵 Winner: {genre_names[i]} with {max_count} users!")

# TODO: FINAL BUSINESS INSIGHTS
print("\n" + "="*50)
print("=== COMPREHENSIVE BUSINESS INSIGHTS FOR TEENTUNES ===")
print("="*50)

# STEP 12: Calculate key business metrics
power_user_percentage = # Your calculation: (power users ÷ total users) × 100
conversion_rate = # Your calculation: (premium users ÷ total users) × 100
popular_genre_percentage = # Your calculation: (max genre count ÷ total users) × 100

print(f" USER BASE: {len(usernames)} total users analyzed")
print(f" ENGAGEMENT: {len(power_users)} power users ({round(power_user_percentage, 1)}% of base)")
print(f" REVENUE: {premium_count} premium subscribers ({round(conversion_rate, 1)}% conversion rate)")
print(f"  USAGE: Premium users average {premium_avg:.1f}h vs Free users {free_avg:.1f}h")
print(f" CONTENT: {most_popular_genre} is most popular genre ({round(popular_genre_percentage, 1)}% of users)")

# STEP 13: Generate strategic recommendations
print(f"\n STRATEGIC RECOMMENDATIONS:")
# TODO: Write if/else logic for recommendations
if premium_avg > free_avg:
    print(f"• Premium users are more engaged - highlight exclusive content in marketing")
else:
    print(f"• Free users are highly engaged - focus on converting them to premium")

print(f"• Expand {most_popular_genre} music library to attract more users")
print(f"• Target power users for premium upgrades and exclusive features")

Major Hints to Get You Started:

# Hint for totals and averages:
total_listening = sum(monthly_hours)
average_listening = total_listening / len(monthly_hours)

# Hint for power user categorization:
if hours > 50:
    power_users.append(user)
else:
    regular_users.append(user)

# Hint for premium vs free:
if is_premium == True:
    premium_total_hours += hours
    premium_count += 1
else:
    free_total_hours += hours
    free_count += 1

# Hint for calculating averages:
premium_avg = premium_total_hours / premium_count
free_avg = free_total_hours / free_count

# Hint for genre counting:
if genre == "Pop":
    pop_count += 1
elif genre == "Rock":
    rock_count += 1
# Continue for other genres...

# Hint for percentages:
power_user_percentage = (len(power_users) / len(usernames)) * 100

Advanced Concepts You’re Mastering:

Multiple analysis workflows in a single program
Data accumulation and counter variables for complex calculations
Business metrics calculation (conversion rates, percentages, averages)
Comparative analysis between user segments
Error handling with conditional checks before calculations
Strategic insight generation from multiple data points

Extension Challenges (If You Finish Early):

Add user satisfaction analysis: Create a new list with satisfaction ratings and analyze happy vs unhappy users
Find the most valuable user: Combine premium status and listening hours to identify the most valuable customer
Create user profiles: Write code that generates a complete profile for each user including all their metrics
Build a recommendation engine: Based on listening hours and genre, recommend whether each user should upgrade to premium

INSIGHT DAY 2

Theme of the day: social media advice.

Opening Question:

“Raise your hand if you’ve ever watched a video that had millions of views and thought ‘I could make something better than this!’ Today we’re going to figure out what actually makes videos go viral and which platform gives creators the best shot at fame.”

Today’s Mission:

“By the end of today, you’ll be able to analyze real data from 500+ viral videos and give advice to content creators about platform strategy, optimal posting times, and what types of content actually work!”

Session 1: From Lists to CSV Files

Concept Introduction: The CSV File Format

Explanation: “Yesterday you worked with small lists that fit in a few lines of code. But what if Netflix wants to analyze data from 260 million users? Or TikTok wants to understand patterns from billions of videos? That’s where CSV files come in!”

What is a CSV file?

CSV = Comma Separated Values
Like a spreadsheet, but in plain text
Each row is one record (like one video)
Each column is one piece of information (like view count)

Show students a simple example:

video_title,platform,views,likes,creator
"Dancing Cat",TikTok,2500000,180000,CatLover23
"Study Tips",YouTube,890000,45000,StudyBuddy
"Cooking Hack",Instagram,1200000,75000,ChefLife

Why data scientists love CSV files:

Can handle millions of rows
Work with any programming language
Easy to share between different tools
Industry standard for data analysis

Concept: Reading CSV Files with Python

New Skills We’ll Learn:

import csv – Brings in Python’s CSV reading abilities
csv.DictReader() – Reads CSV files as dictionaries (like our lists from yesterday, but bigger!)
File handling – Opening and closing files safely

Link to CSV file on Social Media Videos: HERE!!!!!

Step-by-Step Code Example:

# Step 1: Import the CSV library
import csv

# Step 2: Create an empty list to store all our video data
viral_videos = []

# Step 3: Open the CSV file and read it
with open('viral_videos_2024.csv', 'r') as file:
    # csv.DictReader turns each row into a dictionary
    # This means we can access data by column name instead of position
    reader = csv.DictReader(file)
    
    # Step 4: Loop through each row and add it to our list
    for row in reader:
        viral_videos.append(row)  # Each row is a dictionary with video info

# Step 5: Check what we loaded
print(f"SUCCESS! Loaded {len(viral_videos)} viral videos for analysis!")

# Step 6: Look at the first video to understand our data
print("\nFirst video in our dataset:")
first_video = viral_videos[0]
print(f"Title: {first_video['title']}")
print(f"Platform: {first_video['platform']}")
print(f"Views: {first_video['views']}")
print(f"Likes: {first_video['likes']}")
print(f"Creator: {first_video['creator_name']}")

Now instead of student_names[0] we use viral_videos[0]['creator_name'] – we access data by meaningful names instead of position numbers!

Code Example: Basic Data Exploration

Before analyzing, let’s learn data exploration functions:

len() – How many videos do we have?
.keys() – What information is available for each video?
Data type conversion – CSV files read everything as text, but we need numbers for math

print("=== VIRAL VIDEO DATASET EXPLORATION ===")

# Basic dataset information
print(f"Total videos in dataset: {len(viral_videos)}")
print(f"Data available for each video: {list(viral_videos[0].keys())}")

# Look at a few example videos to understand our data
print("\n=== SAMPLE VIDEOS ===")
for i in range(3):  # Show first 3 videos
    video = viral_videos[i]
    print(f"\nVideo {i+1}:")
    print(f"  Title: {video['title']}")
    print(f"  Platform: {video['platform']}")
    print(f"  Views: {video['views']}")  # This is still text!
    print(f"  Creator: {video['creator_name']}")

# IMPORTANT: Convert text numbers to actual numbers for calculations
print("\n=== CONVERTING DATA FOR ANALYSIS ===")

# Method 1: Convert individual values
first_video = viral_videos[0]
views_as_text = first_video['views']  # This is text like "2500000"
views_as_number = int(first_video['views'])  # Now it's a number: 2500000

print(f"Views as text: '{views_as_text}' (type: {type(views_as_text)})")
print(f"Views as number: {views_as_number} (type: {type(views_as_number)})")
print(f"Now we can do math: {views_as_number} + 1000 = {views_as_number + 1000}")

# Method 2: Convert data for the whole dataset (we'll use this approach)
print("\nConverting all numerical data...")
for video in viral_videos:
    # Convert text numbers to integers for calculations
    video['views'] = int(video['views'])
    video['likes'] = int(video['likes'])
    video['comments'] = int(video['comments'])
    video['shares'] = int(video['shares'])
    video['follower_count'] = int(video['follower_count'])
    video['video_length_seconds'] = int(video['video_length_seconds'])

print("Data conversion complete! Ready for analysis.")

# Now we can do real data science!
print("\n=== FIRST ANALYSIS: BASIC STATISTICS ===")

# Extract all view counts into a list (like yesterday's work!)
all_view_counts = []
for video in viral_videos:
    all_view_counts.append(video['views'])

# Calculate statistics using functions from yesterday
total_views = sum(all_view_counts)
average_views = total_views / len(all_view_counts)
highest_views = max(all_view_counts)
lowest_views = min(all_view_counts)

print(f"Total views across all viral videos: {total_views:,}")
print(f"Average views per viral video: {round(average_views):,}")
print(f"Most viral video had: {highest_views:,} views")
print(f"Least viral video had: {lowest_views:,} views")

# Find the most viral video
for video in viral_videos:
    if video['views'] == highest_views:
        print(f"\nMOST VIRAL VIDEO:")
        print(f"  Title: '{video['title']}'")
        print(f"  Platform: {video['platform']}")
        print(f"  Creator: {video['creator_name']}")
        break

Key Learning Points:

CSV files store everything as text – we must convert to numbers for math
Dictionary access – video['views'] instead of position numbers
Data exploration first – always understand your data before analyzing
Same functions, bigger data – sum(), max(), len() work the same way

Challenge 1: Platform Comparison Basics

Your Turn: Use what you just learned to compare TikTok vs YouTube vs Instagram!

print("=== PLATFORM BATTLE: TIKTOK VS YOUTUBE VS INSTAGRAM ===")

# TODO: STEP 1 - Count videos by platform
tiktok_count = 0
youtube_count = 0
instagram_count = 0

# Loop through all videos and count each platform
for video in viral_videos:
    platform = video['platform']
    
    # TODO: Write if/elif statements to count each platform
    if # Your condition here (check if platform equals "TikTok"):
        # Add 1 to tiktok_count
    elif # Your condition here (check if platform equals "YouTube"):
        # Add 1 to youtube_count
    elif # Your condition here (check if platform equals "Instagram"):
        # Add 1 to instagram_count

print(f"TikTok videos: {tiktok_count}")
print(f"YouTube videos: {youtube_count}")
print(f"Instagram videos: {instagram_count}")

# TODO: STEP 2 - Calculate total views by platform
tiktok_total_views = 0
youtube_total_views = 0
instagram_total_views = 0

# Loop through videos again and add up views for each platform
for video in viral_videos:
    platform = video['platform']
    views = video['views']
    
    # TODO: Write if/elif statements to add views to platform totals
    if # platform == "TikTok":
        # Add views to tiktok_total_views
    elif # platform == "YouTube":
        # Add views to youtube_total_views
    elif # platform == "Instagram":
        # Add views to instagram_total_views

print(f"\nTotal views by platform:")
print(f"TikTok: {tiktok_total_views:,} views")
print(f"YouTube: {youtube_total_views:,} views")
print(f"Instagram: {instagram_total_views:,} views")

# TODO: STEP 3 - Calculate average views per video by platform
# (Only calculate if we have videos for that platform)
if tiktok_count > 0:
    tiktok_avg = # Your calculation here
    print(f"TikTok average views per video: {round(tiktok_avg):,}")

if youtube_count > 0:
    youtube_avg = # Your calculation here
    print(f"YouTube average views per video: {round(youtube_avg):,}")

if instagram_count > 0:
    instagram_avg = # Your calculation here
    print(f"Instagram average views per video: {round(instagram_avg):,}")

# TODO: STEP 4 - Determine the winner
print(f"\n PLATFORM BATTLE RESULTS:")
print(f"Most videos: # Determine which platform has most videos")
print(f"Most total views: # Determine which platform has most total views")
print(f"Highest average views: # Determine which platform has highest average")

Hints:

# Hint for counting platforms:
if platform == "TikTok":
    tiktok_count += 1

# Hint for calculating averages:
tiktok_avg = tiktok_total_views / tiktok_count

# Hint for finding winners:
if tiktok_count > youtube_count and tiktok_count > instagram_count:
    print("Most videos: TikTok")

What You’re Learning:

Working with real datasets (500+ videos vs. yesterday’s 5-6 items)
Data type conversion (text to numbers for calculations)
Dictionary access (using meaningful names instead of positions)
Platform comparison (business intelligence basics)

Session 2: Advanced Viral Video Analytics

Concept: Advanced Data Filtering and Analysis

New Skills for Real Data Science:

Multiple conditions – Finding videos that meet several criteria
Engagement rate calculations – Converting raw numbers to meaningful metrics
Data categorization – Grouping videos by performance levels
Temporal analysis – Understanding time-based patterns

Key Metrics Content Creators Care About:

Engagement Rate = (Likes + Comments) ÷ Views
Viral Threshold = Videos above average performance
Platform Success Rate = Percentage of videos that go viral by platform

Code Example: Content Creator Intelligence

print("=== CONTENT CREATOR INTELLIGENCE REPORT ===")

# Advanced Analysis 1: Calculate engagement rates
print("Calculating engagement rates for all videos...")

for video in viral_videos:
    views = video['views']
    likes = video['likes']
    comments = video['comments']
    
    # Calculate engagement rate (likes + comments) / views
    if views > 0:  # Avoid division by zero
        engagement_rate = (likes + comments) / views
        video['engagement_rate'] = round(engagement_rate, 4)  # Store for later use
    else:
        video['engagement_rate'] = 0

print("Engagement rates calculated!")

# Advanced Analysis 2: Find top performing videos by engagement
print("\n=== TOP 5 MOST ENGAGING VIDEOS ===")

# Create a list of engagement rates to find the highest ones
all_engagement_rates = []
for video in viral_videos:
    all_engagement_rates.append(video['engagement_rate'])

# Sort engagement rates and find top 5
all_engagement_rates.sort(reverse=True)  # Highest first
top_5_rates = all_engagement_rates[:5]  # First 5 items

print("Highest engagement rates found:", top_5_rates)

# Find and display the videos with these top engagement rates
for target_rate in top_5_rates:
    for video in viral_videos:
        if video['engagement_rate'] == target_rate:
            print(f"\n Engagement Rate: {video['engagement_rate']:.1%}")
            print(f"   Title: '{video['title']}'")
            print(f"   Platform: {video['platform']}")
            print(f"   Views: {video['views']:,}")
            print(f"   Likes: {video['likes']:,}")
            print(f"   Comments: {video['comments']:,}")
            break  # Found it, move to next rate

# Advanced Analysis 3: Video length analysis
print("\n=== VIDEO LENGTH ANALYSIS ===")

# Categorize videos by length
short_videos = []    # Under 60 seconds
medium_videos = []   # 60-300 seconds (1-5 minutes)
long_videos = []     # Over 300 seconds (5+ minutes)

for video in viral_videos:
    length = video['video_length_seconds']
    
    if length < 60:
        short_videos.append(video)
    elif length <= 300:  # 60-300 seconds
        medium_videos.append(video)
    else:  # Over 300 seconds
        long_videos.append(video)

print(f"Short videos (< 1 min): {len(short_videos)} videos")
print(f"Medium videos (1-5 min): {len(medium_videos)} videos")
print(f"Long videos (5+ min): {len(long_videos)} videos")

# Calculate average views by video length category
if len(short_videos) > 0:
    short_avg_views = sum(video['views'] for video in short_videos) / len(short_videos)
    print(f"Short videos average views: {round(short_avg_views):,}")

if len(medium_videos) > 0:
    medium_avg_views = sum(video['views'] for video in medium_videos) / len(medium_videos)
    print(f"Medium videos average views: {round(medium_avg_views):,}")

if len(long_videos) > 0:
    long_avg_views = sum(video['views'] for video in long_videos) / len(long_videos)
    print(f"Long videos average views: {round(long_avg_views):,}")

# Advanced Analysis 4: Creator success analysis
print("\n=== CREATOR SUCCESS ANALYSIS ===")

# Find creators with multiple viral videos
creator_video_counts = {}  # Dictionary to count videos per creator

for video in viral_videos:
    creator = video['creator_name']
    
    if creator in creator_video_counts:
        creator_video_counts[creator] += 1  # Add to existing count
    else:
        creator_video_counts[creator] = 1   # First video for this creator

# Find creators with more than 1 viral video
successful_creators = []
for creator, count in creator_video_counts.items():
    if count > 1:
        successful_creators.append((creator, count))

print(f"Creators with multiple viral videos: {len(successful_creators)}")
for creator, count in successful_creators[:5]:  # Show top 5
    print(f"  {creator}: {count} viral videos")

Key Advanced Concepts:

Engagement rate calculation – Converting raw data to meaningful metrics
Data sorting and filtering – Finding top performers
Multiple categorization – Video length analysis
Dictionary usage for counting – Track creator success

Challenge 2: Ultimate Platform Strategy Analysis

Your Mission: You’re consulting for a new content creator who wants to know: “Which platform should I focus on, what type of content should I make, and when should I post?” Use data analysis to give them evidence-based advice!

print("=== PLATFORM STRATEGY CONSULTING PROJECT ===")
print("Analyzing 500+ viral videos to develop content strategy...")

# TODO: ANALYSIS 1 - Platform Performance Comparison (20 minutes)
print("\n=== ANALYSIS 1: PLATFORM PERFORMANCE DEEP DIVE ===")

# Step 1: Calculate average engagement rate by platform
tiktok_engagement_rates = []
youtube_engagement_rates = []
instagram_engagement_rates = []

# TODO: Loop through videos and collect engagement rates by platform
for video in viral_videos:
    platform = video['platform']
    engagement = video['engagement_rate']
    
    # TODO: Add engagement rate to appropriate platform list
    if # platform == "TikTok":
        # Add engagement to tiktok_engagement_rates list
    elif # platform == "YouTube":
        # Add engagement to youtube_engagement_rates list
    elif # platform == "Instagram":
        # Add engagement to instagram_engagement_rates list

# Calculate average engagement by platform
if len(tiktok_engagement_rates) > 0:
    tiktok_avg_engagement = # Your calculation here
    print(f"TikTok average engagement rate: {tiktok_avg_engagement:.1%}")

if len(youtube_engagement_rates) > 0:
    youtube_avg_engagement = # Your calculation here
    print(f"YouTube average engagement rate: {youtube_avg_engagement:.1%}")

if len(instagram_engagement_rates) > 0:
    instagram_avg_engagement = # Your calculation here
    print(f"Instagram average engagement rate: {instagram_avg_engagement:.1%}")

# TODO: ANALYSIS 2 - Content Category Success Analysis (25 minutes)
print("\n=== ANALYSIS 2: CONTENT CATEGORY ANALYSIS ===")

# Available categories in dataset: "Comedy", "Educational", "Music", "Dance", "Lifestyle", "Gaming"
category_performance = {}

# TODO: Calculate average views by content category
for video in viral_videos:
    category = video['category']
    views = video['views']
    
    # TODO: Track views by category
    if category in category_performance:
        # Add this video's data to existing category
        category_performance[category]['total_views'] += views
        category_performance[category]['video_count'] += 1
    else:
        # Create new category entry
        category_performance[category] = {
            'total_views': views,
            'video_count': 1
        }

# Calculate and display average views by category
print("Content category performance:")
for category, data in category_performance.items():
    avg_views = data['total_views'] / data['video_count']
    print(f"{category}: {round(avg_views):,} avg views ({data['video_count']} videos)")

# TODO: ANALYSIS 3 - Optimal Video Length Analysis (15 minutes)
print("\n=== ANALYSIS 3: OPTIMAL VIDEO LENGTH STRATEGY ===")

# Analyze success rate by platform and video length
print("Success rates by video length and platform:")

platforms = ["TikTok", "YouTube", "Instagram"]
for platform in platforms:
    print(f"\n{platform} Analysis:")
    
    platform_short = []
    platform_medium = []
    platform_long = []
    
    # TODO: Categorize videos by length for this platform
    for video in viral_videos:
        if video['platform'] == platform:
            length = video['video_length_seconds']
            views = video['views']
            
            # TODO: Add to appropriate length category
            if # length < 60:
                # Add views to platform_short
            elif # length <= 300:
                # Add views to platform_medium  
            else:
                # Add views to platform_long

    # Calculate averages for each length category
    if len(platform_short) > 0:
        short_avg = sum(platform_short) / len(platform_short)
        print(f"  Short videos (< 1 min): {round(short_avg):,} avg views")
    
    if len(platform_medium) > 0:
        medium_avg = sum(platform_medium) / len(platform_medium)
        print(f"  Medium videos (1-5 min): {round(medium_avg):,} avg views")
    
    if len(platform_long) > 0:
        long_avg = sum(platform_long) / len(platform_long)
        print(f"  Long videos (5+ min): {round(long_avg):,} avg views")

# TODO: ANALYSIS 4 - Creator Success Factor Analysis (15 minutes)
print("\n=== ANALYSIS 4: CREATOR SUCCESS FACTORS ===")

# Analyze relationship between follower count and video performance
small_creators = []   # Under 10,000 followers
medium_creators = []  # 10,000 - 100,000 followers  
large_creators = []   # Over 100,000 followers

# TODO: Categorize creators by follower count and track their video performance
for video in viral_videos:
    followers = video['follower_count']
    views = video['views']
    
    # TODO: Add video views to appropriate creator size category
    if # followers < 10000:
        # Add views to small_creators
    elif # followers <= 100000:
        # Add views to medium_creators
    else:
        # Add views to large_creators

# Calculate success rates by creator size
print("Video performance by creator size:")
if len(small_creators) > 0:
    small_avg = sum(small_creators) / len(small_creators)
    print(f"Small creators (< 10K followers): {round(small_avg):,} avg views")

if len(medium_creators) > 0:
    medium_avg = sum(medium_creators) / len(medium_creators)
    print(f"Medium creators (10K-100K followers): {round(medium_avg):,} avg views")

if len(large_creators) > 0:
    large_avg = sum(large_creators) / len(large_creators)
    print(f"Large creators (100K+ followers): {round(large_avg):,} avg views")

# TODO: FINAL RECOMMENDATIONS - Synthesize all analysis into advice
print("\n" + "="*60)
print("=== FINAL CONTENT CREATOR STRATEGY RECOMMENDATIONS ===")
print("="*60)

# TODO: Generate data-driven recommendations based on your analysis
print("PLATFORM RECOMMENDATION:")
print("Based on engagement rate analysis: # Your recommendation here")

print("\n CONTENT STRATEGY:")
print("Based on category performance: # Your recommendation here")

print("\n VIDEO LENGTH STRATEGY:")
print("Based on length analysis: # Your recommendation here")

print("\n GROWTH STRATEGY:")
print("Based on creator success factors: # Your recommendation here")

print("\n KEY INSIGHTS:")
print("• # Insight 1 from your analysis")
print("• # Insight 2 from your analysis") 
print("• # Insight 3 from your analysis")

Major Hints to Get You Started:

# Hint for collecting engagement rates by platform:
if platform == "TikTok":
    tiktok_engagement_rates.append(engagement)

# Hint for calculating average engagement:
tiktok_avg_engagement = sum(tiktok_engagement_rates) / len(tiktok_engagement_rates)

# Hint for category performance tracking:
if category in category_performance:
    category_performance[category]['total_views'] += views
    category_performance[category]['video_count'] += 1
else:
    category_performance[category] = {'total_views': views, 'video_count': 1}

# Hint for video length categorization:
if length < 60:
    platform_short.append(views)
elif length <= 300:
    platform_medium.append(views)
else:
    platform_long.append(views)

Advanced Concepts You’re Mastering:

Multi-dimensional analysis – Platform × content type × video length
Business intelligence – Converting data into actionable recommendations
Dictionary data structures – Advanced data organization
Comparative analysis – Understanding relative performance across categories
Strategic thinking – Synthesizing multiple analyses into coherent advice

Extension Challenges (If You Finish Early):

Viral Prediction Model: Create a scoring system to predict if a video will go viral
Seasonal Analysis: Analyze if certain content performs better at different times
Cross-Platform Creator Analysis: Find creators successful on multiple platforms
Engagement Quality Analysis: Compare like-to-comment ratios across platforms

INSIGHT DAY 3

Session 1: Becoming a Data Detective

Concept Introduction: Where Real Data Lives

The Internet is Full of Free, Amazing Datasets:

Government websites – Weather, population, crime, economics
Sports databases – Every game, player, and stat you can imagine
Entertainment platforms – Movie ratings, music charts, social media trends
Science repositories – NASA data, health studies, environmental monitoring

Where Professional Data Scientists Find Data:

Kaggle.com – The world’s largest data science community (millions of datasets)
Data.gov.ie – Official Irish government data portal (Irish context!)
CSO.ie – Central Statistics Office Ireland (population, economy, society)
Data.gov – US government data (for global comparisons)
Google Dataset Search – Search engine specifically for datasets
Reddit r/datasets – Community-shared interesting datasets

What Makes a Good Dataset for Visualization:

Size: At least 50-100 rows (enough to find patterns)
Clean format: CSV files are easiest to work with
Interesting variables: Multiple columns you can compare
Recent data: 2020-2024 is ideal for relevance
Clear documentation: You understand what each column means

Quick Dataset Discovery Demo

Irish Data Sources:

data.gov.ie: Search “population by county”, “transport dublin”, “weather stations”
cso.ie: Look for “education statistics”, “housing data”, “employment figures”

Global Sources:

Kaggle: Search “spotify”, “netflix”, “fifa players”, “food ratings”
Data.gov: Search for interesting US datasets to compare with Irish data

Challenge 1: The Numbers Hunter

Challenge Overview:

In this first challenge, you’ll become a “Numbers Hunter” – learning to find and analyze datasets rich with numerical data. This skill is fundamental because statistical analysis is the backbone of data science.

What You’ll Learn:

How to identify high-quality numerical datasets
Basic statistical analysis (averages, ranges, maximums)
Data loading and error handling techniques
How to extract meaningful insights from numbers

Building Toward Presentation:

The statistics you calculate here will become key talking points in your final data story presentation at the end of class today. You’ll be able to say things like “I discovered that the average temperature in Dublin has increased by 2.3 degrees over the past decade” or “The highest-scoring player had 50% more points than the league average.”

Your Mission:

Find and master a dataset that’s PACKED with numbers – perfect for calculating statistics and finding patterns.

What You’re Looking For:

Sports data: Player statistics, team scores, championship results
Financial data: Stock prices, cryptocurrency values, economic indicators
Weather data: Temperature records, rainfall, climate measurements
Health data: Nutrition facts, fitness tracking, medical statistics

Irish Options: CSO.ie weather data, Dublin transport passenger numbers, county population figures Global Options: Kaggle sports datasets, financial market data, NASA climate data

Your Dataset Must Have:

At least 100 rows of data
Multiple numeric columns (ratings, scores, prices, measurements)
Data you can calculate averages, maximums, and trends from

Step 1: Find and Download Your Numbers Dataset

Go to one of the data sources and find a dataset with lots of numbers. Download the CSV file.

Step 2: Load and Analyze Your Numbers

print("CHALLENGE 1: NUMBERS HUNTER")
import csv

# Load your numbers-heavy dataset
dataset = []
filename = 'your_numbers_dataset.csv'  # Replace with your file

try:
    with open(filename, 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        dataset = [row for row in reader]
    print(f"Loaded {len(dataset)} records!")
except Exception as e:
    print(f"Error: {e}")

# Analyze your numeric columns
if dataset:
    print(f"Columns in your dataset: {list(dataset[0].keys())}")
    
    # Pick a numeric column and calculate basic statistics
    numeric_column = 'REPLACE_WITH_YOUR_NUMERIC_COLUMN'  # e.g., 'score', 'price', 'temperature'
    
    values = []
    for record in dataset:
        try:
            value = float(record[numeric_column])
            values.append(value)
        except:
            pass  # Skip non-numeric values
    
    if values:
        average = sum(values) / len(values)
        maximum = max(values)
        minimum = min(values)
        
        print(f"{numeric_column} Statistics:")
        print(f"   Average: {average:.2f}")
        print(f"   Highest: {maximum}")
        print(f"   Lowest: {minimum}")
        print(f"   Range: {maximum - minimum}")

Success Goal:

Calculate at least 3 meaningful statistics from your numeric data

Why This Matters:

Numbers are the foundation of data science – you need to be comfortable with statistical analysis

Challenge 2: The Category Detective

Challenge Overview:

Now you’ll become a “Category Detective” – learning to analyze data with groups and classifications. This is crucial for understanding market segments, user demographics, and comparative analysis.

What You’ll Learn:

How to identify and work with categorical data
Grouping and counting techniques
Distribution analysis (understanding how data is spread across categories)
Finding the most and least common groups in your data

Building Toward Presentation:

Your category analysis will provide compelling talking points for your presentation. You’ll be able to share discoveries like “Comedy movies made up 35% of all films released, but only 15% were profitable” or “Dublin has 3x more cafes than any other Irish city.”

Your Mission:

Find and analyze a dataset with rich categorical data – perfect for understanding groups and classifications.

What You’re Looking For:

Entertainment data: Movies by genre, music by artist, books by category
Geographic data: Cities by country, regions by population, locations by type
Product data: Items by brand, foods by cuisine, apps by category
Social data: Posts by platform, users by demographics, content by type

Irish Options: Dublin businesses by type, Irish schools by county, transport routes by mode Global Options: Netflix shows by genre, Spotify songs by artist, product reviews by brand

Your Dataset Must Have:

Clear categorical columns (genre, brand, country, type, etc.)
Multiple categories to compare (at least 5-10 different groups)
Enough data in each category to make comparisons meaningful

Step 1: Find and Download Your Categories Dataset

Look for a dataset with lots of different categories or groups to analyze.

Step 2: Load and Analyze Your Categories

print("CHALLENGE 2: CATEGORY DETECTIVE")
import csv

# Load your category-rich dataset
dataset = []
filename = 'your_categories_dataset.csv'  # Replace with your file

try:
    with open(filename, 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        dataset = [row for row in reader]
    print(f"Loaded {len(dataset)} records!")
except Exception as e:
    print(f"Error: {e}")

# Analyze your categorical data
if dataset:
    print(f"Columns available: {list(dataset[0].keys())}")
    
    # Pick a categorical column and count the categories
    category_column = 'REPLACE_WITH_YOUR_CATEGORY_COLUMN'  # e.g., 'genre', 'brand', 'country'
    
    category_counts = {}
    for record in dataset:
        category = record.get(category_column, 'Unknown').strip()
        if category in category_counts:
            category_counts[category] += 1
        else:
            category_counts[category] = 1
    
    # Sort categories by popularity
    sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True)
    
    print(f"{category_column} Analysis:")
    print(f"   Total categories found: {len(category_counts)}")
    print(f"   Top 5 categories:")
    for i, (category, count) in enumerate(sorted_categories[:5]):
        print(f"      {i+1}. {category}: {count} items")
    
    # Calculate category distribution
    total_items = sum(category_counts.values())
    top_category_percentage = (sorted_categories[0][1] / total_items) * 100
    print(f"   Most common category represents {top_category_percentage:.1f}% of data")

Success Goal:

Identify the top 5 categories and understand the distribution in your data

Why This Matters:

Categories help us group and compare data – essential for business intelligence

Challenge 3: The Time Traveler

Challenge Overview:

Your final pre-break challenge makes you a “Time Traveler” – mastering the analysis of data that changes over time. This is essential for understanding trends, predicting futures, and spotting patterns that develop over months or years.

What You’ll Learn:

How to work with time-series and historical data
Trend calculation and analysis
Percentage change calculations over time
Identifying whether things are improving, declining, or staying stable

Building Toward Presentation:

Time-based insights often provide the most dramatic presentation moments. You’ll be able to reveal trends like “Irish housing prices increased 45% over 5 years” or “The popularity of rock music has declined 60% since 2010, while hip-hop increased 200%.”

Your Mission:

Find and explore a dataset with time-based data – perfect for spotting trends and changes over time.

What You’re Looking For:

Historical data: Population changes, economic trends, climate over years
Performance data: Sports results over seasons, stock prices over months
Usage data: Website traffic over time, app downloads by date, sales by quarter
Event data: Movie releases by year, song charts over decades

Irish Options: Irish population by year, Dublin weather historical data, housing prices over time Global Options: Olympic records over decades, box office trends, technology adoption rates

Your Dataset Must Have:

Date or time columns (years, months, specific dates)
Data spanning multiple time periods (at least 5+ years or time points)
Values you can track changes in over time

Step 1: Find and Download Your Time Dataset

Search for historical data that shows changes over time.

Step 2: Load and Analyze Your Time Data

print("CHALLENGE 3: TIME TRAVELER")
import csv

# Load your time-based dataset
dataset = []
filename = 'your_time_dataset.csv'  # Replace with your file

try:
    with open(filename, 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        dataset = [row for row in reader]
    print(f"Loaded {len(dataset)} records!")
except Exception as e:
    print(f"Error: {e}")

# Analyze your time-based data
if dataset:
    print(f"Columns available: {list(dataset[0].keys())}")
    
    # Pick a time column and a value column to track over time
    time_column = 'REPLACE_WITH_TIME_COLUMN'  # e.g., 'year', 'date', 'month'
    value_column = 'REPLACE_WITH_VALUE_COLUMN'  # e.g., 'population', 'price', 'score'
    
    # Group data by time periods
    time_data = {}
    for record in dataset:
        time_period = record.get(time_column, 'Unknown')
        try:
            value = float(record.get(value_column, 0))
            if time_period in time_data:
                time_data[time_period].append(value)
            else:
                time_data[time_period] = [value]
        except:
            pass  # Skip non-numeric values
    
    # Calculate averages for each time period
    time_averages = {}
    for time_period, values in time_data.items():
        if values:
            time_averages[time_period] = sum(values) / len(values)
    
    # Sort by time period and show trends
    sorted_times = sorted(time_averages.items())
    
    print(f"{value_column} over {time_column}:")
    for time_period, avg_value in sorted_times[:10]:  # Show first 10 time periods
        print(f"   {time_period}: {avg_value:.2f}")
    
    # Calculate trend
    if len(sorted_times) >= 2:
        first_value = sorted_times[0][1]
        last_value = sorted_times[-1][1]
        change = ((last_value - first_value) / first_value) * 100
        
        trend = "increased" if change > 0 else "decreased"
        print(f"   Trend: {value_column} {trend} by {abs(change):.1f}% over time")

Success Goal:

Identify a clear trend in your data over time (increasing, decreasing, or cyclical)

Why This Matters:

Time-based analysis reveals trends and helps predict future patterns

What You’ve Built

Skills Developed:

Numbers Skills: Statistical analysis and numeric data handling
Category Skills: Grouping and classification analysis
Time Skills: Trend analysis and temporal data understanding
Three Different Datasets: You now have diverse, real-world data ready for visualization!

Session 2: Bringing Data to Life with Visualizations

Concept: From CSV to Professional Charts

The Visualization Workflow:

Load your dataset – Read the CSV file into Python
Explore the data – Understand what you have
Clean if needed – Handle missing values or formatting issues
Choose chart types – Bar charts, scatter plots, line charts based on your data
Create visualizations – Use matplotlib to make professional charts
Tell the story – What insights does your data reveal?

Chart Types for Different Data:

Bar Charts: Comparing categories (best movies by genre, top artists by streams)
Scatter Plots: Finding relationships (budget vs box office, calories vs price)
Line Charts: Showing trends over time (music popularity by year, sports performance)
Pie Charts: Showing proportions (market share, category breakdowns)

Code Example: Template for Any Dataset:

print("REAL DATASET VISUALIZATION PROJECT")
import matplotlib.pyplot as plt
import csv

# STEP 1: Load your downloaded dataset
print("Loading your real dataset...")
dataset = []

# Replace 'your_dataset.csv' with your actual filename
with open('your_dataset.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        dataset.append(row)

print(f"Loaded {len(dataset)} records from your dataset!")

# STEP 2: Explore what you have
print("Dataset exploration:")
if len(dataset) > 0:
    first_record = dataset[0]
    print(f"Available columns: {list(first_record.keys())}")
    print(f"First record: {first_record}")

# STEP 3: Data preparation example
print("Preparing data for visualization...")

# Convert text numbers to actual numbers for calculations
for record in dataset:
    # Identify which columns have numbers and convert them
    # Example for a movies dataset:
    # if 'rating' in record and record['rating']:
    #     record['rating'] = float(record['rating'])
    # if 'box_office' in record and record['box_office']:
    #     record['box_office'] = int(record['box_office'])
    pass

print("Data preparation complete!")

# STEP 4: Create your first visualization
print("Creating visualization 1...")

# Example: Bar chart of top categories
categories = {}  # Dictionary to count items by category

for record in dataset:
    # Replace 'genre' with an appropriate column from your dataset
    category = record.get('genre', 'Unknown')  # Use .get() to handle missing values
    
    if category in categories:
        categories[category] += 1
    else:
        categories[category] = 1

# Create bar chart
if categories:
    category_names = list(categories.keys())
    category_counts = list(categories.values())
    
    plt.figure(figsize=(12, 6))
    bars = plt.bar(category_names, category_counts, color=plt.cm.Set3(range(len(category_names))))
    
    # Customize with your dataset's specifics
    plt.title('Distribution by Category (Your Real Data)', fontsize=16, fontweight='bold')
    plt.xlabel('Category')
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    
    # Add value labels on bars
    for bar, count in zip(bars, category_counts):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                str(count), ha='center', fontweight='bold')
    
    plt.tight_layout()
    plt.show()

print("First visualization complete!")

Challenge 4: Create Your First Real Data Visualization

Challenge Overview:

Now you’ll transform your data discoveries into professional visualizations. This challenge focuses on creating your first polished chart that could appear in a news article or business presentation.

What You’ll Learn:

How to create publication-quality bar charts
Professional chart formatting and labeling
Data storytelling through visualization
How to highlight the most important insights in your data

Building Toward Presentation:

This visualization will be the centerpiece of your data story presentation. You’ll show your chart to the class and explain what it reveals. The goal is to create something so clear and compelling that anyone can understand your discovery immediately.

Your Mission:

Load your downloaded dataset and create a professional bar chart showing the most interesting pattern you can find!

print("MY REAL DATASET ANALYSIS")
import matplotlib.pyplot as plt
import csv

# STEP 1 - Load your specific dataset
dataset = []
# Replace 'YOUR_FILENAME.csv' with your actual downloaded file
filename = 'YOUR_FILENAME.csv'

try:
    with open(filename, 'r', encoding='utf-8') as file:  # encoding helps with special characters
        reader = csv.DictReader(file)
        for row in reader:
            dataset.append(row)
    print(f"SUCCESS! Loaded {len(dataset)} records")
except FileNotFoundError:
    print(f"ERROR: Could not find {filename}")
    print("Make sure your CSV file is in the same folder as this code!")
except Exception as e:
    print(f"ERROR loading data: {e}")

# STEP 2 - Explore your data
if len(dataset) > 0:
    print(f"Your dataset has these columns: {list(dataset[0].keys())}")
    print(f"First few records:")
    for i in range(min(3, len(dataset))):  # Show first 3 records
        print(f"Record {i+1}: {dataset[i]}")

# STEP 3 - Choose what to visualize
# Look at your columns and pick something interesting to count or compare
# Examples:
# - If you have a movies dataset: count by genre, decade, rating range
# - If you have music data: count by artist, year, genre
# - If you have sports data: count by team, position, year

print(f"Creating visualization...")

# STEP 4 - Count items by your chosen category
category_counts = {}
category_column = 'REPLACE_WITH_YOUR_COLUMN_NAME'  # e.g., 'genre', 'artist', 'team'

for record in dataset:
    category = record.get(category_column, 'Unknown')
    
    # Add counting logic
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# STEP 5 - Create professional bar chart
if len(category_counts) > 0:
    # Sort by count to show most popular first
    sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True)
    
    # Take top 10 to avoid overcrowded chart
    top_categories = sorted_categories[:10]
    
    names = [item[0] for item in top_categories]
    counts = [item[1] for item in top_categories]
    
    plt.figure(figsize=(14, 8))
    bars = plt.bar(names, counts, color=plt.cm.viridis(range(len(names))))
    
    # Customize with your specific data
    plt.title('YOUR_CHART_TITLE_HERE', fontsize=16, fontweight='bold')
    plt.xlabel('YOUR_X_LABEL')
    plt.ylabel('YOUR_Y_LABEL')
    plt.xticks(rotation=45, ha='right')
    
    # Add value labels
    for bar, count in zip(bars, counts):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(counts)*0.01,
                str(count), ha='center', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # STEP 6 - Generate insights
    print(f"INSIGHTS FROM YOUR REAL DATA:")
    print(f"Most common {category_column}: {names[0]} ({counts[0]} items)")
    print(f"Total categories found: {len(category_counts)}")
    print(f"Your dataset covers: {min(counts)} to {max(counts)} items per category")

else:
    print("No data to visualize. Check your column name and data!")

Hints for COMMON DATASet Types

# For movie datasets:
category_column = 'genre'  # or 'director', 'year', 'rating_category'

# For music datasets:
category_column = 'artist'  # or 'genre', 'year', 'decade'

# For sports datasets:
category_column = 'team'  # or 'position', 'league', 'country'

# For food datasets:
category_column = 'restaurant'  # or 'cuisine_type', 'price_range'

Challenge 5: Multi-Chart Data Story

Challenge Overview:

In this advanced challenge, you’ll create multiple complementary visualizations that tell a complete data story. This mirrors how professional data scientists and journalists combine different chart types to build compelling narratives.

What You’ll Learn:

How to create different types of charts (bar charts, histograms, scatter plots)
Combining multiple visualizations to tell a story
Advanced data analysis techniques
How to synthesize insights across different views of the same data

Preparing Your Final Presentation:

As you create each chart, think about how they connect to tell a larger story. Your 2-minute presentation will showcase 2-3 of your best visualizations and the key insights they reveal. Consider:

What’s the main message you want your audience to remember?
Which charts best support that message?
What surprising or interesting pattern did you discover?

Your Mission:

Create 2-3 different visualizations from your dataset to tell a complete data story!

print("COMPLETE DATA STORY PROJECT")
print("Creating multiple visualizations from your real dataset...")

# Use your dataset from Challenge 4
# Now we'll create multiple charts to tell a complete story

# CHART 1: Category distribution (you already have this!)
print("Chart 1: Category Analysis")
# (Use your code from Challenge 4)

# CHART 2: Numerical analysis (if you have numeric columns)
print("Chart 2: Numerical Insights")

# Find a numeric column in your dataset
numeric_column = 'REPLACE_WITH_NUMERIC_COLUMN'  # e.g., 'rating', 'price', 'score', 'views'

numeric_values = []
for record in dataset:
    try:
        # Convert text to number, skip if conversion fails
        value = float(record.get(numeric_column, 0))
        if value > 0:  # Skip zero/missing values
            numeric_values.append(value)
    except (ValueError, TypeError):
        pass  # Skip records that can't be converted

if len(numeric_values) > 0:
    # Create histogram showing distribution of values
    plt.figure(figsize=(10, 6))
    plt.hist(numeric_values, bins=20, color='skyblue', alpha=0.7, edgecolor='black')
    
    plt.title(f'Distribution of {numeric_column.title()} (Your Real Data)', 
              fontsize=16, fontweight='bold')
    plt.xlabel(numeric_column.title())
    plt.ylabel('Frequency')
    
    # Add statistics
    avg_value = sum(numeric_values) / len(numeric_values)
    max_value = max(numeric_values)
    min_value = min(numeric_values)
    
    plt.axvline(avg_value, color='red', linestyle='--', linewidth=2, 
                label=f'Average: {avg_value:.1f}')
    plt.legend()
    
    plt.tight_layout()
    plt.show()
    
    print(f"{numeric_column.title()} Statistics:")
    print(f"Average: {avg_value:.2f}")
    print(f"Highest: {max_value:.2f}")
    print(f"Lowest: {min_value:.2f}")

# CHART 3: Relationship analysis (if you have multiple numeric columns)
print("Chart 3: Relationship Analysis")

# Find two numeric columns to compare
x_column = 'REPLACE_WITH_X_COLUMN'  # e.g., 'budget', 'followers', 'age'
y_column = 'REPLACE_WITH_Y_COLUMN'  # e.g., 'revenue', 'likes', 'score'

x_values = []
y_values = []
labels = []

for record in dataset:
    try:
        x_val = float(record.get(x_column, 0))
        y_val = float(record.get(y_column, 0))
        if x_val > 0 and y_val > 0:
            x_values.append(x_val)
            y_values.append(y_val)
            # Use title or name for labels if available
            label = record.get('title', record.get('name', f'Item {len(labels)+1}'))
            labels.append(label)
    except (ValueError, TypeError):
        pass

if len(x_values) > 10:  # Need decent amount of data for scatter plot
    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(x_values, y_values, alpha=0.6, s=60, c=y_values, cmap='viridis')
    
    plt.title(f'{y_column.title()} vs {x_column.title()} (Your Real Data)', 
              fontsize=16, fontweight='bold')
    plt.xlabel(x_column.title())
    plt.ylabel(y_column.title())
    
    # Add colorbar
    plt.colorbar(scatter, label=y_column.title())
    
    # Optionally label a few interesting points
    if len(labels) == len(x_values):
        # Label the highest and lowest points
        max_y_index = y_values.index(max(y_values))
        min_y_index = y_values.index(min(y_values))
        
        plt.annotate(labels[max_y_index], 
                    (x_values[max_y_index], y_values[max_y_index]),
                    xytext=(5, 5), textcoords='offset points', fontsize=9)
        plt.annotate(labels[min_y_index], 
                    (x_values[min_y_index], y_values[min_y_index]),
                    xytext=(5, 5), textcoords='offset points', fontsize=9)
    
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print(f"Relationship Insights:")
    print(f"Analyzed {len(x_values)} data points")
    print(f"Highest {y_column}: {max(y_values):.2f}")
    print(f"Lowest {y_column}: {min(y_values):.2f}")

# FINAL DATA STORY SUMMARY
print("=" * 50)
print("YOUR COMPLETE DATA STORY")
print("=" * 50)

print(f"Dataset: {filename}")
print(f"Records analyzed: {len(dataset)}")
print(f"Key findings:")
print(f"• Most common category: [Your insight here]")
print(f"• Numerical trends: [Your insight here]") 
print(f"• Interesting relationships: [Your insight here]")
print(f"• Surprising discovery: [Your insight here]")

print(f"WHAT THIS DATA TELLS US:")
print(f"Based on your analysis, what story does this data tell?")
print(f"What decisions could someone make using these insights?")

Hints for different data set types

# For movie datasets:
numeric_column = 'rating'  # or 'box_office', 'runtime'
x_column = 'budget'
y_column = 'box_office'

# For music datasets:
numeric_column = 'popularity'  # or 'duration', 'year'
x_column = 'danceability'
y_column = 'popularity'

# For sports datasets:
numeric_column = 'points'  # or 'salary', 'age'
x_column = 'height'
y_column = 'points'

# For food datasets:
numeric_column = 'calories'  # or 'price', 'protein'
x_column = 'price'
y_column = 'rating'

Wrap-Up & Data Story Presentations

Your 2-Minute Data Story Presentation:

Each student will present their most interesting discovery to the class. Your presentation should include:

Structure:

The Hook (15 seconds): “I discovered something surprising about [your topic]…”
The Data (30 seconds): Show your best chart and explain what it shows
The Insight (60 seconds): What does this mean? Why is it interesting or important?
The Impact (15 seconds): How could this information be used in the real world?

Presentation Tips:

Start with your most surprising or interesting finding
Make sure your chart is clearly visible and labeled
Speak to your audience, not to your screen
End with why this matters or what someone could do with this information

INSIGHT DAY 4

Day 4: Music Streaming Wars – Advanced Analytics & Predictions

Today’s Mission:

“By the end of today, you’ll work in teams to compete as music industry consultants. Each team will analyze a different genre of music and pitch their insights to judges representing a major record label looking for their next big investment!”

What Makes Today Different:

Yesterday you learned basic visualization. Today you’ll master advanced data science techniques: correlation analysis, predictive modeling, and statistical validation – the same methods used by Spotify, Apple Music, and record labels to make million-dollar decisions.

The Competition:

Teams will be assigned different music genres (Pop, Hip-Hop, Rock, Electronic, Country, R&B) and compete to present the most compelling insights about what makes songs successful in their genre. The winning team gets recognition as “Music Industry Data Science Champions!”

Session 1: Advanced Music Data Analysis

Concept Introduction: Computational Thinking in Music Analytics

Computational Thinking Practices We’ll Use Today:

1. Pattern Recognition

What we’re doing: Finding correlations between song features and success
Why it matters: Just like recognizing patterns in weather data helps predict storms, recognizing patterns in music helps predict hits
Real example: “Every summer hit has high danceability” vs “Every winter hit is more acoustic”

2. Decomposition

What we’re doing: Breaking down “song success” into measurable components (tempo, energy, danceability, etc.)
Why it matters: Complex problems become manageable when broken into smaller parts
Real example: Instead of asking “What makes a good song?” we ask “How does tempo affect streaming numbers?”

3. Abstraction

What we’re doing: Creating mathematical models that capture the essence of hit songs
Why it matters: Models help us focus on what’s important and ignore irrelevant details
Real example: A song is more than just notes – we abstract it to numerical features we can analyze

4. Algorithm Design

What we’re doing: Creating step-by-step processes to predict song success
Why it matters: Algorithms can be tested, improved, and scaled to analyze millions of songs
Real example: “If danceability > 0.7 AND energy > 0.6, then predict ‘likely hit’”

Today’s Music Dataset

Dataset Overview:

114,000 songs from Spotify’s database – real industry data!
Key columns for analysis: track_name, artists, track_genre, popularity, danceability, energy
Multiple genres available: Check what genres exist in the data
Real popularity scores: 0-100 scale used by Spotify

Sample Data Preview:

track_name,artists,track_genre,popularity,danceability,energy
"Blinding Lights","The Weeknd","pop",95,0.514,0.730
"Good 4 U","Olivia Rodrigo","pop",88,0.563,0.664
"Industry Baby","Lil Nas X","hip-hop",92,0.844,0.704

Important: The column names are track_name, artists, and track_genre (not song_name, artist, genre)

Challenge 1: Load and Explore Music Data

Challenge Overview: Get familiar with the music dataset using simple data exploration. This builds on your CSV skills from Day 2 but with music data that’s much easier to work with.

Download Spotify Data set HERE!!

Learning Goals:

Practice CSV loading with music industry context
Calculate basic statistics on music features
Understand the three key success metrics

Code Purpose: This is a template for loading data and exploring what’s in the dataset. You need to complete the missing parts to make it work with the real Spotify dataset.

# MUSIC DATA EXPLORATION CHALLENGE
print("MUSIC DATA EXPLORER")
import csv

# Step 1: Load the music data
music_data = []

with open('dataset.csv', 'r') as file:
    reader = csv.DictReader(file)
    for song in reader:
        music_data.append(song)

print(f"Loaded {len(music_data)} songs")

# Step 2: Discover what genres exist in the dataset
print("\nDiscovering available genres...")
# TODO: Create a list of all unique genres in the dataset
# HINT: Look at the 'track_genre' column for each song
# HINT: Use a list to collect unique genres (avoid duplicates)

unique_genres = []
for song in music_data:
    genre = song['track_genre']
    # TODO: Add logic to only add genre if it's not already in the list
    # Your code here

print(f"Found {len(unique_genres)} different genres:")
for genre in unique_genres[:10]:  # Show first 10 genres
    print(f"  - {genre}")

# Step 3: Look at a few songs to understand the data structure
print("\nSample songs from dataset:")
for i in range(3):
    song = music_data[i]
    # TODO: Print the track name, artist, genre, and popularity
    # HINT: Use song['track_name'], song['artists'], etc.
    # Your code here

# Step 4: Convert text numbers to real numbers for analysis
print("\nConverting data for analysis...")
conversion_errors = 0

for song in music_data:
    try:
        # TODO: Convert the text values to numbers
        # HINT: popularity should be int(), danceability and energy should be float()
        # Your code here
        pass
    except ValueError:
        conversion_errors += 1

print(f" Data conversion complete! ({conversion_errors} errors found)")

TODO for Students:

Complete the genre discovery: Make the code find all unique genres without duplicates
Fix the data display: Make it show track names, artists, and other info properly
Implement data conversion: Convert the string numbers to integers and floats
Test your code: Run it and see what genres are available in the dataset

If you want to run the movie example code hidden below then the data set can be downloaded here: Movie Data

Example: Click on arrow to see example code – Movie Data Exploration
What this example shows: How to load and explore a different dataset (movies) using the same techniques. This demonstrates the same concepts you just practiced, but with movie data instead of music.
Key learning: The same data exploration approach works for any CSV dataset – just change the column names!

# MOVIE DATA EXPLORATION EXAMPLE
print("MOVIE DATA EXPLORER")
import csv

# Load movie data (same process as music data)
movie_data = []
with open('tmdb_5000_movies.csv', 'r') as file:
    reader = csv.DictReader(file)
    for movie in reader:
        movie_data.append(movie)

print(f"Loaded {len(movie_data)} movies")

# Discover what languages are in the movie dataset
unique_languages = []
for movie in movie_data:
    language = movie['original_language']
    if language not in unique_languages:
        unique_languages.append(language)

print(f"Found {len(unique_languages)} different languages:")
print("Top languages:", unique_languages[:8])

# Look at sample movies
print("\nSample movies:")
for i in range(3):
    movie = movie_data[i]
    print(f"Title: {movie['title']}")
    print(f"Language: {movie['original_language']}")
    print(f"Popularity: {movie['popularity']}")
    print(f"Vote Average: {movie['vote_average']}")
    print()

# Convert movie data for analysis
for movie in movie_data:
    try:
        movie['popularity'] = float(movie['popularity'])
        movie['vote_average'] = float(movie['vote_average'])
        movie['vote_count'] = int(movie['vote_count'])
    except ValueError:
        pass  # Skip movies with missing data

print("Movie data ready for analysis!")

Challenge 2: Find the Most Popular Songs

Challenge Overview: Use simple list operations to find the most popular songs overall. This reinforces Day 1 concepts (max, sorting) with engaging music data.

Learning Goals:

Apply max() and sorting to real music data
Practice finding patterns in datasets
Build confidence with working code

Code Purpose: This shows you the structure for finding top songs, but you need to implement the actual logic. This builds on Day 1 concepts but requires you to think through the steps.

# FIND THE BIGGEST HITS CHALLENGE
print("FINDING THE MOST POPULAR SONGS")

# Step 1: Extract all popularity scores
# TODO: Create a list of all popularity scores from the dataset
# HINT: Loop through music_data and get song['popularity'] for each song

all_popularity = []
# Your code here

# Step 2: Find the highest popularity score
# TODO: Use a function from Day 1 to find the maximum value
# HINT: What function finds the biggest number in a list?

highest_popularity = # Your code here
print(f"Highest popularity score: {highest_popularity}")

# Step 3: Find which song has that highest score
# TODO: Loop through the data to find the song with the highest popularity
# HINT: Use an if statement to check if song['popularity'] equals highest_popularity

print("Most popular song:")
for song in music_data:
    # Your code here - check if this song has the highest popularity
    # If it does, print the track name and artist

# Step 4: Find top 5 most popular songs (This is the tricky part!)
print("\nTOP 5 MOST POPULAR SONGS:")

# TODO: Create a way to find the top 5 songs
# STRATEGY 1: Sort all popularity scores, get top 5, then find those songs
# STRATEGY 2: Find highest, remove it, find next highest, repeat 5 times
# STRATEGY 3: Create a list of (popularity, song_name) pairs and sort

# Your approach here - this is the real challenge!

TODO for Students:

Complete the popularity extraction: Fill the all_popularity list
Find the maximum: Use the right function to find the highest score
Locate the top song: Write the if statement to find and print the #1 song
Design the top 5 algorithm: This is the real challenge – figure out how to get the top 5 songs
Test with different features: Once working, try finding most danceable or energetic songs

Example: Click arrow to reveal example code – Finding Top Movies
What this example shows: The exact same logic for finding top items, but applied to movies instead of songs. Notice how the structure is identical – only the data and column names change.
Key learning: Once you understand the pattern for finding “top” items, you can apply it to any dataset!

# FINDING TOP MOVIES EXAMPLE
print("FINDING THE MOST POPULAR MOVIES")

# Step 1: Extract all popularity scores (same concept as music)
all_movie_popularity = []
for movie in movie_data:
    all_movie_popularity.append(movie['popularity'])

# Step 2: Find highest popularity
highest_movie_popularity = max(all_movie_popularity)
print(f"Highest movie popularity: {highest_movie_popularity}")

# Step 3: Find which movie has that score
for movie in movie_data:
    if movie['popularity'] == highest_movie_popularity:
        print(f"Most popular movie: {movie['title']}")
        break

# Step 4: Top 5 movies using the sorting approach
print("\nTOP 5 MOST POPULAR MOVIES:")

# Create popularity list and sort it
movie_popularity_scores = []
for movie in movie_data:
    movie_popularity_scores.append(movie['popularity'])

movie_popularity_scores.sort(reverse=True)
top_5_movie_scores = movie_popularity_scores[:5]

# Find movies with these scores
for score in top_5_movie_scores:
    for movie in movie_data:
        if movie['popularity'] == score:
            print(f"{movie['title']} - Popularity: {score}")
            break

# Bonus: Find highest rated movies (different column)
print("\nHIGHEST RATED MOVIE:")
all_ratings = []
for movie in movie_data:
    all_ratings.append(movie['vote_average'])

highest_rating = max(all_ratings)
for movie in movie_data:
    if movie['vote_average'] == highest_rating:
        print(f"Highest rated: {movie['title']} (Rating: {highest_rating})")
        break

Challenge 3: Simple Genre Comparison

Challenge Overview: Compare different music genres using basic counting and averaging. This prepares you for the team competition where you’ll need to analyze your specific genre.

Learning Goals:

Group data by categories (genres)
Calculate averages for different groups
Make simple comparisons between categories

Code Purpose: This template shows the structure for analyzing one genre, but you need to implement all the calculation logic. This is good practice for the team competition.

# GENRE ANALYSIS CHALLENGE
print("ANALYZING A SPECIFIC GENRE")

# Step 1: Choose a genre to analyze (pick from the list you discovered above)
genre_to_analyze = "pop"  # You can change this to any genre from your discovery

# Step 2: Filter songs by genre
# TODO: Create a list containing only songs from your chosen genre
# HINT: Use an if statement to check if song['track_genre'] matches your genre

genre_songs = []
for song in music_data:
    # Your code here - add song to genre_songs if it matches the genre

print(f"Found {len(genre_songs)} {genre_to_analyze} songs")

# Step 3: Calculate average popularity for this genre
# TODO: Add up all the popularity scores and divide by the count
# HINT: Create a total, loop through genre_songs adding each popularity, then divide

total_popularity = 0
# Your code here to add up all popularity scores

if len(genre_songs) > 0:
    average_popularity = # Your calculation here
    print(f"Average popularity for {genre_to_analyze}: {average_popularity:.1f}")

# Step 4: Calculate average danceability
# TODO: Same process as popularity, but for danceability
# HINT: Loop through genre_songs and add up all the danceability values

total_danceability = 0
# Your code here

if len(genre_songs) > 0:
    average_danceability = # Your calculation here
    print(f"Average danceability for {genre_to_analyze}: {average_danceability:.2f}")

# Step 5: Find the most popular song in this genre
# TODO: Find the highest popularity score within just this genre's songs
# HINT: Similar to Challenge 2, but only looking at genre_songs

highest_pop_in_genre = 0
best_song_in_genre = ""

for song in genre_songs:
    # Your code here - check if this song's popularity is higher than current highest
    # If so, update both variables

print(f"Most popular {genre_to_analyze} song: {best_song_in_genre}")

# BONUS CHALLENGE: Can you find the top 3 songs in this genre?
print(f"\nBONUS: Try to find the top 3 {genre_to_analyze} songs!")

TODO for Students:

Complete the genre filtering: Write the if statement to collect songs from one genre
Calculate popularity average: Implement the sum and division
Calculate danceability average: Apply the same logic to a different variable
Find the top song: Write logic to track the highest popularity song
Try different genres: Test your code with different genre values
Bonus challenge: Extend your logic to find top 3 songs in the genre

Example: Click arrow for example code – Movie Language Analysis
What this example shows: How to analyze one category (movie language) using the same filtering and averaging techniques. This is exactly what you’re doing with music genres, but for movies.
Key learning: Category analysis works the same way regardless of the dataset – filter, calculate, compare!

# MOVIE LANGUAGE ANALYSIS EXAMPLE
print("ANALYZING ENGLISH MOVIES")

# Step 1: Choose a language to analyze (like choosing a genre)
language_to_analyze = "en"  # English movies

# Step 2: Filter movies by language (same logic as genre filtering)
language_movies = []
for movie in movie_data:
    if movie['original_language'] == language_to_analyze:
        language_movies.append(movie)

print(f"Found {len(language_movies)} English movies")

# Step 3: Calculate average popularity for English movies
total_popularity = 0
for movie in language_movies:
    total_popularity += movie['popularity']

average_popularity = total_popularity / len(language_movies)
print(f"Average popularity for English movies: {average_popularity:.1f}")

# Step 4: Calculate average rating
total_rating = 0
for movie in language_movies:
    total_rating += movie['vote_average']

average_rating = total_rating / len(language_movies)
print(f"Average rating for English movies: {average_rating:.1f}")

# Step 5: Find most popular English movie
highest_pop = 0
best_movie = ""

for movie in language_movies:
    if movie['popularity'] > highest_pop:
        highest_pop = movie['popularity']
        best_movie = movie['title']

print(f"Most popular English movie: {best_movie}")

# Compare to another language
print("\nComparing to Spanish movies:")
spanish_movies = []
for movie in movie_data:
    if movie['original_language'] == "es":
        spanish_movies.append(movie)

if len(spanish_movies) > 0:
    spanish_avg = sum(movie['popularity'] for movie in spanish_movies) / len(spanish_movies)
    print(f"Spanish movies average popularity: {spanish_avg:.1f}")
    print(f"English vs Spanish: {average_popularity:.1f} vs {spanish_avg:.1f}")

Team Assignment & Strategy Planning

Team Assignment Process:

First: Teams run the genre discovery code to see what genres are actually in the dataset
Then: Teams choose from the available genres (likely: pop, hip-hop, rock, electronic, country, r&b, jazz, classical, etc.)
Finally: Each team picks a different genre to avoid duplicates

Strategy Discussion: “Discuss with your team: Based on the data exploration, what do you already know about your genre? What patterns did you notice? What would convince a record label to invest in it?”

Challenge 4: Team Genre Deep Dive

Challenge Overview: Work as a team to analyze your assigned genre in detail. You’ll create 3 simple charts that show why your genre is the best investment for Breakthrough Records.

Learning Goals:

Apply all week’s skills to a focused analysis
Work collaboratively on data analysis
Create charts that support business arguments

Your Team Mission: Create exactly 3 charts that tell a compelling story about your genre:

Chart 1: How popular is your genre compared to others?
Chart 2: What makes your genre special? (danceability, energy, or another feature)
Chart 3: Show your genre’s biggest hits or most promising trends

Code Purpose: These are simplified chart-making templates. Your team will modify them to create visualizations that support your investment pitch.

Simple Chart Template 1: Genre Popularity Comparison

# CHART 1: GENRE POPULARITY COMPARISON CHALLENGE
import matplotlib.pyplot as plt

# Step 1: Discover all genres in the dataset
# TODO: Create a list of unique genres (use your code from Challenge 1)
all_genres = []
# Your code here to find unique genres

# Step 2: Calculate average popularity for each genre
# TODO: For each genre, calculate its average popularity
# HINT: This is similar to Challenge 3, but for multiple genres

genre_averages = []
for genre in all_genres[:8]:  # Limit to first 8 genres to fit on chart
    # TODO: Find all songs in this genre
    genre_songs = []
    # Your code here
    
    # TODO: Calculate average popularity for this genre
    if len(genre_songs) > 0:
        # Your calculation here
        average_pop = 0  # Replace with your calculation
        genre_averages.append(average_pop)
    else:
        genre_averages.append(0)

# Step 3: Create the chart
plt.figure(figsize=(12, 6))
bars = plt.bar(all_genres[:8], genre_averages)

# Step 4: Highlight YOUR genre
your_genre = "REPLACE_WITH_YOUR_GENRE"  # Teams replace this
for i, genre in enumerate(all_genres[:8]):
    if genre == your_genre:
        bars[i].set_color('red')
    else:
        bars[i].set_color('lightblue')

plt.title('Average Popularity by Genre')
plt.ylabel('Average Popularity Score')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# TODO: Print insights about your genre's ranking
print(f"Your genre ({your_genre}) average popularity: {genre_averages[all_genres.index(your_genre)]:.1f}")

Simple Chart Template 2: Genre Special Feature

# CHART 2: WHAT MAKES YOUR GENRE SPECIAL CHALLENGE
your_genre = "REPLACE_WITH_YOUR_GENRE"

# Step 1: Filter to your genre's songs
# TODO: Create a list of all songs in your genre
your_genre_songs = []
# Your code here (similar to Challenge 3)

print(f"Analyzing {len(your_genre_songs)} {your_genre} songs")

# Step 2: Extract the feature you want to analyze
# TODO: Choose danceability, energy, or another feature to highlight
# TODO: Create a list of that feature's values for your genre

feature_scores = []
feature_name = "danceability"  # You can change this to "energy" or other features

for song in your_genre_songs:
    # TODO: Add the chosen feature value to your list
    # HINT: Use song[feature_name] to get the value
    pass

# Step 3: Create visualization
plt.figure(figsize=(8, 6))
plt.hist(feature_scores, bins=15, color='green', alpha=0.7, edgecolor='black')
plt.title(f'{your_genre.title()} Songs: {feature_name.title()} Distribution')
plt.xlabel(f'{feature_name.title()} Score')
plt.ylabel('Number of Songs')
plt.show()

# Step 4: Calculate and display insights
# TODO: Calculate average, minimum, and maximum for your feature
if feature_scores:
    avg_feature = # Your calculation here
    min_feature = # Your calculation here  
    max_feature = # Your calculation here
    
    print(f"Average {feature_name} for {your_genre}: {avg_feature:.3f}")
    print(f"Range: {min_feature:.3f} to {max_feature:.3f}")
    
    # TODO: Compare to overall dataset average
    print(f"This tells us that {your_genre} music is...")

Simple Chart Template 3: Your Genre’s Top Hits

# CHART 3: YOUR GENRE'S BIGGEST HITS CHALLENGE
your_genre = "REPLACE_WITH_YOUR_GENRE"

# Step 1: Get all songs from your genre
# TODO: Filter the dataset to only your genre's songs
your_genre_songs = []
# Your code here

# Step 2: Find the top songs by popularity
# TODO: This is the big challenge - sort your genre's songs by popularity
# STRATEGY HINT: You could create a list of (popularity, song_info) pairs and sort
# OR use the same approach from Challenge 2 but only for your genre

# Method 1: Create pairs and sort (Recommended)
song_popularity_pairs = []
for song in your_genre_songs:
    # TODO: Create a pair of (popularity_score, song_dictionary)
    # HINT: (song['popularity'], song)
    pass

# TODO: Sort the pairs by popularity (highest first)
# HINT: Use .sort() with reverse=True and key=lambda x: x[0]

# Step 3: Get top 5 and prepare for chart
top_5_songs = song_popularity_pairs[:5]

# Extract data for visualization
song_names = []
popularities = []
for popularity_score, song_info in top_5_songs:
    # TODO: Get the track name and popularity for each top song
    # HINT: Use song_info['track_name'] and popularity_score
    short_name = song_info['track_name'][:20]  # Shorten long names
    song_names.append(short_name)
    popularities.append(popularity_score)

# Step 4: Create the chart
plt.figure(figsize=(10, 6))
plt.barh(song_names, popularities, color='purple')
plt.title(f'Top 5 {your_genre.title()} Hits')
plt.xlabel('Popularity Score')
plt.gca().invert_yaxis()  # Highest at top
plt.tight_layout()
plt.show()

# TODO: Print insights about your top hits
print(f"\n{your_genre.title()} Genre Analysis:")
print(f"Most popular song: {song_names[0]} (Score: {popularities[0]})")
print(f"Average of top 5: {sum(popularities)/len(popularities):.1f}")

TODO for Teams:

Complete each chart template: Fill in all the TODO sections with working code
Test your implementations: Make sure each chart displays correctly
Analyze your results: What story do your charts tell about your genre?
Prepare insights: What specific evidence supports investing in your genre?
Practice explaining: Can you explain what each chart shows in simple terms?

Example: Movie Visualization Charts
What this example shows: Three complete, working chart examples using movie data. These demonstrate the same visualization techniques you’re implementing for music, but with movies instead.
Key learning: The chart-making process is identical regardless of data – filter your category, calculate what you need, then visualize it!
Movie Chart 1: Language Popularity Comparison

# MOVIE CHART 1: LANGUAGE POPULARITY COMPARISON
import matplotlib.pyplot as plt

# Find top 6 languages by number of movies
language_counts = {}
for movie in movie_data:
    lang = movie['original_language']
    if lang in language_counts:
        language_counts[lang] += 1
    else:
        language_counts[lang] = 1

# Get top 6 languages
sorted_langs = sorted(language_counts.items(), key=lambda x: x[1], reverse=True)
top_languages = [lang for lang, count in sorted_langs[:6]]

# Calculate average popularity for each top language
lang_averages = []
for lang in top_languages:
    lang_movies = []
    for movie in movie_data:
        if movie['original_language'] == lang:
            lang_movies.append(movie)
    
    if len(lang_movies) > 0:
        avg_pop = sum(movie['popularity'] for movie in lang_movies) / len(lang_movies)
        lang_averages.append(avg_pop)
    else:
        lang_averages.append(0)

# Create chart
plt.figure(figsize=(10, 6))
bars = plt.bar(top_languages, lang_averages, color='skyblue')
bars[0].set_color('red')  # Highlight English
plt.title('Average Movie Popularity by Language')
plt.ylabel('Average Popularity')
plt.show()

Movie Chart 2: English Movie Ratings Distribution

# MOVIE CHART 2: ENGLISH MOVIE RATINGS DISTRIBUTION
english_movies = []
for movie in movie_data:
    if movie['original_language'] == 'en':
        english_movies.append(movie)

# Get all ratings for English movies
english_ratings = []
for movie in english_movies:
    english_ratings.append(movie['vote_average'])

# Create histogram
plt.figure(figsize=(8, 6))
plt.hist(english_ratings, bins=15, color='green', alpha=0.7, edgecolor='black')
plt.title('English Movies: Rating Distribution')
plt.xlabel('Rating (1-10)')
plt.ylabel('Number of Movies')
plt.show()

# Print insights
avg_rating = sum(english_ratings) / len(english_ratings)
print(f"Average English movie rating: {avg_rating:.1f}")

Movie Chart 3: Top English Movies

# MOVIE CHART 3: TOP ENGLISH MOVIES BY POPULARITY
english_movies = []
for movie in movie_data:
    if movie['original_language'] == 'en':
        english_movies.append(movie)

# Sort by popularity
english_movies.sort(key=lambda movie: movie['popularity'], reverse=True)
top_5_english = english_movies[:5]

# Extract data for chart
movie_titles = []
movie_popularity = []
for movie in top_5_english:
    movie_titles.append(movie['title'][:20])  # Shorten long titles
    movie_popularity.append(movie['popularity'])

# Create horizontal bar chart
plt.figure(figsize=(10, 6))
plt.barh(movie_titles, movie_popularity, color='purple')
plt.title('Top 5 English Movies by Popularity')
plt.xlabel('Popularity Score')
plt.gca().invert_yaxis()  # Highest at top
plt.tight_layout()
plt.show()

print(f"Most popular English movie: {movie_titles[0]}")

Challenge 5: Pitch Preparation & Competition

Challenge Overview: Prepare and deliver your 5-minute pitch to Breakthrough Records. Use your charts to build a compelling case for why they should invest $5 million in your genre.

Learning Goals:

Communicate data insights clearly and persuasively
Connect statistical findings to business decisions
Present technical work to non-technical audiences

Pitch Structure (5 minutes total):

1. Hook (30 seconds): “[Genre] music is about to explode, and here’s the data to prove it…”

2. Market Evidence (2 minutes):

Show Chart 1: “Our genre compares to others like this…”
Show Chart 2: “What makes our genre unique is…”
Key statistics that support investment

3. Success Stories (1.5 minutes):

Show Chart 3: “Look at these massive hits in our genre…”
Name specific artists/songs and their success

4. Investment Pitch (1 minute):

“Here’s why you should invest $5 million in [Genre]…”
Specific recommendations for what type of artists to sign

Day 5: STEM Research Challenge

Session 1: Scientific Data Basics

Simple Science Data Concepts

What Makes Good Scientific Research? (Just 3 key things)

Clear Question: What specific thing are you trying to understand?
Reliable Data: Numbers from credible scientific sources
Honest Analysis: What does the data actually show (not what you want it to show)

Computational Thinking in Science:

Pattern Recognition: Finding trends over time (temperatures rising, populations changing)
Abstraction: Focusing on key measurements while ignoring complexity
Decomposition: Breaking big questions into smaller, testable parts
Algorithm Design: Creating step-by-step analysis processes (just like you did for music!)

Today’s Research TracK

Choose ONE option: Use our provided datasets (recommended for reliability) or go your own way and choose a different data-set if you wish.

Ready-to-Use Real Datasets

Track 1: Climate & Temperature Analysis

Dataset: GlobalLandTemperaturesByCountry.csv
Source: Berkeley Earth Global Temperature Data
Size: 577,000+ records (1743-2013, monthly data)
Sample Questions: How have temperatures changed over time? Which countries are warming fastest? How does Ireland compare globally?
Key Columns: dt (date), AverageTemperature, AverageTemperatureUncertainty, Country
Irish Context: Compare Ireland’s temperature trends to global patterns

Track 2: CO2 Emissions & Environment

Dataset: co2_emissions_kt_by_country.csv
Source: World Bank CO2 Emissions Data
Size: 13,900+ records (countries 1960-2019)
Sample Questions: Which countries produce the most CO2? How have emissions changed over time? Where does Ireland rank?
Key Columns: country_code, country_name, year, value (CO2 in kilotons)
Irish Context: Analyze Ireland’s emissions compared to EU neighbors

Track 3: Global Health & Life Expectancy

Dataset: Life Expectancy Data.csv
Source: World Health Organization
Size: 2,900+ records (countries 2000-2015)
Sample Questions: What factors predict longer life? How has Irish health compared globally? What affects life expectancy most?
Key Columns: Country, Year, Life_expectancy, Adult_Mortality, GDP, Total_expenditure, Schooling
Irish Context: Ireland data available – compare to other developed nations

Track 4: Digital Development & Technology

Dataset: Final.csv
Source: World Bank Digital Development Indicators
Size: 8,800+ records (countries 1980-2020)
Sample Questions: How has internet adoption grown globally? Does digital access boost economies? How does Ireland compare?
Key Columns: Entity (country), Year, Cellular_Subscription, Internet_Users(%), No_of_Internet_Users, Broadband_Subscription
Irish Context: Track Ireland’s digital transformation over decades

Alternative: Health Systems Analysis

Dataset: 2.12_Health_systems.csv
Source: WHO Health Systems Data
Size: 200+ countries (single timepoint)
Sample Questions: What makes health systems effective? How do different countries spend on healthcare?
Key Columns: Country_Region, Health_exp_pct_GDP_2016, Health_exp_per_capita_USD_2016, Physicians_per_1000

Option B: Find Your Own Dataset

If you want to explore something different:

Where to find datasets:

Kaggle.com: Search for “climate”, “space”, “health”, or “technology” + “dataset”
Data.gov.ie: Irish government data (education, transport, environment)
Our World in Data: Global development and research data
World Bank Open Data: Economic and social indicators

Requirements for your own dataset:

Must be downloadable as CSV file
Should have 200+ records for meaningful analysis
Must include at least one time column (year) and 2-3 numeric columns
Should address a scientific or social question you care about

Popular student choices:

Sports performance data (Olympics, World Cup, etc.)
Movie/entertainment industry data
Food and nutrition data
Transportation and urban planning data
Education and academic performance data

Dataset approval: Show your chosen dataset to an instructor before starting analysis

Challenge 1: Load and Explore Scientific Data

Challenge Overview: Choose your research track and explore your scientific dataset using the same techniques from Day 4. The skills are identical – only the context is more serious.

Learning Goals:

Apply CSV loading skills to scientific data
Discover what questions your data can answer
Practice scientific thinking with familiar coding techniques

Code Purpose: This is the same data loading structure you used for music, but adapted for scientific datasets. You’ll modify it to explore your chosen research area.

# SCIENTIFIC DATA EXPLORATION CHALLENGE
print("SCIENTIFIC DATA EXPLORER")
import csv

# Step 1: Load your chosen scientific dataset
# TODO: Choose ONE of these real datasets:
# - "GlobalLandTemperaturesByCountry.csv" (climate research - 577k records)
# - "co2_emissions_kt_by_country.csv" (emissions research - 14k records)  
# - "Life Expectancy Data.csv" (health research - 3k records)
# - "Final.csv" (technology research - 9k records)
# - "2.12_Health_systems.csv" (health systems - 200 countries)

scientific_data = []

dataset_filename = "REPLACE_WITH_YOUR_CHOSEN_DATASET.csv"
with open(dataset_filename, 'r') as file:
    reader = csv.DictReader(file)
    for record in reader:
        scientific_data.append(record)

print(f"Loaded {len(scientific_data)} scientific records")

# Step 2: Discover what's in your dataset
print("\nExploring dataset structure...")
if scientific_data:
    first_record = scientific_data[0]
    available_columns = list(first_record.keys())
    print(f"Available data columns: {available_columns}")

# Step 3: Look at sample data to understand it
print("\nSample records from dataset:")
for i in range(3):
    record = scientific_data[i]
    # TODO: Print the key information from your dataset
    # FOR REAL DATASETS - choose your dataset and uncomment the right section:
    
    # Temperature data: 
    # print(f"Date: {record['dt']}, Country: {record['Country']}, Temp: {record['AverageTemperature']}")
    
    # CO2 data:
    # print(f"Country: {record['country_name']}, Year: {record['year']}, CO2: {record['value']}")
    
    # Life Expectancy data:
    # print(f"Country: {record['Country']}, Year: {record['Year']}, Life Exp: {record['Life expectancy ']}")
    
    # Technology data:
    # print(f"Country: {record['Entity']}, Year: {record['Year']}, Internet: {record['Internet Users(%)']}%")
    
    # Health Systems data:
    # print(f"Country: {record['Country_Region']}, Health Exp: {record['Health_exp_pct_GDP_2016']}%")
    
    # Your code here to print relevant fields
    pass

# Step 4: Convert text numbers to real numbers for analysis
print("\nConverting data for scientific analysis...")
conversion_errors = 0

for record in scientific_data:
    try:
        # TODO: Convert the numeric columns for your dataset
        # FOR REAL DATASETS - uncomment the one you're using:
        
        # Temperature data conversion:
        # if record['AverageTemperature']:  # Skip empty values
        #     record['AverageTemperature'] = float(record['AverageTemperature'])
        #     record['year'] = int(record['dt'][:4])  # Extract year from date
        
        # CO2 data conversion:
        # record['year'] = int(record['year'])
        # record['value'] = float(record['value'])
        
        # Life Expectancy data conversion:
        # record['Year'] = int(record['Year'])
        # if record['Life expectancy ']:  # Handle missing values
        #     record['Life expectancy '] = float(record['Life expectancy '])
        # if record['GDP']:
        #     record['GDP'] = float(record['GDP'])
        
        # Technology data conversion:
        # record['Year'] = int(record['Year'])
        # if record['Internet Users(%)']:
        #     record['Internet Users(%)'] = float(record['Internet Users(%)'])
        # if record['Cellular Subscription']:
        #     record['Cellular Subscription'] = float(record['Cellular Subscription'])
        
        # Health Systems data conversion:
        # if record['Health_exp_pct_GDP_2016']:
        #     record['Health_exp_pct_GDP_2016'] = float(record['Health_exp_pct_GDP_2016'])
        # if record['Health_exp_per_capita_USD_2016']:
        #     record['Health_exp_per_capita_USD_2016'] = float(record['Health_exp_per_capita_USD_2016'])
        
        # Your conversion code here
        pass
        
    except (ValueError, TypeError):
        conversion_errors += 1

print(f" Scientific data ready! ({conversion_errors} conversion errors)")

# Step 5: Basic dataset insights
print(f"\nDataset Overview:")
print(f"Total records: {len(scientific_data)}")

# TODO: Print some basic insights about your data
# Examples based on your chosen dataset:
# Temperature: Countries covered, year range, recent vs historical averages
# CO2: Time span, highest/lowest emitting countries, recent trends
# Health: Countries covered, year range, developed vs developing nations
# Technology: Digital growth timespan, countries with best/worst access
# Your insights here

TODO for Students:

Choose your research track: Pick climate, space, health, or technology
Load your dataset: Replace the filename with your chosen dataset
Explore the structure: Print the column names and understand what data you have
Display sample data: Show key fields from a few records
Convert data types: Make the numbers usable for calculations

Example: Movie Data Loading for Comparison
What this example shows: The exact same data loading process you just practiced, but with the movie dataset you already know. Notice how the structure is identical!
Key learning: Scientific data loading uses exactly the same skills as entertainment data!

# MOVIE DATA LOADING FOR COMPARISON
print("MOVIE DATA LOADER")
import csv

# Load movie dataset (same process)
movie_data = []
with open('tmdb_5000_movies.csv', 'r') as file:
    reader = csv.DictReader(file)
    for movie in reader:
        movie_data.append(movie)

print(f"Loaded {len(movie_data)} movies")

# Explore structure (same technique)
if movie_data:
    columns = list(movie_data[0].keys())
    print(f"Available columns: {columns[:8]}")  # Show first 8

# Display sample data (same approach)
print("\nSample movies:")
for i in range(2):
    movie = movie_data[i]
    print(f"Title: {movie['title']}")
    print(f"Year: {movie['release_date'][:4]}")  # Extract year
    print(f"Rating: {movie['vote_average']}")
    print(f"Popularity: {movie['popularity']}")
    print()

# Convert data (same process)
print("Converting movie data...")
for movie in movie_data:
    try:
        movie['popularity'] = float(movie['popularity'])
        movie['vote_average'] = float(movie['vote_average'])
        movie['vote_count'] = int(movie['vote_count'])
    except ValueError:
        pass

print("Movie data converted!")

Challenge 2: Find Patterns in Scientific Data (35 minutes)

Challenge Overview: Use the same pattern-finding techniques from Day 4 to discover trends in your scientific dataset. You’re looking for the “biggest hits” – but now it’s biggest changes, highest values, or most important patterns.

Learning Goals:

Apply max/min finding to scientific questions
Look for trends over time in scientific data
Practice the same analysis skills with serious data

Code Purpose: This template shows how to find important patterns in scientific data using the same max/min techniques you mastered with music data. You’ll adapt it to answer real scientific questions.

# SCIENTIFIC PATTERN DISCOVERY CHALLENGE
print("DISCOVERING SCIENTIFIC PATTERNS")

# PART 1: Find the most extreme values (like finding top songs)
print("\nFinding extreme values in the data...")

# TODO: Choose one numeric column to analyze for extremes
# Examples: temperature_change, co2_levels, life_expectancy, internet_users_percent
column_to_analyze = "REPLACE_WITH_YOUR_COLUMN"

# TODO: Extract all values for this column into a list
all_values = []
for record in scientific_data:
    # Your code here - add the column value to the list
    pass

# TODO: Find the maximum and minimum values
if all_values:
    highest_value = # Your code here
    lowest_value = # Your code here
    
    print(f"Highest {column_to_analyze}: {highest_value}")
    print(f"Lowest {column_to_analyze}: {lowest_value}")
    
    # TODO: Find which records have these extreme values
    print("\nRecords with extreme values:")
    for record in scientific_data:
        # Your code here - find and print records with highest/lowest values

# PART 2: Look for patterns over time (if your data has years)
print("\nAnalyzing trends over time...")

# TODO: Group data by year and calculate averages
# This is similar to grouping music by genre
yearly_data = {}

for record in scientific_data:
    try:
        year = # Extract year from your data
        value = # Extract the value you're analyzing
        
        # TODO: Add this year's data to the yearly_data dictionary
        if year in yearly_data:
            # Add to existing year
            pass
        else:
            # Create new year entry
            pass
            
    except (ValueError, KeyError):
        continue

# TODO: Calculate average for each year
print("Yearly averages:")
for year in sorted(yearly_data.keys())[:10]:  # Show first 10 years
    # Your code here - calculate and print yearly average

# PART 3: Compare different categories (like comparing genres)
print("\nComparing different categories...")

# TODO: Choose a category column to compare
# Examples: country, planet_type, region, income_level
category_column = "REPLACE_WITH_CATEGORY_COLUMN"

# TODO: Calculate averages for each category (same logic as music genres)
category_averages = {}
# Your code here - group by category and calculate averages

# TODO: Find the best and worst performing categories
if category_averages:
    best_category = # Your code here
    worst_category = # Your code here
    
    print(f"Best performing {category_column}: {best_category}")
    print(f"Lowest performing {category_column}: {worst_category}")

TODO for Students:

Choose your analysis column: Pick the most interesting numeric column from your dataset
Find extremes: Implement the max/min finding logic
Identify extreme records: Find which countries/planets/records have the highest/lowest values
Analyze time trends: If your data has years, look for changes over time
Compare categories: Group by a category and compare averages

Example: Movie Pattern Discovery
What this example shows: The same pattern-finding techniques applied to movie data. Notice how the logic structure matches what you’re implementing for scientific data!
Key learning: Pattern discovery works identically whether you’re analyzing box office hits or scientific trends!

# MOVIE PATTERN DISCOVERY EXAMPLE
print("DISCOVERING MOVIE PATTERNS")

# Part 1: Find extreme values (same logic as scientific data)
print("Finding highest and lowest rated movies...")

all_ratings = []
for movie in movie_data:
    all_ratings.append(movie['vote_average'])

highest_rating = max(all_ratings)
lowest_rating = min(all_ratings)

print(f"Highest movie rating: {highest_rating}")
print(f"Lowest movie rating: {lowest_rating}")

# Find movies with extreme ratings
for movie in movie_data:
    if movie['vote_average'] == highest_rating:
        print(f"Highest rated movie: {movie['title']}")
    elif movie['vote_average'] == lowest_rating:
        print(f"Lowest rated movie: {movie['title']}")

# Part 2: Analyze trends over time (same approach)
print("\nAnalyzing movie trends by year...")

yearly_ratings = {}
for movie in movie_data:
    try:
        year = int(movie['release_date'][:4])  # Extract year
        rating = movie['vote_average']
        
        if year in yearly_ratings:
            yearly_ratings[year].append(rating)
        else:
            yearly_ratings[year] = [rating]
    except (ValueError, TypeError):
        continue

# Calculate yearly averages
print("Average ratings by year:")
for year in sorted(yearly_ratings.keys())[-10:]:  # Last 10 years
    avg_rating = sum(yearly_ratings[year]) / len(yearly_ratings[year])
    print(f"{year}: {avg_rating:.1f}")

# Part 3: Compare by language (like comparing categories)
print("\nComparing movie ratings by language...")

language_ratings = {}
for movie in movie_data:
    lang = movie['original_language']
    rating = movie['vote_average']
    
    if lang in language_ratings:
        language_ratings[lang].append(rating)
    else:
        language_ratings[lang] = [rating]

# Find best and worst languages (minimum 10 movies)
lang_averages = {}
for lang, ratings in language_ratings.items():
    if len(ratings) >= 10:  # Only languages with enough movies
        lang_averages[lang] = sum(ratings) / len(ratings)

best_lang = max(lang_averages, key=lang_averages.get)
worst_lang = min(lang_averages, key=lang_averages.get)

print(f"Best average ratings: {best_lang} ({lang_averages[best_lang]:.1f})")
print(f"Lowest average ratings: {worst_lang} ({lang_averages[worst_lang]:.1f})")

Challenge 3: Create Your Scientific Visualization

Challenge Overview: Create one simple, clear chart that tells an important scientific story. This uses the same visualization skills from Day 4, but now your chart could inform real policy decisions or scientific understanding.

Learning Goals:

Apply matplotlib skills to scientific communication
Create charts that non-scientists can understand
Tell compelling stories with data visualization

Code Purpose: These are simplified chart templates adapted for scientific data. Choose ONE chart type that best shows your most interesting finding.

Scientific Chart Option 1: Time Trend Analysis

# SCIENTIFIC CHART 1: TRENDS OVER TIME
import matplotlib.pyplot as plt

# TODO: Create a chart showing how something changes over time
# Examples: Temperature over years, life expectancy improvements, space discoveries per year

# Step 1: Prepare your time data
years = []
values = []

# TODO: Extract years and corresponding values from your dataset
for record in scientific_data:
    try:
        year = # Extract year from your record
        value = # Extract the value you want to track over time
        
        years.append(year)
        values.append(value)
    except (ValueError, KeyError):
        continue

# Step 2: Create the line chart
plt.figure(figsize=(12, 6))
plt.plot(years, values, marker='o', linewidth=2, markersize=4)

# Step 3: Add scientific formatting
plt.title('YOUR_SCIENTIFIC_TITLE_HERE', fontsize=14, fontweight='bold')
plt.xlabel('Year')
plt.ylabel('YOUR_MEASUREMENT_HERE')
plt.grid(True, alpha=0.3)

# Step 4: Add context annotations
# TODO: Add a text box explaining what the trend means
plt.text(0.05, 0.95, 'Key Insight: YOUR_INSIGHT_HERE', 
         transform=plt.gca().transAxes, fontsize=10,
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.show()

# TODO: Print your analysis
print("Scientific Analysis:")
print(f"Time period analyzed: {min(years)} to {max(years)}")
print(f"Overall trend: YOUR_DESCRIPTION_HERE")

Scientific Chart Option 2: Category Comparison

# SCIENTIFIC CHART 2: COMPARING DIFFERENT GROUPS
import matplotlib.pyplot as plt

# TODO: Compare different countries, planet types, or other categories
# Examples: CO2 by country, life expectancy by region, internet access by continent

# Step 1: Calculate averages for each category
category_averages = {}

for record in scientific_data:
    try:
        category = # Extract category (country, region, etc.)
        value = # Extract value to compare
        
        # TODO: Add to category averages
        if category in category_averages:
            # Add to existing category
            pass
        else:
            # Create new category
            pass
            
    except (ValueError, KeyError):
        continue

# TODO: Calculate final averages
final_averages = {}
for category, data in category_averages.items():
    # Your calculation here
    pass

# Step 2: Get top 8 categories for chart
sorted_categories = sorted(final_averages.items(), key=lambda x: x[1], reverse=True)
top_categories = sorted_categories[:8]

categories = [item[0] for item in top_categories]
values = [item[1] for item in top_categories]

# Step 3: Create bar chart
plt.figure(figsize=(12, 6))
bars = plt.bar(categories, values, color='lightcoral')
bars[0].set_color('darkred')  # Highlight the top performer

plt.title('YOUR_COMPARISON_TITLE_HERE', fontsize=14, fontweight='bold')
plt.ylabel('YOUR_MEASUREMENT_HERE')
plt.xticks(rotation=45, ha='right')

# Step 4: Add value labels
for bar, value in zip(bars, values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(values)*0.01,
             f'{value:.1f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print(f"Highest performing: {categories[0]} ({values[0]:.1f})")
print(f"Scientific insight: YOUR_INSIGHT_HERE")

TODO for Students:

Choose your chart type: Time trends or category comparison?
Implement the data extraction: Fill in the TODO sections with your specific data
Customize titles and labels: Make them scientifically accurate and clear
Add your insights: What does your chart reveal about your scientific question?
Test and refine: Does your chart clearly communicate your finding?

Example: Movie Visualization Charts
What this example shows: Two complete scientific-style visualizations using movie data. These demonstrate the same chart-making process you’re implementing, but with entertainment data.
Key learning: The visualization process is identical – extract data, create chart, add insights!
Movie Chart 1: Movie Ratings Trend Over Time

# MOVIE TREND ANALYSIS EXAMPLE
import matplotlib.pyplot as plt

# Extract years and average ratings
yearly_ratings = {}
for movie in movie_data:
    try:
        year = int(movie['release_date'][:4])
        rating = movie['vote_average']
        
        if 1990 <= year <= 2020:  # Focus on recent decades
            if year in yearly_ratings:
                yearly_ratings[year].append(rating)
            else:
                yearly_ratings[year] = [rating]
    except (ValueError, TypeError):
        continue

# Calculate yearly averages
years = []
avg_ratings = []
for year in sorted(yearly_ratings.keys()):
    if len(yearly_ratings[year]) >= 5:  # Years with enough movies
        avg_rating = sum(yearly_ratings[year]) / len(yearly_ratings[year])
        years.append(year)
        avg_ratings.append(avg_rating)

# Create trend chart
plt.figure(figsize=(12, 6))
plt.plot(years, avg_ratings, marker='o', linewidth=2, markersize=4, color='blue')
plt.title('Movie Quality Trends: Average Ratings Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Year')
plt.ylabel('Average Movie Rating (1-10)')
plt.grid(True, alpha=0.3)

# Add insight annotation
recent_avg = sum(avg_ratings[-5:]) / 5  # Last 5 years average
plt.text(0.05, 0.95, f'Recent Trend: {recent_avg:.1f} average rating', 
         transform=plt.gca().transAxes, fontsize=10,
         bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))

plt.tight_layout()
plt.show()

print(f"Analysis: Movie ratings from {min(years)} to {max(years)}")
print(f"Recent 5-year average: {recent_avg:.2f}")

Movie Chart 2: Movie Production by Language

# MOVIE CATEGORY COMPARISON EXAMPLE
import matplotlib.pyplot as plt

# Count movies by language
language_counts = {}
for movie in movie_data:
    lang = movie['original_language']
    if lang in language_counts:
        language_counts[lang] += 1
    else:
        language_counts[lang] = 1

# Get top 8 languages
sorted_langs = sorted(language_counts.items(), key=lambda x: x[1], reverse=True)
top_8_langs = sorted_langs[:8]

languages = [item[0] for item in top_8_langs]
counts = [item[1] for item in top_8_langs]

# Create comparison chart
plt.figure(figsize=(10, 6))
bars = plt.bar(languages, counts, color='lightgreen')
bars[0].set_color('darkgreen')  # Highlight English

plt.title('Movie Production by Language: Global Cinema Diversity', fontsize=14, fontweight='bold')
plt.ylabel('Number of Movies')
plt.xticks(rotation=45)

# Add value labels
for bar, count in zip(bars, counts):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 20,
             f'{count}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print(f"Most common language: {languages[0]} ({counts[0]} movies)")
print(f"Cinema diversity: {len(language_counts)} languages represented")

Challenge 4: Scientific Presentation Preparation

Challenge Overview: Prepare a 3-minute presentation of your scientific findings. You’ll present like a real scientist at a conference, sharing one important discovery that could inform decisions or future research.

Learning Goals:

Communicate scientific findings clearly to a general audience
Practice defending conclusions with evidence
Connect data analysis to real-world applications

Presentation Structure (3 minutes total):

1. The Question (30 seconds): “The scientific question I investigated was… This matters because…”

2. The Data (1 minute): “I analyzed [X records] from … Here’s what I found…” [Show your chart]

3. The Discovery (1 minute): “The most important pattern I discovered was… This means…”

4. The Impact (30 seconds): “This research suggests that scientists/governments/people should…”

THE END.