This morning you learned to store data. But what if I told you Netflix analyzes the viewing habits of 260 MILLION users to decide what shows to recommend? How do you think they store and analyze that much data?
The Big Idea:
- Variables store one piece of information
- Lists store many pieces of related information
- Data Science = finding patterns in those collections
Concept: Lists – Your First Dataset
Explanation: “A list in Python is like a column in a spreadsheet. Instead of storing one value, it stores many related values in order.”
# Instead of this (one student):
student_name = "Alex"
test_score = 85
# We can do this (whole class):
student_names = ["Alex", "Sam", "Jordan", "Casey", "Taylor", "Morgan"]
test_scores = [85, 92, 78, 96, 88, 74]
# The data is connected by position:
# Alex scored 85, Sam scored 92, Jordan scored 78, etc.
Code Example: Basic List Operations
Before we code, let’s learn some powerful built-in functions:
len()– counts how many items are in a listmax()– finds the highest value in a list of numbersmin()– finds the lowest value in a list of numberssum()– adds up all numbers in a listround()– rounds a decimal number to fewer decimal places
# Our first real dataset: Class test scores
print("=== CLASSROOM DATA ANALYSIS ===")
# Create our lists (like columns in a spreadsheet)
student_names = ["Alex", "Sam", "Jordan", "Casey", "Taylor", "Morgan"]
test_scores = [85, 92, 78, 96, 88, 74]
# Basic dataset information using built-in functions
print("Number of students:", len(student_names)) # len() counts items in the list
print("Highest score:", max(test_scores)) # max() finds the biggest number
print("Lowest score:", min(test_scores)) # min() finds the smallest number
# Calculate class average (first real data science calculation!)
total_points = sum(test_scores) # sum() adds all numbers: 85+92+78+96+88+74 = 513
class_average = total_points / len(test_scores) # 513 ÷ 6 students = 85.5
print("Class average:", round(class_average, 1)) # round() makes 85.5 instead of 85.50000
# Show all the data using a loop
print("\n=== INDIVIDUAL STUDENT SCORES ===")
for i in range(len(student_names)): # range(6) gives us [0, 1, 2, 3, 4, 5]
# i is the position: 0=Alex, 1=Sam, 2=Jordan, etc.
print(f"{student_names[i]}: {test_scores[i]} points")
Notice how student_names[0] and test_scores[0] both refer to Alex’s data. The position numbers keep everything connected!
Challenge 1: Analyzing Your Friend Group
Your Turn: Create lists for your friend group and analyze their social media habits!
# TODO: Fill in with real data about your friends
friend_names = ["Friend1", "Friend2", "Friend3", "Friend4", "Friend5"]
daily_screen_time = [0, 0, 0, 0, 0] # Hours per day (estimate!)
# Your mission: Calculate and display:
# 1. Who spends the most time on their phone?
# 2. What's the average screen time for your friend group?
# 3. How does each friend compare to the average?
print("=== FRIEND GROUP SCREEN TIME ANALYSIS ===")
# Write your code here!
Section 2: Finding Patterns (25 minutes)
Concept: Data Filtering and Categorization
Explanation: “Data scientists don’t just calculate averages – they find patterns and group data into categories. This is how Spotify creates playlists or how schools identify students who need extra help.”
New Skills We’ll Learn:
- Empty lists – Start with
[]and add items later using.append() .append()– Adds an item to the end of a list- Categorization – Sorting data into groups based on conditions
Code Example: Academic Performance Analysis
print("=== ADVANCED STUDENT PERFORMANCE ANALYSIS ===")
# Expanded dataset with more students
students = ["Alex", "Sam", "Jordan", "Casey", "Taylor", "Morgan", "Riley", "Avery"]
test_scores = [85, 92, 78, 96, 88, 74, 91, 82]
study_hours = [2.5, 4.0, 1.5, 4.5, 3.0, 1.0, 3.5, 2.0] # Hours per week
# Create empty lists to sort students into categories
excellent_students = [] # Will hold names of students with scores >= 90
good_students = [] # Will hold names of students with scores 80-89
needs_help_students = [] # Will hold names of students with scores < 80
print("Analyzing each student...")
# Analyze each student (this is real data science!)
for i in range(len(students)): # i goes from 0 to 7 (8 students total)
name = students[i] # Get student name at position i
score = test_scores[i] # Get test score at position i
hours = study_hours[i] # Get study hours at position i
# Categorize based on test score
if score >= 90:
excellent_students.append(name) # Add this student's name to excellent list
performance_category = "Excellent"
elif score >= 80: # This means 80-89 (since >= 90 was already checked)
good_students.append(name) # Add this student's name to good list
performance_category = "Good"
else: # This means < 80
needs_help_students.append(name) # Add this student's name to needs help list
performance_category = "Needs Support"
# Show individual analysis
print(f"{name}: {score} points, {hours}h study time → {performance_category}")
# Generate insights (the payoff!)
print(f"\n=== CLASS INSIGHTS ===")
print(f" Excellent performers ({len(excellent_students)} students): {excellent_students}")
print(f" Good performers ({len(good_students)} students): {good_students}")
print(f" Students who could use support ({len(needs_help_students)} students): {needs_help_students}")
# Advanced insight: Is there a relationship between study time and performance?
print(f"\n=== STUDY TIME ANALYSIS ===")
if len(excellent_students) > 0: # Make sure we have excellent students to analyze
# Calculate average study time for excellent students
excellent_study_times = [] # Empty list to collect study times
for name in excellent_students: # Go through each excellent student
student_position = students.index(name) # Find where this student is in the main list
study_time = study_hours[student_position] # Get their study time
excellent_study_times.append(study_time) # Add to our collection
excellent_avg_study = sum(excellent_study_times) / len(excellent_study_times)
print(f"Excellent students study an average of {excellent_avg_study:.1f} hours per week")
else:
print("No excellent students to analyze study time patterns")
.append()lets us build lists as we analyze data- Multiple lists can work together (names, scores, and hours are connected by position)
.index()helps us find where an item is located in a list
Challenge 2: Simple Grade Calculator
Your Turn: Create a grade calculator that analyzes multiple students and determines who needs help!
print("=== CLASS GRADE ANALYZER ===")
# Dataset: Student test scores from this week
student_names = ["Maya", "Jordan", "Alex", "Sam", "Casey"]
test_scores = [92, 67, 88, 45, 91]
print("=== BASIC CLASS INFORMATION ===")
print("Students in class:", len(student_names))
print("All test scores:", test_scores)
# TODO: Calculate class statistics using the functions you learned
# STEP 1: Calculate the total points (use sum() function)
total_points = # Your code here
# STEP 2: Calculate the class average (total ÷ number of students)
class_average = # Your code here
# STEP 3: Find the highest and lowest scores (use max() and min() functions)
highest_score = # Your code here
lowest_score = # Your code here
# Display the statistics (round the average to 1 decimal place)
print("Class average:", round(class_average, 1))
print("Highest score:", highest_score)
print("Lowest score:", lowest_score)
# TODO: Analyze each student's performance
print("\n=== INDIVIDUAL STUDENT ANALYSIS ===")
# STEP 4: Create empty lists to categorize students
students_needing_help = [] # For students with scores below 70
students_doing_well = [] # For students with scores 70 or above
# STEP 5: Loop through each student and categorize them
for i in range(len(student_names)):
name = student_names[i]
score = test_scores[i]
print(f"Analyzing {name}: {score} points")
# TODO: Write the if/else logic to categorize students
if # Your condition here (hint: check if score is below 70):
# Add student to needs help list
# Set performance message
print(f" → NEEDS HELP (below 70)")
else:
# Add student to doing well list
# Set performance message
print(f" → DOING WELL (70 or above)")
# TODO: Generate insights for the teacher
print("\n=== TEACHER INSIGHTS ===")
print(f"Students doing well: {students_doing_well}")
print(f"Students needing help: {students_needing_help}")
# STEP 6: Calculate what percentage of class needs help
# (number needing help ÷ total students) × 100
help_percentage = # Your calculation here
print(f"Number needing support: {len(students_needing_help)} out of {len(student_names)}")
print(f"Percentage needing help: {round(help_percentage, 1)}%")
# STEP 7: Provide recommendation based on results
if len(students_needing_help) > 0:
print(f"\n RECOMMENDATION: Schedule tutoring sessions for {len(students_needing_help)} students")
print(f"Focus on: {students_needing_help}")
else:
print(f"\n GREAT NEWS: All students are performing well!")
Hints to Get You Started:
# Hint for Step 1: Use sum() to add all scores
total_points = sum(test_scores)
# Hint for Step 2: Divide total by number of students
class_average = total_points / len(test_scores)
# Hint for Step 5: Check if score is less than 70
if score < 70:
students_needing_help.append(name)
else:
students_doing_well.append(name)
What You’re Learning:
- Data categorization – Sorting students into “doing well” vs “needs help”
- Percentage calculations – Converting counts to percentages for insights
- List building – Using
.append()to collect students in different categories - Educational applications – How schools use data to identify at-risk students
Extension Challenges (If You Finish Early):
- Change the threshold: Try changing the “needs help” threshold from 70 to 75. How does this change the results?
- Add more categories: Create a third category for “excellent” students (scores >= 90)
- Find the struggling student: Write code to find which student has the lowest score and needs the most help
- Calculate grade letters: Add logic to assign letter grades (A, B, C, D, F) to each student
Section 3: Multi-Variable Analysis
Concept: Connecting Multiple Data Points
Explanation: “Real data science gets exciting when you connect different types of information. Does study time actually correlate with test scores? Do premium users really listen to more music? This is where insights come from!”
Advanced Skills We’ll Use:
- Multi-factor scoring – Combining several pieces of data to make predictions
- Weighted factors – Some data points matter more than others
- Conditional logic chains – Using
if/elifwith multiple conditions - Complex business analysis – Multiple separate analytical workflows in one program
Code Example: Student Success Predictor
print("=== STUDENT SUCCESS PREDICTOR ===")
# Comprehensive student dataset (4 different types of information per student)
students = ["Emma", "Liam", "Sophia", "Noah", "Olivia", "William", "Ava", "James"]
test_scores = [92, 78, 96, 85, 88, 74, 91, 82]
study_hours = [4.0, 1.5, 4.5, 2.5, 3.0, 1.0, 3.5, 2.0] # Hours per week
attendance_rate = [0.95, 0.82, 0.98, 0.90, 0.93, 0.78, 0.96, 0.87] # Decimal (0.95 = 95%)
extracurriculars = [2, 0, 3, 1, 2, 0, 2, 1] # Number of activities
print("=== COMPREHENSIVE STUDENT ANALYSIS ===")
print("Analyzing", len(students), "students using 4 different factors...")
for i in range(len(students)): # Analyze each student individually
name = students[i]
score = test_scores[i]
hours = study_hours[i]
attendance = attendance_rate[i]
activities = extracurriculars[i]
print(f"\n--- Analyzing {name} ---")
print(f"Test Score: {score}, Study Time: {hours}h, Attendance: {attendance:.0%}, Activities: {activities}")
# Calculate a "success prediction score" using multiple factors
success_score = 0 # Start at zero and add points for each factor
# Factor 1: Test performance (worth up to 40 points - most important!)
print("Factor 1 - Test Performance:")
if score >= 90:
points_earned = 40
success_score += 40
print(f" Score {score} >= 90: Excellent! +{points_earned} points")
elif score >= 80: # Between 80-89
points_earned = 25
success_score += 25
print(f" Score {score} is 80-89: Good! +{points_earned} points")
else: # Below 80
points_earned = 10
success_score += 10
print(f" Score {score} < 80: Needs work. +{points_earned} points")
# Factor 2: Study habits (worth up to 30 points)
print("Factor 2 - Study Habits:")
if hours >= 3.5:
points_earned = 30
success_score += 30
print(f" {hours} hours >= 3.5: Excellent study habits! +{points_earned} points")
elif hours >= 2.0: # Between 2.0-3.4 hours
points_earned = 20
success_score += 20
print(f" {hours} hours is 2.0-3.4: Good study habits. +{points_earned} points")
else: # Less than 2 hours
points_earned = 5
success_score += 5
print(f" {hours} hours < 2.0: Needs more study time. +{points_earned} points")
# Factor 3: Attendance (worth up to 20 points)
print("Factor 3 - Attendance:")
if attendance >= 0.95: # 95% or higher
points_earned = 20
success_score += 20
print(f" {attendance:.0%} >= 95%: Excellent attendance! +{points_earned} points")
elif attendance >= 0.85: # 85-94%
points_earned = 15
success_score += 15
print(f" {attendance:.0%} is 85-94%: Good attendance. +{points_earned} points")
else: # Below 85%
points_earned = 5
success_score += 5
print(f" {attendance:.0%} < 85%: Attendance needs improvement. +{points_earned} points")
# Factor 4: Engagement/Activities (worth up to 10 points)
print("Factor 4 - Extracurricular Engagement:")
if activities >= 2:
points_earned = 10
success_score += 10
print(f" {activities} activities >= 2: Well-rounded! +{points_earned} points")
elif activities >= 1:
points_earned = 5
success_score += 5
print(f" {activities} activity: Some involvement. +{points_earned} points")
else: # 0 activities
points_earned = 0
success_score += 0
print(f" {activities} activities: Consider joining activities. +{points_earned} points")
# Generate prediction based on total score (out of 100 possible points)
print(f"Total Success Score: {success_score}/100")
if success_score >= 85:
prediction = "Highly likely to succeed"
confidence = "High confidence"
elif success_score >= 65:
prediction = "Good chance of success"
confidence = "Moderate confidence"
else:
prediction = "May need additional support"
confidence = "Requires intervention"
print(f"PREDICTION: {prediction} ({confidence})")
Key Data Science Concepts:
- Weighted scoring – Test scores worth 40 points, study habits 30, attendance 20, activities 10
- Threshold analysis – Different score ranges lead to different conclusions
- Multi-factor prediction – Combining several data types for better accuracy
- Confidence levels – Higher scores = more confident predictions
Challenge 3: Music Streaming Analysis (Advanced)
Your Turn: You work for “TeenTunes” (a fictional music streaming app) and need to analyze user listening habits using multiple analytical techniques!
Music Streaming Analysis (Advanced)
Your Turn: You work for “TeenTunes” (a fictional music streaming app) and need to analyze user listening habits using multiple analytical techniques!
print("=== TEENTUNES COMPREHENSIVE USER ANALYSIS ===")
# Your dataset: User listening habits
usernames = ["MusicLover23", "PopFan2024", "RockStar_Kid", "StudyBeats", "DanceMachine", "ChillVibes"]
monthly_hours = [45, 67, 23, 89, 52, 34] # Hours listened this month
favorite_genres = ["Pop", "Pop", "Rock", "Lo-fi", "Electronic", "Indie"]
premium_subscriber = [False, True, False, True, True, False] # True = Premium, False = Free
# TODO: ANALYSIS 1 - Basic Dataset Overview
print("=== BASIC DATASET OVERVIEW ===")
print("Total users analyzed:", len(usernames))
print("All monthly listening hours:", monthly_hours)
# STEP 1: Calculate total listening time across all users
total_listening = # Your code here (use sum function)
# STEP 2: Calculate average listening time per user
average_listening = # Your code here (total ÷ number of users)
print(f"Total company listening time: {total_listening} hours")
print(f"Average user listening time: {round(average_listening, 1)} hours per month")
# TODO: ANALYSIS 2 - Power User Identification
print("\n=== ANALYSIS 1: POWER USER IDENTIFICATION ===")
print("Finding users who listen more than 50 hours per month...")
# STEP 3: Create empty lists to categorize users
power_users = [] # Users with > 50 hours
regular_users = [] # Users with <= 50 hours
# STEP 4: Loop through users and categorize them
for i in range(len(usernames)):
user = usernames[i]
hours = monthly_hours[i]
# TODO: Write if/else logic to categorize users
if # Your condition here (hint: check if hours > 50):
# Add user to power_users list
# Print that they're a power user
print(f"{user}: {hours} hours - POWER USER! ")
else:
# Add user to regular_users list
# Print that they're a regular user
print(f"{user}: {hours} hours - Regular user")
# Display results
print(f"\nPower Users Identified: {len(power_users)} out of {len(usernames)} total")
print(f"Power users: {power_users}")
print(f"Regular users: {regular_users}")
# TODO: ANALYSIS 3 - Premium vs Free User Comparison
print("\n=== ANALYSIS 2: PREMIUM VS FREE USER BEHAVIOR ===")
# STEP 5: Set up variables to track premium vs free users
premium_total_hours = 0 # Will add up all premium user hours
premium_count = 0 # Will count premium users
free_total_hours = 0 # Will add up all free user hours
free_count = 0 # Will count free users
print("Categorizing users by subscription type:")
# STEP 6: Loop through users and separate by subscription type
for i in range(len(usernames)):
user = usernames[i]
hours = monthly_hours[i]
is_premium = premium_subscriber[i] # True or False
# TODO: Write if/else to handle premium vs free users
if # Your condition here (check if is_premium == True):
# Add hours to premium_total_hours
# Add 1 to premium_count
# Print user as PREMIUM
print(f"{user}: {hours} hours (PREMIUM) ")
else:
# Add hours to free_total_hours
# Add 1 to free_count
# Print user as FREE
print(f"{user}: {hours} hours (FREE)")
# STEP 7: Calculate averages (with error checking)
print(f"\nSubscription Analysis Results:")
if premium_count > 0:
premium_avg = # Your calculation here
print(f"Premium users ({premium_count} total): {round(premium_avg, 1)} hours average")
else:
print("No premium users found")
premium_avg = 0
if free_count > 0:
free_avg = # Your calculation here
print(f"Free users ({free_count} total): {round(free_avg, 1)} hours average")
else:
print("No free users found")
free_avg = 0
# TODO: ANALYSIS 4 - Genre Popularity
print("\n=== ANALYSIS 3: MUSIC GENRE POPULARITY ===")
print("All user genre preferences:", favorite_genres)
# STEP 8: Count each genre manually
pop_count = 0
rock_count = 0
lofi_count = 0
electronic_count = 0
indie_count = 0
print("Counting genre preferences:")
# STEP 9: Loop through favorite_genres and count each one
for i in range(len(usernames)):
user = usernames[i]
genre = favorite_genres[i]
# TODO: Write if/elif statements to count each genre
if # genre equals "Pop":
# Add 1 to pop_count
elif # genre equals "Rock":
# Add 1 to rock_count
elif # genre equals "Lo-fi":
# Add 1 to lofi_count
elif # genre equals "Electronic":
# Add 1 to electronic_count
elif # genre equals "Indie":
# Add 1 to indie_count
print(f"{user} likes {genre}")
# STEP 10: Display results and find most popular
print(f"\nGenre Popularity Results:")
print(f"Pop: {pop_count} users")
print(f"Rock: {rock_count} users")
print(f"Lo-fi: {lofi_count} users")
print(f"Electronic: {electronic_count} users")
print(f"Indie: {indie_count} users")
# TODO: Find which genre is most popular
# STEP 11: Create a list of all counts and find the maximum
all_counts = [pop_count, rock_count, lofi_count, electronic_count, indie_count]
genre_names = ["Pop", "Rock", "Lo-fi", "Electronic", "Indie"]
max_count = # Your code here (use max function)
# Find which genre has the max count
print(f"\nMost popular genre analysis:")
for i in range(len(genre_names)):
if all_counts[i] == max_count:
most_popular_genre = genre_names[i]
print(f"🎵 Winner: {genre_names[i]} with {max_count} users!")
# TODO: FINAL BUSINESS INSIGHTS
print("\n" + "="*50)
print("=== COMPREHENSIVE BUSINESS INSIGHTS FOR TEENTUNES ===")
print("="*50)
# STEP 12: Calculate key business metrics
power_user_percentage = # Your calculation: (power users ÷ total users) × 100
conversion_rate = # Your calculation: (premium users ÷ total users) × 100
popular_genre_percentage = # Your calculation: (max genre count ÷ total users) × 100
print(f" USER BASE: {len(usernames)} total users analyzed")
print(f" ENGAGEMENT: {len(power_users)} power users ({round(power_user_percentage, 1)}% of base)")
print(f" REVENUE: {premium_count} premium subscribers ({round(conversion_rate, 1)}% conversion rate)")
print(f" USAGE: Premium users average {premium_avg:.1f}h vs Free users {free_avg:.1f}h")
print(f" CONTENT: {most_popular_genre} is most popular genre ({round(popular_genre_percentage, 1)}% of users)")
# STEP 13: Generate strategic recommendations
print(f"\n STRATEGIC RECOMMENDATIONS:")
# TODO: Write if/else logic for recommendations
if premium_avg > free_avg:
print(f"• Premium users are more engaged - highlight exclusive content in marketing")
else:
print(f"• Free users are highly engaged - focus on converting them to premium")
print(f"• Expand {most_popular_genre} music library to attract more users")
print(f"• Target power users for premium upgrades and exclusive features")
Major Hints to Get You Started:
# Hint for totals and averages:
total_listening = sum(monthly_hours)
average_listening = total_listening / len(monthly_hours)
# Hint for power user categorization:
if hours > 50:
power_users.append(user)
else:
regular_users.append(user)
# Hint for premium vs free:
if is_premium == True:
premium_total_hours += hours
premium_count += 1
else:
free_total_hours += hours
free_count += 1
# Hint for calculating averages:
premium_avg = premium_total_hours / premium_count
free_avg = free_total_hours / free_count
# Hint for genre counting:
if genre == "Pop":
pop_count += 1
elif genre == "Rock":
rock_count += 1
# Continue for other genres...
# Hint for percentages:
power_user_percentage = (len(power_users) / len(usernames)) * 100
Advanced Concepts You’re Mastering:
- Multiple analysis workflows in a single program
- Data accumulation and counter variables for complex calculations
- Business metrics calculation (conversion rates, percentages, averages)
- Comparative analysis between user segments
- Error handling with conditional checks before calculations
- Strategic insight generation from multiple data points
Extension Challenges (If You Finish Early):
- Add user satisfaction analysis: Create a new list with satisfaction ratings and analyze happy vs unhappy users
- Find the most valuable user: Combine premium status and listening hours to identify the most valuable customer
- Create user profiles: Write code that generates a complete profile for each user including all their metrics
- Build a recommendation engine: Based on listening hours and genre, recommend whether each user should upgrade to premium
INSIGHT DAY 2
Theme of the day: social media advice.
Opening Question:
“Raise your hand if you’ve ever watched a video that had millions of views and thought ‘I could make something better than this!’ Today we’re going to figure out what actually makes videos go viral and which platform gives creators the best shot at fame.”
Today’s Mission:
“By the end of today, you’ll be able to analyze real data from 500+ viral videos and give advice to content creators about platform strategy, optimal posting times, and what types of content actually work!”
Session 1: From Lists to CSV Files
Concept Introduction: The CSV File Format
Explanation: “Yesterday you worked with small lists that fit in a few lines of code. But what if Netflix wants to analyze data from 260 million users? Or TikTok wants to understand patterns from billions of videos? That’s where CSV files come in!”
What is a CSV file?
- CSV = Comma Separated Values
- Like a spreadsheet, but in plain text
- Each row is one record (like one video)
- Each column is one piece of information (like view count)
Show students a simple example:
video_title,platform,views,likes,creator
"Dancing Cat",TikTok,2500000,180000,CatLover23
"Study Tips",YouTube,890000,45000,StudyBuddy
"Cooking Hack",Instagram,1200000,75000,ChefLife
Why data scientists love CSV files:
- Can handle millions of rows
- Work with any programming language
- Easy to share between different tools
- Industry standard for data analysis
Concept: Reading CSV Files with Python
New Skills We’ll Learn:
import csv– Brings in Python’s CSV reading abilitiescsv.DictReader()– Reads CSV files as dictionaries (like our lists from yesterday, but bigger!)- File handling – Opening and closing files safely
Link to CSV file on Social Media Videos: HERE!!!!!
Step-by-Step Code Example:
# Step 1: Import the CSV library
import csv
# Step 2: Create an empty list to store all our video data
viral_videos = []
# Step 3: Open the CSV file and read it
with open('viral_videos_2024.csv', 'r') as file:
# csv.DictReader turns each row into a dictionary
# This means we can access data by column name instead of position
reader = csv.DictReader(file)
# Step 4: Loop through each row and add it to our list
for row in reader:
viral_videos.append(row) # Each row is a dictionary with video info
# Step 5: Check what we loaded
print(f"SUCCESS! Loaded {len(viral_videos)} viral videos for analysis!")
# Step 6: Look at the first video to understand our data
print("\nFirst video in our dataset:")
first_video = viral_videos[0]
print(f"Title: {first_video['title']}")
print(f"Platform: {first_video['platform']}")
print(f"Views: {first_video['views']}")
print(f"Likes: {first_video['likes']}")
print(f"Creator: {first_video['creator_name']}")
Now instead of student_names[0] we use viral_videos[0]['creator_name'] – we access data by meaningful names instead of position numbers!
Code Example: Basic Data Exploration
Code Example: Basic Data Exploration
Before analyzing, let’s learn data exploration functions:
len()– How many videos do we have?.keys()– What information is available for each video?- Data type conversion – CSV files read everything as text, but we need numbers for math
print("=== VIRAL VIDEO DATASET EXPLORATION ===")
# Basic dataset information
print(f"Total videos in dataset: {len(viral_videos)}")
print(f"Data available for each video: {list(viral_videos[0].keys())}")
# Look at a few example videos to understand our data
print("\n=== SAMPLE VIDEOS ===")
for i in range(3): # Show first 3 videos
video = viral_videos[i]
print(f"\nVideo {i+1}:")
print(f" Title: {video['title']}")
print(f" Platform: {video['platform']}")
print(f" Views: {video['views']}") # This is still text!
print(f" Creator: {video['creator_name']}")
# IMPORTANT: Convert text numbers to actual numbers for calculations
print("\n=== CONVERTING DATA FOR ANALYSIS ===")
# Method 1: Convert individual values
first_video = viral_videos[0]
views_as_text = first_video['views'] # This is text like "2500000"
views_as_number = int(first_video['views']) # Now it's a number: 2500000
print(f"Views as text: '{views_as_text}' (type: {type(views_as_text)})")
print(f"Views as number: {views_as_number} (type: {type(views_as_number)})")
print(f"Now we can do math: {views_as_number} + 1000 = {views_as_number + 1000}")
# Method 2: Convert data for the whole dataset (we'll use this approach)
print("\nConverting all numerical data...")
for video in viral_videos:
# Convert text numbers to integers for calculations
video['views'] = int(video['views'])
video['likes'] = int(video['likes'])
video['comments'] = int(video['comments'])
video['shares'] = int(video['shares'])
video['follower_count'] = int(video['follower_count'])
video['video_length_seconds'] = int(video['video_length_seconds'])
print("Data conversion complete! Ready for analysis.")
# Now we can do real data science!
print("\n=== FIRST ANALYSIS: BASIC STATISTICS ===")
# Extract all view counts into a list (like yesterday's work!)
all_view_counts = []
for video in viral_videos:
all_view_counts.append(video['views'])
# Calculate statistics using functions from yesterday
total_views = sum(all_view_counts)
average_views = total_views / len(all_view_counts)
highest_views = max(all_view_counts)
lowest_views = min(all_view_counts)
print(f"Total views across all viral videos: {total_views:,}")
print(f"Average views per viral video: {round(average_views):,}")
print(f"Most viral video had: {highest_views:,} views")
print(f"Least viral video had: {lowest_views:,} views")
# Find the most viral video
for video in viral_videos:
if video['views'] == highest_views:
print(f"\nMOST VIRAL VIDEO:")
print(f" Title: '{video['title']}'")
print(f" Platform: {video['platform']}")
print(f" Creator: {video['creator_name']}")
break
Key Learning Points:
- CSV files store everything as text – we must convert to numbers for math
- Dictionary access –
video['views']instead of position numbers - Data exploration first – always understand your data before analyzing
- Same functions, bigger data –
sum(),max(),len()work the same way
Challenge 1: Platform Comparison Basics
Your Turn: Use what you just learned to compare TikTok vs YouTube vs Instagram!
print("=== PLATFORM BATTLE: TIKTOK VS YOUTUBE VS INSTAGRAM ===")
# TODO: STEP 1 - Count videos by platform
tiktok_count = 0
youtube_count = 0
instagram_count = 0
# Loop through all videos and count each platform
for video in viral_videos:
platform = video['platform']
# TODO: Write if/elif statements to count each platform
if # Your condition here (check if platform equals "TikTok"):
# Add 1 to tiktok_count
elif # Your condition here (check if platform equals "YouTube"):
# Add 1 to youtube_count
elif # Your condition here (check if platform equals "Instagram"):
# Add 1 to instagram_count
print(f"TikTok videos: {tiktok_count}")
print(f"YouTube videos: {youtube_count}")
print(f"Instagram videos: {instagram_count}")
# TODO: STEP 2 - Calculate total views by platform
tiktok_total_views = 0
youtube_total_views = 0
instagram_total_views = 0
# Loop through videos again and add up views for each platform
for video in viral_videos:
platform = video['platform']
views = video['views']
# TODO: Write if/elif statements to add views to platform totals
if # platform == "TikTok":
# Add views to tiktok_total_views
elif # platform == "YouTube":
# Add views to youtube_total_views
elif # platform == "Instagram":
# Add views to instagram_total_views
print(f"\nTotal views by platform:")
print(f"TikTok: {tiktok_total_views:,} views")
print(f"YouTube: {youtube_total_views:,} views")
print(f"Instagram: {instagram_total_views:,} views")
# TODO: STEP 3 - Calculate average views per video by platform
# (Only calculate if we have videos for that platform)
if tiktok_count > 0:
tiktok_avg = # Your calculation here
print(f"TikTok average views per video: {round(tiktok_avg):,}")
if youtube_count > 0:
youtube_avg = # Your calculation here
print(f"YouTube average views per video: {round(youtube_avg):,}")
if instagram_count > 0:
instagram_avg = # Your calculation here
print(f"Instagram average views per video: {round(instagram_avg):,}")
# TODO: STEP 4 - Determine the winner
print(f"\n PLATFORM BATTLE RESULTS:")
print(f"Most videos: # Determine which platform has most videos")
print(f"Most total views: # Determine which platform has most total views")
print(f"Highest average views: # Determine which platform has highest average")
Hints:
# Hint for counting platforms:
if platform == "TikTok":
tiktok_count += 1
# Hint for calculating averages:
tiktok_avg = tiktok_total_views / tiktok_count
# Hint for finding winners:
if tiktok_count > youtube_count and tiktok_count > instagram_count:
print("Most videos: TikTok")
What You’re Learning:
- Working with real datasets (500+ videos vs. yesterday’s 5-6 items)
- Data type conversion (text to numbers for calculations)
- Dictionary access (using meaningful names instead of positions)
- Platform comparison (business intelligence basics)
Session 2: Advanced Viral Video Analytics
Concept: Advanced Data Filtering and Analysis
New Skills for Real Data Science:
- Multiple conditions – Finding videos that meet several criteria
- Engagement rate calculations – Converting raw numbers to meaningful metrics
- Data categorization – Grouping videos by performance levels
- Temporal analysis – Understanding time-based patterns
Key Metrics Content Creators Care About:
- Engagement Rate = (Likes + Comments) ÷ Views
- Viral Threshold = Videos above average performance
- Platform Success Rate = Percentage of videos that go viral by platform
Code Example: Content Creator Intelligence
print("=== CONTENT CREATOR INTELLIGENCE REPORT ===")
# Advanced Analysis 1: Calculate engagement rates
print("Calculating engagement rates for all videos...")
for video in viral_videos:
views = video['views']
likes = video['likes']
comments = video['comments']
# Calculate engagement rate (likes + comments) / views
if views > 0: # Avoid division by zero
engagement_rate = (likes + comments) / views
video['engagement_rate'] = round(engagement_rate, 4) # Store for later use
else:
video['engagement_rate'] = 0
print("Engagement rates calculated!")
# Advanced Analysis 2: Find top performing videos by engagement
print("\n=== TOP 5 MOST ENGAGING VIDEOS ===")
# Create a list of engagement rates to find the highest ones
all_engagement_rates = []
for video in viral_videos:
all_engagement_rates.append(video['engagement_rate'])
# Sort engagement rates and find top 5
all_engagement_rates.sort(reverse=True) # Highest first
top_5_rates = all_engagement_rates[:5] # First 5 items
print("Highest engagement rates found:", top_5_rates)
# Find and display the videos with these top engagement rates
for target_rate in top_5_rates:
for video in viral_videos:
if video['engagement_rate'] == target_rate:
print(f"\n Engagement Rate: {video['engagement_rate']:.1%}")
print(f" Title: '{video['title']}'")
print(f" Platform: {video['platform']}")
print(f" Views: {video['views']:,}")
print(f" Likes: {video['likes']:,}")
print(f" Comments: {video['comments']:,}")
break # Found it, move to next rate
# Advanced Analysis 3: Video length analysis
print("\n=== VIDEO LENGTH ANALYSIS ===")
# Categorize videos by length
short_videos = [] # Under 60 seconds
medium_videos = [] # 60-300 seconds (1-5 minutes)
long_videos = [] # Over 300 seconds (5+ minutes)
for video in viral_videos:
length = video['video_length_seconds']
if length < 60:
short_videos.append(video)
elif length <= 300: # 60-300 seconds
medium_videos.append(video)
else: # Over 300 seconds
long_videos.append(video)
print(f"Short videos (< 1 min): {len(short_videos)} videos")
print(f"Medium videos (1-5 min): {len(medium_videos)} videos")
print(f"Long videos (5+ min): {len(long_videos)} videos")
# Calculate average views by video length category
if len(short_videos) > 0:
short_avg_views = sum(video['views'] for video in short_videos) / len(short_videos)
print(f"Short videos average views: {round(short_avg_views):,}")
if len(medium_videos) > 0:
medium_avg_views = sum(video['views'] for video in medium_videos) / len(medium_videos)
print(f"Medium videos average views: {round(medium_avg_views):,}")
if len(long_videos) > 0:
long_avg_views = sum(video['views'] for video in long_videos) / len(long_videos)
print(f"Long videos average views: {round(long_avg_views):,}")
# Advanced Analysis 4: Creator success analysis
print("\n=== CREATOR SUCCESS ANALYSIS ===")
# Find creators with multiple viral videos
creator_video_counts = {} # Dictionary to count videos per creator
for video in viral_videos:
creator = video['creator_name']
if creator in creator_video_counts:
creator_video_counts[creator] += 1 # Add to existing count
else:
creator_video_counts[creator] = 1 # First video for this creator
# Find creators with more than 1 viral video
successful_creators = []
for creator, count in creator_video_counts.items():
if count > 1:
successful_creators.append((creator, count))
print(f"Creators with multiple viral videos: {len(successful_creators)}")
for creator, count in successful_creators[:5]: # Show top 5
print(f" {creator}: {count} viral videos")
Key Advanced Concepts:
- Engagement rate calculation – Converting raw data to meaningful metrics
- Data sorting and filtering – Finding top performers
- Multiple categorization – Video length analysis
- Dictionary usage for counting – Track creator success
Challenge 2: Ultimate Platform Strategy Analysis
Your Mission: You’re consulting for a new content creator who wants to know: “Which platform should I focus on, what type of content should I make, and when should I post?” Use data analysis to give them evidence-based advice!
print("=== PLATFORM STRATEGY CONSULTING PROJECT ===")
print("Analyzing 500+ viral videos to develop content strategy...")
# TODO: ANALYSIS 1 - Platform Performance Comparison (20 minutes)
print("\n=== ANALYSIS 1: PLATFORM PERFORMANCE DEEP DIVE ===")
# Step 1: Calculate average engagement rate by platform
tiktok_engagement_rates = []
youtube_engagement_rates = []
instagram_engagement_rates = []
# TODO: Loop through videos and collect engagement rates by platform
for video in viral_videos:
platform = video['platform']
engagement = video['engagement_rate']
# TODO: Add engagement rate to appropriate platform list
if # platform == "TikTok":
# Add engagement to tiktok_engagement_rates list
elif # platform == "YouTube":
# Add engagement to youtube_engagement_rates list
elif # platform == "Instagram":
# Add engagement to instagram_engagement_rates list
# Calculate average engagement by platform
if len(tiktok_engagement_rates) > 0:
tiktok_avg_engagement = # Your calculation here
print(f"TikTok average engagement rate: {tiktok_avg_engagement:.1%}")
if len(youtube_engagement_rates) > 0:
youtube_avg_engagement = # Your calculation here
print(f"YouTube average engagement rate: {youtube_avg_engagement:.1%}")
if len(instagram_engagement_rates) > 0:
instagram_avg_engagement = # Your calculation here
print(f"Instagram average engagement rate: {instagram_avg_engagement:.1%}")
# TODO: ANALYSIS 2 - Content Category Success Analysis (25 minutes)
print("\n=== ANALYSIS 2: CONTENT CATEGORY ANALYSIS ===")
# Available categories in dataset: "Comedy", "Educational", "Music", "Dance", "Lifestyle", "Gaming"
category_performance = {}
# TODO: Calculate average views by content category
for video in viral_videos:
category = video['category']
views = video['views']
# TODO: Track views by category
if category in category_performance:
# Add this video's data to existing category
category_performance[category]['total_views'] += views
category_performance[category]['video_count'] += 1
else:
# Create new category entry
category_performance[category] = {
'total_views': views,
'video_count': 1
}
# Calculate and display average views by category
print("Content category performance:")
for category, data in category_performance.items():
avg_views = data['total_views'] / data['video_count']
print(f"{category}: {round(avg_views):,} avg views ({data['video_count']} videos)")
# TODO: ANALYSIS 3 - Optimal Video Length Analysis (15 minutes)
print("\n=== ANALYSIS 3: OPTIMAL VIDEO LENGTH STRATEGY ===")
# Analyze success rate by platform and video length
print("Success rates by video length and platform:")
platforms = ["TikTok", "YouTube", "Instagram"]
for platform in platforms:
print(f"\n{platform} Analysis:")
platform_short = []
platform_medium = []
platform_long = []
# TODO: Categorize videos by length for this platform
for video in viral_videos:
if video['platform'] == platform:
length = video['video_length_seconds']
views = video['views']
# TODO: Add to appropriate length category
if # length < 60:
# Add views to platform_short
elif # length <= 300:
# Add views to platform_medium
else:
# Add views to platform_long
# Calculate averages for each length category
if len(platform_short) > 0:
short_avg = sum(platform_short) / len(platform_short)
print(f" Short videos (< 1 min): {round(short_avg):,} avg views")
if len(platform_medium) > 0:
medium_avg = sum(platform_medium) / len(platform_medium)
print(f" Medium videos (1-5 min): {round(medium_avg):,} avg views")
if len(platform_long) > 0:
long_avg = sum(platform_long) / len(platform_long)
print(f" Long videos (5+ min): {round(long_avg):,} avg views")
# TODO: ANALYSIS 4 - Creator Success Factor Analysis (15 minutes)
print("\n=== ANALYSIS 4: CREATOR SUCCESS FACTORS ===")
# Analyze relationship between follower count and video performance
small_creators = [] # Under 10,000 followers
medium_creators = [] # 10,000 - 100,000 followers
large_creators = [] # Over 100,000 followers
# TODO: Categorize creators by follower count and track their video performance
for video in viral_videos:
followers = video['follower_count']
views = video['views']
# TODO: Add video views to appropriate creator size category
if # followers < 10000:
# Add views to small_creators
elif # followers <= 100000:
# Add views to medium_creators
else:
# Add views to large_creators
# Calculate success rates by creator size
print("Video performance by creator size:")
if len(small_creators) > 0:
small_avg = sum(small_creators) / len(small_creators)
print(f"Small creators (< 10K followers): {round(small_avg):,} avg views")
if len(medium_creators) > 0:
medium_avg = sum(medium_creators) / len(medium_creators)
print(f"Medium creators (10K-100K followers): {round(medium_avg):,} avg views")
if len(large_creators) > 0:
large_avg = sum(large_creators) / len(large_creators)
print(f"Large creators (100K+ followers): {round(large_avg):,} avg views")
# TODO: FINAL RECOMMENDATIONS - Synthesize all analysis into advice
print("\n" + "="*60)
print("=== FINAL CONTENT CREATOR STRATEGY RECOMMENDATIONS ===")
print("="*60)
# TODO: Generate data-driven recommendations based on your analysis
print("PLATFORM RECOMMENDATION:")
print("Based on engagement rate analysis: # Your recommendation here")
print("\n CONTENT STRATEGY:")
print("Based on category performance: # Your recommendation here")
print("\n VIDEO LENGTH STRATEGY:")
print("Based on length analysis: # Your recommendation here")
print("\n GROWTH STRATEGY:")
print("Based on creator success factors: # Your recommendation here")
print("\n KEY INSIGHTS:")
print("• # Insight 1 from your analysis")
print("• # Insight 2 from your analysis")
print("• # Insight 3 from your analysis")
Major Hints to Get You Started:
# Hint for collecting engagement rates by platform:
if platform == "TikTok":
tiktok_engagement_rates.append(engagement)
# Hint for calculating average engagement:
tiktok_avg_engagement = sum(tiktok_engagement_rates) / len(tiktok_engagement_rates)
# Hint for category performance tracking:
if category in category_performance:
category_performance[category]['total_views'] += views
category_performance[category]['video_count'] += 1
else:
category_performance[category] = {'total_views': views, 'video_count': 1}
# Hint for video length categorization:
if length < 60:
platform_short.append(views)
elif length <= 300:
platform_medium.append(views)
else:
platform_long.append(views)
Advanced Concepts You’re Mastering:
- Multi-dimensional analysis – Platform × content type × video length
- Business intelligence – Converting data into actionable recommendations
- Dictionary data structures – Advanced data organization
- Comparative analysis – Understanding relative performance across categories
- Strategic thinking – Synthesizing multiple analyses into coherent advice
Extension Challenges (If You Finish Early):
- Viral Prediction Model: Create a scoring system to predict if a video will go viral
- Seasonal Analysis: Analyze if certain content performs better at different times
- Cross-Platform Creator Analysis: Find creators successful on multiple platforms
- Engagement Quality Analysis: Compare like-to-comment ratios across platforms
INSIGHT DAY 3
Session 1: Becoming a Data Detective
Concept Introduction: Where Real Data Lives
The Internet is Full of Free, Amazing Datasets:
- Government websites – Weather, population, crime, economics
- Sports databases – Every game, player, and stat you can imagine
- Entertainment platforms – Movie ratings, music charts, social media trends
- Science repositories – NASA data, health studies, environmental monitoring
Where Professional Data Scientists Find Data:
- Kaggle.com – The world’s largest data science community (millions of datasets)
- Data.gov.ie – Official Irish government data portal (Irish context!)
- CSO.ie – Central Statistics Office Ireland (population, economy, society)
- Data.gov – US government data (for global comparisons)
- Google Dataset Search – Search engine specifically for datasets
- Reddit r/datasets – Community-shared interesting datasets
What Makes a Good Dataset for Visualization:
- Size: At least 50-100 rows (enough to find patterns)
- Clean format: CSV files are easiest to work with
- Interesting variables: Multiple columns you can compare
- Recent data: 2020-2024 is ideal for relevance
- Clear documentation: You understand what each column means
Quick Dataset Discovery Demo
Irish Data Sources:
- data.gov.ie: Search “population by county”, “transport dublin”, “weather stations”
- cso.ie: Look for “education statistics”, “housing data”, “employment figures”
Global Sources:
- Kaggle: Search “spotify”, “netflix”, “fifa players”, “food ratings”
- Data.gov: Search for interesting US datasets to compare with Irish data
Challenge 1: The Numbers Hunter
Challenge Overview:
In this first challenge, you’ll become a “Numbers Hunter” – learning to find and analyze datasets rich with numerical data. This skill is fundamental because statistical analysis is the backbone of data science.
What You’ll Learn:
- How to identify high-quality numerical datasets
- Basic statistical analysis (averages, ranges, maximums)
- Data loading and error handling techniques
- How to extract meaningful insights from numbers
Building Toward Presentation:
The statistics you calculate here will become key talking points in your final data story presentation at the end of class today. You’ll be able to say things like “I discovered that the average temperature in Dublin has increased by 2.3 degrees over the past decade” or “The highest-scoring player had 50% more points than the league average.”
Your Mission:
Find and master a dataset that’s PACKED with numbers – perfect for calculating statistics and finding patterns.
What You’re Looking For:
- Sports data: Player statistics, team scores, championship results
- Financial data: Stock prices, cryptocurrency values, economic indicators
- Weather data: Temperature records, rainfall, climate measurements
- Health data: Nutrition facts, fitness tracking, medical statistics
Irish Options: CSO.ie weather data, Dublin transport passenger numbers, county population figures Global Options: Kaggle sports datasets, financial market data, NASA climate data
Your Dataset Must Have:
- At least 100 rows of data
- Multiple numeric columns (ratings, scores, prices, measurements)
- Data you can calculate averages, maximums, and trends from
Step 1: Find and Download Your Numbers Dataset
Go to one of the data sources and find a dataset with lots of numbers. Download the CSV file.
Step 2: Load and Analyze Your Numbers
print("CHALLENGE 1: NUMBERS HUNTER")
import csv
# Load your numbers-heavy dataset
dataset = []
filename = 'your_numbers_dataset.csv' # Replace with your file
try:
with open(filename, 'r', encoding='utf-8') as file:
reader = csv.DictReader(file)
dataset = [row for row in reader]
print(f"Loaded {len(dataset)} records!")
except Exception as e:
print(f"Error: {e}")
# Analyze your numeric columns
if dataset:
print(f"Columns in your dataset: {list(dataset[0].keys())}")
# Pick a numeric column and calculate basic statistics
numeric_column = 'REPLACE_WITH_YOUR_NUMERIC_COLUMN' # e.g., 'score', 'price', 'temperature'
values = []
for record in dataset:
try:
value = float(record[numeric_column])
values.append(value)
except:
pass # Skip non-numeric values
if values:
average = sum(values) / len(values)
maximum = max(values)
minimum = min(values)
print(f"{numeric_column} Statistics:")
print(f" Average: {average:.2f}")
print(f" Highest: {maximum}")
print(f" Lowest: {minimum}")
print(f" Range: {maximum - minimum}")
Success Goal:
Calculate at least 3 meaningful statistics from your numeric data
Why This Matters:
Numbers are the foundation of data science – you need to be comfortable with statistical analysis
Challenge 2: The Category Detective
Challenge Overview:
Now you’ll become a “Category Detective” – learning to analyze data with groups and classifications. This is crucial for understanding market segments, user demographics, and comparative analysis.
What You’ll Learn:
- How to identify and work with categorical data
- Grouping and counting techniques
- Distribution analysis (understanding how data is spread across categories)
- Finding the most and least common groups in your data
Building Toward Presentation:
Your category analysis will provide compelling talking points for your presentation. You’ll be able to share discoveries like “Comedy movies made up 35% of all films released, but only 15% were profitable” or “Dublin has 3x more cafes than any other Irish city.”
Your Mission:
Find and analyze a dataset with rich categorical data – perfect for understanding groups and classifications.
What You’re Looking For:
- Entertainment data: Movies by genre, music by artist, books by category
- Geographic data: Cities by country, regions by population, locations by type
- Product data: Items by brand, foods by cuisine, apps by category
- Social data: Posts by platform, users by demographics, content by type
Irish Options: Dublin businesses by type, Irish schools by county, transport routes by mode Global Options: Netflix shows by genre, Spotify songs by artist, product reviews by brand
Your Dataset Must Have:
- Clear categorical columns (genre, brand, country, type, etc.)
- Multiple categories to compare (at least 5-10 different groups)
- Enough data in each category to make comparisons meaningful
Step 1: Find and Download Your Categories Dataset
Look for a dataset with lots of different categories or groups to analyze.
Step 2: Load and Analyze Your Categories
print("CHALLENGE 2: CATEGORY DETECTIVE")
import csv
# Load your category-rich dataset
dataset = []
filename = 'your_categories_dataset.csv' # Replace with your file
try:
with open(filename, 'r', encoding='utf-8') as file:
reader = csv.DictReader(file)
dataset = [row for row in reader]
print(f"Loaded {len(dataset)} records!")
except Exception as e:
print(f"Error: {e}")
# Analyze your categorical data
if dataset:
print(f"Columns available: {list(dataset[0].keys())}")
# Pick a categorical column and count the categories
category_column = 'REPLACE_WITH_YOUR_CATEGORY_COLUMN' # e.g., 'genre', 'brand', 'country'
category_counts = {}
for record in dataset:
category = record.get(category_column, 'Unknown').strip()
if category in category_counts:
category_counts[category] += 1
else:
category_counts[category] = 1
# Sort categories by popularity
sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True)
print(f"{category_column} Analysis:")
print(f" Total categories found: {len(category_counts)}")
print(f" Top 5 categories:")
for i, (category, count) in enumerate(sorted_categories[:5]):
print(f" {i+1}. {category}: {count} items")
# Calculate category distribution
total_items = sum(category_counts.values())
top_category_percentage = (sorted_categories[0][1] / total_items) * 100
print(f" Most common category represents {top_category_percentage:.1f}% of data")
Success Goal:
Identify the top 5 categories and understand the distribution in your data
Why This Matters:
Categories help us group and compare data – essential for business intelligence
Challenge 3: The Time Traveler
Challenge Overview:
Your final pre-break challenge makes you a “Time Traveler” – mastering the analysis of data that changes over time. This is essential for understanding trends, predicting futures, and spotting patterns that develop over months or years.
What You’ll Learn:
- How to work with time-series and historical data
- Trend calculation and analysis
- Percentage change calculations over time
- Identifying whether things are improving, declining, or staying stable
Building Toward Presentation:
Time-based insights often provide the most dramatic presentation moments. You’ll be able to reveal trends like “Irish housing prices increased 45% over 5 years” or “The popularity of rock music has declined 60% since 2010, while hip-hop increased 200%.”
Your Mission:
Find and explore a dataset with time-based data – perfect for spotting trends and changes over time.
What You’re Looking For:
- Historical data: Population changes, economic trends, climate over years
- Performance data: Sports results over seasons, stock prices over months
- Usage data: Website traffic over time, app downloads by date, sales by quarter
- Event data: Movie releases by year, song charts over decades
Irish Options: Irish population by year, Dublin weather historical data, housing prices over time Global Options: Olympic records over decades, box office trends, technology adoption rates
Your Dataset Must Have:
- Date or time columns (years, months, specific dates)
- Data spanning multiple time periods (at least 5+ years or time points)
- Values you can track changes in over time
Step 1: Find and Download Your Time Dataset
Search for historical data that shows changes over time.
Step 2: Load and Analyze Your Time Data
print("CHALLENGE 3: TIME TRAVELER")
import csv
# Load your time-based dataset
dataset = []
filename = 'your_time_dataset.csv' # Replace with your file
try:
with open(filename, 'r', encoding='utf-8') as file:
reader = csv.DictReader(file)
dataset = [row for row in reader]
print(f"Loaded {len(dataset)} records!")
except Exception as e:
print(f"Error: {e}")
# Analyze your time-based data
if dataset:
print(f"Columns available: {list(dataset[0].keys())}")
# Pick a time column and a value column to track over time
time_column = 'REPLACE_WITH_TIME_COLUMN' # e.g., 'year', 'date', 'month'
value_column = 'REPLACE_WITH_VALUE_COLUMN' # e.g., 'population', 'price', 'score'
# Group data by time periods
time_data = {}
for record in dataset:
time_period = record.get(time_column, 'Unknown')
try:
value = float(record.get(value_column, 0))
if time_period in time_data:
time_data[time_period].append(value)
else:
time_data[time_period] = [value]
except:
pass # Skip non-numeric values
# Calculate averages for each time period
time_averages = {}
for time_period, values in time_data.items():
if values:
time_averages[time_period] = sum(values) / len(values)
# Sort by time period and show trends
sorted_times = sorted(time_averages.items())
print(f"{value_column} over {time_column}:")
for time_period, avg_value in sorted_times[:10]: # Show first 10 time periods
print(f" {time_period}: {avg_value:.2f}")
# Calculate trend
if len(sorted_times) >= 2:
first_value = sorted_times[0][1]
last_value = sorted_times[-1][1]
change = ((last_value - first_value) / first_value) * 100
trend = "increased" if change > 0 else "decreased"
print(f" Trend: {value_column} {trend} by {abs(change):.1f}% over time")
Success Goal:
Identify a clear trend in your data over time (increasing, decreasing, or cyclical)
Why This Matters:
Time-based analysis reveals trends and helps predict future patterns
What You’ve Built
Skills Developed:
- Numbers Skills: Statistical analysis and numeric data handling
- Category Skills: Grouping and classification analysis
- Time Skills: Trend analysis and temporal data understanding
- Three Different Datasets: You now have diverse, real-world data ready for visualization!
Session 2: Bringing Data to Life with Visualizations
Concept: From CSV to Professional Charts
The Visualization Workflow:
- Load your dataset – Read the CSV file into Python
- Explore the data – Understand what you have
- Clean if needed – Handle missing values or formatting issues
- Choose chart types – Bar charts, scatter plots, line charts based on your data
- Create visualizations – Use matplotlib to make professional charts
- Tell the story – What insights does your data reveal?
Chart Types for Different Data:
- Bar Charts: Comparing categories (best movies by genre, top artists by streams)
- Scatter Plots: Finding relationships (budget vs box office, calories vs price)
- Line Charts: Showing trends over time (music popularity by year, sports performance)
- Pie Charts: Showing proportions (market share, category breakdowns)
Code Example: Template for Any Dataset:
print("REAL DATASET VISUALIZATION PROJECT")
import matplotlib.pyplot as plt
import csv
# STEP 1: Load your downloaded dataset
print("Loading your real dataset...")
dataset = []
# Replace 'your_dataset.csv' with your actual filename
with open('your_dataset.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
dataset.append(row)
print(f"Loaded {len(dataset)} records from your dataset!")
# STEP 2: Explore what you have
print("Dataset exploration:")
if len(dataset) > 0:
first_record = dataset[0]
print(f"Available columns: {list(first_record.keys())}")
print(f"First record: {first_record}")
# STEP 3: Data preparation example
print("Preparing data for visualization...")
# Convert text numbers to actual numbers for calculations
for record in dataset:
# Identify which columns have numbers and convert them
# Example for a movies dataset:
# if 'rating' in record and record['rating']:
# record['rating'] = float(record['rating'])
# if 'box_office' in record and record['box_office']:
# record['box_office'] = int(record['box_office'])
pass
print("Data preparation complete!")
# STEP 4: Create your first visualization
print("Creating visualization 1...")
# Example: Bar chart of top categories
categories = {} # Dictionary to count items by category
for record in dataset:
# Replace 'genre' with an appropriate column from your dataset
category = record.get('genre', 'Unknown') # Use .get() to handle missing values
if category in categories:
categories[category] += 1
else:
categories[category] = 1
# Create bar chart
if categories:
category_names = list(categories.keys())
category_counts = list(categories.values())
plt.figure(figsize=(12, 6))
bars = plt.bar(category_names, category_counts, color=plt.cm.Set3(range(len(category_names))))
# Customize with your dataset's specifics
plt.title('Distribution by Category (Your Real Data)', fontsize=16, fontweight='bold')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
# Add value labels on bars
for bar, count in zip(bars, category_counts):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
str(count), ha='center', fontweight='bold')
plt.tight_layout()
plt.show()
print("First visualization complete!")
Challenge 4: Create Your First Real Data Visualization
Challenge Overview:
Now you’ll transform your data discoveries into professional visualizations. This challenge focuses on creating your first polished chart that could appear in a news article or business presentation.
What You’ll Learn:
- How to create publication-quality bar charts
- Professional chart formatting and labeling
- Data storytelling through visualization
- How to highlight the most important insights in your data
Building Toward Presentation:
This visualization will be the centerpiece of your data story presentation. You’ll show your chart to the class and explain what it reveals. The goal is to create something so clear and compelling that anyone can understand your discovery immediately.
Your Mission:
Load your downloaded dataset and create a professional bar chart showing the most interesting pattern you can find!
print("MY REAL DATASET ANALYSIS")
import matplotlib.pyplot as plt
import csv
# STEP 1 - Load your specific dataset
dataset = []
# Replace 'YOUR_FILENAME.csv' with your actual downloaded file
filename = 'YOUR_FILENAME.csv'
try:
with open(filename, 'r', encoding='utf-8') as file: # encoding helps with special characters
reader = csv.DictReader(file)
for row in reader:
dataset.append(row)
print(f"SUCCESS! Loaded {len(dataset)} records")
except FileNotFoundError:
print(f"ERROR: Could not find {filename}")
print("Make sure your CSV file is in the same folder as this code!")
except Exception as e:
print(f"ERROR loading data: {e}")
# STEP 2 - Explore your data
if len(dataset) > 0:
print(f"Your dataset has these columns: {list(dataset[0].keys())}")
print(f"First few records:")
for i in range(min(3, len(dataset))): # Show first 3 records
print(f"Record {i+1}: {dataset[i]}")
# STEP 3 - Choose what to visualize
# Look at your columns and pick something interesting to count or compare
# Examples:
# - If you have a movies dataset: count by genre, decade, rating range
# - If you have music data: count by artist, year, genre
# - If you have sports data: count by team, position, year
print(f"Creating visualization...")
# STEP 4 - Count items by your chosen category
category_counts = {}
category_column = 'REPLACE_WITH_YOUR_COLUMN_NAME' # e.g., 'genre', 'artist', 'team'
for record in dataset:
category = record.get(category_column, 'Unknown')
# Add counting logic
if category in category_counts:
category_counts[category] += 1
else:
category_counts[category] = 1
# STEP 5 - Create professional bar chart
if len(category_counts) > 0:
# Sort by count to show most popular first
sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True)
# Take top 10 to avoid overcrowded chart
top_categories = sorted_categories[:10]
names = [item[0] for item in top_categories]
counts = [item[1] for item in top_categories]
plt.figure(figsize=(14, 8))
bars = plt.bar(names, counts, color=plt.cm.viridis(range(len(names))))
# Customize with your specific data
plt.title('YOUR_CHART_TITLE_HERE', fontsize=16, fontweight='bold')
plt.xlabel('YOUR_X_LABEL')
plt.ylabel('YOUR_Y_LABEL')
plt.xticks(rotation=45, ha='right')
# Add value labels
for bar, count in zip(bars, counts):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(counts)*0.01,
str(count), ha='center', fontweight='bold')
plt.tight_layout()
plt.show()
# STEP 6 - Generate insights
print(f"INSIGHTS FROM YOUR REAL DATA:")
print(f"Most common {category_column}: {names[0]} ({counts[0]} items)")
print(f"Total categories found: {len(category_counts)}")
print(f"Your dataset covers: {min(counts)} to {max(counts)} items per category")
else:
print("No data to visualize. Check your column name and data!")
Hints for COMMON DATASet Types
# For movie datasets:
category_column = 'genre' # or 'director', 'year', 'rating_category'
# For music datasets:
category_column = 'artist' # or 'genre', 'year', 'decade'
# For sports datasets:
category_column = 'team' # or 'position', 'league', 'country'
# For food datasets:
category_column = 'restaurant' # or 'cuisine_type', 'price_range'
Challenge 5: Multi-Chart Data Story
Challenge Overview:
In this advanced challenge, you’ll create multiple complementary visualizations that tell a complete data story. This mirrors how professional data scientists and journalists combine different chart types to build compelling narratives.
What You’ll Learn:
- How to create different types of charts (bar charts, histograms, scatter plots)
- Combining multiple visualizations to tell a story
- Advanced data analysis techniques
- How to synthesize insights across different views of the same data
Preparing Your Final Presentation:
As you create each chart, think about how they connect to tell a larger story. Your 2-minute presentation will showcase 2-3 of your best visualizations and the key insights they reveal. Consider:
- What’s the main message you want your audience to remember?
- Which charts best support that message?
- What surprising or interesting pattern did you discover?
Your Mission:
Create 2-3 different visualizations from your dataset to tell a complete data story!
print("COMPLETE DATA STORY PROJECT")
print("Creating multiple visualizations from your real dataset...")
# Use your dataset from Challenge 4
# Now we'll create multiple charts to tell a complete story
# CHART 1: Category distribution (you already have this!)
print("Chart 1: Category Analysis")
# (Use your code from Challenge 4)
# CHART 2: Numerical analysis (if you have numeric columns)
print("Chart 2: Numerical Insights")
# Find a numeric column in your dataset
numeric_column = 'REPLACE_WITH_NUMERIC_COLUMN' # e.g., 'rating', 'price', 'score', 'views'
numeric_values = []
for record in dataset:
try:
# Convert text to number, skip if conversion fails
value = float(record.get(numeric_column, 0))
if value > 0: # Skip zero/missing values
numeric_values.append(value)
except (ValueError, TypeError):
pass # Skip records that can't be converted
if len(numeric_values) > 0:
# Create histogram showing distribution of values
plt.figure(figsize=(10, 6))
plt.hist(numeric_values, bins=20, color='skyblue', alpha=0.7, edgecolor='black')
plt.title(f'Distribution of {numeric_column.title()} (Your Real Data)',
fontsize=16, fontweight='bold')
plt.xlabel(numeric_column.title())
plt.ylabel('Frequency')
# Add statistics
avg_value = sum(numeric_values) / len(numeric_values)
max_value = max(numeric_values)
min_value = min(numeric_values)
plt.axvline(avg_value, color='red', linestyle='--', linewidth=2,
label=f'Average: {avg_value:.1f}')
plt.legend()
plt.tight_layout()
plt.show()
print(f"{numeric_column.title()} Statistics:")
print(f"Average: {avg_value:.2f}")
print(f"Highest: {max_value:.2f}")
print(f"Lowest: {min_value:.2f}")
# CHART 3: Relationship analysis (if you have multiple numeric columns)
print("Chart 3: Relationship Analysis")
# Find two numeric columns to compare
x_column = 'REPLACE_WITH_X_COLUMN' # e.g., 'budget', 'followers', 'age'
y_column = 'REPLACE_WITH_Y_COLUMN' # e.g., 'revenue', 'likes', 'score'
x_values = []
y_values = []
labels = []
for record in dataset:
try:
x_val = float(record.get(x_column, 0))
y_val = float(record.get(y_column, 0))
if x_val > 0 and y_val > 0:
x_values.append(x_val)
y_values.append(y_val)
# Use title or name for labels if available
label = record.get('title', record.get('name', f'Item {len(labels)+1}'))
labels.append(label)
except (ValueError, TypeError):
pass
if len(x_values) > 10: # Need decent amount of data for scatter plot
plt.figure(figsize=(12, 8))
scatter = plt.scatter(x_values, y_values, alpha=0.6, s=60, c=y_values, cmap='viridis')
plt.title(f'{y_column.title()} vs {x_column.title()} (Your Real Data)',
fontsize=16, fontweight='bold')
plt.xlabel(x_column.title())
plt.ylabel(y_column.title())
# Add colorbar
plt.colorbar(scatter, label=y_column.title())
# Optionally label a few interesting points
if len(labels) == len(x_values):
# Label the highest and lowest points
max_y_index = y_values.index(max(y_values))
min_y_index = y_values.index(min(y_values))
plt.annotate(labels[max_y_index],
(x_values[max_y_index], y_values[max_y_index]),
xytext=(5, 5), textcoords='offset points', fontsize=9)
plt.annotate(labels[min_y_index],
(x_values[min_y_index], y_values[min_y_index]),
xytext=(5, 5), textcoords='offset points', fontsize=9)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Relationship Insights:")
print(f"Analyzed {len(x_values)} data points")
print(f"Highest {y_column}: {max(y_values):.2f}")
print(f"Lowest {y_column}: {min(y_values):.2f}")
# FINAL DATA STORY SUMMARY
print("=" * 50)
print("YOUR COMPLETE DATA STORY")
print("=" * 50)
print(f"Dataset: {filename}")
print(f"Records analyzed: {len(dataset)}")
print(f"Key findings:")
print(f"• Most common category: [Your insight here]")
print(f"• Numerical trends: [Your insight here]")
print(f"• Interesting relationships: [Your insight here]")
print(f"• Surprising discovery: [Your insight here]")
print(f"WHAT THIS DATA TELLS US:")
print(f"Based on your analysis, what story does this data tell?")
print(f"What decisions could someone make using these insights?")
Hints for different data set types
# For movie datasets:
numeric_column = 'rating' # or 'box_office', 'runtime'
x_column = 'budget'
y_column = 'box_office'
# For music datasets:
numeric_column = 'popularity' # or 'duration', 'year'
x_column = 'danceability'
y_column = 'popularity'
# For sports datasets:
numeric_column = 'points' # or 'salary', 'age'
x_column = 'height'
y_column = 'points'
# For food datasets:
numeric_column = 'calories' # or 'price', 'protein'
x_column = 'price'
y_column = 'rating'
Wrap-Up & Data Story Presentations
Your 2-Minute Data Story Presentation:
Each student will present their most interesting discovery to the class. Your presentation should include:
Structure:
- The Hook (15 seconds): “I discovered something surprising about [your topic]…”
- The Data (30 seconds): Show your best chart and explain what it shows
- The Insight (60 seconds): What does this mean? Why is it interesting or important?
- The Impact (15 seconds): How could this information be used in the real world?
Presentation Tips:
- Start with your most surprising or interesting finding
- Make sure your chart is clearly visible and labeled
- Speak to your audience, not to your screen
- End with why this matters or what someone could do with this information
INSIGHT DAY 4
Day 4: Music Streaming Wars – Advanced Analytics & Predictions
Today’s Mission:
“By the end of today, you’ll work in teams to compete as music industry consultants. Each team will analyze a different genre of music and pitch their insights to judges representing a major record label looking for their next big investment!”
What Makes Today Different:
Yesterday you learned basic visualization. Today you’ll master advanced data science techniques: correlation analysis, predictive modeling, and statistical validation – the same methods used by Spotify, Apple Music, and record labels to make million-dollar decisions.
The Competition:
Teams will be assigned different music genres (Pop, Hip-Hop, Rock, Electronic, Country, R&B) and compete to present the most compelling insights about what makes songs successful in their genre. The winning team gets recognition as “Music Industry Data Science Champions!”
Session 1: Advanced Music Data Analysis
Concept Introduction: Computational Thinking in Music Analytics
Computational Thinking Practices We’ll Use Today:
1. Pattern Recognition
- What we’re doing: Finding correlations between song features and success
- Why it matters: Just like recognizing patterns in weather data helps predict storms, recognizing patterns in music helps predict hits
- Real example: “Every summer hit has high danceability” vs “Every winter hit is more acoustic”
2. Decomposition
- What we’re doing: Breaking down “song success” into measurable components (tempo, energy, danceability, etc.)
- Why it matters: Complex problems become manageable when broken into smaller parts
- Real example: Instead of asking “What makes a good song?” we ask “How does tempo affect streaming numbers?”
3. Abstraction
- What we’re doing: Creating mathematical models that capture the essence of hit songs
- Why it matters: Models help us focus on what’s important and ignore irrelevant details
- Real example: A song is more than just notes – we abstract it to numerical features we can analyze
4. Algorithm Design
- What we’re doing: Creating step-by-step processes to predict song success
- Why it matters: Algorithms can be tested, improved, and scaled to analyze millions of songs
- Real example: “If danceability > 0.7 AND energy > 0.6, then predict ‘likely hit’”
Today’s Music Dataset
Dataset Overview:
- 114,000 songs from Spotify’s database – real industry data!
- Key columns for analysis: track_name, artists, track_genre, popularity, danceability, energy
- Multiple genres available: Check what genres exist in the data
- Real popularity scores: 0-100 scale used by Spotify
Sample Data Preview:
track_name,artists,track_genre,popularity,danceability,energy
"Blinding Lights","The Weeknd","pop",95,0.514,0.730
"Good 4 U","Olivia Rodrigo","pop",88,0.563,0.664
"Industry Baby","Lil Nas X","hip-hop",92,0.844,0.704
Important: The column names are track_name, artists, and track_genre (not song_name, artist, genre)
Challenge 1: Load and Explore Music Data
Challenge Overview: Get familiar with the music dataset using simple data exploration. This builds on your CSV skills from Day 2 but with music data that’s much easier to work with.
Download Spotify Data set HERE!!
Learning Goals:
- Practice CSV loading with music industry context
- Calculate basic statistics on music features
- Understand the three key success metrics
Code Purpose: This is a template for loading data and exploring what’s in the dataset. You need to complete the missing parts to make it work with the real Spotify dataset.
# MUSIC DATA EXPLORATION CHALLENGE
print("MUSIC DATA EXPLORER")
import csv
# Step 1: Load the music data
music_data = []
with open('dataset.csv', 'r') as file:
reader = csv.DictReader(file)
for song in reader:
music_data.append(song)
print(f"Loaded {len(music_data)} songs")
# Step 2: Discover what genres exist in the dataset
print("\nDiscovering available genres...")
# TODO: Create a list of all unique genres in the dataset
# HINT: Look at the 'track_genre' column for each song
# HINT: Use a list to collect unique genres (avoid duplicates)
unique_genres = []
for song in music_data:
genre = song['track_genre']
# TODO: Add logic to only add genre if it's not already in the list
# Your code here
print(f"Found {len(unique_genres)} different genres:")
for genre in unique_genres[:10]: # Show first 10 genres
print(f" - {genre}")
# Step 3: Look at a few songs to understand the data structure
print("\nSample songs from dataset:")
for i in range(3):
song = music_data[i]
# TODO: Print the track name, artist, genre, and popularity
# HINT: Use song['track_name'], song['artists'], etc.
# Your code here
# Step 4: Convert text numbers to real numbers for analysis
print("\nConverting data for analysis...")
conversion_errors = 0
for song in music_data:
try:
# TODO: Convert the text values to numbers
# HINT: popularity should be int(), danceability and energy should be float()
# Your code here
pass
except ValueError:
conversion_errors += 1
print(f" Data conversion complete! ({conversion_errors} errors found)")
TODO for Students:
- Complete the genre discovery: Make the code find all unique genres without duplicates
- Fix the data display: Make it show track names, artists, and other info properly
- Implement data conversion: Convert the string numbers to integers and floats
- Test your code: Run it and see what genres are available in the dataset
If you want to run the movie example code hidden below then the data set can be downloaded here: Movie Data
Example: Click on arrow to see example code – Movie Data Exploration
What this example shows: How to load and explore a different dataset (movies) using the same techniques. This demonstrates the same concepts you just practiced, but with movie data instead of music.
Key learning: The same data exploration approach works for any CSV dataset – just change the column names!
# MOVIE DATA EXPLORATION EXAMPLE
print("MOVIE DATA EXPLORER")
import csv
# Load movie data (same process as music data)
movie_data = []
with open('tmdb_5000_movies.csv', 'r') as file:
reader = csv.DictReader(file)
for movie in reader:
movie_data.append(movie)
print(f"Loaded {len(movie_data)} movies")
# Discover what languages are in the movie dataset
unique_languages = []
for movie in movie_data:
language = movie['original_language']
if language not in unique_languages:
unique_languages.append(language)
print(f"Found {len(unique_languages)} different languages:")
print("Top languages:", unique_languages[:8])
# Look at sample movies
print("\nSample movies:")
for i in range(3):
movie = movie_data[i]
print(f"Title: {movie['title']}")
print(f"Language: {movie['original_language']}")
print(f"Popularity: {movie['popularity']}")
print(f"Vote Average: {movie['vote_average']}")
print()
# Convert movie data for analysis
for movie in movie_data:
try:
movie['popularity'] = float(movie['popularity'])
movie['vote_average'] = float(movie['vote_average'])
movie['vote_count'] = int(movie['vote_count'])
except ValueError:
pass # Skip movies with missing data
print("Movie data ready for analysis!")
Challenge 2: Find the Most Popular Songs
Challenge Overview: Use simple list operations to find the most popular songs overall. This reinforces Day 1 concepts (max, sorting) with engaging music data.
Learning Goals:
- Apply max() and sorting to real music data
- Practice finding patterns in datasets
- Build confidence with working code
Code Purpose: This shows you the structure for finding top songs, but you need to implement the actual logic. This builds on Day 1 concepts but requires you to think through the steps.
# FIND THE BIGGEST HITS CHALLENGE
print("FINDING THE MOST POPULAR SONGS")
# Step 1: Extract all popularity scores
# TODO: Create a list of all popularity scores from the dataset
# HINT: Loop through music_data and get song['popularity'] for each song
all_popularity = []
# Your code here
# Step 2: Find the highest popularity score
# TODO: Use a function from Day 1 to find the maximum value
# HINT: What function finds the biggest number in a list?
highest_popularity = # Your code here
print(f"Highest popularity score: {highest_popularity}")
# Step 3: Find which song has that highest score
# TODO: Loop through the data to find the song with the highest popularity
# HINT: Use an if statement to check if song['popularity'] equals highest_popularity
print("Most popular song:")
for song in music_data:
# Your code here - check if this song has the highest popularity
# If it does, print the track name and artist
# Step 4: Find top 5 most popular songs (This is the tricky part!)
print("\nTOP 5 MOST POPULAR SONGS:")
# TODO: Create a way to find the top 5 songs
# STRATEGY 1: Sort all popularity scores, get top 5, then find those songs
# STRATEGY 2: Find highest, remove it, find next highest, repeat 5 times
# STRATEGY 3: Create a list of (popularity, song_name) pairs and sort
# Your approach here - this is the real challenge!
TODO for Students:
- Complete the popularity extraction: Fill the all_popularity list
- Find the maximum: Use the right function to find the highest score
- Locate the top song: Write the if statement to find and print the #1 song
- Design the top 5 algorithm: This is the real challenge – figure out how to get the top 5 songs
- Test with different features: Once working, try finding most danceable or energetic songs
Example: Click arrow to reveal example code – Finding Top Movies
What this example shows: The exact same logic for finding top items, but applied to movies instead of songs. Notice how the structure is identical – only the data and column names change.
Key learning: Once you understand the pattern for finding “top” items, you can apply it to any dataset!
# FINDING TOP MOVIES EXAMPLE
print("FINDING THE MOST POPULAR MOVIES")
# Step 1: Extract all popularity scores (same concept as music)
all_movie_popularity = []
for movie in movie_data:
all_movie_popularity.append(movie['popularity'])
# Step 2: Find highest popularity
highest_movie_popularity = max(all_movie_popularity)
print(f"Highest movie popularity: {highest_movie_popularity}")
# Step 3: Find which movie has that score
for movie in movie_data:
if movie['popularity'] == highest_movie_popularity:
print(f"Most popular movie: {movie['title']}")
break
# Step 4: Top 5 movies using the sorting approach
print("\nTOP 5 MOST POPULAR MOVIES:")
# Create popularity list and sort it
movie_popularity_scores = []
for movie in movie_data:
movie_popularity_scores.append(movie['popularity'])
movie_popularity_scores.sort(reverse=True)
top_5_movie_scores = movie_popularity_scores[:5]
# Find movies with these scores
for score in top_5_movie_scores:
for movie in movie_data:
if movie['popularity'] == score:
print(f"{movie['title']} - Popularity: {score}")
break
# Bonus: Find highest rated movies (different column)
print("\nHIGHEST RATED MOVIE:")
all_ratings = []
for movie in movie_data:
all_ratings.append(movie['vote_average'])
highest_rating = max(all_ratings)
for movie in movie_data:
if movie['vote_average'] == highest_rating:
print(f"Highest rated: {movie['title']} (Rating: {highest_rating})")
break
Challenge 3: Simple Genre Comparison
Challenge Overview: Compare different music genres using basic counting and averaging. This prepares you for the team competition where you’ll need to analyze your specific genre.
Learning Goals:
- Group data by categories (genres)
- Calculate averages for different groups
- Make simple comparisons between categories
Code Purpose: This template shows the structure for analyzing one genre, but you need to implement all the calculation logic. This is good practice for the team competition.
# GENRE ANALYSIS CHALLENGE
print("ANALYZING A SPECIFIC GENRE")
# Step 1: Choose a genre to analyze (pick from the list you discovered above)
genre_to_analyze = "pop" # You can change this to any genre from your discovery
# Step 2: Filter songs by genre
# TODO: Create a list containing only songs from your chosen genre
# HINT: Use an if statement to check if song['track_genre'] matches your genre
genre_songs = []
for song in music_data:
# Your code here - add song to genre_songs if it matches the genre
print(f"Found {len(genre_songs)} {genre_to_analyze} songs")
# Step 3: Calculate average popularity for this genre
# TODO: Add up all the popularity scores and divide by the count
# HINT: Create a total, loop through genre_songs adding each popularity, then divide
total_popularity = 0
# Your code here to add up all popularity scores
if len(genre_songs) > 0:
average_popularity = # Your calculation here
print(f"Average popularity for {genre_to_analyze}: {average_popularity:.1f}")
# Step 4: Calculate average danceability
# TODO: Same process as popularity, but for danceability
# HINT: Loop through genre_songs and add up all the danceability values
total_danceability = 0
# Your code here
if len(genre_songs) > 0:
average_danceability = # Your calculation here
print(f"Average danceability for {genre_to_analyze}: {average_danceability:.2f}")
# Step 5: Find the most popular song in this genre
# TODO: Find the highest popularity score within just this genre's songs
# HINT: Similar to Challenge 2, but only looking at genre_songs
highest_pop_in_genre = 0
best_song_in_genre = ""
for song in genre_songs:
# Your code here - check if this song's popularity is higher than current highest
# If so, update both variables
print(f"Most popular {genre_to_analyze} song: {best_song_in_genre}")
# BONUS CHALLENGE: Can you find the top 3 songs in this genre?
print(f"\nBONUS: Try to find the top 3 {genre_to_analyze} songs!")
TODO for Students:
- Complete the genre filtering: Write the if statement to collect songs from one genre
- Calculate popularity average: Implement the sum and division
- Calculate danceability average: Apply the same logic to a different variable
- Find the top song: Write logic to track the highest popularity song
- Try different genres: Test your code with different genre values
- Bonus challenge: Extend your logic to find top 3 songs in the genre
Example: Click arrow for example code – Movie Language Analysis
What this example shows: How to analyze one category (movie language) using the same filtering and averaging techniques. This is exactly what you’re doing with music genres, but for movies.
Key learning: Category analysis works the same way regardless of the dataset – filter, calculate, compare!
# MOVIE LANGUAGE ANALYSIS EXAMPLE
print("ANALYZING ENGLISH MOVIES")
# Step 1: Choose a language to analyze (like choosing a genre)
language_to_analyze = "en" # English movies
# Step 2: Filter movies by language (same logic as genre filtering)
language_movies = []
for movie in movie_data:
if movie['original_language'] == language_to_analyze:
language_movies.append(movie)
print(f"Found {len(language_movies)} English movies")
# Step 3: Calculate average popularity for English movies
total_popularity = 0
for movie in language_movies:
total_popularity += movie['popularity']
average_popularity = total_popularity / len(language_movies)
print(f"Average popularity for English movies: {average_popularity:.1f}")
# Step 4: Calculate average rating
total_rating = 0
for movie in language_movies:
total_rating += movie['vote_average']
average_rating = total_rating / len(language_movies)
print(f"Average rating for English movies: {average_rating:.1f}")
# Step 5: Find most popular English movie
highest_pop = 0
best_movie = ""
for movie in language_movies:
if movie['popularity'] > highest_pop:
highest_pop = movie['popularity']
best_movie = movie['title']
print(f"Most popular English movie: {best_movie}")
# Compare to another language
print("\nComparing to Spanish movies:")
spanish_movies = []
for movie in movie_data:
if movie['original_language'] == "es":
spanish_movies.append(movie)
if len(spanish_movies) > 0:
spanish_avg = sum(movie['popularity'] for movie in spanish_movies) / len(spanish_movies)
print(f"Spanish movies average popularity: {spanish_avg:.1f}")
print(f"English vs Spanish: {average_popularity:.1f} vs {spanish_avg:.1f}")
Team Assignment & Strategy Planning
Team Assignment Process:
- First: Teams run the genre discovery code to see what genres are actually in the dataset
- Then: Teams choose from the available genres (likely: pop, hip-hop, rock, electronic, country, r&b, jazz, classical, etc.)
- Finally: Each team picks a different genre to avoid duplicates
Strategy Discussion: “Discuss with your team: Based on the data exploration, what do you already know about your genre? What patterns did you notice? What would convince a record label to invest in it?”
Challenge 4: Team Genre Deep Dive
Challenge Overview: Work as a team to analyze your assigned genre in detail. You’ll create 3 simple charts that show why your genre is the best investment for Breakthrough Records.
Learning Goals:
- Apply all week’s skills to a focused analysis
- Work collaboratively on data analysis
- Create charts that support business arguments
Your Team Mission: Create exactly 3 charts that tell a compelling story about your genre:
- Chart 1: How popular is your genre compared to others?
- Chart 2: What makes your genre special? (danceability, energy, or another feature)
- Chart 3: Show your genre’s biggest hits or most promising trends
Code Purpose: These are simplified chart-making templates. Your team will modify them to create visualizations that support your investment pitch.
Simple Chart Template 1: Genre Popularity Comparison
# CHART 1: GENRE POPULARITY COMPARISON CHALLENGE
import matplotlib.pyplot as plt
# Step 1: Discover all genres in the dataset
# TODO: Create a list of unique genres (use your code from Challenge 1)
all_genres = []
# Your code here to find unique genres
# Step 2: Calculate average popularity for each genre
# TODO: For each genre, calculate its average popularity
# HINT: This is similar to Challenge 3, but for multiple genres
genre_averages = []
for genre in all_genres[:8]: # Limit to first 8 genres to fit on chart
# TODO: Find all songs in this genre
genre_songs = []
# Your code here
# TODO: Calculate average popularity for this genre
if len(genre_songs) > 0:
# Your calculation here
average_pop = 0 # Replace with your calculation
genre_averages.append(average_pop)
else:
genre_averages.append(0)
# Step 3: Create the chart
plt.figure(figsize=(12, 6))
bars = plt.bar(all_genres[:8], genre_averages)
# Step 4: Highlight YOUR genre
your_genre = "REPLACE_WITH_YOUR_GENRE" # Teams replace this
for i, genre in enumerate(all_genres[:8]):
if genre == your_genre:
bars[i].set_color('red')
else:
bars[i].set_color('lightblue')
plt.title('Average Popularity by Genre')
plt.ylabel('Average Popularity Score')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# TODO: Print insights about your genre's ranking
print(f"Your genre ({your_genre}) average popularity: {genre_averages[all_genres.index(your_genre)]:.1f}")
Simple Chart Template 2: Genre Special Feature
# CHART 2: WHAT MAKES YOUR GENRE SPECIAL CHALLENGE
your_genre = "REPLACE_WITH_YOUR_GENRE"
# Step 1: Filter to your genre's songs
# TODO: Create a list of all songs in your genre
your_genre_songs = []
# Your code here (similar to Challenge 3)
print(f"Analyzing {len(your_genre_songs)} {your_genre} songs")
# Step 2: Extract the feature you want to analyze
# TODO: Choose danceability, energy, or another feature to highlight
# TODO: Create a list of that feature's values for your genre
feature_scores = []
feature_name = "danceability" # You can change this to "energy" or other features
for song in your_genre_songs:
# TODO: Add the chosen feature value to your list
# HINT: Use song[feature_name] to get the value
pass
# Step 3: Create visualization
plt.figure(figsize=(8, 6))
plt.hist(feature_scores, bins=15, color='green', alpha=0.7, edgecolor='black')
plt.title(f'{your_genre.title()} Songs: {feature_name.title()} Distribution')
plt.xlabel(f'{feature_name.title()} Score')
plt.ylabel('Number of Songs')
plt.show()
# Step 4: Calculate and display insights
# TODO: Calculate average, minimum, and maximum for your feature
if feature_scores:
avg_feature = # Your calculation here
min_feature = # Your calculation here
max_feature = # Your calculation here
print(f"Average {feature_name} for {your_genre}: {avg_feature:.3f}")
print(f"Range: {min_feature:.3f} to {max_feature:.3f}")
# TODO: Compare to overall dataset average
print(f"This tells us that {your_genre} music is...")
Simple Chart Template 3: Your Genre’s Top Hits
# CHART 3: YOUR GENRE'S BIGGEST HITS CHALLENGE
your_genre = "REPLACE_WITH_YOUR_GENRE"
# Step 1: Get all songs from your genre
# TODO: Filter the dataset to only your genre's songs
your_genre_songs = []
# Your code here
# Step 2: Find the top songs by popularity
# TODO: This is the big challenge - sort your genre's songs by popularity
# STRATEGY HINT: You could create a list of (popularity, song_info) pairs and sort
# OR use the same approach from Challenge 2 but only for your genre
# Method 1: Create pairs and sort (Recommended)
song_popularity_pairs = []
for song in your_genre_songs:
# TODO: Create a pair of (popularity_score, song_dictionary)
# HINT: (song['popularity'], song)
pass
# TODO: Sort the pairs by popularity (highest first)
# HINT: Use .sort() with reverse=True and key=lambda x: x[0]
# Step 3: Get top 5 and prepare for chart
top_5_songs = song_popularity_pairs[:5]
# Extract data for visualization
song_names = []
popularities = []
for popularity_score, song_info in top_5_songs:
# TODO: Get the track name and popularity for each top song
# HINT: Use song_info['track_name'] and popularity_score
short_name = song_info['track_name'][:20] # Shorten long names
song_names.append(short_name)
popularities.append(popularity_score)
# Step 4: Create the chart
plt.figure(figsize=(10, 6))
plt.barh(song_names, popularities, color='purple')
plt.title(f'Top 5 {your_genre.title()} Hits')
plt.xlabel('Popularity Score')
plt.gca().invert_yaxis() # Highest at top
plt.tight_layout()
plt.show()
# TODO: Print insights about your top hits
print(f"\n{your_genre.title()} Genre Analysis:")
print(f"Most popular song: {song_names[0]} (Score: {popularities[0]})")
print(f"Average of top 5: {sum(popularities)/len(popularities):.1f}")
TODO for Teams:
- Complete each chart template: Fill in all the TODO sections with working code
- Test your implementations: Make sure each chart displays correctly
- Analyze your results: What story do your charts tell about your genre?
- Prepare insights: What specific evidence supports investing in your genre?
- Practice explaining: Can you explain what each chart shows in simple terms?
Example: Movie Visualization Charts
What this example shows: Three complete, working chart examples using movie data. These demonstrate the same visualization techniques you’re implementing for music, but with movies instead.
Key learning: The chart-making process is identical regardless of data – filter your category, calculate what you need, then visualize it!
Movie Chart 1: Language Popularity Comparison
# MOVIE CHART 1: LANGUAGE POPULARITY COMPARISON
import matplotlib.pyplot as plt
# Find top 6 languages by number of movies
language_counts = {}
for movie in movie_data:
lang = movie['original_language']
if lang in language_counts:
language_counts[lang] += 1
else:
language_counts[lang] = 1
# Get top 6 languages
sorted_langs = sorted(language_counts.items(), key=lambda x: x[1], reverse=True)
top_languages = [lang for lang, count in sorted_langs[:6]]
# Calculate average popularity for each top language
lang_averages = []
for lang in top_languages:
lang_movies = []
for movie in movie_data:
if movie['original_language'] == lang:
lang_movies.append(movie)
if len(lang_movies) > 0:
avg_pop = sum(movie['popularity'] for movie in lang_movies) / len(lang_movies)
lang_averages.append(avg_pop)
else:
lang_averages.append(0)
# Create chart
plt.figure(figsize=(10, 6))
bars = plt.bar(top_languages, lang_averages, color='skyblue')
bars[0].set_color('red') # Highlight English
plt.title('Average Movie Popularity by Language')
plt.ylabel('Average Popularity')
plt.show()
Movie Chart 2: English Movie Ratings Distribution
# MOVIE CHART 2: ENGLISH MOVIE RATINGS DISTRIBUTION
english_movies = []
for movie in movie_data:
if movie['original_language'] == 'en':
english_movies.append(movie)
# Get all ratings for English movies
english_ratings = []
for movie in english_movies:
english_ratings.append(movie['vote_average'])
# Create histogram
plt.figure(figsize=(8, 6))
plt.hist(english_ratings, bins=15, color='green', alpha=0.7, edgecolor='black')
plt.title('English Movies: Rating Distribution')
plt.xlabel('Rating (1-10)')
plt.ylabel('Number of Movies')
plt.show()
# Print insights
avg_rating = sum(english_ratings) / len(english_ratings)
print(f"Average English movie rating: {avg_rating:.1f}")
Movie Chart 3: Top English Movies
# MOVIE CHART 3: TOP ENGLISH MOVIES BY POPULARITY
english_movies = []
for movie in movie_data:
if movie['original_language'] == 'en':
english_movies.append(movie)
# Sort by popularity
english_movies.sort(key=lambda movie: movie['popularity'], reverse=True)
top_5_english = english_movies[:5]
# Extract data for chart
movie_titles = []
movie_popularity = []
for movie in top_5_english:
movie_titles.append(movie['title'][:20]) # Shorten long titles
movie_popularity.append(movie['popularity'])
# Create horizontal bar chart
plt.figure(figsize=(10, 6))
plt.barh(movie_titles, movie_popularity, color='purple')
plt.title('Top 5 English Movies by Popularity')
plt.xlabel('Popularity Score')
plt.gca().invert_yaxis() # Highest at top
plt.tight_layout()
plt.show()
print(f"Most popular English movie: {movie_titles[0]}")
Challenge 5: Pitch Preparation & Competition
Challenge Overview: Prepare and deliver your 5-minute pitch to Breakthrough Records. Use your charts to build a compelling case for why they should invest $5 million in your genre.
Learning Goals:
- Communicate data insights clearly and persuasively
- Connect statistical findings to business decisions
- Present technical work to non-technical audiences
Pitch Structure (5 minutes total):
1. Hook (30 seconds): “[Genre] music is about to explode, and here’s the data to prove it…”
2. Market Evidence (2 minutes):
- Show Chart 1: “Our genre compares to others like this…”
- Show Chart 2: “What makes our genre unique is…”
- Key statistics that support investment
3. Success Stories (1.5 minutes):
- Show Chart 3: “Look at these massive hits in our genre…”
- Name specific artists/songs and their success
4. Investment Pitch (1 minute):
- “Here’s why you should invest $5 million in [Genre]…”
- Specific recommendations for what type of artists to sign
Day 5: STEM Research Challenge
Session 1: Scientific Data Basics
Simple Science Data Concepts
What Makes Good Scientific Research? (Just 3 key things)
- Clear Question: What specific thing are you trying to understand?
- Reliable Data: Numbers from credible scientific sources
- Honest Analysis: What does the data actually show (not what you want it to show)
Computational Thinking in Science:
- Pattern Recognition: Finding trends over time (temperatures rising, populations changing)
- Abstraction: Focusing on key measurements while ignoring complexity
- Decomposition: Breaking big questions into smaller, testable parts
- Algorithm Design: Creating step-by-step analysis processes (just like you did for music!)
Today’s Research TracK
Choose ONE option: Use our provided datasets (recommended for reliability) or go your own way and choose a different data-set if you wish.
Ready-to-Use Real Datasets
Track 1: Climate & Temperature Analysis
- Dataset:
GlobalLandTemperaturesByCountry.csv - Source: Berkeley Earth Global Temperature Data
- Size: 577,000+ records (1743-2013, monthly data)
- Sample Questions: How have temperatures changed over time? Which countries are warming fastest? How does Ireland compare globally?
- Key Columns: dt (date), AverageTemperature, AverageTemperatureUncertainty, Country
- Irish Context: Compare Ireland’s temperature trends to global patterns
Track 2: CO2 Emissions & Environment
- Dataset:
co2_emissions_kt_by_country.csv - Source: World Bank CO2 Emissions Data
- Size: 13,900+ records (countries 1960-2019)
- Sample Questions: Which countries produce the most CO2? How have emissions changed over time? Where does Ireland rank?
- Key Columns: country_code, country_name, year, value (CO2 in kilotons)
- Irish Context: Analyze Ireland’s emissions compared to EU neighbors
Track 3: Global Health & Life Expectancy
- Dataset:
Life Expectancy Data.csv - Source: World Health Organization
- Size: 2,900+ records (countries 2000-2015)
- Sample Questions: What factors predict longer life? How has Irish health compared globally? What affects life expectancy most?
- Key Columns: Country, Year, Life_expectancy, Adult_Mortality, GDP, Total_expenditure, Schooling
- Irish Context: Ireland data available – compare to other developed nations
Track 4: Digital Development & Technology
- Dataset:
Final.csv - Source: World Bank Digital Development Indicators
- Size: 8,800+ records (countries 1980-2020)
- Sample Questions: How has internet adoption grown globally? Does digital access boost economies? How does Ireland compare?
- Key Columns: Entity (country), Year, Cellular_Subscription, Internet_Users(%), No_of_Internet_Users, Broadband_Subscription
- Irish Context: Track Ireland’s digital transformation over decades
Alternative: Health Systems Analysis
- Dataset:
2.12_Health_systems.csv - Source: WHO Health Systems Data
- Size: 200+ countries (single timepoint)
- Sample Questions: What makes health systems effective? How do different countries spend on healthcare?
- Key Columns: Country_Region, Health_exp_pct_GDP_2016, Health_exp_per_capita_USD_2016, Physicians_per_1000
Option B: Find Your Own Dataset
If you want to explore something different:
Where to find datasets:
- Kaggle.com: Search for “climate”, “space”, “health”, or “technology” + “dataset”
- Data.gov.ie: Irish government data (education, transport, environment)
- Our World in Data: Global development and research data
- World Bank Open Data: Economic and social indicators
Requirements for your own dataset:
- Must be downloadable as CSV file
- Should have 200+ records for meaningful analysis
- Must include at least one time column (year) and 2-3 numeric columns
- Should address a scientific or social question you care about
Popular student choices:
- Sports performance data (Olympics, World Cup, etc.)
- Movie/entertainment industry data
- Food and nutrition data
- Transportation and urban planning data
- Education and academic performance data
Dataset approval: Show your chosen dataset to an instructor before starting analysis
Challenge 1: Load and Explore Scientific Data
Challenge Overview: Choose your research track and explore your scientific dataset using the same techniques from Day 4. The skills are identical – only the context is more serious.
Learning Goals:
- Apply CSV loading skills to scientific data
- Discover what questions your data can answer
- Practice scientific thinking with familiar coding techniques
Code Purpose: This is the same data loading structure you used for music, but adapted for scientific datasets. You’ll modify it to explore your chosen research area.
# SCIENTIFIC DATA EXPLORATION CHALLENGE
print("SCIENTIFIC DATA EXPLORER")
import csv
# Step 1: Load your chosen scientific dataset
# TODO: Choose ONE of these real datasets:
# - "GlobalLandTemperaturesByCountry.csv" (climate research - 577k records)
# - "co2_emissions_kt_by_country.csv" (emissions research - 14k records)
# - "Life Expectancy Data.csv" (health research - 3k records)
# - "Final.csv" (technology research - 9k records)
# - "2.12_Health_systems.csv" (health systems - 200 countries)
scientific_data = []
dataset_filename = "REPLACE_WITH_YOUR_CHOSEN_DATASET.csv"
with open(dataset_filename, 'r') as file:
reader = csv.DictReader(file)
for record in reader:
scientific_data.append(record)
print(f"Loaded {len(scientific_data)} scientific records")
# Step 2: Discover what's in your dataset
print("\nExploring dataset structure...")
if scientific_data:
first_record = scientific_data[0]
available_columns = list(first_record.keys())
print(f"Available data columns: {available_columns}")
# Step 3: Look at sample data to understand it
print("\nSample records from dataset:")
for i in range(3):
record = scientific_data[i]
# TODO: Print the key information from your dataset
# FOR REAL DATASETS - choose your dataset and uncomment the right section:
# Temperature data:
# print(f"Date: {record['dt']}, Country: {record['Country']}, Temp: {record['AverageTemperature']}")
# CO2 data:
# print(f"Country: {record['country_name']}, Year: {record['year']}, CO2: {record['value']}")
# Life Expectancy data:
# print(f"Country: {record['Country']}, Year: {record['Year']}, Life Exp: {record['Life expectancy ']}")
# Technology data:
# print(f"Country: {record['Entity']}, Year: {record['Year']}, Internet: {record['Internet Users(%)']}%")
# Health Systems data:
# print(f"Country: {record['Country_Region']}, Health Exp: {record['Health_exp_pct_GDP_2016']}%")
# Your code here to print relevant fields
pass
# Step 4: Convert text numbers to real numbers for analysis
print("\nConverting data for scientific analysis...")
conversion_errors = 0
for record in scientific_data:
try:
# TODO: Convert the numeric columns for your dataset
# FOR REAL DATASETS - uncomment the one you're using:
# Temperature data conversion:
# if record['AverageTemperature']: # Skip empty values
# record['AverageTemperature'] = float(record['AverageTemperature'])
# record['year'] = int(record['dt'][:4]) # Extract year from date
# CO2 data conversion:
# record['year'] = int(record['year'])
# record['value'] = float(record['value'])
# Life Expectancy data conversion:
# record['Year'] = int(record['Year'])
# if record['Life expectancy ']: # Handle missing values
# record['Life expectancy '] = float(record['Life expectancy '])
# if record['GDP']:
# record['GDP'] = float(record['GDP'])
# Technology data conversion:
# record['Year'] = int(record['Year'])
# if record['Internet Users(%)']:
# record['Internet Users(%)'] = float(record['Internet Users(%)'])
# if record['Cellular Subscription']:
# record['Cellular Subscription'] = float(record['Cellular Subscription'])
# Health Systems data conversion:
# if record['Health_exp_pct_GDP_2016']:
# record['Health_exp_pct_GDP_2016'] = float(record['Health_exp_pct_GDP_2016'])
# if record['Health_exp_per_capita_USD_2016']:
# record['Health_exp_per_capita_USD_2016'] = float(record['Health_exp_per_capita_USD_2016'])
# Your conversion code here
pass
except (ValueError, TypeError):
conversion_errors += 1
print(f" Scientific data ready! ({conversion_errors} conversion errors)")
# Step 5: Basic dataset insights
print(f"\nDataset Overview:")
print(f"Total records: {len(scientific_data)}")
# TODO: Print some basic insights about your data
# Examples based on your chosen dataset:
# Temperature: Countries covered, year range, recent vs historical averages
# CO2: Time span, highest/lowest emitting countries, recent trends
# Health: Countries covered, year range, developed vs developing nations
# Technology: Digital growth timespan, countries with best/worst access
# Your insights here
TODO for Students:
- Choose your research track: Pick climate, space, health, or technology
- Load your dataset: Replace the filename with your chosen dataset
- Explore the structure: Print the column names and understand what data you have
- Display sample data: Show key fields from a few records
- Convert data types: Make the numbers usable for calculations
Example: Movie Data Loading for Comparison
What this example shows: The exact same data loading process you just practiced, but with the movie dataset you already know. Notice how the structure is identical!
Key learning: Scientific data loading uses exactly the same skills as entertainment data!
# MOVIE DATA LOADING FOR COMPARISON
print("MOVIE DATA LOADER")
import csv
# Load movie dataset (same process)
movie_data = []
with open('tmdb_5000_movies.csv', 'r') as file:
reader = csv.DictReader(file)
for movie in reader:
movie_data.append(movie)
print(f"Loaded {len(movie_data)} movies")
# Explore structure (same technique)
if movie_data:
columns = list(movie_data[0].keys())
print(f"Available columns: {columns[:8]}") # Show first 8
# Display sample data (same approach)
print("\nSample movies:")
for i in range(2):
movie = movie_data[i]
print(f"Title: {movie['title']}")
print(f"Year: {movie['release_date'][:4]}") # Extract year
print(f"Rating: {movie['vote_average']}")
print(f"Popularity: {movie['popularity']}")
print()
# Convert data (same process)
print("Converting movie data...")
for movie in movie_data:
try:
movie['popularity'] = float(movie['popularity'])
movie['vote_average'] = float(movie['vote_average'])
movie['vote_count'] = int(movie['vote_count'])
except ValueError:
pass
print("Movie data converted!")
Challenge 2: Find Patterns in Scientific Data (35 minutes)
Challenge Overview: Use the same pattern-finding techniques from Day 4 to discover trends in your scientific dataset. You’re looking for the “biggest hits” – but now it’s biggest changes, highest values, or most important patterns.
Learning Goals:
- Apply max/min finding to scientific questions
- Look for trends over time in scientific data
- Practice the same analysis skills with serious data
Code Purpose: This template shows how to find important patterns in scientific data using the same max/min techniques you mastered with music data. You’ll adapt it to answer real scientific questions.
# SCIENTIFIC PATTERN DISCOVERY CHALLENGE
print("DISCOVERING SCIENTIFIC PATTERNS")
# PART 1: Find the most extreme values (like finding top songs)
print("\nFinding extreme values in the data...")
# TODO: Choose one numeric column to analyze for extremes
# Examples: temperature_change, co2_levels, life_expectancy, internet_users_percent
column_to_analyze = "REPLACE_WITH_YOUR_COLUMN"
# TODO: Extract all values for this column into a list
all_values = []
for record in scientific_data:
# Your code here - add the column value to the list
pass
# TODO: Find the maximum and minimum values
if all_values:
highest_value = # Your code here
lowest_value = # Your code here
print(f"Highest {column_to_analyze}: {highest_value}")
print(f"Lowest {column_to_analyze}: {lowest_value}")
# TODO: Find which records have these extreme values
print("\nRecords with extreme values:")
for record in scientific_data:
# Your code here - find and print records with highest/lowest values
# PART 2: Look for patterns over time (if your data has years)
print("\nAnalyzing trends over time...")
# TODO: Group data by year and calculate averages
# This is similar to grouping music by genre
yearly_data = {}
for record in scientific_data:
try:
year = # Extract year from your data
value = # Extract the value you're analyzing
# TODO: Add this year's data to the yearly_data dictionary
if year in yearly_data:
# Add to existing year
pass
else:
# Create new year entry
pass
except (ValueError, KeyError):
continue
# TODO: Calculate average for each year
print("Yearly averages:")
for year in sorted(yearly_data.keys())[:10]: # Show first 10 years
# Your code here - calculate and print yearly average
# PART 3: Compare different categories (like comparing genres)
print("\nComparing different categories...")
# TODO: Choose a category column to compare
# Examples: country, planet_type, region, income_level
category_column = "REPLACE_WITH_CATEGORY_COLUMN"
# TODO: Calculate averages for each category (same logic as music genres)
category_averages = {}
# Your code here - group by category and calculate averages
# TODO: Find the best and worst performing categories
if category_averages:
best_category = # Your code here
worst_category = # Your code here
print(f"Best performing {category_column}: {best_category}")
print(f"Lowest performing {category_column}: {worst_category}")
TODO for Students:
- Choose your analysis column: Pick the most interesting numeric column from your dataset
- Find extremes: Implement the max/min finding logic
- Identify extreme records: Find which countries/planets/records have the highest/lowest values
- Analyze time trends: If your data has years, look for changes over time
- Compare categories: Group by a category and compare averages
Example: Movie Pattern Discovery
What this example shows: The same pattern-finding techniques applied to movie data. Notice how the logic structure matches what you’re implementing for scientific data!
Key learning: Pattern discovery works identically whether you’re analyzing box office hits or scientific trends!
# MOVIE PATTERN DISCOVERY EXAMPLE
print("DISCOVERING MOVIE PATTERNS")
# Part 1: Find extreme values (same logic as scientific data)
print("Finding highest and lowest rated movies...")
all_ratings = []
for movie in movie_data:
all_ratings.append(movie['vote_average'])
highest_rating = max(all_ratings)
lowest_rating = min(all_ratings)
print(f"Highest movie rating: {highest_rating}")
print(f"Lowest movie rating: {lowest_rating}")
# Find movies with extreme ratings
for movie in movie_data:
if movie['vote_average'] == highest_rating:
print(f"Highest rated movie: {movie['title']}")
elif movie['vote_average'] == lowest_rating:
print(f"Lowest rated movie: {movie['title']}")
# Part 2: Analyze trends over time (same approach)
print("\nAnalyzing movie trends by year...")
yearly_ratings = {}
for movie in movie_data:
try:
year = int(movie['release_date'][:4]) # Extract year
rating = movie['vote_average']
if year in yearly_ratings:
yearly_ratings[year].append(rating)
else:
yearly_ratings[year] = [rating]
except (ValueError, TypeError):
continue
# Calculate yearly averages
print("Average ratings by year:")
for year in sorted(yearly_ratings.keys())[-10:]: # Last 10 years
avg_rating = sum(yearly_ratings[year]) / len(yearly_ratings[year])
print(f"{year}: {avg_rating:.1f}")
# Part 3: Compare by language (like comparing categories)
print("\nComparing movie ratings by language...")
language_ratings = {}
for movie in movie_data:
lang = movie['original_language']
rating = movie['vote_average']
if lang in language_ratings:
language_ratings[lang].append(rating)
else:
language_ratings[lang] = [rating]
# Find best and worst languages (minimum 10 movies)
lang_averages = {}
for lang, ratings in language_ratings.items():
if len(ratings) >= 10: # Only languages with enough movies
lang_averages[lang] = sum(ratings) / len(ratings)
best_lang = max(lang_averages, key=lang_averages.get)
worst_lang = min(lang_averages, key=lang_averages.get)
print(f"Best average ratings: {best_lang} ({lang_averages[best_lang]:.1f})")
print(f"Lowest average ratings: {worst_lang} ({lang_averages[worst_lang]:.1f})")
Challenge 3: Create Your Scientific Visualization
Challenge Overview: Create one simple, clear chart that tells an important scientific story. This uses the same visualization skills from Day 4, but now your chart could inform real policy decisions or scientific understanding.
Learning Goals:
- Apply matplotlib skills to scientific communication
- Create charts that non-scientists can understand
- Tell compelling stories with data visualization
Code Purpose: These are simplified chart templates adapted for scientific data. Choose ONE chart type that best shows your most interesting finding.
Scientific Chart Option 1: Time Trend Analysis
# SCIENTIFIC CHART 1: TRENDS OVER TIME
import matplotlib.pyplot as plt
# TODO: Create a chart showing how something changes over time
# Examples: Temperature over years, life expectancy improvements, space discoveries per year
# Step 1: Prepare your time data
years = []
values = []
# TODO: Extract years and corresponding values from your dataset
for record in scientific_data:
try:
year = # Extract year from your record
value = # Extract the value you want to track over time
years.append(year)
values.append(value)
except (ValueError, KeyError):
continue
# Step 2: Create the line chart
plt.figure(figsize=(12, 6))
plt.plot(years, values, marker='o', linewidth=2, markersize=4)
# Step 3: Add scientific formatting
plt.title('YOUR_SCIENTIFIC_TITLE_HERE', fontsize=14, fontweight='bold')
plt.xlabel('Year')
plt.ylabel('YOUR_MEASUREMENT_HERE')
plt.grid(True, alpha=0.3)
# Step 4: Add context annotations
# TODO: Add a text box explaining what the trend means
plt.text(0.05, 0.95, 'Key Insight: YOUR_INSIGHT_HERE',
transform=plt.gca().transAxes, fontsize=10,
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
plt.tight_layout()
plt.show()
# TODO: Print your analysis
print("Scientific Analysis:")
print(f"Time period analyzed: {min(years)} to {max(years)}")
print(f"Overall trend: YOUR_DESCRIPTION_HERE")
Scientific Chart Option 2: Category Comparison
# SCIENTIFIC CHART 2: COMPARING DIFFERENT GROUPS
import matplotlib.pyplot as plt
# TODO: Compare different countries, planet types, or other categories
# Examples: CO2 by country, life expectancy by region, internet access by continent
# Step 1: Calculate averages for each category
category_averages = {}
for record in scientific_data:
try:
category = # Extract category (country, region, etc.)
value = # Extract value to compare
# TODO: Add to category averages
if category in category_averages:
# Add to existing category
pass
else:
# Create new category
pass
except (ValueError, KeyError):
continue
# TODO: Calculate final averages
final_averages = {}
for category, data in category_averages.items():
# Your calculation here
pass
# Step 2: Get top 8 categories for chart
sorted_categories = sorted(final_averages.items(), key=lambda x: x[1], reverse=True)
top_categories = sorted_categories[:8]
categories = [item[0] for item in top_categories]
values = [item[1] for item in top_categories]
# Step 3: Create bar chart
plt.figure(figsize=(12, 6))
bars = plt.bar(categories, values, color='lightcoral')
bars[0].set_color('darkred') # Highlight the top performer
plt.title('YOUR_COMPARISON_TITLE_HERE', fontsize=14, fontweight='bold')
plt.ylabel('YOUR_MEASUREMENT_HERE')
plt.xticks(rotation=45, ha='right')
# Step 4: Add value labels
for bar, value in zip(bars, values):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(values)*0.01,
f'{value:.1f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
print(f"Highest performing: {categories[0]} ({values[0]:.1f})")
print(f"Scientific insight: YOUR_INSIGHT_HERE")
TODO for Students:
- Choose your chart type: Time trends or category comparison?
- Implement the data extraction: Fill in the TODO sections with your specific data
- Customize titles and labels: Make them scientifically accurate and clear
- Add your insights: What does your chart reveal about your scientific question?
- Test and refine: Does your chart clearly communicate your finding?
Example: Movie Visualization Charts
What this example shows: Two complete scientific-style visualizations using movie data. These demonstrate the same chart-making process you’re implementing, but with entertainment data.
Key learning: The visualization process is identical – extract data, create chart, add insights!
Movie Chart 1: Movie Ratings Trend Over Time
# MOVIE TREND ANALYSIS EXAMPLE
import matplotlib.pyplot as plt
# Extract years and average ratings
yearly_ratings = {}
for movie in movie_data:
try:
year = int(movie['release_date'][:4])
rating = movie['vote_average']
if 1990 <= year <= 2020: # Focus on recent decades
if year in yearly_ratings:
yearly_ratings[year].append(rating)
else:
yearly_ratings[year] = [rating]
except (ValueError, TypeError):
continue
# Calculate yearly averages
years = []
avg_ratings = []
for year in sorted(yearly_ratings.keys()):
if len(yearly_ratings[year]) >= 5: # Years with enough movies
avg_rating = sum(yearly_ratings[year]) / len(yearly_ratings[year])
years.append(year)
avg_ratings.append(avg_rating)
# Create trend chart
plt.figure(figsize=(12, 6))
plt.plot(years, avg_ratings, marker='o', linewidth=2, markersize=4, color='blue')
plt.title('Movie Quality Trends: Average Ratings Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Year')
plt.ylabel('Average Movie Rating (1-10)')
plt.grid(True, alpha=0.3)
# Add insight annotation
recent_avg = sum(avg_ratings[-5:]) / 5 # Last 5 years average
plt.text(0.05, 0.95, f'Recent Trend: {recent_avg:.1f} average rating',
transform=plt.gca().transAxes, fontsize=10,
bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))
plt.tight_layout()
plt.show()
print(f"Analysis: Movie ratings from {min(years)} to {max(years)}")
print(f"Recent 5-year average: {recent_avg:.2f}")
Movie Chart 2: Movie Production by Language
# MOVIE CATEGORY COMPARISON EXAMPLE
import matplotlib.pyplot as plt
# Count movies by language
language_counts = {}
for movie in movie_data:
lang = movie['original_language']
if lang in language_counts:
language_counts[lang] += 1
else:
language_counts[lang] = 1
# Get top 8 languages
sorted_langs = sorted(language_counts.items(), key=lambda x: x[1], reverse=True)
top_8_langs = sorted_langs[:8]
languages = [item[0] for item in top_8_langs]
counts = [item[1] for item in top_8_langs]
# Create comparison chart
plt.figure(figsize=(10, 6))
bars = plt.bar(languages, counts, color='lightgreen')
bars[0].set_color('darkgreen') # Highlight English
plt.title('Movie Production by Language: Global Cinema Diversity', fontsize=14, fontweight='bold')
plt.ylabel('Number of Movies')
plt.xticks(rotation=45)
# Add value labels
for bar, count in zip(bars, counts):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 20,
f'{count}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
print(f"Most common language: {languages[0]} ({counts[0]} movies)")
print(f"Cinema diversity: {len(language_counts)} languages represented")
Challenge 4: Scientific Presentation Preparation
Challenge Overview: Prepare a 3-minute presentation of your scientific findings. You’ll present like a real scientist at a conference, sharing one important discovery that could inform decisions or future research.
Learning Goals:
- Communicate scientific findings clearly to a general audience
- Practice defending conclusions with evidence
- Connect data analysis to real-world applications
Presentation Structure (3 minutes total):
1. The Question (30 seconds): “The scientific question I investigated was… This matters because…”
2. The Data (1 minute): “I analyzed [X records] from … Here’s what I found…” [Show your chart]
3. The Discovery (1 minute): “The most important pattern I discovered was… This means…”
4. The Impact (30 seconds): “This research suggests that scientists/governments/people should…”
THE END.