# Load the tidyverse library
library(tidyverse)
# Read in the car sales data
# Make sure the data file is in your lab_0/data/ folder
<- read_csv("data/car_sales_data.csv") car_data
Lab 0: Getting Started with dplyr
Your First Data Analysis
Overview
Welcome to your first lab! In this (not graded) assignment, you’ll practice the fundamental dplyr operations I overviewed in class using car sales data. This lab will help you get comfortable with:
- Basic data exploration
- Column selection and manipulation
- Creating new variables
- Filtering data
- Grouping and summarizing
Instructions: Copy this template into your portfolio repository under a lab_0/
folder, then complete each section with your code and answers. You will write the code under the comment section in each chunk. Be sure to also copy the data folder into your lab_0
folder.
Setup
Exercise 1: Getting to Know Your Data
1.1 Data Structure Exploration
Explore the structure of your data and answer these questions:
# Use glimpse() to see the data structure
glimpse(car_data)
Rows: 50,000
Columns: 7
$ Manufacturer <chr> "Ford", "Porsche", "Ford", "Toyota", "VW", "Ford…
$ Model <chr> "Fiesta", "718 Cayman", "Mondeo", "RAV4", "Polo"…
$ `Engine size` <dbl> 1.0, 4.0, 1.6, 1.8, 1.0, 1.4, 1.8, 1.4, 1.2, 2.0…
$ `Fuel type` <chr> "Petrol", "Petrol", "Diesel", "Hybrid", "Petrol"…
$ `Year of manufacture` <dbl> 2002, 2016, 2014, 1988, 2006, 2018, 2010, 2015, …
$ Mileage <dbl> 127300, 57850, 39190, 210814, 127869, 33603, 866…
$ Price <dbl> 3074, 49704, 24072, 1705, 4101, 29204, 14350, 30…
# Check the column names
colnames(car_data)
[1] "Manufacturer" "Model" "Engine size"
[4] "Fuel type" "Year of manufacture" "Mileage"
[7] "Price"
# Look at the first few rows
head(car_data)
# A tibble: 6 × 7
Manufacturer Model `Engine size` `Fuel type` `Year of manufacture` Mileage
<chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Ford Fiesta 1 Petrol 2002 127300
2 Porsche 718 Caym… 4 Petrol 2016 57850
3 Ford Mondeo 1.6 Diesel 2014 39190
4 Toyota RAV4 1.8 Hybrid 1988 210814
5 VW Polo 1 Petrol 2006 127869
6 Ford Focus 1.4 Petrol 2018 33603
# ℹ 1 more variable: Price <dbl>
Questions to answer: - How many rows and columns does the dataset have? - What types of variables do you see (numeric, character, etc.)? - Are there any column names that might cause problems? Why?
Your answers: - Rows: 50,000 - Columns: 7 - Variable types: Character (chr), Double (dbl) - Problematic names: The columns that have spaces in their names require quotes around them while scripting
1.2 Tibble vs Data Frame
Compare how tibbles and data frames display:
# Look at the tibble version (what we have)
car_data
# A tibble: 50,000 × 7
Manufacturer Model `Engine size` `Fuel type` `Year of manufacture` Mileage
<chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Ford Fiesta 1 Petrol 2002 127300
2 Porsche 718 Cay… 4 Petrol 2016 57850
3 Ford Mondeo 1.6 Diesel 2014 39190
4 Toyota RAV4 1.8 Hybrid 1988 210814
5 VW Polo 1 Petrol 2006 127869
6 Ford Focus 1.4 Petrol 2018 33603
7 Ford Mondeo 1.8 Diesel 2010 86686
8 Toyota Prius 1.4 Hybrid 2015 30663
9 VW Polo 1.2 Petrol 2012 73470
10 Ford Focus 2 Diesel 1992 262514
# ℹ 49,990 more rows
# ℹ 1 more variable: Price <dbl>
# Convert to regular data frame and display
<- as.data.frame(car_data)
car_df %>% head(10) car_df
Manufacturer Model Engine size Fuel type Year of manufacture Mileage
1 Ford Fiesta 1.0 Petrol 2002 127300
2 Porsche 718 Cayman 4.0 Petrol 2016 57850
3 Ford Mondeo 1.6 Diesel 2014 39190
4 Toyota RAV4 1.8 Hybrid 1988 210814
5 VW Polo 1.0 Petrol 2006 127869
6 Ford Focus 1.4 Petrol 2018 33603
7 Ford Mondeo 1.8 Diesel 2010 86686
8 Toyota Prius 1.4 Hybrid 2015 30663
9 VW Polo 1.2 Petrol 2012 73470
10 Ford Focus 2.0 Diesel 1992 262514
Price
1 3074
2 49704
3 24072
4 1705
5 4101
6 29204
7 14350
8 30297
9 9977
10 1049
Question: What differences do you notice in how they print?
Your answer: In Markdown, the identifier icon in the top left changes from “A tibble: 50,000 x 7” to “Description: df [50,000 x 7]” for tibbles and dfs, respectively. When rendering the website, printing a data frame prints all rows and does not allow for scrolling across columns. Instead, it prints extra columns in another row.
Exercise 2: Basic Column Operations
2.1 Selecting Columns
Practice selecting different combinations of columns:
# Select just Model and Mileage columns
select(.data = car_data, Model, Mileage)
# A tibble: 50,000 × 2
Model Mileage
<chr> <dbl>
1 Fiesta 127300
2 718 Cayman 57850
3 Mondeo 39190
4 RAV4 210814
5 Polo 127869
6 Focus 33603
7 Mondeo 86686
8 Prius 30663
9 Polo 73470
10 Focus 262514
# ℹ 49,990 more rows
# Select Manufacturer, Price, and Fuel type
%>% select(Manufacturer, Price, `Fuel type`) car_data
# A tibble: 50,000 × 3
Manufacturer Price `Fuel type`
<chr> <dbl> <chr>
1 Ford 3074 Petrol
2 Porsche 49704 Petrol
3 Ford 24072 Diesel
4 Toyota 1705 Hybrid
5 VW 4101 Petrol
6 Ford 29204 Petrol
7 Ford 14350 Diesel
8 Toyota 30297 Hybrid
9 VW 9977 Petrol
10 Ford 1049 Diesel
# ℹ 49,990 more rows
# Challenge: Select all columns EXCEPT Engine Size
%>% select(-`Engine size`) car_data
# A tibble: 50,000 × 6
Manufacturer Model `Fuel type` `Year of manufacture` Mileage Price
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Ford Fiesta Petrol 2002 127300 3074
2 Porsche 718 Cayman Petrol 2016 57850 49704
3 Ford Mondeo Diesel 2014 39190 24072
4 Toyota RAV4 Hybrid 1988 210814 1705
5 VW Polo Petrol 2006 127869 4101
6 Ford Focus Petrol 2018 33603 29204
7 Ford Mondeo Diesel 2010 86686 14350
8 Toyota Prius Hybrid 2015 30663 30297
9 VW Polo Petrol 2012 73470 9977
10 Ford Focus Diesel 1992 262514 1049
# ℹ 49,990 more rows
2.2 Renaming Columns
Let’s fix a problematic column name:
# Rename 'Year of manufacture' to year
<- car_data %>%
car_data rename(year = `Year of manufacture`)
# Check that it worked
names(car_data)
[1] "Manufacturer" "Model" "Engine size" "Fuel type" "year"
[6] "Mileage" "Price"
Question: Why did we need backticks around Year of manufacture
but not around year
?
Your answer: year no longer has spaces
Exercise 3: Creating New Columns
3.1 Calculate Car Age
# Create a mileage_per_year column
<- car_data %>%
car_data mutate(age = 2025 - year,
mileage_per_year = Mileage / age)
# Look at your new columns
%>% select(Model, year, age, Mileage, mileage_per_year) car_data
# A tibble: 50,000 × 5
Model year age Mileage mileage_per_year
<chr> <dbl> <dbl> <dbl> <dbl>
1 Fiesta 2002 23 127300 5535.
2 718 Cayman 2016 9 57850 6428.
3 Mondeo 2014 11 39190 3563.
4 RAV4 1988 37 210814 5698.
5 Polo 2006 19 127869 6730.
6 Focus 2018 7 33603 4800.
7 Mondeo 2010 15 86686 5779.
8 Prius 2015 10 30663 3066.
9 Polo 2012 13 73470 5652.
10 Focus 1992 33 262514 7955.
# ℹ 49,990 more rows
3.2 Categorize Cars
# Create a price_category column where if price is < 15000, its is coded as budget, between 15000 and 30000 is midrange and greater than 30000 is luxury (use case_when)
<- car_data %>%
car_data mutate(price_category = case_when(Price < 15000 ~ "budget",
>= 15000 & Price < 30000 ~ "midrange",
Price .default = "luxury"))
# Check your categories select the new column and show it
%>% select(Price, price_category) car_data
# A tibble: 50,000 × 2
Price price_category
<dbl> <chr>
1 3074 budget
2 49704 luxury
3 24072 midrange
4 1705 budget
5 4101 budget
6 29204 midrange
7 14350 budget
8 30297 luxury
9 9977 budget
10 1049 budget
# ℹ 49,990 more rows
Exercise 4: Filtering Practice
4.1 Basic Filtering
# Find all Toyota cars
%>% filter(Manufacturer == "Toyota") car_data
# A tibble: 12,554 × 10
Manufacturer Model `Engine size` `Fuel type` year Mileage Price age
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Toyota RAV4 1.8 Hybrid 1988 210814 1705 37
2 Toyota Prius 1.4 Hybrid 2015 30663 30297 10
3 Toyota RAV4 2.2 Petrol 2007 79393 16026 18
4 Toyota Yaris 1.4 Petrol 1998 97286 4046 27
5 Toyota RAV4 2.4 Hybrid 2003 117425 11667 22
6 Toyota Yaris 1.2 Petrol 1992 245990 720 33
7 Toyota RAV4 2 Hybrid 2018 28381 52671 7
8 Toyota Prius 1 Hybrid 2003 115291 6512 22
9 Toyota Prius 1 Hybrid 1990 238571 961 35
10 Toyota Prius 1.8 Hybrid 2017 31958 38961 8
# ℹ 12,544 more rows
# ℹ 2 more variables: mileage_per_year <dbl>, price_category <chr>
# Find cars with mileage less than 30,000
%>% filter(Mileage < 30000) car_data
# A tibble: 5,402 × 10
Manufacturer Model `Engine size` `Fuel type` year Mileage Price age
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Toyota RAV4 2 Hybrid 2018 28381 52671 7
2 VW Golf 2 Petrol 2020 18985 36387 5
3 BMW M5 4 Petrol 2017 22759 97758 8
4 Toyota RAV4 2.4 Petrol 2018 24588 49125 7
5 VW Golf 2 Hybrid 2018 25017 36957 7
6 Porsche 718 Cayman 2.4 Petrol 2021 14070 69526 4
7 Ford Focus 1.8 Petrol 2020 22371 40336 5
8 Ford Mondeo 1.6 Diesel 2015 21834 28435 10
9 VW Passat 1.6 Diesel 2018 22122 36634 7
10 VW Passat 1.4 Diesel 2020 21413 39310 5
# ℹ 5,392 more rows
# ℹ 2 more variables: mileage_per_year <dbl>, price_category <chr>
# Find luxury cars (from price category) with low mileage
%>% filter(price_category == "luxury" & Mileage < 30000) car_data
# A tibble: 3,257 × 10
Manufacturer Model `Engine size` `Fuel type` year Mileage Price age
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Toyota RAV4 2 Hybrid 2018 28381 52671 7
2 VW Golf 2 Petrol 2020 18985 36387 5
3 BMW M5 4 Petrol 2017 22759 97758 8
4 Toyota RAV4 2.4 Petrol 2018 24588 49125 7
5 VW Golf 2 Hybrid 2018 25017 36957 7
6 Porsche 718 Cayman 2.4 Petrol 2021 14070 69526 4
7 Ford Focus 1.8 Petrol 2020 22371 40336 5
8 VW Passat 1.6 Diesel 2018 22122 36634 7
9 VW Passat 1.4 Diesel 2020 21413 39310 5
10 Toyota RAV4 2.4 Petrol 2021 6829 66031 4
# ℹ 3,247 more rows
# ℹ 2 more variables: mileage_per_year <dbl>, price_category <chr>
4.2 Multiple Conditions
# Find cars that are EITHER Ford OR Porsche
%>% filter(Manufacturer %in% c("Ford", "Porsche")) car_data
# A tibble: 17,568 × 10
Manufacturer Model `Engine size` `Fuel type` year Mileage Price age
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Ford Fiesta 1 Petrol 2002 127300 3074 23
2 Porsche 718 Cayman 4 Petrol 2016 57850 49704 9
3 Ford Mondeo 1.6 Diesel 2014 39190 24072 11
4 Ford Focus 1.4 Petrol 2018 33603 29204 7
5 Ford Mondeo 1.8 Diesel 2010 86686 14350 15
6 Ford Focus 2 Diesel 1992 262514 1049 33
7 Ford Mondeo 1.6 Diesel 1996 77584 5667 29
8 Porsche 911 2.6 Petrol 2009 66273 41963 16
9 Porsche 911 3.5 Petrol 2005 151556 19747 20
10 Ford Focus 1 Hybrid 2010 85131 12472 15
# ℹ 17,558 more rows
# ℹ 2 more variables: mileage_per_year <dbl>, price_category <chr>
# Find cars with price between $20,000 and $35,000
%>% filter(Price > 20000 & Price < 35000) car_data
# A tibble: 7,301 × 10
Manufacturer Model `Engine size` `Fuel type` year Mileage Price age
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Ford Mondeo 1.6 Diesel 2014 39190 24072 11
2 Ford Focus 1.4 Petrol 2018 33603 29204 7
3 Toyota Prius 1.4 Hybrid 2015 30663 30297 10
4 Toyota Prius 1.4 Hybrid 2016 43893 29946 9
5 Toyota Prius 1.4 Hybrid 2016 43130 30085 9
6 VW Passat 1.6 Petrol 2016 64344 23641 9
7 Ford Mondeo 1.6 Diesel 2015 21834 28435 10
8 BMW M5 4.4 Petrol 2008 109941 31711 17
9 BMW Z4 2.2 Petrol 2014 61332 26084 11
10 Porsche 911 3.5 Petrol 2003 107705 24378 22
# ℹ 7,291 more rows
# ℹ 2 more variables: mileage_per_year <dbl>, price_category <chr>
# Find diesel cars less than 10 years old
%>% filter(`Fuel type` == "Diesel" & age < 10) car_data
# A tibble: 2,040 × 10
Manufacturer Model `Engine size` `Fuel type` year Mileage Price age
<chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Ford Fiesta 1 Diesel 2017 38370 16257 8
2 VW Passat 1.6 Diesel 2018 22122 36634 7
3 VW Passat 1.4 Diesel 2020 21413 39310 5
4 BMW X3 2 Diesel 2018 27389 44018 7
5 Ford Mondeo 2 Diesel 2016 51724 28482 9
6 Porsche Cayenne 2.6 Diesel 2019 20147 76182 6
7 VW Polo 1.2 Diesel 2018 37411 19649 7
8 Ford Mondeo 1.8 Diesel 2016 29439 30886 9
9 Ford Mondeo 1.4 Diesel 2020 18929 37720 5
10 Ford Mondeo 1.4 Diesel 2018 42017 28904 7
# ℹ 2,030 more rows
# ℹ 2 more variables: mileage_per_year <dbl>, price_category <chr>
Question: How many diesel cars are less than 10 years old?
Your answer: 2,040 vehicles
Exercise 5: Grouping and Summarizing
5.1 Basic Summaries
# Calculate average price by manufacturer
<- car_data %>%
avg_price_by_brand group_by(Manufacturer) %>%
summarize(avg_price = mean(Price, na.rm = TRUE))
avg_price_by_brand
# A tibble: 5 × 2
Manufacturer avg_price
<chr> <dbl>
1 BMW 24429.
2 Ford 10672.
3 Porsche 29104.
4 Toyota 14340.
5 VW 10363.
# Calculate average mileage by fuel type
<- car_data %>%
avg_mileage_by_fuel group_by(`Fuel type`) %>%
summarize(avg_mileage = mean(Mileage, na.rm = T))
avg_mileage_by_fuel
# A tibble: 3 × 2
`Fuel type` avg_mileage
<chr> <dbl>
1 Diesel 112667.
2 Hybrid 111622.
3 Petrol 112795.
# Count cars by manufacturer
<- car_data %>%
count_by_manufacturer group_by(Manufacturer) %>%
summarize(count = n())
count_by_manufacturer
# A tibble: 5 × 2
Manufacturer count
<chr> <int>
1 BMW 4965
2 Ford 14959
3 Porsche 2609
4 Toyota 12554
5 VW 14913
5.2 Categorical Summaries
# Frequency table for price categories
<- car_data %>%
price_cat_freq_table group_by(price_category) %>%
summarize(count = n()) %>%
mutate(freq = count / sum(count))
price_cat_freq_table
# A tibble: 3 × 3
price_category count freq
<chr> <int> <dbl>
1 budget 34040 0.681
2 luxury 6179 0.124
3 midrange 9781 0.196
Submission Notes
To submit this lab: 1. Make sure your code runs without errors 2. Fill in all the “[YOUR ANSWER]” sections and complete all of the empty code! 3. Save this file in your portfolio’s lab_0/
folder 4. Commit and push to GitHub 5. Check that it appears on your GitHub Pages portfolio site
Questions? Post on the canvas discussion board or come to office hours!