Lab 0: Getting Started with dplyr

Your First Data Analysis

Author

Your Name Here

Published

September 15, 2025

Overview

Welcome to your first lab! In this (not graded) assignment, you’ll practice the fundamental dplyr operations I overviewed in class using car sales data. This lab will help you get comfortable with:

  • Basic data exploration
  • Column selection and manipulation
  • Creating new variables
  • Filtering data
  • Grouping and summarizing

Instructions: Copy this template into your portfolio repository under a lab_0/ folder, then complete each section with your code and answers. You will write the code under the comment section in each chunk. Be sure to also copy the data folder into your lab_0 folder.

Setup

# Load the tidyverse library
library(tidyverse)

# Read in the car sales data
# Make sure the data file is in your lab_0/data/ folder
car_data <- read_csv("data/car_sales_data.csv")

Exercise 1: Getting to Know Your Data

1.1 Data Structure Exploration

Explore the structure of your data and answer these questions:

# Use glimpse() to see the data structure
glimpse(car_data)
Rows: 50,000
Columns: 7
$ Manufacturer          <chr> "Ford", "Porsche", "Ford", "Toyota", "VW", "Ford…
$ Model                 <chr> "Fiesta", "718 Cayman", "Mondeo", "RAV4", "Polo"…
$ `Engine size`         <dbl> 1.0, 4.0, 1.6, 1.8, 1.0, 1.4, 1.8, 1.4, 1.2, 2.0…
$ `Fuel type`           <chr> "Petrol", "Petrol", "Diesel", "Hybrid", "Petrol"…
$ `Year of manufacture` <dbl> 2002, 2016, 2014, 1988, 2006, 2018, 2010, 2015, …
$ Mileage               <dbl> 127300, 57850, 39190, 210814, 127869, 33603, 866…
$ Price                 <dbl> 3074, 49704, 24072, 1705, 4101, 29204, 14350, 30…
# Check the column names
colnames(car_data)
[1] "Manufacturer"        "Model"               "Engine size"        
[4] "Fuel type"           "Year of manufacture" "Mileage"            
[7] "Price"              
# Look at the first few rows
head(car_data)
# A tibble: 6 × 7
  Manufacturer Model     `Engine size` `Fuel type` `Year of manufacture` Mileage
  <chr>        <chr>             <dbl> <chr>                       <dbl>   <dbl>
1 Ford         Fiesta              1   Petrol                       2002  127300
2 Porsche      718 Caym…           4   Petrol                       2016   57850
3 Ford         Mondeo              1.6 Diesel                       2014   39190
4 Toyota       RAV4                1.8 Hybrid                       1988  210814
5 VW           Polo                1   Petrol                       2006  127869
6 Ford         Focus               1.4 Petrol                       2018   33603
# ℹ 1 more variable: Price <dbl>

Questions to answer: - How many rows and columns does the dataset have? - What types of variables do you see (numeric, character, etc.)? - Are there any column names that might cause problems? Why?

Your answers: - Rows: 50,000 - Columns: 7 - Variable types: Character (chr), Double (dbl) - Problematic names: The columns that have spaces in their names require quotes around them while scripting

1.2 Tibble vs Data Frame

Compare how tibbles and data frames display:

# Look at the tibble version (what we have)
car_data
# A tibble: 50,000 × 7
   Manufacturer Model    `Engine size` `Fuel type` `Year of manufacture` Mileage
   <chr>        <chr>            <dbl> <chr>                       <dbl>   <dbl>
 1 Ford         Fiesta             1   Petrol                       2002  127300
 2 Porsche      718 Cay…           4   Petrol                       2016   57850
 3 Ford         Mondeo             1.6 Diesel                       2014   39190
 4 Toyota       RAV4               1.8 Hybrid                       1988  210814
 5 VW           Polo               1   Petrol                       2006  127869
 6 Ford         Focus              1.4 Petrol                       2018   33603
 7 Ford         Mondeo             1.8 Diesel                       2010   86686
 8 Toyota       Prius              1.4 Hybrid                       2015   30663
 9 VW           Polo               1.2 Petrol                       2012   73470
10 Ford         Focus              2   Diesel                       1992  262514
# ℹ 49,990 more rows
# ℹ 1 more variable: Price <dbl>
# Convert to regular data frame and display
car_df <- as.data.frame(car_data)
car_df %>% head(10)
   Manufacturer      Model Engine size Fuel type Year of manufacture Mileage
1          Ford     Fiesta         1.0    Petrol                2002  127300
2       Porsche 718 Cayman         4.0    Petrol                2016   57850
3          Ford     Mondeo         1.6    Diesel                2014   39190
4        Toyota       RAV4         1.8    Hybrid                1988  210814
5            VW       Polo         1.0    Petrol                2006  127869
6          Ford      Focus         1.4    Petrol                2018   33603
7          Ford     Mondeo         1.8    Diesel                2010   86686
8        Toyota      Prius         1.4    Hybrid                2015   30663
9            VW       Polo         1.2    Petrol                2012   73470
10         Ford      Focus         2.0    Diesel                1992  262514
   Price
1   3074
2  49704
3  24072
4   1705
5   4101
6  29204
7  14350
8  30297
9   9977
10  1049

Question: What differences do you notice in how they print?

Your answer: In Markdown, the identifier icon in the top left changes from “A tibble: 50,000 x 7” to “Description: df [50,000 x 7]” for tibbles and dfs, respectively. When rendering the website, printing a data frame prints all rows and does not allow for scrolling across columns. Instead, it prints extra columns in another row.

Exercise 2: Basic Column Operations

2.1 Selecting Columns

Practice selecting different combinations of columns:

# Select just Model and Mileage columns
select(.data = car_data, Model, Mileage)
# A tibble: 50,000 × 2
   Model      Mileage
   <chr>        <dbl>
 1 Fiesta      127300
 2 718 Cayman   57850
 3 Mondeo       39190
 4 RAV4        210814
 5 Polo        127869
 6 Focus        33603
 7 Mondeo       86686
 8 Prius        30663
 9 Polo         73470
10 Focus       262514
# ℹ 49,990 more rows
# Select Manufacturer, Price, and Fuel type
car_data %>% select(Manufacturer, Price, `Fuel type`)
# A tibble: 50,000 × 3
   Manufacturer Price `Fuel type`
   <chr>        <dbl> <chr>      
 1 Ford          3074 Petrol     
 2 Porsche      49704 Petrol     
 3 Ford         24072 Diesel     
 4 Toyota        1705 Hybrid     
 5 VW            4101 Petrol     
 6 Ford         29204 Petrol     
 7 Ford         14350 Diesel     
 8 Toyota       30297 Hybrid     
 9 VW            9977 Petrol     
10 Ford          1049 Diesel     
# ℹ 49,990 more rows
# Challenge: Select all columns EXCEPT Engine Size
car_data %>% select(-`Engine size`)
# A tibble: 50,000 × 6
   Manufacturer Model      `Fuel type` `Year of manufacture` Mileage Price
   <chr>        <chr>      <chr>                       <dbl>   <dbl> <dbl>
 1 Ford         Fiesta     Petrol                       2002  127300  3074
 2 Porsche      718 Cayman Petrol                       2016   57850 49704
 3 Ford         Mondeo     Diesel                       2014   39190 24072
 4 Toyota       RAV4       Hybrid                       1988  210814  1705
 5 VW           Polo       Petrol                       2006  127869  4101
 6 Ford         Focus      Petrol                       2018   33603 29204
 7 Ford         Mondeo     Diesel                       2010   86686 14350
 8 Toyota       Prius      Hybrid                       2015   30663 30297
 9 VW           Polo       Petrol                       2012   73470  9977
10 Ford         Focus      Diesel                       1992  262514  1049
# ℹ 49,990 more rows

2.2 Renaming Columns

Let’s fix a problematic column name:

# Rename 'Year of manufacture' to year
car_data <- car_data %>% 
  rename(year = `Year of manufacture`)

# Check that it worked
names(car_data)
[1] "Manufacturer" "Model"        "Engine size"  "Fuel type"    "year"        
[6] "Mileage"      "Price"       

Question: Why did we need backticks around Year of manufacture but not around year?

Your answer: year no longer has spaces

Exercise 3: Creating New Columns

3.1 Calculate Car Age

# Create a mileage_per_year column  
car_data <- car_data %>%
  mutate(age = 2025 - year,
         mileage_per_year = Mileage / age)

# Look at your new columns
car_data %>% select(Model, year, age, Mileage, mileage_per_year)
# A tibble: 50,000 × 5
   Model       year   age Mileage mileage_per_year
   <chr>      <dbl> <dbl>   <dbl>            <dbl>
 1 Fiesta      2002    23  127300            5535.
 2 718 Cayman  2016     9   57850            6428.
 3 Mondeo      2014    11   39190            3563.
 4 RAV4        1988    37  210814            5698.
 5 Polo        2006    19  127869            6730.
 6 Focus       2018     7   33603            4800.
 7 Mondeo      2010    15   86686            5779.
 8 Prius       2015    10   30663            3066.
 9 Polo        2012    13   73470            5652.
10 Focus       1992    33  262514            7955.
# ℹ 49,990 more rows

3.2 Categorize Cars

# Create a price_category column where if price is < 15000, its is coded as budget, between 15000 and 30000 is midrange and greater than 30000 is luxury (use case_when)
car_data <- car_data %>% 
  mutate(price_category = case_when(Price < 15000 ~ "budget",
                                     Price >= 15000 & Price < 30000 ~ "midrange",
                                     .default = "luxury"))

# Check your categories select the new column and show it
car_data %>% select(Price, price_category)
# A tibble: 50,000 × 2
   Price price_category
   <dbl> <chr>         
 1  3074 budget        
 2 49704 luxury        
 3 24072 midrange      
 4  1705 budget        
 5  4101 budget        
 6 29204 midrange      
 7 14350 budget        
 8 30297 luxury        
 9  9977 budget        
10  1049 budget        
# ℹ 49,990 more rows

Exercise 4: Filtering Practice

4.1 Basic Filtering

# Find all Toyota cars
car_data %>% filter(Manufacturer == "Toyota")
# A tibble: 12,554 × 10
   Manufacturer Model `Engine size` `Fuel type`  year Mileage Price   age
   <chr>        <chr>         <dbl> <chr>       <dbl>   <dbl> <dbl> <dbl>
 1 Toyota       RAV4            1.8 Hybrid       1988  210814  1705    37
 2 Toyota       Prius           1.4 Hybrid       2015   30663 30297    10
 3 Toyota       RAV4            2.2 Petrol       2007   79393 16026    18
 4 Toyota       Yaris           1.4 Petrol       1998   97286  4046    27
 5 Toyota       RAV4            2.4 Hybrid       2003  117425 11667    22
 6 Toyota       Yaris           1.2 Petrol       1992  245990   720    33
 7 Toyota       RAV4            2   Hybrid       2018   28381 52671     7
 8 Toyota       Prius           1   Hybrid       2003  115291  6512    22
 9 Toyota       Prius           1   Hybrid       1990  238571   961    35
10 Toyota       Prius           1.8 Hybrid       2017   31958 38961     8
# ℹ 12,544 more rows
# ℹ 2 more variables: mileage_per_year <dbl>, price_category <chr>
# Find cars with mileage less than 30,000
car_data %>% filter(Mileage < 30000)
# A tibble: 5,402 × 10
   Manufacturer Model      `Engine size` `Fuel type`  year Mileage Price   age
   <chr>        <chr>              <dbl> <chr>       <dbl>   <dbl> <dbl> <dbl>
 1 Toyota       RAV4                 2   Hybrid       2018   28381 52671     7
 2 VW           Golf                 2   Petrol       2020   18985 36387     5
 3 BMW          M5                   4   Petrol       2017   22759 97758     8
 4 Toyota       RAV4                 2.4 Petrol       2018   24588 49125     7
 5 VW           Golf                 2   Hybrid       2018   25017 36957     7
 6 Porsche      718 Cayman           2.4 Petrol       2021   14070 69526     4
 7 Ford         Focus                1.8 Petrol       2020   22371 40336     5
 8 Ford         Mondeo               1.6 Diesel       2015   21834 28435    10
 9 VW           Passat               1.6 Diesel       2018   22122 36634     7
10 VW           Passat               1.4 Diesel       2020   21413 39310     5
# ℹ 5,392 more rows
# ℹ 2 more variables: mileage_per_year <dbl>, price_category <chr>
# Find luxury cars (from price category) with low mileage
car_data %>% filter(price_category == "luxury" & Mileage < 30000)
# A tibble: 3,257 × 10
   Manufacturer Model      `Engine size` `Fuel type`  year Mileage Price   age
   <chr>        <chr>              <dbl> <chr>       <dbl>   <dbl> <dbl> <dbl>
 1 Toyota       RAV4                 2   Hybrid       2018   28381 52671     7
 2 VW           Golf                 2   Petrol       2020   18985 36387     5
 3 BMW          M5                   4   Petrol       2017   22759 97758     8
 4 Toyota       RAV4                 2.4 Petrol       2018   24588 49125     7
 5 VW           Golf                 2   Hybrid       2018   25017 36957     7
 6 Porsche      718 Cayman           2.4 Petrol       2021   14070 69526     4
 7 Ford         Focus                1.8 Petrol       2020   22371 40336     5
 8 VW           Passat               1.6 Diesel       2018   22122 36634     7
 9 VW           Passat               1.4 Diesel       2020   21413 39310     5
10 Toyota       RAV4                 2.4 Petrol       2021    6829 66031     4
# ℹ 3,247 more rows
# ℹ 2 more variables: mileage_per_year <dbl>, price_category <chr>

4.2 Multiple Conditions

# Find cars that are EITHER Ford OR Porsche
car_data %>% filter(Manufacturer %in% c("Ford", "Porsche"))
# A tibble: 17,568 × 10
   Manufacturer Model      `Engine size` `Fuel type`  year Mileage Price   age
   <chr>        <chr>              <dbl> <chr>       <dbl>   <dbl> <dbl> <dbl>
 1 Ford         Fiesta               1   Petrol       2002  127300  3074    23
 2 Porsche      718 Cayman           4   Petrol       2016   57850 49704     9
 3 Ford         Mondeo               1.6 Diesel       2014   39190 24072    11
 4 Ford         Focus                1.4 Petrol       2018   33603 29204     7
 5 Ford         Mondeo               1.8 Diesel       2010   86686 14350    15
 6 Ford         Focus                2   Diesel       1992  262514  1049    33
 7 Ford         Mondeo               1.6 Diesel       1996   77584  5667    29
 8 Porsche      911                  2.6 Petrol       2009   66273 41963    16
 9 Porsche      911                  3.5 Petrol       2005  151556 19747    20
10 Ford         Focus                1   Hybrid       2010   85131 12472    15
# ℹ 17,558 more rows
# ℹ 2 more variables: mileage_per_year <dbl>, price_category <chr>
# Find cars with price between $20,000 and $35,000
car_data %>% filter(Price > 20000 & Price < 35000)
# A tibble: 7,301 × 10
   Manufacturer Model  `Engine size` `Fuel type`  year Mileage Price   age
   <chr>        <chr>          <dbl> <chr>       <dbl>   <dbl> <dbl> <dbl>
 1 Ford         Mondeo           1.6 Diesel       2014   39190 24072    11
 2 Ford         Focus            1.4 Petrol       2018   33603 29204     7
 3 Toyota       Prius            1.4 Hybrid       2015   30663 30297    10
 4 Toyota       Prius            1.4 Hybrid       2016   43893 29946     9
 5 Toyota       Prius            1.4 Hybrid       2016   43130 30085     9
 6 VW           Passat           1.6 Petrol       2016   64344 23641     9
 7 Ford         Mondeo           1.6 Diesel       2015   21834 28435    10
 8 BMW          M5               4.4 Petrol       2008  109941 31711    17
 9 BMW          Z4               2.2 Petrol       2014   61332 26084    11
10 Porsche      911              3.5 Petrol       2003  107705 24378    22
# ℹ 7,291 more rows
# ℹ 2 more variables: mileage_per_year <dbl>, price_category <chr>
# Find diesel cars less than 10 years old
car_data %>% filter(`Fuel type` == "Diesel" & age < 10)
# A tibble: 2,040 × 10
   Manufacturer Model   `Engine size` `Fuel type`  year Mileage Price   age
   <chr>        <chr>           <dbl> <chr>       <dbl>   <dbl> <dbl> <dbl>
 1 Ford         Fiesta            1   Diesel       2017   38370 16257     8
 2 VW           Passat            1.6 Diesel       2018   22122 36634     7
 3 VW           Passat            1.4 Diesel       2020   21413 39310     5
 4 BMW          X3                2   Diesel       2018   27389 44018     7
 5 Ford         Mondeo            2   Diesel       2016   51724 28482     9
 6 Porsche      Cayenne           2.6 Diesel       2019   20147 76182     6
 7 VW           Polo              1.2 Diesel       2018   37411 19649     7
 8 Ford         Mondeo            1.8 Diesel       2016   29439 30886     9
 9 Ford         Mondeo            1.4 Diesel       2020   18929 37720     5
10 Ford         Mondeo            1.4 Diesel       2018   42017 28904     7
# ℹ 2,030 more rows
# ℹ 2 more variables: mileage_per_year <dbl>, price_category <chr>

Question: How many diesel cars are less than 10 years old?

Your answer: 2,040 vehicles

Exercise 5: Grouping and Summarizing

5.1 Basic Summaries

# Calculate average price by manufacturer
avg_price_by_brand <- car_data %>%
  group_by(Manufacturer) %>%
  summarize(avg_price = mean(Price, na.rm = TRUE))

avg_price_by_brand
# A tibble: 5 × 2
  Manufacturer avg_price
  <chr>            <dbl>
1 BMW             24429.
2 Ford            10672.
3 Porsche         29104.
4 Toyota          14340.
5 VW              10363.
# Calculate average mileage by fuel type
avg_mileage_by_fuel <- car_data %>% 
  group_by(`Fuel type`) %>% 
  summarize(avg_mileage = mean(Mileage, na.rm = T))

avg_mileage_by_fuel
# A tibble: 3 × 2
  `Fuel type` avg_mileage
  <chr>             <dbl>
1 Diesel          112667.
2 Hybrid          111622.
3 Petrol          112795.
# Count cars by manufacturer
count_by_manufacturer <- car_data %>% 
  group_by(Manufacturer) %>% 
  summarize(count = n())

count_by_manufacturer
# A tibble: 5 × 2
  Manufacturer count
  <chr>        <int>
1 BMW           4965
2 Ford         14959
3 Porsche       2609
4 Toyota       12554
5 VW           14913

5.2 Categorical Summaries

# Frequency table for price categories
price_cat_freq_table <- car_data %>%
  group_by(price_category) %>% 
  summarize(count = n()) %>% 
  mutate(freq = count / sum(count))

price_cat_freq_table
# A tibble: 3 × 3
  price_category count  freq
  <chr>          <int> <dbl>
1 budget         34040 0.681
2 luxury          6179 0.124
3 midrange        9781 0.196

Submission Notes

To submit this lab: 1. Make sure your code runs without errors 2. Fill in all the “[YOUR ANSWER]” sections and complete all of the empty code! 3. Save this file in your portfolio’s lab_0/ folder 4. Commit and push to GitHub 5. Check that it appears on your GitHub Pages portfolio site

Questions? Post on the canvas discussion board or come to office hours!