Lab 0: Getting Started with dplyr

Your First Data Analysis

Author

Jed Chew

Published

September 15, 2025

Overview

Welcome to your first lab! In this (not graded) assignment, you’ll practice the fundamental dplyr operations I overviewed in class using car sales data. This lab will help you get comfortable with:

  • Basic data exploration
  • Column selection and manipulation
  • Creating new variables
  • Filtering data
  • Grouping and summarizing

Instructions: Copy this template into your portfolio repository under a lab_0/ folder, then complete each section with your code and answers. You will write the code under the comment section in each chunk. Be sure to also copy the data folder into your lab_0 folder.

Setup

# Load the tidyverse library
library(tidyverse)

# Read in the car sales data
# Make sure the data file is in your lab_0/data/ folder
car_data <- read_csv("data/car_sales_data.csv")

Exercise 1: Getting to Know Your Data

1.1 Data Structure Exploration

Explore the structure of your data and answer these questions:

# Use glimpse() to see the data structure
glimpse(car_data)

# Check the column names
names(car_data)

# Look at the first few rows
head(car_data)

Questions to answer:

  • How many rows and columns does the dataset have?
  • What types of variables do you see (numeric, character, etc.)?
  • Are there any column names that might cause problems? Why?

Your answers:

- Rows: 50,000 - Columns: 7
- Variable types: characters and doubles
- Problematic names: ‘Engine size’ and ‘Fuel type’ and ‘Year of manufacture’

1.2 Tibble vs Data Frame

Compare how tibbles and data frames display:

# Look at the tibble version (what we have)
car_data

# Convert to regular data frame and display
car_df <- as.data.frame(car_data)
car_df

Question: What differences do you notice in how they print?

Your answer: when you render a dataframe you get everything

Exercise 2: Basic Column Operations

2.1 Selecting Columns

Practice selecting different combinations of columns:

# Select just Model and Mileage columns
car_data |> select(Model, Mileage)

# Select Manufacturer, Price, and Fuel type
car_data |> select(Manufacturer, Price, 'Fuel type')

# Challenge: Select all columns EXCEPT Engine size
car_data |> select(-'Engine size')

2.2 Renaming Columns

Let’s fix a problematic column name:

# Rename 'Year of manufacture' to year
car_data <- car_data |> 
  rename(year = 'Year of manufacture',
  fuel = 'Fuel type',
  engine_size = 'Engine size')

# Check that it worked
names(car_data)

Question: Why did we need backticks around Year of manufacture but not around year?

Your answer: there were spaces between the column name ‘Year of manufacture’

Exercise 3: Creating New Columns

3.1 Calculate Car Age

# Create an 'age' column (2025 minus year of manufacture)
car_data <- car_data |>
  mutate(age = 2025 - year)

# Create a mileage_per_year column  
car_data <- car_data |> 
  mutate(mileage_per_year = Mileage / age)

# Look at your new columns
select(car_data, Model, year, age, Mileage, mileage_per_year)

3.2 Categorize Cars

# Create a price_category column where if price is < 15000, its is coded as budget, between 15000 and 30000 is midrange and greater than 30000 is luxury (use case_when)
car_data <- car_data |> 
  mutate(price_category = 
      case_when(
        Price < 15000 ~ "budget",
        Price < 30000 ~ "midrange",
        TRUE ~ "luxury"
      )
  )


# Check your categories select the new column and show it
select(car_data, price_category)

Exercise 4: Filtering Practice

4.1 Basic Filtering

# Find all Toyota cars
car_data |> filter(Manufacturer == 'Toyota')

# Find cars with mileage less than 30,000
car_data |>  filter(Mileage < 30000)
  
# Find luxury cars (from price category) with low mileage
car_data |>  filter(price_category == 'luxury' & Mileage < 30000)

4.2 Multiple Conditions

# Find cars that are EITHER Honda OR Nissan
car_data |> filter(Manufacturer == "Honda")
car_data |> filter(Manufacturer == "Nissan")
car_data |> filter(Manufacturer %in% c("Honda", "Nissan"))

# Find cars with price between $20,000 and $35,000
car_data |> filter(Price > 20000 & Price < 35000)

# Find diesel cars less than 10 years old
car_data |> filter(fuel == "Diesel" & age < 10)

Question: How many diesel cars are less than 10 years old?

Your answer: 2040

Exercise 5: Grouping and Summarizing

5.1 Basic Summaries

# Calculate average price by manufacturer
avg_price_by_brand <- car_data %>%
  group_by(Manufacturer) %>%
  summarize(avg_price = mean(Price, na.rm = TRUE))

avg_price_by_brand

# Calculate average mileage by fuel type
avg_mileage_by_fuel <- car_data |> 
  group_by(fuel) |> 
  summarize(avg_mileage = mean(Mileage, na.rm = TRUE))

avg_mileage_by_fuel

# Count cars by manufacturer
cars_by_manufacturer <- count(car_data, Manufacturer)
cars_by_manufacturer

5.2 Categorical Summaries

# Frequency table for price categories
car_price_summary <- car_data |> 
  group_by(price_category) |> 
  summarize(n=n()) |> 
  mutate(freq = n/sum(n))
car_price_summary

Submission Notes

To submit this lab: 1. Make sure your code runs without errors 2. Fill in all the “[YOUR ANSWER]” sections and complete all of the empty code! 3. Save this file in your portfolio’s lab_0/ folder 4. Commit and push to GitHub 5. Check that it appears on your GitHub Pages portfolio site

Questions? Post on the canvas discussion board or come to office hours!