Jonas, Pietro & Hauke
If you haven’t done so already, please install R as well as RStudio now:
In this session we will
Exercise & Break
Exercise & Break
Exercise & Time for Questions
Scripts vs. WYSIWYG (Excel or SPSS)
Analyses conducted in R are transparent, easily shareable, and reproducible. This helps not only others to run and understand your code but also your future selves.
Open Source
R is 100% free of charge and as a result, has a huge support community. it means that a huge community of R programmers will constantly develop an distribute new R functionality. It also means that you find a lot of help online as others ran into the same problems as you do.
Versatility
Yes, R is not Python. You can still use it to do a lot of stuff. If you can imagine an analytic task, you can almost certainly implement (and automate) it in R.
RStudio
RStudio helps you write R code. You can easily and seamlessly organize and combine R code, analyses, plots, and written text into elegant documents all in one place.
RStudio is an integrated development environment (IDE) for R. It helps the user to effectively use R by making things easier with features such as:
Opening RStudio, you will see four window panes:
Console
executes code. You can use it to test code that is not saved.Source
opens your scripts, markdown documents or notebooks. It is the one you’ll use the most as it allows you to write and save both comments and code. You have to actively run the code, though.Environment Pane
displays the objects (e.g. data, variables, custom functions) you can access in your current memory.Exercise
Create a script and save it.
See this cheat sheet
#
.#
will not be evaluated.Sections
You can also use # some title -----
to create foldable sections in your scripts
source: reddit
In the most simple form, R is an advanced calculator. Operators are symbols you know from any other program (Excel, etc.), such as +
or -
.
Operation | Description |
---|---|
x + y | Addition |
x - y | Subtraction |
x * y | Multiplication |
x / y | Division |
x ^ y | Exponentiation |
x %/% y | Integer Division |
x %% y | Remainder |
Exercise
What does Integer Division (line 6) and Remainder (line 7) actually mean? Run these lines and try to find out while tweaking the numbers a little.
<-
or =
(we recommend the former).Why <-
There is a general preference among the R community for using <- for assignment. This has at least two reasons:
=
because it is used in other contexts (functions) as well.Shortcuts
option
& -
Alt
& -
To inspect the objects you have just created you can call them.
Alternatively, you can take a look into your Environment
. Do you remember where to find it?
Conventions
_
to separate words within a name.Case Sensitivity
Uppercase and lowercase letters are treated as distinct.
Warning
Importantly, be careful to not reassign an object unintentionally.
Warning
Where possible, avoid using names
Doing so will cause confusion for the readers of your code.
Note
Also, try to avoid special symbols such as +
, for example. If you really need them, you can escape them using the back tick `
There are many different data types. Today, we will focus on the most basic ones:
Data Type | Example |
---|---|
Numeric | 42 |
Character | "forty two" |
Logical | TRUE |
Tip
Use class()
to identify data types. class(42)
will return numeric
, for instance.
You will run into situations where R does not interpret data the way you want it. For instance, if R interprets the value 3
as "3"
, that is, as a character
while it should be a numeric
.
Luckily, R offers functions for these situations:
We can combine objects with the c()
command to create a vector.
The documentation says:
The default method combines its arguments to form a vector. All arguments are coerced to a common type which is the type of the returned value […].
Exercise
What does All arguments are coerced to a common type actually mean? Hint: use class()
for a
& test
.
please answer the survey if you haven’t done so already
Description
A Boolean expression is a logical statement that is either TRUE
or FALSE
.
Note
You can also abbreviate TRUE
and FALSE
with T
and F
.
Operation | Description | Output |
---|---|---|
x < y | Less than | TRUE if x is smaller than y. FALSE otherwise |
x <= y | Less or equal than | TRUE if x is smaller or equal than y. FALSE otherwise |
x > y | Greater than | TRUE if x is greater than y. FALSE otherwise |
x >= y | Greater or equal than | TRUE if x is greater or equal than y. FALSE otherwise |
x == y | Exactly equal to | TRUE if and only if x is equal to y. FALSE otherwise |
x != y | Not equal to | TRUE if and only if x is not equal to y. FALSE otherwise |
!x | Negation | TRUE if x is equal to FALSE or 0. FALSE otherwise |
x | y | OR | TRUE if x or y or both are TRUE . FALSE otherwise |
x & y | AND | TRUE if and only if x and y are both TRUE . FALSE otherwise |
Let’s combine what we learned about vectors and Boolean algebra.
Exercise
Create a vector called countries
that contains “CH”, “FR”, “NL” & “DK”.
Tip
Show the vector’s first entry using countries[1]
.
Tip
Show whether the vector contains "EN"
with "EN" %in% countries
.
Tip
Show where the vector contains "DK"
with match("DK", countries)
.
So far, you have seen several built-in functions: class()
& c()
.
You can recognize a function either via class()
1 or by the parentheses.
To call a function, you have to provide some argument(s). The mean()
function needs some sort of vector, for instance.
Assume you want to learn how to use the mean()
function.
?mean
or help(mean)
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.
The comprehensive R Archive Network currently features 18.694 packages.
The majority of packages is quite niche-specific.
To get started, we’ll show you how to install Tidyverse, an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
An R package is like a lightbulb. First you need to order it with install.packages(). Then, every time you want to use it, you need to turn it on with library()
Your turn
tidyverse
, which contains the lubridate package with a function called year()
Sys.Date()
Sys.Date() %>% year()
lubridate
Sys.Date() %>% lubridate::year()
Note
install.packages("package_name")
library(package_name)
Remember that the bottom right panel has a Packages
tab? Let’s inspect it.
Let’s apply some functions. Can you guess the outcome?
We can do the same using %>%
:
Your turn
vec
containing 1
, 2
, 3
, 99
mean()
using both methodsres1
& res2
Generate three objects
my_pi
should contain the number pi (try out the command pi
)a
, should contain the numbers 1
to 3
b
should be the multiplication of 1*pi
, 2*pi
3*pi
ls()
.rm()
You should actually apply this.
Delete unused elements
This makes your environment easier to read and saves some memory.
Start with a fresh session
This forces you to
It, thus, makes your code more reproducible and less likely to break.
Don’t!
my_num
that contains 6 numbersmy_num
by 4my_char
that contains 5 character stringsmy_num
and my_char
into a variable called both
(hint: you can use the c()
function)both
(hint: try the length()
function)both
?both
by 3, what happens?x
y
x
and y
together? why?y
(hint: you can use the c()
function)x
and y
togetherx
and y
together. pay attention to how R performs operations on vectors of the same length.swirl
is a great start as it teaches you R interactively, at your own pace, and right in the R console! Run the following code to get started.
Resources we found helpful creating the slides:
Beyond base
R’s core functions, packages provide additional functionality for (almost) all of your problems.
Some of the most widely-used packages are included in the tidyverse
collection install.packages("tidyverse")
and library(tidyverse)
brings you most of the functions you will need (at this point):
For example:
dplyr
tidyr
ggplot2
There are many more packages available
Pitfall
Failing to load a package is a common error
source - one you can easily avoid by loading everything you need right at the start.
This is how the very beginning of your code could look like:
Exercise & Break
Exercise & Time for Questions
Data can come in various file formats, e.g., .csv
, .xlsx
, .sav
etc. We can read all these files into our environment
.
For that, we always specify the location path to the local data source (and we directly assign our data to an object called data
)
Reading in
Good to know
We recommend text files (.csv
or .txt
), however, there are packages for reading in other file formats (haven
, foreign
or rio
).
You should know how your data is structured before processing it and R has some neat functions to do that:
head(data)
shows the first few rows of your datanames(data)
shows the column or variable names of your dataview(data)
shows the entire data in a new window lfdn external_lfdn tester dispcode lastpage
1 2 0 0 31 6061029
2 4 0 0 31 6061029
3 11 0 0 31 6061029
4 13 0 0 31 6061029
5 14 0 0 31 6061029
[1] "lfdn" "external_lfdn" "tester" "dispcode"
[5] "lastpage" "quality" "duration" "expectation"
[9] "r_experience" "interest_data"
To get a feeling for your data, you can investigate summary statistics:
We can calculate the mean rent of this course individually:
How about the course’s gender distribution?
Or the mean shoe size by gender?
Load the survey data and assign it to a dataframe object named df
(if not done so already)
How many rows and columns does df
contain?
Calculate simple summary statistics for the df
Calculate the standard error for the variable shoesize
Some important functions from dplyr
:
select
Choose which columns to include.filter
Filter which rows to include.arrange
Sort the data, by size for continuous variables, by date, or alphabetically.group_by
Group the data by a categorical variable.n()
Count the number of records. Here there isn’t a variable in the brackets of the function, because the number of records applies to all variables.mutate
Create new column(s) in the data, or change existing column(s).rename
Rename column(s).bind_rows
Append one data data frame to another, combining data from columns with the same name.select
and filter
gender rent
1 2 NA
2 2 3000
3 1 950
data %>%
filter(gender==1 & shoesize ==44) %>% select(gender, shoesize) # show just males with a shoesize of 44
gender shoesize
1 1 44
2 1 44
3 1 44
Recap: %>%
(pipe) operator
%>%
makes code easier to write and read. It works similar to a + sign and is an integral part of the tidyverse syntax
arrange
and n
data %>%
filter(gender == 2 & rent >= 900) %>% # only females with rent above 900CHF
arrange(desc(rent)) %>% # sort by descending rent
select(gender, rent) # show only the columns gender and rent
gender rent
1 2 3000
data %>%
filter(gender == 2) %>% # only female
summarise(
n = n()) # number of instances written in new column with name n
n
1 5
summarise()
summarise()
reduces multiple values down to a single summary.
group_by
data %>%
group_by(gender) %>% # group by gender
summarise(
n = n(), # counte the instances and write in column named n
avg_height = mean(height, na.rm = TRUE),
avg_rent = mean (rent, na.rm = TRUE)) # calculate the mean height and write in avg_height column
# A tibble: 2 × 4
gender n avg_height avg_rent
<int> <int> <dbl> <dbl>
1 1 8 182. 616.
2 2 5 158. 1212
Removing NAs
na.rm
allows you to exclude instances of missing value. This helps to prevent error
messages. But check why is that data missing!
mutate
and rename
gender sq_height
1 2 23104
2 2 21025
3 1 34225
4 1 32400
5 1 34225
lfdn external_lfdn tester dispcode lastpage
1 2 0 0 31 6061029
2 4 0 0 31 6061029
3 11 0 0 31 6061029
4 13 0 0 31 6061029
5 14 0 0 31 6061029
data %>% # Start with the data
filter(r_experience <= 6) %>% # Oonly those with R experience below 6
group_by(gender) %>% # Group by gender
summarise(
age.mean = mean(age), # Define first summary...
height.mean = median(height), # you get the idea...
n = n()) %>% # How many are in each group?
select(age.mean, height.mean, n)
# A tibble: 2 × 3
age.mean height.mean n
<dbl> <int> <int>
1 22.9 185 7
2 27.4 154 5
Help
It can sometimes be hard to understand what a command is doing with the data. Tidy data tutor visualizes what is happening to your data in every code step which is extremely helpful.
Build a new dataframe called df_demographics
that contains just the demographic information from our survey data
Rename the gender
column to sex
column in this dataframe
Calculate the mean age by gender in this dataframe
Calculate the logarithmic rent of all male participants of the AIBS course and store it in a column named log_rent
inner_join
: To keep only rows that match from the data frames, specify the argument all=FALSE
. To keep all rows from both data frames, specify all=TRUE
.left_join
: To include all the rows of your data frame x and only those from y that match.Example: Calculate year born of members of this class:
age_table <- read.csv(file = "data/raw/age_table.csv") # read in age table from directory
data_new <- left_join(data, select(age_table, c(X2022_after, year_born)), by = c("age" = "X2022_after")) # perform join to get the birthyear
data_new %>% select(age, year_born) %>% filter(age<24) # show the results
age year_born
1 22 2000
2 22 2000
3 21 2001
4 23 1999
5 22 2000
6 23 1999
7 22 2000
data_long
to data_wide
The arguments to pivot_longer():
Reverse Way
The reverse way from wide to long format via pivot_wider
You (R) can do a lot:
Recap: Mean comparisons (>= 2 groups)
Welch Two Sample t-test
data: rent by gender
t = -0.99236, df = 3.0711, p-value = 0.3926
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-2481.5 1290.0
sample estimates:
mean in group 1 mean in group 2
616.25 1212.00
rstatix::t_test(data = data, formula = rent~gender, alternative = "two.sided") # the same works with other packages, here `rstatix`
# A tibble: 1 × 8
.y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 rent 1 2 8 5 -0.992 3.07 0.393
Notice the difference in the code?
Caution
There are similar functions t.test
and t_test
from different packages using different syntax
Recap: Mean comparisons (> 2 groups)
Is there a difference in XYZ?
data$course <- as.factor(data$education) # this ensures our factor to be of the right class
summary(aov(formula = rent~education, data = data))
Df Sum Sq Mean Sq F value Pr(>F)
education 1 573944 573944 1.176 0.304
Residuals 10 4882419 488242
1 Beobachtung als fehlend gelöscht
Caution
The underlying grouping variable needs to be as.factor
. Always check first!
Recap: Relationship between two (or more) variables
Is there a relationship between height and shoesize?
summary(lm(formula = shoesize~height, data = data)) # runs a linear model and provides statistical summary
Call:
lm(formula = shoesize ~ height, data = data)
Residuals:
Min 1Q Median 3Q Max
-2.6840 -1.2487 -0.3436 0.9486 4.0510
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.96106 5.93868 1.172 0.265895
height 0.19729 0.03421 5.767 0.000125 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.871 on 11 degrees of freedom
Multiple R-squared: 0.7514, Adjusted R-squared: 0.7288
F-statistic: 33.26 on 1 and 11 DF, p-value: 0.0001252
?lm()
will immediately provide you with Help in the panelRegress rent
on height
and shoesize
Calculate the mean difference between r_experience
by class
and perform a t-test
Do r_experience
and interest_data
correlate?
shoesizes
dataset from Canvas and look for potential information that might match with our dataSpecific to your projects?
Put them on Canvas or bring them to the lecture
Tidyverse
You have seen that there are many functions such as filter()
, select()
, etc. in the dplyr
package. Please name other functions the package provides.
Resource
Visit tidydatatutor.com and scroll down a little. You’ll find helpful visualizations over there.
We’ll use tidyverse functions as a reference today and explain how to do similar stuff in base R
, that is, without loading any packages.
To illustrate this, we’ll first load some data.
Remember how we selected items in (one-dimensional) vectors?
Remember how we selected items in (one-dimensional) vectors?
Remember how we selected items in (one-dimensional) vectors?
Remember how we selected items in (one-dimensional) vectors?
Here is a more practical example where we want to rename the column UrbanPop
.
We can do the same thing with (two dimensional) data frames without using the tidyverse:
With dplyr::filter()
we were able to display all the rows that match a certain condition, such as Murder > 10
.
How shall we do this with base R?
With dplyr::filter()
we were able to display all the rows that match a certain condition, such as Murder > 10
.
How shall we do this with base R?
We talked about this yesterday.
Select Assalt Statistics
Inspect the data and look for the Assault
Variable. Then select it using all of the three options displayed above.
Please show all the states with an above-median number of Assaults.
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Delaware 5.9 238 72 15.8
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8
Illinois 10.4 249 83 24.0
Louisiana 15.4 249 66 22.2
Maryland 11.3 300 67 27.8
Michigan 12.1 255 74 35.1
Mississippi 16.1 259 44 17.1
Missouri 9.0 178 70 28.2
Nevada 12.2 252 81 46.0
New Mexico 11.4 285 70 32.1
New York 11.1 254 86 26.1
North Carolina 13.0 337 45 16.1
Rhode Island 3.4 174 87 8.3
South Carolina 14.4 279 48 22.5
Tennessee 13.2 188 59 26.9
Texas 12.7 201 80 25.5
Wyoming 6.8 161 60 15.6
Please show all the states with an above-median number of Assaults.
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Delaware 5.9 238 72 15.8
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8
Illinois 10.4 249 83 24.0
Louisiana 15.4 249 66 22.2
Maryland 11.3 300 67 27.8
Michigan 12.1 255 74 35.1
Mississippi 16.1 259 44 17.1
Missouri 9.0 178 70 28.2
Nevada 12.2 252 81 46.0
New Mexico 11.4 285 70 32.1
New York 11.1 254 86 26.1
North Carolina 13.0 337 45 16.1
Rhode Island 3.4 174 87 8.3
South Carolina 14.4 279 48 22.5
Tennessee 13.2 188 59 26.9
Texas 12.7 201 80 25.5
Wyoming 6.8 161 60 15.6
What is the average number of murders in these states?
[1] 11.175
What is the average number of murders in these states?
Reading in
Saving
head(data)
shows the first few rows of your datanames(data)
shows the column or variable names of your dataview(data)
shows the entire data in a new windowData visualization is the practice of representing data in a graphical form (e.g. plots). This is a powerful way to visualize information in a way that is understandable to the majority of the audience (and impress your professors when presenting the empirical project). For this reason it is crucial to select the type of visualization that best fits our data and our goal!
Ggplot is a powerful package for creating visualizations. Its strength relies on the underlying Grammar of Graphics, a set of rules of combining independent elements into a graphical representations.
ggplot()
initializes the coordinate system of our visual representation. It takes as input the dataset of interest.geom_*()
specifies the type of visualization. It needs a mapping argument which maps the variables from the dataset to the objects on the plot.aes()
is assigned to the mapping argument. It takes as argument the values corresponding to the x and y axis, as well as other plot-specific characteristics (e.g., color, size, fill,…)ggtitle()
,xlab()
,ylab()
,theme()
).Remember: It is always possible to get help regarding these function by running ?functionName
Univariate graphical representation visualize the distribution of data from a single continuous or discrete variables.
Bivariate visual analysis shows relationship between two variables. It is important to select the correct visualization type according to the variables analysed:
Multivariate presentation show the relationship of 3 or more variables. There are two main approaches grouping and faceting.
BootcampR
your_path_to/BootcampR
data
data/raw
and data/preprocessed
mtcars.csv
into data/raw
BootcampR
your_path_to/BootcampR
data
data/raw
and data/preprocessed
mtcars.csv
into data/raw
Download the mtcars.csv
file from canvas :)
Install and load the following packages:
This is the place where you assign constants, e.g. the Swiss VAT:
Remember that we use a different naming convention for constants, i.e. uppercase letters.
Import the mtcars.csv
file into an object called mtcars_df
Ideally, you have a folder called data
that contains two subfolders:
raw
from where you read the original data &processed
where you store the data after you have transformed itvariable | description |
---|---|
mpg | Miles/(US) gallon |
cyl | Number of cylinders |
disp | Displacement (cu.in.) |
hp | Gross horsepower |
drat | Rear axle ratio |
wt | Weight (1000 lbs) |
qsec | 1/4 mile time |
vs | Engine (0 = V-shaped, 1 = straight) |
an | Transmission (0 = automatic, 1 = manual) |
gear | Number of forward gears |
carb | Number of carburetors |
new_mtcars
which includes only the following columns: mpg
, hp
, wt
, vs
, gear
vs
and distribution density of wt
names(mtcars_df)
new_mtcars <- mtcars_df %>% select(mpg, hp, wt, vs, gear)
summary(new_mtcars)
#ggplot(new_mtcars) + geom_bar(mapping=aes(x=vs))
ggplot(new_mtcars, aes(x = wt)) +
geom_histogram(aes(x=wt, y = ..density..),
colour = 1, fill = "white") +
geom_density()
corr <- round(cor(new_mtcars), 1)
ggcorrplot(corr, lab=TRUE)
#ggcorrplot(corr, lab=TRUE, type='lower')
hp_wt
which is the ration of hp
and wt
hp
and wt
data/processed
There are different formats you can use. CSV is the most compatible format.
and good luck on your empirical project :)