Hi everyone, my name is Keana, a second year grad student working with Coren Apicella. For my research, I focus on factors that may affect gender differences in willingness to compete. If you’d like to hear more about my work (I’m more than happy to talk about it!), feel free to talk to me after this workshop or shoot me an email. Today, I’ll be leading you through a workshop on data management today, which will mainly focus on data cleaning, arguably the most time-consuming part of analyzing your data.
I started learning R in my last year of undergrad (so it was spring/summer of 2017). I first tried using datacamp and swirl to learn how to use it. But it’s harder to really force yourself unless you’re working with real data, so I’d recommend either taking a class (where you are forced to do homework and projects regularly - there are some good courses in the statistics department that I have been taking) or just forcing yourself to completely analyze your data in R (without any help from Excel). I did this and I learned so much faster - you will have to be patient with this, warning in advance.
Before we get started on writing our code for today’s session, I want to really quickly review what you learned last week during the intro to R workshop. Does anyone have any specific questions about any of these concepts they would like for me to address? For each of these, please feel free to shout out any questions that come to your mind.
directories
workspace
scripts
the save button
installing packages
basic r syntax
basic analyses
1 ggplot example
So one thing I think is really important for trying to work in R is being familiar with the R environment. So I’ll quickly refresh your memory on some important parts of the display here. By default, on the left hand side of the screen, we have two separate sections. The top box is where you can write and save your script for later use. Note: you can still run the code and get output up here, it’s just that you’re code will also be saved. In the bottom box, you can run code and get output, but it will not save. So my personal recommendation is to always write in the top box and save your script at regular intervals. Next, we have the panels on the right. There are a few tabs, but some are less useful than others:
Any questions?
Alright so let’s review one other thing that I definitely found confusing when I first started learning R: graphing with ggplot
library(ggplot2)
graph<- ggplot(data=iris, aes(x = Sepal.Length, y = Sepal.Width))
## what does this "graph" look like before we add anything else?
graph
## essentially a template for us to add the data and any other settings we want to set. You NEED to do this everytime. Then you add on to it (literally, by using addition symbol). But you need a basic structure on which to build (think of it like building the foundation of a house and adding on to that). Can't build without the foundation though.
## first we have to tell it we want a scatterplot.
graph + geom_point()
### Seems like there's some separation here, as if there's two different groups that have different slopes etc. Do you remember what groups we defined last time that seemed to be driving this separation?
## Species! Let's add to our graph a setting that differentiates between species based on color AND shape.
graph + geom_point(aes(color=Species, shape=Species))
## so here, each species has a unique color and shape. We could also just use color as their identifiers in the key:
graph + geom_point(aes(color=Species))
## Now we just have a unique color for all of them, but they share the same shape (it looks like the default shape is a circle, but you should be able to change that in the arguments for this function if you have a preference for another shape)
## One other thing we did was add axis labels and a title, like so:
graph + xlab("Sepal Length") + ylab("Sepal Width") +
ggtitle("Sepal Length-Width")
### mention saving a graph here: notice how it got rid of the point, even though it changed the title and axis labels. Does anyone know why?
## finally, there are themes in R that you can use that just make formatting much easier - you can just set it so if you're a person like me who doesn't know anything about aesthetics etc, you don't have to think too hard. So let's tack on a theme and put it all together.
graph + theme_minimal()+ geom_point(aes(color=Species, shape=Species)) + xlab("Sepal Length") + ylab("Sepal Width") +
ggtitle("Sepal Length-Width")
## any questions?
Today we’re going to pretend as though you are about to run your study. I think this is the most important part of your research, if you get this right, you can save a lot of time and pain once your data comes in.
Before collecting data, I suggest writing a data analysis plan. Here’s an example from my own research. In it, you will list out every step you will take (minus the code you will use). This will allow you to optimize all of your procedures BEFORE you collect your data, which will make it much easier to analyze it later. I have found this especially helpful when I am working on Qualtrics and need to know in advance how I want my data to be formatted so I can easily import it into R without too much data cleaning (but you will usually have to clean it up a bit, regardless). This will help you in writing your pre-registration, which I will show you how to create!
As you may know, several scientific fields are going through a replication crisis, including psychology. Several well-known psychological findings are coming under scrutiny largely because of the discovery of p-hacking ie. the selective reporting of statistically significant analyses and other less than scientific practices.
Pre-registration of one’s studies has been proposed to help reduce the prevalence of p-hacking, largely because it is attributed to poor planning/unclear predictions. I can attest to this myself, as a person who has run studies before I knew about pre-registration and after pre-registering my study. Pre-registration is extremely useful to clarify exactly what your predictions are and your purposes for running the study. So I think it’s overall a win-win situation. I’ll quickly walk you through the pre-registration process.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#install.packages("magrittr")
library(magrittr)
Note: adapted from STAT705 at Penn - highly recommend checking this class out if possible!
I’ll be using popular packages called dplyr and magrittr for some of these examples, which is very useful for making it much easier and more efficient to change/manipulate your data.
dplyr and magrittr are different packages but they are most useful when used together. The key idea is to decompose R commands into sequential actions. The result of one action is piped into the next action by the %>% command. The key actions or verbs are:
We are going to use the movies dataset from kaggle.com. I sent it to you by email, so go ahead and load it. We have the variables such as budget
, revenue
, runtime
(in minutes), title
. We also have vote_count
and vote_average
which reflect user votes on the website. The variable popularity
is a number computed by TMDB whose formula is unknown. And finally, we have a series of genre indicators giving us the genre(s) of the movies.
Note: you’ll need to change the directory below to reflect the location of the movies dataset on your computer and your current system. An option for not having to remember all of the directories/subdirectories is to use the here package in combination with R Projects. Here are some great links that helped me learn about the here package:
movies <- read.csv("C:/Users/keana/Downloads/movies.csv", stringsAsFactors = T)
Some basic examples
Select:
movies1 <- movies %>% select(c(title, budget, runtime, revenue))
View(movies1)
Filter:
movies2 <- movies %>% filter(Adventure==1 & Action==1) %>%
select(c(title, budget, runtime, revenue))
View(movies2)
Arrange (low to high):
movies3 <- movies %>% filter(Adventure==1 & Action==1) %>%
select(c(title, budget, runtime, revenue)) %>% arrange(budget)
View(movies3)
Arrange (high to low):
movies4 <- movies %>% filter(Adventure==1 & Action==1) %>%
select(c(title, budget, runtime, revenue)) %>% arrange(desc(budget))
View(movies4)
Mutate:
movies5 <- movies %>% filter(Adventure==1 & Action==1) %>%
select(c(title, budget, runtime, revenue)) %>% arrange(desc(budget)) %>% mutate(efficiency = budget/revenue)
View (movies5)
Summarise:
## show error message with this and explain why:
movies %>% filter(Adventure==1 & Action==1) %>%
select(c(title, budget, runtime, revenue)) %>% arrange(desc(budget)) %>% mutate(efficiency = budget/revenue) %>% summarise(mean = mean(efficiency))
## mean
## 1 NaN
### adjustments to avoid error:
movies %>% filter(Adventure==1 & Action==1) %>%
select(c(title, budget, runtime, revenue)) %>% arrange(desc(budget)) %>% mutate(efficiency = budget/revenue) %>% filter (efficiency != Inf |!is.na(efficiency)) %>% summarise(mean = mean(efficiency))
## mean
## 1 Inf
group_by:
summarise(group_by(movies, original_language), mean(popularity))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 37 x 2
## original_language `mean(popularity)`
## <fct> <dbl>
## 1 af 2.50
## 2 ar 4.72
## 3 cn 10.6
## 4 cs 1.29
## 5 da 17.7
## 6 de 10.2
## 7 el 28.9
## 8 en 22.3
## 9 es 13.3
## 10 fa 5.66
## # ... with 27 more rows
Other potentially useful functions!
recode(movies$original_language, en = "English")
## or with numbers
recode(movies$original_language, en = 1L, es = 2L, zh = 3L, de = 4L)
###just make sure to recode all of them so it doesn't replace all of them with NAs
movies <- dplyr::rename(movies, pop = popularity)
##why the "dplyr::"? what happens if I remove that at the front?
movies <- dplyr::select(movies, -overview)
movies <- distinct(movies)
summarise_all(movies, funs(mean))
## id budget homepage original_language original_title pop
## 1 55988.21 29214581 NA NA NA 21.61734
## release_date revenue runtime status tagline title vote_average vote_count
## 1 NA 82742654 NA NA NA NA 6.114199 694.2574
## Action Adventure Animation Comedy Crime Documentary Drama
## 1 0.2416754 0.165445 0.04900524 0.3606283 0.1457592 0.02303665 0.4810471
## Family Fantasy Foreign History Horror Music Mystery
## 1 0.1074346 0.08879581 0.007120419 0.04125654 0.1086911 0.03874346 0.07287958
## Romance Science.Fiction Thriller TV.Movie War Western
## 1 0.1872251 0.1120419 0.2668063 0.001675393 0.03015707 0.01717277
library(dplyr)
movies <- add_row(movies, budget= 1, id = 1)
View(movies)