1 Introduction

Hi everyone, my name is Keana, a second year grad student working with Coren Apicella. For my research, I focus on factors that may affect gender differences in willingness to compete. If you’d like to hear more about my work (I’m more than happy to talk about it!), feel free to talk to me after this workshop or shoot me an email. Today, I’ll be leading you through a workshop on data management today, which will mainly focus on data cleaning, arguably the most time-consuming part of analyzing your data.

2 R background

I started learning R in my last year of undergrad (so it was spring/summer of 2017). I first tried using datacamp and swirl to learn how to use it. But it’s harder to really force yourself unless you’re working with real data, so I’d recommend either taking a class (where you are forced to do homework and projects regularly - there are some good courses in the statistics department that I have been taking) or just forcing yourself to completely analyze your data in R (without any help from Excel). I did this and I learned so much faster - you will have to be patient with this, warning in advance.

3 Review

Before we get started on writing our code for today’s session, I want to really quickly review what you learned last week during the intro to R workshop. Does anyone have any specific questions about any of these concepts they would like for me to address? For each of these, please feel free to shout out any questions that come to your mind.

  • directories

  • workspace

  • scripts

  • the save button

  • installing packages

  • basic r syntax

    • operations
    • organizing/interacting with your data
  • basic analyses

  • 1 ggplot example

    So one thing I think is really important for trying to work in R is being familiar with the R environment. So I’ll quickly refresh your memory on some important parts of the display here. By default, on the left hand side of the screen, we have two separate sections. The top box is where you can write and save your script for later use. Note: you can still run the code and get output up here, it’s just that you’re code will also be saved. In the bottom box, you can run code and get output, but it will not save. So my personal recommendation is to always write in the top box and save your script at regular intervals. Next, we have the panels on the right. There are a few tabs, but some are less useful than others:

    • Top box: you have the environment, which shows all data relevant to your current session (any data related to your code you have written). It essentially keeps a running tab of anything you’ve created. I mainly look at it to see what datasets I should call for my code. The other tabs aren’t really as useful for us today - so I’m going to skip them
    • Bottom box: this has a few options - I think the most relevant things for you are the help tab - where you can look up specific functions and get some of the arguments for it (for instance, let’s look at the mean function in here and some of its arguments). The arguments are essentially features of the function you can change that will give you more flexibility. For instance, to calculate the mean for a column with NA values, you should set na.rm = TRUE when you are calling the mean() function. The Plots tab is where your plots populate automatically. You also have the option to install/update packages in this part of the display using the Packages tab. Another relevant tab is the files tab, which shows you your current working directory and the associated files - there are some options under the tab to make any changes you want to your wd directly from the R environment.

Any questions?

Alright so let’s review one other thing that I definitely found confusing when I first started learning R: graphing with ggplot

graph<- ggplot(data=iris, aes(x = Sepal.Length, y = Sepal.Width)) 
## what does this "graph" look like before we add anything else? 

## essentially a template for us to add the data and any other settings we want to set. You NEED to do this everytime. Then you add on to it (literally, by using addition symbol). But you need a basic structure on which to build (think of it like building the foundation of a house and adding on to that). Can't build without the foundation though.

## first we have to tell it we want a scatterplot. 

graph + geom_point()

### Seems like there's some separation here, as if there's two different groups that have different slopes etc. Do you remember what groups we defined last time that seemed to be driving this separation? 

## Species! Let's add to our graph a setting that differentiates between species based on color AND shape.

graph + geom_point(aes(color=Species, shape=Species))

## so here, each species has a unique color and shape. We could also just use color as their identifiers in the key: 

graph + geom_point(aes(color=Species))

## Now we just have a unique color for all of them, but they share the same shape (it looks like the default shape is a circle, but you should be able to change that in the arguments for this function if you have a preference for another shape)

## One other thing we did was add axis labels and a title, like so:

graph +   xlab("Sepal Length") +  ylab("Sepal Width") +
  ggtitle("Sepal Length-Width") 

### mention saving a graph here: notice how it got rid of the point, even though it changed the title and axis labels. Does anyone know why?

## finally, there are themes in R that you can use that just make formatting much easier - you can just set it so if you're a person like me who doesn't know anything about aesthetics etc, you don't have to think too hard. So let's tack on a theme and put it all together. 

graph +  theme_minimal()+ geom_point(aes(color=Species, shape=Species)) + xlab("Sepal Length") +  ylab("Sepal Width") +
  ggtitle("Sepal Length-Width")