tidyverse
These materials are based on the APS's “R for Plant Pathologists” “R for Plant Pathologists”
Some inspiration from J. Bryan's Stat545 and B. Bohemke's Intro to R
All highly recommended
Performance: stable, light and fast
Support network: documentation, community, developers
Reproducibility: anyone anywhere can reproduce results
Versatility: unified solution to almost any numerical problem and graphical capabilities
Ethics: accessible to anyone as it is free and open source
Transition from “point and click” is tough but rewarding
Help:
Learning:
Help:
Learning:
Cheatsheets → https://rstudio.com/resources/cheatsheets/
R – Statistical programming language
/http://www.r-project.org/
RStudio – Integrated Development Environment (IDE) makes our life much easier
RStudio – Dashboard
Move the cursor onto a line with R code and pres:
Objects, where the data is stored.
Assign with <-
x <- 1
y <- 2
x + y
[1] 3
the same result if:
1+3
[1] 4
Objects, where the data is stored.
Assign with <-
x <- 1
y <- 2
x + y
[1] 3
the same result if:
1+3
[1] 4
Functions which are applied on objects or another functions (i.e. to analyze the data): round brackets!
# I am a comment!!! Just here to help jog the memory later on...
# Let us make a function!
addition <- function(argument_one,
argument_two){
argument_one + argument_two # operations
} # curly brackets define operations
ls() # check content of the environment
[1] "addition" "x" "y"
addition(argument_one = x,
argument_two = y)
[1] 3
Objects, where the data is stored.
Assign with <-
x <- 1
y <- 2
x + y
[1] 3
the same result if:
1+3
[1] 4
Functions which are applied on objects or another functions (i.e. to analyze the data): round brackets!
addition <- function(argument_one, argument_two){
argument_one + argument_two
}
addition(argument_one = x,argument_two = y)
[1] 3
addition(x, y)# Notice the difference?!
[1] 3
addition(x, y) == x+y #notice double "="
[1] TRUE
all.equal(addition(x, y), x+y) #Same as above, but pre-made
[1] TRUE
Vectors store data of the same type
(a column of an excel table)
num <- c(50, 60, 65)
char <- c("mouse", "rat", "dog")
fct <- factor("low", "med", "high")
dates <- as.Date(c("02/27/92", "02/27/92", "01/14/92"), "%m/%d/%y")
logical <- c(FALSE, FALSE, TRUE) # only TRUE or FALSE
Vectors store data of the same type
(a column of an excel table)
num <- c(50, 60, 65)
char <- c("mouse", "rat", "dog")
fct <- factor("low", "med", "high")
dates <- as.Date(c("02/27/92", "02/27/92", "01/14/92"), "%m/%d/%y")
logical <- c(FALSE, FALSE, TRUE) # only TRUE or FALSE
Subsetting - square brackets
num[1] # 1st element
[1] 50
num[num >= 60] # More than or equal
[1] 60 65
char == "dog" # see logical on the left
[1] FALSE FALSE TRUE
char[logical]
[1] "dog"
char[char == "dog"]
[1] "dog"
Dataframe is a set of vectors of same length(an entire excel table)
df <- data.frame(col_one = num,
col_two = char)
print(df)
col_one col_two
1 50 mouse
2 60 rat
3 65 dog
head(df,1)
col_one col_two
1 50 mouse
Same logic for indexing, just in 2 dimensions
df[1, 1] # [rows, columns]
[1] 50
df[, 1] # 1st column in the data frame
[1] 50 60 65
df[, -2] # Exclude 2nd column
[1] 50 60 65
df[2:3, "col_two"]
[1] "rat" "dog"
df$col_two
[1] "mouse" "rat" "dog"
Pre-made set of functions for common (and not so common) tasks
Think of something like Microsoft Office suite
tidyverse
and data analysis cycle
Several functions within readr
and readxl
for different types of files.
For this workshop, we will use data on coffee leaf rust from Ethiopia
dt <- read_csv(here::here("data", "survey_clean.csv"))
tibble::glimpse(dt, 70)
Rows: 405
Columns: 13
$ farm <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
$ region <chr> "SNNPR", "SNNPR", "SNNPR", "SNNPR", "SNNP...
$ zone <chr> "Bench Maji", "Bench Maji", "Bench Maji",...
$ district <chr> "Debub Bench", "Debub Bench", "Debub Benc...
$ lon <dbl> 35.44250, 35.44250, 35.42861, 35.42861, 3...
$ lat <dbl> 6.904722, 6.904722, 6.904444, 6.904444, 6...
$ altitude <dbl> 1100, 1342, 1434, 1100, 1400, 1342, 1432,...
$ cultivar <chr> "Local", "Mixture", "Mixture", "Local", "...
$ shade <chr> "Sun", "Mid shade", "Mid shade", "Sun", "...
$ cropping_system <chr> "Plantation", "Plantation", "Plantation",...
$ farm_management <chr> "Unmanaged", "Minimal", "Minimal", "Unman...
$ inc <dbl> 86.70805, 51.34354, 43.20000, 76.70805, 4...
$ sev2 <dbl> 55.57986, 17.90349, 8.25120, 46.10154, 12...
dplyr
Functions
Six key dplyr
functions that allow you to solve the vast majority of your data transformation challenges:
Function | Description |
---|---|
filter |
pick observations based on values |
select |
pick variables |
summarize |
compute statistical summaries |
group_by |
perform operations at different levels of your data |
arrange |
reorder data |
mutate |
create new variables |
Piping
From magrittr
package.
Traditional approach:
function(argument_one, argument_two,...)
pipe %\>%
approach:
argument_one %>%
function(., argument_two,...)
Make small subset of data
(dt_small <-
dt %>%
select(cultivar, zone, inc) %>%
group_by(cultivar, zone) %>%
slice(head(row_number(), 1)) %>%
filter(
zone =="Sheka" |zone == "Sidama") %>%
ungroup())
# A tibble: 6 x 3
cultivar zone inc
<chr> <chr> <dbl>
1 Improved Sheka 33.2
2 Improved Sidama 16.5
3 Local Sheka 81.8
4 Local Sidama 35.2
5 Mixture Sheka 29.5
6 Mixture Sidama 18.6
dt_small %>%
select(cultivar, inc) %>%
filter(inc <= 17)
# A tibble: 1 x 2
cultivar inc
<chr> <dbl>
1 Improved 16.5
dt_small %>%
group_by(cultivar) %>%
summarize(mean_inc = mean(inc),
min_weight = min(inc)) %>%
arrange(desc(mean_inc))
# A tibble: 3 x 3
cultivar mean_inc min_weight
<chr> <dbl> <dbl>
1 Local 58.5 35.2
2 Improved 24.8 16.5
3 Mixture 24.1 18.6
Important for data visualization
Our data subset is in long format
dt_small
# A tibble: 6 x 3
cultivar zone inc
<chr> <chr> <dbl>
1 Improved Sheka 33.2
2 Improved Sidama 16.5
3 Local Sheka 81.8
4 Local Sidama 35.2
5 Mixture Sheka 29.5
6 Mixture Sidama 18.6
Important for data visualization
Our data subset is in long format
dt_small
# A tibble: 6 x 3
cultivar zone inc
<chr> <chr> <dbl>
1 Improved Sheka 33.2
2 Improved Sidama 16.5
3 Local Sheka 81.8
4 Local Sidama 35.2
5 Mixture Sheka 29.5
6 Mixture Sidama 18.6
Change it to wide format with tidyr
names_from
: column to columnSvalues_from
: column to values(dt_small_wide <-
dt_small %>%
pivot_wider(names_from = "zone",
values_from = "inc"))
# A tibble: 3 x 3
cultivar Sheka Sidama
<chr> <dbl> <dbl>
1 Improved 33.2 16.5
2 Local 81.8 35.2
3 Mixture 29.5 18.6
Can we do it the other way around?
dt_small_wide
# A tibble: 3 x 3
cultivar Sheka Sidama
<chr> <dbl> <dbl>
1 Improved 33.2 16.5
2 Local 81.8 35.2
3 Mixture 29.5 18.6
Change it to long format with pivot_longer()
cols
: columns to column values_from
: values to columnsdt_small_wide %>%
pivot_longer(cols =
c("Sheka", "Sidama"),
names_to = "zone",
values_to = "inc")
# A tibble: 6 x 3
cultivar zone inc
<chr> <chr> <dbl>
1 Improved Sheka 33.2
2 Improved Sidama 16.5
3 Local Sheka 81.8
4 Local Sidama 35.2
5 Mixture Sheka 29.5
6 Mixture Sidama 18.6
So, the painful part is done, enjoy the rest!