A very brief..
Intro to R







Mladen Čučak

mladencucak@gmail.com

Topics

  • About R/RStudio
  • Basics of programming with R
  • Data analysis with tidyverse









These materials are based on the APS's “R for Plant Pathologists” “R for Plant Pathologists”
Some inspiration from J. Bryan's Stat545 and B. Bohemke's Intro to R
All highly recommended

Why R


  • Performance: stable, light and fast

  • Support network: documentation, community, developers

  • Reproducibility: anyone anywhere can reproduce results

  • Versatility: unified solution to almost any numerical problem and graphical capabilities

  • Ethics: accessible to anyone as it is free and open source

Be strong!

Transition from “point and click” is tough but rewarding

Baby steps

Help:

  • Google: just add “with R” at the end of any search
  • Stack Overflow: programming questions
  • Cross Validated: scientific questions

Learning:

Baby steps

Help:

  • Google: just add “with R” at the end of any search
  • Stack Overflow: programming questions
  • Cross Validated: scientific questions

Learning:

Your new best friends



R – Statistical programming language


alt text

/http://www.r-project.org/



RStudio – Integrated Development Environment (IDE) makes our life much easier

alt text https://rstudio.com/

It may be described as...


R – Engine


RStudio – Dashboard



R interface

is not the friendliest one

RStudio (IDE)

Move onto some coding



Move the cursor onto a line with R code and pres:

  • (Win)Ctrl + Enter or
  • (MAC)Cmd + Return.

    Challenge: Do it with one hand you are not using to hold the mouse!


    Tips for later:
  • Many other keyboard shortcuts in RStudio (Win)Alt+Shift+K or (MAC)Option+Shift+K
  • For example, to run an entire script (Win)Ctrl + Shift + Enter or (MAC)Cmd + Shift + Return

R basics: In R, we have...

Objects, where the data is stored.

Assign with <-

x <- 1
y <- 2
x + y
[1] 3

the same result if:

1+3
[1] 4

R basics: In R, we have...

Objects, where the data is stored.

Assign with <-

x <- 1
y <- 2
x + y
[1] 3

the same result if:

1+3
[1] 4

Functions which are applied on objects or another functions (i.e. to analyze the data): round brackets!

# I am a comment!!! Just here to help jog the memory later on...
# Let us make a function!
addition <- function(argument_one,
                     argument_two){ 
  argument_one + argument_two # operations
} # curly brackets define operations

ls() # check content of the environment
[1] "addition" "x"        "y"       
addition(argument_one = x,
         argument_two = y)
[1] 3

R basics: In R, we have...

Objects, where the data is stored.

Assign with <-

x <- 1
y <- 2
x + y
[1] 3

the same result if:

1+3
[1] 4

Functions which are applied on objects or another functions (i.e. to analyze the data): round brackets!

addition <- function(argument_one, argument_two){ 
  argument_one + argument_two 
} 
addition(argument_one = x,argument_two = y)
[1] 3
addition(x, y)# Notice the difference?!
[1] 3
addition(x, y) == x+y #notice double "="
[1] TRUE
all.equal(addition(x, y), x+y) #Same as above, but pre-made
[1] TRUE

Objects: Vectors

Vectors store data of the same type
(a column of an excel table)

Types of data:

num <- c(50, 60, 65) 

char <- c("mouse", "rat", "dog") 

fct <- factor("low", "med", "high")

dates <- as.Date(c("02/27/92", "02/27/92", "01/14/92"), "%m/%d/%y")

logical <-  c(FALSE, FALSE, TRUE) # only TRUE or FALSE

Objects: Vectors

Vectors store data of the same type
(a column of an excel table)

Types of data:

num <- c(50, 60, 65) 

char <- c("mouse", "rat", "dog") 

fct <- factor("low", "med", "high")

dates <- as.Date(c("02/27/92", "02/27/92", "01/14/92"), "%m/%d/%y")

logical <-  c(FALSE, FALSE, TRUE) # only TRUE or FALSE

Subsetting - square brackets

num[1] # 1st element
[1] 50
num[num >= 60] # More than or equal
[1] 60 65
char == "dog" # see logical on the left
[1] FALSE FALSE  TRUE
char[logical]
[1] "dog"
char[char == "dog"]
[1] "dog"

Objects: Dataframes

Dataframe is a set of vectors of same length(an entire excel table)

Creating and viewing data frames

df <- data.frame(col_one = num,
                 col_two = char)
print(df)
  col_one col_two
1      50   mouse
2      60     rat
3      65     dog
head(df,1)
  col_one col_two
1      50   mouse

Same logic for indexing, just in 2 dimensions

df[1, 1] # [rows, columns]
[1] 50
df[, 1] # 1st column in the data frame
[1] 50 60 65
df[, -2] # Exclude 2nd column
[1] 50 60 65
df[2:3, "col_two"] 
[1] "rat" "dog"
df$col_two
[1] "mouse" "rat"   "dog"  

R packages

Pre-made set of functions for common (and not so common) tasks

A package of R packages: tidyverse

Think of something like Microsoft Office suite

tidyverse and data analysis cycle

Data import

Several functions within readr and readxl for different types of files.
For this workshop, we will use data on coffee leaf rust from Ethiopia

dt <- read_csv(here::here("data", "survey_clean.csv"))
tibble::glimpse(dt, 70)
Rows: 405
Columns: 13
$ farm            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
$ region          <chr> "SNNPR", "SNNPR", "SNNPR", "SNNPR", "SNNP...
$ zone            <chr> "Bench Maji", "Bench Maji", "Bench Maji",...
$ district        <chr> "Debub Bench", "Debub Bench", "Debub Benc...
$ lon             <dbl> 35.44250, 35.44250, 35.42861, 35.42861, 3...
$ lat             <dbl> 6.904722, 6.904722, 6.904444, 6.904444, 6...
$ altitude        <dbl> 1100, 1342, 1434, 1100, 1400, 1342, 1432,...
$ cultivar        <chr> "Local", "Mixture", "Mixture", "Local", "...
$ shade           <chr> "Sun", "Mid shade", "Mid shade", "Sun", "...
$ cropping_system <chr> "Plantation", "Plantation", "Plantation",...
$ farm_management <chr> "Unmanaged", "Minimal", "Minimal", "Unman...
$ inc             <dbl> 86.70805, 51.34354, 43.20000, 76.70805, 4...
$ sev2            <dbl> 55.57986, 17.90349, 8.25120, 46.10154, 12...

Data transformation

dplyr Functions

Six key dplyr functions that allow you to solve the vast majority of your data transformation challenges:

Function Description
filter pick observations based on values
select pick variables
summarize compute statistical summaries
group_by perform operations at different levels of your data
arrange reorder data
mutate create new variables

Piping

From magrittr package.
Traditional approach:

function(argument_one, argument_two,...)  

pipe %\>% approach:

argument_one %>% 
  function(., argument_two,...)  

Lets test these

Make small subset of data

(dt_small <- 
dt %>%
  select(cultivar, zone, inc) %>% 
  group_by(cultivar, zone) %>%
  slice(head(row_number(), 1)) %>% 
  filter(
    zone =="Sheka" |zone ==  "Sidama") %>% 
  ungroup())
# A tibble: 6 x 3
  cultivar zone     inc
  <chr>    <chr>  <dbl>
1 Improved Sheka   33.2
2 Improved Sidama  16.5
3 Local    Sheka   81.8
4 Local    Sidama  35.2
5 Mixture  Sheka   29.5
6 Mixture  Sidama  18.6
dt_small %>% 
  select(cultivar, inc) %>% 
  filter(inc <= 17)
# A tibble: 1 x 2
  cultivar   inc
  <chr>    <dbl>
1 Improved  16.5
dt_small %>%
  group_by(cultivar) %>%
  summarize(mean_inc = mean(inc),
            min_weight = min(inc)) %>%
    arrange(desc(mean_inc))
# A tibble: 3 x 3
  cultivar mean_inc min_weight
  <chr>       <dbl>      <dbl>
1 Local        58.5       35.2
2 Improved     24.8       16.5
3 Mixture      24.1       18.6

Reshaping data: wide

Important for data visualization

Our data subset is in long format

dt_small
# A tibble: 6 x 3
  cultivar zone     inc
  <chr>    <chr>  <dbl>
1 Improved Sheka   33.2
2 Improved Sidama  16.5
3 Local    Sheka   81.8
4 Local    Sidama  35.2
5 Mixture  Sheka   29.5
6 Mixture  Sidama  18.6

Reshaping data: wide

Important for data visualization

Our data subset is in long format

dt_small
# A tibble: 6 x 3
  cultivar zone     inc
  <chr>    <chr>  <dbl>
1 Improved Sheka   33.2
2 Improved Sidama  16.5
3 Local    Sheka   81.8
4 Local    Sidama  35.2
5 Mixture  Sheka   29.5
6 Mixture  Sidama  18.6

Change it to wide format with tidyr

  • names_from: column to columnS
  • values_from: column to values
(dt_small_wide <- 
dt_small %>%
  pivot_wider(names_from = "zone", 
              values_from = "inc"))
# A tibble: 3 x 3
  cultivar Sheka Sidama
  <chr>    <dbl>  <dbl>
1 Improved  33.2   16.5
2 Local     81.8   35.2
3 Mixture   29.5   18.6

Reshaping data: long




Can we do it the other way around?

dt_small_wide 
# A tibble: 3 x 3
  cultivar Sheka Sidama
  <chr>    <dbl>  <dbl>
1 Improved  33.2   16.5
2 Local     81.8   35.2
3 Mixture   29.5   18.6

Change it to long format with pivot_longer()

  • cols: columns to column
  • values_from: values to columns
dt_small_wide %>% 
  pivot_longer(cols = 
                 c("Sheka", "Sidama"), 
               names_to = "zone",
               values_to = "inc")
# A tibble: 6 x 3
  cultivar zone     inc
  <chr>    <chr>  <dbl>
1 Improved Sheka   33.2
2 Improved Sidama  16.5
3 Local    Sheka   81.8
4 Local    Sidama  35.2
5 Mixture  Sheka   29.5
6 Mixture  Sidama  18.6





Congratulations!!

So, the painful part is done, enjoy the rest!