Adapted from R for Data Science by Hadley Wichkam and Garrett Grolemund
Download slides here
In this section, we will be using a build-in dataset called “diamonds” to explore the tools and techniques that are useful for exploratory data analysis. we will be mostly using tidyverse packages.
# load the library
library(tidyverse)
# call for data
data("diamonds")
# display data structure
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
# mapping the types of diamond cut, and their number
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Here the bar represents different categories of dimaond cuts. The height of the bars displays how many observations occurred with each x value.
# if you want a count table
diamonds %>%
count(cut)
## # A tibble: 5 x 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
Distribution plots are one of common visualization tools use for exploratory data analysis. Selection of exploratory tools is in part based on data types. Histograms are used for continuous variables, whereas barplots and box plots are common tools for categorical data.
# call for data structure and check the data type for diamond carat
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Carat is a numerical information about the diamond weight. Here carat is a continuous variable and it can take infinite set of values.
# maximum value for diamond carat
max(diamonds$carat)
## [1] 5.01
# minimum value for diamond carat
min(diamonds$carat)
## [1] 0.2
# distribution of carat values
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
Note: Binwidth determines size of the bin. If you increase the size for the bin, you will observe less number of bins in histogram.
# check the difference when the bin size is one
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 1)
We can also use data subsetting skills to select the specific information.
# distribution of diamond with carat less than 3.
smaller <- diamonds %>%
filter(carat < 3)
# check the table
head(smaller)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.230 Ideal E SI2 61.5 55. 326 3.95 3.98 2.43
## 2 0.210 Premium E SI1 59.8 61. 326 3.89 3.84 2.31
## 3 0.230 Good E VS1 56.9 65. 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58. 334 4.20 4.23 2.63
## 5 0.310 Good J SI2 63.3 58. 335 4.34 4.35 2.75
## 6 0.240 Very Good J VVS2 62.8 57. 336 3.94 3.96 2.48
# check the x-axis and compare with the previous histogram.
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.5)
# adding color to the histogram
ggplot(data = smaller, mapping = aes(x = carat, fill = "red")) +
geom_histogram(binwidth = 0.5)
# adding informative color to the histogram
ggplot(data = smaller, mapping = aes(x = carat, fill = cut)) +
geom_histogram(binwidth = 0.5)
Instead of histogram if you want a line then you can use geom_freqpoly() function. Instead of bars, lines are used to display the counts.
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1)
Histogram are good tool to explore continuous variable. However, it is not that intuitive to explore categorical data with histogram. For categorical data we use box plot.
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
# if you want to reorder based on the median value
ggplot(data = diamonds, mapping = aes(x = reorder(cut, price, FUN = median), y = price)) +
geom_boxplot()
If you have long variable names, geom_boxplot() will work better if you flip it 90°. You can do that with coord_flip() function.
ggplot(data = diamonds, mapping = aes(x = reorder(cut, price, FUN = median), y = price)) +
geom_boxplot() +
coord_flip()
To visualize covariation between categorical variables, you will need to count the number of observations for each combination.
# display data as table
diamonds %>%
count(color, cut)
## # A tibble: 35 x 3
## color cut n
## <ord> <ord> <int>
## 1 D Fair 163
## 2 D Good 662
## 3 D Very Good 1513
## 4 D Premium 1603
## 5 D Ideal 2834
## 6 E Fair 224
## 7 E Good 933
## 8 E Very Good 2400
## 9 E Premium 2337
## 10 E Ideal 3903
## # ... with 25 more rows
# then plot with geom_tile() and the fill aesthetic:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 10)
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price, colour =cut), alpha = 1)
So far we have looked at the distribution using histograms and box plots. For two continuous variables, we can explore covariations between two variables and study their relationship.
Let’s look at one example with diamond dataset. It will be interesting to see if there is any relationship between the diamond price and the carat. Here we use lm
function from base R to build a linear model, where diamond price is represented as response variable and carat as predictor.
mod <- lm(log(price) ~ log(carat), data = diamonds)
# summary table for linear model
summary(mod)
##
## Call:
## lm(formula = log(price) ~ log(carat), data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.50833 -0.16951 -0.00591 0.16637 1.33793
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.448661 0.001365 6190.9 <2e-16 ***
## log(carat) 1.675817 0.001934 866.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2627 on 53938 degrees of freedom
## Multiple R-squared: 0.933, Adjusted R-squared: 0.933
## F-statistic: 7.51e+05 on 1 and 53938 DF, p-value: < 2.2e-16
par(mfrow=c(2,2)) # allows to visualize four plots in a single page
plot(mod)
Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Use diamonds data set, and visualize such unusual data point. Hint: binwidth
ggplot(data = diamonds, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)