Chapter 2 Statistical Functions (for one variable)
Here are some basic statistical functions you should really know by heart. Most of them are obvious if you think about them. You should also know the formula or other calculation procedures for these.
These will be presented using the $ format and some commonly used sample data sets. The na.rm option is included because you are always safer using it.
2.1 These are from core R
That means the base and stats packages.
# Mean
mean(chickwts$weight, na.rm = TRUE)
## [1] 261.3099
# Variance (for a sample)
var(chickwts$weight, na.rm = TRUE)
## [1] 6095.503
# Variance (for a population)
# There is no simple function for this. The code below shows one possible way
# The length() function gives you the total rows and na.omit() ensures that you
# are not counting NAs, since they are not in the calculation of var().
var(chickwts$weight, na.rm = TRUE)*
(length(na.omit(chickwts$weight)) - 1)/
length(chickwts$weight)
## [1] 6009.65
# Standard deviation (for a sample)
sd(chickwts$weight, na.rm = TRUE)
## [1] 78.0737
# Note there is no simple conversion from sample standard deviation
# to population standard deviation.
# Calculate the population variance and take the square root. See the next
# section.
# Median
median(chickwts$weight, na.rm = TRUE)
## [1] 258
# Quantiles (you specify which ones you want)
quantile(chickwts$weight, probs = c(.25, .50, .75), na.rm = TRUE)
## 25% 50% 75%
## 204.5 258.0 323.5
# Minimum
min(chickwts$weight, na.rm = TRUE)
## [1] 108
# Maximum
max(chickwts$weight, na.rm = TRUE)
## [1] 423
# Table
# This is normally used for factor variables
table(chickwts$feed, useNA = "ifany")
##
## casein horsebean linseed meatmeal soybean sunflower
## 12 10 12 11 14 12
Also you should know that the summary() function can be used on almost object and will give you results depending on what that object is.
summary(chickwts)
## weight feed
## Min. :108.0 casein :12
## 1st Qu.:204.5 horsebean:10
## Median :258.0 linseed :12
## Mean :261.3 meatmeal :11
## 3rd Qu.:323.5 soybean :14
## Max. :423.0 sunflower:12
summary(chickwts$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 108.0 204.5 258.0 261.3 323.5 423.0
summary(chickwts$feed)
## casein horsebean linseed meatmeal soybean sunflower
## 12 10 12 11 14 12
2.2 These are from lehmansociology
Since that is a package you either need to load it with library() or use the notation we use here.
# MODE
# Notice that you must use all upper case. This is because there is
# already a function called mode in base r that does something
# completely unrelated.
lehmansociology::MODE(chickwts$feed)
## $dataframe
## [1] "soybean"
# Variance for a population
# This is similar to the VARP function in most spreadsheets
lehmansociology::varp(chickwts$weight)
## [1] 6009.65
# Standard deviation for a population
# This is similar to the STDEVP
lehmansociology::sdp(chickwts$weight)
## [1] 77.52194
# Frequency distribution. Read about options in ?frequency
lehmansociology::frequency(chickwts$feed)
## Values Freq Percent
## casein 12 16.9
## horsebean 10 14.1
## linseed 12 16.9
## meatmeal 11 15.5
## soybean 14 19.7
## sunflower 12 16.9
## Total 71 100
You can use an the cumulative.percent and/or cumulative.freq options to get cumulative results.
# Cumulative frequency table
lehmansociology::frequency(warpbreaks$breaks, cumulative.percent = TRUE)
## Values Freq Percent Cum. Percent
## 10 1 1.9 1.9
## 12 1 1.9 3.8
## 13 1 1.9 5.7
## 14 1 1.9 7.6
## 15 3 5.6 13.2
## 16 2 3.7 16.9
## 17 2 3.7 20.6
## 18 3 5.6 26.2
## 19 2 3.7 29.9
## 20 2 3.7 33.6
## 21 4 7.4 41
## 24 2 3.7 44.7
## 25 1 1.9 46.6
## 26 4 7.4 54
## 27 1 1.9 55.9
## 28 3 5.6 61.5
## 29 4 7.4 68.9
## 30 2 3.7 72.6
## 31 1 1.9 74.5
## 35 1 1.9 76.4
## 36 2 3.7 80.1
## 39 2 3.7 83.8
## 41 1 1.9 85.7
## 42 1 1.9 87.6
## 43 1 1.9 89.5
## 44 1 1.9 91.4
## 51 1 1.9 93.3
## 52 1 1.9 95.2
## 54 1 1.9 97.1
## 67 1 1.9 99
## 70 1 1.9 100.9
## Total 54 100.9
2.3 These are from the Janitor package
Since it is a package you must load it as here or use the :: notation. This is a different way to get frequency distributions.
For producing single variable frequencies use tabyl(). It is usually used with the %>%
or “pipe” operator from the maggritr package. There are no missing values in this data set, but the example below illustrates the option you would use to remove them from the graph. If you don’t include the option any missing values will show.
library(janitor)
## Warning: package 'janitor' was built under R version 3.4.4
library(magrittr)
chickwts %>% tabyl(feed, show_na=FALSE)
## feed n percent
## casein 12 0.1690141
## horsebean 10 0.1408451
## linseed 12 0.1690141
## meatmeal 11 0.1549296
## soybean 14 0.1971831
## sunflower 12 0.1690141
Janitor uses adorn_ functions to add extra details or formatting to tables.
chickwts %>% tabyl(feed) %>%
adorn_pct_formatting()
## feed n percent
## casein 12 16.9%
## horsebean 10 14.1%
## linseed 12 16.9%
## meatmeal 11 15.5%
## soybean 14 19.7%
## sunflower 12 16.9%
The lehmansociology package includes extra adorn_ functions for cumulative distributions.
library(lehmansociology)
chickwts %>% tabyl(weight) %>%
adorn_cumulative() %>%
adorn_cumulative_percentages()
## weight n percent cum freq cum pct
## 108 1 0.01408451 1 0.01408451
## 124 1 0.01408451 2 0.02816901
## 136 1 0.01408451 3 0.04225352
## 140 1 0.01408451 4 0.05633803
## 141 1 0.01408451 5 0.07042254
## 143 1 0.01408451 6 0.08450704
## 148 1 0.01408451 7 0.09859155
## 153 1 0.01408451 8 0.11267606
## 158 1 0.01408451 9 0.12676056
## 160 1 0.01408451 10 0.14084507
## 168 1 0.01408451 11 0.15492958
## 169 1 0.01408451 12 0.16901408
## 171 1 0.01408451 13 0.18309859
## 179 1 0.01408451 14 0.19718310
## 181 1 0.01408451 15 0.21126761
## 193 1 0.01408451 16 0.22535211
## 199 1 0.01408451 17 0.23943662
## 203 1 0.01408451 18 0.25352113
## 206 1 0.01408451 19 0.26760563
## 213 1 0.01408451 20 0.28169014
## 216 1 0.01408451 21 0.29577465
## 217 1 0.01408451 22 0.30985915
## 222 1 0.01408451 23 0.32394366
## 226 1 0.01408451 24 0.33802817
## 227 1 0.01408451 25 0.35211268
## 229 1 0.01408451 26 0.36619718
## 230 1 0.01408451 27 0.38028169
## 242 1 0.01408451 28 0.39436620
## 243 1 0.01408451 29 0.40845070
## 244 1 0.01408451 30 0.42253521
## 248 2 0.02816901 32 0.45070423
## 250 1 0.01408451 33 0.46478873
## 257 2 0.02816901 35 0.49295775
## 258 1 0.01408451 36 0.50704225
## 260 2 0.02816901 38 0.53521127
## 263 1 0.01408451 39 0.54929577
## 267 1 0.01408451 40 0.56338028
## 271 2 0.02816901 42 0.59154930
## 283 1 0.01408451 43 0.60563380
## 295 1 0.01408451 44 0.61971831
## 297 1 0.01408451 45 0.63380282
## 303 1 0.01408451 46 0.64788732
## 309 1 0.01408451 47 0.66197183
## 315 1 0.01408451 48 0.67605634
## 316 1 0.01408451 49 0.69014085
## 318 2 0.02816901 51 0.71830986
## 320 1 0.01408451 52 0.73239437
## 322 1 0.01408451 53 0.74647887
## 325 1 0.01408451 54 0.76056338
## 327 1 0.01408451 55 0.77464789
## 329 1 0.01408451 56 0.78873239
## 332 1 0.01408451 57 0.80281690
## 334 1 0.01408451 58 0.81690141
## 339 1 0.01408451 59 0.83098592
## 340 1 0.01408451 60 0.84507042
## 341 1 0.01408451 61 0.85915493
## 344 1 0.01408451 62 0.87323944
## 352 1 0.01408451 63 0.88732394
## 359 1 0.01408451 64 0.90140845
## 368 1 0.01408451 65 0.91549296
## 379 1 0.01408451 66 0.92957746
## 380 1 0.01408451 67 0.94366197
## 390 1 0.01408451 68 0.95774648
## 392 1 0.01408451 69 0.97183099
## 404 1 0.01408451 70 0.98591549
## 423 1 0.01408451 71 1.00000000
## Total 71 1.00000000 71 1.00000000
You can add adorn_cum_pct_formatting to put the proportions into percent format.
chickwts %>% tabyl(weight) %>%
adorn_pct_formatting() %>%
adorn_cumulative() %>%
adorn_cumulative_percentages() %>%
adorn_cum_pct_formatting()
## weight n percent cum freq cum pct
## 1 108 1 1.4% 1 1.408451%
## 2 124 1 1.4% 2 2.8%
## 3 136 1 1.4% 3 4.2%
## 4 140 1 1.4% 4 5.6%
## 5 141 1 1.4% 5 7.0%
## 6 143 1 1.4% 6 8.5%
## 7 148 1 1.4% 7 9.9%
## 8 153 1 1.4% 8 11.3%
## 9 158 1 1.4% 9 12.7%
## 10 160 1 1.4% 10 14.1%
## 11 168 1 1.4% 11 15.5%
## 12 169 1 1.4% 12 16.9%
## 13 171 1 1.4% 13 18.3%
## 14 179 1 1.4% 14 19.7%
## 15 181 1 1.4% 15 21.1%
## 16 193 1 1.4% 16 22.5%
## 17 199 1 1.4% 17 23.9%
## 18 203 1 1.4% 18 25.4%
## 19 206 1 1.4% 19 26.8%
## 20 213 1 1.4% 20 28.2%
## 21 216 1 1.4% 21 29.6%
## 22 217 1 1.4% 22 31.0%
## 23 222 1 1.4% 23 32.4%
## 24 226 1 1.4% 24 33.8%
## 25 227 1 1.4% 25 35.2%
## 26 229 1 1.4% 26 36.6%
## 27 230 1 1.4% 27 38.0%
## 28 242 1 1.4% 28 39.4%
## 29 243 1 1.4% 29 40.8%
## 30 244 1 1.4% 30 42.3%
## 31 248 2 2.8% 32 45.1%
## 32 250 1 1.4% 33 46.5%
## 33 257 2 2.8% 35 49.3%
## 34 258 1 1.4% 36 50.7%
## 35 260 2 2.8% 38 53.5%
## 36 263 1 1.4% 39 54.9%
## 37 267 1 1.4% 40 56.3%
## 38 271 2 2.8% 42 59.2%
## 39 283 1 1.4% 43 60.6%
## 40 295 1 1.4% 44 62.0%
## 41 297 1 1.4% 45 63.4%
## 42 303 1 1.4% 46 64.8%
## 43 309 1 1.4% 47 66.2%
## 44 315 1 1.4% 48 67.6%
## 45 316 1 1.4% 49 69.0%
## 46 318 2 2.8% 51 71.8%
## 47 320 1 1.4% 52 73.2%
## 48 322 1 1.4% 53 74.6%
## 49 325 1 1.4% 54 76.1%
## 50 327 1 1.4% 55 77.5%
## 51 329 1 1.4% 56 78.9%
## 52 332 1 1.4% 57 80.3%
## 53 334 1 1.4% 58 81.7%
## 54 339 1 1.4% 59 83.1%
## 55 340 1 1.4% 60 84.5%
## 56 341 1 1.4% 61 85.9%
## 57 344 1 1.4% 62 87.3%
## 58 352 1 1.4% 63 88.7%
## 59 359 1 1.4% 64 90.1%
## 60 368 1 1.4% 65 91.5%
## 61 379 1 1.4% 66 93.0%
## 62 380 1 1.4% 67 94.4%
## 63 390 1 1.4% 68 95.8%
## 64 392 1 1.4% 69 97.2%
## 65 404 1 1.4% 70 98.6%
## 66 423 1 1.4% 71 100.0%
## 67 Total 71 - 71 100.0%
Piping to the knitr::kable()
function will create a more polished table, In additon if knit to PDF kable()
can add a title using the caption=
option.
chickwts %>% tabyl(weight) %>%
adorn_pct_formatting() %>%
adorn_cumulative() %>%
adorn_cumulative_percentages() %>%
adorn_cum_pct_formatting() %>%
knitr::kable(caption = "Weights of chicks")
weight | n | percent | cum freq | cum pct |
---|---|---|---|---|
108 | 1 | 1.4% | 1 | 1.408451% |
124 | 1 | 1.4% | 2 | 2.8% |
136 | 1 | 1.4% | 3 | 4.2% |
140 | 1 | 1.4% | 4 | 5.6% |
141 | 1 | 1.4% | 5 | 7.0% |
143 | 1 | 1.4% | 6 | 8.5% |
148 | 1 | 1.4% | 7 | 9.9% |
153 | 1 | 1.4% | 8 | 11.3% |
158 | 1 | 1.4% | 9 | 12.7% |
160 | 1 | 1.4% | 10 | 14.1% |
168 | 1 | 1.4% | 11 | 15.5% |
169 | 1 | 1.4% | 12 | 16.9% |
171 | 1 | 1.4% | 13 | 18.3% |
179 | 1 | 1.4% | 14 | 19.7% |
181 | 1 | 1.4% | 15 | 21.1% |
193 | 1 | 1.4% | 16 | 22.5% |
199 | 1 | 1.4% | 17 | 23.9% |
203 | 1 | 1.4% | 18 | 25.4% |
206 | 1 | 1.4% | 19 | 26.8% |
213 | 1 | 1.4% | 20 | 28.2% |
216 | 1 | 1.4% | 21 | 29.6% |
217 | 1 | 1.4% | 22 | 31.0% |
222 | 1 | 1.4% | 23 | 32.4% |
226 | 1 | 1.4% | 24 | 33.8% |
227 | 1 | 1.4% | 25 | 35.2% |
229 | 1 | 1.4% | 26 | 36.6% |
230 | 1 | 1.4% | 27 | 38.0% |
242 | 1 | 1.4% | 28 | 39.4% |
243 | 1 | 1.4% | 29 | 40.8% |
244 | 1 | 1.4% | 30 | 42.3% |
248 | 2 | 2.8% | 32 | 45.1% |
250 | 1 | 1.4% | 33 | 46.5% |
257 | 2 | 2.8% | 35 | 49.3% |
258 | 1 | 1.4% | 36 | 50.7% |
260 | 2 | 2.8% | 38 | 53.5% |
263 | 1 | 1.4% | 39 | 54.9% |
267 | 1 | 1.4% | 40 | 56.3% |
271 | 2 | 2.8% | 42 | 59.2% |
283 | 1 | 1.4% | 43 | 60.6% |
295 | 1 | 1.4% | 44 | 62.0% |
297 | 1 | 1.4% | 45 | 63.4% |
303 | 1 | 1.4% | 46 | 64.8% |
309 | 1 | 1.4% | 47 | 66.2% |
315 | 1 | 1.4% | 48 | 67.6% |
316 | 1 | 1.4% | 49 | 69.0% |
318 | 2 | 2.8% | 51 | 71.8% |
320 | 1 | 1.4% | 52 | 73.2% |
322 | 1 | 1.4% | 53 | 74.6% |
325 | 1 | 1.4% | 54 | 76.1% |
327 | 1 | 1.4% | 55 | 77.5% |
329 | 1 | 1.4% | 56 | 78.9% |
332 | 1 | 1.4% | 57 | 80.3% |
334 | 1 | 1.4% | 58 | 81.7% |
339 | 1 | 1.4% | 59 | 83.1% |
340 | 1 | 1.4% | 60 | 84.5% |
341 | 1 | 1.4% | 61 | 85.9% |
344 | 1 | 1.4% | 62 | 87.3% |
352 | 1 | 1.4% | 63 | 88.7% |
359 | 1 | 1.4% | 64 | 90.1% |
368 | 1 | 1.4% | 65 | 91.5% |
379 | 1 | 1.4% | 66 | 93.0% |
380 | 1 | 1.4% | 67 | 94.4% |
390 | 1 | 1.4% | 68 | 95.8% |
392 | 1 | 1.4% | 69 | 97.2% |
404 | 1 | 1.4% | 70 | 98.6% |
423 | 1 | 1.4% | 71 | 100.0% |
Total | 71 | - | 71 | 100.0% |
Janitor is also used to create cross tabs (see Chapter 5).
2.4 From the RMisc package (Confidence intervals)
You can use the CI()
function from the Rmisc package to obtain confidence intervals.
library(Rmisc)
CI(chickwts$weight, ci = 0.95)
## upper mean lower
## 279.7896 261.3099 242.8301
2.5 From the resample package (bootstrap confidence intervals)
Obtaining bootstrap confidence intervals for a single mean would build on the bootstrap()
function from the resample library. It takes two steps. First obtaining a large number of samples from your sample. You also have to specify what statistic you want to estimate the confidence interval for. This can be anything (mean()
, median()
, sd()
or even a function you create). You can also specify the number of replications; with today’s computer power there is no reason to use a low number unless you have a huge (100,000s) sample size.
Save the result in an object and then use the CI.bca()
function to obtain the level of confidence interval you want. There are also other CI methods available.
library(resample)
bootstrap_results <- bootstrap(chickwts, mean(weight, na.rm = TRUE), R = 10000)
CI.bca(bootstrap_results, probs = c(0.025, 0.975))
## 2.5% 97.5%
## mean(weight, na.rm = TRUE) 242.7296 279.6218
2.6 Conclusion
This chapter presents just a handful of R functions and their many options As you gain experience using R you should read the help files, vignettes and online resources to learn more.