Another basic task that I’m tired of looking up how to perform, so I’m posting this for personal reference.
Task: given a vector of values, create a ggplot histogram with overlaid best-fitting normal curve, with optional caption including mean and standard deviation, presented prettified.
library(tidyverse)
<- function(n, digits = 3) {
format_num # Prettify numeric results -- no scientific notation, use significant digits
formatC(signif(n, digits=digits), digits=digits, format="fg", flag="#")
}
<- function(values, binwidth = NA, caption = TRUE, num_sd = NA) {
hist_normal # values is a vector of numbers
<- data.frame(value = values)
df <- mean(df$value)
values_mean <- sd(df$value)
values_sd if (is.na(binwidth)) {binwidth <- abs((max(df$value) - min(df$value)) / 30)}
<- df %>%
g ggplot(aes(x = value)) +
geom_histogram(
aes(y = after_stat(density)),
binwidth = binwidth,
colour = "black", fill = "white"
+
) stat_function(fun = dnorm, args = list(mean = values_mean, sd = values_sd))
if (caption) {
<- g +
g labs(caption = paste0(
"mean = ", format_num(values_mean),
"; sd = ", format_num(values_sd),
"; n = ", length(values)
))
}
if (!is.na(num_sd)) {
<- g + coord_cartesian(xlim = values_mean + values_sd * c(-num_sd, num_sd))
g
}
return (g)
}
Simple example: default of around 30 bins will be too many for n
= 200 points.
hist_normal(rnorm(200))
Smoother example, centering the plot around mean and specifying x-axis limits as 4 standard deviations around mean:
hist_normal(rnorm(5000, mean = 25, sd = 2.5), binwidth = 0.5, num_sd = 4)