Plot histogram with overlaid normal curve

Given a vector of values, create a ggplot histogram with overlaid best-fitting normal curve, with prettified caption of numerics
R
Code snippet
Published

July 29, 2024

Another basic task that I’m tired of looking up how to perform, so I’m posting this for personal reference.

Task: given a vector of values, create a ggplot histogram with overlaid best-fitting normal curve, with optional caption including mean and standard deviation, presented prettified.

library(tidyverse)

format_num <- function(n, digits = 3) {
  # Prettify numeric results -- no scientific notation, use significant digits
  formatC(signif(n, digits=digits), digits=digits, format="fg", flag="#")
}

hist_normal <- function(values, binwidth = NA, caption = TRUE, num_sd = NA) {
  # values is a vector of numbers
  df <- data.frame(value = values)
  values_mean <- mean(df$value)
  values_sd   <- sd(df$value)
  if (is.na(binwidth)) {binwidth <- abs((max(df$value) - min(df$value)) / 30)}
  
  g <- df %>%
    ggplot(aes(x = value)) +
    geom_histogram(
      aes(y = after_stat(density)),
      binwidth = binwidth,
      colour = "black", fill = "white"
    ) +
    stat_function(fun = dnorm, args = list(mean = values_mean, sd = values_sd))

  if (caption) {
    g <- g +
      labs(caption = paste0(
        "mean = ", format_num(values_mean),
        "; sd = ", format_num(values_sd),
        "; n = ", length(values)
      ))
  }
  
  if (!is.na(num_sd)) {
    g <- g + coord_cartesian(xlim = values_mean + values_sd * c(-num_sd, num_sd))
  }
  
  return (g)
}

Simple example: default of around 30 bins will be too many for n = 200 points.

hist_normal(rnorm(200))

Smoother example, centering the plot around mean and specifying x-axis limits as 4 standard deviations around mean:

hist_normal(rnorm(5000, mean = 25, sd = 2.5), binwidth = 0.5, num_sd = 4)