Splitting a delimited text column – Incidental Findings

Code to convert a dataframe text column with delimited text substrings (e.g., CSV) into multiple Boolean (or numeric) columns specifying presence or absence of that substring. Requires an identifier column and a delimited text column.

One use case is data collected from multiple checkboxes, but with the results of all the checked boxes concatenated into a delimited list.

Example

Create a dataframe with 2 columns and 4 rows. Each row has a column with a unique ID and another with a CSV list of fruit.

(I intentionally put some white space before and after the comma delimiter to make sure it can be properly handled.)

library(dplyr)
library(tidyr)

input_df <- data.frame(
    id = c(1, 2, 3, 4),
    text = c("apple, banana", "apple", "banana ,orange", "pear")
  )

input_df %>% knitr::kable()

id	text
1	apple, banana
2	apple
3	banana ,orange
4	pear

The following code snippet

separates the comma-delimited substrings, putting each on its own row (ie, “long” format)
strips whitespace that might have been before or after delimeters
removes duplicates
consolidates the multiple rows for each ID back to a “wide” format, with one column for each substring (with an optional prefix)

delimeter <- ","
prefix <- ""

output_df <- input_df %>%
  separate_rows(text, sep = delimeter) %>% # `text` is the delimited string column
  mutate(text = trimws(text, which = "both")) %>% # remove white space around delimiter
  distinct() %>%  # remove any repeat substrings
  pivot_wider(
    names_from = text,
    names_prefix = prefix,
    values_from = text,
    values_fn = ~ as.numeric(!is.na(.x)), # output as 0/1 numeric
    values_fill = 0
    # values_fn = ~ !is.na(.x), # output as Boolean
    # values_fill = FALSE
  )

Here’s how the output looks, with the original text column added back for comparison.

input_df %>% 
  left_join(output_df, by = "id") %>%
  knitr::kable()

id	text	apple	banana	orange	pear
1	apple, banana	1	1	0	0
2	apple	1	0	0	0
3	banana ,orange	0	1	1	0
4	pear	0	0	0	1