Splitting a delimited text column

Convert a delimited text column of substrings in a dataframe into multiple columns specifying presence or absence of that substring
R
Code snippet
Published

August 30, 2023

Code to convert a dataframe text column with delimited text substrings (e.g., CSV) into multiple Boolean (or numeric) columns specifying presence or absence of that substring. Requires an identifier column and a delimited text column.

One use case is data collected from multiple checkboxes, but with the results of all the checked boxes concatenated into a delimited list.

Example

Create a dataframe with 2 columns and 4 rows. Each row has a column with a unique ID and another with a CSV list of fruit.

(I intentionally put some white space before and after the comma delimiter to make sure it can be properly handled.)

library(dplyr)
library(tidyr)

input_df <- data.frame(
    id = c(1, 2, 3, 4),
    text = c("apple, banana", "apple", "banana ,orange", "pear")
  )

input_df %>% knitr::kable()
id text
1 apple, banana
2 apple
3 banana ,orange
4 pear

The following code snippet

  • separates the comma-delimited substrings, putting each on its own row (ie, “long” format)
  • strips whitespace that might have been before or after delimeters
  • removes duplicates
  • consolidates the multiple rows for each ID back to a “wide” format, with one column for each substring (with an optional prefix)
delimeter <- ","
prefix <- ""

output_df <- input_df %>%
  separate_rows(text, sep = delimeter) %>% # `text` is the delimited string column
  mutate(text = trimws(text, which = "both")) %>% # remove white space around delimiter
  distinct() %>%  # remove any repeat substrings
  pivot_wider(
    names_from = text,
    names_prefix = prefix,
    values_from = text,
    values_fn = ~ as.numeric(!is.na(.x)), # output as 0/1 numeric
    values_fill = 0
    # values_fn = ~ !is.na(.x), # output as Boolean
    # values_fill = FALSE
  )

Here’s how the output looks, with the original text column added back for comparison.

input_df %>% 
  left_join(output_df, by = "id") %>%
  knitr::kable()
id text apple banana orange pear
1 apple, banana 1 1 0 0
2 apple 1 0 0 0
3 banana ,orange 0 1 1 0
4 pear 0 0 0 1