Deterministically anonymize identifiers and dates

Code snippets to deterministically anonymize identifiers and dates, while maintaining intra-individual time series relationships, to facilitate sharing examples of clinical dashboards.

Sharing examples of clinical dashboards is difficult because of the presence of protected health information. Anonymizing the information leads to potential challenges as well:

converting identifiers (like names or medical record numbers) to a random string of characters is not easily human readable
converting dates randomly loses intra-individual time series relationships, like the trend of a lab value or weight over time
a deterministic, rather than fully random, method of anonymizing identifiers and dates, is useful if new data for individuals is acquired and needs to be linked to previously anonymized data

Below are a pair of functions that deterministically anonymize identifiers and dates.

Convert an identifier into a human-readable name

The identifier is concatenated to a salt, hashed, converted into two integers, and indexed into the babynames database to yield a human readable name. If the salt is generated randomly (and not saved), this essentially results in full randomization of the identifier. But if the salt (and identifier) are known, then the conversion into names can be reproduced.

#' Deterministically anonymize an identifier into a name
#'
#' Deterministically convert an identifier into a concatenation of two names from the babynames database.
#' NOT guaranteed to be unique, but there should be 9.47 billion unique combinations as of 2022-10.
#'
#' @param id vector of hashable identifier elements
#' @param salt `salt` is concatenated to `id` prior to hashing
#' @export
random_names <- function(id, salt = "") {
  # for each `id`, return a deterministically "random" concatenation of two names from the babynames database
  library(babynames)
  namelist <- unique(babynames$name)

  hashes <- vapply(
    X = paste0(id, salt[1]), # apply salt to the inputs
    FUN = rlang::hash, # rlang::hash() returns 32-hex digits
    FUN.VALUE = "chr",
    USE.NAMES = FALSE
  )

  # take the first and second 10 digits of the hex and convert to a decimal integer
  # use as indices into the unique babynames (use `%%` modulo to generate proper range)
  # return the concatenated pair of names
  return(
    paste(
      namelist[as.numeric(paste0("0x", substr(hashes,  1, 10))) %% length(namelist) + 1],
      namelist[as.numeric(paste0("0x", substr(hashes, 11, 20))) %% length(namelist) + 1]
    )
  )
}

random_names(c("Joe", "Hadley", "Garrett", "Yihui", "Tibshirani"), salt = "secret_salt") # known salt

[1] "Blaza Navraj"      "Kreigh Kimyra"     "Charlita Delishia"
[4] "Rashana Chantina"  "Kentavia Dajonna"

random_names(c("Joe", "Hadley", "Garrett", "Yihui", "Tibshirani"), salt = runif(1)) # random salt

[1] "Shanice Daymien"   "Achilles Adlene"   "Yanal Mizell"     
[4] "Chakakhan Ziaire"  "Hortencia Shaleia"

random_names(c("Joe", "Hadley", "Garrett", "Yihui", "Tibshirani"), salt = runif(1)) # random salt

[1] "Matheson Davyne"   "Twanna Sheddrick"  "Kindel Mckinleigh"
[4] "Buzz Deem"         "Athene Almalik"

random_names(c("Joe", "Hadley", "Garrett", "Yihui", "Tibshirani"), salt = "secret_salt") # reproduce with known salt

[1] "Blaza Navraj"      "Kreigh Kimyra"     "Charlita Delishia"
[4] "Rashana Chantina"  "Kentavia Dajonna"

Shift a POSIXct date-time to within a target year

An identifier is concatenated to a salt, hashed, and converted into an integer, which is used to select a random minute within the target year. For any given individual, a baseline POSIXct date-time (e.g., date of birth or date of admission) is used to calculate the amount of “shift” that should be applied to all date-times for that individual, which maintains relative times for that individual.

#' Deterministically shift POSIXct to within a given year
#'
#' Deterministically return a shift (in seconds) of a POSIXct vector to within a given year, to
#' anonymize. The shift is determined by a hash of the id and salt. If the same shift is applied
#' to all POSIXct for an individual, relative times can be preserved, for time series analysis.
#'
#' @param id Shift generated by hash of `id` + `salt`
#' @param salt `salt` is concatenated to `id` prior to hashing
#' @param year Target year to randomize the dates to
#' @export
random_timeshift <- function(base_date, id, salt = "", year = 2000) {
  # deterministically return a shift (in seconds) to a `base_date` POSIXct column to shift to within a given year
  # - deterministic based on `id` + `salt`
  # - shifts to the minute (no seconds)
  # - vectorizes successfully across `year`
  start <- as.POSIXct(paste0(year, "-01-01"))
  end <- as.POSIXct(paste0(year+1, "-01-01"))
  seconds_in_year <- as.numeric(difftime(end, start), units = "secs") # will account for leap years and leap seconds
  minutes_in_year <- as.integer(seconds_in_year / 60)
  
  # map_chr because rlang::hash is NOT vectorized
  hash = map_chr(id, .f = function(x) substr(rlang::hash(paste0(x, salt)), 1, 8))
  hash_date <- start + 60 * as.numeric(paste0("0x", hash)) %% minutes_in_year
  shift_seconds <- as.numeric(difftime(hash_date, base_date), units = "secs")
  return(shift_seconds)
}

Example of use

library(tidyverse)

Generate a dataframe with identifiers and dates of birth:

df <- tibble(
    id = 1:8,
    dob = as.POSIXct("2022-01-01") + (id-1)*60*60*24 # sequential days
  )

knitr::kable(df)

id	dob
1	2022-01-01
2	2022-01-02
3	2022-01-03
4	2022-01-04
5	2022-01-05
6	2022-01-06
7	2022-01-07
8	2022-01-08

Then, anonymize that dataframe by converting the identifier into a random name and shifting the dates of birth “randomly”:

df %>% 
  mutate(
    name = random_names(id, salt = Sys.time()),
    shift = random_timeshift(dob, name, salt = "", year = 2000),
    dob_shift = dob + shift
  ) %>% 
  knitr::kable()

id	dob	name	shift	dob_shift
1	2022-01-01	Damiun Issabell	-679764420	2000-06-17 09:33:00
2	2022-01-02	Onesti Miyona	-665991540	2000-11-24 18:21:00
3	2022-01-03	Tristram Faatima	-665407200	2000-12-02 12:40:00
4	2022-01-04	Evodio Dejuane	-678727920	2000-07-02 09:28:00
5	2022-01-05	Yazin Desai	-665049480	2000-12-08 16:02:00
6	2022-01-06	Jessamarie Quillen	-664541520	2000-12-15 13:08:00
7	2022-01-07	Carra Raimy	-665548620	2000-12-04 21:23:00
8	2022-01-08	Marcandrew Kayelee	-676239960	2000-08-04 04:34:00

Example converting multiple date-time columns

If there were multiple time series columns associated with each id, they could all be shifted by the appropriate shift to maintain relative timing.

load("biliData.Rda")
enclist_anon <- enclist %>% # start with fully-identified dataset
  select(
    PatientID,      # select patient identifier
    contains("DTS") # select multiple date-time columns
  ) %>% 
  mutate(
    # convert all columns containing "DTS" from string to POSIXct
    across(contains("DTS"), function(x) { as.POSIXct(strptime(x, format = '%Y-%m-%d %H:%M:%S')) } )
  ) %>% 
  mutate(
    PatientID = random_names(PatientID, salt = runif(1)),            # anonymize identifiers
    .shift = random_timeshift(BirthDTS, PatientID, salt = runif(1)), # generate per-individual time shift
    across(contains("DTS"), function(x) {x + .shift} )               # apply time shifts
  ) %>% 
  select(-.shift)                                                    # remove the time shifts

The resulting dataset has now been anonymized, but the relative timing between the dates of birth, hospital admission, and hospital discharge are all maintained.

> summary(enclist_anon)
  PatientID            BirthDTS                   HospitalAdmitDTS              HospitalDischargeDTS         
 Length:23649       Min.   :2000-01-01 01:14:00   Min.   :2000-01-01 01:14:00   Min.   :2000-01-02 05:27:00  
 Class :character   1st Qu.:2000-04-03 00:10:00   1st Qu.:2000-04-17 01:18:30   1st Qu.:2000-04-20 02:58:30  
 Mode  :character   Median :2000-07-02 12:04:00   Median :2000-07-20 13:24:00   Median :2000-07-23 11:45:00  
                    Mean   :2000-07-02 02:26:36   Mean   :2000-07-21 18:41:51   Mean   :2000-07-24 22:54:20  
                    3rd Qu.:2000-09-30 21:23:00   3rd Qu.:2000-10-21 10:54:30   3rd Qu.:2000-10-25 00:41:30  
                    Max.   :2000-12-31 23:25:00   Max.   :2001-10-21 07:39:00   Max.   :2001-10-21 20:21:00  
                                                  NA's   :450                   NA's   :662

> sample_n(enclist_anon, 8) %>% arrange(BirthDTS)
         PatientID            BirthDTS    HospitalAdmitDTS HospitalDischargeDTS
1     Larod Dorita 2000-02-07 14:48:00 2000-02-07 14:48:00  2000-02-09 01:57:00
2  Indalecio Ninon 2000-03-01 02:04:00 2000-03-06 04:18:00  2000-03-06 12:23:00
3     Connal Abron 2000-04-19 13:53:00 2000-08-17 03:37:00  2000-08-17 05:03:00
4 Kaytlynne Lorren 2000-05-12 17:52:00 2000-05-12 17:52:00  2000-05-20 04:54:00
5  Turkessa Montse 2000-06-30 04:41:00 2000-06-30 04:41:00  2000-07-02 09:14:00
6    Venesta Sofie 2000-07-18 00:22:00 2000-07-18 00:22:00  2000-07-19 14:12:00
7    Edisson Teric 2000-07-24 21:42:00 2000-12-25 11:09:00  2000-12-25 13:58:00
8  Latonia Santrez 2000-10-17 09:27:00 2000-10-21 07:47:00  2000-10-21 20:51:00