change a column from birth date to age in r

2020-02-05 03:25发布

问题:

I am using data.table for the first time.

I have a column of about 400,000 ages in my table. I need to convert them from birth dates to ages.

What is the best way to do this?

回答1:

From the comments of this blog entry, I found the age_calc function in the eeptools package. It takes care of edge cases (leap years, etc.), checks inputs and looks quite robust.

library(eeptools)
x <- as.Date(c("2011-01-01", "1996-02-29"))
age_calc(x[1],x[2]) # default is age in months

[1] 46.73333 224.83118

age_calc(x[1],x[2], units = "years") # but you can set it to years

[1] 3.893151 18.731507

floor(age_calc(x[1],x[2], units = "years"))

[1] 3 18

For your data

yourdata$age <- floor(age_calc(yourdata$birthdate, units = "years"))

assuming you want age in integer years.



回答2:

I've been thinking about this and have been dissatisfied with the two answers so far. I like using lubridate, as @KFB did, but I also want things wrapped up nicely in a function, as in my answer using the eeptools package. So here's a wrapper function using the lubridate interval method with some nice options:

#' Calculate age
#' 
#' By default, calculates the typical "age in years", with a
#' \code{floor} applied so that you are, e.g., 5 years old from
#' 5th birthday through the day before your 6th birthday. Set
#' \code{floor = FALSE} to return decimal ages, and change \code{units}
#' for units other than years.
#' @param dob date-of-birth, the day to start calculating age.
#' @param age.day the date on which age is to be calculated.
#' @param units unit to measure age in. Defaults to \code{"years"}. Passed to \link{\code{duration}}.
#' @param floor boolean for whether or not to floor the result. Defaults to \code{TRUE}.
#' @return Age in \code{units}. Will be an integer if \code{floor = TRUE}.
#' @examples
#' my.dob <- as.Date('1983-10-20')
#' age(my.dob)
#' age(my.dob, units = "minutes")
#' age(my.dob, floor = FALSE)
age <- function(dob, age.day = today(), units = "years", floor = TRUE) {
    calc.age = interval(dob, age.day) / duration(num = 1, units = units)
    if (floor) return(as.integer(floor(calc.age)))
    return(calc.age)
}

Usage examples:

> my.dob <- as.Date('1983-10-20')

> age(my.dob)
[1] 31

> age(my.dob, floor = FALSE)
[1] 31.15616

> age(my.dob, units = "minutes")
[1] 16375680

> age(seq(my.dob, length.out = 6, by = "years"))
[1] 31 30 29 28 27 26


回答3:

Assume you have a data.table, you could do below:

library(data.table)
library(lubridate)
# toy data
X = data.table(birth=seq(from=as.Date("1970-01-01"), to=as.Date("1980-12-31"), by="year"))
Sys.Date()

Option 1 : use "as.period" from lubriate package

X[, age := as.period(Sys.Date() - birth)][]
         birth                   age
 1: 1970-01-01  44y 0m 327d 0H 0M 0S
 2: 1971-01-01  43y 0m 327d 6H 0M 0S
 3: 1972-01-01 42y 0m 327d 12H 0M 0S
 4: 1973-01-01 41y 0m 326d 18H 0M 0S
 5: 1974-01-01  40y 0m 327d 0H 0M 0S
 6: 1975-01-01  39y 0m 327d 6H 0M 0S
 7: 1976-01-01 38y 0m 327d 12H 0M 0S
 8: 1977-01-01 37y 0m 326d 18H 0M 0S
 9: 1978-01-01  36y 0m 327d 0H 0M 0S
10: 1979-01-01  35y 0m 327d 6H 0M 0S
11: 1980-01-01 34y 0m 327d 12H 0M 0S

Option 2 : if you do not like the format of Option 1, you could do below:

yr = duration(num = 1, units = "years")
X[, age := new_interval(birth, Sys.Date())/yr][]
# you get
         birth      age
 1: 1970-01-01 44.92603
 2: 1971-01-01 43.92603
 3: 1972-01-01 42.92603
 4: 1973-01-01 41.92329
 5: 1974-01-01 40.92329
 6: 1975-01-01 39.92329
 7: 1976-01-01 38.92329
 8: 1977-01-01 37.92055
 9: 1978-01-01 36.92055
10: 1979-01-01 35.92055
11: 1980-01-01 34.92055

Believe Option 2 should be the more desirable.



回答4:

I prefer to do this using the lubridate package, borrowing syntax I originally encountered in another post.

It's necessary to standardize your input dates in terms of R date objects, preferably with the lubridate::mdy() or lubridate::ymd() or similar functions, as applicable. You can use the interval() function to generate an interval describing the time elapsed between the two dates, and then use the duration() function to define how this interval should be "diced".

I've summarized the simplest case for calculating an age from two dates below, using the most current syntax in R.

df$DOB <- mdy(df$DOB)
df$EndDate <- mdy(df$EndDate)
df$Calc_Age <- interval(start= df$DOB, end=df$EndDate)/                      
                     duration(n=1, unit="years")

Age may be rounded down to the nearest complete integer using the base R 'floor()` function, like so:

df$Calc_AgeF <- floor(df$Calc_Age)

Alternately, the digits= argument in the base R round() function can be used to round up or down, and specify the exact number of decimals in the returned value, like so:

df$Calc_Age2 <- round(df$Calc_Age, digits = 2) ## 2 decimals
df$Calc_Age0 <- round(df$Calc_Age, digits = 0) ## nearest integer

It's worth noting that once the input dates are passed through the calculation step described above (i.e., interval() and duration() functions) , the returned value will be numeric and no longer a date object in R. This is significant whereas the lubridate::floor_date() is limited strictly to date-time objects.

The above syntax works regardless whether the input dates occur in a data.table or data.frame object.



回答5:

I wasn't happy with any of the responses when it comes to calculating the age in months or years, when dealing with leap years, so this is my function using the lubridate package.

Basically, it slices the interval between from and to into (up to) yearly chunks, and then adjusts the interval for whether that chunk is leap year or not. The total interval is the sum of the age of each chunk.

library(lubridate)

#' Get Age of Date relative to Another Date
#'
#' @param from,to the date or dates to consider
#' @param units the units to consider
#' @param floor logical as to whether to floor the result
#' @param simple logical as to whether to do a simple calculation, a simple calculation doesn't account for leap year.
#' @author Nicholas Hamilton
#' @export
age <- function(from, to = today(), units = "years", floor = FALSE, simple = FALSE) {

  #Account for Leap Year if Working in Months and Years
  if(!simple && length(grep("^(month|year)",units)) > 0){
    df = data.frame(from,to)
    calc = sapply(1:nrow(df),function(r){

      #Start and Finish Points
      st = df[r,1]; fn = df[r,2]

      #If there is no difference, age is zero
      if(st == fn){ return(0) }

      #If there is a difference, age is not zero and needs to be calculated
      sign = +1 #Age Direction
      if(st > fn){ tmp = st; st = fn; fn = tmp; sign = -1 } #Swap and Change sign

      #Determine the slice-points
      mid   = ceiling_date(seq(st,fn,by='year'),'year')

      #Build the sequence
      dates = unique( c(st,mid,fn) )
      dates = dates[which(dates >= st & dates <= fn)]

      #Determine the age of the chunks
      chunks = sapply(head(seq_along(dates),-1),function(ix){
        k = 365/( 365 + leap_year(dates[ix]) )
        k*interval( dates[ix], dates[ix+1] ) / duration(num = 1, units = units)
      })

      #Sum the Chunks, and account for direction
      sign*sum(chunks)
    })

  #If Simple Calculation or Not Months or Not years
  }else{
    calc = interval(from,to) / duration(num = 1, units = units)
  }

  if (floor) calc = as.integer(floor(calc))
  calc
}


回答6:

(Sys.Date() - yourDate) / 365.25


回答7:

I wanted an implementation that didn't increase my dependencies beyond data.table, which is usually my only dependency. The data.table is only needed for mday, which means day of the month.

This is the function as my brain works when I consider someone's age:

require(data.table)
agecalc <- function(origin, current){
    y <- year(current) - year(origin) - 1
    offset <- 0
    if(month(current) > month(origin)) offset <- 1
    if(month(current) == month(origin) & 
       mday(current) >= mday(origin)) offset <- 1
    age <- y + offset
    return(age)
}

This is the same logic refactored and vectorized:

agecalc <- function(origin, current){
    age <- year(current) - year(origin) - 1
    ii <- (month(current) > month(origin)) | (month(current) == month(origin) & 
                                                  mday(current) >= mday(origin))
    age[ii] <- age[ii] + 1
    return(age)
}

You could also do a string comparison on the mm-dd part. I could imagine scenarios where the string comparison could be a faster; if you had the year as a number and the birth date as a string.

agecalc <- function(origin, current){
    origin <- as.character(origin)
    current <- as.character(current)

    age <- as.numeric(substr(current, 1, 4)) - as.numeric(substr(origin, 1, 4)) - 1
    if(substr(current, 6, 10) >= substr(origin, 6, 10)){
        age <- age + 1
    }
    return(age)
}

Some tests:

agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-09-12"))

agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-03-01"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2011-03-01"))

## Requires vectorized version:
d <- data.table(d=as.IDate("2000-01-01") + 0:10000)
d[ , b1 := as.IDate("2000-08-15")]
d[ , b2 := as.IDate("2000-02-29")]
d[ , age1_num := (d - b1) / 365]
d[ , age2_num := (d - b2) / 365]
d[ , age1 := agecalc(b1, d)]
d[ , age2 := agecalc(b2, d)]
d