I have a dataframe that looks like:
animal_id trait_id sire_id dam_id
1 25.05 0 0
2 -46.3 1 2
3 41.6 1 2
4 -42.76 3 4
5 -10.99 3 4
6 -49.81 5 4
I want to create another variable that contains the estimate of "trait_id" for each "sire_id" and "dam_id".
All sires (sire_id) and dams (dam_id) are also present in the animal_id column. So what I want to do is to look for their measurement in the trait_id and repeat this variable in the new variable.
The outcome that I want is:
animal_id trait_id sire_id trait_sire dam_id trait_dam
1 25.05 0 NA 0 NA
2 -46.3 1 25.05 2 -46.3
3 41.6 1 25.05 2 -46.3
4 -42.76 3 41.6 4 -42.76
5 -10.99 3 41.6 4 -42.76
6 -49.81 5 -10.99 4 -42.76
Any suggestion would be greatly appreciated.
You can use match
; match(col, df$animal_id)
gives corresponding index of elements from col in the animal_id
, which can be used further to locate the values of trait
:
df[c("trait_sire", "trait_dam")] <-
lapply(df[c("sire_id", "dam_id")], function(col) df$trait_id[match(col, df$animal_id)])
df
# animal_id trait_id sire_id dam_id trait_sire trait_dam
#1 1 25.05 0 0 NA NA
#2 2 -46.30 1 2 25.05 -46.30
#3 3 41.60 1 2 25.05 -46.30
#4 4 -42.76 3 4 41.60 -42.76
#5 5 -10.99 3 4 41.60 -42.76
#6 6 -49.81 5 4 -10.99 -42.76
With data.table joins...
library(data.table)
setDT(DT)
DT[, trait_sire :=
.SD[.SD, on=.(animal_id = sire_id), x.trait_id ]
]
DT[, trait_dam :=
.SD[.SD, on=.(animal_id = dam_id), x.trait_id ]
]
animal_id trait_id sire_id dam_id trait_sire trait_dam
1: 1 25.05 0 0 NA NA
2: 2 -46.30 1 2 25.05 -46.30
3: 3 41.60 1 2 25.05 -46.30
4: 4 -42.76 3 4 41.60 -42.76
5: 5 -10.99 3 4 41.60 -42.76
6: 6 -49.81 5 4 -10.99 -42.76
The syntax is x[i, on=, j]
where j
is some function of the columns. To see how it works, try out DT[DT, on=.(animal_id = dam_id)]
and variations. Some notes:
- The
i.*
/ x.*
syntax helps to distinguish where a column is taken from.
- When
j
is v := expression
, the expression is assigned to column, v
.
- The join
x[i, ...]
uses rows of i
to look up rows of x
.
- The
on=
syntax is like .(xcol = icol)
.
- Inside
j
, the table itself can be written as .SD
.
One advantage of this approach over match
is that it extends to joins on more than one column, like on = .(xcol = icol, xcol2 = icol2)
or even "non equi joins" like on = .(xcol < icol)
. Also, it's part of a consistent syntax for operating on the table (explained in the package's introductory material), rather than specialized code for one task.
You can do this using match
(in base R) in one run (no need to loop over)
df[c("trait_sire", "trait_dam")] <-
cbind(with(df, trait_id[match(sire_id, animal_id)]),
with(df, trait_id[match(dam_id, animal_id)]))
# animal_id trait_id sire_id dam_id trait_sire trait_dam
# 1 1 25.05 0 0 NA NA
# 2 2 -46.30 1 2 25.05 -46.30
# 3 3 41.60 1 2 25.05 -46.30
# 4 4 -42.76 3 4 41.60 -42.76
# 5 5 -10.99 3 4 41.60 -42.76
# 6 6 -49.81 5 4 -10.99 -42.76