I have two data frames. One data frame (Partners.Missing
) contains 195 people who are partnered (married, de facto, etc) for which I need to construct the partner, using a random selection from a second data frame (NAsOnly
).
The Partners.Missing
data frame information is:
str(Partners.Missing)
'data.frame': 195 obs. of 8 variables:
$ V1 : Factor w/ 2 levels "Female","Male": 1 1 1 2 1 1 1 2 2 2 ...
$ V2 : Factor w/ 9 levels "15 - 17 Years",..: 4 4 7 7 4 4 7 3 7 4 ...
$ V3 : Factor w/ 1 level "Partnered": 1 1 1 1 1 1 1 1 1 1 ...
$ V4 : Factor w/ 7 levels "Eight or More Usual Residents",..: 1 1 5 2 1 1 1 1 2 5 ...
$ V5 : Factor w/ 8 levels "1-9 Hours Worked",..: 8 4 8 6 7 8 7 5 4 6 ...
$ SEX : chr "Male" "Male" "Male" "Female" ...
$ Ageband : num 4 4 7 7 4 4 7 3 7 4 ...
$ Inhabitants: num 8 8 6 5 8 8 8 8 5 6 ...
Because V2 is age-band as a factor, I have created the Ageband
variable that is a recode of V2
so that the youngest age group (15 - 17 years) is 1, the next oldest is 2, etc. Inhabitants
is a recode of V4
, again to construct a numeric variable. Sex
is binary "Male"/"Female".
The information on the second data frame (NAsOnly
) is:
str(NAsOnly)
'data.frame': 762 obs. of 7 variables:
$ SEX : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
$ AGEBAND : Factor w/ 13 levels "0 - 4 Years",..: 3 3 3 3 3 3 3 3 3 3 ...
$ RELATIONSHIP: Factor w/ 4 levels "Non-partnered",..: 3 3 3 3 1 1 1 1 1 1 ...
$ INHABITANTS : Factor w/ 9 levels "Eight or More Usual Residents",..: 7 7 3 2 9 9 9 9 7 7 ...
$ HRSWORKED : Factor w/ 9 levels "1-9 Hours Worked",..: 1 8 6 3 1 2 3 6 3 4 ...
I can create new variables so that Ageband
and Inhabitants
in NAsOnly
are the same construction, to use in matching. But I'm stuck on how to match. What I want to do - for each row in Partners.Missing
- is to randomly sample an observation from NAsOnly
using the following criteria:
- opposite
SEX
(so a "Female" inPartners.Missing
will match to a "Male" inNAsOnly
) - the "Female" partner (irrespective of which data frame they originate) is in the same age band, or one younger, than the "Male" partner
- the number of
Inhabitants
is an exact match, so that a "Female" from a 5-person household will only match to a "Male" (of the correct age band) from a 5-person household RELATIONSHIP
inNAsOnly
can only be "Partnered" ("Non-partnered" and "Not elsewhere included" are also valid variable entries in that data frame)*.
So I want a one-to-one match, and I need the match to be a random draw and not the first available. And do this 195 times, once for each observation in Partners.Missing
so that their partner is no longer missing.
I can't use first or last match either, as there could be numerous rows in NAsOnly
that match on the basis of my criteria - it has to be a random draw, otherwise the same observations will be draw every time from NAsOnly
. Basically, something like random sampling with replacement from NAsOnly
. It does not matter whether the sampled observations are used to contruct a third data frame of matches, or whether the sampled observations are added to Partners.Missing
as additional columns.
*It has four levels as the original larger data frame had Totals rows, so the fourth (and unused) level is "Total".
Update: I have tried to write a for next loop to do this, but it's not working as intended. The code is:
for(i in 1:1) {
row <- Partners.Missing[i,]
if(row$V1=="Female")
matched <- data.frame(row$SEX[i]==Partnered.Censored$SEX &
row$Inhabitants[i]==Partnered.Censored$Inhabitants &
(row$Ageband[i]==Partnered.Censored$Ageband | row$Ageband[i]==Partnered.Censored$Ageband+1)
)
else
matched <- data.frame(row$SEX[i]==Partnered.Censored$SEX &
row$Inhabitants[i]==Partnered.Censored$Inhabitants &
(row$Ageband[i]==Partnered.Censored$Ageband | row$Ageband[i]==Partnered.Censored$Ageband-1)
)
}
This outputs a single column into a data frame
called matched
with TRUE
or FALSE
as the input in a single column of 277 rows, representing whether that row's index in Partnered.Censored
is a match or not. Once I increase i's maximum value to 2 (knowing I have 195 rows), I get NA
as output. I have the following problems remaining:
- I wish to use the row(s) that matches from
Partnered.Censored
rather than outputting a boolean result - I then wish to sample randomly from the matching rows to generate the new partner
- and then repeat for each row in
Partners.Missing
.
I also have the problem where increasing the maximum value of i
, e.g. to 2, overwrites the single column of TRUE/
FALSEvalues with
NA`.