conditional random matching from one data frame in

2019-08-11 10:50发布

问题:

I have two data frames. One data frame (Partners.Missing) contains 195 people who are partnered (married, de facto, etc) for which I need to construct the partner, using a random selection from a second data frame (NAsOnly).

The Partners.Missing data frame information is:

 str(Partners.Missing)
 'data.frame':  195 obs. of  8 variables:
  $ V1         : Factor w/ 2 levels "Female","Male": 1 1 1 2 1 1 1 2 2 2 ...
  $ V2         : Factor w/ 9 levels "15 - 17 Years",..: 4 4 7 7 4 4 7 3 7 4 ...
  $ V3         : Factor w/ 1 level "Partnered": 1 1 1 1 1 1 1 1 1 1 ...
  $ V4         : Factor w/ 7 levels "Eight or More Usual Residents",..: 1 1 5 2 1 1 1 1 2 5 ...
  $ V5         : Factor w/ 8 levels "1-9 Hours Worked",..: 8 4 8 6 7 8 7 5 4 6 ...
  $ SEX        : chr  "Male" "Male" "Male" "Female" ...
  $ Ageband    : num  4 4 7 7 4 4 7 3 7 4 ...
  $ Inhabitants: num  8 8 6 5 8 8 8 8 5 6 ...

Because V2 is age-band as a factor, I have created the Ageband variable that is a recode of V2 so that the youngest age group (15 - 17 years) is 1, the next oldest is 2, etc. Inhabitants is a recode of V4, again to construct a numeric variable. Sex is binary "Male"/"Female".

The information on the second data frame (NAsOnly) is:

 str(NAsOnly)
 'data.frame':  762 obs. of  7 variables:
  $ SEX         : Factor w/ 3 levels "Female","Male",..: 2 2 2 2 2 2 2 2 2 2 ...
  $ AGEBAND     : Factor w/ 13 levels "0 - 4 Years",..: 3 3 3 3 3 3 3 3 3 3 ...
  $ RELATIONSHIP: Factor w/ 4 levels "Non-partnered",..: 3 3 3 3 1 1 1 1 1 1 ...
  $ INHABITANTS : Factor w/ 9 levels "Eight or More Usual Residents",..: 7 7 3 2 9 9 9 9 7 7 ...
  $ HRSWORKED   : Factor w/ 9 levels "1-9 Hours Worked",..: 1 8 6 3 1 2 3 6 3 4 ...

I can create new variables so that Ageband and Inhabitants in NAsOnly are the same construction, to use in matching. But I'm stuck on how to match. What I want to do - for each row in Partners.Missing - is to randomly sample an observation from NAsOnly using the following criteria:

  • opposite SEX (so a "Female" in Partners.Missing will match to a "Male" in NAsOnly)
  • the "Female" partner (irrespective of which data frame they originate) is in the same age band, or one younger, than the "Male" partner
  • the number of Inhabitants is an exact match, so that a "Female" from a 5-person household will only match to a "Male" (of the correct age band) from a 5-person household
  • RELATIONSHIP in NAsOnly can only be "Partnered" ("Non-partnered" and "Not elsewhere included" are also valid variable entries in that data frame)*.

So I want a one-to-one match, and I need the match to be a random draw and not the first available. And do this 195 times, once for each observation in Partners.Missing so that their partner is no longer missing.

I can't use first or last match either, as there could be numerous rows in NAsOnly that match on the basis of my criteria - it has to be a random draw, otherwise the same observations will be draw every time from NAsOnly. Basically, something like random sampling with replacement from NAsOnly. It does not matter whether the sampled observations are used to contruct a third data frame of matches, or whether the sampled observations are added to Partners.Missing as additional columns.

*It has four levels as the original larger data frame had Totals rows, so the fourth (and unused) level is "Total".

Update: I have tried to write a for next loop to do this, but it's not working as intended. The code is:

 for(i in 1:1) {
   row <- Partners.Missing[i,]
   if(row$V1=="Female")
   matched <- data.frame(row$SEX[i]==Partnered.Censored$SEX &
             row$Inhabitants[i]==Partnered.Censored$Inhabitants &
           (row$Ageband[i]==Partnered.Censored$Ageband | row$Ageband[i]==Partnered.Censored$Ageband+1)
   )
   else
   matched <- data.frame(row$SEX[i]==Partnered.Censored$SEX &
           row$Inhabitants[i]==Partnered.Censored$Inhabitants &
           (row$Ageband[i]==Partnered.Censored$Ageband | row$Ageband[i]==Partnered.Censored$Ageband-1)
   )
 }

This outputs a single column into a data frame called matched with TRUE or FALSE as the input in a single column of 277 rows, representing whether that row's index in Partnered.Censored is a match or not. Once I increase i's maximum value to 2 (knowing I have 195 rows), I get NA as output. I have the following problems remaining:

  • I wish to use the row(s) that matches from Partnered.Censored rather than outputting a boolean result
  • I then wish to sample randomly from the matching rows to generate the new partner
  • and then repeat for each row in Partners.Missing.

I also have the problem where increasing the maximum value of i, e.g. to 2, overwrites the single column of TRUE/FALSEvalues withNA`.

回答1:

This has been top of my mind for the past couple of days, and I appear to have solved the problem using the following code. I'm leaving the question and answer up just in case anyone else needs to do this.

 for(i in 1:nrow(Partners.Missing)) {
   row <- Partners.Missing[i,]
   result <- merge(row, Partnered.Censored, by=c("SEX","Inhabitants"),suffixes=c(".r",".c"))
   if (row$V1=="Female") {
     result<- subset(result, Ageband.r==Ageband.c | Ageband.r==Ageband.c-1)
   }
   if (row$V1=="Male") {
    result<- subset(result, Ageband.r==Ageband.c | Ageband.r==Ageband.c+1)
   }
   j <- sample(1:nrow(result),1)
   if(i == 1) {
     Matched.Partners <- result[j,]
   }
   if (i > 1) {
   Matched.Partners <- rbind(Matched.Partners,result[j,])
   }
 }

Explaining this code to anyone that needs this answer too, and also to see if the community has a better answer, For each person in Partners.Missing a temporary vector is created holding that person's information. A one-to-many join is constructed on the basis of the two variables that will match - the missing person's sex, and the number of inhabitants in the household. Then, depending on whether the person in Partners.Missing is female or male, the matched results are only retained for potential partners with the correct age band. The code then locates the number of potential partners identified, and generates a random integer between 1 and that number. This is used to extract the randomly matched person and put them into the output data frame. Because the output data frame (Matched.Partners) does not exist before this code is run, the first loop creates the data frame with its first row. Every other time through, the data frame already exists, so the new match is appended.

I'll not vote up either my question or my answer.