Built Family nested tree parent / children relatio

2020-06-18 01:45发布

问题:

I am working on families trees :

I have adapted Bob Horton's example based on sqldf https://www.r-bloggers.com/exploring-recursive-ctes-with-sqldf/

My data :

      person            father
      Guillou Arthur    NA          
      Cleach Marc       NA          
      Guillou Eric      Guillou Arthur          
      Guillou Jacques   Guillou Arthur          
      Cleach Franck     Cleach Marc         
      Cleach Leo        Cleach Marc         
      Cleach Herbet     Cleach Leo          
      Cleach Adele      Cleach Herbet           
      Guillou Jean      Guillou Eric            
      Guillou Alan      Guillou Eric

My results, descendants ordered by levels of "Guillou Arthur" (top person without father) :

  name    parent_name              level
  Guillou Arthur    NA                  1       
  Guillou Eric      Guillou Arthur      2       
  Guillou Jacques   Guillou Arthur      2       
  Guillou Alan      Guillou Eric        3       
  Guillou Jean     Guillou Eric         3       

You can built this table with recursive query with sqldf :

The data :

 person <- c("Guillou Arthur",
              "Cleach Marc",
              "Guillou Eric",
              "Guillou Jacques", 
              "Cleach Franck",
              "Cleach Leo",
              "Cleach Herbet",
              "Cleach Adele",
              "Guillou Jean",
              "Guillou Alan" )
 father <- c(NA, NA, "Guillou Arthur" , "Guillou Arthur", "Cleach Marc", "Cleach Marc", "Cleach Leo", "Cleach Herbet", "Guillou Eric", "Guillou Eric")


family <- data.frame(person, father)

Large to long format conversion :

    library(tidyr)

    long_family <- gather(family, parent, parent_name, -person)

    long_family

Recursive query to find descendants of "Guillou Arthur" (top person without father) :

    library(sqldf)
      descendants_sql <- "
      WITH RECURSIVE descendants (name, parent_name, level) AS (
        SELECT person, parent_name, 1 FROM long_family 
          WHERE person = '%s'
          AND parent = '%s'

          UNION ALL
          SELECT F.person, F.parent_name, D.level + 1 
              FROM descendants D
              JOIN long_family F
              ON F.parent_name = D.name)

      SELECT * FROM descendants ORDER BY level, name
      "
      fam <- sqldf(sprintf(descendants_sql, 'Guillou Arthur', 'father'))
      fam   

My question :
How can I create a data.frame object including all families trees directly with R (and not sql). Each tree starts with a patriarch (without father) like "Cleach Marc". (with R method or sqldf method)

回答1:

We build a recursive function to get the father line, from there everything is easy.

First we define the data with stringsAsFactors = FALSE for smoother reformatting.

family <- data.frame(person, father,stringsAsFactors = FALSE)

the function

father_line <- function(x){
dad <- subset(family,person==x)$father
if(is.na(dad)) return(x)
c(x,father_line(dad))
}

father_line ("Guillou Alan")
# [1] "Guillou Alan"   "Guillou Eric"   "Guillou Arthur"

Use it to get level and other things

family$father_line <- lapply(family$person,father_line)
family$level       <- lengths(family$father_line)
family$patriarch   <- sapply(family$father_line,tail,1)

#             person         father                                          father_line level      patriarch
# 1   Guillou Arthur           <NA>                                       Guillou Arthur     1 Guillou Arthur
# 2      Cleach Marc           <NA>                                          Cleach Marc     1    Cleach Marc
# 3     Guillou Eric Guillou Arthur                         Guillou Eric, Guillou Arthur     2 Guillou Arthur
# 4  Guillou Jacques Guillou Arthur                      Guillou Jacques, Guillou Arthur     2 Guillou Arthur
# 5    Cleach Franck    Cleach Marc                           Cleach Franck, Cleach Marc     2    Cleach Marc
# 6       Cleach Leo    Cleach Marc                              Cleach Leo, Cleach Marc     2    Cleach Marc
# 7    Cleach Herbet     Cleach Leo               Cleach Herbet, Cleach Leo, Cleach Marc     3    Cleach Marc
# 8     Cleach Adele  Cleach Herbet Cleach Adele, Cleach Herbet, Cleach Leo, Cleach Marc     4    Cleach Marc
# 9     Guillou Jean   Guillou Eric           Guillou Jean, Guillou Eric, Guillou Arthur     3 Guillou Arthur
# 10    Guillou Alan   Guillou Eric           Guillou Alan, Guillou Eric, Guillou Arthur     3 Guillou Arthur

For example to get stated expected output:

subset(family,patriarch == "Guillou Arthur",select=c(person,father,level))
#             person         father level
# 1   Guillou Arthur           <NA>     1
# 3     Guillou Eric Guillou Arthur     2
# 4  Guillou Jacques Guillou Arthur     2
# 9     Guillou Jean   Guillou Eric     3
# 10    Guillou Alan   Guillou Eric     3 

The tidyverse way it would look like this:

library(tidyverse)
family %>%
  mutate(family_line = map(person,father_line),
         level = lengths(family_line),
         patriarch = map(family_line,last)) %>%
  filter(patriarch == "Guillou Arthur") %>%
  select(person,father,level)

#            person         father level
# 1  Guillou Arthur           <NA>     1
# 2    Guillou Eric Guillou Arthur     2
# 3 Guillou Jacques Guillou Arthur     2
# 4    Guillou Jean   Guillou Eric     3
# 5    Guillou Alan   Guillou Eric     3


回答2:

You can probably do this using graph tools. so using igraph, you can get neighbours using ego functions.

A quick sketch (which needs checking!)

library(igraph)

family[] = lapply(family, factor, levels=unique(unlist(family)))

g = graph_from_adjacency_matrix(table(family))

cg = connect.neighborhood(g, order=length(V(g)), mode="out")

cbind( V(cg)$name, 
       sapply(ego(g, mode="out", mindist=1), function(x) replace(names(x), length(names(x))==0, NA)),
       ego_size(cg, mode="out") )[grep("Guillou", V(cg)$name),]

[,1]                   [,2]             [,3]
[1,] "Guillou Arthur"  NA               "1" 
[2,] "Guillou Eric"    "Guillou Arthur" "2" 
[3,] "Guillou Jacques" "Guillou Arthur" "2" 
[4,] "Guillou Jean"    "Guillou Eric"   "3" 
[5,] "Guillou Alan"    "Guillou Eric"   "3"

In fact maybe you dont need to create a neighbourhood graph and can get by with:

cbind( V(g)$name, 
       sapply(ego(g, mode="out", mindist=1), function(x) replace(names(x), length(names(x))==0, NA)),
       ego_size(g, mode="out", order=length(V(g))) )[grep("Cleach", V(g)$name),]