Converting text to a data.frame based on headers

2020-04-12 13:04发布

问题:

I uploaded a .txt file in to R as follows: Election_Parties <- readr::read_lines("Election_Parties.txt"). Let's say the following text was in the file:

BOLIVIA
P17-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento Nacionalista Revolucionario 
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])

COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colombia)

In words: After every empty line, a new country starts. I would like to convert this text file into a dataframe where the country name becomes a vector and the list of parties becomes a vector.

Desired output:

Bolivia     P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento Nacionalista 
Bolivia     P19-Liberty and Justice (Libertad y Justicia [LJ])
Bolivia     P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
Colombia    P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
Colombia    P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
Colombia    P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colombia)

I would if possible like the solution to be based on the header.

EDIT: I just realised that every new country starts with P1, so a solution could also be based on that.

回答1:

If your separator is always "", then once you have your text in a vector; use that as a demarcator and do cumsum to separate them into groups.

TXT = readr::read_lines("Election_Parties.txt")
#we add a separator for your first country
TXT = c("",TXT)
idx <- cumsum(TXT=="")
# use idx <- cumsum(!grepl("^[A-Z]",TXT)) if weird newline

You can see BOLIVIA goes into 1, COLOMBIA goes into 2

tibble::tibble(TXT,idx)
# A tibble: 10 x 2
   TXT                                                                       idx
   <chr>                                                                   <int>
 1 ""                                                                          1
 2 BOLIVIA                                                                     1
 3 "P17-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimie…     1
 4 P19-Liberty and Justice (Libertad y Justicia [LJ])                          1
 5 P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tup…     1
 6 ""                                                                          2
 7 COLOMBIA                                                                    2
 8 P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])              2
 9 P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])             2
10 P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colomb…     2

We just apply a function to each group and make a dataframe

func = function(x){
  data.frame(Country=x[2],Parties=x[3:length(x)])
}
do.call(rbind,by(TXT,idx,func))


标签: r string parsing