Context
I am trying to read in and tidy an excel file with multiple headers/sections placed at variable positions. The content of these headers need to be added as a variable. The input files are relatively large excel files which are formatted with (human) readability in mind but little more than that.
Input:
Let's say the data set contains the distributions of types of car (based on the fuel they use) for a number of cities. As you will see, in the original file, the name of the city is used as header (or divider as you will). We need this header as a variable. Unfortunately not all types are listed and some values are missing. Here's a fictional example set:
df <- data.frame(
col1= c("Seattle","Diesel","Gasoline","LPG","Electric","Boston","Diesel","Gasoline","Electric"),
col2= c(NA, 80 ,NA,10,10,NA,65,25,10)
)
col1 col2
1 Seattle NA
2 Diesel 80
3 Gasoline NA
4 LPG 10
5 Electric 10
6 Boston NA
7 Diesel 65
8 Gasoline 25
9 Electric 10
Desired result:
city type value
1 Seattle Diesel 80
2 Seattle Gasoline NA
3 Seattle LPG 10
4 Seattle Electric 10
5 Boston Diesel 65
6 Boston Gasoline 25
7 Boston Electric 10
My attempt:
The closest I got was using dplyr's dense_rank()
and lag()
but this was not an ideal solution.
Any input is greatly appreciated!
For completeness' sake, here's a base R solution that also depends on the expectation that you can make a vector of the elements of
col1
that are not city names and use it for reference:Result:
Here is an option based on creating a group based on the
us.cities
dataset frommaps
by matching the elements in 'city' with the 'name' column from 'us.cities' to create a group, and then create thefirst
element of 'col1' as 'city', delete the first row (slice(-1)
)Or another option is using
str_extract
instead of grouping and thenfill
as in the other postNOTE: This would also work if there are 100s of other elements in 'col1' besides the 'city'. Here, we considered only the US cities, if it also includes cities from other countries, use
world.cities
data from the same packageA data.table option.
Similar to @camille's answer, I assume you can make some vector of measures and if the
col1
value isn't in that list it's a city. This groups by thecumsum
of not (!
)col1 %in% meas
, i.e. a group number which increments by 1 each timecol1
is not found inmeas
. Within each group,city
is set as thefirst
value ofcol1
andcol1
/col2
are renamed appropriately. Then I filter to only rows wherecity
doesn't equalcol1
(now renamedtype
) and remove the grouping variableg
.Assuming you have a finite list of measures (diesel, electric, etc), you can make a list to check against. Any value of
col1
not in that set of measures is presumably a city. Extract those (note that it's currently a factor, so I usedas.character
), fill down, and remove any heading rows.