I'm trying to wrap my head around closures, and I think I've found a case where they might be helpful.
I have the following pieces to work with:
- A set of regular expressions designed to clean state names, housed in a function
- A data.frame with state names (of the standardized form that the function above creates) and state ID codes, to link the two (the "merge map")
The idea is, given some data.frame with sloppy state names (is the capital listed as "Washington, D.C.", "washington DC", "District of Columbia", etc.?), to have a single function return the same data.frame with the state name column removed and only the state ID codes remaining. Then subsequent merges can happen consistently.
I can do this in any number of ways, but one way that seems to be particularly elegant would be to house the merge map and the regular expression and the code process everything inside a closure (following the idea that a closure is a function with data).
Question 1: Is this a reasonable idea?
Question 2: If so, how do I do it in R?
Here's a stupid simple clean state names function that works on the example data:
cleanStateNames <- function(x) {
x <- tolower(x)
x[grepl("columbia",x)] <- "DC"
x
}
Here's some example data that the eventual function will be run on:
dat <- structure(list(state = c("Alabama", "Alaska", "Arizona", "Arkansas",
"California", "Colorado", "Connecticut", "Delaware", "District of Columbia",
"Florida"), pop08 = structure(c(29L, 44L, 40L, 18L, 25L, 30L,
22L, 48L, 36L, 13L), .Label = c("1,050,788", "1,288,198", "1,315,809",
"1,316,456", "1,523,816", "1,783,432", "1,814,468", "1,984,356",
"10,003,422", "11,485,910", "12,448,279", "12,901,563", "18,328,340",
"19,490,297", "2,600,167", "2,736,424", "2,802,134", "2,855,390",
"2,938,618", "24,326,974", "3,002,555", "3,501,252", "3,642,361",
"3,790,060", "36,756,666", "4,269,245", "4,410,796", "4,479,800",
"4,661,900", "4,939,456", "5,220,393", "5,627,967", "5,633,597",
"5,911,605", "532,668", "591,833", "6,214,888", "6,376,792",
"6,497,967", "6,500,180", "6,549,224", "621,270", "641,481",
"686,293", "7,769,089", "8,682,661", "804,194", "873,092", "9,222,414",
"9,685,744", "967,440"), class = "factor")), .Names = c("state",
"pop08"), row.names = c(NA, 10L), class = "data.frame")
And a sample merge map (the actual one links FIPS codes to states, so it can't be trivially generated):
merge_map <- data.frame(state=dat$state, id=seq(10) )
EDIT Building off of crippledlambda's answer below, here's an attempt at the function:
prepForMerge <- local({
merge_map <- structure(list(state = c("alabama", "alaska", "arizona", "arkansas", "california", "colorado", "connecticut", "delaware", "DC", "florida" ), id = 1:10), .Names = c("state", "id"), row.names = c(NA, -10L ), class = "data.frame")
list(
replace_merge_map=function(new_merge_map) {
merge_map <<- new_merge_map
},
show_merge_map=function() {
merge_map
},
return_prepped_data.frame=function(dat) {
dat$state <- cleanStateNames(dat$state)
dat <- merge(dat,merge_map)
dat <- subset(dat,select=c(-state))
dat
}
)
})
> prepForMerge$return_prepped_data.frame(dat)
pop08 id
1 4,661,900 1
2 686,293 2
3 6,500,180 3
4 2,855,390 4
5 36,756,666 5
6 4,939,456 6
7 3,501,252 7
8 591,833 9
9 873,092 8
10 18,328,340 10
Two problems remain before I'd consider this question solved:
Calling prepForMerge$return_prepped_data.frame(dat)
is painful each time. Any way to have a default function such that I could just call prepForMerge(dat)? I'm guessing not given how it's implemented, but perhaps there's at least a convention for the default fxn....
How do I avoid mixing the data and code in the merge_map definition? Ideally I'd clean merge_map elsewhere, then just grab it inside the closure and store that.
I may be missing the point of your question, but this is one way in which you can use a closure:
> replaceStateNames <- local({
+ statenames <- c("Alabama", "Alaska", "Arizona", "Arkansas",
+ "California", "Colorado", "Connecticut", "Delaware",
+ "District of Columbia", "Florida")
+ function(patt,newtext) {
+ statenames <- tolower(statenames)
+ statenames[grepl(patt,statenames)] <- newtext
+ statenames
+ }
+ })
>
> replaceStateNames("columbia","DC")
[1] "alabama" "alaska" "arizona" "arkansas" "california"
[6] "colorado" "connecticut" "delaware" "DC" "florida"
> replaceStateNames("alaska","palincountry")
[1] "alabama" "palincountry" "arizona"
[4] "arkansas" "california" "colorado"
[7] "connecticut" "delaware" "district of columbia"
[10] "florida"
> replaceStateNames("florida","jebbushland")
[1] "alabama" "alaska" "arizona"
[4] "arkansas" "california" "colorado"
[7] "connecticut" "delaware" "district of columbia"
[10] "jebbushland"
>
But to generalize, you can replace statenames
with your data frame definition, and return a function (or list of functions) which uses this data frame without having to pass it as an argument to the function call. Example (but note I've used the ignore.case=TRUE
argument in grepl
):
> replaceStateNames <- local({
+ statenames <- c("Alabama", "Alaska", "Arizona", "Arkansas",
+ "California", "Colorado", "Connecticut", "Delaware",
+ "District of Columbia", "Florida")
+ list(justreturn=function(patt,newtext) {
+ statenames[grepl(patt,statenames,ignore.case=TRUE)] <- newtext
+ statenames
+ },reassign=function(patt,newtext) {
+ statenames <<- replace(statenames,grepl(patt,statenames,ignore.case=TRUE),newtext)
+ statenames
+ })
+ })
Just like the first example:
> replaceStateNames$justreturn("columbia","DC")
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
[6] "Colorado" "Connecticut" "Delaware" "DC" "Florida"
Just returns the lexically-scoped value of statenames
to check that the original values are unchanged:
> replaceStateNames$justreturn("shouldnotmatch","anythinghere")
[1] "Alabama" "Alaska" "Arizona"
[4] "Arkansas" "California" "Colorado"
[7] "Connecticut" "Delaware" "District of Columbia"
[10] "Florida"
Do the same thing, but make the change "permanent":
> replaceStateNames$reassign("columbia","DC")
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
[6] "Colorado" "Connecticut" "Delaware" "DC" "Florida"
And note that the value of statenames
attached to these functions has changed.
> replaceStateNames$justreturn("shouldnotmatch","anythinghere")
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
[6] "Colorado" "Connecticut" "Delaware" "DC" "Florida"
In any case, you can replace statenames
with a data frame, and these simple functions with a "merge map" or any other mapping you desire.
Edit
Speaking of "merge", is this what you're looking for? An implementation of first ?merge
example using a closure:
> authors <- data.frame(surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+ nationality = c("US", "Australia", "US", "UK", "Australia"),
+ deceased = c("yes", rep("no", 4)))
> books <- data.frame(name = I(c("Tukey", "Venables", "Tierney",
+ "Ripley", "Ripley", "McNeil", "R Core")),
+ title = c("Exploratory Data Analysis",
+ "Modern Applied Statistics ...",
+ "LISP-STAT",
+ "Spatial Statistics", "Stochastic Simulation",
+ "Interactive Data Analysis",
+ "An Introduction to R"),
+ other.author = c(NA, "Ripley", NA, NA, NA, NA,
+ "Venables & Smith"))
>
> mergewithauthors <- with(list(authors=authors),function(books)
+ merge(authors, books, by.x = "surname", by.y = "name"))
>
> mergewithauthors(books)
surname nationality deceased title other.author
1 McNeil Australia no Interactive Data Analysis <NA>
2 Ripley UK no Spatial Statistics <NA>
3 Ripley UK no Stochastic Simulation <NA>
4 Tierney US no LISP-STAT <NA>
5 Tukey US yes Exploratory Data Analysis <NA>
6 Venables Australia no Modern Applied Statistics ... Ripley
Edit 2
To read file into an object which will be lexically bound, you can either do
fn <- local({
data <- read.csv("filename.csv")
function(...) {
...
}
})
or
fn <- with(list(data=read.csv("filename.csv")),
function(...) {
...
}
})
or
fn <- with(local(data <- read.csv("filename.csv")),
function(...) {
...
}
})
and so on. (I assume the function(...) will have to do with your "merge_map"). You can also use evalq
in place of local
. To "bring in" objects residing in the global space (or enclosing environment), you can just do the following
globalobj <- value ## could be from read.csv()
fn <- local({
localobj <- globalobj ## if globalobj is not locally defined,
## R will look in enclosing environment
## in this case, the globalenv()
function(...) {
...
}
})
then modifying globalobj
later will not change localobj
attached to the function (since almost(?) everything in R follows pass-by-value semantics). You can also use with
instead of local
as shown in examples above.