I have this for
loop in an R
script:
url <- "https://example.com"
page <- html_session(url, config(ssl_verifypeer = FALSE))
links <- page %>%
html_nodes("td") %>%
html_nodes("tr") %>%
html_nodes("a") %>%
html_attr("href")
base_names <- page %>%
html_nodes("td") %>%
html_nodes("tr") %>%
html_nodes("a") %>%
html_attr("href") %>%
basename()
for(i in 1:length(links)) {
site <- html_session(URLencode(
paste0("https://example.com", links[i])),
config(ssl_verifypeer = FALSE))
writeBin(site$response$content, base_names[i])
}
This loops through links, & downloads a text file to my working directory. I'm wondering if I can put return
somewhere, so that it returns the document.
Reason being, is that I'm executing my script in NiFi (using ExecuteProcess
), and it's not sending my scraped documents down the line. Instead, it just shows the head of my R script. I would assume you would wrap the for
loop in a fun <- function(x) {}
, but I'm not sure how to integrate the x
into an already working scraper.
I need it to return documents down the flow, and not just this:
Processor config:
Even if you are not familiar with NiFi, it would be a great help on the R part! Thanks
If your intent is to both (1) save the output (with writeBin
) and (2) return the values (in a list
), then try this:
out <- Map(function(ln, bn) {
site <- html_session(URLencode(
paste0("https://example.com", ln)),
config(ssl_verifypeer = FALSE))
writeBin(site$response$content, bn)
site$response$content
}, links, base_names)
The use of Map
"zips" together the individual elements. For a base-case, the following are identical:
Map(myfunc, list1)
lapply(list1, myfunc)
But if you want to use same-index elements from multiple lists, you can do one of
lapply(seq_len(length(list1)), function(i) myfunc(list1[i], list2[i], list3[i]))
Map(myfunc, list1, list2, list3)
where unrolling Map
results effectively in:
myfunc(list1[1], list2[1], list3[1])
myfunc(list1[2], list2[2], list3[2])
# ...
The biggest difference between lapply
and Map
here is that lapply
can only accept one vector, whereas Map
accepts one or more (practically unlimited), zipping them together. All of the lists used must be the same length or length 1 (recycled), so it's legitimate to do something like
Map(myfunc, list1, list2, "constant string")
Note: Map
-versus-mapply
is similar to lapply
-vs-sapply
. For both, the first always returns a list
object, while the second will return a vector
IFF every return value is of the same length/dimension, otherwise it too will return a list
.