In some cases, rvest keeps connection open?

2019-05-28 23:49发布

问题:

rvest seems to be putting TCP FDs in CLOSE_WAIT state in some cases.

Replication example: On R terminal do:

library(rvest)
k<-html("http://materialresourcing.com/tags/diy")

This will return.

Now do NOT close the R prompt.

Now open normal command Terminal and do:

lsof | grep materialresourcing

And you will find that it has a TCP handle in CLOSE_WAIT state, forever. (i.e till the R session is closed)

And the rvest API doesn't seem to give control to force close connection.

Any thoughts? Thanks!

UPDATE:

In fact just doing:

library(curl)
k<-curl_download("http://materialresourcing.com/tags/diy", "a.txt")

also causes it

UPDATE IN RESPONSE TO @jeroen questions


> library(curl)
> h <- new_handle()
> l2<-"http://rediff.com"
> curl_download(l2,"b.txt",handle=h)
> system("lsof | grep rediff")
> l3<-"http://materialresourcing.com/tags/diy"
> curl_download(l3,"c.txt",handle=h)
> system("lsof | grep materialresourcing")
R         17936          xxxx    7u     IPv4            3559514      0t0     TCP ip-10-0-xx:40844->vps.materialresourcing.com:http (CLOSE_WAIT)
sh        18257          xxxx    7u     IPv4            3559514      0t0     TCP ip-10-0-xx:40844->vps.materialresourcing.com:http (CLOSE_WAIT)
grep      18259          xxxx    7u     IPv4            3559514      0t0     TCP ip-10-0-xx:40844->vps.materialresourcing.com:http (CLOSE_WAIT)
> 

As you can see only for this link. CLOSE_WAIT shows. And same thing happens using rvest too. and gc() has no effect there.

回答1:

The connection is kept alive deliberately by the curl handle, so that it can be reused. The garbage collector automatically closes the connection when the handle object is deleted or goes out of scope. A simple demonstration:

# Connection is stored on a handle object
library(curl)
h <- new_handle()

# Connection is kept alive
curl_download("http://materialresourcing.com/tags/diy", "a.txt", handle = h)
system("lsof | grep materialresourcing")

# Still there :)
gc()
system("lsof | grep materialresourcing")

# Still there :)
rm(h)
system("lsof | grep materialresourcing")

# after handle gone, garbage collector closes connection
gc()
system("lsof | grep materialresourcing")

The garbage collector runs automatically every now and then in R, so curl cleans up after itself. However some packages that build on the curl package deliberately keep the handle around to provide persistence or better performance. So in that case the connection stays open until the application deletes the internal handle from the pool.



标签: r rvest