Import from web - set user agent in Mathematica

2019-03-12 10:06发布


when I connect to my site with Mathermatica (Import["mysite","Data"]) and look at my Apache log I see:
99.XXX.XXX.XXX - - [22/May/2011:19:36:28 +0200] "GET / HTTP/1.1" 200 6268 "-" "Mathematica/8.0.1.0.0 PM/1.3.1"
Could I set it to be something like this (when I connects with real browser):
99.XXX.XXX.XXX - - [22/May/2011:19:46:17 +0200] "GET /favicon.ico HTTP/1.1" 404 183 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24"

6条回答
该账号已被封号
2楼-- · 2019-03-12 10:09

You can also use J/Link to make your web requests or call curl or wget on the command line.

查看更多
我命由我不由天
3楼-- · 2019-03-12 10:17

Mathematica 9 has the new URLFetch function. It has the option UserAgent.

查看更多
爷的心禁止访问
4楼-- · 2019-03-12 10:24

Here is a way to use the Apache HTTP client through JLink:

Needs["JLink`"]

ClearAll@urlString
urlString[userAgent_String, url_String] :=
  JavaBlock@Module[{http, get}
  , http = JavaNew["org.apache.commons.httpclient.HttpClient"]
  ; http@getParams[]@setParameter["http.useragent", MakeJavaObject@userAgent]
  ; get = JavaNew["org.apache.commons.httpclient.methods.GetMethod", url]
  ; http@executeMethod[get]
  ; get@getResponseBodyAsString[]
  ]

You can use this function as follows:

$userAgent =
  "Mozilla/5.0 (X11;Linux i686) AppleWebKit/534.24 (KHTML,like Gecko) Chrome/11.0.696.68 Safari/534.24";

urlString[$userAgent, "http://www.htttools.com:8080/"]

You can feed the result to ImportString if desired:

ImportString[urlString[$userAgent, "mysite"], "Data"]

A streaming approach would be possible using more elaborate code, but the string-based approach taken above is probably good enough unless the target web resource is very large.

I tried this code in Mathematica 7 and 8, and I expect that it works in v6 as well. Beware that there is no guarantee that Mathematica will always include the Apache HTTP client in future releases.

How It Works

Despite being expressed in Mathematica, the solution is essentially implemented in Java. Mathematica ships with a Java runtime environment built-in and the bridge between Mathematica and Java is a component called JLink.

As is typical of such cross-technology solutions, there is a fair amount of complexity even when there is not much code. It is beyond the scope of this answer to discuss how the code works in detail, but a few items will be emphasized as suggestions for further reading.

The code uses the Apache HTTP client. This Java library was chosen because it ships as an unadvertised part of the standard Mathematica distribution -- and it also happens to be the one that Import appears to use internally.

The whole body of urlString is wrapped in JavaBlock. This ensures that any Java objects that are created over the course of operation are properly released by co-ordinating the activities of the Java and Mathematica memory managers.

JavaNew is used to create the relevant Apache HTTP client objects, HttpClient and GetMethod. Java expressions like http.getParams() are expressed in JLink as http@getParams[]. The Java classes and methods are documented in the Apache HTTP client documentation.

The use of MakeJavaObject is somewhat unusual. It is required in this case as a Mathematica string is being passed as an argument where a Java Object is expected. If a Java String was expected, JLink would automatically create one. But JLink is unable to make this inference when Object is expected, so MakeJavaObject is used to give JLink a hint.

What about URLTools?

Incidentally, the first thing I tried to answer this question was to use Utilities`URLTools`FetchURL. It looked very promising since it takes an option called "RequestHeaderFields". Alas, this did not work because the present implementation of that function uses that option only for HTTP POST verbs -- not GET. Perhaps some future version of Mathematica will support the option for GET.

查看更多
We Are One
5楼-- · 2019-03-12 10:25

I'm extremely lazy and curl is more flexible in less code than J/Link, without the object management issues. This is an example of posting data (userPass) to a url and retrieving the result in JSON format.

Import["!curl -A Mozilla/4.0 --data " <> userPass <> " " <> url, "JSON"]

I isolate this kind of thing in an impure function (unless it is pure) so I know it's tainted, but any web access is that way.

Because I use a pipe, MMA cannot deduce the type of file. ref/Import mentions that « Import["!prog","format"] imports data from a pipe. » and « The format of a file is by default deduced from the file extension in its name, or by FileFormat from its contents. » As a result, it is necessary to specify "CSV", "JSON", etc. as the format parameter. You'll see some strange results otherwise.

curl is a command line tool for transferring data with URL syntax, supporting DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMTP, SMTPS, TELNET and TFTP. curl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload, proxies, cookies, user+password authentication (Basic, Digest, NTLM, Negotiate, kerberos...), file transfer resume, proxy tunneling and a busload of other useful tricks.

From the curl and libcurl welcome page.

查看更多
地球回转人心会变
6楼-- · 2019-03-12 10:26

Mathematica does all of its internet connectivity through a user specified proxy server. If, as Sjoerd suggested, setting one up is too much work, you might want to consider writing the call in C/C++, and then calling that from Mathematica. I don't doubt there are plenty of C libraries that do what you want in a few lines of code.

For calling C code within Mathematica, see the C Language Interface documentation

查看更多
贪生不怕死
7楼-- · 2019-03-12 10:28

As far as I know you can't change the user agent string in Mathematica. I once used a proxy server (CNTLM) to get Mathematica to talk with a firewall which used NTLM authentication (which Mathematica doesn't support). CNTLM also allows you to set the user agent string.

You can find it at http://cntlm.sourceforge.net/. Basically, you set-up this proxy server to run on your own machine and set its port number and ip-address in the Mathematica network settings. The proxy adds user agent stuff and handles the NTLM authentication. Not sure how it works if you don't have a NTLM firewall. There are other free proxies around that might work for you.

EDIT The Squid http proxy seems to do what you want. It has the request_header_replace configuration directive which allows you to change the contents of request headers.

查看更多
登录 后发表回答