when I connect to my site with Mathermatica (Import["mysite","Data"]
) and look at my Apache log I see:
99.XXX.XXX.XXX - - [22/May/2011:19:36:28 +0200] "GET / HTTP/1.1" 200 6268 "-" "Mathematica/8.0.1.0.0 PM/1.3.1"
Could I set it to be something like this (when I connects with real browser):
99.XXX.XXX.XXX - - [22/May/2011:19:46:17 +0200] "GET /favicon.ico HTTP/1.1" 404 183 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24"
问题:
回答1:
Mathematica 9 has the new URLFetch function. It has the option UserAgent.
回答2:
As far as I know you can't change the user agent string in Mathematica. I once used a proxy server (CNTLM) to get Mathematica to talk with a firewall which used NTLM authentication (which Mathematica doesn't support). CNTLM also allows you to set the user agent string.
You can find it at http://cntlm.sourceforge.net/. Basically, you set-up this proxy server to run on your own machine and set its port number and ip-address in the Mathematica network settings. The proxy adds user agent stuff and handles the NTLM authentication. Not sure how it works if you don't have a NTLM firewall. There are other free proxies around that might work for you.
EDIT The Squid http proxy seems to do what you want. It has the request_header_replace
configuration directive which allows you to change the contents of request headers.
回答3:
Here is a way to use the Apache HTTP client through JLink:
Needs["JLink`"]
ClearAll@urlString
urlString[userAgent_String, url_String] :=
JavaBlock@Module[{http, get}
, http = JavaNew["org.apache.commons.httpclient.HttpClient"]
; http@getParams[]@setParameter["http.useragent", MakeJavaObject@userAgent]
; get = JavaNew["org.apache.commons.httpclient.methods.GetMethod", url]
; http@executeMethod[get]
; get@getResponseBodyAsString[]
]
You can use this function as follows:
$userAgent =
"Mozilla/5.0 (X11;Linux i686) AppleWebKit/534.24 (KHTML,like Gecko) Chrome/11.0.696.68 Safari/534.24";
urlString[$userAgent, "http://www.htttools.com:8080/"]
You can feed the result to ImportString
if desired:
ImportString[urlString[$userAgent, "mysite"], "Data"]
A streaming approach would be possible using more elaborate code, but the string-based approach taken above is probably good enough unless the target web resource is very large.
I tried this code in Mathematica 7 and 8, and I expect that it works in v6 as well. Beware that there is no guarantee that Mathematica will always include the Apache HTTP client in future releases.
How It Works
Despite being expressed in Mathematica, the solution is essentially implemented in Java. Mathematica ships with a Java runtime environment built-in and the bridge between Mathematica and Java is a component called JLink.
As is typical of such cross-technology solutions, there is a fair amount of complexity even when there is not much code. It is beyond the scope of this answer to discuss how the code works in detail, but a few items will be emphasized as suggestions for further reading.
The code uses the Apache HTTP client. This Java library was chosen because it ships as an unadvertised part of the standard Mathematica distribution -- and it also happens to be the one that Import
appears to use internally.
The whole body of urlString
is wrapped in JavaBlock
. This ensures that any Java objects that are created over the course of operation are properly released by co-ordinating the activities of the Java and Mathematica memory managers.
JavaNew
is used to create the relevant Apache HTTP client objects, HttpClient
and GetMethod
. Java expressions like http.getParams()
are expressed in JLink as http@getParams[]
. The Java classes and methods are documented in the Apache HTTP client documentation.
The use of MakeJavaObject
is somewhat unusual. It is required in this case as a Mathematica string is being passed as an argument where a Java Object
is expected. If a Java String
was expected, JLink would automatically create one. But JLink is unable to make this inference when Object
is expected, so MakeJavaObject
is used to give JLink a hint.
What about URLTools?
Incidentally, the first thing I tried to answer this question was to use Utilities`URLTools`FetchURL
. It looked very promising since it takes an option called "RequestHeaderFields"
. Alas, this did not work because the present implementation of that function uses that option only for HTTP POST verbs -- not GET. Perhaps some future version of Mathematica will support the option for GET.
回答4:
I'm extremely lazy and curl is more flexible in less code than J/Link, without the object management issues. This is an example of posting data (userPass) to a url and retrieving the result in JSON format.
Import["!curl -A Mozilla/4.0 --data " <> userPass <> " " <> url, "JSON"]
I isolate this kind of thing in an impure function (unless it is pure) so I know it's tainted, but any web access is that way.
Because I use a pipe, MMA cannot deduce the type of file. ref/Import mentions that « Import["!prog","format"] imports data from a pipe. » and « The format of a file is by default deduced from the file extension in its name, or by FileFormat from its contents. » As a result, it is necessary to specify "CSV", "JSON", etc. as the format parameter. You'll see some strange results otherwise.
curl is a command line tool for transferring data with URL syntax, supporting DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMTP, SMTPS, TELNET and TFTP. curl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload, proxies, cookies, user+password authentication (Basic, Digest, NTLM, Negotiate, kerberos...), file transfer resume, proxy tunneling and a busload of other useful tricks.
From the curl and libcurl welcome page.
回答5:
Mathematica does all of its internet connectivity through a user specified proxy server. If, as Sjoerd suggested, setting one up is too much work, you might want to consider writing the call in C/C++, and then calling that from Mathematica. I don't doubt there are plenty of C libraries that do what you want in a few lines of code.
For calling C code within Mathematica, see the C Language Interface documentation
回答6:
You can also use J/Link to make your web requests or call curl or wget on the command line.