when I connect to my site with Mathermatica (Import["mysite","Data"]
) and look at my Apache log I see:
99.XXX.XXX.XXX - - [22/May/2011:19:36:28 +0200] "GET / HTTP/1.1" 200 6268 "-" "Mathematica/8.0.1.0.0 PM/1.3.1"
Could I set it to be something like this (when I connects with real browser):
99.XXX.XXX.XXX - - [22/May/2011:19:46:17 +0200] "GET /favicon.ico HTTP/1.1" 404 183 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.68 Safari/534.24"
相关问题
- How to reimport module with ES6 import
- TestCafe - The browser always starts in clean slat
- UrlEncodeUnicode and browser navigation errors
- Puzzled by Function body evaluation
- Character Encoding in iframes
相关文章
You can also use J/Link to make your web requests or call curl or wget on the command line.
Mathematica 9 has the new URLFetch function. It has the option UserAgent.
Here is a way to use the Apache HTTP client through JLink:
You can use this function as follows:
You can feed the result to
ImportString
if desired:A streaming approach would be possible using more elaborate code, but the string-based approach taken above is probably good enough unless the target web resource is very large.
I tried this code in Mathematica 7 and 8, and I expect that it works in v6 as well. Beware that there is no guarantee that Mathematica will always include the Apache HTTP client in future releases.
How It Works
Despite being expressed in Mathematica, the solution is essentially implemented in Java. Mathematica ships with a Java runtime environment built-in and the bridge between Mathematica and Java is a component called JLink.
As is typical of such cross-technology solutions, there is a fair amount of complexity even when there is not much code. It is beyond the scope of this answer to discuss how the code works in detail, but a few items will be emphasized as suggestions for further reading.
The code uses the Apache HTTP client. This Java library was chosen because it ships as an unadvertised part of the standard Mathematica distribution -- and it also happens to be the one that
Import
appears to use internally.The whole body of
urlString
is wrapped inJavaBlock
. This ensures that any Java objects that are created over the course of operation are properly released by co-ordinating the activities of the Java and Mathematica memory managers.JavaNew
is used to create the relevant Apache HTTP client objects,HttpClient
andGetMethod
. Java expressions likehttp.getParams()
are expressed in JLink ashttp@getParams[]
. The Java classes and methods are documented in the Apache HTTP client documentation.The use of
MakeJavaObject
is somewhat unusual. It is required in this case as a Mathematica string is being passed as an argument where a JavaObject
is expected. If a JavaString
was expected, JLink would automatically create one. But JLink is unable to make this inference whenObject
is expected, soMakeJavaObject
is used to give JLink a hint.What about URLTools?
Incidentally, the first thing I tried to answer this question was to use
Utilities`URLTools`FetchURL
. It looked very promising since it takes an option called"RequestHeaderFields"
. Alas, this did not work because the present implementation of that function uses that option only for HTTP POST verbs -- not GET. Perhaps some future version of Mathematica will support the option for GET.I'm extremely lazy and curl is more flexible in less code than J/Link, without the object management issues. This is an example of posting data (userPass) to a url and retrieving the result in JSON format.
I isolate this kind of thing in an impure function (unless it is pure) so I know it's tainted, but any web access is that way.
Because I use a pipe, MMA cannot deduce the type of file. ref/Import mentions that « Import["!prog","format"] imports data from a pipe. » and « The format of a file is by default deduced from the file extension in its name, or by FileFormat from its contents. » As a result, it is necessary to specify "CSV", "JSON", etc. as the format parameter. You'll see some strange results otherwise.
From the curl and libcurl welcome page.
Mathematica does all of its internet connectivity through a user specified proxy server. If, as Sjoerd suggested, setting one up is too much work, you might want to consider writing the call in C/C++, and then calling that from Mathematica. I don't doubt there are plenty of C libraries that do what you want in a few lines of code.
For calling C code within Mathematica, see the C Language Interface documentation
As far as I know you can't change the user agent string in Mathematica. I once used a proxy server (CNTLM) to get Mathematica to talk with a firewall which used NTLM authentication (which Mathematica doesn't support). CNTLM also allows you to set the user agent string.
You can find it at http://cntlm.sourceforge.net/. Basically, you set-up this proxy server to run on your own machine and set its port number and ip-address in the Mathematica network settings. The proxy adds user agent stuff and handles the NTLM authentication. Not sure how it works if you don't have a NTLM firewall. There are other free proxies around that might work for you.
EDIT The Squid http proxy seems to do what you want. It has the
request_header_replace
configuration directive which allows you to change the contents of request headers.