I want to make a Greasemonkey script that, while you are in URL_1, the script parses the whole HTML web page of URL_2 in the background in order to extract a text element from it.
To be specific, I want to download the whole page's HTML code (a Rotten Tomatoes page) in the background and store it in a variable and then use getElementsByClassName[0]
in order to extract the text I want from the element with class name "critic_consensus".
I've found this in MDN: HTML in XMLHttpRequest so, I ended up in this unfortunately non-working code:
var xhr = new XMLHttpRequest();
xhr.onload = function() {
alert(this.responseXML.getElementsByClassName(critic_consensus)[0].innerHTML);
}
xhr.open("GET", "http://www.rottentomatoes.com/m/godfather/",true);
xhr.responseType = "document";
xhr.send();
It shows this error message when I run it in Firefox Scratchpad:
Cross-Origin Request Blocked: The Same Origin Policy disallows reading
the remote resource at http://www.rottentomatoes.com/m/godfather/.
This can be fixed by moving the resource to the same domain or
enabling CORS.
PS. The reason why I don't use the Rotten Tomatoes API is that they've removed the critics consensus from it.
For cross-origin requests, where the fetched site has not helpfully set a permissive CORS policy, Greasemonkey provides the GM_xmlhttpRequest()
function. (Most other userscript engines also provide this function.)
GM_xmlhttpRequest
is expressly designed to allow cross-origin requests.
To get your target information create a DOMParser
on the result. Do not use jQuery methods as this will cause extraneous images, scripts and objects to load, slowing things down, or crashing the page.
Here's a complete script that illustrates the process:
// ==UserScript==
// @name _Parse Ajax Response for specific nodes
// @include http://stackoverflow.com/questions/*
// @require http://ajax.googleapis.com/ajax/libs/jquery/2.1.0/jquery.min.js
// @grant GM_xmlhttpRequest
// ==/UserScript==
GM_xmlhttpRequest ( {
method: "GET",
url: "http://www.rottentomatoes.com/m/godfather/",
onload: function (response) {
var parser = new DOMParser ();
/* IMPORTANT!
1) For Chrome, see
https://developer.mozilla.org/en-US/docs/Web/API/DOMParser#DOMParser_HTML_extension_for_other_browsers
for a work-around.
2) jQuery.parseHTML() and similar are bad because it causes images, etc., to be loaded.
*/
var doc = parser.parseFromString (response.responseText, "text/html");
var criticTxt = doc.getElementsByClassName ("critic_consensus")[0].textContent;
$("body").prepend ('<h1>' + criticTxt + '</h1>');
},
onerror: function (e) {
console.error ('**** error ', e);
},
onabort: function (e) {
console.error ('**** abort ', e);
},
ontimeout: function (e) {
console.error ('**** timeout ', e);
}
} );
The problem is: XMLHttpRequest cannot load http://www.rottentomatoes.com/m/godfather/. No 'Access-Control-Allow-Origin' header is present on the requested resource.
Because you are not the owner of the resource you can not set up this header.
What you can do is set up a proxy on heroku which will proxy all requests to rottentomatoes web site
Here is a small node.js proxy https://gist.github.com/igorbarinov/a970cdaf5fc9451f8d34
var https = require('https'),
http = require('http'),
util = require('util'),
path = require('path'),
fs = require('fs'),
colors = require('colors'),
url = require('url'),
httpProxy = require('http-proxy'),
dotenv = require('dotenv');
dotenv.load();
var proxy = httpProxy.createProxyServer({});
var host = "www.rottentomatoes.com";
var port = Number(process.env.PORT || 5000);
process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0";
var server = require('http').createServer(function(req, res) {
// You can define here your custom logic to handle the request
// and then proxy the request.
var path = url.parse(req.url, true).path;
req.headers.host = host;
res.setHeader("Access-Control-Allow-Origin", "*");
proxy.web(req, res, {
target: "http://"+host+path,
});
}).listen(port);
proxy.on('proxyRes', function (res) {
console.log('RAW Response from the target', JSON.stringify(res.headers, true, 2));
});
util.puts('Proxying to '+ host +'. Server'.blue + ' started '.green.bold + 'on port '.blue + port);
I modified https://github.com/massive/firebase-proxy/ code for this
I published proxy on http://peaceful-cove-8072.herokuapp.com/ and on http://peaceful-cove-8072.herokuapp.com/m/godfather you can test it
Here is a gist to test http://jsfiddle.net/uuw8nryy/
var xhr = new XMLHttpRequest();
xhr.onload = function() {
alert(this.responseXML.getElementsByClassName(critic_consensus)[0]);
}
xhr.open("GET", "http://peaceful-cove-8072.herokuapp.com/m/godfather",true);
xhr.responseType = "document";
xhr.send();
The JavaScript same origin policy prevents you from accessing content that belongs to a different domain.
The above reference also gives you four techniques for relaxing this rule (CORS being one of them).