I am trying to program a javascript that'll grab the Inner HTML code from the top news story of the BBC website (http://www.bbc.co.uk/news), and put it in a txt document.
I don't know much about javascript, I know more of .BAT and .VBS, but I know that they can't do this.
I'm not sure how to approach this.
I thought of making it scan for a fixed outerHTML code, and then copy the inner one to txt file.
However, I can't seem to find an outerHTML code that is permanent everyday. For example, this is the title of today's.
<span class="title-link__title-text">Benefit plan 'could hit young Britons'</span>
As you see, it has the headline incorporated.
I'm using Firefox if that makes a different.
Any help would be much appreciated.
Regards,
Master-chip.
Pure client Browser approach:
Ok i made this fiddle for you and may help others too. This was interesting to me and challenging. Below are the points on how i achieved the possible solution
- Used ECMA 5 Blob Api to create text file on the fly.
- Loaded http://www.bbc.co.uk/news in iframe (Cross Domain origin access - See Note section below)
- On iframe loaded event trigger a timeout using either setTimeout or
setInterval (Commented - For repetitive execution hourly or daily) as per your need (Adjust time as per your need).
- Querying the text nodes using document.querySelectorAll(".title-link span") seemed
to be generic based on examining the webpage source.
- Check out the fiddler Link
Javascript:
(function () {
var textFile = null,
makeTextFile = function (text) {
var data = new Blob([text], {
type: 'text/plain'
});
// If we are replacing a previously generated file we need to
// manually revoke the object URL to avoid memory leaks.
if (textFile !== null) {
window.URL.revokeObjectURL(textFile);
}
textFile = window.URL.createObjectURL(data);
return textFile;
};
var iframe = document.getElementById('frame');
var commFunc = function () {
var iframe2 = document.getElementById('frame'); //This is required to get the fresh updated DOM
var innerDoc = iframe2.contentDocument || iframe2.contentWindow.document;
var getAll = Array.prototype.slice.call(innerDoc.querySelectorAll(".title-link span"));
var dummy = "";
for (var obj in getAll) {
dummy = dummy.concat("\n" + (getAll[obj]).innerText);
}
var link = document.createElement("a");
link.href = makeTextFile(dummy);
link.download = "sample.txt"
link.click();
console.log("Downloaded the sample.txt file");
};
iframe.onload = function () {
setTimeout(commFunc, 1000); //Adjust the time required to load
//setInterval(commFunc, 1000);
};
//Click the button when the page inside the iframe is loaded
create.addEventListener('click', commFunc);
})();
HTML:
<span class="title-link__title-text">Benefit plan 'could hit young Britons'</span>
<div>
<iframe id="frame" src="http://www.bbc.co.uk/news"></iframe>
</div>
<button id="create">Download</button>
Note:
- To run the above javascript on chrome you need to disable web security.
The above script should run good on firefox, no tweaks needed.
- This is a possible illustration that can be achieved using pure
browser scripting. Tab should be active for periodic grabbing.
- Targetted for modern browsers
Suggested Approach:
Use node.js server and you can modify the above script for to run as
stanalone
Or any server side scripting frameworks like php, java spring etc.
Using Node js approach:
Javascript:
var jsdom = require("node-jsdom");
var fs = require("fs");
jsdom.env({
url: "http://www.bbc.co.uk/news",
scripts: ["http://code.jquery.com/jquery.js"],
done: function (errors, window) {
var $ = window.$;
console.log("HN Links");
$(".title-link span").each(function() {
//console.log(" -", $(this).text());
fs.existsSync("sample.txt") === true ? fs.appendFile("sample.txt", "\r"+ $(this).text()) : fs.writeFile("sample.txt", "\r"+ $(this).text())
});
}
});
Dependencies for the above code:
- NodeJS
- JSDOM
- Jquery
- NodeJS Filesystem
Hope it helped you and other also
My thoughts -
JS can be used to get data/text from pages, but, to save it into a file, you have to use something in the backend like Python or PHP etc.,
Why use JS? You can scrape the web very well using CURL. Use PHP Curl if that's easier for you.
You can scrape/download the webpage using -
<?php
// Defining the basic cURL function
function curl($url) {
$ch = curl_init(); // Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
?>
Then use the function at your discretion-
<?php
$scraped_website = curl("http://www.yahoo.com"); // Executing our curl function to scrape the webpage http://www.yahoo.com and return the results into the $scraped_website variable
?>
Reference Links-
Web scraping with PHP and CURL
Scraping in PHP with CURL
You can scrape more clearly using DIV's and Node's of HTML elements.
Check these out - Part1 - Part2 - Part3
Hope it helps. Happy Coding!
You want download txt file with content from html?Is this right, you can use this create txt file and download it If you want to get text from all title spans, you need do this
var txt = "";
var nodeList = document.querySelectorAll(".title-link__title-text")
for(var i=0; i<nodeList.length;i++){
txt+="\n"+nodeList[i].innerText;
}
And then write txt variable to file, like in post i mentioned above.