I have written a script using Google Apps Script to extract text from a web page into Google Sheets. I only need this script to work with a specific web page, so it does not need to be versatile. The script works almost exactly as I want it to except that I have run into a character encoding problem. I am extracting both Hebrew and English text. The meta tag in the HTML has charset=Windows-1255. The English extracts perfectly, but the Hebrew displays as black diamonds containing a question mark.
I found this question that says to pass the data into a blob then use the getDataAsString method to convert to another encoding. I tried converting to different encodings and got different results. UTF-8 displays the black diamonds with question marks, UTF-16 displays Korean, ISO 8859-8 returns an error and says it's not a valid parameter, and the original Windows-1255 displays one Hebrew character but a bunch of other gibberish.
However, I am able to copy and paste the Hebrew text into Google Sheets manually and it displays correctly.
I have even tested passing Hebrew directly from Google Apps Script code like so:
function passHebrew() {
return "וַיְדַבֵּר";
}
This displays the Hebrew text properly on Google Sheets.
My code is as follows:
function parseText(book, chapter) {
//var bk = book;
//var ch = chapter;
var bk = '04'; //hard-coded for testing purposes
var ch = '01'; //hard-coded for testing purposes
var url = 'http://www.mechon-mamre.org/p/pt/pt' + bk + ch + '.htm';
var xml = UrlFetchApp.fetch(url).getContentText();
//I had to "fix" these xml errors for XmlService.parse(xml) below
//to function.
xml = xml.replace('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">', '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">');
xml = xml.replace('<LINK REL="stylesheet" HREF="p.css" TYPE="text/css">', '<LINK REL="stylesheet" HREF="p.css" TYPE="text/css"></LINK>');
xml = xml.replace('<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255">', '<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255"></meta>');
xml = xml.replace(/ALIGN=CENTER/gi, 'ALIGN="CENTER"');
xml = xml.replace(/<BR>/gi, '<BR></BR>');
xml = xml.replace(/class=h/gi, 'class="h"');
//This section is the specific route to the table in the page I want
var document = XmlService.parse(xml);
var body = document.getRootElement().getChildren("BODY");
var maintable = body[0].getChildren("TABLE");
var maintablechildren = maintable[0].getChildren();
//This creates a two-dimensional array so that I can store the Hebrew
//in the first column and the English in the second column
var array = new Array(maintablechildren.length);
for (var i = 0; i < maintablechildren.length; i++) {
array[i] = new Array(2);
}
//This is where the table gets parsed into the array
for (var i = 0; i < maintablechildren.length; i++) {
var verse = maintablechildren[i].getChildren();
//This is where the encoding problem occurs.
//I originally tried verse[0].getText() but it didn't work.
array[i][0] = Utilities.newBlob(verse[0].getText()).getDataAsString('UTF-8');
//This array receives the English text and works fine.
array[i][1] = verse[1].getText();
}
return array;
}
What am I overlooking, misunderstanding, or doing wrong? I don't have a very good understanding of how encoding works so I don't understand why converting it to UTF-8 isn't working.
Your problem occurs before the lines you've commented as an encoding problem: because the default encoding for UrlFetchApp is munging the unicode text from the start.
You should use the variation of the
.getContentText()
method that Returns the content of an HTTP response encoded as a string of the given charset. For your case:That should be all you need to change, although the
blob()
work-around is no longer needed. (It's harmless, though.) Other comments:The logical OR operator (
||
) is very helpful for setting default values. I've tweaked the first few lines to enable testing but still let the function operate normally with arguments.The way you're setting up an empty array before populating it with strings is Bad JavaScript; it's complex code that isn't needed, so toss it. Instead, we'll declare the
array
Array, thenpush()
rows onto it.The
.replace()
functions can be reduced with more clever RegExp use; I've included the URLs for demos of the really tricky ones.There were
\n
newline characters in the text which I guessed were unnecessary for your purposes, so added areplace()
for them as well.Here's what you're left with:
Results