What is the best way to parse html in google apps

var page = UrlFetchApp.fetch(contestURL);
var doc = XmlService.parse(page);

The above code gives a parse error when used, however if I replace the XmlService class with the deprecated Xml class, with the lenient flag set, it parses the html properly.

var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);

The problem is mostly caused because of no CDATA in the javascript part of the html and the parser complains with the following error.

The entity name must immediately follow the '&' in the entity reference.

Even if I remove all the <script>(.*?)</script> using regex, it still complains because the <br> tags aren't closed. Is there a clean way of parsing html into a DOM tree.

标签： javascript html regex google-apps-script

6条回答

贼婆χ

2楼-- · 2019-01-11 02:19

I know it is not exactly what OP asked, but I found this question when I was looking for some html parsing options - so it might be useful for others as well.

There is an easy to use the library for TEXT parsing. It's useful if you want to get only one piece of information from the html(xml) code.

It works like in the picture above

function getData() {
    var url = "https://chrome.google.com/webstore/detail/signaturesatori-central-s/fejomcfhljndadjlojamaklegghjnjfn?hl=en";
    var fromText = '<span class="e-f-ih" title="';
    var toText = '">';

    var content = UrlFetchApp.fetch(url).getContentText();
    var scraped = Parser
                    .data(content)
                    .from(fromText)
                    .to(toText)
                    .build();
    Logger.log(scraped);
    return scraped;
}

0人赞添加讨论(0) 举报

再贱就再见

3楼-- · 2019-01-11 02:20

Xml.parse() has an option to turn on lenient parsing, which helps when parsing HTML. Note that the Xml service is deprecated however, and the newer XmlService doesn't have this functionality.

0人赞添加讨论(0) 举报

不美不萌又怎样

4楼-- · 2019-01-11 02:25

I found that the best way to parse html in google apps is to avoid using XmlService.parse or Xml.parse. XmlService.parse doesn't work well with bad html code from certain websites.

Here a basic example on how you can parse any website easily without using XmlService.parse or Xml.parse. In this example, i am retrieving a list of president from "wikipedia.org/wiki/President_of_the_United_States" whit a regular javascript document.getElementsByTagName(), and pasting the values into my google spreadsheet.

1- Create a new Google Sheet;

2- Click the menu Tools > Script editor... to open a new tab with the code editor window and copy the following code into your Code.gs:

function onOpen() {
 var ui = SpreadsheetApp.getUi();
    ui.createMenu("Parse Menu")
      .addItem("Parse", "parserMenuItem")
      .addToUi();

}


function parserMenuItem() {
  var sideBar = HtmlService.createHtmlOutputFromFile("test");
  SpreadsheetApp.getUi().showSidebar(sideBar);
}


function getUrlData(url) {
 var doc = UrlFetchApp.fetch(url).getContentText()
 return doc                               
}

function writeToSpreadSheet(data) {
 var ss = SpreadsheetApp.getActiveSpreadsheet();
 var sheet = ss.getSheets()[0];
 var row=1

   for (var i = 0; i < data.length; i++) {
   var x = data[i];
   var range = sheet.getRange(row, 1)
   range.setValue(x);
   var row = row+1
    }
}

3- Add an HTML file to your Apps Script project. Open the Script Editor and choose File > New > Html File, and name it 'test'.Then copy the following code into your test.html

<!DOCTYPE html>
<html>
<head>    
</head>
<body>
<input id= "mButon" type="button" value="Click here to get list"
onclick="parse()">
<div hidden id="mOutput"></div>
</body>
<script>

window.onload = onOpen;

function onOpen() {
 var url = "https://en.wikipedia.org/wiki/President_of_the_United_States"
 google.script.run.withSuccessHandler(writeHtmlOutput).getUrlData(url)
 document.getElementById("mButon").style.visibility = "visible";
}

function writeHtmlOutput(x) {
 document.getElementById('mOutput').innerHTML = x;
}

function parse() {

var list = document.getElementsByTagName("area");
var data = [];

   for (var i = 0; i < list.length; i++) {
   var x = list[i];
   data.push(x.getAttribute("title"))
    }

google.script.run.writeToSpreadSheet(data);
} 
</script> 
</html>

4- Save your gs and html files and Go back to your spreadsheet. Reload your Spreadsheet. Click on "Parse Menu" - "Parse". Then click on "Click here to get list" in the sidebar.

0人赞添加讨论(0) 举报

Anthone

5楼-- · 2019-01-11 02:29

Use a regular expression:

var page = UrlFetchApp.fetch(contestURL);
var regExp = new RegExp("(pattern)", "gi");
var value = regExp.exec(page.getContentText())[1];   // [1] is the match group when using parenthesis in the pattern

0人赞添加讨论(0) 举报

放荡不羁爱自由

6楼-- · 2019-01-11 02:32

Natively there's no way unless you do what you already tried which wont work if the html doesnt conform with the xml format.

0人赞添加讨论(0) 举报

够拽才男人

7楼-- · 2019-01-11 02:34

I ran into this exact same problem. I was able to circumvent it by first using the deprecated Xml.parse, since it still works, then selecting the body XmlElement, then passing in its Xml String into the new XmlService.parse method:

var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
var bodyHtml = doc.html.body.toXmlString();
doc = XmlService.parse(bodyHtml);
var root = doc.getRootElement();

Note: This solution may not work if the old Xml.parse is completely removed from Google Scripts.

0人赞添加讨论(0) 举报

What is the best way to parse html in google apps

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间