(javascript / google scripts) How to get the title

2019-07-21 03:00发布

问题:

I'm asking for help because I really spent hours (more than 5) to look for an answer online and can't find the proper solution.

My project requires that I scrap the titles of a external web pages, but sometimes these pages are coded in iso-8859-1.

As the scrapped titles are displayed in my page code in utf-8, I get � instead of characters such as é, à, ê, ô ...

So I must find a way to sometimes convert the titles from iso-8859-1 to utf-8. Can you help me?

I'm scripting with Google Scripts, e.g. I write code in JavaScript to enhance a Google spreadsheet using the API provided.

To scrap the external web pages, I use this code :

var result = UrlFetchApp.fetch( url );
var wholePage = result.getContentText(); 
var scrap = wholePage.match( /<title>(.*?)<\/title>/ ); 
var title = scrap[1];

It works perfectly if the scrapped page is coded in utf-8 but not for this url (as an example): http://www.lexpress.fr/actualite/medias/cannes-pierre-lescure-et-jerome-clement-pressentis-pour-succeder-a-gilles-jacob_1254608.html

This is the result I get on this example :

Cannes: Pierre Lescure et J�r�me Cl�ment pressentis pour succ�der � Gilles Jacob - L'EXPRESS

(yes, I'm French).

Can someone help me on this? I'll be really grateful. I tried to give as much information as I could, since many other questions related to encoding issues on StackOverflow are said to miss the real context. Tell me if you need more, I'll answer quickly.

回答1:

Where are you displaying the content is more important. If you are writing it to a spreadsheet, then you have to change the settings of the spreadsheet to the correct language (French). In addition, you can use the advanced parameters to specify what charset the original content is in. For example,

var result = UrlFetchApp.fetch( url , {contentType : 'text/html; charset=utf-8'});
var wholePage = result.getContentText(); 
var scrap = wholePage.match( /<title>(.*?)<\/title>/ ); 
var title = scrap[1];