Use Google Sheets ImportXML with XPath to import A

2019-09-20 17:41发布

问题:

I want to get the Product title from an Amazon product page in Google Sheets.

I have searched and tried different approaches and came up with this

=ImportXML("https://www.amazon.com/dp/B01MSR8J29","//*[@id='productTitle']")

I want to get the title: Army Flag Shirt: Become Brothers Army TShirt

But I get Error, Imported content is empty.

Demo Sheet

回答1:

You can do this with just the //title xpath and a simple regexreplace to remove the standard Amazon.com: portion:

=REGEXREPLACE(ImportXML("https://www.amazon.com/dp/B01MSR8J29","//title"),"Amazon.com: ","")

If you also want to remove the rest of the breadcrumb e.g. : Clothing just modify it to use the OR symbol which in regex is | like this:

=REGEXREPLACE(ImportXML("https://www.amazon.com/dp/B01MSR8J29","//title"),"Amazon.com: |: Clothing","")



回答2:

Much of HTML isn't valid XML, in particular Amazon's pages are not valid XML. So, importXML fails on them.

You can use an Apps Script via a custom function as follows (remove the space before "amazon", it's here to prevent SO from rewriting the URL):

=producttitle("https://www. amazon.com/dp/B01MSR8J29")  

returns "Army Flag Shirt: Become Brothers Army TShirt", provided that the custom function is entered in Script Editor as follows:

function productTitle(url) {   
  var content = UrlFetchApp.fetch(url).getContentText();
  var match = content.match(/<span id="productTitle".*>([^<]*)<\/span>/);
  return match && match [1] ? match[1] : 'Title not found';
}

Here, the first line gets the source of the page; then a regex extracts the item title.

You will find a similar post here, including the question of whether this activity is compliant with Amazon's Terms of Services.