scrape ASIN from amazon URL using javascript

2019-03-08 06:23发布

Assuming I have an Amazon product URL like so

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C/ref=amb_link_86123711_2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=0AY9N5GXRYHCADJP5P0V&pf_rd_t=101&pf_rd_p=500528151&pf_rd_i=507846

How could I scrape just the ASIN using javascript? Thanks!

11条回答
smile是对你的礼貌
2楼-- · 2019-03-08 06:54

If the ASIN is always in that position in the URL:

var asin= decodeURIComponent(url.split('/')[5]);

though there's probably little chance of an ASIN getting %-escaped.

查看更多
倾城 Initia
3楼-- · 2019-03-08 07:03

Since the ASIN is always a sequence of 10 letters and/or numbers immediately after a slash, try this:

url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)")

The additional (?:[/?]|$) after the ASIN is to ensure that only a full path segment is taken.

查看更多
手持菜刀,她持情操
4楼-- · 2019-03-08 07:04

This may be a simplistic approach, but I have yet to find an error in it using any of the URL's provided in this thread that people say is an issue.

Simply, I take the URL, split it on the "/" to get the discrete parts. Then loop through the contents of the array and bounce them off of the regex. In my case the variable i represents an object that has a property called RawURL to contain the raw url that I am working with and a property called VendorSKU that I am populating.

try
            {
                string[] urlParts = i.RawURL.Split('/');
                Regex regex = new Regex(@"^[A-Z0-9]{10}");

                foreach (string part in urlParts)
                {
                    Match m = regex.Match(part);
                    if (m.Success)
                    {
                        i.VendorSKU = m.Value;
                    }
                }
            }
            catch (Exception) { }

So far, this has worked perfectly.

查看更多
The star\"
5楼-- · 2019-03-08 07:07

Amazon's detail pages can have several forms, so to be thorough you should check for them all. These are all equivalent:

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
http://www.amazon.com/dp/B0015T963C
http://www.amazon.com/gp/product/B0015T963C
http://www.amazon.com/gp/product/glance/B0015T963C

They always look like either this or this:

http://www.amazon.com/<SEO STRING>/dp/<VIEW>/ASIN
http://www.amazon.com/gp/product/<VIEW>/ASIN

This should do it:

var url = "http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C";
var regex = RegExp("http://www.amazon.com/([\\w-]+/)?(dp|gp/product)/(\\w+/)?(\\w{10})");
m = url.match(regex);
if (m) { 
    alert("ASIN=" + m[4]);
}
查看更多
小情绪 Triste *
6楼-- · 2019-03-08 07:08

None of the above work in all cases. I have tried following urls to match with the examples above:

http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
http://www.amazon.com/dp/B0015T963C
http://www.amazon.com/gp/product/B0015T963C
http://www.amazon.com/gp/product/glance/B0015T963C

https://www.amazon.de/gp/product/B00LGAQ7NW/ref=s9u_simh_gw_i1?ie=UTF8&pd_rd_i=B00LGAQ7NW&pd_rd_r=5GP2JGPPBAXXP8935Q61&pd_rd_w=gzhaa&pd_rd_wg=HBg7f&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_s=&pf_rd_r=GA7GB6X6K6WMJC6WQ9RB&pf_rd_t=36701&pf_rd_p=c210947d-c955-4398-98aa-d1dc27e614f1&pf_rd_i=desktop

https://www.amazon.de/Sawyer-Wasserfilter-Wasseraufbereitung-Outdoor-Filter/dp/B00FA2RLX2/ref=pd_sim_200_3?_encoding=UTF8&psc=1&refRID=NMR7SMXJAKC4B3MH0HTN

https://www.amazon.de/Notverpflegung-Kg-Marine-wasserdicht-verpackt/dp/B01DFJTYSQ/ref=pd_sim_200_5?_encoding=UTF8&psc=1&refRID=7QM8MPC16XYBAZMJNMA4

https://www.amazon.de/dp/B01N32MQOA?psc=1

This is the best I could come up with: (?:[/dp/]|$)([A-Z0-9]{10}) Which will also select the prepending / in all cases. This can then be removed later on.

You can test it on: http://regexr.com/3gk2s

查看更多
登录 后发表回答