Pulling price from amazon rss feed embedded in des

2019-08-14 19:44发布

问题:

I am working on an RSS feed, which is pulling data from an Amazon RSS feed of books. I am using C# .NET Compact Framework 3.5. I can get the title of the book, the date published etc from the nodes in the RSS feed. However, the price of the book is embedded in a whole heap of HTML in the description node. How would I go about extracting only the price and not a load of HTML?

if (nodeChannel.ChildNodes[i].Name == "item")
{
    nodeItem = nodeChannel.ChildNodes[i];
    row = new ListViewItem();
    row.Text = nodeItem["title"].InnerText;
    row.SubItems.Add(nodeItem["description"].InnerText);
    listBooks.Items.Add(row);
}

An example of the price in the middle of the description node

<description><![CDATA[    <div class="hreview" style="clear:both;">  <div class="item">        <div style="float:left;" class="tgRssImage"><a class="url" href="https://rads.stackoverflow.com/amzn/click/com/B0013FDM7E" rel="nofollow noreferrer"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0" /></a></div>    <span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span>  </div>  <div class="description">    <br />    <span style="display: block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a href="https://rads.stackoverflow.com/amzn/click/com/B0013FDM7E" rel="nofollow noreferrer">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a href="http://www.amazon.com/gp/offer-listing/B0013FDM7E/ref=tag_rso_rs_eofr_used" id="tag_rso_rs_eofr_used">285 used and new</a> from <span class="tgProductPrice">$1.00</span></span><br /></span>    <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0" /><br /></span>    <br />    <span class="tgRssProductTag"></span>    <span class="tgRssAllTags">Customer tags: <a href="http://www.amazon.com/tag/science%20fiction/ref=tag_rss_rs_itdp_item_at">science fiction</a>(92), <a href="http://www.amazon.com/tag/will%20smith/ref=tag_rss_rs_itdp_item_at">will smith</a>(79), <a href="http://www.amazon.com/tag/horror/ref=tag_rss_rs_itdp_item_at">horror</a>(51), <a href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">action</a>(43), <a href="http://www.amazon.com/tag/adventure/ref=tag_rss_rs_itdp_item_at">adventure</a>(34), <a href="http://www.amazon.com/tag/fantasy/ref=tag_rss_rs_itdp_item_at">fantasy</a>(33), <a href="http://www.amazon.com/tag/dvd/ref=tag_rss_rs_itdp_item_at">dvd</a>(30), <a href="http://www.amazon.com/tag/movie/ref=tag_rss_rs_itdp_item_at">movie</a>(20), <a href="http://www.amazon.com/tag/zombies/ref=tag_rss_rs_itdp_item_at">zombies</a>(14), <a href="http://www.amazon.com/tag/i%20am%20legend/ref=tag_rss_rs_itdp_item_at">i am legend</a>(6), <a href="http://www.amazon.com/tag/bad%20sci-fi/ref=tag_rss_rs_itdp_item_at">bad sci-fi</a>(4), <a href="http://www.amazon.com/tag/mutants/ref=tag_rss_rs_itdp_item_at">mutants</a>(4)<br /></span>  </div></div>]]></description>

$5.49 is in that mess somewhere

回答1:

It could be a dumb idea but how about doing a string search after class="tgProductPrice">? then extract the followign char until you hit the end tag </span>.

You don't need to load any html, you alraedy have it in the Description.

Will that work for you?



回答2:

That description looks really bad and if you don't have any possibility of getting a different version of that RSS feed, I think the only solution is to parse the HTML that you have in the description.

For that, you could use the HTML Agility Pack (haven't used it, but it's the recommended solution for HTML parsing from .NET) or use a regular expression or text search to find that tag and extract the price (this feels a bit hacky to me, and could lead to the need to make many changes if the RSS changes)

Edit: I've done the string search combined with regex a while back and it was a nightmare to maintain, but considering your case and that it's for only one value, it might be ok.



回答3:

using CsQuery; //get CsQuery from nuget packages
path = textBox1.Text;
        var dom = CQ.CreateFromUrl(path);
        var divContent = dom.Select("#priceblock_ourprice").Text();
        //priceblock_ourprice is an id of span where price is written
        label1.Text = divContent.ToString();


标签: c# rss amazon