2010-11-26 111 views
1

我正在研究RSS提要,該提要從亞馬遜的RSS RSS提要中提取數據。我正在使用C#.NET Compact Framework 3.5。我可以從RSS提要中的節點獲得書名,發佈日期等。但是,本書的價格嵌入在描述節點中的整個HTML堆中。我將如何去提取只有價格而不是HTML的負載?從描述中嵌入的amazon rss feed中提取價格

if (nodeChannel.ChildNodes[i].Name == "item") 
{ 
    nodeItem = nodeChannel.ChildNodes[i]; 
    row = new ListViewItem(); 
    row.Text = nodeItem["title"].InnerText; 
    row.SubItems.Add(nodeItem["description"].InnerText); 
    listBooks.Items.Add(row); 
} 

在描述節點

<description><![CDATA[ <div class="hreview" style="clear:both;"> <div class="item">  <div style="float:left;" class="tgRssImage"><a class="url" href="http://rads.stackoverflow.com/amzn/click/B0013FDM7E"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0" /></a></div> <span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display: block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a href="http://rads.stackoverflow.com/amzn/click/B0013FDM7E">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a href="http://rads.stackoverflow.com/amzn/click/B0013FDM7E" id="tag_rso_rs_eofr_used">285 used and new</a> from <span class="tgProductPrice">$1.00</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0" /><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a href="http://www.amazon.com/tag/science%20fiction/ref=tag_rss_rs_itdp_item_at">science fiction</a>(92), <a href="http://www.amazon.com/tag/will%20smith/ref=tag_rss_rs_itdp_item_at">will smith</a>(79), <a href="http://www.amazon.com/tag/horror/ref=tag_rss_rs_itdp_item_at">horror</a>(51), <a href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">action</a>(43), <a href="http://www.amazon.com/tag/adventure/ref=tag_rss_rs_itdp_item_at">adventure</a>(34), <a href="http://www.amazon.com/tag/fantasy/ref=tag_rss_rs_itdp_item_at">fantasy</a>(33), <a href="http://www.amazon.com/tag/dvd/ref=tag_rss_rs_itdp_item_at">dvd</a>(30), <a href="http://www.amazon.com/tag/movie/ref=tag_rss_rs_itdp_item_at">movie</a>(20), <a href="http://www.amazon.com/tag/zombies/ref=tag_rss_rs_itdp_item_at">zombies</a>(14), <a href="http://www.amazon.com/tag/i%20am%20legend/ref=tag_rss_rs_itdp_item_at">i am legend</a>(6), <a href="http://www.amazon.com/tag/bad%20sci-fi/ref=tag_rss_rs_itdp_item_at">bad sci-fi</a>(4), <a href="http://www.amazon.com/tag/mutants/ref=tag_rss_rs_itdp_item_at">mutants</a>(4)<br /></span> </div></div>]]></description> 

$ 5.49的中間價位的一個例子是那些亂七八糟的地方

+0

您可以舉一個包含價格的HTML代碼的例子嗎? – Rox 2010-11-26 12:56:33

回答

1

這可能是一個愚蠢的想法,但如何這樣做之後的字符串搜索class="tgProductPrice">?然後提取跟蹤字符,直到您點擊結束標記</span>

你不需要加載任何html,你可以在描述中加入它。

這是否適合您?

1

該描述看起來非常糟糕,如果您沒有任何獲得該RSS源的不同版本的可能性,我認爲唯一的解決方案是解析您在描述中使用的HTML。爲此,您可以使用HTML Agility Pack(尚未使用它,但它是從.NET解析HTML的推薦解決方案),或者使用正則表達式或文本搜索來查找該標記並提取價格(這種感覺這對我來說有點難以理解,並且如果RSS發生變化,可能會導致需要做出許多更改)

編輯:我已經完成了與正則表達式結合在一起的字符串搜索,這是一個噩夢來維護,但考慮到你的情況,它只有一個價值,它可能是好的。

0
using CsQuery; //get CsQuery from nuget packages 
path = textBox1.Text; 
     var dom = CQ.CreateFromUrl(path); 
     var divContent = dom.Select("#priceblock_ourprice").Text(); 
     //priceblock_ourprice is an id of span where price is written 
     label1.Text = divContent.ToString();