StreamWrite xml節點內容忽略使用C＃的兒童

我試圖讀取一個rss新聞提要，並重寫文章的日期，標題和正文在txt文件上的程序。我前兩天剛學過C＃，但有其他語言的經驗。該程序適用於某些Feed，但在其他人（例如路透社）中，在每篇文章正文後面有一個「通過電子郵件發送此文章」類型的鏈接，並且在複製它時似乎無法擺脫它。我運行整個飼料的程序。StreamWrite xml節點內容忽略使用C＃的兒童

例如，這是一些新聞的XML代碼：

<item> 
    <title>Pimco's Ivascyn sees 'significant' opportunity in European bank assets</title> 
    <link>http://feeds.reuters.com/~r/news/wealth/~3/vUJ74S5mXQg/story01.htm</link> 
    <category domain="">PersonalFinance</category> 
    <pubDate>Mon, 16 Jun 2014 15:37:52 GMT</pubDate> 
    <guid isPermaLink="false">http://www.reuters.com/article/2014/06/16/us-investing-pimco-ivascyn-idUSKBN0ER1VV20140616?feedType=RSS&amp;feedName=PersonalFinance</guid> 
    <description>NEW YORK (Reuters) - The expected unloading of roughly $1 trillion in assets by European banks represents a "significant investment opportunity" in residential and commercial real estate as well as...&lt;div class="feedflare"&gt; 
    &lt;a href="http://feeds.reuters.com/~ff/news/wealth?a=vUJ74S5mXQg:y6BPXasLV5o:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/news/wealth?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; 
    &lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/news/wealth/~4/vUJ74S5mXQg" height="1" width="1"/&gt;</description 
    <feedburner:origLink>http://reuters.us.feedsportal.com/c/35217/f/654211/s/3b8e7c6b/sc/2/l/0L0Sreuters0N0Carticle0C20A140C0A60C160Cus0Einvesting0Epimco0Eivascyn0EidUSKBN0AER1VV20A140A6160DfeedType0FRSS0GfeedName0FPersonalFinance/story01.htm</feedburner:origLink> 
</item>

然而，當我運行程序我得到：

Mon, 16 Jun 2014 15:37:52 GMT 
Pimco's Ivascyn sees 'significant' opportunity in European bank assets 
NEW YORK (Reuters) - The expected unloading of roughly $1 trillion in assets by European banks represents a "significant investment opportunity" in residential and commercial real estate as well as...<div class="feedflare"> 
<a href="http://feeds.reuters.com/~ff/news/wealth a=vUJ74S5mXQg:y6BPXasLV5o:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/news/wealth?d=yIl2AUoC8zA" border="0"></img></a> 
</div><img src="http://feeds.feedburner.com/~r/news/wealth/~4/vUJ74S5mXQg" height="1" width="1"/> 
**********

我試圖擺脫最後兩行的文章正文之後的代碼。我添加了星號來分隔不同的文章。

這裏是我的代碼：

using System; 
using System.IO; 
using System.Text; 
using System.Xml; 

namespace XmlReading 
{ 
    class RssReading 
    { 
     static void Main(string[] args) 
     { 
      //Creater a StreamWriter object to write in a text file. 
      StreamWriter sw = new StreamWriter("C:\\Users\Testing002.txt"); 

      XmlDocument xmlDoc = new XmlDocument(); 

      // Loads the rss feed page 
      xmlDoc.Load("http://feeds.reuters.com/news/wealth"); 

      //create an object of item nodes. 
      XmlNodeList itemNodes = xmlDoc.SelectNodes("//rss/channel/item"); 

      foreach (XmlNode itemNode in itemNodes) 
      { 
       //Reading the title 
       XmlNode titleNode = itemNode.SelectSingleNode("title"); 
       //Reading the date 
       XmlNode dateNode = itemNode.SelectSingleNode("pubDate"); 
       //Reading the body 
       XmlNode bodyNode = itemNode.SelectSingleNode("description"); 

       if(((titleNode != null) && (dateNode != null)) && (bodyNode!= null)) 
       { 
        /* Xpath of article body, and of extra links. 
        * //*[@id="bodyblock"]/ul/li[2]/div/text() 
        * //*[@id="bodyblock"]/ul/li[2]/div/div 
        */ 
       //writing to console just to check the output. 
        Console.WriteLine(dateNode.InnerText); 
        sw.WriteLine(dateNode.InnerText); 

        Console.WriteLine(titleNode.InnerText); 
        sw.WriteLine(titleNode.InnerText); 

        Console.WriteLine(bodyNode.Value); 
        sw.WriteLine(bodyNode.InnerText); 

        Console.WriteLine("**********\n\n\n"); 
        sw.WriteLine("**********\n\n\n"); 
        sw.WriteLine(" "); 
        sw.WriteLine(" "); 

       } 
      } 
      sw.Close(); 
      Console.ReadKey(true); 
     } 
    } 
}

預先感謝任何幫助或建議。

來源

2014-06-17 user3748452

您的「XML代碼」不是RSS提要的XML結構。它是它的HTML表示。請提供您正在嘗試處理的XML結構。 –

對不起，我的壞。我現在糾正了它。 – user3748452

我找到了解決問題的方法。起初我還以爲是孩子的問題，但我意識到，「電子郵件這個」創建鏈接使用的實體（如：

&lt;

和

&gt;

所以我所做的就是用子字符串從位置0開始，直到第一個'&'字符的索引，並且爲了使代碼在rss讀取器沒有這個問題時運行，我使用Math.Max編寫了它，以避免子字符串的大小不正確。最終的代碼與t相同他將身體寫入文本文件的行。在那裏代碼被替換爲以下行：

sw.WriteLine(bodyNode.InnerText.Substring(0,Math.Max(bodyNode.InnerXml.IndexOf("&"),0)));

此外，現在Console.WriteLine（）不需要在代碼中。

來源

2014-06-18 13:30:22 user3748452

當您想要保留的文本中包含字符實體（如>，&等）時，此方法將不起作用。您可以對描述文本進行HTML解碼，然後使用正則表達式刪除HTML標記。一個稍微複雜的解決方案，但比現在更好的解決方案不是搜索'＆'，而是'<'，因爲這將更準確地搜索HTML標籤的開始。 –

這是我最初嘗試過的。然而，由於某種原因，它只會正確地寫出其中一篇文章的正文，並且不會爲其他任何文章寫任何內容。 – user3748452

StreamWrite xml節點內容忽略使用C＃的兒童

回答

相關問題