HTML到RichTextBox作爲純文本與超鏈接

閱讀這麼多關於not using RegExes for stripping HTML，我想知道如何獲得一些鏈接到我的RichTextBox沒有得到所有的雜亂的HTML也是在我從一些報紙網站下載的內容。HTML到RichTextBox作爲純文本與超鏈接

我有什麼：從一個報紙網站的HTML。

我想要什麼：作爲純文本在RichTextBox中的文章。但與鏈接（即，<a href="foo">bar</a>替換爲<Hyperlink NavigateUri="foo">bar</Hyperlink>）。

HtmlAgilityPack給我HtmlNode.InnerText（剝去所有HTML標籤）和HtmlNode.InnerHtml（帶有所有標籤）。我可以通過articlenode.SelectNodes(".//a")獲取鏈接的網址和文本，但我應該如何知道在HtmlNode.InnerText的純文本中插入的位置？

任何提示，將不勝感激。

來源

2013-06-03 Rokus

這裏是你如何能做到這一點（與樣本控制檯應用程序，但這個想法是爲Silverlight相同）：

讓我們假設你有這樣的HTML：

<html> 
<head></head> 
<body> 
Link 1: <a href="foo1">bar</a> 
Link 2: <a href="foo2">bar2</a> 
</body> 
</html>

那麼這個代碼：

HtmlDocument doc = new HtmlDocument(); 
doc.Load(myFileHtm); 

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a")) 
{ 
    // replace the HREF element in the DOM at the exact same place 
    // by a deep cloned one, with a different name 
    HtmlNode newNode = node.ParentNode.ReplaceChild(node.CloneNode("Hyperlink", true), node); 

    // modify some attributes 
    newNode.SetAttributeValue("NavigateUri", newNode.GetAttributeValue("href", null)); 
    newNode.Attributes.Remove("href"); 
} 
doc.Save(Console.Out);

將輸出這樣的：

<html> 
<head></head> 
<body> 
Link 1: <hyperlink navigateuri="foo1">bar</hyperlink> 
Link 2: <hyperlink navigateuri="foo2">bar2</hyperlink> 
</body> 
</html>

來源

2013-06-03 13:55:22

很好！這工作，謝謝。但我仍然不得不從其他所有html標籤（img，ul，li，p，div ...）中刪除我的文本。正則表達式'<[^a].*?>'匹配除鏈接之外的所有html標籤，但我也必須保留''。我不知道如何讓那裏的OR運算符匹配每個'<.*>'，除了或''。 – Rokus

這個問題的答案，順便說一句，將是'<(?!a|/a)^>] +>'。我現在想到了這一切。 – Rokus

HTML到RichTextBox作爲純文本與超鏈接

回答

相關問題