如何從C＃中的網頁獲取所有顯示文本

其實我想獲得所有的顯示文字，但不是html標籤。

這裏是我的代碼

HtmlWeb web = new HtmlWeb(); 
HtmlAgilityPack.HtmlDocument doc = web. 
    Load(@"http://dawateislami.net/books/bookslibrary.do#!section:bookDetail_521.tr"); 
string str = doc.DocumentNode.InnerText;

該內部HTML將返回一些標籤和腳本很好，但我想只得到顯示文本對用戶可見。請幫幫我。感謝

來源

2013-10-26 Guria Doll

你能給出一個輸出和輸出的例子嗎？ – gideon

輸出就像這樣<！ - 一些文本 - ！> 但我只想要文本 –

要刪除JavaScript和CSS：

foreach(var script in doc.DocumentNode.Descendants("script").ToArray()) 
    script.Remove(); 
foreach(var style in doc.DocumentNode.Descendants("style").ToArray()) 
    style.Remove();

要刪除評論（未經測試）：

foreach(var comment in doc.DocumentNode.Descendants("//comment()").ToArray()) 
    comment.Remove()

來源

2013-10-26 08:27:02 junichiro

它不工作:) 再見，這是什麼。你可以推薦另一個解決方案嗎 –

我已經添加了一條刪除評論的行。如果你更確切地知道你不想要的東西 – junichiro

使用System.Linq不支持 –

對於字符串中移除所有的HTML標籤，你可以使用：

String output = inputString.replaceAll("<[^>]*>", "");

刪除特定標籤：

String output = inputString.replaceAll("(?i)<td[^>]*>", "");

希望它能幫助:)

來源

2013-10-26 10:22:53 Avishek

[我相信這將解決問題烏爾] [1]

方法1 - 在內存剪切和粘貼

使用WebBrowser控件對象來處理網絡頁面，然後從控件中複製文本...

使用以下代碼下載網頁： Collapse |複製代碼

//Create the WebBrowser control 
WebBrowser wb = new WebBrowser(); 
//Add a new event to process document when download is completed 
wb.DocumentCompleted += 
    new WebBrowserDocumentCompletedEventHandler(DisplayText); 
//Download the webpage 
wb.Url = urlPath;

使用下列事件代碼來處理下載的網頁文本：崩潰|複製代碼

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e) 
{ 
WebBrowser wb = (WebBrowser)sender; 
wb.Document.ExecCommand(「SelectAll」, false, null); 
wb.Document.ExecCommand(「Copy」, false, null); 
textResultsBox.Text = CleanText(Clipboard.GetText()); 
}

方法2 - 在內存選擇對象

這是處理下載的網頁文本的第二種方法。它似乎只需要一點點時間（差別很小）。但是，它避免了使用剪貼板以及與此相關的限制。摺疊|複製代碼

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e) 
{ //Create the WebBrowser control and IHTMLDocument2 
WebBrowser wb = (WebBrowser)sender; 
IHTMLDocument2 htmlDocument = 
wb.Document.DomDocument as IHTMLDocument2; 
//Select all the text on the page and create a selection object 
wb.Document.ExecCommand(「SelectAll」, false, null); 
IHTMLSelectionObject currentSelection = htmlDocument.selection; 
//Create a text range and send the range’s text to your text box 
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange 
textResultsBox.Text = range.text; 
}

方法3 - 的高雅，簡潔，速度較慢的XmlDocument方法

一個好朋友與我分享這個例子。我是一個簡單的狂熱粉絲，這個例子贏得了簡單比賽。與其他兩種方法相比，不幸的是速度很慢。

XmlDocument對象將只用3行簡單的代碼加載/處理HTML文件： Collapse | Copy Code

XmlDocument document = new XmlDocument(); 
document.Load(「www.yourwebsite.com」); 
string allText = document.InnerText;

你有它！三種簡單的方法只從網頁中刪除顯示的文本，而不涉及外部「包」。包

來源

2014-03-03 04:54:55

如何從C＃中的網頁獲取所有顯示文本

回答

相關問題