C＃Internet Explorer和剝離HTML標籤

有什麼辦法可以從C＃打開Internet Explorer進程，發送html內容到這個瀏覽器並捕獲'顯示'內容？C＃Internet Explorer和剝離HTML標籤

我知道其他的HTML剝離方法（例如HtmlAgilityPack），但我想探索上述途徑。

感謝， LG

2012-02-19 Luke G

完美有效的問題IMO。 – 2012-02-19 14:37:52

您可以使用'WebBrowser'控件。 'webBrowser.DocumentText = ....' – 2012-02-19 14:47:31

您可以使用WebBrowser控件，它存在兩個WinForms和WPF，主辦IE在應用程序中。然後，您可以將控件的Source設置爲HTML，等待內容加載（使用LayoutUpdated事件，而不是在HTML完成下載時引發的Loaded事件，不一定要安排並且所有動態JS運行），然後訪問Document屬性來獲取HTML。

來源

2012-02-19 14:52:10

你好，謝謝你指點我正確的方向。你能告訴我如何從文檔屬性中提取可見內容嗎？謝謝 – 2012-02-19 15:42:44

Document屬性在WPF中被定義爲'object'，但它實際上是一個COM IHTMLDocument對象的包裝。在調試器中進入它，你會發現它有一些有用的屬性，如'body'，它包含一個HTMLBodyClass對象，該對象又有一個'innerHTML'屬性，包含顯示的HTML（在javascript之後）或innerText，其中只包含文字，沒有標籤。 – 2012-02-19 16:47:44

public List<LinkItem> getListOfLinksFromPage(string webpage) 
    { 
     WebClient w = new WebClient(); 
     List<LinkItem> list = new List<LinkItem>(); 
     try 
     { 
      string s = w.DownloadString(webpage); 

      foreach (LinkItem i in LinkFinder.Find(s)) 
      { 
       //Debug.WriteLine(i); 
       //richTextBox1.AppendText(i.ToString() + "\n"); 
       list.Add(i); 
      } 
      listTest = list; 
      return list; 
     } 
     catch (Exception e) 
     { 
      return list; 
     } 

    } 

    public struct LinkItem 
    { 
     public string Href; 
     public string Text; 

     public override string ToString() 
     { 
      return Href; 
     } 
    } 

    static class LinkFinder 
    { 
     public static List<LinkItem> Find(string file) 
     { 
      List<LinkItem> list = new List<LinkItem>(); 

      // 1. 
      // Find all matches in file. 
      MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)", RegexOptions.Singleline); 

      // 2. 
      // Loop over each match. 
      foreach (Match m in m1) 
      { 
       string value = m.Groups[1].Value; 
       LinkItem i = new LinkItem(); 

       // 3. 
       // Get href attribute. 
       Match m2 = Regex.Match(value, @"href=\""(.*?)\""", 
       RegexOptions.Singleline); 
       if (m2.Success) 
       { 
        i.Href = m2.Groups[1].Value; 
       } 

       // 4. 
       // Remove inner tags from text. 
       string t = Regex.Replace(value, @"\s*<.*?>\s*", "", 
       RegexOptions.Singleline); 
       i.Text = t; 

       list.Add(i); 
      } 

      return list; 

     } 
    }

別人創建的正則表達式，所以我不能邀功說，但上面的代碼將打開一個WebClient的對象以在網頁中通過並使用正則表達式查找所有childLinks該頁面。不確定這是否是您要查找的內容，但如果您只是想「抓住」所有HTML內容並將其保存到文件中，則可以簡單地保存在「string s = w」行中創建的字符串「s」 .DownloadString（網頁）;」到一個文件。

來源

2012-02-19 18:08:26

C＃Internet Explorer和剝離HTML標籤

回答

相關問題