解析HTML的網絡爬蟲

而且我先前的問題在這裏：Extending a basic web crawler to filter status codes and HTML，我試圖從HTML標記的信息，在這種情況下，「標題」，用下面的方法：解析HTML的網絡爬蟲

public static void parsePage() throws IOException, BadLocationException 
{ 
    HTMLEditorKit kit = new HTMLEditorKit(); 
    HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument(); 
    doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE); 
    Reader HTMLReader = new InputStreamReader(testURL.openConnection() 
      .getInputStream()); 
    kit.read(HTMLReader, doc, 0); 

    // Create an iterator for all HTML tags. 
    ElementIterator it = new ElementIterator(doc); 
    Element elem; 

    while ((elem = it.next()) != null) 
    { 
     if (elem.getName().equals("title")) 
     { 
      System.out.println("found title tag"); 
     } 
    } 
}

這是工作至於告訴我它找到了標籤。我正在努力的是如何提取他們之後/之內包含的信息。

我在網站上發現了這個問題：Help with Java Swing HTML parsing，但它表示它只能使用格式良好的HTML。我希望有另一種方式。

任何指針讚賞。

來源

2012-07-14 Robert

原來改變了方法，這種嘗試會產生預期的結果：

{ 
      HTMLEditorKit kit = new HTMLEditorKit(); 
      HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument(); 
      doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE); 
      Reader HTMLReader = new InputStreamReader(testURL.openConnection().getInputStream()); 
      kit.read(HTMLReader, doc, 0); 
      String title = (String) doc.getProperty(Document.TitleProperty); 
      System.out.println(title); 
    }

我覺得我被關上野鵝與迭代器/元素的東西追逐。

來源

2012-07-14 21:57:23 Robert

使用Jodd

Jerry jerry = jerry().enableHtmlMode().parse(html); 
...

或者HtmlParser

Parser parser = new Parser(htmlInput); 
CssSelectorNodeFilter cssFilter = new CssSelectorNodeFilter("title"); 
NodeList nodes = parser.parse(cssFilter);

來源

2012-07-14 21:24:02

謝謝阿列克謝。有沒有辦法做到這一點，而不使用外部庫？ – Robert 2012-07-14 21:26:22

如果你需要一個快速而又髒亂的解決方案，你可以使用正則表達式來提取標題，但是，一般來說，避免使用正則表達式的HTML – 2012-07-14 21:28:24

是的，我注意到使用正則表達式來解析HTML是不受歡迎的。在這種情況下，我只需要「標題」信息。 – Robert 2012-07-14 21:40:27

解析HTML的網絡爬蟲

回答

相關問題