如何從HTML中解析文本

16

從jsoup食譜：http://jsoup.org/cookbook/extracting-data/attributes-text-html

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>"; 
Document doc = Jsoup.parse(html); 
String text = doc.body().text(); // "An example link"

來源

2010-08-17 22:13:45

+0

如何排除不可見元素？（例如：display：none） – Ehsan 2013-06-19 06:51:26

0

嗯，這裏是一個快速的方法我一起扔一次。它使用正則表達式完成工作。大多數人會同意這不是一個好辦法。所以，使用風險自負。

public static String getPlainText(String html) { 
    String htmlBody = html.replaceAll("<hr>", ""); // one off for horizontal rule lines 
    String plainTextBody = htmlBody.replaceAll("<[^<>]+>([^<>]*)<[^<>]+>", "$1"); 
    plainTextBody = plainTextBody.replaceAll("<br ?/>", ""); 
    return decodeHtml(plainTextBody); 
}

這最初是在我的API封裝器中用於堆棧溢出API。所以，它只在html標籤的一小部分下進行測試。

來源

2010-08-17 22:15:07 jjnguy

+0

嗯...爲什麼不使用簡單的正則表達式：'replaceAll（「<[^>] +>」，「」）'？ – Crozin 2010-08-17 22:28:04

+0

@Crozin，好吧，我在教自己如何使用我猜的後向引用。它看起來像你的可能也會工作。 – jjnguy 2010-08-17 22:31:03

+0

這傷害！ - > http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – sleeplessnerd 2011-08-27 13:54:15

1

使用是JDK的一部分類：

import java.io.*; 
import java.net.*; 
import javax.swing.text.*; 
import javax.swing.text.html.*; 

class GetHTMLText 
{ 
    public static void main(String[] args) 
     throws Exception 
    { 
     EditorKit kit = new HTMLEditorKit(); 
     Document doc = kit.createDefaultDocument(); 

     // The Document class does not yet handle charset's properly. 
     doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE); 

     // Create a reader on the HTML content. 

     Reader rd = getReader(args[0]); 

     // Parse the HTML. 

     kit.read(rd, doc, 0); 

     // The HTML text is now stored in the document 

     System.out.println(doc.getText(0, doc.getLength())); 
    } 

    // Returns a reader on the HTML data. If 'uri' begins 
    // with "http:", it's treated as a URL; otherwise, 
    // it's assumed to be a local filename. 

    static Reader getReader(String uri) 
     throws IOException 
    { 
     // Retrieve from Internet. 
     if (uri.startsWith("http:")) 
     { 
      URLConnection conn = new URL(uri).openConnection(); 
      return new InputStreamReader(conn.getInputStream()); 
     } 
     // Retrieve from file. 
     else 
     { 
      return new FileReader(uri); 
     } 
    } 
}

來源

2010-08-17 23:14:11 camickr

如何從HTML中解析文本

回答

相關問題