2017-05-17 41 views
0

我需要創建通過URL獲取網頁資源的網頁抓取工具。然後統計網頁上提供的字詞數量和字符數量。解析HTML(網頁)JavaSE

URL url = new URL(urlStr); 
URLConnection connection = url.openConnection(); 
InputStream inputStream = connection.getInputStream(); 
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream,"UTF-8")); 

因此,我可以獲取頁面(和html標籤)上的所有文本,以便我接下來做什麼?

有人可以幫我嗎?一些文件或sthg閱讀。我只需要使用JavaSE。不能使用3D派對庫。

+1

到底爲什麼?有這麼多的圖書館,*重新發明輪子*通常是一個不好的選擇。 –

+0

@Shashwat我明白,並知道jsoup和其他。但這是一個測試案例。他們說「提示: - 不要使用第三方庫」,我同意你的看法。所以在5個小時後,我沒有找到這個任務的好答案。 –

+0

嘗試通過HTMLEditorKit,但是這是正確的? –

回答

0

例如,你有page.html中:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 
<html> 
    <head> 
     <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> 
     <title>Login Page</title> 
    </head> 
    <body> 
     <div id="login" class="simple" > 
      <form action="login.do"> 
       Username : <input id="username" type="text" /> 
       Password : <input id="password" type="password" /> 
       <input id="submit" type="submit" /> 
       <input id="reset" type="reset" /> 
      </form> 
     </div> 
    </body> 
</html> 

要與解析它,您可以:

import java.io.File; 
import java.io.IOException; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 

/** 
* Java Program to parse/read HTML documents from File using Jsoup library. 
*/ 
public class HTMLParser{ 

    public static void main(String args[]) { 

     // Parse HTML String using JSoup library 
     String HTMLSTring = "<!DOCTYPE html>" 
       + "<html>" 
       + "<head>" 
       + "<title>JSoup Example</title>" 
       + "</head>" 
       + "<body>" 
       + "<table><tr><td><h1>HelloWorld</h1></tr>" 
       + "</table>" 
       + "</body>" 
       + "</html>"; 

     Document html = Jsoup.parse(HTMLSTring); 
     String title = html.title(); 
     String h1 = html.body().getElementsByTag("h1").text(); 

     System.out.println("Input HTML String to JSoup :" + HTMLSTring); 
     System.out.println("After parsing, Title : " + title); 
     System.out.println("Afte parsing, Heading : " + h1); 

     // JSoup Example 2 - Reading HTML page from URL 
     Document doc; 
     try { 
      doc = Jsoup.connect("http://google.com/").get(); 
      title = doc.title(); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 

     System.out.println("Jsoup Can read HTML page from URL, title : " + title); 

     // JSoup Example 3 - Parsing an HTML file in Java 
     //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong 
     Document htmlFile = null; 
     try { 
      htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1"); 
     } catch (IOException e) { 
      // TODO Auto-generated catch block 
      e.printStackTrace(); 
     } // right 
     title = htmlFile.title(); 
     Element div = htmlFile.getElementById("login"); 
     String cssClass = div.className(); // getting class form HTML element 

     System.out.println("Jsoup can also parse HTML file directly"); 
     System.out.println("title : " + title); 
     System.out.println("class of div tag : " + cssClass); 
    } 
} 

輸出:

Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html> 
After parsing, Title : JSoup Example 
Afte parsing, Heading : HelloWorld 
Jsoup Can read HTML page from URL, title : Google 
Jsoup can also parse HTML file directly 
title : Login Page 
class of div tag : simple 
+0

OP專門說*不能使用3d派對庫* –

+0

好吧,明白了,我只會看到一次 –