HTML未正確下載

我一直在嘗試下載Google新聞RSS源的源代碼。除了顯示不正常的鏈接之外，它可以正確下載。HTML未正確下載

static String urlNotizie = "https://news.google.it/news/feeds?pz=1&cf=all&ned=it&hl=it&output=rss"; 
Document docHtml = Jsoup.connect(urlNotizie).get(); 
String html = docHtml.toString(); 
System.out.println(html);

輸出：

<html> 
<head></head> 
<body> 
    <rss version="2.0"> 
    <channel> 
    <generator> 
    NFE/1.0 
    </generator> 
    <title>Prima pagina - Google News</title> 
    <link />http://news.google.it/news?pz=1&amp;ned=it&amp;hl=it 
    <language> 
    it 
    </language> 
    <webmaster> 
    [email protected] 
    </webmaster> 
    <copyright> 
    &amp;copy;2013 Google 
    </copyright> [...]

使用一個URLConnection我能夠輸出的頁面的正確來源。但在解析過程中，我遇到了與上面相同的問題，它出現了一個列表<link />.（同樣只有鏈接，解析其他東西時效果很好）。 URLConnection的例子：

 URL u = new URL(urlNotizie); 
     URLConnection yc = u.openConnection(); 

     StringBuilder builder = new StringBuilder(); 
     BufferedReader reader = new BufferedReader(new InputStreamReader(
       yc.getInputStream())); 
     String line; 
     while ((line = reader.readLine()) != null) { 
      builder.append(line); 
      builder.append("\n"); 
     } 
     String html = builder.toString(); 
     System.out.println("HTML " + html); 

     Document doc = Jsoup.parse(html); 

     Elements listaTitoli = doc.select("title"); 
     Elements listaCategorie = doc.select("category"); 
     Elements listaDescrizioni = doc.select("description"); 
     Elements listaUrl = doc.select("link"); 
     System.out.println(listaUrl);

來源

2013-12-12 Angelo Tricarico

其下載正確否則jsoup將不能夠把它變成一個文檔內替換

Document docHtml = Jsoup.connect(urlNotizie).get();

，事情顯然出問題的toString（）方法。當然可以直接使用URLConnection或Apache HttpClient直接獲取RSS數據。 – Gimby

已更新的問題與新代碼 –

Jsoup被設計爲HTML parser，不能作爲XML（也不RSS）分析器。

HTML <link> element被指定爲沒有任何主體。這將是invalid有一個<link>元素的主體，就像您的XML一樣。

您可以使用Jsoup解析XML，但您需要明確地將其tell切換爲XML parsing mode。

通過

Document docXml = Jsoup.connect(urlNotizie).parser(Parser.xmlParser()).get();

來源

2013-12-12 09:52:59 BalusC

我做過：'Document docHtml = Jsoup.connect（urlNotizie）.parser（Parser.xmlParser（））。get（）; String html = docHtml.toString（）; Document doc = Jsoup.parse（html）;'但它仍然不起作用。它不斷解析奇怪的鏈接。 –

咦？爲什麼你在做'String html = docHtml.toString（）; Document doc = Jsoup.parse（html）;'？擺脫這些線。你在這裏簡單地將解析後的XML重新解析爲HTML，這並不是完全合理的，正如我的回答中所解釋的。只需使用'docXml.select（「link」）'等等。 – BalusC

它像一個魅力。感謝您的解釋。 –

HTML未正確下載

回答

相關問題