無法解析完整的HTML頁面到文檔中

在這裏代碼我想解析整個HTML文件（本地）到文檔變量，但我觀察到它只解析少於10％的內容。請幫忙！！無法解析完整的HTML頁面到文檔中

Document doc=null; 
HashSet<String>urlSet=null; 
try { 
    doc = Jsoup.parse(file,null); 
} catch (IOException e) { 
    e.printStackTrace(); 
    return urlSet; 
} 

urlSet=getLinks(doc); 
if(urlSet!=null) 
    urlSet=refineURLs(urlSet); 
return urlSet;

來源

2015-09-17 uniquephase

多大字節的'''file'''字符串？ – luksch

html文件大小，如果100kb。它從此鏈接https://en.wikipedia.org/wiki/Developmental_biology下載並保存爲html文件。 – uniquephase

我認爲這是因爲html中的相對鏈接。使用這個來代替：

String html = readFile(file.getAbsolutePath(), Charset.defaultCharset()); 
doc = Jsoup.parse(html, "https://en.wikipedia.org/wiki/Developmental_biology"); 

private static String readFile(String path, Charset encoding) throws IOException { 
    byte[] encoded = Files.readAllBytes(Paths.get(path)); 
    return new String(encoded, encoding); 
}

來源

2015-09-17 12:31:55 Flagman

你想讓我創建一個方法「readFile（）」？你能否詳細說明這個功能？ – uniquephase

無法解析完整的HTML頁面到文檔中

回答

相關問題