2013-09-25 35 views
0

我的應用程序在第一次啓動時將某個網站下載爲HTML文件。 HTML文件非常混亂,所以我想用HtmlCleaner來清理它,這樣我就可以用Jsoup來解析它。但是,如何在清理後獲得新的清理過的HTML項目?如何從HtmlCleaner獲取清除的html文件?

我做了一些研究,這是我能找到的:

HtmlCleaner htmlCleaner = new HtmlCleaner(); 

TagNode root = htmlCleaner.clean(url); 

HtmlCleaner.getInnerHtml(root); 

String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">"; 

但我不能看到這段代碼並把它寫入新文件?如果沒有,我該如何實現它,以便舊文件將被刪除,新的清理過的html文件將被創建?

+0

什麼是 '亂' HTML? – RvdK

+0

檢查[this](https://www.easistent.com/urniki/263/razredi/16515)鏈接的來源 – Guy

回答

0

,你可以這樣做以下:

HtmlCleaner cleaner = new HtmlCleaner(); 
final String siteUrl = "http://www.themoscowtimes.com/"; 

TagNode node = cleaner.clean(new URL(siteUrl)); 


// serialize to xml file 
new PrettyXmlSerializer(props).writeToFile(
    node , "cleaned.xml", "utf-8" 
); 

// serialize to html file 
SimpleHtmlSerializer serializer = new SimpleHtmlSerializer(htmlCleaner.getProperties()); 
serializer.writeToFile(node, "c:/temp/cleaned.html");