如何讀取zip文件中的htm文件？

我有一個包含Index.htm的zip文件。我應該閱讀Index.htm的內容並在其中找到一個日期（2011年12月），並創建一個包含此日期的目錄，然後在該目錄中提取zip文件。如何讀取zip文件中的htm文件？

這是HTML文件：

<HTML>  
    <HEAD></HEAD>  
    <BODY>  
    <A Name="TopOfPage"></A>  
    <TABLE Width="100%" Border="0" CellPadding="0" CellSpacing="0">  
    <TR> 
    <TD Width="30%"><A HRef="HeaderTxt/HetBCFI.htm">Het B.C.F.I.</A></TD>  
    </TR>  
    </TABLE>  
    <TABLE Width="100%" Border="0" CellPadding="0" CellSpacing="0"> 
    <TR> 
    <TD RowSpan="2" Width="10"></TD> 
    <TD Width="70%"><STRONG><FONT Face="Arial" Size="2">Gecommentarieerd Geneesmiddelenrepertorium</FONT></STRONG></TD> 
    <TD Width="29%" Align="Right" Class= "Datum">&nbsp; 
    December 2011&nbsp;&nbsp; 
    </TD> 
    <TD Rowspan="2" Width="10"></TD> 
</TR> 
</TABLE> </BODY> </HTML>

來源

2012-01-16 michdraft

這是我用正確的最終代碼：感謝你提供有用的提示

public static String getDateWithinHtmlInsideZipFile(File archive) { 
     ZipFile zp = new ZipFile(archive); 
     InputStream in = zp.getInputStream (zp.getEntry ("Index.htm")); 

     Document doc = Jsoup.parse(in, "UTF-8", ""); 

    return doc.body().getElementsByClass("Datum").text().trim(); 
}

來源

2012-01-18 10:43:48 michdraft

幾個步驟：

使用java.util.zip包並創建一個解壓縮流。
使用XML解析器（如JSoup）來遍歷節點，並...
使用正則表達式或帶日期解析器（如SimpleDateFormat）的正則表達式來挑出日期。

這使得您正在查找的日期總是在文本節點中。

來源

2012-01-16 15:17:08 bdares

額外的步驟1.5：ZipFile的ZP =新的ZipFile（「xxx.zip」）; InputStream in = zp.getInputStream（zp.getEntry（「Index.htm」））; – 2012-01-16 15:21:31

我得到這個結果： ' December 2011 ' 如何在我的字符串中省略？ – michdraft 2012-01-18 12:03:29

試試這個，

使用java.util.zip包to read the html
使用某些HTML解析器（我會建議JSoup）來獲取日期字符串。 Here is link這將有助於你的情況。

一旦你有日期字符串，創建你想要的目錄。

編輯：要刪除 ，您可以在以下情況之一，

創建包含 字符串另一文檔元素並執行以下操作

document.select(":containsOwn(\u00a0)").remove();（從here拍攝）
以下使用（假設您要清理的字符串是htmlString）

Jsoup.parse(htmlString).text();
使用字符串的replaceAll()功能擺脫 。

來源

2012-01-16 15:18:33 Santosh

當我解析htm文件時，我得到了 befor和我的字符串結束。我該如何擺脫它？ – michdraft 2012-01-18 12:04:38

更新了我的答案，以解決您的疑慮。 – Santosh 2012-01-18 12:44:04

'String date = doc.body（）。getElementsByClass（「Datum」）。html（）。toString（）。replaceAll（「」，「」）。trim（）;' – michdraft 2012-01-18 14:14:35

如何讀取zip文件中的htm文件？

回答

相關問題