如何區分pdf和非pdf文件？

我用下面的代碼片段下載PDF文件（我把它從here，學分Josh M）如何區分pdf和非pdf文件？

public final class FileDownloader { 

    private FileDownloader(){} 

    public static void main(String args[]) throws IOException{ 
     download("http://pdfobject.com/pdf/sample.pdf", new File("sample.pdf")); 
    } 

    public static void download(final String url, final File destination) throws IOException { 
     final URLConnection connection = new URL(url).openConnection(); 
     connection.setConnectTimeout(60000); 
     connection.setReadTimeout(60000); 
     connection.addRequestProperty("User-Agent", "Mozilla/5.0"); 
     final FileOutputStream output = new FileOutputStream(destination, false); 
     final byte[] buffer = new byte[2048]; 
     int read; 
     final InputStream input = connection.getInputStream(); 
     while((read = input.read(buffer)) > -1) 
      output.write(buffer, 0, read); 
     output.flush(); 
     output.close(); 
     input.close(); 
    } 
}

它可以完美兼容PDF文件。然而，正如我遇到一個「壞檔案」......我不知道該文件的擴展名是什麼，但似乎我陷入了無限循環while((read = input.read(buffer)) > -1)。我該如何改進這個片段來丟棄任何不適當的文件（非pdf）？

來源

2013-11-15 Бывший Мусор

*它適用於pdf文件。但是，正如我遇到一個「壞檔案」* - 您是否檢查過這是否真的是PDF還是PDF的問題？你在這種情況下檢查過目標文件的內容嗎？ – mkl

還有一個類似問題的問題：Infinite Loop in Input Stream。

查看可能的解決方案：Abort loop after fixed time。

您可以嘗試設置連接的超時時間：Java URLConnection Timeout。

來源

2013-11-15 14:32:40

+1謝謝。這種解決方案適用於小批量生產。但是，爲每次下載啓動一個新線程將是不切實際的。我有大約3700萬個文件需要檢查 –

我已經用另一個可能的解決方案更新了答案。 –

如何區分pdf和非pdf文件？

回答

相關問題