根據字數搜索多個pdf文件中的單詞和索引pdf

任何人都可以幫助我搜索多個pdf文件中的單詞並獲得單詞數嗎？根據字數搜索多個pdf文件中的單詞和索引pdf

我需要在每個文檔中以字數遞減的順序顯示pdf，我應該在java中執行此操作。

2015-01-21 viren v

好像你正在尋找一個起點或想法，而不是一個具體的解決方案 - 你這裏有幾個選項。

首先，您需要確保PDF文本內容是可搜索的。例如，使用Adobe Acrobat的one way。其次，您需要使用某種API來索引PDF文件，以便它們可以被搜索到。這裏是Apache Lucene站點上的section，它可能會給你一些提示。

Apache Lucene是一個高性能，全功能的文本搜索引擎庫，完全用Java編寫。

請記住，在您的問題中沒有太多上下文，因此爲PDF或Lucene編制索引可能對您來說過分。

我建議谷歌搜索的一些方法 - 嘗試「文本搜索的PDF文件」，「閱讀PDF文件中的Java」等

下面是一個another answer來幫助你了。

來源

2015-01-21 09:18:28

感謝。我用Lucene和它的工作。 – 2015-02-11 10:08:21

獲取數據：
下載的iText（PDF工具），你要掃描，閱讀裏面的文本打開的所有PDF格式的，做一個HashMap來存儲字 - >數（字）。

排序您的HashMap：
這個問題已經被計算器這裏解決：Sort a Map<Key, Value> by values (Java)

來源

2015-01-21 09:16:58 chris

可以使用PDFBox在PDF文件字數統計：

public static int countWordInFile(String word, String filename, String fileEncoding) throws Exception { 
    int count=0; 
    PrintStream ps = null; 
    PrintStream originalSystemOut = System.out; 

    try { 
     ByteArrayOutputStream baos = new ByteArrayOutputStream(); 
     ps = new PrintStream(baos); 
     System.setOut(ps); 

     // Extracting text from page 
     ExtractText.main(new String[] {// 
       // 
         "-encoding", fileEncoding, // 
         "-console", // 
         filename // 
       // 
       }); 

     String content = baos.toString(fileEncoding); 

     // TODO: Find the word in content and count its occurences... 

    } finally { 
     IOUtils.closeQuietly(ps); 
     System.setOut(originalSystemOut); 
    } 

    return count; 
}

來源

2015-01-21 09:24:35 Stephan

謝謝@Stephan – 2015-02-11 10:07:04

根據字數搜索多個pdf文件中的單詞和索引pdf

回答

相關問題