2014-01-15 261 views
0

我已經使用下面的代碼來搜索pdf中的文本。它用單個單詞正常工作。但是對於代碼中提到的句子,即使文檔中存在文本,它也顯示它不存在。任何人都可以幫我解決這個問題嗎?使用Lucene短語查詢和PDFBOX搜索PDF中的句子

  Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); 

      // Store the index in memory:    
      Directory directory = new RAMDirectory(); 
      // To store an index on disk, use this instead: 
      //Directory directory = FSDirectory.open("/tmp/testindex"); 
      IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer); 
      IndexWriter iwriter = new IndexWriter(directory, config); 
      Document doc = new Document(); 
      PDDocument document = null; 
       try { 
        document = PDDocument.load(strFilepath); 
       } 
       catch (IOException ex) { 
        System.out.println("Exception Occured while Loading the document: " + ex); 
       } 
       int i =1; 
       String name = null;   
       String output=new PDFTextStripper().getText(document); 
      //String text = "This is the text to be indexed"; 
      doc.add(new Field("contents", output, TextField.TYPE_STORED)); 
      iwriter.addDocument(doc); 
      iwriter.close(); 
      // Now search the index 
      DirectoryReader ireader = DirectoryReader.open(directory); 
      IndexSearcher isearcher = new IndexSearcher(ireader); 
      // Parse a simple query that searches for "text": 
      QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", analyzer); 

      String sentence = "Following are the"; 
      PhraseQuery query = new PhraseQuery(); 
      String[] words = sentence.split(" "); 
      for (String word : words) { 
       query.add(new Term("contents", word)); 
      } 
      ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs; 
      if(hits.length>0){ 
       System.out.println("Searched text existed in the PDF."); 
      } 
      ireader.close(); 
      directory.close(); 
     } 
     catch(Exception e){ 
      System.out.println("Exception: "+e.getMessage()); 
     } 
} 

回答

0

您應該使用查詢解析器從您的句子創建查詢,而不是自己創建您的短語查詢。你自己創建的查詢包含術語「跟隨」,它沒有被索引,因爲標準分析器在索引期間將小寫它,所以只有「跟隨」被索引。

+0

我使用了queryparser。但仍然沒有完整的句子。相反,它正在採取第一個詞,並表明它不存在。我爲queryparser使用了follwing代碼。 – Lucene1

+0

QueryParser queryParser = new QueryParser(Version.LUCENE_CURRENT,「contents」,analyzer); \t \t queryParser.setDefaultOperator(QueryParser.Operator.AND); \t \t queryParser.setPhraseSlop(0); \t \t Query query = queryParser.createPhraseQuery(「contents」,sentence); \t \t ScoreDoc [] hits = isearcher.search(query,null,1000).scoreDocs; – Lucene1

+0

標準分析器過濾出停用詞,因此您的查詢只能成爲內容:無論如何。這確實意味着您的PDF文本中不存在以下單詞。你能打印出'輸出'字符串嗎?我確定它裏面沒有「跟隨」。 – fatih