使用Lucene短語查詢和PDFBOX搜索PDF中的句子

我已經使用下面的代碼來搜索pdf中的文本。它用單個單詞正常工作。但是對於代碼中提到的句子，即使文檔中存在文本，它也顯示它不存在。任何人都可以幫我解決這個問題嗎？使用Lucene短語查詢和PDFBOX搜索PDF中的句子

  Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); 

      // Store the index in memory:    
      Directory directory = new RAMDirectory(); 
      // To store an index on disk, use this instead: 
      //Directory directory = FSDirectory.open("/tmp/testindex"); 
      IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer); 
      IndexWriter iwriter = new IndexWriter(directory, config); 
      Document doc = new Document(); 
      PDDocument document = null; 
       try { 
        document = PDDocument.load(strFilepath); 
       } 
       catch (IOException ex) { 
        System.out.println("Exception Occured while Loading the document: " + ex); 
       } 
       int i =1; 
       String name = null;   
       String output=new PDFTextStripper().getText(document); 
      //String text = "This is the text to be indexed"; 
      doc.add(new Field("contents", output, TextField.TYPE_STORED)); 
      iwriter.addDocument(doc); 
      iwriter.close(); 
      // Now search the index 
      DirectoryReader ireader = DirectoryReader.open(directory); 
      IndexSearcher isearcher = new IndexSearcher(ireader); 
      // Parse a simple query that searches for "text": 
      QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", analyzer); 

      String sentence = "Following are the"; 
      PhraseQuery query = new PhraseQuery(); 
      String[] words = sentence.split(" "); 
      for (String word : words) { 
       query.add(new Term("contents", word)); 
      } 
      ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs; 
      if(hits.length>0){ 
       System.out.println("Searched text existed in the PDF."); 
      } 
      ireader.close(); 
      directory.close(); 
     } 
     catch(Exception e){ 
      System.out.println("Exception: "+e.getMessage()); 
     } 
}

來源

2014-01-15 Lucene1

您應該使用查詢解析器從您的句子創建查詢，而不是自己創建您的短語查詢。你自己創建的查詢包含術語「跟隨」，它沒有被索引，因爲標準分析器在索引期間將小寫它，所以只有「跟隨」被索引。

來源

2014-01-15 12:38:41 fatih

我使用了queryparser。但仍然沒有完整的句子。相反，它正在採取第一個詞，並表明它不存在。我爲queryparser使用了follwing代碼。 – Lucene1

QueryParser queryParser = new QueryParser（Version.LUCENE_CURRENT，「contents」，analyzer）; \t \t queryParser.setDefaultOperator（QueryParser.Operator.AND）; \t \t queryParser.setPhraseSlop（0）; \t \t Query query = queryParser.createPhraseQuery（「contents」，sentence）; \t \t ScoreDoc [] hits = isearcher.search（query，null，1000）.scoreDocs; – Lucene1

標準分析器過濾出停用詞，因此您的查詢只能成爲內容：無論如何。這確實意味着您的PDF文本中不存在以下單詞。你能打印出'輸出'字符串嗎？我確定它裏面沒有「跟隨」。 – fatih

使用Lucene短語查詢和PDFBOX搜索PDF中的句子

回答

相關問題