0
我已經使用下面的代碼來搜索pdf中的文本。它用單個單詞正常工作。但是對於代碼中提到的句子,即使文檔中存在文本,它也顯示它不存在。任何人都可以幫我解決這個問題嗎?使用Lucene短語查詢和PDFBOX搜索PDF中的句子
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
PDDocument document = null;
try {
document = PDDocument.load(strFilepath);
}
catch (IOException ex) {
System.out.println("Exception Occured while Loading the document: " + ex);
}
int i =1;
String name = null;
String output=new PDFTextStripper().getText(document);
//String text = "This is the text to be indexed";
doc.add(new Field("contents", output, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", analyzer);
String sentence = "Following are the";
PhraseQuery query = new PhraseQuery();
String[] words = sentence.split(" ");
for (String word : words) {
query.add(new Term("contents", word));
}
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
if(hits.length>0){
System.out.println("Searched text existed in the PDF.");
}
ireader.close();
directory.close();
}
catch(Exception e){
System.out.println("Exception: "+e.getMessage());
}
}
我使用了queryparser。但仍然沒有完整的句子。相反,它正在採取第一個詞,並表明它不存在。我爲queryparser使用了follwing代碼。 – Lucene1
QueryParser queryParser = new QueryParser(Version.LUCENE_CURRENT,「contents」,analyzer); \t \t queryParser.setDefaultOperator(QueryParser.Operator.AND); \t \t queryParser.setPhraseSlop(0); \t \t Query query = queryParser.createPhraseQuery(「contents」,sentence); \t \t ScoreDoc [] hits = isearcher.search(query,null,1000).scoreDocs; – Lucene1
標準分析器過濾出停用詞,因此您的查詢只能成爲內容:無論如何。這確實意味着您的PDF文本中不存在以下單詞。你能打印出'輸出'字符串嗎?我確定它裏面沒有「跟隨」。 – fatih