如何從文件中使用apache tika獲取特定的元數據標籤

我在一個文件夾（sample.pdf，sample.html等）中有一些文件，我使用以下Apache tika命令來提取元數據。如何從文件中使用apache tika獲取特定的元數據標籤

java -jar tika-app.jar -m -j /sample/sample.pdf > test.txt

執行此命令後，我能夠得到所有的sample.pdf文件的元數據標籤，但我的要求是讓特定的標籤，如作者，標題等。請建議我如何使用Apache蒂卡獲得特定元數據標籤。

感謝

來源

2013-06-24 user2353439

'xpdf'提供實用程序'pdfinfo'，爲PDF提供元數據信息 – devnull

將元數據放入臨時文件中，grep用於感興趣的元數據關鍵字，使用awk將值分割出來或者更具體/用不同的語言/ etc？ – Gagravarr

您可以提取元數據的名稱如下（我的例子是解析XML文件，你可以簡單地將其更改爲PDF解析器，或使用自動檢測解析器：

//detecting the file type 
BodyContentHandler handler = new BodyContentHandler(-1); 
Metadata metadata = new Metadata(); 
File inFile = new File("example.xml"); 
FileInputStream inputstream = new FileInputStream(inFile); 
ParseContext pcontext = new ParseContext(); 

//Xml parser 
XMLParser xmlparser = new XMLParser(); 
xmlparser.parse(inputstream, handler, metadata, pcontext); 

System.out.println("Metadata of the document:"); 
String[] metadataNames = metadata.names();//Now we have all the metadata tags here 

for(String name : metadataNames) { 
    if (name == "Your Particular Tag"){ //here you can check if the tag names are the particular ones you need and do what you want with them 
     System.out.println(name + ": " + metadata.get(name)); 
    } 
}

來源

2017-04-02 17:49:16

如何從文件中使用apache tika獲取特定的元數據標籤

回答

相關問題