2014-10-30 80 views
0

我想通過使用tika服務器來了解doc文件的頁數。我運行tika服務器;如何從tika服務器獲取頁面計數信息?

java -jar tika-server-1.6.jar  

並使用curl來獲取元數據;

curl -X PUT -T /tmp/test.doc http://localhost:9998/meta 

輸出是:

"Revision-Number","0" 
"Last-Printed","1601-01-01T00:00:00Z" 
"cp:revision","0" 
"meta:print-date","1601-01-01T00:00:00Z" 
"meta:creation-date","2014-10-30T06:04:11Z" 
"dcterms:modified","1601-01-01T00:00:00Z" 
"meta:save-date","1601-01-01T00:00:00Z" 
"dc:creator","ndemir " 
"Last-Modified","1601-01-01T00:00:00Z" 
"Author","ndemir " 
"dcterms:created","2014-10-30T06:04:11Z" 
"date","1601-01-01T00:00:00Z" 
"X-Parsed-By","org.apache.tika.parser.ParserDecorator$1","org.apache.tika.parser.microsoft.OfficeParser" 
"modified","1601-01-01T00:00:00Z" 
"creator","ndemir " 
"Creation-Date","2014-10-30T06:04:11Z" 
"meta:author","ndemir " 
"Content-Type","application/msword" 
"Last-Save-Date","1601-01-01T00:00:00Z" 

正如你看到的,這是毫無頁計數的信息。如何從tika服務器獲取頁面數量信息?

回答

1

Tika只會在存儲在文件中時向您提供該信息。大多數Microsoft Office文檔包含它,但有一些則不。對於這些,您需要在Office中加載它們,告訴Office重新計算統計信息/頁數,然後保存。一旦它在文件中,提卡就能找到它

如果我們使用附帶提卡,然後我們會看到它的測試word文檔的一個嘗試:

$ curl -q -X PUT -T tika-parsers/src/test/resources/test-documents/testWORD.doc http://localhost:9998/meta | grep xmpTPg:NPages 
"xmpTPg:NPages","2" 

對於頁數,你想要xmpTPg:NPages,這是基於XMP Paged-Text schema