2014-02-21 123 views
2

我一直在網站上搜索很多,但無法真正找到我需要的東西。我有其中包含數據的web.warc.gz文件,我需要提取WARC標題。我已經安裝Tomcat和韋巴克(1.6)試圖獲得與./warc-header腳本,這是由韋巴克提供的,但我不斷收到對我使用的格式的錯誤消息:從WARC.gz文件中提取頭文件

Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz \r\n\ 
~/Desktop/output.csv type \r\n 
     USAGE: tgtWarc fieldsSrc id 
     tgtWarc is the path to the target WARC.gz 
      fieldsSrc is the path to the text of the record 
    make sure each line is terminated by \r\n 
    and that the file ends with a blank, \r\n terminiated line 
id is the XXX in: 
    Content-Description: Made from XXX by org.archive.wayback.util.WARCHeader 
    of the header record... header... 

或者其他錯誤類型:

Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz 
    ~/Desktop/output.csv Content-Type 
    java.io.IOException: End-Of-Stream before \r\n\r\n End-Of-ANVLRecord: 

at org.archive.util.anvl.ANVLRecord.load(ANVLRecord.java:163) 
at org.archive.wayback.util.WARCHeader.writeHeaderRecord(WARCHeader.java:43) 
at org.archive.wayback.util.WARCHeader.main(WARCHeader.java:75) 

我確定它是我在命令行中寫入的一種格式,但我仍然無法正確理解它。請幫忙?

回答