2
我一直在網站上搜索很多,但無法真正找到我需要的東西。我有其中包含數據的web.warc.gz文件,我需要提取WARC標題。我已經安裝Tomcat和韋巴克(1.6)試圖獲得與./warc-header腳本,這是由韋巴克提供的,但我不斷收到對我使用的格式的錯誤消息:從WARC.gz文件中提取頭文件
Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz \r\n\
~/Desktop/output.csv type \r\n
USAGE: tgtWarc fieldsSrc id
tgtWarc is the path to the target WARC.gz
fieldsSrc is the path to the text of the record
make sure each line is terminated by \r\n
and that the file ends with a blank, \r\n terminiated line
id is the XXX in:
Content-Description: Made from XXX by org.archive.wayback.util.WARCHeader
of the header record... header...
或者其他錯誤類型:
Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz
~/Desktop/output.csv Content-Type
java.io.IOException: End-Of-Stream before \r\n\r\n End-Of-ANVLRecord:
at org.archive.util.anvl.ANVLRecord.load(ANVLRecord.java:163)
at org.archive.wayback.util.WARCHeader.writeHeaderRecord(WARCHeader.java:43)
at org.archive.wayback.util.WARCHeader.main(WARCHeader.java:75)
我確定它是我在命令行中寫入的一種格式,但我仍然無法正確理解它。請幫忙?