2014-12-22 204 views
5

我試圖使用元數據採集包https://pypi.python.org/pypi/pyoai收穫本網站https://www.duo.uio.no/oai/request?verb=Identify元數據收集

我試圖在pyaoi網站的例子中的數據,但沒有奏效。當我測試它時,我得到一個錯誤。該代碼是:

from oaipmh.client import Client 
from oaipmh.metadata import MetadataRegistry, oai_dc_reader 

URL = 'http://uni.edu/ir/oaipmh' 
registry = MetadataRegistry() 
registry.registerReader('oai_dc', oai_dc_reader) 
client = Client(URL, registry) 

for record in client.listRecords(metadataPrefix='oai_dc'): 
    print record 

這是堆棧跟蹤:

Traceback (most recent call last): 
    File "/Users/arashsaidi/PycharmProjects/get-new-DUO/get-files.py", line 8, in <module> 
    for record in client.listRecords(metadataPrefix='oai_dc'): 
    File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/common.py", line 115, in method 
    return obj(self, **kw) 
    File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/common.py", line 110, in __call__ 
    return bound_self.handleVerb(self._verb, kw) 
    File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/client.py", line 65, in handleVerb 
    kw, self.makeRequestErrorHandling(verb=verb, **kw))  
    File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/client.py", line 273, in makeRequestErrorHandling 
    raise error.XMLSyntaxError(kw) 
oaipmh.error.XMLSyntaxError: {'verb': 'ListRecords', 'metadataPrefix': 'oai_dc'} 

我需要訪問我已鏈接到上面的頁面上的所有文件以及生成一些元數據的附加文件。

有什麼建議嗎?

回答

2

我結束了使用的鐮刀包,我發現有很多更好的文檔,更容易使用:

這段代碼獲得所有的組,然後每組檢索每個記錄。這似乎是鑑於有超過30000條記錄處理的最佳解決方案。爲每組做出更多的控制。希望這可以幫助其他人。我不知道爲什麼庫使用OAI,似乎並不像組織數據給我一個好辦法...

# gets sickle from OAI 
     sickle = Sickle('http://www.duo.uio.no/oai/request') 
     sets = sickle.ListSets() # gets all sets 
     for recs in sets: 
      for rec in recs: 
       if rec[0] == 'setSpec': 
        try: 
         print rec[1][0], self.spec_list[rec[1][0]] 
         records = sickle.ListRecords(metadataPrefix='xoai', set=rec[1][0], ignore_deleted=True) 
         self.write_file_and_metadata() 
        except Exception as e: 
         # simple exception handling if not possible to retrieve record 
         print('Exception: {}'.format(e)) 
0

這似乎從pyoai網站(http://uni.edu/ir/oaipmh)的鏈接是死的,因爲它返回404
然而,你應該能夠從您的網站獲得的數據是這樣的:

from oaipmh.client import Client 
from oaipmh.metadata import MetadataRegistry, oai_dc_reader 

URL = 'https://www.duo.uio.no/oai/request' 
registry = MetadataRegistry() 
registry.registerReader('oai_dc', oai_dc_reader) 
client = Client(URL, registry) 

# identify info 
identify = client.identify() 
print "Repository name: {0}".format(identify.repositoryName()) 
print "Base URL: {0}".format(identify.baseURL()) 
print "Protocol version: {0}".format(identify.protocolVersion()) 
print "Granularity: {0}".format(identify.granularity()) 
print "Compression: {0}".format(identify.compression()) 
print "Deleted record: {0}".format(identify.deletedRecord()) 

# list records 
records = client.listRecords(metadataPrefix='oai_dc') 
for record in records: 
    # do something with the record 
    pass 

# list metadata formats 
formats = client.listMetadataFormats() 
for f in formats: 
    # do something with f 
    pass