2012-04-26 34 views
5
文檔

昨天harvard released開放進入其所有的庫的元數據(約12萬條記錄)你怎麼可以解析存儲在MARC21格式與Python

我一直在尋找分析數據,併爲目標用它玩釋放是「支持創新」

下載的壓縮包12GB,打開包裝發現約800MB每個

MARC21 format

13個.mrc文件當我看着的頭部和尾部前幾個文件,它看起來是非結構化的,即使閱讀了一些關於MARC21。

這裏是第一個文件的第一個4K的樣子:

$ head -c 4000 ab.bib.00.20120331.full.mrc

00857nam a2200253 a 4500001001200000005001700012008004100029010001700070020001200087035001600099040001800115043001200133050002500145245011100170260004900281300002100330504004100351610006400392650005300456650003500509700003800544988001300582906000800595002000001-420020606093309.7880822s1985 unr  b 000 0 ruso a 86231326 c0.45rub0 aocm18463285 aDLCcDLCdHLS ae-ur-un0 aJN6639.A8bK665 198500aInformat︠s︡ii︠a︡ v rabote partiĭnykh komitetov /c[sostavitelʹ Stepan Ivanovich I︠A︡lovega]. aKiev :bIzd-vo polit. lit-ry Ukrainy,c1985. a206 p. ;c20 cm. aIncludes bibliographical references.20aKomunistychna partii︠a︡ UkraïnyxInformation services. 0aParty committeeszUkrainexInformation services. 0aInformation serviceszUkraine.1 aI︠A︡lovega, Stepan Ivanovich. a20020608 0DLC00418nam a22001335u 4500001001200000005001700012008004100029110003000070245004600100260006000146500005800206988001300264906000700277002000002-220020606093309.7900925|1944 mx   |||||||spa|d1 aCampeche (Mexico : State)10aLey del notariado del estado de Campeche.0 a[Campeche]bDepartamento de prensa y publicidad,c1944. aAt head of title: Gobierno constitucional del estado. a20020608 0MH00647nam a2200229M 4500001001200000005001700012008004100029010001700070035001600087040001300103041001100116050003600127100004200163245004100205246005600246260001600302300001900318500001500337650004400352988001300396906000800409002000003-020051201172535.0890331s1902 xx  d 000 0 ota a 73960310 0 aocm23499219 aDLCcEYM0 aotaara0 aPJ6636.T8bU5 1973 (Orien Arab)1 aUnsī, Muḥammad ʻAlī ibn Ḥasan.10aQāmūs al-lughah al-ʻUthmānīyah.3 aDarārī al-lāmiʻāt fī muntakhabāt al-lughāt. c[1902 1973] a564 p.c22 cm. aRomanized. 0aTurkish languagevDictionariesxArabic. a20020608 0DLC00878nam a2200253 a 4500001001200000005001700012008004100029010001700070035001600087040001800103043001200121050002300133245012800156246004600284260006300330300004800393500003300441610003200474650005000506700002400556710002300580988001300603906000800616002000004-920020606093309.7880404s1980 yu fa   000 0 scco a 82167322 0 aocm17880048 aDLCcDLCdHLS ae-yu---0 aL53.P783bT75 198000aTrideset pet godina Prosvetnog pregleda, 1945-1980 /c[glavni i odgovorni urednik i urednik publikacije Ružica Petrović].3 a35 godina Prosvetnog pregleda, 1945-1980. aBeograd :bNovinska organizacija Prosvetni pregled,c1980. a146 p., [21] p. of plates :bill. ;c29 cm. aIn Serbo-Croatian (Cyrillic)20aProsvetni pregledxHistory. 0aEducationzYugoslaviaxHistoryy20th century.1 aPetrović, Ružica.2 aProsvetni pregled. a20020608 0DLC00449nam a22001455u 4500001001200000005001700012008004100029245008200070260002800152300001100180440006600191700002600257988001300283906000700296002000005-720020606093309.7900925|1981 pl   |||||||pol|d10aZ zagadnień dialektyki i świadomości społecznej /cpod red. K. Ślęczka.0 aKatowice :bUŚ,c1981. a135 p. 0aPrace naukowe Uniwersytetu Śląskiego w Katowicach ;vnr 4621 aŚlęczka, Kazimierz. a20020608 0MH00331nam a22001455u 4500001001200000005001700012008004100029100002200070245002200092250001200114260002800126300001100154988001300165906000700178002000006-520020606093309.7900925|1980 pl   |||||||pol|d1 aMencwel, Andrzej.10aWidziane z dołu. aWyd. 1.0 aWarszawa :bPIW,c1980. a166 p. a20020608 0MH00746cam a2200241 a 4500001001200000005001700012008004100029010001700070020001500087035001600102040001800118050002400136082001600160100001600176245008000192260007100272300002500343504004100368650003400409650004000443988001300483906000800496002000007-300000000000000.0900123s1990 enk  b 001 0 eng a 90031350 a03910368230 aocm21081069 aDLCcDLCdHBS00aHF5439.8b.O35 199 

有沒有人曾與前MARC21工作?它通常看起來像這樣或我需要以不同的方式解析它。

+0

也許[這個工具](http://www.loc.gov/marc/marc-functional-analysis/tool.html)會有幫助。 – sarnold 2012-04-26 01:09:39

回答

10

pymarc是解析使用Python MARC21記錄的最佳選擇(全面披露:我是它的維護者之一)。如果您不熟悉MARC21的工作,請閱讀您在美國國會圖書館網站上鍊接的一些規範。我還通讀了Code4lib wiki上的Working with MARC頁面。

0

免責聲明:我是marcx的作者。

pymarc是一個很棒的圖書館。對於我在pymarc中錯過的一些操作,我將其實施爲薄層:marcx

marcx.FatRecordpymarc.Record的小擴展,它增加了一些快捷方式。要點是雙胞胎addremove,(子字段)值生成器itervalues和通用test函數。

它的主要好處是迭代字段(或子字段)值的更簡單的方法。例如:

>>> from marcx import FatRecord; from urllib import urlopen 
>>> record = FatRecord(data=urlopen("http://goo.gl/lfJnw9").read()) 
>>> for val in record.itervalues('100.a', '700.a'): 
...  print(val) 
Hunt, Andrew, 
Thomas, David,