2016-06-13 29 views
1

我正在嘗試使用Blazegraph在ConceptNet上運行圖算法,但首先必須導入數據。數據將是一次寫入,多讀取,所以我不需要任何類型的增量寫入。使用批量數據加載器將三元組加載到Blazegraph

我從它的.deb文件中安裝了Blazegraph 2.1.1。我還下載了blazegraph.jar,以便我可以遵循涉及在blazegraph.jar上運行命令的指導。

文件assoc.nt是N-Triples格式,包含約2500萬條邊。下面是一些從開始:

</c/af/a_foei_tog/r> </r/SenseOf> </c/af/a_foei_tog> . 
</c/af/a_foei_tog/r> </r/Synonym> </c/af/jammer> . 
</c/af/a_foei_tog/r> </r/Synonym> </c/af/ongelukkig> . 
</c/af/a_foei_tog/r> </r/RelatedTo> </c/fr/malheureusement> . 
</c/af/a_foe%C4%B1_tog/r> </r/SenseOf> </c/af/a_foe%C4%B1_tog> . 
</c/af/a_foe%C4%B1_tog/r> </r/Synonym> </c/af/jammer> . 
</c/af/a_foe%C4%B1_tog/r> </r/Synonym> </c/af/ongelukk%C4%B1g> . 
</c/af/a_foe%C4%B1_tog/r> </r/RelatedTo> </c/fr/malheureusement> . 
</c/af/a_ja_a/r> </r/SenseOf> </c/af/a_ja_a> . 
</c/af/a_ja_a/r> </r/Synonym> </c/af/seker> . 
</c/af/a_ja_a/r> </r/Synonym> </c/af/sekerlik> . 

fastload.propertiesBlazegraph samples on GitHub,但後來改變了底:

  • 我加com.bigdata.journal.AbstractJournal.file=blazegraph.jnl,因爲否則它會告訴我,財產不見了。

  • 我改變了bufferModeDiskRWDisk,因爲someone's property file表示,這會給我寫一次多讀的語義,這正是我想要的。

這是我的最後fastload.properties

# This configuration turns off incremental inference for load and retract, so 
# you must explicitly force these operations if you want to compute the closure 
# of the knowledge base. Forcing the closure requires punching through the SAIL 
# layer. Of course, if you are not using inference then this configuration is 
# just the ticket and is quite fast. 

# set the initial and maximum extent of the journal 
com.bigdata.journal.AbstractJournal.initialExtent=209715200 
com.bigdata.journal.AbstractJournal.maximumExtent=209715200 

# turn off automatic inference in the SAIL 
com.bigdata.rdf.sail.truthMaintenance=false 

# don't store justification chains, meaning retraction requires full manual 
# re-closure of the database 
com.bigdata.rdf.store.AbstractTripleStore.justify=false 

# turn off the statement identifiers feature for provenance 
com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=false 

# turn off the free text index 
com.bigdata.rdf.store.AbstractTripleStore.textIndex=false 

com.bigdata.journal.AbstractJournal.bufferMode=Disk 
com.bigdata.journal.AbstractJournal.file=blazegraph.jnl 

我跑的命令:

java -cp blazegraph.jar com.bigdata.rdf.store.DataLoader -namespace conceptnet fastload.properties ~/conceptnet5/data/assoc/assoc.nt 

它紡CPU幾分鐘,但最終似乎也沒有什麼增加。這是我得到的輸出:

WARN : ServiceProviderHook.java:171: Running. 
INFO: com.bigdata.util.config.LogUtil: Configure: jar:file:/home/rspeer/src/blazegraph/blazegraph.jar!/log4j.properties 

BlazeGraph(TM) Graph Engine 

        Flexible 
        Reliable 
        Affordable 
     Web-Scale Computing for the Enterprise 

Copyright SYSTAP, LLC DBA Blazegraph 2006-2016. All rights reserved. 

[my hostname appeared here] 
Mon Jun 13 13:36:05 EDT 2016 
Linux/3.13.0-83-generic amd64 
Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz Family 6 Model 62 Stepping 4, GenuineIntel #CPU=4 
Oracle Corporation 1.8.0_74 
freeMemory=1002354744 
buildVersion=2.1.1 
gitCommit=90d9e8232969a8afdc830e856643e5416bb50d0a 

Dependency   License                 
ICU    http://source.icu-project.org/repos/icu/icu/trunk/license.html   
bigdata-ganglia http://www.apache.org/licenses/LICENSE-2.0.html       
blueprints-core https://github.com/tinkerpop/blueprints/blob/master/LICENSE.txt   
colt    http://acs.lbl.gov/software/colt/license.html       
commons-codec  http://www.apache.org/licenses/LICENSE-2.0.html       
commons-fileupload http://www.apache.org/licenses/LICENSE-2.0.html       
commons-io   http://www.apache.org/licenses/LICENSE-2.0.html       
commons-logging http://www.apache.org/licenses/LICENSE-2.0.html       
dsiutils   http://www.gnu.org/licenses/lgpl-2.1.html        
fastutil   http://www.apache.org/licenses/LICENSE-2.0.html       
flot    http://www.opensource.org/licenses/mit-license.php      
high-scale-lib  http://creativecommons.org/licenses/publicdomain       
httpclient   http://www.apache.org/licenses/LICENSE-2.0.html       
httpclient-cache http://www.apache.org/licenses/LICENSE-2.0.html       
httpcore   http://www.apache.org/licenses/LICENSE-2.0.html       
httpmime   http://www.apache.org/licenses/LICENSE-2.0.html       
jackson-core  http://www.apache.org/licenses/LICENSE-2.0.html       
jetty    http://www.apache.org/licenses/LICENSE-2.0.html       
jquery    https://github.com/jquery/jquery/blob/master/MIT-LICENSE.txt    
jsonld    https://raw.githubusercontent.com/jsonld-java/jsonld-java/master/LICENCE 
log4j    http://www.apache.org/licenses/LICENSE-2.0.html       
lucene    http://www.apache.org/licenses/LICENSE-2.0.html       
nanohttp   http://elonen.iki.fi/code/nanohttpd/#license        
rexster-core  https://github.com/tinkerpop/rexster/blob/master/LICENSE.txt    
river    http://www.apache.org/licenses/LICENSE-2.0.html       
semargl   https://github.com/levkhomich/semargl/blob/master/LICENSE    
servlet-api  http://www.apache.org/licenses/LICENSE-2.0.html       
sesame    http://www.openrdf.org/download.jsp          
slf4j    http://www.slf4j.org/license.html          
zookeeper   http://www.apache.org/licenses/LICENSE-2.0.html       

Reading properties: fastload.properties 
Will load from: /home/rspeer/conceptnet5/data/assoc/assoc.nt 
Journal file: blazegraph.jnl 
Load: 0 stmts added in 171.173 secs, rate= 0, commitLatency=0ms, {failSet=0,goodSet=1} 
Total elapsed=172015ms 

回答

2

我相信我已經找到了我遇到的問題的答案。

當Blazegraph導入N-Triples數據時,它跳過相對URI。我的URI是相對的這一事實是我的錯誤;似乎N-Triples中只允許絕對URI,但Blazegraph讓我知道這一點,而不是默默地失敗。

我用http://和一個域名前綴了我的所有URI,現在它正在加載數據。下面是我的數據看起來像現在:

<http://api.conceptnet.io/c/af/a_foei_tog/r> <http://api.conceptnet.io/r/SenseOf> <http://api.conceptnet.io/c/af/a_foei_tog> . 
<http://api.conceptnet.io/c/af/a_foei_tog/r> <http://api.conceptnet.io/r/Synonym> <http://api.conceptnet.io/c/af/jammer> . 
<http://api.conceptnet.io/c/af/a_foei_tog/r> <http://api.conceptnet.io/r/Synonym> <http://api.conceptnet.io/c/af/ongelukkig> . 
<http://api.conceptnet.io/c/af/a_foei_tog/r> <http://api.conceptnet.io/r/RelatedTo> <http://api.conceptnet.io/c/fr/malheureusement> . 
<http://api.conceptnet.io/c/af/a_foe%C4%B1_tog/r> <http://api.conceptnet.io/r/SenseOf> <http://api.conceptnet.io/c/af/a_foe%C4%B1_tog> . 
<http://api.conceptnet.io/c/af/a_foe%C4%B1_tog/r> <http://api.conceptnet.io/r/Synonym> <http://api.conceptnet.io/c/af/jammer> . 
<http://api.conceptnet.io/c/af/a_foe%C4%B1_tog/r> <http://api.conceptnet.io/r/Synonym> <http://api.conceptnet.io/c/af/ongelukk%C4%B1g> . 
<http://api.conceptnet.io/c/af/a_foe%C4%B1_tog/r> <http://api.conceptnet.io/r/RelatedTo> <http://api.conceptnet.io/c/fr/malheureusement> . 
<http://api.conceptnet.io/c/af/a_ja_a/r> <http://api.conceptnet.io/r/SenseOf> <http://api.conceptnet.io/c/af/a_ja_a> . 
<http://api.conceptnet.io/c/af/a_ja_a/r> <http://api.conceptnet.io/r/Synonym> <http://api.conceptnet.io/c/af/seker> . 

我得到的似乎表明,它正在採取1到10秒之間加載每個「記錄」一些令人擔憂的輸出,但我認爲這些警告是誤導,因爲他們只顯示在向上加載時顯著放緩的時刻:

WARN : AbstractBTree.java:3758: wrote: name=kb.spo.OSP, 1 records (#nodes=1, #leaves=0) in 14582ms : addrRoot=22869767568228938 
WARN : AbstractBTree.java:3758: wrote: name=kb.spo.POS, 1 records (#nodes=1, #leaves=0) in 14582ms : addrRoot=22869765391385095 
WARN : AbstractBTree.java:3758: wrote: name=kb.spo.OSP, 9 records (#nodes=5, #leaves=4) in 10690ms : addrRoot=25508598331212042 
WARN : AbstractBTree.java:3758: wrote: name=kb.spo.POS, 1 records (#nodes=1, #leaves=0) in 9335ms : addrRoot=38702680415142364 
WARN : AbstractBTree.java:3758: wrote: name=kb.spo.POS, 9 records (#nodes=6, #leaves=3) in 6932ms : addrRoot=63331668311671368 
WARN : AbstractBTree.java:3758: wrote: name=kb.spo.POS, 1 records (#nodes=1, #leaves=0) in 11326ms : addrRoot=80044185196954272 

儘管警告,但在約8分鐘,這是不壞裝載2500萬層的邊緣。

相關問題