2013-01-18 76 views
1

我已經設法使用pymongo爲mongoDB編寫簡單的索引器腳本。但我不知道爲什麼索引,添加文檔和查詢會佔用服務器上96GB的RAM。爲什麼mongoDB佔用這麼多RAM空間?

是因爲我的查詢沒有優化?我怎樣才能優化我的查詢,而不是database.find_one({"eng":src})

我怎麼能優化我的索引器腳本?

所以我的輸入是本身(實際數據輸入具有變化的句子的長度的2000000 +線):

#srcfile

You will be aware from the press and television that there have been a number of bomb explosions and killings in Sri Lanka. 
One of the people assassinated very recently in Sri Lanka was Mr Kumar Ponnambalam, who had visited the European Parliament just a few months ago. 
Would it be appropriate for you, Madam President, to write a letter to the Sri Lankan President expressing Parliament's regret at his and the other violent deaths in Sri Lanka and urging her to do everything she possibly can to seek a peaceful reconciliation to a very difficult situation? 
Yes, Mr Evans, I feel an initiative of the type you have just suggested would be entirely appropriate. 
If the House agrees, I shall do as Mr Evans has suggested. 

#trgfile

Wie Sie sicher aus der Presse und dem Fernsehen wissen, gab es in Sri Lanka mehrere Bombenexplosionen mit zahlreichen Toten. 
Zu den Attentatsopfern, die es in jüngster Zeit in Sri Lanka zu beklagen gab, zählt auch Herr Kumar Ponnambalam, der dem Europäischen Parlament erst vor wenigen Monaten einen Besuch abgestattet hatte. 
Wäre es angemessen, wenn Sie, Frau Präsidentin, der Präsidentin von Sri Lanka in einem Schreiben das Bedauern des Parlaments zum gewaltsamen Tod von Herrn Ponnambalam und anderen Bürgern von Sri Lanka übermitteln und sie auffordern würden, alles in ihrem Kräften stehende zu tun, um nach einer friedlichen Lösung dieser sehr schwierigen Situation zu suchen? 
Ja, Herr Evans, ich denke, daß eine derartige Initiative durchaus angebracht ist. 
Wenn das Haus damit einverstanden ist, werde ich dem Vorschlag von Herrn Evans folgen. 

一個例子文檔看起來像這樣

{ 
    "_id" : ObjectId("50f5fe8916174763f6217994"), 
    "deu" : "Wie Sie sicher aus der Presse und dem Fernsehen wissen, gab es in Sri 
      Lanka mehrere Bombenexplosionen mit zahlreichen Toten.\n", 
    "uid" : 13, 
    "eng" : "You will be aware from the press and television that there have been a 
      number of bomb explosions and killings in Sri Lanka." 
} 

我的代碼

# -*- coding: utf8 -*- 
import codecs, glob, os 
from pymongo import MongoClient 
from itertools import izip 
from bson.code import Code 

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8") 

# Gets first instance of matching key given a value and a dictionary.  
def getKey(dic, value): 
    return [k for k,v in dic.items() if v == value] 

def langiso (lang, isochar=3): 
    languages = {"en":"eng", 
       "da":"dan","de":"deu", 
       "es":"spa", 
       "fi":"fin","fr":"fre", 
       "it":"ita", 
       "nl":"nld", 
       "zh":"mcn"} 
    if len(lang) == 2 or isochar==3: 
    return languages[lang] 
    if len(lang) == 3 or isochar==2: 
    return getKey(lang) 

def txtPairs (bitextDir): 
    txtpairs = {} 
    for infile in glob.glob(os.path.join(bitextDir, '*')): 
    #print infile 
    k = infile[-8:-3]; lang = infile[-2:] 
    try: 
     txtpairs[k] = (txtpairs[k],infile) if lang == "en" else (infile,txtpairs[k]) 
    except: 
     txtpairs[k] = infile 
    for i in txtpairs: 
    if len(txtpairs[i]) != 2: 
     del txtpairs[i] 
    return txtpairs 

def indexEuroparl(sfile, tfile, database): 
    trglang = langiso(tfile[-2:]) #; srclang = langiso(sfile[-2:]) 

    maxdoc = database.find().sort("uid",-1).limit(1) 
    uid = 1 if maxdoc.count() == 0 else maxdoc[0] 

    counter = 0 
    for src, trg in izip(codecs.open(sfile,"r","utf8"), \ 
         codecs.open(tfile,"r","utf8")): 
    quid = database.find_one({"eng":src}) 
    # If sentence already exist in db 
    if quid != None: 
     if database.find({trglang: {"$exists": True}}): 
     print "Sentence uniqID",quid["uid"],"already exist." 
     continue 
     else: 
     print "Reindexing uniqID",quid["uid"],"..." 
     database.update({"uid":quid["uid"]}, {"$push":{trglang:trg}}) 
    else: 
     print "Indexing uniqID",uid,"..." 
     doc = {"uid":uid,"eng":src,trglang:trg} 
     database.insert(doc) 
     uid+=1 
    if counter == 1000: 
     for i in database.find(): 
     print i 
     counter = 0 
    counter+=1 

connection = MongoClient() 
db = connection["europarl"] 
v7 = db["v7"] 

srcfile = "eng-deu.en"; trgfile = "eng-deu.de" 
indexEuroparl(srcfile,trgfile,v7) 

# After indexing the english-german pair, i'll perform the same indexing on other language pairs 
srcfile = "eng-spa.en"; trgfile = "eng-spa.es" 
indexEuroparl(srcfile,trgfile,v7) 
+1

爲了節省我們不必試圖瞭解你的代碼,你可以告訴我們一個示例文檔? – Sammaye

+0

所以你正在查詢你需要離開的東西嗎?譯文?如果我錯了,請更正我的錯誤,但要獲得您正在查找的查詢,您必須已經知道您希望提取哪個查詢,這意味着:爲什麼要執行查詢? – Sammaye

+0

把'getIndexes'輸出和'explain()'查詢 –

回答

0

經過幾輪代碼分析的,我發現那裏的RAM被泄漏到。

首先,如果我要經常查詢"eng"場,我應該這樣做,創建該領域中的指標:

v7.ensure_index([("eng",1),("unique",True)]) 

解析跨越未編入索引"eng"領域採取串行搜索的時間。

二,出血RAM問題是由於這種昂貴的函數調用:

doc = {"uid":uid,"eng":src,trglang:trg} 
if counter == 1000: 
    for i in database.find(): 
    print i 
    counter = 0 
counter+=1 

什麼MongoDB的確實是它存儲的結果到RAM中@Sammaye已經注意到。每次我調用database.find()時,它都會在RAM中保存一整套我添加到集合中的文檔。這就是我燒燬96GB的RAM。上面的代碼需要被改爲:

doc = {"uid":uid,"eng":src,trglang:trg} 
if counter == 1000: 
    print doc 
counter+=1 

通過消除database.find()和創建"eng"領域的指數,我只使用最多25GB和我已經完成了200萬的指標句子在不到1個小時內。

相關問題