2013-11-24 145 views
3

我有一個這樣的單詞的語料庫。有超過3000字。但也有2個文件:多個文件中的單詞匹配

File #1: 
#fabulous  7.526 2301 2 
#excellent  7.247 2612 3 
#superb   7.199 1660 2 
#perfection  7.099 3004 4 
#terrific  6.922 629  1 
#magnificent 6.672 490  1 

File #2: 
) #perfect  6.021 511  2 
? #great  5.995 249  1 
! #magnificent 5.979 245  1 
) #ideal  5.925 232  1 
day #great  5.867 219  1 
bed #perfect 5.858 217  1 
) #heavenly  5.73 191  1 
night #perfect 5.671 180  1 
night #great 5.654 177  1 
. #partytime 5.427 141  1 

我有很多句子是這樣,3000個多行象下面這樣:

superb, All I know is the road for that Lomardi start at TONIGHT!!!! We will set a record for a pre-season MNF I can guarantee it, perfection. 

All Blue and White fam, we r meeting at Golden Corral for dinner to night at 6pm....great 

我必須要經過的每一行,然後執行以下任務:
1 )發現的,如果把這些語料中的句子
2之間是否匹配)發現的,如果把這些語料匹配領先和句子的結尾

我能夠做的第2部分),而不是第1部分) 。我能做到但找到一種有效的方法。 我有以下代碼:

for line in sys.stdin: 
(id,num,senti,words) = re.split("\t+",line.strip()) 
sentence = re.split("\s+", words.strip().lower()) 

for line1 in f1: #f1 is the file containing all corpus of words like File #1 
    (term2,sentimentScore,numPos,numNeg) = re.split("\t", line1.strip()) 
    wordanalysis["trail"] = found if re.match(sentence[(len(sentence)-1)],term2.lower()) else not(found) 
    wordanalysis["lead"] = found if re.match(sentence[0],term2.lower()) else not(found) 

for line in sys.stdin: 
    (id,num,senti,words) = re.split("\t+",line.strip()) 
    sentence = re.split("\s+", words.strip().lower()) 

for line1 in f1: #f1 is the file containing all corpus of words like File #1 
    (term2,sentimentScore,numPos,numNeg) = re.split("\t", line1.strip()) 
    wordanalysis["trail"] = found if re.match(sentence[(len(sentence)-1)],term2.lower()) else not(found) 
    wordanalysis["lead"] = found if re.match(sentence[0],term2.lower()) else not(found) 

for line1 in f2: #f2 is the file containing all corpus of words like File #2 
    (term2,sentimentScore,numPos,numNeg) = re.split("\t", line1.strip()) 
    wordanalysis["trail_2"] = found if re.match(sentence[(len(sentence)-1)],term.lower()) else not(found) 
    wordanalysis["lead_2"] = found if re.match(sentence[0],term.lower()) else not(found) 

我是對的嗎?有沒有更好的方法來做到這一點。

+1

怎麼樣使用數據strcuctrue *哈希* in * Redis *?首先,將兩個文件中的數據讀入存儲在* Hashes *中的Redis。然後當從一個句子中讀出一個單詞時,在Redis中做一個可能非常快的哈希搜索。這可能是幫助[redis中的哈希命令](http://redis.io/commands#hash) – flyer

+0

@flyer就像在Java中的Hashtable? – fscore

+0

對不起,我對Java很少了解。這是一個簡單的解釋:[小redis書](https://github.com/karlseguin/the-little-redis-book/blob/master/en/redis.md#hashes) – flyer

回答

0

這是一個經典的地圖降低的問題,如果你想認真對待你應該考慮什麼樣的效率:http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

,如果你懶得/太少的資源來設定自己的Hadoop環境中,可以嘗試做一個http://aws.amazon.com/elasticmapreduce/

隨意後在這裏發表您的代碼準備對其做:)它會很高興地看到它是如何翻譯成的MapReduce算法...

+0

嗨,是的。我很高興你注意到了。這個問題是map-reduce算法,它也有一個reducer腳本,並且爲此設置了hadoop,但使用高效的數據結構也很重要。 – fscore