多個文件中的單詞匹配

我有一個這樣的單詞的語料庫。有超過3000字。但也有2個文件：多個文件中的單詞匹配

File #1: 
#fabulous  7.526 2301 2 
#excellent  7.247 2612 3 
#superb   7.199 1660 2 
#perfection  7.099 3004 4 
#terrific  6.922 629  1 
#magnificent 6.672 490  1 

File #2: 
) #perfect  6.021 511  2 
? #great  5.995 249  1 
! #magnificent 5.979 245  1 
) #ideal  5.925 232  1 
day #great  5.867 219  1 
bed #perfect 5.858 217  1 
) #heavenly  5.73 191  1 
night #perfect 5.671 180  1 
night #great 5.654 177  1 
. #partytime 5.427 141  1

我有很多句子是這樣，3000個多行象下面這樣：

superb, All I know is the road for that Lomardi start at TONIGHT!!!! We will set a record for a pre-season MNF I can guarantee it, perfection. 

All Blue and White fam, we r meeting at Golden Corral for dinner to night at 6pm....great

我必須要經過的每一行，然後執行以下任務：
1 ）發現的，如果把這些語料中的句子
2之間是否匹配）發現的，如果把這些語料匹配領先和句子的結尾

我能夠做的第2部分），而不是第1部分）。我能做到但找到一種有效的方法。我有以下代碼：

for line in sys.stdin: 
(id,num,senti,words) = re.split("\t+",line.strip()) 
sentence = re.split("\s+", words.strip().lower()) 

for line1 in f1: #f1 is the file containing all corpus of words like File #1 
    (term2,sentimentScore,numPos,numNeg) = re.split("\t", line1.strip()) 
    wordanalysis["trail"] = found if re.match(sentence[(len(sentence)-1)],term2.lower()) else not(found) 
    wordanalysis["lead"] = found if re.match(sentence[0],term2.lower()) else not(found) 

for line in sys.stdin: 
    (id,num,senti,words) = re.split("\t+",line.strip()) 
    sentence = re.split("\s+", words.strip().lower()) 

for line1 in f1: #f1 is the file containing all corpus of words like File #1 
    (term2,sentimentScore,numPos,numNeg) = re.split("\t", line1.strip()) 
    wordanalysis["trail"] = found if re.match(sentence[(len(sentence)-1)],term2.lower()) else not(found) 
    wordanalysis["lead"] = found if re.match(sentence[0],term2.lower()) else not(found) 

for line1 in f2: #f2 is the file containing all corpus of words like File #2 
    (term2,sentimentScore,numPos,numNeg) = re.split("\t", line1.strip()) 
    wordanalysis["trail_2"] = found if re.match(sentence[(len(sentence)-1)],term.lower()) else not(found) 
    wordanalysis["lead_2"] = found if re.match(sentence[0],term.lower()) else not(found)

我是對的嗎？有沒有更好的方法來做到這一點。

來源

2013-11-24 fscore

怎麼樣使用數據strcuctrue *哈希* in * Redis *？首先，將兩個文件中的數據讀入存儲在* Hashes *中的Redis。然後當從一個句子中讀出一個單詞時，在Redis中做一個可能非常快的哈希搜索。這可能是幫助[redis中的哈希命令]（http://redis.io/commands#hash） – flyer

@flyer就像在Java中的Hashtable？ – fscore

對不起，我對Java很少了解。這是一個簡單的解釋：[小redis書]（https://github.com/karlseguin/the-little-redis-book/blob/master/en/redis.md#hashes） – flyer

這是一個經典的地圖降低的問題，如果你想認真對待你應該考慮什麼樣的效率：http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

，如果你懶得/太少的資源來設定自己的Hadoop環境中，可以嘗試做一個http://aws.amazon.com/elasticmapreduce/

隨意後在這裏發表您的代碼準備對其做:)它會很高興地看到它是如何翻譯成的MapReduce算法...

來源

2013-11-24 02:35:55

嗨，是的。我很高興你注意到了。這個問題是map-reduce算法，它也有一個reducer腳本，並且爲此設置了hadoop，但使用高效的數據結構也很重要。 – fscore

多個文件中的單詞匹配

回答

相關問題