具有倒排索引的Ruby和Mongodb帶來了一些有趣的結果

對於我的程序，我使用來自Twitter feed的數據創建倒排索引，但是，當解析並將它們放入MongoDB時，會發生一些有趣的問題。具有倒排索引的Ruby和Mongodb帶來了一些有趣的結果

通常類型的條目應該是這樣的：

{"ax"=>1, "easyjet"=>1, "from"=>2}

然而，解析了一些他們的分貝落得像這樣的鳴叫的時候：

{""=>{""=>{""=>{""=>{""=>{"giants"=>{"dhem"=>1, "giants"=>1, "giantss"=>1}}}}

我有這些行分開一個微博，並增加分貝值：

def pull_hash_tags(tweet, lang) 
    hash_tags = tweet.split.find_all { |word| /^#.+/.match word } 
    t = tweet.gsub(/https?:\/\/[\S]+/,"") # removing urls 
    t = t.gsub(/#\w+/,"") # removing hash tags 
    t = t.gsub(/[^0-9a-z ]/i, '') # removing non-alphanumerics and keeping spaces 
    t = t.gsub(/\r/," ") 
    t = t.gsub(/\n/," ") 
    hash_tags.each { |tag| add_to_hash(lang, tag, t) } 
end 

def add_to_hash(lang, tag, t) 
    t.gsub(/\W+/, ' ').split.each { |word| @db.collection.update({"_id" => lang}, {"$inc" => {"#{tag}.#{word}" => 1}}, { :upsert => true }) } 
end

我想要得到正常的話（只有字母數字字符），沒有雙空格，並且沒有回車符等。

來源

2012-02-09 Domness

我建議您在連接時添加一個記錄器，然後準確觀察您要放入數據庫的內容。您的代碼可能存在問題。 – 2012-02-09 19:03:13

當通過大約50GB的數據工作時，這將很難確定.. – Domness 2012-02-09 20:26:51

在這種情況下，請勿使用記錄器。只需在你的pull_hash_tags方法中添加一些代碼來查找這些異常文檔。 – 2012-02-09 20:32:26

您應該添加t.strip!，因爲看起來問題可能是前導/尾隨空格。

來源

2012-04-20 04:40:33

具有倒排索引的Ruby和Mongodb帶來了一些有趣的結果

回答

相關問題