2013-04-10 88 views
1

我有一個方法可以計算字符串中單詞的頻率。我手動包括一些應該刪除的單詞。我發現,對於短字符串,'''被刪除...對於較長的字符串(如下面的字符串),該方法仍會打印'the'。有關這是爲什麼以及如何解決它的任何想法?如何從紅寶石哈希中刪除項目?

def count_words(string) 
    words = string.downcase.split(' ') 

    delete_list = ['the'] 
    delete_list.each do |del| 
     words.delete_at(words.index(del)) 
    end 

    frequency = Hash.new(0) 
    words.each do |word| 
     frequency[word.downcase] += 1 
    end 

    return frequency.sort_by {|k,v| v}.reverse 
end 

puts count_words('Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking 
politics have gotten out of control since gates left the building.. 
sales metrics often do not reflect the contributions of the role, which demonstrates that line management is out of touch of what the individual contributors role really does 
middle management does not care about the career of his/her directs, 90% of the time management competes directly with their people, or takes credit for their work 
lots of back stabbing going on 
Microsoft changes the organization or commitment or comp model, faster than the average deal cycle, making it next to near impossible to develop momentum in role or a rhythm of success 
execs promote themselves in years when they freeze employees merit increases 
only way to advance is to step on your peers/colleagues and take credit for work you had no impact on, beat your chest loud enough and you get "visibility" you need to advance 
visibility is not based on performance by enlarge, it is based on being in your manager\'s swim lane for advancement 
I have observed people get promoted in years when they did not meet their quota, nor did the earn the highest performance on the team, they kissed their way to the promotion 
Advice to Senior Management 1, get back to risk taking and teaming, less politics please, you are killing the company 
2, set realistic commitments and stick to them for multiple years, stop changing the game faster than your people can react 
3, stop over engineering commitments and over segmenting the company, people are not willing to collaborate or be corporate citizens 
4, too many empty suits in middle management, keep flattening out the company and getting rid of middle managers that run reports all day, get back to a culture where managers also sell and drives wins 
5, keep your word microsoft, you said stability, but you keep tinkering with the org too much for any changes to take affect A great Culture 
Limitless opportunities 
Supportive Management team who are passionate about people 
A company that really does want you to have a good work life balance and backs it up with policies that enable you to manage how and where you work. 
Cons Support resources are constrained 
Can be overly competitve and hard to get noticed 
Sales rewards are definitely prioritised and marketing cuts are always prioritised. 
Consumer organisation is still far from ideal. 
Advice to Senior Management Focus on getting the internal organisation simplified to improve performance and increase empowerment. 
Get some REAL consumer focus and invest for the long term 
Start connecting with people, focussing on telling stories rather than selling products.') 
+0

夫婦快速點;你不需要頻率[word.downcase]'因爲單詞已經被降低了。你也應該使用'each_with_object'而不是'words.each'循環。看到我的答案。 – meagar 2013-04-10 02:00:46

+0

發生這種情況是因爲數組中有多個「the」,所以當你說'delete_at(words.index(del))'時,它只會刪除第一個匹配項。要修復它,請執行@meagar所說的話。 – 2013-04-10 02:08:52

+2

當你創建一個問題時,不要放置比絕對必要的更多的數據。你的問題很難閱讀,因爲太多不必要的文字會帶來所有的視覺噪音。 – 2013-04-10 02:44:22

回答

1

只需使用words.delete("the")。所有你需要做的就是給它鑰匙。

程序的簡化版本是:

def count_words(string) 
    words = string.downcase.split(' ').each_with_object(Hash.new(0)) { |w,o| o[w] += 1 } 

    delete_list = ['the'] 

    delete_list.each { |del| words.delete(del) } 

    frequency.sort_by {|k,v| v}.reverse 
end 
1

分析網頁的搜索引擎優化時,這是一個非常普遍的問題。下面是我寫的一個快速版本:

require 'pp' 

STOP_WORDS = %w[a and of the] 

def count_words(string) 

    word_count = string 
    .downcase 
    .gsub(/[^a-z ]+/, '') 
    .split 
    .group_by{ |w| w } 

    STOP_WORDS.each do |stop_word| 
    word_count.delete(stop_word) 
    end 

    word_count 
    .map{ |k,v| [k, v.size]} 
    .sort_by{ |k, c| [-c, k] } 
end 

pp count_words(<<EOT) 
Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking 
politics have gotten out of control since gates left the building.. 
Start connecting with people, focussing on telling stories rather than selling products. 
EOT 

我特意截斷了樣本數據以提高可讀性。

關於該主題,您可以在此使用(「<<」)在必須傳入大量文本時改進代碼的格式。一個替代是插入一個__END__標記,並把它的所有後,然後使用特殊的IO對象DATA來讀取尾隨塊:

pp count_words(DATA.read) 

__END__ 
Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking 
politics have gotten out of control since gates left the building.. 
Start connecting with people, focussing on telling stories rather than selling products. 

在任一情況下,代碼輸出:

 
[["of", 2], 
["and", 1], 
["are", 1], 
["benefits", 1], 
["buildingstart", 1], 
["compensation", 1], 
["connecting", 1], 
["cons", 1], 
["control", 1], 
["empty", 1], 
["fair", 1], 
["focussing", 1], 
["gates", 1], 
["gotten", 1], 
["great", 1], 
["have", 1], 
["left", 1], 
["little", 1], 
["management", 1], 
["middle", 1], 
["off", 1], 
["on", 1], 
["out", 1], 
["people", 1], 
["products", 1], 
["pros", 1], 
["rather", 1], 
["reasonable", 1], 
["risk", 1], 
["selling", 1], 
["since", 1], 
["stories", 1], 
["suits", 1], 
["takingpolitics", 1], 
["telling", 1], 
["than", 1], 
["time", 1], 
["very", 1], 
["vision", 1], 
["void", 1], 
["with", 1]] 

gsub(/[^a-z ]+/, '')去除任何不是字母或空格的東西。 Enumerable的group_by正在舉重。另外,Enumerable的sort_by可以很容易地通過計數和單詞進行反向排序。

由於遍歷STOP_WORD列表通常比嘗試迭代語料庫中的單詞要快,因此在刪除停用詞時,我使用散列而不是數組。一個大的語料庫很可能會有更多的單詞而不是停詞。