2012-01-23 74 views
3

我正在進行各種單詞列表的廣泛工作。將單詞列表轉換爲這些單詞出現的頻率列表

請考慮以下問題,我有:

docText={"settlement", "new", "beginnings", "wildwood", "settlement", "book", 
"excerpt", "agnes", "leffler", "perry", "my", "mother", "junetta", 
"hally", "leffler", "brought", "my", "brother", "frank", "and", "me", 
"to", "edmonton", "from", "monmouth", "illinois", "mrs", "matilda", 
"groff", "accompanied", "us", "her", "husband", "joseph", "groff", 
"my", "father", "george", "leffler", "and", "my", "uncle", "andrew", 
"henderson", "were", "already", "in", "edmonton", "they", "came", 
"in", "1910", "we", "arrived", "july", "1", "1911", "the", "sun", 
"was", "shining", "when", "we", "arrived", "however", "it", "had", 
"been", "raining", "for", "days", "and", "it", "was", "very", 
"muddy", "especially", "around", "the", "cn", "train"} 

searchWords={"the","for","my","and","me","and","we"} 

這些列表的要長得多(說250個字的searchWords列表和docText爲約12000字)。

現在,我必須做這樣的事情找出給定單詞的頻率的能力:

docFrequency=Sort[Tally[docText],#1[[2]]>#2[[2]]&];  
Flatten[Cases[docFrequency,{"settlement",_}]][[2]] 

但是,在我收到掛了是我的追求,以生成特定列表。具體而言,將單詞列表轉換爲這些單詞出現的頻率列表的問題。我試圖用Do循環來做到這一點,但已經撞牆了。

我想通過docTextsearchWords並用純粹的外觀頻率替換docText的每個元素。即因爲「結算」出現兩次,它將在列表中被替換爲2,而由於「我」出現3次,它將變爲3.然後列表將是類似2,1,1,1,2等等的東西向前。

我懷疑答案在If[]Map[]之間?

這聽起來不可思議,但我想預先處理一堆的詞頻信息...


加成的透明度(我希望):

這裏是一個更好的例子。

searchWords={"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "A", "about", 
"above", "across", "after", "again", "against", "all", "almost", 
"alone", "along", "already", "also", "although", "always", "among", 
"an", "and", "another", "any", "anyone", "anything", "anywhere", 
"are", "around", "as", "at", "b", "B", "back", "be", "became", 
"because", "become", "becomes", "been", "before", "behind", "being", 
"between", "both", "but", "by", "c", "C", "can", "cannot", "could", 
"d", "D", "do", "done", "down", "during", "e", "E", "each", "either", 
"enough", "even", "ever", "every", "everyone", "everything", 
"everywhere", "f", "F", "few", "find", "first", "for", "four", 
"from", "full", "further", "g", "G", "get", "give", "go", "h", "H", 
"had", "has", "have", "he", "her", "here", "herself", "him", 
"himself", "his", "how", "however", "i", "I", "if", "in", "interest", 
"into", "is", "it", "its", "itself", "j", "J", "k", "K", "keep", "l", 
"L", "last", "least", "less", "m", "M", "made", "many", "may", "me", 
"might", "more", "most", "mostly", "much", "must", "my", "myself", 
"n", "N", "never", "next", "no", "nobody", "noone", "not", "nothing", 
"now", "nowhere", "o", "O", "of", "off", "often", "on", "once", 
"one", "only", "or", "other", "others", "our", "out", "over", "p", 
"P", "part", "per", "perhaps", "put", "q", "Q", "r", "R", "rather", 
"s", "S", "same", "see", "seem", "seemed", "seeming", "seems", 
"several", "she", "should", "show", "side", "since", "so", "some", 
"someone", "something", "somewhere", "still", "such", "t", "T", 
"take", "than", "that", "the", "their", "them", "then", "there", 
"therefore", "these", "they", "this", "those", "though", "three", 
"through", "thus", "to", "together", "too", "toward", "two", "u", 
"U", "under", "until", "up", "upon", "us", "v", "V", "very", "w", 
"W", "was", "we", "well", "were", "what", "when", "where", "whether", 
"which", "while", "who", "whole", "whose", "why", "will", "with", 
"within", "without", "would", "x", "X", "y", "Y", "yet", "you", 
"your", "yours", "z", "Z"} 

這些是從WordData[]自動生成的停用詞。所以我想比較這些詞與docText。由於「結算」不是searchWords的一部分,因此它會顯示爲0.但由於「我的」是searchWords的一部分,因此它會彈出作爲計數(所以我可以告訴給定詞出現多少次)。

我真的很感謝你的幫助 - 我很期待能夠參加一些正式課程,因爲我碰到了能夠真正解釋我想要做什麼的邊緣!

+0

您是否需要處理那些出現在searchWords中的單詞? 「docWords」中的其餘內容會發生什麼? – Szabolcs

+0

@Szabolcs如果他們沒有出現,他們應該顯示爲0.在以前的程序中,我使用了'If'將其轉換爲0,因爲我會得到null問題。 –

+0

我還是完全不明白。你能解釋一下'searchWords'的作用嗎? – Szabolcs

回答

7

我們可以替換不searchWords由0 docText顯示如下的一切:

preprocessedDocText = 
    Replace[docText, 
    [email protected][Thread[searchWords -> searchWords], _ -> 0], {1}] 

的,我們可以通過它們的頻率替換剩餘的話:

replaceTable = Dispatch[Rule @@@ Tally[docText]]; 

preprocessedDocText /. replaceTable 

Dispatch一個預處理規則列表(->),並在隨後的使用中顯着加速更換。

我還沒有對大數據進行基準測試,但Dispatch應該提供很好的加速比。

+0

我看到了混亂,這是我的錯。請給我一兩分鐘。 –

+0

這是很棒的,很乾淨的代碼,順便說一句。感謝您解釋它的作用! –

+0

@ ian.milligan看我的編輯,有幫助嗎? – Szabolcs

4

@Szololcs提供了一個很好的解決方案,我可能會自己走相同的路線。這裏有一個稍微不同的解決方案,只是爲了好玩:

ClearAll[getFreqs]; 
getFreqs[docText_, searchWords_] := 
    Module[{dwords, dfreqs, inSearchWords, lset}, 
    SetAttributes[{lset, inSearchWords}, Listable]; 
    lset[args__] := Set[args]; 
    {dwords, dfreqs} = [email protected][docText]; 
    lset[inSearchWords[searchWords], True]; 
    inSearchWords[_] = False; 
    dfreqs*Boole[inSearchWords[dwords]]] 

這說明Listable屬性是如何被用來代替循環,甚至Map性平。我們有:

In[120]:= getFreqs[docText,searchWords] 
Out[120]= {0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,3,1,1,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,2, 
1,0,0,2,0,0,1,0,2,0,2,0,1,1,2,1,1,0,1,0,1,0,0,1,0,0} 
2

我着手解決這個與Szabolcs不​​同的方式,但我最終得到了一些相似的東西。

儘管如此,我認爲它更清潔。在某些數據上速度更快,而其他數據速度更慢。

docText /. 
    Dispatch[FilterRules[Rule @@@ [email protected], searchWords] ~Join~ {_String -> 0}]