減少單詞的列表，數元組最高聚集鍵

我試圖把星火字被人算例和集合字被一些其它的值數（例如，詞和計數，其中人是「VI」或「MO 「在下面的情況下）減少單詞的列表，數元組最高聚集鍵

我有一個RDD這是它的值是元組的列表元組的列表：

[(u'VI', [(u'word1', 1), (u'word2', 1), (u'word3', 1)]), 
(u'MO', 
    [(u'word4', 1), 
    (u'word4', 1), 
    (u'word5', 1), 
    (u'word8', 1), 
    (u'word10', 1), 
    (u'word1', 1), 
    (u'word4', 1), 
    (u'word6', 1), 
    (u'word9', 1), 
    ... 
)]

我想是這樣的：

from operator import add 
reduced_tokens = tokenized.reduceByKey(add) 
reduced_tokens.take(2)

這給了我：

[ 
('VI', 
    [(u'word1', 1), (u'word2', 1), (u'word3', 1)], 
('MO', 
    [(u'word4', 58), (u'word8', 2), (u'word9', 23) ...) 
]

到word count example here相似，我希望能夠過濾掉字下面的一些人一些閾值的計數。謝謝！

來源

2017-09-29 scmz

您嘗試減少的鍵是(name, word)對，而不僅僅是名稱。所以，你需要做一個.map一步來解決，你的數據：

def key_by_name_word(record): 
    name, (word, count) = record 
    return (name, word), count 

tokenized_by_name_word = tokenized.map(key_by_name_word) 
counts_by_name_word = tokenized_by_name_word.reduce(add)

這應該給你

[ 
    (('VI', 'word1'), 1), 
    (('VI', 'word2'), 1), 
    (('VI', 'word3'), 1), 
    (('MO', 'word4'), 58), 
    ... 
]

要進入它正是你所提到的相同的格式，然後你可以這樣做：

def key_by_name(record): 
    # this is the inverse of key_by_name_word 
    (name, word), count = record 
    return name, (word, count) 

output = counts_by_name_word.map(key_by_name).reduceByKey(add)

但它實際上可能更容易在平板格式counts_by_name_word與數據一起工作。

來源

2017-09-29 17:27:59

我的數據結構有點不同，但這有助於我理解如何解決它。我的初始數據看起來像'[Row（key = u'VI'，item = u'word1 word2 word3'），...]'並且我創建了一個函數，該函數標記了該項並返回了[[（（name，token），1）令牌令牌]'。從那裏我用flatMap將函數應用於我的數據以獲得您建議的結構。 – scmz

爲了完整起見，這裏是我如何解決這個問題的各個部分：

問1：通過一些關鍵

import re 

def restructure_data(name_and_freetext): 
    name = name_and_freetext[0] 
    tokens = re.sub('[&|/|\d{4}|\.|\,|\:|\-|\(|\)|\+|\$|\!]', ' ', name_and_freetext[1]).split() 
    return [((name, token), 1) for token in tokens] 

filtered_data = data.filter((data.flag==1)).select('name', 'item') 
tokenized = filtered_data.rdd.flatMap(restructure_data)

總字數問2：過濾掉的話有以下一些計數門檻：

from operator import add 

# keep words which have counts >= 5 
counts_by_state_word = tokenized.reduceByKey(add).filter(lambda x: x[1] >= 5) 

# map filtered word counts into a list by key so we can sort them 
restruct = counts_by_name_word.map(lambda x: (x[0][0], [(x[0][1], x[1])]))

獎勵：從最頻繁的排序的話至少頻繁

# sort the word counts from most frequent to least frequent words 
output = restruct.reduceByKey(add).map(lambda x: (x[0], sorted(x[1], key=lambda y: y[1], reverse=True))).collect()

來源

2017-10-04 22:17:52 scmz

減少單詞的列表，數元組最高聚集鍵

回答

相關問題