2017-09-29 26 views
1

我試圖把星火字被人算例和集合字被一些其它的值數(例如,詞和計數,其中人是「VI」或「MO 「在下面的情況下)減少單詞的列表,數元組最高聚集鍵

我有一個RDD這是它的值是元組的列表元組的列表:

[(u'VI', [(u'word1', 1), (u'word2', 1), (u'word3', 1)]), 
(u'MO', 
    [(u'word4', 1), 
    (u'word4', 1), 
    (u'word5', 1), 
    (u'word8', 1), 
    (u'word10', 1), 
    (u'word1', 1), 
    (u'word4', 1), 
    (u'word6', 1), 
    (u'word9', 1), 
    ... 
)] 

我想是這樣的:

from operator import add 
reduced_tokens = tokenized.reduceByKey(add) 
reduced_tokens.take(2) 

這給了我:

[ 
('VI', 
    [(u'word1', 1), (u'word2', 1), (u'word3', 1)], 
('MO', 
    [(u'word4', 58), (u'word8', 2), (u'word9', 23) ...) 
] 

word count example here相似,我希望能夠過濾掉字下面的一些人一些閾值的計數。謝謝!

回答

0

您嘗試減少的鍵是(name, word)對,而不僅僅是名稱。所以,你需要做一個.map一步來解決,你的數據:

def key_by_name_word(record): 
    name, (word, count) = record 
    return (name, word), count 

tokenized_by_name_word = tokenized.map(key_by_name_word) 
counts_by_name_word = tokenized_by_name_word.reduce(add) 

這應該給你

[ 
    (('VI', 'word1'), 1), 
    (('VI', 'word2'), 1), 
    (('VI', 'word3'), 1), 
    (('MO', 'word4'), 58), 
    ... 
] 

要進入它正是你所提到的相同的格式,然後你可以這樣做:

def key_by_name(record): 
    # this is the inverse of key_by_name_word 
    (name, word), count = record 
    return name, (word, count) 

output = counts_by_name_word.map(key_by_name).reduceByKey(add) 

但它實際上可能更容易在平板格式counts_by_name_word與數據一起工作。

+0

我的數據結構有點不同,但這有助於我理解如何解決它。我的初始數據看起來像'[Row(key = u'VI',item = u'word1 word2 word3'),...]'並且我創建了一個函數,該函數標記了該項並返回了[[((name,token) ,1)令牌令牌]'。從那裏我用flatMap將函數應用於我的數據以獲得您建議的結構。 – scmz

0

爲了完整起見,這裏是我如何解決這個問題的各個部分:

問1:通過一些關鍵

import re 

def restructure_data(name_and_freetext): 
    name = name_and_freetext[0] 
    tokens = re.sub('[&|/|\d{4}|\.|\,|\:|\-|\(|\)|\+|\$|\!]', ' ', name_and_freetext[1]).split() 
    return [((name, token), 1) for token in tokens] 

filtered_data = data.filter((data.flag==1)).select('name', 'item') 
tokenized = filtered_data.rdd.flatMap(restructure_data) 

總字數問2:過濾掉的話有以下一些計數門檻:

from operator import add 

# keep words which have counts >= 5 
counts_by_state_word = tokenized.reduceByKey(add).filter(lambda x: x[1] >= 5) 

# map filtered word counts into a list by key so we can sort them 
restruct = counts_by_name_word.map(lambda x: (x[0][0], [(x[0][1], x[1])])) 

獎勵:從最頻繁的排序的話至少頻繁

# sort the word counts from most frequent to least frequent words 
output = restruct.reduceByKey(add).map(lambda x: (x[0], sorted(x[1], key=lambda y: y[1], reverse=True))).collect()