我正在使用nltk語料庫movie_reviews其中有很多文檔。我的任務是通過預處理數據並且無需預處理來獲得這些評論的預測性能。但有問題,在列表documents
和documents2
我有相同的文件,我需要洗牌他們,以保持在這兩個列表中相同的順序。我無法單獨洗牌,因爲每次洗牌時我都會得到其他結果。這就是爲什麼我需要以相同的順序立即洗牌,因爲我需要最後比較它們(這取決於順序)。我使用python 2.7以同樣的順序立即將兩個列表同時排序
例(實際是字符串標記化,但它不是相對值):
documents = [(['plot : two teen couples go to a church party , '], 'neg'),
(['drink and then drive . '], 'pos'),
(['they get into an accident . '], 'neg'),
(['one of the guys dies'], 'neg')]
documents2 = [(['plot two teen couples church party'], 'neg'),
(['drink then drive . '], 'pos'),
(['they get accident . '], 'neg'),
(['one guys dies'], 'neg')]
我需要洗牌後得到這樣的結果兩份名單:
documents = [(['one of the guys dies'], 'neg'),
(['they get into an accident . '], 'neg'),
(['drink and then drive . '], 'pos'),
(['plot : two teen couples go to a church party , '], 'neg')]
documents2 = [(['one guys dies'], 'neg'),
(['they get accident . '], 'neg'),
(['drink then drive . '], 'pos'),
(['plot two teen couples church party'], 'neg')]
我有這樣的代碼:
def cleanDoc(doc):
stopset = set(stopwords.words('english'))
stemmer = nltk.PorterStemmer()
clean = [token.lower() for token in doc if token.lower() not in stopset and len(token) > 2]
final = [stemmer.stem(word) for word in clean]
return final
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
documents2 = [(list(cleanDoc(movie_reviews.words(fileid))), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(and here shuffle documents and documents2 with same order) # or somehow
@thefourtheye,太謝謝你了!我已經更新了我的答案。 – sshashank124
謝謝,那正是我需要的。 –
(noob問題) - *表示什麼? –