2015-05-03 36 views
0

下面我有這段代碼,它將一段文本與一個停止詞集進行比較,並返回文本中不在停止詞集中的一列詞。然後我將單詞列表更改爲一個字符串,以便我可以在textmining模塊中使用它來創建術語文檔矩陣。如何在使用python textmining模塊構建文本文檔矩陣時保留超文字?

我在代碼中檢查顯示在列表和字符串中保留了帶連字符的單詞,但是一旦我將它們傳遞給代碼的TDM部分,就會打斷帶連字符的單詞。有沒有辦法在textmining模塊和TDM中保留帶連字符的單詞?

import re 

f= open ("words") #dictionary 
stops = set() 
for line in f: 
    stops.add(line.strip()) 

f = open ("azathoth") #Azathoth (1922) 
azathoth = list() 
for line in f: 
    azathoth.extend(re.findall("[A-z\-\']+", line.strip())) 

azathothcount = list() 
for w in azathoth: 
    if w in stops: 
     continue 
    else: 
     azathothcount.append(w) 

print azathothcount[1:10] 
raw_input('Press Enter...') 

azathothstr = ' '.join(azathothcount) 
print azathothstr 
raw_input('Press Enter...') 

import textmining 

def termdocumentmatrix_example(): 
    doc1 = azathothstr 

    tdm = textmining.TermDocumentMatrix() 
    tdm.add_doc(doc1) 

    tdm.write_csv('matrixhp.csv', cutoff=1) 

    for row in tdm.rows(cutoff=1): 
     print row 

raw_input('Press Enter...') 
termdocumentmatrix_example() 

回答

0

初始化TermDocumentMatrix類時,textmining包默認爲其自己的'simple_tokenize'函數。 add_doc()在將其添加到tdm之前通過simple_tokenize()推送文本。

幫助(文本挖掘)的產量,部分:

class TermDocumentMatrix(__builtin__.object) 
| Class to efficiently create a term-document matrix. 
| 
| The only initialization parameter is a tokenizer function, which should 
| take in a single string representing a document and return a list of 
| strings representing the tokens in the document. If the tokenizer 
| parameter is omitted it defaults to using textmining.simple_tokenize 
| 
| Use the add_doc method to add a document (document is a string). Use the 
| write_csv method to output the current term-document matrix to a csv 
| file. You can use the rows method to return the rows of the matrix if 
| you wish to access the individual elements without writing directly to a 
| file. 
| 
| Methods defined here: 
| 
| __init__(self, tokenizer=<function simple_tokenize>) 
| 
| ... 
| 
|simple_tokenize(document) 
| Clean up a document and split into a list of words. 
| 
| Converts document (a string) to lowercase and strips out 
| everything which is not a lowercase letter. 

所以你必須推出自己的標記生成器並不在連字符分割,並使之通過,當你初始化TermDocumentMatrix類。

在我看來,如果這個過程保留了simple_tokenize()函數的其餘功能 - 減去刪除連字符的單詞,那麼最好是這樣,因此您可以將連字符的單詞路由到該函數的結果中。

doc1 = 'blah "blah" blahbitty-blah, in-the bloopity blip bleep br-rump! ' 

import re 

def toknzr(txt): 
    hyph_words = re.findall(r'\w+(?:-\w+)+',txt) 
    remove = '|'.join(hyph_words) 
    regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE) 
    simple = regex.sub("", txt) 
    return(hyph_words + textmining.simple_tokenize(simple)) 

tdm = textmining.TermDocumentMatrix(tokenizer = toknzr) 
tdm.add_doc(doc1) 

此:下面,我從文檔中刪除了連字符的單詞,將它們添加到TDM推前,其餘通過simple_tokenize(),然後合併兩個列表(複姓字+ simple_tokenize()結果)可能並不是製作自己的標記器最令人興奮的方式(反饋讚賞!),但這裏的要點是,你必須用新的標記器初始化類,而不是使用默認的simple_tokenize()。

相關問題