2017-03-05 33 views
0

我正在處理文本數據的二進制分類問題。我想根據他們在我選擇的一些定義良好的詞類特徵中的出現來對文本的詞語進行分類。 現在,我一直在搜索每個單詞類中文本整個單詞的出現次數,並在匹配時遞增該單詞類的計數。這個計數被進一步用來計算每個詞類的頻率。這裏是我的代碼:如何在我的代碼中實現re.search()?

import nltk 
import re 

def wordClassFeatures(text): 
    home = """woke home sleep today eat tired wake watch 
     watched dinner ate bed day house tv early boring 
     yesterday watching sit""" 

    conversation = """know people think person tell feel friends 
talk new talking mean ask understand feelings care thinking 
friend relationship realize question answer saying""" 


    countHome = countConversation =0 

    totalWords = len(text.split()) 

    text = text.lower() 
    text = nltk.word_tokenize(text) 
    conversation = nltk.word_tokenize(conversation) 
    home = nltk.word_tokenize(home) 
''' 
    for word in text: 
     if word in conversation: #this is my current approach 
      countConversation += 1 
     if word in home: 
      countHome += 1 
''' 

    for word in text: 
     if re.search(word, conversation): #this is what I want to implement 
      countConversation += 1 
     if re.search(word, home): 
      countHome += 1 

    countConversation /= 1.0*totalWords 
    countHome /= 1.0*totalWords 

    return(countHome,countConversation) 

text = """ Long time no see. Like always I was rewriting it from scratch a couple of times. But nevertheless 
it's still java and now it uses metropolis sampling to help that poor path tracing converge. Btw. I did MLT on 
yesterday evening after 2 beers (it had to be Ballmer peak). Altough the implementation is still very fresh it 
easily outperforms standard path tracing, what is to be seen especially when difficult caustics are involved. 
I've implemented spectral rendering too, it was very easy actually, cause all computations on wavelengths are 
linear just like rgb. But then I realised that even if it does feel more physically correct to do so, whats the 
point? 3d applications are operating in rgb color space, and because I cant represent a rgb color as spectrum 
interchangeably I have to approximate it, so as long as I'm not running a physical simulation or something I don't 
see the benefits (please correct me if I'm wrong), thus I abandoned that.""" 

print(wordClassFeatures(text)) 

這樣做的缺點是,我現在所產生的所有詞類的每一個字,因爲文中的話必須明確地匹配陷入一個字類的額外的開銷。因此,我現在試圖將每個單詞作爲正則表達式輸入並在每個單詞類中進行搜索。 這引發錯誤:

line 362, in wordClassFeatures 
if re.search(conversation, word): 
    File "/root/anaconda3/lib/python3.6/re.py", line 182, in search 
    return _compile(pattern, flags).search(string) 
    File "/root/anaconda3/lib/python3.6/re.py", line 289, in _compile 
    p, loc = _cache[type(pattern), pattern, flags] 
TypeError: unhashable type: 'list' 

我知道有一個在語法中的重大失誤,但我找不到它在網絡上的大部分語法re.search的是格式爲:

re.search("thank|appreciate|advance", x)

有沒有什麼辦法可以正確實施呢?

+1

它應該是're.search(字,談話)'。 –

+0

@Rawing試了一下。如果re.search(單詞,會話): 文件「/root/anaconda3/lib/python3.6/re.py」,第182行,搜索 返回_compile(pattern,標誌).search(字符串) TypeError:期望的字符串或類似字節的對象 –

+0

此問題需要[最小,完整和可驗證]示例(http://stackoverflow.com/help/mcve)。這使我們更容易幫助你。 –

回答

0

我相信re.search正在尋找stringbuffer,而不是list,代碼餵養談話變量。

此外,當你tokenizing你是這樣做的文本所有特殊字符都是關閉搜索。

所以,首先我們需要剝離的特殊字符

text = re.sub('\W+',' ', text) #strip text of all special characters 

接下來,我們離開談話變量文本,因爲它是(字符串格式),而不是tokenize

#conversation = nltk.word_tokenize(conversation) 
#home = nltk.word_tokenize(home) 

我們得到想要的答案:

(0.21301775147928995, 0.20118343195266272) 

全部下面的代碼:

import nltk 
import re 

def wordClassFeatures(text): 
    home = """woke home sleep today eat tired wake watch 
     watched dinner ate bed day house tv early boring 
     yesterday watching sit""" 

    conversation = """know people think person tell feel friends 
talk new talking mean ask understand feelings care thinking 
friend relationship realize question answer saying""" 

    text = re.sub('\W+',' ', text) #strip text of all special characters 

    countHome = countConversation =0 

    totalWords = len(text.split()) 

    text = text.lower() 
    text = nltk.word_tokenize(text) 
    #conversation = nltk.word_tokenize(conversation) 
    #home = nltk.word_tokenize(home) 
    ''' 
     for word in text: 
      if word in conversation: #this is my current approach 
       countConversation += 1 
      if word in home: 
       countHome += 1 
    ''' 

    for word in text: 
     if re.search(word, conversation): #this is what I want to implement 
      countConversation += 1 
     if re.search(word, home): 
      countHome += 1 

    countConversation /= 1.0*totalWords 
    countHome /= 1.0*totalWords 

    return(countHome,countConversation) 

text = """ Long time no see. Like always I was rewriting it from scratch a couple of times. But nevertheless 
it's still java and now it uses metropolis sampling to help that poor path tracing converge. Btw. I did MLT on 
yesterday evening after 2 beers (it had to be Ballmer peak). Altough the implementation is still very fresh it 
easily outperforms standard path tracing, what is to be seen especially when difficult caustics are involved. 
I've implemented spectral rendering too, it was very easy actually, cause all computations on wavelengths are 
linear just like rgb. But then I realised that even if it does feel more physically correct to do so, whats the 
point? 3d applications are operating in rgb color space, and because I cant represent a rgb color as spectrum 
interchangeably I have to approximate it, so as long as I'm not running a physical simulation or something I don't 
see the benefits (please correct me if I'm wrong), thus I abandoned that.""" 

print(wordClassFeatures(text))