在不同的邏輯中的Python標記化

我想在Python中編寫兩個單獨的標記化函數，第一個基本上需要一個字符串並返回一個標記列表，以便1）所有標記都是小寫，（2）所有標點符號都是保存爲獨立的令牌。在不同的邏輯中的Python標記化

第二個和上面提到的不一樣：只要出現'not'這個詞，把這兩個後面的標記改爲在標記之前有'not_'前綴。看下面的例子。

我能夠做到建設的第一個，下面是我的第一個記號化功能代碼：

def token(text): 
    x=re.findall(r"[\w]+|['(&#@$*.,/)!?;^]", text.lower()) 
    return x

輸出：

token("Hi! How's it going??? an_underscore is not *really* punctuation.") 
['hi','!','how',"'",'s','it','going','?','?','?','e','an_underscore','is','not','*','really','*','punctuation','.']

預計產量爲第二令牌化功能：

tokenize_with_not("This movie is not good. In fact, it is not even really a movie.") 
['this','movie','is','not','not_good','not_.','in','fact',',','it','is','not','not_even','not_really','a','movie','.']

有人可以幫我完成第二個標記化函數，任何幫助都是有益的特德。

來源

2015-11-05 Wolf

嘗試：

import re 

def token(text): 
    x=re.findall(r"[\w]+|['(&#@$*.,/)!?;^]", text.lower()) 
    return x 

def tokenize_with_not(text): 
    result = [] 
    c=0 
    for t in token(text): 
     if t == 'not': 
      c=2 
      result.append(t) 
     else: 
      if c>0: 
       result.append('not_'+t) 
       c -= 1 
      else: 
       result.append(t) 

    return result 

print tokenize_with_not("This movie is not good. In fact, it is not even really a movie.")

來源

2015-11-05 01:46:03

不錯的感謝，這工作！ – Wolf

你可以試試這個：

def token_with(text, t): 
    ret = token(text) 
    for i in range(len(ret)): 
     if ret[i] == t: 
      try: 
       ret[i+1] = '{}_{}'.format(t, ret[i+1]) 
       ret[i+2] = '{}_{}'.format(t, ret[i+2]) 
      except IndexError: 
       pass 
    return ret

如何使用：

token_with("This movie is not good. In fact, it is not even really a movie.", "not")

來源

2015-11-05 01:52:11

在不同的邏輯中的Python標記化

回答

相關問題