2015-11-05 17 views
0

我想在Python中編寫兩個單獨的標記化函數,第一個基本上需要一個字符串並返回一個標記列表,以便1)所有標記都是小寫,(2)所有標點符號都是保存爲獨立的令牌。在不同的邏輯中的Python標記化

第二個和上面提到的不一樣:只要出現'not'這個詞,把這兩個後面的標記改爲在標記之前有'not_'前綴。看下面的例子。

我能夠做到建設的第一個,下面是我的第一個記號化功能代碼:

def token(text): 
    x=re.findall(r"[\w]+|['(&#@$*.,/)!?;^]", text.lower()) 
    return x 

輸出:

token("Hi! How's it going??? an_underscore is not *really* punctuation.") 
['hi','!','how',"'",'s','it','going','?','?','?','e','an_underscore','is','not','*','really','*','punctuation','.'] 

預計產量爲第二令牌化功能:

tokenize_with_not("This movie is not good. In fact, it is not even really a movie.") 
['this','movie','is','not','not_good','not_.','in','fact',',','it','is','not','not_even','not_really','a','movie','.'] 

有人可以幫我完成第二個標記化函數,任何幫助都是有益的特德。

回答

1

嘗試:

import re 

def token(text): 
    x=re.findall(r"[\w]+|['(&#@$*.,/)!?;^]", text.lower()) 
    return x 

def tokenize_with_not(text): 
    result = [] 
    c=0 
    for t in token(text): 
     if t == 'not': 
      c=2 
      result.append(t) 
     else: 
      if c>0: 
       result.append('not_'+t) 
       c -= 1 
      else: 
       result.append(t) 

    return result 

print tokenize_with_not("This movie is not good. In fact, it is not even really a movie.") 
+0

不錯的感謝,這工作! – Wolf

0

你可以試試這個:

def token_with(text, t): 
    ret = token(text) 
    for i in range(len(ret)): 
     if ret[i] == t: 
      try: 
       ret[i+1] = '{}_{}'.format(t, ret[i+1]) 
       ret[i+2] = '{}_{}'.format(t, ret[i+2]) 
      except IndexError: 
       pass 
    return ret 

如何使用:

token_with("This movie is not good. In fact, it is not even really a movie.", "not")