2017-09-21 87 views
1

我有一個熊貓DF一欄已用符號化:標記列表

df['token_col'] = df.col.apply(word_tokenize) 

現在我想要使用這些符號化語言來標記:

df['pos_col'] = nltk.tag.pos_tag(df['token_col']) 
df['wordnet_tagged_pos_col'] = [(w,get_wordnet_pos(t)) for (w, t) in (df['pos_col'])] 

但我我得到一個錯誤,我不明白:

AttributeError       Traceback (most recent call last) 
<ipython-input-28-99d28433d090> in <module>() 
     1 #tag tokenized lists 
----> 2 df['pos_col'] = nltk.tag.pos_tag(df['token_col']) 
     3 df['wordnet_tagged_pos_col'] = [(w,get_wordnet_pos(t)) for (w, t) in (df['pos_col'])] 

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\__init__.py in pos_tag(tokens, tagset, lang) 
    125  """ 
    126  tagger = _get_tagger(lang) 
--> 127  return _pos_tag(tokens, tagset, tagger) 
    128 
    129 

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\__init__.py in _pos_tag(tokens, tagset, tagger) 
    93 
    94 def _pos_tag(tokens, tagset, tagger): 
---> 95  tagged_tokens = tagger.tag(tokens) 
    96  if tagset: 
    97   tagged_tokens = [(token, map_tag('en-ptb', tagset, tag)) for (token, tag) in tagged_tokens] 

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in tag(self, tokens) 
    150   output = [] 
    151 
--> 152   context = self.START + [self.normalize(w) for w in tokens] + self.END 
    153   for i, word in enumerate(tokens): 
    154    tag = self.tagdict.get(word) 

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in <listcomp>(.0) 
    150   output = [] 
    151 
--> 152   context = self.START + [self.normalize(w) for w in tokens] + self.END 
    153   for i, word in enumerate(tokens): 
    154    tag = self.tagdict.get(word) 

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in normalize(self, word) 
    236   if '-' in word and word[0] != '-': 
    237    return '!HYPHEN' 
--> 238   elif word.isdigit() and len(word) == 4: 
    239    return '!YEAR' 
    240   elif word[0].isdigit(): 

AttributeError: 'list' object has no attribute 'isdigit' 

如果它有所作爲,我的下一步將是lemmatizin定義g那些標籤標記:

df['lmtzd_col'] = [(lmtzr.lemmatize(w, pos=t if t else 'n').lower(),t) for (w,t) in wordnet_tagged_pos_col] 
print(len(set(wordnet_tagged_pos_col)),(len(set(df['lmtzd_col'])))) 

MY df是寬70列,所以這裏是一個小快照:

ID_number Meeting1 Meeting2 Meeting3 Meeting4 Meeting5 col  
123456789 9/15/2015 1/8/2016 4/27/2016 NaN   NaN   [Assessment, of, Improvement, will, be, on-goi... 
987654321 9/22/2016 NaN   2/25/2017 NaN   NaN   [A, member, of, the, administrative, team, wil.. 
456789123 10/1/2015 11/30/2015 NaN   NaN   NaN   [During, our, second, and, third, meetings, we... 
+0

你可以發佈col的樣本嗎? – Dark

+0

@Bharathshetty - 添加了一些示例數據 – LMGagne

+0

'get_wordnet_pos'不是內置的權利? – Dark

回答

1

您可以使用適用於獲得語音標籤的部分,即

df['pos_col'] = df['token_col'].apply(nltk.tag.pos_tag) 

df['pos_col'] 
 
0 [(Assessment, NNP), (of, NNP), (Improvement,... 
1 [(A, DT), (member, NNP), (of, NNP), (the, N... 
2 [(During, IN), (our, JJ), (second, NN), (an... 
Name: pos_col, dtype: object 

類似的更好的使用apply功能與拉姆達適用於每一行比通功能該系列的功能就像

df['wordnet_tagged_pos_col'] = df['pos_col'].apply(lambda x : [(w,get_wordnet_pos(t)) for (w, t) in x],1) 

因爲您需要在列的每個單元格上應用get_wordnet_pos。

df['wordnet_tagged_pos_col'] 
 
0 [(Assessment, (N, n)), (of, (N, n)), (Improv... 
1 [(A, (D, n)), (member, (N, n)), (of, (N, n))... 
2 [(During, (I, n)), (our, (J, a)), (second, (... 
Name: wordnet_tagged_pos_col, dtype: object 

希望它能幫助。

+0

謝謝,我運行了這段代碼並得到了'ValueError:太多的值解壓縮(預期2)' – LMGagne

+0

Pos_col或wordnet col? – Dark

+0

這是pos_col行 – LMGagne