PYTHON：刪除txt文件中的POS標記

我有以下txt文件，其中包含每個單詞的POS（Part of Speech）標記。PYTHON：刪除txt文件中的POS標記

Needless/jj to/to say/vb ,/, I/ppss was/bedz furious/jj at/in this/dt unparalleled/jj intrusion/nn upon/in free/jj enterprise/nn ./. How/wrb dared/vbn they/ppss

有沒有什麼辦法來讀取，而POS標籤的文件，這樣的結果將是：

不用說，我是在自由企業這個無與倫比的入侵大發雷霆。他們如何敢

所以，基本上我想刪除/後的任何字符。

words = re.findall('\w+',open(input_file).read())

上面的代碼將刪除/但像jj，ppss這樣的縮寫仍然會出現。那麼，如何刪除/跟隨任何字符。

來源

2013-03-12 Peace

會將文件有任何'/'不在一個標籤指示器？單詞/標籤組合是否總是空格分開？使用'.split（）'是可能或不可行的天真的解決方案。 – geoffspear 2013-03-12 15:16:46

請看我的回答 – eyquem 2013-03-12 18:26:08

這夠好嗎？

>>> import re 
>>> s = 'Needless/jj to/to say/vb ,/, I/ppss was/bedz furious/jj at/in this/dt unparalleled/jj intrusion/nn upon/in free/jj enterprise/nn ./.' 
>>> re.sub(r'/[^\s]+','',s) 
'Needless to say , I was furious at this unparalleled intrusion upon free enterprise .'

這只是消除了與/開始，直到找到空白的任何文本。

來源

2013-03-12 15:18:46 mgilson

它不起作用，因爲txt在一個列表中：newtxt = re.sub（r'/ [^ \ s] +'，''，words）Traceback（最近一次調用最後一次）：File 「」，第1行，在文件「/usr/lib/python2.7/re.py」，第151行，在子返回_compile（pattern，flags）.sub（repl，string，count）TypeError：expected string或緩衝區 – Peace 2013-03-12 15:43:37

什麼是「單詞」？ – mgilson 2013-03-12 15:44:03

words = re.findall（'\ w \ S +'，open（file_name）.read（）） – Peace 2013-03-12 15:45:25

正如Wooble建議，你可以用嵌套列表理解兩個裂口做到這一點：

s = 'Needless/jj to/to say/vb ,/, I/ppss was/bedz furious/jj at/in this/dt unparalleled/jj intrusion/nn upon/in free/jj enterprise/nn ./.' 
print " ".join(word.split("/")[0] for word in s.split())

輸出：

Needless to say , I was furious at this unparalleled intrusion upon free enterprise .

s.split()拆分句子譯成獨立的單詞。 word.split("/")將英語單詞（或者puncutation標記）從其詞性中分離出來。 word.split("/")[0]只選擇英文單詞並丟棄POS。 " ".join()將生成的英文單詞列表組合爲單個字符串。

來源

2013-03-12 15:33:56 Kevin

它會在列表中工作嗎？ – Peace 2013-03-12 16:15:39

當然，那麼它會'[「」.join（word.split（「/」）[0]爲s.split（）中的單詞）for myListOfSentences]' – Kevin 2013-03-12 16:17:26

非常感謝你的工作:) :)：）:) – Peace 2013-03-12 16:26:07

這段代碼使用Wooble的言論和您需要的帳戶處理字符串列表，afaiu：

li = [ ('//Needless/jj to/to say/vb ,/, ' 
     'I/ppss was/bedz fur/ious/jj at/in this/dt ' 
     'unparalleled/jj intrusion/nn upon/in ' 
     'free/jj enterprise/nn ./. ' 
     'How/wrb dared/vbn they/ppss'), 
     '/Before/jj to/to say/vb ,/, /I/ppss am/bedz h/a/p/p/y/jj'] 

import re 

def clean(s,r=re.compile('(?<![\s/])/[^\s/]+(?![\S/])')): 
    return r.sub('',s) 

x = map(clean, li) 

print '\n\n'.join(x)

結果

//Needless to say , I was fur/ious at this unparalleled intrusion upon free enterprise . How dared they 

/Before to say , /I am h/a/p/p/y

來源

2013-03-12 18:25:21 eyquem

PYTHON：刪除txt文件中的POS標記

回答

相關問題