刪除單引號，同時保留撇號Python，NLTK

我試圖創建一個詩集語料庫的頻率列表。該代碼讀取.txt文件並使用數據創建.csv。刪除單引號，同時保留撇號Python，NLTK

我正在努力的部分是從文本中刪除不相關的標點符號。相關的代碼我至今是：

import nltk 

raw = open('file_name.txt', 'r').read() 
output = open('output_filename.csv','w') 
txt = raw.lower() 

pattern = r'''(?x)([A_Z]\.)+|\w+(-\w+)*|\.\.\|[][.,;"'?():-_`]''' 
tokenized = nltk.regexp_tokenize(txt,pattern)

這工作幾乎是完美的，因爲它可保留的詞語，如煙囪清掃車的連字符，但它也減少收縮成兩個獨立的話，這是不是有什麼我想要。

例如，我的文本文件（試運行是無罪的威廉·布萊克的歌曲），有臺詞：「！管一個關於羔羊的歌」

，我想是

管| a |歌曲|關於| a |羔羊

我是用以前的代碼保持收縮完好，但也給我留下連接到字的單引號：

for punct in string.punctuation: 
    txt = txt.replace(punct,' ') 
re.sub(r'\r+',' ',txt)

所以我會得到

「管道| a |歌曲|關於| a |羔羊

我想找到這兩者之間的中間路線，因爲我需要保持在口頭上撇號如永遠高高飄揚和連字符，但擺脫一切的。

我知道這個話題在這個論壇上看起來很疲憊，但我已經花了四天時間嘗試提供的每個示例，並且無法讓它們像廣告中那樣工作，所以不是撕毀我的所有我想我會嘗試發佈一個問題。

編輯：

看來，一個標準的標記生成器是不是我的文字工作的原因是一些撇號是左/右奇數地方傾斜的結果。我已經用了一堆說明書.replace()我產生期望的結果：

txt = txt.replace("\n", " ") 
#formats the text so that the line break counts as a space 
txt = txt.replace("」", " ") 
#replaces stray quotation marks with a space 
txt = txt.replace("「", " ") 
#replaces stray quotation marks with a space 
txt = txt.replace(" ’", " ") 
#replaces a right leaning apostrophe with a space if it follows a space(which now includes line breaks) 
txt = txt.replace(" ‘", " ") 
#replaces a left leaning apostrophe with a space if it follows a space

我不懷疑有合併的所有那些爲一行代碼的方式，但我真的很高興，這所有的作品！

來源

2014-03-12 E.Kori

而不是替換標點符號，你可以split上的空間，然後strip標點符號每個單詞的開頭和結尾：

>>> import string 
>>> phrase = "'This has punctuation, and it's hard to remove!'" 
>>> [word.strip(string.punctuation) for word in phrase.split(" ")] 
['This', 'has', 'punctuation', 'and', "it's", 'hard', 'to', 'remove']

這使單詞中撇號和連字符，而在開始或結束移除標點符號的話。

注意，獨立位置標點將由一個空字符串""更換：

>>> phrase = "This is - no doubt - punctuated" 
>>> [word.strip(string.punctuation) for word in phrase.split(" ")] 
['This', 'is', '', 'no', 'doubt', '', 'punctuated']

這是很容易過濾掉，因爲空字符串評估False：

filtered = [f for f in txt if f and f.lower() not in stopwords] 
          #^excludes empty string

來源

2014-03-12 12:08:30 jonrsharpe

你能提供比「努力工作」還要多一點？錯誤（提供完整的追溯）？意外輸出（提供示例輸入和預期輸出和實際輸出）？ – jonrsharpe

對不起，我沒有在準備好之前得到評論的格式和提交。現在就制定如何格式化。 –

正確的，我所擁有的是'import string'' raw = open（'file.txt'，'r'）。read（）''output = open（'Output/result.csv'，'w'）''' txt = raw.lower（）''[word.strip（string.punctuation）for word in txt.split（「」）]'現在結果只給了我一些隨機字母和它們在文本中出現的頻率。例如：e - 1635，t - 766等 –

刪除單引號，同時保留撇號Python，NLTK

回答

相關問題