Python csv模塊是一個很棒的庫,但經常用它來簡化任務可能是一個矯枉過正的問題。 這種特殊情況下,對我來說,是一個典型的例子,在使用CSV模塊可事情
要我過去複雜,
- 只是通過文件迭代,
- 拆分上逗號每一行,並提取所述第一分割
- 然後在白色空間分割剩餘部分
- 轉換每個字爲小寫
- 地帶出所有的標點符號和挖其
- 和理解的結果爲一組
是線性直接的方法
一個例子具有以下文件內容
Lorem Ipsum is simply dummy "text" of the ,0
printing and typesetting; industry. Lorem,1
Ipsum has been the industry's standard ,2
dummy text ever since the 1500s, when an,3
unknown printer took a galley of type and,4
scrambled it to make a type specimen ,5
book. It has survived not only five ,6
centuries, but also the leap into electronic,7
typesetting, remaining essentially unch,8
anged. It was popularised in the 1960s with ,9
the release of Letraset sheets conta,10
ining Lorem Ipsum passages, and more rec,11
ently with desktop publishing software like,12
!!Aldus PageMaker!! including versions of,13
Lorem Ipsum.,14
>>> from string import digits, punctuation
>>> remove_set = digits + punctuation
>>> with open("test.csv") as fin:
words = {word.lower().strip(remove_set) for line in fin
for word in line.rsplit(",",1)[0].split()}
>>> words
set(['and', 'pagemaker', 'passages', 'sheets', 'galley', 'text', 'is', 'in', 'it', 'anged', 'an', 'simply', 'type', 'electronic', 'was', 'publishing', 'also', 'unknown', 'make', 'since', 'when', 'scrambled', 'been', 'desktop', 'to', 'only', 'book', 'typesetting', 'rec', "industry's", 'has', 'ever', 'into', 'more', 'printer', 'centuries', 'dummy', 'with', 'specimen', 'took', 'but', 'standard', 'five', 'survived', 'leap', 'not', 'lorem', 'a', 'ipsum', 'essentially', 'unch', 'conta', 'like', 'ining', 'versions', 'of', 'industry', 'ently', 'remaining', 's', 'printing', 'letraset', 'popularised', 'release', 'including', 'the', 'aldus', 'software'])
謝謝你跑,你能確切地說明我可以使用.split()嗎?不知道如何。 – Julia
擴大了我的範例,希望能解決大部分問題。 –
謝謝!有沒有辦法擺脫字符「沒有假表達?」 – Julia