在Python中執行多個列表解析的最有效方法

給出這三個列表解析，是否有更有效的方法來做到這一點，而不是三個有意義的集合？我相信在這種情況下，循環可能是不好的形式，但如果我要遍歷rowsaslist中的大量行，我覺得我下面的內容不是那麼高效。在Python中執行多個列表解析的最有效方法

cachedStopWords = stopwords.words('english') 

rowsaslist = [x.lower() for x in rowsaslist] 
rowsaslist = [''.join(c for c in s if c not in string.punctuation) for s in rowsaslist] 
rowsaslist = [' '.join([word for word in p.split() if word not in cachedStopWords]) for p in rowsaslist]

將這些綜合成一個理解陳述更有效嗎？我從可讀性的角度知道它可能會是一團糟的代碼。

來源

2017-07-29 Sean

你可以用'map（）'和'filter（）'代替，但效率相同 –

感謝大家對此的意見。我會玩這些建議！ – Sean

取而代之的是同一個列表上迭代3次的，你可以簡單地定義2個功能，在一個單一的列表理解使用它們：

cachedStopWords = stopwords.words('english') 


def remove_punctuation(text): 
    return ''.join(c for c in text.lower() if c not in string.punctuation) 

def remove_stop_words(text): 
    return ' '.join([word for word in p.split() if word not in cachedStopWords]) 

rowsaslist = [remove_stop_words(remove_punctuation(text)) for text in rowsaslist]

我從來沒有使用stopwords。如果它返回一個列表，最好先將其轉換爲set，以加速word not in cachedStopWords測試。

最後，NLTK包可能會幫助您處理文本。見@alvas' answer。

來源

2017-07-29 16:37:53

我認爲有一個更好的方法來處理這個問題，而不是執行嵌套循環去除標點符號和停用詞。 – alvas

@alvas：你說得對。我已經添加了一個鏈接到您的答案。 –

您目前擁有它的方式，每個列表將在創建下一個之前創建的完全。您可以通過從內涵切換到發電機表達式解決這個問題（注意使用的()代替[]）：

rowsaslist = (x.lower() for x in rows as list) 
rowsaslist = (''.join(c for c in s if c not in string.punctuation) for s in rows as list) 
rowsaslist = (' '.join([word for word in p.split() if word not in cachedStopWords]) for p in rowsaslist)

而不是創造名單，這將創建3個發電機組。每個生成器只會根據需要生成一個值，而不是一次嚴格創建每個列表。

來源

2017-07-29 16:41:44 Carcigenicate

我會在這裏贊成功能方法*

' '.join(filter(lambda word: word not in cachedStopWords, 
       ''.join(filter(lambda c: c not in string.punctuation, 
         map(str.lower, rowsaslist))).split())

這是醜陋的罪過，但真的沒有辦法讓這個不難看。評論對於這些大型一體化處理工作非常有用。

# removes punctuation, filters out stop words, and lowercases

這說明了一切都很完美。

*誠然，可能是因爲我在Haskell被玩弄越來越多！

來源

2017-07-29 16:44:51

使用函數代替lambda表達式有助於提高可讀性。那麼不需要評論。 –

根據您是否需要將結果列表相應地排序爲輸入方式，至少有兩種方法可以解決此問題。

首先，你有兩個黑名單要刪除這似乎：

標點符號
停止詞。

而您希望通過循環遍歷字符來刪除標點符號，而您想通過循環標記來刪除停用詞。

假設是輸入是一個非標記化的人類可讀字符串。

爲什麼不能成爲標點符號？這樣，你可以通過循環的標記去掉標點符號和停用詞，即

>>> from nltk import word_tokenize 
>>> from nltk.corpus import stopwords 
>>> from string import punctuation 
>>> blacklist = set(punctuation).union(set(stopwords.words('english'))) 
>>> blacklist 
set([u'all', u'just', u'being', u'when', u'over', u'through', u'during', u'its', u'before', '$', u'hadn', '(', u'll', u'had', ',', u'should', u'to', u'only', u'does', u'under', u'ours', u'has', '<', '@', u'them', u'his', u'very', u'they', u'not', u'yourselves', u'now', '\\', u'nor', '`', u'd', u'did', u'shan', u'didn', u'these', u'she', u'each', u'where', '|', u'because', u'doing', u'there', u'theirs', u'some', u'we', u'him', u'up', u'are', u'further', u'ourselves', u'out', '#', "'", '+', u'weren', '/', u're', u'won', u'above', u'between', ';', '?', u't', u'be', u'hasn', u'after', u'here', u'shouldn', u'hers', '[', u'by', '_', u'both', u'about', u'couldn', u'of', u'o', u's', u'isn', '{', u'or', u'own', u'into', u'yourself', u'down', u'mightn', u'wasn', u'your', u'he', '"', u'from', u'her', '&', u'aren', '*', u'been', '.', u'few', u'too', u'wouldn', u'then', u'themselves', ':', u'was', u'until', '>', u'himself', u'on', u'with', u'but', u'mustn', u'off', u'herself', u'than', u'those', '^', u'me', u'myself', u'ma', u'this', u'whom', u'will', u'while', u'ain', u'below', u'can', u'were', u'more', u'my', '~', u'and', u've', u'do', u'is', u'in', u'am', u'it', u'doesn', u'an', u'as', u'itself', u'against', u'have', u'our', u'their', u'if', '!', u'again', '%', u'no', ')', u'that', '-', u'same', u'any', u'how', u'other', u'which', u'you', '=', u'needn', u'y', u'haven', u'who', u'what', u'most', u'such', ']', u'why', u'a', u'don', u'for', u'i', u'm', u'having', u'so', u'at', u'the', '}', u'yours', u'once']) 
>>> sent = "This is a humanly readable string, that Tina Guo doesn't want to play" 
>>> [word for word in word_tokenize(sent) if word not in blacklist] 
['This', 'humanly', 'readable', 'string', 'Tina', 'Guo', "n't", 'want', 'play']

如果不需要的話爲了作爲輸入，採用set().difference功能可以加速你的代碼了：

>>> set(word_tokenize(sent)).difference(blacklist) 
set(['humanly', 'play', 'string', 'This', 'readable', 'Guo', 'Tina', "n't", 'want'])

或者，如果你不想來標記字符串，就可以使用str.translate刪除標點並肯定會更有效率比通過字符循環：

>>> sent 
"This is a humanly readable string, that Tina Guo doesn't want to play" 
>>> sent.translate(None, punctuation) 
'This is a humanly readable string that Tina Guo doesnt want to play't 
>>> stoplist = stopwords.words('english') 
>>> [word for word in sent.translate(None, punctuation).split() if word not in stoplist] 
['This', 'humanly', 'readable', 'string', 'Tina', 'Guo', 'doesnt', 'want', 'play']

來源

2017-07-30 02:12:01 alvas

在Python中執行多個列表解析的最有效方法

回答

相關問題