提高對數據幀文本清理的性能

我有一個DF：提高對數據幀文本清理的性能

id text 
1  This is a good sentence 
2  This is a sentence with a number: 2015 
3  This is a third sentence

我有一個文本清洗功能：

def clean(text): 
    lettersOnly = re.sub('[^a-zA-Z]',' ', text) 
    tokens = word_tokenize(lettersOnly.lower()) 
    stops = set(stopwords.words('english')) 
    tokens = [w for w in tokens if not w in stops] 
    tokensPOS = pos_tag(tokens) 
    tokensLemmatized = [] 
    for w in tokensPOS: 
     tokensLemmatized.append(WordNetLemmatizer().lemmatize(w[0], get_wordnet_pos(w[1]))) 
    clean = " ".join(tokensLemmatized) 
    return clean

get_wordnet_pos()是這樣的：

def get_wordnet_pos(treebank_tag): 
    if treebank_tag.startswith('J'): 
     return wordnet.ADJ 
    elif treebank_tag.startswith('V'): 
     return wordnet.VERB 
    elif treebank_tag.startswith('N'): 
     return wordnet.NOUN 
    elif treebank_tag.startswith('R'): 
     return wordnet.ADV 
    else: 
     return wordnet.NOUN

我我正在將extractFeatures()應用到熊貓專欄，並創建一個新結果列：

df['cleanText'] = df['text'].apply(clean)

得到的DF：

id cleanText 
1  good sentence 
2  sentence number 
3  third sentence

循環時出現成倍增長。例如，使用%%timeit，將其應用於五行，每個循環以17 ms運行。 300行以每個循環800毫秒運行。 500行以每循環1.26秒運行。

我通過實例化stops和WordNetLemmatizer()以外的函數來改變它，因爲這些函數只需要調用一次。

stops = set(stopwords.words('english')) 
lem = WordNetLemmatizer() 
def clean(text): 
    lettersOnly = re.sub('[^a-zA-Z]',' ', text) 
    tokens = word_tokenize(lettersOnly.lower()) 
    tokens = [w for w in tokens if not w in stops] 
    tokensPOS = pos_tag(tokens) 
    tokensLemmatized = [] 
    for w in tokensPOS: 
     tokensLemmatized.append(lem.lemmatize(w[0], get_wordnet_pos(w[1]))) 
    clean = " ".join(tokensLemmatized) 
    return clean

在apply線運行%prun -l 10導致該表：

  672542 function calls (672538 primitive calls) in 2.798 seconds 

    Ordered by: internal time 
    List reduced from 211 to 10 due to restriction <10> 

    ncalls tottime percall cumtime percall filename:lineno(function) 
    4097 0.727 0.000 0.942 0.000 perceptron.py:48(predict) 
    4500 0.584 0.000 0.584 0.000 {built-in method nt.stat} 
    3500 0.243 0.000 0.243 0.000 {built-in method nt._isdir} 
    14971 0.157 0.000 0.178 0.000 {method 'sub' of '_sre.SRE_Pattern' objects} 
    57358 0.129 0.000 0.155 0.000 perceptron.py:250(add) 
    4105 0.117 0.000 0.201 0.000 {built-in method builtins.max} 
    184365 0.084 0.000 0.084 0.000 perceptron.py:58(<lambda>) 
    4097 0.057 0.000 0.213 0.000 perceptron.py:245(_get_features) 
     500 0.038 0.000 1.220 0.002 perceptron.py:143(tag) 
    2000 0.034 0.000 0.068 0.000 ntpath.py:471(normpath)

它看起來像惡搞感知是，可以預見，採取了大量的資源，但我不知道如何簡化它。另外，我不確定nt.stat或nt._isdir在哪裏被調用。

我該如何改變功能或應用方法來提高性能？這個函數是Cython還是Numba的候選人？

來源

2017-08-28 Cameron Taylor

不能說沒有你的數據和預期的輸出。 –

增加樣品輸入數據和清潔功能的結果。我得到了正確的輸出 - 問題更多的是如何更快地獲得適當的輸出。 –

有趣。言語的順序是否重要？我猜是的？ –

改善的第一個明顯的一點，我在這裏看到的是整個get_wordnet_pos功能應該還原爲一個字典查找：

def get_wordnet_pos(treebank_tag): 
    if treebank_tag.startswith('J'): 
     return wordnet.ADJ 
    elif treebank_tag.startswith('V'): 
     return wordnet.VERB 
    elif treebank_tag.startswith('N'): 
     return wordnet.NOUN 
    elif treebank_tag.startswith('R'): 
     return wordnet.ADV 
    else: 
     return wordnet.NOUN

取而代之的是，從collections包初始化一個defaultdict：

import collections 
get_wordnet_pos = collections.defaultdict(lambda: wordnet.NOUN) 
get_wordnet_pos.update({'J' : wordnet.ADJ, 
         'V' : wordnet.VERB, 
         'N' : wordnet.NOUN, 
         'R' : wordnet.ADV })

然後，您將訪問查找這樣的：

get_wordnet_pos[w[1][0]]

接下來，如果要在多個位置使用它，則可以考慮預編譯您的正則表達式模式。你得到的加速並不是那麼多，但這一切都很重要。

pattern = re.compile('[^a-zA-Z]')

裏面的功能，你會打電話：

pattern.sub(' ', text)

OTOH，如果您知道您的文字是從哪裏來的，並有可能會和可能看不到什麼了，你可以預編譯字符的列表，而是使用str.translate，比笨重的基於正則表達式替換快得多得多：

tab = str.maketrans(dict.fromkeys("[email protected]#$%^&*()_+-={}[]|\'\":;,<.>/?\\~`", '')) # pre-compiled use once substitution table (keep this outside the function) 

text = 'hello., hi! lol, what\'s up' 
new_text = text.translate(tab) # this would run inside your function 

print(new_text) 

'hello hi lol whats up'

此外，我想說的是word_tokenize overk生病 - 你所做的就是擺脫特殊字符，所以你失去word_tokenize的所有好處，這實際上與標點符號等有所不同。你可以選擇退回text.split()。

最後，跳過clean = " ".join(tokensLemmatized)步驟。只需返回列表，然後在最後一步中致電df.applymap(" ".join)。

我將基準給你。

來源

2017-08-28 14:07:42

非常感謝 - 非常有幫助。對於defaultdict，它會拋出一個錯誤，指出'TypeError：'collections.defaultdict'對象不可調用'。除此之外，你對替換和分裂的看法很有意義。 –

@CameronTaylor有一個小錯誤。你可以調用像'get_wordnet_pos [']'的字典，而不是'（...）'。將編輯我的答案。 –

另一個怪癖可能就是在原始函數中，標籤是由'startswith'找到的。有沒有辦法將它實現到'defaultdict'中？因爲目前我相信它把大多數東西當作名詞來對待，因爲很多標籤不僅僅是一個字母。 –

提高對數據幀文本清理的性能

回答

相關問題