2016-03-03 38 views
1

我有一個樣本數據框如下:應用功能,每一行的每個字在大熊貓數據幀列

df = pd.DataFrame({ 
'notes': pd.Series(['speling', 'korrecter']), 
'name': pd.Series(['Walter White', 'Walter White']), 
}) 

    name    notes 
0 Walter White  This speling is incorrect 
1 Walter White  Corrector should correct korrecter 

我想彼得·諾維格提供here相適應的拼寫檢查器。然後,我想通過遍歷行中的每個單詞來將這個函數應用於每一行。我想知道如何在Python Pandas環境中完成這項工作?

我想輸出爲:

name    notes 
0 Walter White  This spelling is incorrect 
1 Walter White  Corrector should correct corrector 

欣賞任何輸入。謝謝!

回答

1

你可以嘗試用str.split這個解決方案,但我認爲在大df性能可能會出現問題:

import pandas as pd 
import numpy as np 

df = pd.DataFrame({ 
'notes': pd.Series(['This speling is incorrect', 'Corrector should correct korrecter one']), 
'name': pd.Series(['Walter White', 'Walter White']), 
}) 
print df 
      name         notes 
0 Walter White    This speling is incorrect 
1 Walter White Corrector should correct korrecter one  

#simulate function correct 
def correct(x): 
    return x + '888' 

#split column notes and apply correct 
df1 = df.notes.str.split(expand=True).apply(correct) 
print df1 
       0   1   2    3  4 
0  This888 speling888  is888 incorrect888  NaN 
1 Corrector888 should888 correct888 korrecter888 one888 

#remove NaN and concanecate all words together 
df['notes'] = df1.fillna('').apply(lambda row: ' '.join(row), axis=1) 
print df 
      name            notes 
0 Walter White    This888 speling888 is888 incorrect888 
1 Walter White Corrector888 should888 correct888 korrecter888... 
0

我已使用您發佈的鏈接中的代碼以使其正常工作。以此爲靈感。

import re, collections 
import pandas as pd 

# This code comes from the link you have posted 
def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features): 
    model = collections.defaultdict(lambda: 1) 
    for f in features: 
     model[f] += 1 
    return model 

def edits1(word): 
    splits  = [(word[:i], word[i:]) for i in range(len(word) + 1)] 
    deletes = [a + b[1:] for a, b in splits if b] 
    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] 
    replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] 
    inserts = [a + c + b  for a, b in splits for c in alphabet] 
    return set(deletes + transposes + replaces + inserts) 

def known_edits2(word): 
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) 

def known(words): return set(w for w in words if w in NWORDS) 

def correct(word): 
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] 
    return max(candidates, key=NWORDS.get) 

NWORDS = train(words(file('big.txt').read())) 

alphabet = 'abcdefghijklmnopqrstuvwxyz' 

# This is your code 
df = pd.DataFrame({ 
'notes': pd.Series(['speling', 'korrecter']), 
'name': pd.Series(['Walter White', 'Walter White']), 
}) 

# Spellchecking can be optimized, of course and not hardcoded 
for i, row in df.iterrows(): 
    df.set_value(i,'notes',correct(row['notes'])) 
相關問題