2017-04-25 199 views
1

我有一個包含200,000行的csv文件。我已加載到這個數據幀和想用攤販與下面的腳本匿名保護:爲什麼這個腳本需要這麼長時間才能運行?

for i in range(MasterDE1.FirstName.size): 
    MasterDE1.loc[(MasterDE1["Gender__pc"] == 'Female'), ['FirstName','LastName']] = fake.first_name_female(),fake.last_name_female() 
    MasterDE1.loc[(MasterDE1["Gender__pc"] == 'Male'), ['FirstName','LastName']] = fake.first_name_male(),fake.last_name_male() 
    MasterDE1.loc[(MasterDE1["Gender__pc"] == 'Unknown'), ['FirstName','LastName']] = fake.first_name(),fake.last_name() 
    MasterDE1['Name'] = MasterDE1['FirstName'] + ' ' + MasterDE1['LastName'] 
    MasterDE1['EmailAddress'] = 'smithandthunder' + str(i+1) + '@gmail.com' 

它已經從過去的20幾分鐘內運行(我不認爲內核是死的)。

+0

不介意負面的問題,但會很感激評論,所以我可以改善未來的問題 –

回答

1

而是在每次迭代更新數據框,你可以首先生成的名稱,然後分配:

df = pd.DataFrame({'Gender': np.random.choice(['Female', 'Male', 'Unknown'], p=[0.45, 0.45, 0.1], size=2*10**5), 
        'First Name': np.nan, 'Last Name': np.nan}) 


df.head() 
Out: 
    First Name Gender Last Name 
0   NaN Female  NaN 
1   NaN Male  NaN 
2   NaN Female  NaN 
3   NaN Male  NaN 
4   NaN Male  NaN 

df.shape 
Out: (200000, 3) 

現在下面應完成在幾分鐘內:

df.loc[df['Gender']=='Female', ('First Name', 'Last Name')] = [(fake.first_name_female(), fake.last_name_female()) for _ in range(df[df['Gender']=='Female'].shape[0])] 

df.loc[df['Gender']=='Male', ('First Name', 'Last Name')] = [(fake.first_name_male(), fake.last_name_male()) for _ in range(df[df['Gender']=='Male'].shape[0])] 

df.loc[df['Gender']=='Unknown', ('First Name', 'Last Name')] = [(fake.first_name(), fake.last_name()) for _ in range(df[df['Gender']=='Unknown'].shape[0])] 

df.head() 
Out: 
    First Name Gender Last Name 
0  Ruth Female  Moore 
1 Christina Female  Jones 
2 Lindsey Female  Davis 
3  Aaron Unknown Watkins 
4  Joshua  Male  Henry 

之後,像df['Name'] = df['First Name'] + ' ' + df['Last Name']這樣的東西應該很快。

+0

像魅力一樣工作!非常感謝! –

+0

@Data_Kid歡迎您。 :) – ayhan

1

您可以省略循環:

MasterDE1 = pd.DataFrame({'Gender__pc':['Female','Male','Unknown'], 
         'FirstName':['s','d','f'], 
         'LastName': ['d','f','r']}) 
MasterDE1 = pd.concat([MasterDE1]*3).reset_index(drop=True) 
print (MasterDE1) 
    FirstName Gender__pc LastName 
0   s  Female  d 
1   d  Male  f 
2   f Unknown  r 
3   s  Female  d 
4   d  Male  f 
5   f Unknown  r 
6   s  Female  d 
7   d  Male  f 
8   f Unknown  r 

def f1(): 
    return 'first_name_female' + str(np.random.randint(100)) 
def f2(): 
    return 'last_name_female' + str(np.random.randint(100)) 

maskfem = (MasterDE1["Gender__pc"] == 'Female') 
a = pd.Series(((np.arange(len(MasterDE1.index))) + 1).astype(str)) 

MasterDE1.loc[maskfem, 'FirstName'] = [f1() for x in np.arange(maskfem.sum())] 
MasterDE1.loc[maskfem, 'LastName'] = [f2() for x in np.arange(maskfem.sum())] 

MasterDE1['Name'] = MasterDE1['FirstName'] + ' ' + MasterDE1['LastName'] 
MasterDE1['EmailAddress'] = 'smithandthunder' + a + '@gmail.com' 
print (MasterDE1) 
      FirstName Gender__pc   LastName \ 
0 first_name_female70  Female last_name_female64 
1     d  Male     f 
2     f Unknown     r 
3 first_name_female6  Female last_name_female67 
4     d  Male     f 
5     f Unknown     r 
6 first_name_female59  Female last_name_female99 
7     d  Male     f 
8     f Unknown     r 

            Name    EmailAddress 
0 first_name_female70 last_name_female64 [email protected] 
1          d f [email protected] 
2          f r [email protected] 
3 first_name_female6 last_name_female67 [email protected] 
4          d f [email protected] 
5          f r [email protected] 
6 first_name_female59 last_name_female99 [email protected] 
7          d f [email protected] 
8          f r [email protected] 
+0

謝謝。當我嘗試這樣做時,我得到這個錯誤:TypeError:ufunc'add'不包含與簽名匹配類型的循環dtype('

+0

我相信' fake.first_name_female()'(和其他人)每次調用時都會生成新名稱。因此,循環或應用是必要的。 – ayhan

+0

是@ayhan。我已經嘗試過這種方式,它給了整個表的相同名稱。我希望所有的名字都有所不同。 –

0

我不知道確切地告訴你它爲什麼採取這一長,但它可能是因爲該文件的大小。

但是,你能找到一種方法來監視循環知道它是否仍在工作:

signal = 0 

for i in range(0,200000): 
    .... 
    # something going on in the loop 
    .... 
    # signal the loop 
    signal += 1 
    if signal == 50000 or signal == 100000 or signal == 150000: 
     print('It\'s still going!') 
    elif signal > 200000: 
     print('It\'s over 200000 already!') 
     break # or you can raise an error instead of break (raise RuntimeError) 
+0

感謝您的支持。對未來很有用 –

相關問題