2016-12-15 109 views
0

我有數據幀熊貓:創建新列,如果從一列值包含的子

member_id,event_type,event_path,event_time,event_date,event_duration 
20077,2016-11-20,"2016-11-20 09:17:07",url,e.mail.ru/message/14794236680000000730/,0 
20077,2016-11-20,"2016-11-20 09:17:07",url,e.mail.ru/message/14794236680000000730/,2 
20077,2016-11-20,"2016-11-20 09:17:09",url,avito.ru/profile/messenger/channel/u2i-558928587-101700461?utm_source=avito_mail&utm_medium=email&utm_campaign=messenger_single&utm_content=test,1 
20077,2016-11-20,"2016-11-20 09:17:37",url,avito.ru/auto/messenger/channel/u2i-558928587-101700461?utm_source=avito_mail&utm_medium=email&utm_campaign=messenger_single&utm_content=test,135 
20077,2016-11-20,"2016-11-20 09:19:53",url,e.mail.ru/message/14794236680000000730/,0 
20077,2016-11-20,"2016-11-20 09:19:53",url,e.mail.ru/message/14794236680000000730/,37 

,並有另一DF2

domain category subcategory unique id count_sec Main category Subcategory 
avito.ru/auto Автомобили Авто 1600 83112396 Auto Avito 
youtube.com Видеопортал Видеохостинг 1317 42710996 Video Youtube 
ok.ru Развлечения  Социальные сети 694 13394605 Social network OK 
kinogo.club Развлечения  Кино 497 8438800 Video Illegal 
e.mail.ru Почтовый сервис None 1124 8428984 Mail.ru Email 
vk.com/audio Видеопортал Видеохостинг 1020 7409440 Music VK 

我通常使用:

df['category'] = df.event_date.map(df2.set_index('domain')['Main category'] 

但它的比較數據,如果數據相同,則它取值並在新列中創建。但我怎麼能做到這一點,但使用字符串中的子字符串?

+0

的可能的複製[是有可能做模糊匹配與Python大熊貓合併?(http://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy -match-merge-python-pandas) – maxymoo

回答

0

我真的不知道你到底在做什麼。但是,我的建議是這樣的:在DF的第

mapping = dict(df2.set_index('domain')['Main category']) 

def map_to_substring(x): 
    for key in mapping.keys(): 
     if x in key: 
      return mapping[key] 
    return '' 

df['category'] = df.event_date.apply(lambda x: map_to_substring(x)) 

測試,因爲它可能取決於你有多少數據需要一段時間。

+0

有什麼辦法讓它更快? event_date中有500,000條str –

0

沒有任何啓發式用於發現在其上加入模糊匹配,你不會有一個可擴展的解決方案,因爲你需要做O(N )比較。

對於您的具體使用情況,我建議提取您想要比較的做的的URL部分。也許像

from urlparse import urlparse 

def netloc(s): 
    return urlparse('http://' + s).netloc 

df['netloc'] = df['event_date'].apply(netloc) 
df2['netloc'] = df2['domain'].apply(netloc) 

df.merge(df2, 'left', on='netloc')