熊貓：在列中找到子字符串改進算法

我有數據框，我試着只獲取字符串，其中一些列包含一些字符串。熊貓：在列中找到子字符串改進算法

我用：

df_res = pd.DataFrame() 
for i in substr: 
    res = df[df['event_address'].str.contains(i)]

df樣子：

member_id,event_address,event_time,event_duration 
g1497o1ofm5a1963,fotki.yandex.ru/users/atanusha/albums,2015-05-01 00:00:05,8 
g1497o1ofm5a1963,9829192.ru/apple-iphone.html,2015-05-01 00:00:15,2 
g1497o1ofm5a1963,fotki.yandex.ru/users/atanusha/album/165150?&p=3,2015-05-01 00:00:17,2 
g1497o1ofm5a1963,fotki.yandex.ru/tags/%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&search_author=utpaladev&&p=2,2015-05-01 00:01:31,10 
g1497o1ofm5a1963,3gmaster.net,2015-05-01 00:01:41,6 
g1497o1ofm5a1963,fotki.yandex.ru/search.xml?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&&p=2,2015-05-01 00:02:01,6 
g1497o1ofm5a1963,fotki.yandex.ru/search.xml?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&search_author=Sunny-Fanny&,2015-05-01 00:02:31,2 
g1497o1ofm5a1963,fotki.9829192.ru/apple-iphone.html,2015-05-01 00:03:25,6

和substr是：

123.ru/gadgets/communicators 
320-8080.ru/mobilephones 
3gmaster.net 
3-q.ru/products/smartfony/s 
9829192.ru/apple-iphone.html 
9829192.ru/index.php?cat=1 
acer.com/ac/ru/ru/content/group/smartphones 
aj.ru

我得到理想的結果與此代碼，但它的廁所長。我也嘗試使用列（substr這是一個substr = urls.url.values.tolist()）我嘗試

res = df[df['event_address'].str.contains(urls.url)]

但它返回：

TypeError: 'Series' objects are mutable, thus they cannot be hashed

這是任何方式，使其更加快速或也許我錯了？

來源

2016-10-04 Petr Petrov

哪種類型是'substr'？這是一個字符串列表嗎？ – albert

我想你需要通過|添加join到str.contains如果需要更快的解決方案：

res = df[df['event_address'].str.contains('|'.join(urls.url))] 
print (res) 
      member_id      event_address   event_time \ 
1 g1497o1ofm5a1963  9829192.ru/apple-iphone.html 2015-05-01 00:00:15 
4 g1497o1ofm5a1963      3gmaster.net 2015-05-01 00:01:41 
7 g1497o1ofm5a1963 fotki.9829192.ru/apple-iphone.html 2015-05-01 00:03:25 

    event_duration 
1    2 
4    6 
7    6

另一個list comprehension解決方案：

res = df[df['event_address'].apply(lambda x: any([n in x for n in urls.url.tolist()]))] 
print (res) 
      member_id      event_address   event_time \ 
1 g1497o1ofm5a1963  9829192.ru/apple-iphone.html 2015-05-01 00:00:15 
4 g1497o1ofm5a1963      3gmaster.net 2015-05-01 00:01:41 
7 g1497o1ofm5a1963 fotki.9829192.ru/apple-iphone.html 2015-05-01 00:03:25 

    event_duration 
1    2 
4    6 
7    6

時序：

#[8000 rows x 4 columns] 
df = pd.concat([df]*1000).reset_index(drop=True) 

In [68]: %timeit (df[df['event_address'].str.contains('|'.join(urls.url))]) 
100 loops, best of 3: 12 ms per loop 

In [69]: %timeit (df.ix[df.event_address.map(check_exists)]) 
10 loops, best of 3: 155 ms per loop 

In [70]: %timeit (df.ix[df.event_address.map(lambda x: any([True for i in urls.url.tolist() if i in x]))]) 
10 loops, best of 3: 163 ms per loop 

In [71]: %timeit (df[df['event_address'].apply(lambda x: any([n in x for n in urls.url.tolist()]))]) 
10 loops, best of 3: 174 ms per loop

來源

2016-10-04 08:52:01 jezrael

我嘗試過'df ['event_address']。str.contains（'|'.join（urls.url））'因爲我需要添加'regex = True'，但是它會返回'sre_constants.error：multiple repeat' –

這樣做：

def check_exists(x): 
    for i in substr: 
     if i in x: 
      return True 
    return False 

df2 = df.ix[df.event_address.map(check_exists)]

，或者如果你喜歡它在一個行寫：

df.ix[df.event_address.map(lambda x: any([True for i in substr if i in x]))]

來源

2016-10-04 09:07:06 Howardyan

熊貓：在列中找到子字符串改進算法

回答

相關問題