2015-09-02 63 views
4

我有一個數據框與應用程序用戶代理的列。我需要做的是從這一列中確定特定的應用程序。例如,Python Pandas:通過搜索子字符串查找表

NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0將被歸入Words With Friends

iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24 will be Sudoku by FingerArts etc. 

我將有另一個數據框與我需要匹配的字符串。例如,

Keyword     Game 
NewWordsWithFriends  Words With Friends 
com.fingerarts.sudoku Sudoku by FingerArts 

我該如何做這樣的熊貓數據框查找?例如數據幀就像

user date     user-agent 
A  2015-09-02 13:45:56 NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0 
B  2015-08-31 23:04:21 iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24 

我想在查找後找到一個新列GameName

回答

1

一種可能的方式來實現,這將是:

import pandas as pd                

# some example data 
qry = pd.DataFrame.from_dict({"Keyword": ["NewWordsWithFriends",     
              "com.fingerarts.sudoku"],    
           "Game": ["Words With Friends",      
             "Sudoku by FingerArts"]})     

df = pd.DataFrame.from_dict({"user-agent" : ["NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0",  
              "iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24"]}) 

keywords = qry.Keyword.tolist()             
games = qry.Game.tolist()               

def select(x):                 
    for key, game in zip(keywords, games):          
     if key in x:                
      return game               

df["GameName"] = df["user-agent"].apply(select) 

這將給:

In [41]: df 
Out[41]: 
              user-agent    GameName 
0 NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15... Words With Friends 
1 iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sud... Sudoku by FingerArts 

如果你需要做的,對於大數據集,你需要測試這個解決方案的性能和看你的目的是否足夠快。

如果沒有,也許優化例如字符串測試的方式:有所有可能的遊戲外循環,然後用.apply每場返回結果每列可以加快速度,因爲它會避免

在每次呼叫select()等的所有遊戲中循環。

要確定瓶頸,您可以使用line_profiler(請參閱How can I profile python code line-by-line?)。

1
df = pd.DataFrame({'date' : ['2015-09-02 13:45:56' , '2015-08-31 23:04:21'] , 'user-agent' : ['NewWordsWithFriendsFree/2.3 CFNetwork/672.1.15 Darwin/14.0.0' , 'iPhone3,1; iPhone OS 7.1.2; com.fingerarts.sudoku2; 143441-1,24'] }) 

map_df = pd.DataFrame({'Keyword' : ['NewWordsWithFriends' , 'com.fingerarts.sudoku'], 'Game' : [ 'Words With Friends', 'Sudoku by FingerArts'] }) 

mapping = {vals[1] : vals[0] for vals in map_df.values} 


regex = '|'.join([keyword.replace('.' , '\.') for keyword in map_df['Keyword']]) 

def get_keyword(user_agent): 
    matches = re.findall(regex ,user_agent) 
    return matches[0] if len(matches) > 0 else np.nan 


df['GameName'] = df['user-agent'].apply(get_keyword) 

df['GameName'] = df['GameName'].map(mapping) 

get_keyword功能的其他實現方式可以

def get_keyword(user_agent): 
    for keyword in map_df['Keyword']: 
     if keyword in user_agent: 
      return keyword 

也是另一種方式來獲得映射是創建一個series

mapping = pd.Series(map_df['Game'].values , index = map_df.Keyword) 
+0

會看到什麼是快是很有意思 – Moritz