使用正則表達式從一個大的SFrame或數據幀中提取信息而不使用循環

我有以下代碼，我使用循環來提取一些信息並使用這些信息來創建一個新的矩陣。但是，由於我正在使用循環，因此此代碼需要永久完成。使用正則表達式從一個大的SFrame或數據幀中提取信息而不使用循環

我想知道是否有更好的方法來使用GraphLab的SFrame或pandas dataframe。我感謝任何幫助！

# This is the regex pattern 
pattern_topic_entry_read = r"\d{15}/discussion_topics/(?P<topic>\d{9})/entries/(?P<entry>\d{9})/read" 

# Using the pattern, I filter my records 
requests_topic_entry_read = requests[requests['url'].apply(lambda x: False if regex.match(pattern_topic_entry_read, x) == None else True)] 

# Then for each record in the final set, 
# I need to extract topic and entry info using match.group 
for request in requests_topic_entry_read: 
    for match in regex.finditer(pattern_topic_entry_read, request['url']): 
     topic, entry = match.group('topic'), match.group('entry') 

     # Then, I need to create a new SFrame (or dataframe, or anything suitable) 
     newRow = gl.SFrame({'user_id':[request['user_id']], 
          'url':[request['url']], 
          'topic':[topic], 'entry':[entry]}) 

     # And, append it to my existing SFrame (or dataframe) 
     entry_read_matrix = entry_read_matrix.append(newRow)

一些樣本數據：

user_id | url 
1000 | /123456832960900/discussion_topics/770000832912345/read 
1001 | /123456832960900/discussion_topics/770000832923456/view?per_page=832945307 
1002 | /123456832960900/discussion_topics/770000834562343/entries/832350330/read 
1003 | /123456832960900/discussion_topics/770000534344444/entries/832350367/read

我想獲得這樣的：

user_id | topic   | entry 
1002 | 770000834562343 | 832350330 
1003 | 770000534344444 | 832350367

來源

2017-01-18 renakre

如果你編譯正則表達式您可能會剃掉一些的執行時間。儘管_lambda x：regex ....不是None_ – volcano

在這裏，讓我重現：

>>> import pandas as pd 
>>> df = pd.DataFrame(columns=["user_id","url"]) 
>>> df.user_id = [1000,1001,1002,1003] 
>>> df.url = ['/123456832960900/discussion_topics/770000832912345/read', '/123456832960900/discussion_topics/770000832923456/view?per_page=832945307', '/123456832960900/discussion_topics/770000834562343/entries/832350330/read','/123456832960900/discussion_topics/770000534344444/entries/832350367/read'] 
>>> df["entry"] = df.url.apply(lambda x: x.split("/")[-2] if "entries" in x.split("/") else "---") 
>>> df["topic"] = df.url.apply(lambda x: x.split("/")[-4] if "entries" in x.split("/") else "---") 
>>> df[df.entry!="---"]

給你想要的數據幀

來源

2017-01-18 10:40:31 Hng

熊貓的系列有string functions。例如，在DF數據：

pattern = re.compile(r'.*/discussion_topics/(?P<topic>\d+)(?:/entries/(?P<entry>\d+))?') 
df = pd.read_table(io.StringIO(data), sep=r'\s*\|\s*', index_col='user_id') 
df.url.str.extract(pattern, expand=True)

產生

    topic  entry 
user_id        
1000  770000832912345  NaN 
1001  770000832923456  NaN 
1002  770000834562343 832350330 
1003  770000534344444 832350367

來源

2017-01-18 11:27:04 Thorsten

使用正則表達式從一個大的SFrame或數據幀中提取信息而不使用循環

回答

相關問題