1
我有以下代碼,我使用循環來提取一些信息並使用這些信息來創建一個新的矩陣。但是,由於我正在使用循環,因此此代碼需要永久完成。使用正則表達式從一個大的SFrame或數據幀中提取信息而不使用循環
我想知道是否有更好的方法來使用GraphLab的SFrame
或pandas dataframe
。我感謝任何幫助!
# This is the regex pattern
pattern_topic_entry_read = r"\d{15}/discussion_topics/(?P<topic>\d{9})/entries/(?P<entry>\d{9})/read"
# Using the pattern, I filter my records
requests_topic_entry_read = requests[requests['url'].apply(lambda x: False if regex.match(pattern_topic_entry_read, x) == None else True)]
# Then for each record in the final set,
# I need to extract topic and entry info using match.group
for request in requests_topic_entry_read:
for match in regex.finditer(pattern_topic_entry_read, request['url']):
topic, entry = match.group('topic'), match.group('entry')
# Then, I need to create a new SFrame (or dataframe, or anything suitable)
newRow = gl.SFrame({'user_id':[request['user_id']],
'url':[request['url']],
'topic':[topic], 'entry':[entry]})
# And, append it to my existing SFrame (or dataframe)
entry_read_matrix = entry_read_matrix.append(newRow)
一些樣本數據:
user_id | url
1000 | /123456832960900/discussion_topics/770000832912345/read
1001 | /123456832960900/discussion_topics/770000832923456/view?per_page=832945307
1002 | /123456832960900/discussion_topics/770000834562343/entries/832350330/read
1003 | /123456832960900/discussion_topics/770000534344444/entries/832350367/read
我想獲得這樣的:
user_id | topic | entry
1002 | 770000834562343 | 832350330
1003 | 770000534344444 | 832350367
如果你編譯正則表達式您可能會剃掉一些的執行時間。儘管_lambda x:regex ....不是None_ – volcano