2017-07-14 69 views
1

我有一個數據幀data與2列IDText。目標是根據日期將Text列中的值分成多列。通常情況下,日期會啓動一系列需要在列中的字符串值,除非日期位於字符串的末尾(在這種情況下,它被視爲以前一個日期開始的字符串的一部分)。如何使用日期來分割一個數據幀列python中的多列

data: 
ID  Text 
10  6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007 
20  7/17/06-advil, qui; 
10  7/19/06-ibuprofen. 8/31/06-penicilin, tramadol; 
40  9/26/06-penicilin, tramadol; 
91  5/23/06-penicilin, amoxicilin, tylenol; 
84  10/20/06-ibuprofen, tramadol; 
17  12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up 
23  12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up 
15  Follow up appt. scheduled 
69  talk to care giver 
32  12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months 
70  12/1/06?Follow up but no serious allergies 
70  12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil 

預期輸出:

ID  Text                     Text2                     Text3 
10  6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007 
20  7/17/06-advil, qui; 
10  7/19/06-ibuprofen.                  8/31/06-penicilin, tramadol; 
40  9/26/06-penicilin, tramadol; 
91  5/23/06-penicilin, amoxicilin, tylenol; 
84  10/20/06-ibuprofen, tramadol; 
17  12/19/06-vit D, tramadol.                12/1/09 -6/18/10 vit D only for 5 months.            3/7/11 f/up 
23  12/19/06-vit D, tramadol;                12/1/09 -6/18/10 vit D;                 3/7/11 video follow-up 
15  Follow up appt. scheduled 
69  talk to care giver 
32  12/15/06-2/16/07 everyday Follow-up;             6/8/16 discharged after 2 months 
70  12/1/06?Follow up but no serious allergies 
70  12/12/06-tylenol, vit D,advil;               1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil 

到目前爲止我的代碼:

d = [] 
for i in data.Text: 
    d = list(datefinder.find_dates(i)) #I can get the dates so far but still want to format the date values as %m/%d/%Y 

if len(d) > 1:#Checks for every record that has more than 1 date 
    for j in range(0,len(d)): 
     i = " " + " ".join(re.split(r'[^a-z 0-9/-]',i.lower())) + " " #cleans the text strings of any special characters 
     #data.Text[j] = d[j]r'[/^(.*?)]'d[j+1]'/'#this is not working 

     #The goal is for the Text column to retain the string from the first date up to before the second date. Then create a new Text1, get every value from the second date up to before the third date. And if there are more dates, create Textn and so on. 
     #Exception, if a date immediately follows a date (i.e. 12/1/09 -6/18/10) or a date ends a value string (i.e. 6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007), they should be considered to be in the same column 

如何使這項工作將節省我一天的任何想法。謝謝!

+0

將所有相關的日期格式是MM/DD/YY格式? –

+0

@Brad Solomon - 最好以mm/dd/yyy爲單位。謝謝! – CodeLearner

+0

我的意思是在您的輸入數據 –

回答

1

你去那裏

from itertools import chain, starmap, zip_longest 
import itertools 
import re 
import pandas as pd 

ids = [10, 20, 10, 40, 91, 84, 17, 23, 15, 69, 32, 70, 70] 

text = [ 
    "6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007", 
    "7/17/06-advil, qui;", 
    "7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;", 
    "9/26/06-penicilin, tramadol;", 
    "5/23/06-penicilin, amoxicilin, tylenol;", 
    "10/20/06-ibuprofen, tramadol;", 
    "12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up", 
    "12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up", 
    "Follow up appt. scheduled", 
    "talk to care giver", 
    "12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months", 
    "12/1/06?Follow up but no serious allergies", 
     "12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil"] 

by_date = re.compile(
    """((?:0?[1-9]|1[012])/(?:0?[1-9]|[12]\d|3[01])/\d\d\s*""" 
    """(?:(?:-|to |through)\s*(?:0?[1-9]|1[012])/(?:0?[1-9]|[12]\d|3[01])/\d\d)?\s*\S)""") 


def to_items(line): 
    starts = [m.start() for m in by_date.finditer(line)] 
    if not starts or starts[0] > 0: 
     starts.insert(0, 0) 
    stops = iter(starts) 
    next(stops) 
    return map(line.__getitem__, starmap(slice, zip_longest(starts, stops))) 


cleaned = zip_longest(*map(to_items, text)) 
col_names = chain(["Text"], map("Text{}".format, itertools.count(2))) 
df = pd.DataFrame(dict(zip(col_names, cleaned), ID=ids)) 

print(df) 
+0

你是一個拯救生命的人。謝謝!快速觀察:我發現一個字符串末尾的日期仍然被拉進一個新的列 - 這不應該是。我的意思是,字符串末尾的任何日期都應該被認爲是該字符串的一部分,因此它應該在同一列中。我們如何擺脫這種錯誤的分離? – CodeLearner

+0

請參閱上面的評論。謝謝。 – CodeLearner

+0

@CodeLearner你在談論記錄中的直線嗎?對不起,我沒有看到字符串末尾的日期形成新列。您是否在使用其他數據進行測試?正則表達式使用了\ S來確保日期後有內容。 – frogcoder

相關問題