2016-12-14 51 views
1

假設我有以下的.txt文件:創建在大熊貓字符串的每次出現新列

Alabama[edit] 
fooAL 
barAL 
Arizona[edit] 
fooAz 
barAz 
bazAz 
Alaska[edit] 
fooAk 
... 

我怎麼能轉換成形式的熊貓數據幀這

| St. Name | Region | 
|----------+--------| 
| Alabama | fooAL | 
| Alabama | barAL | 
| Arizona | fooAz | 
| Arizona | barAz | 
| Arizona | bazAz | 
| Alaska | fooAk | 
| ...  | ... | 

所以我想是使用每個國家名稱後面出現的[編輯]字符串作爲read_csvpandas中的參數sep= '\[edit\]'。但它不能給我我想要的東西。

但我仍然認爲我可以在這裏使用某種正則表達式來執行我想要的操作,而無需編寫循環或類似的東西。你能幫忙嗎?

+0

這看起來的確很像([簡介數據科學Python中的第4周] HTTPS ://www.coursera.org/learn/python-data-analysis)當然:) –

+0

是的。他們鼓勵你在stackoverflow上提問。所以我做了:) – minibuffer

回答

1

我建議不要依靠大熊貓直接在這裏,而是由線打開該文件,並處理它行構建字典的列表,並用它來創建一個數據框做解析:

with open('yourfile.txt','r') as f: 
    content = f.read().splitlines() 

state = None 
l_dict = [] 
for line in content: 
    if '[edit]' in line: 
     state = line.split('[')[0] 
    else: 
     l_dict.append({'St. Name':state, 'Region':line}) 

df = pd.DataFrame(l_dict) 
df.set_index('St. Name', inplace=True) 

如果你真的想在大熊貓的事,我想你可以通過處理各國和各地區分開,並用NaN的一種forward fillDataFrame.ffill做這種方式是一樣的fillna(method='ffill')(或pad

df = pd.DataFrame('yourfile.txt', columns=['txt']) 
# Create a column that'll serve as a filter IsState 
df['IsState'] = df['txt'].str.contains('\[edit\]') 

# Split and get first item of split 
df.loc[df.IsState, 'St. Name'] = df.loc[df.IsState, 'txt'].str.split('[').str.get(0) 

# the `~`means not 
df.loc[~df.IsState, 'Region'] = df.loc[~df.IsState, 'txt'] 

# Forward fill the NaNs 
df['St. Name'] = df['St. Name'].ffill() 

# Select what you truly want and set index 
df = df.loc[~df.IsState, ['St. Name', 'Region']] 
df.set_index('St. Name', inplace=True) 
3
# header is None and names=['St. Name'] 
s = pd.read_csv('yourfile.txt', header=None, squeeze=True, names=['St. Name']) 

# grab [edit] lines 
st = s.str.extract('(.*)\[edit\]').ffill() 
# groupby 
g = s.groupby(st) 
# use tail(-1) to get all but first row 
df = g.apply(pd.Series.tail, n=-1) 
# reset_index to get what we want 
df.reset_index('St. Name', name='Region') 

enter image description here


在一行同樣的事情

s = pd.read_csv(StringIO(txt), header=None, squeeze=True, names=['St. Name']) 

s.groupby(s.str.extract('(.*)\[edit\]').ffill()) \ 
    .apply(pd.Series.tail, n=-1) \ 
    .reset_index('St. Name', name='Region') 
+0

我喜歡!十分優雅。 –