解壓數據框中

的列表元素我有這樣的DF：解壓數據框中

l1 = ['a', 'b', 'c'] 
l2 = ['x', ['y1', 'y2', 'y3'], 'z'] 
df = pd.DataFrame(list(zip(l1, l2)), columns = ['l1', 'l2'])

結果：

l1   l2 
0 a    x 
1 b [y1, y2, y3] 
2 c    z

我需要的是解壓在L2內部列表，並在這樣的L1普及相應的值：

l1 l2 
0 a x 
1 b y1 
2 b y2 
3 b y3 
4 c z

這樣做的正確方法是什麼？謝謝。

來源

2016-06-28 Alexey Trofimov

我想你可以通過嵌套listsstr.len和平板值由chain使用numpy.repeat由legths重複值：

from itertools import chain 

df1 = pd.DataFrame({ 
     "l1": np.repeat(df.l1.values, df.l2.str.len()), 
     "l2": list(chain.from_iterable(df.l2))}) 
print (df1) 
    l1 l2 
0 a x 
1 b y1 
2 b y2 
3 b y3 
4 c z

時序：

#[100000 rows x 2 columns] 
np.random.seed(10) 
N = 100000 
l1 = ['a', 'b', 'c'] 
l1 = np.random.choice(l1, N) 
l2 = [list(tuple(string.ascii_letters[:np.random.randint(1, 10)])) for _ in np.arange(N)] 
df = pd.DataFrame({"l1":l1, "l2":l2}) 
df.l2 = df.l2.apply(lambda x: x if len(x) !=1 else x[0]) 
#print (df) 


In [91]: %timeit (pd.DataFrame([(left, right) for outer in zip(l1, l2) for left, right in zip_longest(*outer, fillvalue=outer[0])])) 
1 loop, best of 3: 242 ms per loop 

In [92]: %timeit (pd.DataFrame({ "l1": np.repeat(df.l1.values, df.l2.str.len()), "l2": list(chain.from_iterable(df.l2))})) 
10 loops, best of 3: 84.6 ms per loop

結論：

numpy.repeat是3 times更快，因爲zip_longest解決方案在更大的df中。

編輯：

對於循環版本進行比較是necessery小DF，因爲很慢：

#[1000 rows x 2 columns] 
np.random.seed(10) 
N = 1000 
l1 = ['a', 'b', 'c'] 
l1 = np.random.choice(l1, N) 
l2 = [list(tuple(string.ascii_letters[:np.random.randint(1, 10)])) for _ in np.arange(N)] 
df = pd.DataFrame({"l1":l1, "l2":l2}) 
df.l2 = df.l2.apply(lambda x: x if len(x) !=1 else x[0]) 
#print (df)

def alexey(df): 
    df2 = pd.DataFrame(columns=df.columns,index=df.index)[0:0] 

    for idx in df.index: 
     new_row = df.loc[idx, :].copy() 
     for res in df.ix[idx, 'l2']: 
      new_row.set_value('l2', res) 
      df2.loc[len(df2)] = new_row 
    return df2 

print (alexey(df)) 

In [20]: %timeit (alexey(df)) 
1 loop, best of 3: 11.4 s per loop 

In [21]: %timeit pd.DataFrame([(left, right) for outer in zip(l1, l2) for left, right in zip_longest(*outer, fillvalue=outer[0])]) 
100 loops, best of 3: 2.57 ms per loop 

In [22]: %timeit pd.DataFrame({ "l1": np.repeat(df.l1.values, df.l2.str.len()), "l2": list(chain.from_iterable(df.l2))}) 
The slowest run took 4.42 times longer than the fastest. This could mean that an intermediate result is being cached. 
1000 loops, best of 3: 1.41 ms per loop

來源

2017-03-24 06:21:29 jezrael

我可以讓你權衡我的答案[** here **]（http://stackoverflow.com/a/43020297/2336654）我回答遲了。我也在沒有我的筆記本電腦的路上，不能運行任何代碼 – piRSquared

不幸的是我在電話上，所以無法測試。但我得到upvote。 – jezrael

謝謝！我很感激。 – piRSquared

您可以使用嵌套列表理解itertools.zip_longest。

import pandas as pd 

from itertools import zip_longest 

l1 = ['a', 'b', 'c'] 
l2 = ['x', ['y1', 'y2', 'y3'], 'z'] 

expanded = [(left, right) for outer in zip(l1, l2) 
          for left, right in zip_longest(*outer, fillvalue=outer[0])] 

pd.DataFrame(expanded)

結果是...

0 1 
0 a x 
1 b y1 
2 b y2 
3 b y3 
4 c z

對我來說這是對過長的列表比較的邊界。還假設l1沒有列表，並將進行填充。

來源

2016-06-28 14:23:23

蠻力，遍歷數據框：

for idx in df.index: 
    # This transforms the item in "l2" into an iterable list 
    item = df.loc[idx, "l2"] if isinstance(df.loc[idx, "l2"], (list, tuple)) else [df.loc[idx, "l2"]] 
    for element in item: 
     print(df.loc[idx, "l1"], element)

回報

a x 
b y1 
b y2 
b y3 
c z

來源

2016-06-28 14:24:20 Sosel

對於列不constatnt數DataFrames我現在做這樣的事情：

l1 = ['a', 'b', 'c'] 
l2 = ['x', ['y1', 'y2', 'y3'], 'z'] 
df = pd.DataFrame(list(zip(l1, l2)), columns = ['l1', 'l2']) 

df2 = pd.DataFrame(columns=df.columns,index=df.index)[0:0] 

for idx in df.index: 
    new_row = df.loc[idx, :].copy() 
    for res in df.ix[idx, 'l2']: 
     new_row.set_value('l2', res) 
     df2.loc[len(df2)] = new_row

它作品，但這看起來很像bruteforce。

來源

2017-03-24 06:17:42

對我來說，沒有工作，所以我不能將它添加到計時。 – jezrael

但我認爲它會很慢，因爲循環:( – jezrael

固定的代碼，你可以請檢查時間嗎？是的，我猜它也很慢。也許循環可以優化某種方式（我不是專家） –

解壓數據框中

回答

相關問題