2016-11-14 203 views
2

兩字組我有這個測試表中數據幀的大熊貓創建一個列在大熊貓DF

Leaf_category_id session_id product_id 
0    111   1   987 
3    111   4   987 
4    111   1   741 
1    222   2   654 
2    333   3   321 

這是我剛纔的問題,這是由@jazrael回答的延伸。 view answer

因此讓中的product_id列中的值(只是一個假設,從我剛纔的問題的輸出略有不同,

|product_id    | 
    --------------------------- 
    |111,987,741,34,12  | 
    |987,1232     | 
    |654,12,324,465,342,324 | 
    |321,741,987    | 
    |324,654,862,467,243,754 | 
    |6453,123,987,741,34,12 | 

等, 我想創建一個新列後,在其中行中的所有的值應該被製造爲具有它的下一個,最後一個沒有兩字組的行與第一個組合中,例如:

|product_id    |Bigram 
    ------------------------------------------------------------------------- 
    |111,987,741,34,12  |(111,987),**(987,741)**,(741,34),(34,12),(12,111) 
    |987,1232     |(987,1232),(1232,987) 
    |654,12,324,465,342,32 |(654,12),(12,324),(324,465),(465,342),(342,32),(32,654) 
    |321,741,987    |(321,741),**(741,987)**,(987,321) 
    |324,654,862    |(324,654),(654,862),(862,324) 
    |123,987,741,34,12  |(123,987),(987,741),(34,12),(12,123) 

忽略**(I」稍後會告訴你爲什麼我出演的是)

代碼才達到兩字組是

for i in df.Leaf_category_id.unique(): 
    print (df[df.Leaf_category_id == i].groupby('session_id')['product_id'].apply(lambda x: list(zip(x, x[1:]))).reset_index()) 

從這個東風,我要考慮二元柱,使一個更加列命名爲頻率,這給了我兩字的頻率發生。

Note* : (987,741) and (741,987) are to be considered as same and one dublicate entry should be removed and thus frequency of (987,741) should be 2. similar is the case with (34,12) it occurs two times, so frequency should be 2

|Bigram 
    --------------- 
    |(111,987), 
    |**(987,741)** 
    |(741,34) 
    |(34,12) 
    |(12,111) 
    |**(741,987)** 
    |(987,321) 
    |(34,12) 
    |(12,123) 

最終的結果應該是。

|Bigram  | frequency | 
    -------------------------- 
    |(111,987) | 1 
    |(987,741) | 2 
    |(741,34)  | 1 
    |(34,12)  | 2 
    |(12,111)  | 1 
    |(987,321) | 1 
    |(12,123)  | 1 

我希望能在這裏找到答案,請幫助我,我儘可能詳細闡述了它。

+0

你怎麼想的頻率?在單行中,Bigram列將包含多個元組,因此會有多個頻率。 – James

+0

@James:行中的每個元組都應該被創建爲一個新行,如第二個最後一個表所示。然後如果有重複的表格,正如我所提到的那樣,頻率應該相應地改變 – Shubham

+0

所以'Bigram'和'frequency'是在一個單獨的數據框中? – James

回答

2

嘗試這個代碼

from itertools import combinations 
import pandas as pd 

df = pd.DataFrame.from_csv("data.csv") 
#consecutive 
grouped_consecutive_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in zip(x,x[1:])]).reset_index() 

df1=pd.DataFrame(grouped_consecutive_product_ids) 
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack() 
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna() 
df2.rename(columns = {0:'Bigram'}, inplace = True) 
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count') 
bigram_frequency_consecutive = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index() 
del bigram_frequency_consecutive["index"] 

用於組合(所有可能的雙克)

from itertools import combinations 
import pandas as pd 

df = pd.DataFrame.from_csv("data.csv") 
#combinations 
grouped_combination_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in combinations(x,2)]).reset_index() 

df1=pd.DataFrame(grouped_combination_product_ids) 
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack() 
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna() 
df2.rename(columns = {0:'Bigram'}, inplace = True) 
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count') 
bigram_frequency_combinations = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index() 
del bigram_frequency_combinations["index"] 

data.csv其中包含

Leaf_category_id,session_id,product_id 
0,111,1,111 
3,111,4,987 
4,111,1,741 
1,222,2,654 
2,333,3,321 
5,111,1,87 
6,111,1,34 
7,111,1,12 
8,111,1,987 
9,111,4,1232 
10,222,2,12 
11,222,2,324 
12,222,2,465 
13,222,2,342 
14,222,2,32 
15,333,3,321 
16,333,3,741 
17,333,3,987 
18,333,3,324 
19,333,3,654 
20,333,3,862 
21,222,1,123 
22,222,1,987 
23,222,1,741 
24,222,1,34 
25,222,1,12 

所得bigram_frequency_consecutive將爲

  Bigram freq 
0  (12, 34)  2 
1  (12, 324)  1 
2  (12, 654)  1 
3  (12, 987)  1 
4  (32, 342)  1 
5  (34, 87)  1 
6  (34, 741)  1 
7  (87, 741)  1 
8 (111, 741)  1 
9 (123, 987)  1 
10 (321, 321)  1 
11 (321, 741)  1 
12 (324, 465)  1 
13 (324, 654)  1 
14 (324, 987)  1 
15 (342, 465)  1 
16 (654, 862)  1 
17 (741, 987)  2 
18 (987, 1232)  1 

所得bigram_frequency_combinations

  Bigram freq 
0  (12, 32)  1 
1  (12, 34)  2 
2  (12, 87)  1 
3  (12, 111)  1 
4  (12, 123)  1 
5  (12, 324)  1 
6  (12, 342)  1 
7  (12, 465)  1 
8  (12, 654)  1 
9  (12, 741)  2 
10 (12, 987)  2 
11 (32, 324)  1 
12 (32, 342)  1 
13 (32, 465)  1 
14 (32, 654)  1 
15  (34, 87)  1 
16 (34, 111)  1 
17 (34, 123)  1 
18 (34, 741)  2 
19 (34, 987)  2 
20 (87, 111)  1 
21 (87, 741)  1 
22 (87, 987)  1 
23 (111, 741)  1 
24 (111, 987)  1 
25 (123, 741)  1 
26 (123, 987)  1 
27 (321, 321)  1 
28 (321, 324)  2 
29 (321, 654)  2 
30 (321, 741)  2 
31 (321, 862)  2 
32 (321, 987)  2 
33 (324, 342)  1 
34 (324, 465)  1 
35 (324, 654)  2 
36 (324, 741)  1 
37 (324, 862)  1 
38 (324, 987)  1 
39 (342, 465)  1 
40 (342, 654)  1 
41 (465, 654)  1 
42 (654, 741)  1 
43 (654, 862)  1 
44 (654, 987)  1 
45 (741, 862)  1 
46 (741, 987)  3 
47 (862, 987)  1 
48 (987, 1232)  1 
在上述情況下

它按兩種存儲

+0

非常好的答案,1 – jezrael

+0

@先生。有什麼不同bigram_frequency_consecutive和bigram_frequency_combinations? – Shubham

+0

在'bigram_frequency_consecutive'如果一組具有產品ID'[27,35,99]'那麼你得到雙克'[(27,35),(35,99)]'其中,通過組合的形成雙字母組是'[(27,35),(27,99),(35,99)]'如果您正在進行任何產品購買分析,您應該使用二元組合。因爲我不知道確切的用例,所以我給出了兩種解決方案,第一種解決方案按照您提供的代碼片段提供,第二種解決方案是最需要的。 –

1

我們將從product_id中提取值,創建bigrams,對其進行排序並進行重複數據刪除,並計數它們以獲取頻率,然後填充數據框。

from collections import Counter 

# assuming your data frame is called 'df' 

bigrams = [list(zip(x,x[1:])) for x in df.product_id.values.tolist()] 
bigram_set = [tuple(sorted(xx) for x in bigrams for xx in x] 
freq_dict = Counter(bigram_set) 
df_freq = pd.DataFrame([list(f) for f in freq_dict], columns=['bigram','freq']) 
+0

當我運行** freq_dict =計數器(bigram_set)** 我正在剛開這個錯誤:** unhashable類型: '名單' ** – Shubham

+0

的'tuple'功能應該採取 – James

+0

類型的護理(bigram_set)=名單。 – Shubham