2017-08-02 125 views
0

我有兩個dataframes像這樣:查找匹配的子兩個dataframes

[in]print(training_df.head(n=10)) 

[out] 
          product_id 
transaction_id      
0000001     [P06, P09] 
0000002   [P01, P05, P06, P09] 
0000003     [P01, P06] 
0000004     [P01, P09] 
0000005     [P06, P09] 
0000006     [P02, P09] 
0000007   [P01, P06, P09, P10] 
0000008     [P03, P05] 
0000009     [P03, P09] 
0000010   [P03, P05, P06, P09] 

[in]print(testing_df.head(n=10)) 

[out] 
        product_id 
transaction_id     
001      [P01] 
002     [P01, P02] 
003    [P01, P02, P09] 
004     [P01, P03] 
005    [P01, P03, P05] 
006    [P01, P03, P07] 
007    [P01, P03, P08] 
008     [P01, P04] 
009    [P01, P04, P05] 
010    [P01, P04, P08] 

在testing_df每一行都是在training_df線的可能的「子」。我想查找所有匹配並返回testing_df中每個列表的可能training_df列表。如果我能夠返回一個字典,其中的關鍵字是testing_df中的transaction_id,並且training_df中的值都是可能的「matches」,那將會很有幫助。 (training_df中的每個列表應該比test_df中的相應列表長一個值)。

我想:

# Find the substrings that match 
matches = [] 

for string in training_df: 
    results = [] 
    for substring in testing_df: 
     if substring in string: 
      results.append(substring) 
    if results: 
     matches.append(results) 

但是這並不工作,它只返回列名 'PRODUCT_ID'。

我也試過:

# Initialize a list to store the matches between incomplete testing_df and training_df 
matches = {} 

# Compare the "incomplete" testing lists to the training set 
for line in testing_df.product_id: 
    for line in training_df.product_id: 
     if line in testing_df.product_id in line in training_df.product_id: 
      matches[line] = training_df[training_df.product_id.str.contains(line)] 

然而,這會引發錯誤TypeError: unhashable type: 'list'

+0

我認爲問題在括號中。例如,「P01」是「[P01,P06]」的一個子串,但「[P01]」不是。你可以嘗試使用substring [1:-1]而不是substring來擺脫括號。 – csander

+0

@csander我試圖'匹配= [] 用於training_df串[1:-1]: 結果= [] 用於testing_df子串[1:-1]: 如果在子串: results.append (子串) 如果結果: matches.append(結果)'但是那個沒有工作要麼 – zsad512

+0

不,你不想分片數據幀,你想分割子串 – csander

回答

1

我認爲這個問題是括號。問題是in檢查元素是否在列表中,而不是一個列表是否是另一個列表的子集。您可以將兩個列表轉換爲集合,然後檢查它們是否是彼此的子集。您還可以使用高級索引來保存transaction_id

training_df = pd.DataFrame([ 
    ['0000001', ['P06', 'P09']], 
    ['0000002', ['P01', 'P05', 'P06', 'P09']], 
    ['0000003', ['P01', 'P06']], 
    ['0000004', ['P01', 'P09']], 
    ['0000005', ['P06', 'P09']], 
    ['0000006', ['P02', 'P09']], 
    ['0000007', ['P01', 'P06', 'P09', 'P10']], 
    ['0000008', ['P03', 'P05']], 
    ['0000009', ['P03', 'P09']], 
    ['0000010', ['P03', 'P05', 'P06', 'P09']], 
], columns=['transaction_id', 'product_id']) 

testing_df = pd.DataFrame([ 
    ['001', ['P01']], 
    ['002', ['P01', 'P02']], 
    ['003', ['P01', 'P02', 'P09']], 
    ['004', ['P01', 'P03']], 
    ['005', ['P01', 'P03', 'P05']], 
    ['006', ['P01', 'P03', 'P07']], 
    ['007', ['P01', 'P03', 'P08']], 
    ['008', ['P01', 'P04']], 
    ['009', ['P01', 'P04', 'P05']], 
    ['010', ['P01', 'P04', 'P08']], 
], columns=['transaction_id', 'product_id']) 

matches = {} 
for testing_id in testing_df.product_id: 
    testing_id_set = set(testing_id) 
    contains_id = training_df.product_id.apply(lambda id: testing_id_set.issubset(set(id))) 
    matches[str(testing_id)] = contains_id 
+0

testing_df和training_df已經被格式化爲熊貓數據框。當我嘗試'匹配= {} 用於testing_df.product_id testing_id: 匹配[testing_id] = training_df [training_df.product_id.str.contains(testing_id [1:-1])]'它簡單地返回一個空的字典 – zsad512

+0

我包括定義以顯示我正在測試的數據。如果它不適合你,你的數據框是什麼樣的? – csander

+0

托架是在外面,以便例如行1,TRANSACTION_ID 001時,PRODUCT_ID列如下:'[「P01」,「P02」,「P03」]' – zsad512