查找匹配的子兩個dataframes

我有兩個dataframes像這樣：查找匹配的子兩個dataframes

[in]print(training_df.head(n=10)) 

[out] 
          product_id 
transaction_id      
0000001     [P06, P09] 
0000002   [P01, P05, P06, P09] 
0000003     [P01, P06] 
0000004     [P01, P09] 
0000005     [P06, P09] 
0000006     [P02, P09] 
0000007   [P01, P06, P09, P10] 
0000008     [P03, P05] 
0000009     [P03, P09] 
0000010   [P03, P05, P06, P09] 

[in]print(testing_df.head(n=10)) 

[out] 
        product_id 
transaction_id     
001      [P01] 
002     [P01, P02] 
003    [P01, P02, P09] 
004     [P01, P03] 
005    [P01, P03, P05] 
006    [P01, P03, P07] 
007    [P01, P03, P08] 
008     [P01, P04] 
009    [P01, P04, P05] 
010    [P01, P04, P08]

在testing_df每一行都是在training_df線的可能的「子」。我想查找所有匹配並返回testing_df中每個列表的可能training_df列表。如果我能夠返回一個字典，其中的關鍵字是testing_df中的transaction_id，並且training_df中的值都是可能的「matches」，那將會很有幫助。（training_df中的每個列表應該比test_df中的相應列表長一個值）。

我想：

# Find the substrings that match 
matches = [] 

for string in training_df: 
    results = [] 
    for substring in testing_df: 
     if substring in string: 
      results.append(substring) 
    if results: 
     matches.append(results)

但是這並不工作，它只返回列名 'PRODUCT_ID'。

我也試過：

# Initialize a list to store the matches between incomplete testing_df and training_df 
matches = {} 

# Compare the "incomplete" testing lists to the training set 
for line in testing_df.product_id: 
    for line in training_df.product_id: 
     if line in testing_df.product_id in line in training_df.product_id: 
      matches[line] = training_df[training_df.product_id.str.contains(line)]

然而，這會引發錯誤TypeError: unhashable type: 'list'

來源

2017-08-02 zsad512

我認爲問題在括號中。例如，「P01」是「[P01，P06]」的一個子串，但「[P01]」不是。你可以嘗試使用substring [1：-1]而不是substring來擺脫括號。 – csander

@csander我試圖'匹配= [] 用於training_df串[1：-1]：結果= [] 用於testing_df子串[1：-1]：如果在子串： results.append （子串）如果結果： matches.append（結果）'但是那個沒有工作要麼 – zsad512

不，你不想分片數據幀，你想分割子串 – csander

我認爲這個問題是括號。問題是in檢查元素是否在列表中，而不是一個列表是否是另一個列表的子集。您可以將兩個列表轉換爲集合，然後檢查它們是否是彼此的子集。您還可以使用高級索引來保存transaction_id：

training_df = pd.DataFrame([ 
    ['0000001', ['P06', 'P09']], 
    ['0000002', ['P01', 'P05', 'P06', 'P09']], 
    ['0000003', ['P01', 'P06']], 
    ['0000004', ['P01', 'P09']], 
    ['0000005', ['P06', 'P09']], 
    ['0000006', ['P02', 'P09']], 
    ['0000007', ['P01', 'P06', 'P09', 'P10']], 
    ['0000008', ['P03', 'P05']], 
    ['0000009', ['P03', 'P09']], 
    ['0000010', ['P03', 'P05', 'P06', 'P09']], 
], columns=['transaction_id', 'product_id']) 

testing_df = pd.DataFrame([ 
    ['001', ['P01']], 
    ['002', ['P01', 'P02']], 
    ['003', ['P01', 'P02', 'P09']], 
    ['004', ['P01', 'P03']], 
    ['005', ['P01', 'P03', 'P05']], 
    ['006', ['P01', 'P03', 'P07']], 
    ['007', ['P01', 'P03', 'P08']], 
    ['008', ['P01', 'P04']], 
    ['009', ['P01', 'P04', 'P05']], 
    ['010', ['P01', 'P04', 'P08']], 
], columns=['transaction_id', 'product_id']) 

matches = {} 
for testing_id in testing_df.product_id: 
    testing_id_set = set(testing_id) 
    contains_id = training_df.product_id.apply(lambda id: testing_id_set.issubset(set(id))) 
    matches[str(testing_id)] = contains_id

來源

2017-08-02 20:30:41 csander

testing_df和training_df已經被格式化爲熊貓數據框。當我嘗試'匹配= {} 用於testing_df.product_id testing_id：匹配[testing_id] = training_df [training_df.product_id.str.contains（testing_id [1：-1]）]'它簡單地返回一個空的字典 – zsad512

我包括定義以顯示我正在測試的數據。如果它不適合你，你的數據框是什麼樣的？ – csander

托架是在外面，以便例如行1，TRANSACTION_ID 001時，PRODUCT_ID列如下：'[「P01」，「P02」，「P03」]' – zsad512

查找匹配的子兩個dataframes

回答

相關問題