我有兩個dataframes像這樣:查找匹配的子兩個dataframes
[in]print(training_df.head(n=10))
[out]
product_id
transaction_id
0000001 [P06, P09]
0000002 [P01, P05, P06, P09]
0000003 [P01, P06]
0000004 [P01, P09]
0000005 [P06, P09]
0000006 [P02, P09]
0000007 [P01, P06, P09, P10]
0000008 [P03, P05]
0000009 [P03, P09]
0000010 [P03, P05, P06, P09]
[in]print(testing_df.head(n=10))
[out]
product_id
transaction_id
001 [P01]
002 [P01, P02]
003 [P01, P02, P09]
004 [P01, P03]
005 [P01, P03, P05]
006 [P01, P03, P07]
007 [P01, P03, P08]
008 [P01, P04]
009 [P01, P04, P05]
010 [P01, P04, P08]
在testing_df每一行都是在training_df線的可能的「子」。我想查找所有匹配並返回testing_df中每個列表的可能training_df列表。如果我能夠返回一個字典,其中的關鍵字是testing_df中的transaction_id,並且training_df中的值都是可能的「matches」,那將會很有幫助。 (training_df中的每個列表應該比test_df中的相應列表長一個值)。
我想:
# Find the substrings that match
matches = []
for string in training_df:
results = []
for substring in testing_df:
if substring in string:
results.append(substring)
if results:
matches.append(results)
但是這並不工作,它只返回列名 'PRODUCT_ID'。
我也試過:
# Initialize a list to store the matches between incomplete testing_df and training_df
matches = {}
# Compare the "incomplete" testing lists to the training set
for line in testing_df.product_id:
for line in training_df.product_id:
if line in testing_df.product_id in line in training_df.product_id:
matches[line] = training_df[training_df.product_id.str.contains(line)]
然而,這會引發錯誤TypeError: unhashable type: 'list'
我認爲問題在括號中。例如,「P01」是「[P01,P06]」的一個子串,但「[P01]」不是。你可以嘗試使用substring [1:-1]而不是substring來擺脫括號。 – csander
@csander我試圖'匹配= [] 用於training_df串[1:-1]: 結果= [] 用於testing_df子串[1:-1]: 如果在子串: results.append (子串) 如果結果: matches.append(結果)'但是那個沒有工作要麼 – zsad512
不,你不想分片數據幀,你想分割子串 – csander