2017-05-04 138 views
0

我運行一個查詢象下面這樣:合併使用大熊貓由文本相似度2個dataframes

select * 
from sd_sms LEFT JOIN categories_phrases 
    on sd_sms.body like concat('%',categories_phrases.phrase1,'%') 
    and sd_sms.body like concat('%',categories_phrases.phrase2,'%') 
    and sd_sms.body like concat('%',categories_phrases.phrase3,'%') 
    and sd_sms.body like concat('%',categories_phrases.phrase4,'%') 

基本上,它會連接兩個表,如果表A中的字段包含表B中的幾個短語,但現在我需要在Python中執行此操作。

是否有任何簡單的方法來合併使用熊貓這兩個表,所以它給了我相同的結果?

請告知

+1

你能提供一些示例數據和你想要的輸出嗎? – Allen

+0

您可以使用此鏈接下載示例數據:https://drive.google.com/file/d/0B9sctdRURN0PSXk2ZUxGMU9JdU0/view?usp=sharing –

+0

基本上我需要像https://blog.ouseful.info/2012/09/26/merge-data-sets-based-partially-matched-data-elements/ –

回答

0

此代碼示例適用於文本數據和喜歡參加子句中的條件。

from pandasql import * 
import pandas as pd 

pysqldf = lambda q: sqldf(q, globals()) 

df1 = pd.DataFrame({"name": ['Antony', 'Mark', 'Jacob'], "age": 
                 [11,12,13]}) 
df2 = pd.DataFrame({"name": ['Antony', 'Gill', 'John']}) 

q = """SELECT * FROM df1 LEFT JOIN df2 ON df1.name LIKE '%' || df2.name || '%'""" 

df = pysqldf(q) 

這只是一個虛擬DF與示例數據,但我應用了與您的問題類似的條件。

希望它有用。

+0

一個nex的例子它幾乎工作。我試圖玩耍。我修改如下面的代碼: 從pandasql進口* 進口熊貓作爲PD pysqldf =拉姆達問:sqldf(Q,全局()) DF1 = pd.DataFrame({ 「名稱」:[ '安東尼' ,'Mark','Jacob'],「age」: [11,12,13]}) df2 = pd.DataFrame({「name」:['Anto','Mark Gill','John']] }) q = 「」 「 SELECT df1.name,df1.age,df2.name FROM DF1 左連接上df1.name DF2 LIKE '%{}%' 」「」 q.format( df2.name) df = pysqldf(q) df 最後一列顯示NONE。 你能幫助改善嗎? –

+0

pandasql使用SQLITE語法。所以你必須在類似的子句中連接字符串'q =「」「SELECT * FROM df1 LEFT JOIN df2 ON df1.name LIKE'%'|| df2.name ||'%'」「」'。我編輯了上面的 –

+0

以上的answare。謝謝。 –

0

我不明白你是什麼類型的數據,因爲你錯過你的答案一些示例數據;但是如果您需要像sintax一樣使用SQL查詢熊貓數據框,則可以嘗試使用pandasql package.v它基於SQLAlchemy ORM工具。

from pandasql import * 
import pandas as pd 

pysqldf = lambda q: sqldf(q, globals()) 

q = """ 
    SELECT 
    m.date 
    , m.beef 
    , b.births 
    FROM 
    meat m 
    LEFT JOIN 
    births b 
    ON m.date = b.date 
    WHERE 
    m.date > '1974-12-31'; 
    """ 

meat = load_meat() 
births = load_births() 

df = pysqldf(q) 
df 
 

date beef births 
0 1975-01-01 00:00:00.000000 2106.0 265775.0 
1 1975-02-01 00:00:00.000000 1845.0 241045.0 
2 1975-03-01 00:00:00.000000 1891.0 268849.0 
3 1975-04-01 00:00:00.000000 1895.0 247455.0 
4 1975-05-01 00:00:00.000000 1849.0 254545.0 
5 1975-06-01 00:00:00.000000 1849.0 254096.0 
6 1975-07-01 00:00:00.000000 1916.0 275163.0 
7 1975-08-01 00:00:00.000000 1961.0 281300.0 
8 1975-09-01 00:00:00.000000 2065.0 270738.0 
9 1975-10-01 00:00:00.000000 2270.0 265494.0 
10 1975-11-01 00:00:00.000000 1970.0 251973.0 
11 1975-12-01 00:00:00.000000 2055.0 260532.0 
12 1976-01-01 00:00:00.000000 2208.0 257455.0 
13 1976-01-01 00:00:00.000000 2208.0 259173.0 
14 1976-02-01 00:00:00.000000 1966.0 236551.0 
15 1976-02-01 00:00:00.000000 1966.0 238153.0 
16 1976-03-01 00:00:00.000000 2318.0 257951.0 
17 1976-03-01 00:00:00.000000 2318.0 261608.0 
18 1976-04-01 00:00:00.000000 2015.0 246469.0 
19 1976-04-01 00:00:00.000000 2015.0 250992.0 
20 1976-05-01 00:00:00.000000 1969.0 256986.0 
21 1976-05-01 00:00:00.000000 1969.0 261572.0 
22 1976-06-01 00:00:00.000000 2161.0 250525.0 
23 1976-06-01 00:00:00.000000 2161.0 255734.0 
24 1976-07-01 00:00:00.000000 2111.0 279630.0 
25 1976-07-01 00:00:00.000000 2111.0 279744.0 
26 1976-08-01 00:00:00.000000 2233.0 279937.0 
27 1976-08-01 00:00:00.000000 2233.0 286496.0 
28 1976-09-01 00:00:00.000000 2274.0 273750.0 
29 1976-09-01 00:00:00.000000 2274.0 283718.0 
... ... ... ... 
533 2010-06-01 00:00:00.000000 2320.0 NaN 
534 2010-07-01 00:00:00.000000 2229.6 NaN 
535 2010-08-01 00:00:00.000000 2286.6 NaN 
536 2010-09-01 00:00:00.000000 2252.2 NaN 
537 2010-10-01 00:00:00.000000 2234.9 NaN 
538 2010-11-01 00:00:00.000000 2235.5 NaN 
539 2010-12-01 00:00:00.000000 2270.9 NaN 
540 2011-01-01 00:00:00.000000 2122.9 356457.0 
541 2011-02-01 00:00:00.000000 2020.4 338521.0 
542 2011-03-01 00:00:00.000000 2266.2 350630.0 
543 2011-04-01 00:00:00.000000 2052.5 346397.0 
544 2011-05-01 00:00:00.000000 2131.9 354886.0 
545 2011-06-01 00:00:00.000000 2375.0 348587.0 
546 2011-07-01 00:00:00.000000 2134.1 375384.0 
547 2011-08-01 00:00:00.000000 2386.9 373333.0 
548 2011-09-01 00:00:00.000000 2215.2 367965.0 
549 2011-10-01 00:00:00.000000 2215.1 357875.0 
550 2011-11-01 00:00:00.000000 2148.8 323788.0 
551 2011-12-01 00:00:00.000000 2126.3 353871.0 
552 2012-01-01 00:00:00.000000 2113.8 337980.0 
553 2012-02-01 00:00:00.000000 2009.0 316641.0 
554 2012-03-01 00:00:00.000000 2159.8 347803.0 
555 2012-04-01 00:00:00.000000 1990.6 337272.0 
556 2012-05-01 00:00:00.000000 2232.0 345257.0 
557 2012-06-01 00:00:00.000000 2252.1 346971.0 
558 2012-07-01 00:00:00.000000 2200.8 368450.0 
559 2012-08-01 00:00:00.000000 2367.5 359554.0 
560 2012-09-01 00:00:00.000000 2016.0 361922.0 
561 2012-10-01 00:00:00.000000 2343.7 347625.0 
562 2012-11-01 00:00:00.000000 2206.6 320195.0 

這裏回購:https://github.com/yhat/pandasql和一個漂亮的快速入門教程http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html

+0

我的數據類型是文本,我想使用類似於sql-like子句的文本來連接兩個數據幀)可以做 –

+0

這個支持SQL like子句嗎?我的老闆嘗試這個,但它沒有工作 –

+0

我只是發佈像條款 –