2017-09-16 50 views
1

我需要連接到現有的SQLite數據庫,將鍵列的值與數據框中的值進行比較。對於數據庫和數據框之間的每個關鍵匹配項,更新該行中特定列的值。如果密鑰存在於數據框中,但不在數據庫中,請將相應的行附加到數據庫。目標是相對較大的數據集,因此內存使用率和性能是一個值得關注的問題(可以是20-60 gb db,@〜20列和數百萬行)。用數據框更新SQLite數據框使用條件來更改列值或追加新行

我以前曾嘗試將數據庫讀入數據框並在內存中合併舊的和新的數據幀,但這證明代價昂貴(通常是5個演出數據集會在內存中增加20個演出)。

我迷失在這裏的邏輯,這是我已經走最遠:

def update_column(tablename, key_value): 
    c.execute('SELECT key FROM {}'.format(tablename)) 
    for row in c.fetchall(): 
     # populating this key value per row is challenging for me 
     if row == key_value: 
      c.execute('UPDATE {} SET last_seen = {} WHERE UUID = {}}'.format(tablename, hunt_date, key_value)) 
     else: 
      df.to_sql(table_name, if_exists='append') 

for index, row in reader.iterrows(): 
    key_value = row['key'] 
    update_column(tablename, key_value) 

示例數據集:

數據庫

Key  First_Seen Last_Seen Data1 Data2 
Bigfoot 2015  2015  Blah Blah 
Loch_Ness 2016  2016  Blah Blah 
UFO  2016  2004  Blah Blah  

數據幀新數據:

Key  First_Seen Last_Seen Data Data 
UFO  2017  2017  Blah Blah 
Tupac  2017  2017  Blah Blah 

所需的數據庫輸出:

Key  First_Seen Last_Seen Data Data 
Bigfoot 2015  2015  Blah Blah 
Loch_Ness 2016  2016  Blah Blah 
UFO  2016  2017  Blah Blah 
Tupac  2017  2017  Blah Blah 

回答

1

至於建議,可以考慮在SQLite的臨時表和運行UPDATEINSERT INTO查詢。無需遍歷數百萬行。

由於SQLite不支持UPDATE...JOIN,子查詢是必需的,例如IN子句。每次運行追加查詢都沒有什麼壞處,因爲它只會追加新的密鑰行。

df.to_sql('pandastable', conn, if_exists='replace') 

c.execute("UPDATE finaltable f " + \ 
      "SET f.last_seen = p.last_seen " + \ 
      "WHERE f.[key] IN (SELECT p.[key] FROM pandastable p);") 
conn.commit() 

c.execute("INSERT INTO finaltable ([key], first_seen, last_seen, blah, blah, blah) " + \ 
      "SELECT [key], first_seen, last_seen, blah, blah, blah " + \ 
      "FROM pandastable p " + \ 
      "WHERE NOT EXISTS " + \ 
      " (SELECT 1 FROM finaltable sub " + \ 
      " WHERE sub.[key] = p.[key]);") 
conn.commit() 

如果連接與SQLAlchemy而不是原始連接,考慮到與交易,而不是光標運行操作查詢大熊貓呼籲:

import sqlalchemy 

... 
engine = sqlalchemy.create_engine("sqlite:sqlite:////path/to/database.db") 

df.to_sql(name='pandastable', con=engine, if_exists='replace') 

# SQL ACTIONS USING TRANSACTIONS 
with engine.begin() as conn:  
    conn.execute("UPDATE finaltable f " + \ 
       "SET f.last_seen = p.last_seen " + \ 
       "WHERE f.[key] IN (SELECT p.[key] FROM pandastable p);") 

with engine.begin() as conn:  
    conn.execute("INSERT INTO finaltable ([key], first_seen, last_seen, blah, blah, blah) " + \ 
       "SELECT [key], first_seen, last_seen, blah, blah, blah " + \ 
       "FROM pandastable p " + \ 
       "WHERE NOT EXISTS " + \ 
       " (SELECT 1 FROM finaltable sub " + \ 
       " WHERE sub.[key] = p.[key]);") 

engine.dispose() 
1

我會在SQLite端做這樣的更新。

第一您的DF保存爲臨時SQLite表:tmp

df.to_sql('tmp', conn, if_exists='replace') 

sql = """ 
UPDATE table_name set last_seen = (SELECT t.last_seen 
            FROM tmp t 
            WHERE t.Key = table_name.key) 
WHERE EXISTS(
    SELECT * 
    FROM tmp 
    WHERE tmp.key = table_name.key 
) 
""" 

c.execute(sql) 
相關問題