2014-01-13 19 views
0

我需要查詢數據庫以進行一些數據分析,並且我有超過2000萬條記錄。我限制訪問數據庫,並在8分鐘後查詢超時。因此,我試圖將查詢拆分成更小的部分,並將結果保存爲Excel以供稍後處理。在MySQL中增量選擇記錄並在Python中保存爲csv

這是我到目前爲止。我怎樣才能讓python循環查詢每個x-number(例如1,000,000)記錄,並將它們存儲在同一個csv中,直到搜索到所有(20 mil ++)記錄爲止?

import MySQLdb 
import csv 

db_main = MySQLdb.connect(host="localhost", 
        port = 1234, 
        user="user1", 
         passwd="test123", 
         db="mainDB") 

cur = db_main .cursor() 

cur.execute("SELECT a.user_id, b.last_name, b.first_name, 
    FLOOR(DATEDIFF(CURRENT_DATE(), c.birth_date)/365) age, 
    DATEDIFF(b.left_date, b.join_date) workDays 
    FROM users a 
    INNER JOIN users_signup b ON a.user_id a = b.user_id 
    INNER JOIN users_personal c ON a.user_id a = c.user_id 
    INNER JOIN 
    (
     SELECT distinct d.a.user_id FROM users_signup d 
     WHERE (user_id >=1 AND user_id <1000000) 
     AND d.join_date >= '2013-01-01' and d.join_date < '2014-01-01' 
    ) 
    AS t ON a.user_id = t.user_id") 

result=cur.fetchall() 
c = csv.writer(open("temp.csv","wb")) 
for row in result: 
    c.writerow(row) 
+1

也許嘗試使用L IMIT和OFFSET與SQL查詢? –

回答

0

您的代碼應如下所示。您可以調整它的性能由per_query變量

c = csv.writer(open("temp.csv","wb")) 
offset = 0 
per_query = 10000 
while true: 
    cur.execute("__the_query__ LIMIT %s OFFSET %s", (per_query, offset)) 

    rows = cur.fetchall() 
    if len(rows) == 0: 
     break #escape the loop at the end of data 

    for row in cur.fetchall(): 
     c.writerow(row) 

    offset += per_query 
0

下面是實現的例子,可以幫助你:

from contextlib import contextmanager 
import MySQLdb 
import csv 

connection_args = {"host": "localhost", "port": 1234, "user": "user1", "passwd": "test123", "db": "mainDB"} 

@contextmanager 
def get_cursor(**kwargs): 
    ''' The contextmanager allow to automatically close 
    the cursor. 
    ''' 
    db = MySQLdb.connect(**kwargs) 
    cursor = db.cursor() 
    try: 
     yield cursor 
    finally: 
     cursor.close() 

# note the placeholders for the limits 
query = """ SELECT a.user_id, b.last_name, b.first_name, 
     FLOOR(DATEDIFF(CURRENT_DATE(), c.birth_date)/365) age, 
     DATEDIFF(b.left_date, b.join_date) workDays 
    FROM users a 
    INNER JOIN users_signup b ON a.user_id a = b.user_id 
    INNER JOIN users_personal c ON a.user_id a = c.user_id 
    INNER JOIN 
    (
     SELECT distinct d.a.user_id FROM users_signup d 
     WHERE (user_id >= 1 AND user_id < 1000000) 
     AND d.join_date >= '2013-01-01' and d.join_date < '2014-01-01' 
    ) AS t ON a.user_id = t.user_id OFFSET %s LIMIT %s """ 

csv_file = csv.writer(open("temp.csv","wb")) 

# One million at the time 
STEP = 1000000 
for step_nb in xrange(0, 20): 
    with get_cursor(**connection_args) as cursor: 
     cursor.execute(query, (step_nb * STEP, (step_nb + 1) * STEP)) # query the DB 
     for row in cursor: # use the cursor instead of fetching everything in memory 
      csv_file.writerow(row) 

編輯:什麼是批量(雖然它是在user_ID的)誤會

0

未經測試的代碼,但這應該讓你開始...

SQL = """ 
SELECT a.user_id, b.last_name, b.first_name, 
    FLOOR(DATEDIFF(CURRENT_DATE(), c.birth_date)/365) age, 
    DATEDIFF(b.left_date, b.join_date) workDays 
    FROM users a 
    INNER JOIN users_signup b ON a.user_id a = b.user_id 
    INNER JOIN users_personal c ON a.user_id a = c.user_id 
    INNER JOIN 
    (
     SELECT distinct d.a.user_id FROM users_signup d 
     WHERE (user_id >=1 AND user_id <1000000) 
     AND d.join_date >= '2013-01-01' and d.join_date < '2014-01-01' 
    ) 
    AS t ON a.user_id = t.user_id 
    OFFSET %s LIMIT %s 
    """ 

BATCH_SIZE = 100000 

with open("temp.csv","wb") as f: 
    writer = csv.writer(f) 
    cursor = db_main.cursor() 

    offset = 0 
    limit = BATCH_SIZE 


    while True: 
     cursor.execute(SQL, (offset, limit)) 
     for row in cursor: 
      writer.writerow(row) 
     else: 
      # no more rows, we're done 
      break 
     offset += BATCH_SIZE  
cursor.close() 
+0

當我嘗試運行它時,它給了我一個SQL查詢。沒有確切的錯誤,但它正在抱怨偏移和限制。 – Cryssie

+0

我說這是_untested_ code,是不是? - ) –