使用Python從Bigquery到Redshift的ETL數據

我有一個Python腳本，它設置了一個查詢結果的變量，該變量在Google Bigquery中運行（某些庫我在這裏沒有使用，但我測試的是將json轉換爲csv文件）：使用Python從Bigquery到Redshift的ETL數據

import httplib2 
import datetime 
import json 
import csv 
import sys 
from oauth2client.service_account import ServiceAccountCredentials 
from bigquery import get_client 


#Set DAY - 1 
yesterday = datetime.datetime.now() - datetime.timedelta(days=1) 
today = datetime.datetime.now() 

#Format to Date 
yesterday = '{:%Y-%m-%d}'.format(yesterday) 
today = '{:%Y-%m-%d}'.format(today) 


# BigQuery project id as listed in the Google Developers Console. 
project_id = 'project' 

# Service account email address as listed in the Google Developers Console. 
service_account = '[email protected]' 


scope = 'https://www.googleapis.com/auth/bigquery' 

credentials = ServiceAccountCredentials.from_json_keyfile_name('/path/to/file/.json', scope) 

http = httplib2.Http() 
http = credentials.authorize(http) 


client = get_client(project_id, credentials=credentials, service_account=service_account) 

#Synchronous query 
try: 
    _job_id, results = client.query("SELECT * FROM dataset.table WHERE CreatedAt >= PARSE_UTC_USEC('" + yesterday + "') and CreatedAt < PARSE_UTC_USEC('" + today + "') limit 1", timeout=1000) 
except Exception as e: 
    print e 

print results

在可變結果返回的結果是這樣的：

[ 
{u'Field1': u'Msn', u'Field2': u'00000000000000', u'Field3': u'jsdksf422552d32', u'Field4': u'00000000000000', u'Field5': 1476004363.421, 
u'Field5': u'message', u'Field6': u'msn', 
u'Field7': None, 
u'Field8': u'{"user":{"field":"j23h4sdfsf345","field":"Msn","field":"000000000000000000","field":true,"field":"000000000000000000000","field":"2016-10-09T09:12:43.421Z"}}', u'Field9': 1476004387.016} 
]

我需要在亞馬遜紅移加載它，但這種格式我不能運行使用它生成的.json從S3複製...

有沒有辦法可以修改這個用於Redshift加載的json？或直接返回一個.csv？我從這個來自bigquery的庫或者python（我的第一個腳本之一）中瞭解不多。

非常感謝！

來源

2016-10-10 iagoxmm

要刪除 'U' 前田：

results = json.dumps(results)

然後，在一個CSV文件轉換json的變量，我創建：

#Transform json variable to csv 
results = json.dumps(results) 

results = json.loads(results) 

f = csv.writer(open("file.csv", "w"), delimiter='|') 

f.writerow(["field","field","field","field","field","field","field", "field", "field", "field"]) 

for results in results: 
    f.writerow([results["field"], 
      results["field"], 
      results["field"], 
       results["field"], 
      results["field"], 
      results["field"], 
      results["field"], 
      results["field"], 
      results["field"], 
      results["field"]])

在此之後，我能將文件加載到Redshift。

來源

2016-10-17 12:20:06 iagoxmm

使用Python從Bigquery到Redshift的ETL數據

回答

相關問題