psycopg2 Postgres複製專家熊貓read_csv使用內存緩衝失敗，ValueError

所以我運行下面的代碼使用Python 3.5中的psycopg2驅動程序到Pandas 19.x.從存儲器緩衝器中讀psycopg2 Postgres複製專家熊貓read_csv使用內存緩衝失敗，ValueError

buf = io.StringIO() 
cursor = conn.cursor() 
sql_query = 'COPY ('+ base_sql + ' limit 100) TO STDOUT WITH CSV HEADER' 
cursor.copy_expert(sql_query, buf) 
df = pd.read_csv(buf.getvalue(),engine='c') 
buf.close()

的read_csv打擊塊時：

pandas\parser.pyx in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4175)() 

pandas\parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:8333)() 

C:\Users\....\AppData\Local\Continuum\Anaconda3\lib\genericpath.py in exists(path) 
    17  """Test whether a path exists. Returns False for broken symbolic links""" 
    18  try: 
---> 19   os.stat(path) 
    20  except OSError: 
    21   return False 

ValueError: stat: path too long for Windows

Uh..wot路徑？ buf在內存中。我在這裏錯過了什麼？

僅供參考，副本似乎按預期工作。

SOLUTION下面的代碼

多虧了下面的答案，我的查詢速度使用這種方法，我的內存使用了500％下降了一倍。這是我的最終測試代碼，用於幫助他人解決他們的性能問題。我很樂意看到任何改善此功能的代碼！一定要回到你的問題中的這個問題。

# COPY TO CSV quick and dirty performance test 
import io 
import sys 

start = time.time() 
conn_str_copy= r'postgresql+psycopg2://' + user_id + r":" + pswd + r"@xxx.xxx.xxx.xxx:ppppp/my_database" 
result = urlparse(conn_str_copy) 
username = result.username 
password = result.password 
database = result.path[1:] 
hostname = result.hostname 

size = 2**30 
buf = io.BytesIO() 
# buf = io.StringIO() 

engine = create_engine(conn_str_copy) 
conn_copy= psycopg2.connect(
    database=database, user=username, password=password, host=hostname) 

cursor_copy = conn_copy.cursor() 
sql_query = 'COPY ('+ my_sql_query + ') TO STDOUT WITH CSV HEADER' 
cursor_copy.copy_expert(sql_query, buf, size) 
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds') 
tmp = buf.seek(0) 
df = pd.read_csv(buf,engine='c', low_memory=False) 
buf.close() 
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')

速度是~4分鐘從postgres複製數據並且不到30秒將它加載到熊貓數據框中。請注意，複製命令是psycopg2驅動程序的一項功能，可能無法在其他驅動程序中使用。

來源

2016-12-20 Harvey

嘗試刪除'.getvalue（）'所以只是'df = pd.read_csv（buf，engine ='c'）'不確定，只是猜測 – piRSquared

@piRSquared：你是對的。你必須傳遞一個文件句柄，並傳遞'getvalue（）'讓熊貓相信你正在傳遞一個文件名。 –

但還有更多：您還必須「倒回」緩衝區對象，否則它將無法工作。 –

您必須將文件句柄或文件名傳遞給pandas.read_csv()。

傳遞buf.getvalue()使大熊貓read_csv相信你傳遞一個文件名，因爲對象沒有一個read方法，不同之處在於「文件名」是緩衝區，它被視爲太長（窗口限制爲255個字符的文件名的）

你幾乎明白了。由於buf已經是一個類似文件的對象，只需按原樣傳遞即可。小細節：你要快退，因爲以前cursor.copy_expert(sql_query, buf)通話可能使用write和buf位置是在年底（嘗試沒有它，你可能會得到一個空的數據幀）

buf.seek(0) # rewind because you're at the end of the buffer 
df = pd.read_csv(buf,engine='c')

來源

2016-12-20 19:03:56

psycopg2 Postgres複製專家熊貓read_csv使用內存緩衝失敗，ValueError

回答

相關問題