2
我有一個存儲在需要集成的Pandas Dataframe中的370k記錄的數據集。我嘗試了多處理,線程,Cpython和循環展開。但我沒有成功,顯示計算的時間是22小時。任務如下:如何增加循環中的python速度?
%matplotlib inline
from numba import jit, autojit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
with open('data/full_text.txt', encoding = "ISO-8859-1") as f:
strdata=f.readlines()
data=[]
for string in strdata:
data.append(string.split('\t'))
df=pd.DataFrame(data,columns=["uname","date","UT","lat","long","msg"])
df=df.drop('UT',axis=1)
df[['lat','long']] = df[['lat','long']].apply(pd.to_numeric)
from textblob import TextBlob
from tqdm import tqdm
df['polarity']=np.zeros(len(df))
線程:
from queue import Queue
from threading import Thread
import logging
logging.basicConfig(
level=logging.DEBUG,
format='(%(threadName)-10s) %(message)s',
)
class DownloadWorker(Thread):
def __init__(self, queue):
Thread.__init__(self)
self.queue = queue
def run(self):
while True:
# Get the work from the queue and expand the tuple
lowIndex, highIndex = self.queue.get()
a = range(lowIndex,highIndex-1)
for i in a:
df['polarity'][i]=TextBlob(df['msg'][i]).sentiment.polarity
self.queue.task_done()
def main():
# Create a queue to communicate with the worker threads
queue = Queue()
# Create 8 worker threads
for x in range(8):
worker = DownloadWorker(queue)
worker.daemon = True
worker.start()
# Put the tasks into the queue as a tuple
for i in tqdm(range(0,len(df)-1,62936)):
logging.debug('Queueing')
queue.put((i,i+62936))
queue.join()
print('Took {}'.format(time() - ts))
main()
多重與循環展開:
pool = multiprocessing.Pool(processes=2)
r = pool.map(assign_polarity, df)
pool.close()
def assign_polarity(df):
a=range(0,len(df),5)
for i in tqdm(a):
df['polarity'][i]=TextBlob(df['msg'][i]).sentiment.polarity
df['polarity'][i+1]=TextBlob(df['msg'][i+1]).sentiment.polarity
df['polarity'][i+2]=TextBlob(df['msg'][i+2]).sentiment.polarity
df['polarity'][i+3]=TextBlob(df['msg'][i+3]).sentiment.polarity
df['polarity'][i+4]=TextBlob(df['msg'][i+4]).sentiment.polarity
如何提高計算速度?或以更快的方式將計算存儲在數據幀中?我的筆記本電腦配置
- 內存:8GB
- 物理核心:2個
- 邏輯內核:8級
- 的Windows 10
實現多重給我更高的計算時間。 線程正在順序執行(我認爲是因爲GIL) 循環展開給了我相同的計算速度。 Cpython在導入庫時給我提供了錯誤。
「所以我嘗試了多處理,線程,Cpython和循環展開。」什麼沒有奏效?你可以在問題中發佈這個問題嗎? – Boggartfly
您需要提供[MCVE]。 – IanS
@Boggartfly謝謝,我添加了那些不起作用的東西 – ASD