我做了一個簡單的網絡爬蟲,使用urllib和beautifulsoup從網頁上的表格中提取數據。爲了加快數據拉取速度,我嘗試使用線程,但出現以下錯誤: 「內部緩衝區錯誤:內存分配失敗:增長緩衝區」 此消息出現不少次,然後顯示: 「內存不足」Python線程 - 內部緩衝區錯誤 - 內存不足
感謝您的幫助。
from bs4 import BeautifulSoup
from datetime import datetime
import urllib2
import re
from threading import Thread
stockData = []
#Access the list of stocks to search for data
symbolfile = open("stocks.txt")
symbolslist = symbolfile.read()
newsymbolslist = symbolslist.split("\n")
#text file stock data is stored in
myfile = open("webcrawldata.txt","a")
#initializing data for extraction of web data
lineOfData = ""
i=0
def th(ur):
stockData = []
lineOfData = ""
dataline = ""
stats = ""
page = ""
soup = ""
i=0
#creates a timestamp for when program was won
timestamp = datetime.now()
#Get Data ONLINE
#bloomberg stock quotes
url= "http://www.bloomberg.com/quote/" + ur + ":US"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
#Extract key stats table only
stats = soup.find("table", {"class": "key_stat_data" })
#iteration for <tr>
j = 0
try:
for row in stats.findAll('tr'):
stockData.append(row.find('td'))
j += 1
except AttributeError:
print "Table handling error in HTML"
k=0
for cell in stockData:
#clean up text
dataline = stockData[k]
lineOfData = lineOfData + " " + str(dataline)
k += 1
g = str(timestamp) + " " + str(ur)+ ' ' + str(lineOfData) + ' ' + ("\n\n\n")
myfile.write(g)
print (ur + "\n")
del stockData[:]
lineOfData = ""
dataline = ""
stats = None
page = None
soup = None
i += 1
threadlist = []
for u in newsymbolslist:
t = Thread(target = th, args = (u,))
t.start()
threadlist.append(t)
for b in threadlist:
b.join()enter code here
你在'newsymbolslist'裏有多少物品? – csl
約2,700所有紐約證券交易所代碼 – Jesse
那麼您是否會同時啓動2,700個線程? –