一種冗長的問題,我可能只需要有人指出我在正確的方向。我正在構建一個網頁抓取工具,以便從ESPN網站上獲取籃球運動員信息。 URL結構很簡單,因爲每個玩家卡在URL中都有一個特定的ID。爲了獲得信息,我正在編寫1〜6000的循環來從他們的數據庫中抓取玩家。我的問題是,是否有更有效的方式來做到這一點?優化我的Python刮刀
from bs4 import BeautifulSoup
from urllib2 import urlopen
import requests
import nltk
import re
age = [] # Empty List to store player ages
BASE = 'http://espn.go.com/nba/player/stats/_/id/' # Base Structure of Player Card URL
def get_age(BASE): #Creates a function
#z = range(1,6000) # Create Range from 1 to 6000
for i in range(1, 6000): # This is a for loop
BASE_U = BASE + str(i) + '/' # Create URL For Player
r = requests.get(BASE_U)
soup = BeautifulSoup(r.text)
#Prior to this step, I had to print out the soup object and look through the HTML in order to find the tag that contained my desired information
# Get Age of Players
age_tables = soup.find_all('ul', class_="player-metadata") # Grabs all text in the metadata tag
p = str(age_tables) # Turns text into a string
#At this point I had to look at all the text in the p object and determine a way to capture the age info
if "Age: " not in p: # PLayer ID doesn't exist so go to next to avoid error
continue
else:
start = p.index("Age: ") + len("Age: ") # Gets the location of the players age
end = p[start:].index(")") + start
player_id.append(i) #Adds player_id to player_id list
age.append(p[start:end]) # Adds player's age to age list
get_age(BASE)
任何幫助,即使很小,將不勝感激。即使它只是指着我在正確的方向,而不一定是直接的解決方案
感謝, 本
啊我聽說過多線程。你知道易於遵循在線教程嗎? – mangodreamz
我個人認爲'multiprocessing'庫的文檔是一個很好的開始。如果文檔對您來說不夠好,您可以查看該庫的指南。 –