2017-08-19 69 views
0

我試圖通過多個頁面循環來使用Python和Beautifulsoup刮擦數據。我的腳本適用於一個頁面,但是當嘗試遍歷多個頁面時,它只會返回最後一個頁面上的數據。我認爲在我循環或存儲/追加player_data列表的方式中可能有問題。使用Python刮擦多個頁面Beautifulsoup - 只從最後一頁返回數據

這是我迄今爲止 - 任何幫助,非常感謝。

#! python3 
# downloadRecruits.py - Downloads espn college basketball recruiting database info 

import requests, os, bs4, csv 
import pandas as pd 

# Starting url (class of 2007) 
base_url = 'http://www.espn.com/college-sports/basketball/recruiting/databaseresults/_/class/2007/page/' 

# Number of pages to scrape (Not inclusive, so number + 1) 
pages = map(str, range(1,3)) 

# url for starting page 
url = base_url + pages[0] 

for n in pages: 
    # Create url 
    url = base_url + n 

    # Parse data using BS 
    print('Downloading page %s...' % url) 
    res = requests.get(url) 
    res.raise_for_status() 

    # Creating bs object 
    soup = bs4.BeautifulSoup(res.text, "html.parser") 

    table = soup.find('table') 

    # Get the data 
    data_rows = soup.findAll('tr')[1:] 

    player_data = [] 
    for tr in data_rows: 
     tdata = [] 
     for td in tr: 
      tdata.append(td.getText()) 

      if td.div and td.div['class'][0] == 'school-logo': 
       tdata.append(td.div.a['href']) 

     player_data.append(tdata) 

print(player_data) 
+3

在'print(player_data)'前加4個空格' – PRMoureu

回答

1

你應該有你的循環之外的player_data列表定義,否則只有最後一次迭代的結果將被保存。

+0

謝謝@Kostas Drk。問題 - 這是否也適用於保存爲CSV?當我將print(player_data)替換爲以下內容時,它僅保存最後一頁:open('bballRecruits.csv','w')as f_output: csv_output = csv.writer(f_output) csv_output.writerows(player_data) – NateRattner

+0

@NateRattner是的,無論你使用那個列表,它都是存儲在它裏面的。 –

1

這是縮進問題或聲明問題,具體取決於您期望的結果。

  • 如果您需要打印結果的每一頁:

您可以通過印刷(player_data)之前加入4個空格解決這個問題。

如果讓print循環塊以外的print語句在循環結束後僅執行一次。所以它可以顯示的唯一值是player_data最後一個從for循環的最後一次迭代中泄漏出來的值。

    如果你想存儲所有結果 player_data和結束時打印

你必須在外面和你之前宣佈player_data for循環。

player_data = [] 
for n in pages: 
    # [...] 
+0

OP詢問返回的數據,不僅是打印的內容。 –

+0

@KostasDrk,感謝您指出,不知道預期什麼... – PRMoureu

+0

謝謝你@PRMoureu。我應該更清楚 - 我確實希望將所有結果存儲在player_data中,然後在最後打印出來。所以這是完美的。 – NateRattner

0
import requests 
from bs4 import BeautifulSoup 

# Starting url (class of 2007) 
base_url = 'http://www.espn.com/college-sports/basketball/recruiting/databaseresults/_/class/2007/page/' 

# Number of pages to scrape (Not inclusive, so number + 1) 
pages = list(map(str,range(1,3))) 
# In Python 3, map returns an iterable object of type map, and not a subscriptible list, which would allow you to write map[i]. To force a list result, write 
# url for starting page 
url = base_url + pages[0] 

for n in pages: 
    # Create url 
    url = base_url + n 

    # Parse data using BS 
    print('Downloading page %s...' % url) 
    res = requests.get(url) 
    res.raise_for_status() 

    # Creating bs object 
    soup = BeautifulSoup(res.text, "html.parser") 

    table = soup.find('table') 

    # Get the data 
    data_rows = soup.findAll('tr')[1:] 

    player_data = [] 
    for tr in data_rows: 
     tdata = [] 
     for td in tr: 
      tdata.append(td.getText()) 

      if td.div and td.div['class'][0] == 'school-logo': 
       tdata.append(td.div.a['href']) 

     player_data.append(tdata) 

print(player_data)