2017-10-05 45 views
0

我試圖從espn中刮取一張表並將數據發送到熊貓數據框以便將其導出到excel。我已經完成了大部分的抓取工作,但我陷入瞭如何將每個'td'標記發送到我的for循環中的唯一數據框單元格的問題。 (代碼如下)有什麼想法?謝謝!從WebScraping結果創建Pandas Dataframe

import requests 
import urllib.request 
from bs4 import BeautifulSoup 
import re 
import os 
import csv 
import pandas as pd 

def make_soup(url): 
    thepage = urllib.request.urlopen(url) 
    soupdata = BeautifulSoup(thepage, "html.parser") 
    return soupdata 

soup = make_soup("http://www.espn.com/nba/statistics/player/_/stat/scoring- 
per-game/sort/avgPoints/qualified/false") 

regex = re.compile("^[e-o]") 

for record in soup.findAll('tr', {"class":regex}): 
    for data in record.findAll('td'): 
     print(data) 
+1

閱讀:https://stackoverflow.com/a/1732454/4047084 –

+0

什麼?正則表達式在那裏刪除出現在每n行的多個頭部。 – johankent30

+0

去除的位置在哪裏?您正在BeautifulSoup的解析函數findAll()中應用正則表達式。因此,上面的鏈接。 – Parfait

回答

0

我實際上是最近刮體育網站的工作日常幻想運動算法爲一類。這是我寫的劇本。也許這種方法可以爲你工作。建立一本字典。將其轉換爲數據幀。

url = http://www.footballdb.com/stats/stats.html?lg=NFL&yr={0}&type=reg&mode={1}&limit=all 

    result = requests.get(url) 
    c = result.content 

    # Set as Beautiful Soup Object 
    soup = BeautifulSoup(c) 

    # Go to the section of interest 
    tables = soup.find("table",{'class':'statistics'}) 

    data = {} 
    headers = {} 
    for i, header in enumerate(tables.findAll('th')): 
     data[i] = {} 
     headers[i] = str(header.get_text()) 

    table = tables.find('tbody') 
    for r, row in enumerate(table.select('tr')): 
     for i, cell in enumerate(row.select('td')): 
      try: 
       data[i][r] = str(cell.get_text()) 
      except: 
       stat = strip_non_ascii(cell.get_text()) 
       data[i][r] = stat 

    for i, name in enumerate(tables.select('tbody .left .hidden-xs a')): 
     data[0][i] = str(name.get_text()) 

    df = pd.DataFrame(data=data) 
+0

好,太好了,謝謝! – johankent30

相關問題