刮HTML成csv文件

下面擦傷數據的代碼可以從以下頁面：。「http://www.gbgb.org.uk/resultsMeeting.aspx?id=136005 刮HTML成csv文件

它刮掉所有培訓相關領域，並將它們打印到屏幕不過，我想嘗試，並在打印數據表格形式轉化爲csv文件，導出到電子表格或數據庫中

在網站源HTML中，軌道，日期，日期時間（比賽時間）等級，距離和獎勵來自div類「resultsBlockheader」，並且在網頁上形成比賽卡的頂部區域

來源中的種族身體HTML來自div類「resultsBlock」，這包括完成位置（Fin）灰狗，陷阱，SP，時間/秒和時間距離。

最終會看起來像這樣

track,date,datetime,grade,distance,prize,fin,greyhound,trap,SP,timeSec,time distance

這是可能的，或者我會得到它打印到表格中的屏幕之前，我可以將其導出爲CSV。

from urllib import urlopen 
from bs4 import BeautifulSoup 

html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=136005") 
bsObj = BeautifulSoup(html, 'lxml') 

nameList = bsObj. findAll("div", {"class": "track"}) 
for name in nameList: 
List = bsObj. findAll("div", {"class": "distance"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("div", {"class": "prizes"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("li", {"class": "first essential fin"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("li", {"class": "essential greyhound"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("li", {"class": "trap"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("li", {"class": "sp"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("li", {"class": "timeSec"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("li", {"class": "timeDistance"}) 
for name in nameList: 
    print(name. get_text()) 

nameList = bsObj. findAll("li", {"class": "essential trainer"}) 
for name in nameList: 
    print(name. get_text()) 

nameList = bsObj. findAll("li", {"class": "first essential comment"}) 
for name in nameList: 
    print(name. get_text()) 

nameList = bsObj. findAll("div", {"class": "resultsBlockFooter"}) 
for name in nameList: 
    print(name. get_text()) 

nameList = bsObj. findAll("li", {"class": "first essential"}) 
for name in nameList: 
    print(name. get_text())

來源

2016-02-13 moonshadow

這只是在他們自己的行上打印了一大堆東西。如果你想要一個表格或csv格式，你需要重新格式化這整個代碼 –

您好cricket_007.Thanks您的答覆。我將如何獲得屏幕上的東西並排打印（對所有這些仍然很新） :) – moonshadow

'print（1,2）'將打印在同一行上。 'print（1）'然後'print（2）'將會在新行上打印。那很簡單。您必須將每個值放在一個列表中，才能在一行中打印出來。目前您專注於列而不是行。 –

不知道你爲什麼不跟着代碼this answer建議你前面的問題 - 它實際上解決了分組字段共同問題。

這是一個後續代碼轉儲track，date和greyhound到CSV：

Sheffield,02/02/16,Miss Eastwood 
Sheffield,02/02/16,Sapphire Man 
Sheffield,02/02/16,Swift Millican 
... 
Sheffield,02/02/16,Geelo Storm 
Sheffield,02/02/16,Reflected Light 
Sheffield,02/02/16,Boozed Flame

請注意，我用requests：運行代碼後results.csv的

import csv 

from bs4 import BeautifulSoup 
import requests 


html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754").text 
soup = BeautifulSoup(html, 'lxml') 

rows = [] 
for header in soup.find_all("div", class_="resultsBlockHeader"): 
    track = header.find("div", class_="track").get_text(strip=True).encode('ascii', 'ignore').strip("|") 
    date = header.find("div", class_="date").get_text(strip=True).encode('ascii', 'ignore').strip("|") 

    results = header.find_next_sibling("div", class_="resultsBlock").find_all("ul", class_="line1") 
    for result in results: 
     greyhound = result.find("li", class_="greyhound").get_text(strip=True) 

     rows.append({ 
      "track": track, 
      "date": date, 
      "greyhound": greyhound 
     }) 


with open("results.csv", "w") as f: 
    writer = csv.DictWriter(f, ["track", "date", "greyhound"]) 

    for row in rows: 
     writer.writerow(row)

內容這裏，但你可以留下urllib2，如果你想。

來源

2016-02-13 22:01:02 alecxe

呵呵。剛剛注意到只有這個問題OP有多少個問題 –

嗨alecxe.Many感謝你的代碼。我已經把所有的領域都拿出來了，但是不能得到「教練」「評論」和「第一必要」，領域out.Any建議最受歡迎:) – moonshadow

@moonshadow當然，讓我們避免解決後續註釋中的問題。如果您在提取所有字段時遇到困難，請參閱是否有意義創建單獨的問題。謝謝。 – alecxe

刮HTML成csv文件

回答

相關問題