2016-02-13 66 views
1

下面擦傷數據的代碼可以從以下頁面:。 「http://www.gbgb.org.uk/resultsMeeting.aspx?id=136005刮HTML成csv文件

它刮掉所有培訓相關領域,並將它們打印到屏幕不過,我想嘗試,並在打印數據表格形式轉化爲csv文件,導出到電子表格或數據庫中

在網站源HTML中,軌道,日期,日期時間(比賽時間)等級,距離和獎勵來自div類「resultsBlockheader」,並且在網頁上形成比賽卡的頂部區域

來源中的種族身體HTML來自div類「resultsBlock」,這包括完成位置(Fin)灰狗,陷阱,SP,時間/秒和時間距離。

最終會看起來像這樣

track,date,datetime,grade,distance,prize,fin,greyhound,trap,SP,timeSec,time distance 

這是可能的,或者我會得到它打印到表格中的屏幕之前,我可以將其導出爲CSV。

from urllib import urlopen 
from bs4 import BeautifulSoup 

html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=136005") 
bsObj = BeautifulSoup(html, 'lxml') 

nameList = bsObj. findAll("div", {"class": "track"}) 
for name in nameList: 
List = bsObj. findAll("div", {"class": "distance"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("div", {"class": "prizes"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("li", {"class": "first essential fin"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("li", {"class": "essential greyhound"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("li", {"class": "trap"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("li", {"class": "sp"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("li", {"class": "timeSec"}) 
for name in nameList: 
    print(name. get_text()) 
nameList = bsObj. findAll("li", {"class": "timeDistance"}) 
for name in nameList: 
    print(name. get_text()) 

nameList = bsObj. findAll("li", {"class": "essential trainer"}) 
for name in nameList: 
    print(name. get_text()) 

nameList = bsObj. findAll("li", {"class": "first essential comment"}) 
for name in nameList: 
    print(name. get_text()) 

nameList = bsObj. findAll("div", {"class": "resultsBlockFooter"}) 
for name in nameList: 
    print(name. get_text()) 

nameList = bsObj. findAll("li", {"class": "first essential"}) 
for name in nameList: 
    print(name. get_text()) 
+0

這只是在他們自己的行上打印了一大堆東西。如果你想要一個表格或csv格式,你需要重新格式化這整個代碼 –

+0

您好cricket_007.Thanks您的答覆。我將如何獲得屏幕上的東西並排打印(對所有這些仍然很新) :) – moonshadow

+0

'print(1,2)'將打印在同一行上。 'print(1)'然後'print(2)'將會在新行上打印。那很簡單。您必須將每個值放在一個列表中,才能在一行中打印出來。目前您專注於列而不是行。 –

回答

1

不知道你爲什麼不跟着代碼this answer建議你前面的問題 - 它實際上解決了分組字段共同問題。

這是一個後續代碼轉儲trackdategreyhound到CSV:

Sheffield,02/02/16,Miss Eastwood 
Sheffield,02/02/16,Sapphire Man 
Sheffield,02/02/16,Swift Millican 
... 
Sheffield,02/02/16,Geelo Storm 
Sheffield,02/02/16,Reflected Light 
Sheffield,02/02/16,Boozed Flame 

請注意,我用requests:運行代碼後results.csv

import csv 

from bs4 import BeautifulSoup 
import requests 


html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754").text 
soup = BeautifulSoup(html, 'lxml') 

rows = [] 
for header in soup.find_all("div", class_="resultsBlockHeader"): 
    track = header.find("div", class_="track").get_text(strip=True).encode('ascii', 'ignore').strip("|") 
    date = header.find("div", class_="date").get_text(strip=True).encode('ascii', 'ignore').strip("|") 

    results = header.find_next_sibling("div", class_="resultsBlock").find_all("ul", class_="line1") 
    for result in results: 
     greyhound = result.find("li", class_="greyhound").get_text(strip=True) 

     rows.append({ 
      "track": track, 
      "date": date, 
      "greyhound": greyhound 
     }) 


with open("results.csv", "w") as f: 
    writer = csv.DictWriter(f, ["track", "date", "greyhound"]) 

    for row in rows: 
     writer.writerow(row) 

內容這裏,但你可以留下urllib2,如果你想。

+0

呵呵。剛剛注意到只有這個問題OP有多少個問題 –

+0

嗨alecxe.Many感謝你的代碼。我已經把所有的領域都拿出來了,但是不能得到「教練」「評論」和「第一必要」,領域out.Any建議最受歡迎:) – moonshadow

+0

@moonshadow當然,讓我們避免解決後續註釋中的問題。如果您在提取所有字段時遇到困難,請參閱是否有意義創建單獨的問題。謝謝。 – alecxe