從網站格式化刮取的數據（BeautifulSoup）

我使用BeautifulSoup創建刮板，並請求刮擦網站的頁面以獲取匹配時間表（以及結果，如果可用）。這是我到目前爲止有：從網站格式化刮取的數據（BeautifulSoup）

def getMatches(self): 
     url = 'http://icc-cricket.yahoo.net/match_zone/series/fixtures.php?seriesCode=ENG_WI_2012' # change seriesCode in URL for different series. 
     page = requests.get(url) 
     page_content = page.content 
     soup = BeautifulSoup(page_content) 

    result = soup.find('div', attrs={'class':'bElementBox'}) 
    tags = result.findChildren('tr') 

    for elem in tags: 
     x = elem.getText() 
     print x

而這些結果我得到：

Date &amp; Time (GMT)fixture 
Thu, May 17, 2012 10:00 AMEngland&nbsp; vs &nbsp;West Indies 
3rd&nbsp;TESTA full scorecard will be available shortly.Venue: Edgbaston, BirminghamResult: England won by 5 wickets 
Fri, May 25, 2012 11:00 AMEngland&nbsp; vs &nbsp;West Indies 
2nd&nbsp;TESTClick here for the full scorecardVenue: Trent Bridge, NottinghamResult:  England won by 9 wickets 
Thu, Jun 7, 2012 10:00 AMEngland&nbsp; vs &nbsp;West Indies 
1st&nbsp;TESTClick here for the full scorecardVenue: Lord'sResult: Match Drawn 
Sat, Jun 16, 2012 9:45 AMEngland&nbsp; vs &nbsp;West Indies 
1st&nbsp;ODIClick here for the full scorecardVenue: The Rose Bowl, SouthamptonResult:  England won by 114 runs (D/L Method) 
Tue, Jun 19, 2012 9:45 AMEngland&nbsp; vs &nbsp;West Indies 
2nd&nbsp;ODIVenue: KIA Oval 
Fri, Jun 22, 2012 9:45 AMEngland&nbsp; vs &nbsp;West Indies 
3rd&nbsp;ODIVenue: Headingley Carnegie 
Sun, Jun 24, 2012 12:00 AMEngland&nbsp; vs &nbsp;West Indies 
1st&nbsp;T20Venue: Trent Bridge, Nottingham

現在，我想在一些結構化的格式對數據進行分類。一個包含
關於一場比賽的信息列表將是理想的。但我堅持如何實現這一目標。結果中的輸出字符串具有像&nbsp這樣的字符，並且時間奇怪地排列，如AMEngland。還有一個問題是，如果我用空格字符作爲分隔符來分割字符串，像西印度羣島這樣的國家將會被分割，並且將不會有任何統一的方式來解析它。

那麼有沒有一種方法可以統一解析這些數據，所以我可以在表單中找到。有點像：

[ {'date': match_date, 'home_team': team1, 'away_team': team2, 'venue': venue},{ same for match 2}, { match 3 }...]

我會感謝任何幫助。 :)

來源

2012-06-19 Manish Gill

這是不是很難分開日期/時間和國家。你可以爲「Venue」和「Result」做同樣的事情。

>>> import re 
>>> s = "Sun, Jun 24, 2012 12:00 AMEngland&nbsp; vs &nbsp;West Indies" 
>>> match = re.search(r"\b[AP]M", s) 
>>> s[0:match.end()] 
'Sun, Jun 24, 2012 12:00 AM' 
>>> s[match.end():] 
'England&nbsp; vs &nbsp;West Indies'

來源

2012-06-19 16:05:35 robert

非常感謝。我想整天看HTML會讓我有點忘記我只能用一個簡單的正則表達式。 :) –

改爲看看scrapy;它會使這項任務變得更容易。

您定義items從該網站刮：

from scrapy.item import Item, Field 

class CricketMatch(Item): 
    date = Field() 
    home_team = Field() 
    away_team = Field() 
    venue = Field()

然後定義loader with XPath expressions填寫這些項目。之後，您可以直接使用這些物品，或produce JSON output or similar。

來源

2012-06-19 16:06:14

我確實要去scrapy，但我正在使用的應用程序已經使用BeautifulSoup來處理現有的任務，所以我被告知不要使用它。 –

不幸的是，你沒有在你的問題中指定。另外請注意，SO旨在提供一般有用的問題和答案，而不僅僅是針對個別案例，所以我會留下我的答案。 –

從網站格式化刮取的數據（BeautifulSoup）

回答

相關問題