2015-12-07 19 views
3

你好,我刮除ABC新聞網站上最新的新聞,我的代碼正在刮這個樣子的:有沒有辦法擦除或分離網絡抓取數據?在Python

<a href="/Politics/huckabee-draws-cheers-fundraiser-west-bank-settlement/story?id=35615831" name="lpos=widget[A_3_freeformlite_4380645_homepage]&amp;lid=link[Headline_2]">Huckabee Draws Cheers at Fundraiser for West Bank Settlement<span class="metaH_timeDay">41 minutes ago</span></a> 

但是當你看到我得到一個span標籤內的標籤,所以當我與BeautifulSoup湊這個我得到這樣的信息:

赫卡比在籌款繪製乾杯西岸Settlement41分鐘前

但它給我的時間正好緊挨着我的數據,我想已經分居41分鐘所以它看起來是這樣的:

赫卡比在籌款繪製乾杯西岸定居42分鐘前

或至少刪除它!

我的代碼看起來是這樣的:

import requests 
from bs4 import BeautifulSoup 

url = "http://abcnews.go.com/" 

r = requests.get(url) 

soup = BeautifulSoup(r.content, "lxml") 

for x in range(1,10): 
    for link in soup.find_all("a",{"name": "lpos=widget[A_3_freeformlite_4380645_homepage]&lid=link[Headline_"+str(x)+"]"}): 
    print link.text 
    print link.find_all("",{"class": "metaH_timeDay"})[0].text 
    print "" 

有人能幫助我嗎?

回答

1

可以使用decompose()功能LOP太運行一段時間才能從div刪除所有span標籤 -

import requests 
from bs4 import BeautifulSoup 

url = "http://abcnews.go.com/" 

r = requests.get(url) 

soup = BeautifulSoup(r.content, "html.parser") 

for x in range(1): 
    d=soup.select("div.h a") 
    for j in d: 
     j = str(j) 
     f = BeautifulSoup(j,'html.parser') 
     while f.span: 
      f.span.decompose() 
     print f.text.encode('utf-8') 

輸出 -

Obama Seeks to Remove Fear From ISIS Fight 
Kerry off to Paris Again for Climate Conference 
Huckabee Draws Cheers at Fundraiser for West Bank Settlement 
Sanders Unveils Plan to Address Climate Change 
FBI Looking Into Blatter's Role in Bribery Case 
Armed Bank Robbery Suspect Shot in Miami Had Escaped From Half-Way House 
13 Injured in Attack on Government Office in Western China 
Police Arrest Mother of Newborn Baby Who Was Buried Alive 
Shooting Suspect's Neighbor Says He Became 'More Withdrawn' 
Justice Department to Investigate Chicago Police 
Hillary Clinton Corrects Flub, Thanks to Justice Breyer 
Dashcam Must Be Working 
Clinton Laughs Off TrumpΓÇÖs Claims That She Lacks ΓÇÿStaminaΓÇÖ 
Man Killed in Wisconsin Standoff Was a Hostage 
2 New York College Students Abducted, Held Hostage 
Transgender Actress, Warhol Muse Holly Woodlawn Dies at 69 
Mood Dour Among Venezuelan Ruling Party Backers 
Hillary Clinton Says ΓÇÿWeΓÇÖre Not WinningΓÇÖ Fight Against ISIS 
Jimmy Carter Says Latest Brain Scan Shows No Cancer 
One Direction Leads the Way on Twitter's List of 2015 Tweets 
Promises of Grocery Stores in Needy Areas Mostly Unfulfilled 
McNabb Scores Tiebreaking Goal, Kings Beat Lightning 3-1 
Grocery Chains Leave Food Deserts Barren, AP Analysis Finds 
Medical Examiner Shortage: Facts About Death Investigations 
Roethlisberger Throws 4 TD Passes, Steelers Roll Colts 45-10 
Grocery Chains Leave Food Deserts Barren, AP Analysis Finds 
1

讓我們通過extract()解壓:

>>> link.span.extract()  # remove the first `span` tag that we don't need 
>>> time = link.span.extract() 
>>> time 
<span class="metaH_timeDay">2 hours, 45 minutes ago</span> 
>>> link.text 
' Obama Seeks to Remove Fear From ISIS Fight' 
>>> time.text 
'2 hours, 45 minutes ago' 
>>> 
相關問題