2017-08-15 48 views
0

我想從this鏈接中獲取新聞文章。我的代碼是:提取文本<p></p>與BeautifulSoup

def get_news_details(news_url): 
    source = requests.get(news_url) 
    plain_text = source.text 
    soup = BeautifulSoup(plain_text, "html.parser") 
    content = soup.findAll('div', {'class' : 'big-img-box'}) 
    print(content[0].findAll('p')) 

結果表明:

[<p></p>, <p></p>, <p></p>, <p></p>, <p></p>, <p></p>] 

content值:

<div class="big-img-box"> 
<div class="left-imgs"> 
<figure> 
<img alt="iOS developer hints possibility of 4K Apple TV" class="img-responsive" src="http://www.aninews.in/contentimages/detail/appletv.jpg"/> 
<figcaption><span class="heading-inner-span"></span></figcaption> 
</figure> 
<div class="mb10"></div> 
</div> 
<p></p>  New York [USA], August 6 <a class="highlights" href="http://aninews.in/" target="_blank">(ANI)</a>: The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/4k-apple-tv.html"> 4K Apple TV</a></span> with high dynamic range (HDR) support for both <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/hdr10.html"> HDR10 </a></span> and <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/dolby-vision.html"> Dolby Vision</a></span>.<p></p>  While the current range of Apple's TV set-top box is incompatible to 4K technology, <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/ios.html">iOS</a></span> developer <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/guilherme-rambo.html"> Guilherme Rambo</a></span> revealed that the company is hinting an adoption of the ultra high-definition format, reports <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/the-verge.html">The Verge</a></span>.<p></p>  Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year.<p></p>  It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/netflix.html"> Netflix</a></span> and <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/amazon.html"> Amazon</a></span> support the two high-definition formats.<p></p>  Last month, iTunes started listing movies as supporting 4K and <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/hdr.html"> HDR</a></span> in users' purchase histories, thus providing more thrust to the speculations of the 4K <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/apple.html"> Apple</a></span> TV. <a class="highlights" href="http://aninews.in/" target="_blank">(ANI)</a><p></p> 
</div> 

我可以content[0].text但我得到的文章的有些笨拙版本無法格式化它。

在檢查鉻的網頁時,文章似乎寫在<p>article_text</p>標籤裏面。而在content中,它顯示爲<p></p>article_text標籤。如果前版本出現在soup,我可以得到我想要的輸出。應該做什麼 ?

回答

2

這取決於你的意思是格式。你可以用相當簡單的方式使它更「整齊」。

>>> import bs4 
>>> import requests 
>>> page = requests.get('http://www.aninews.in/newsdetail-Nw/MzI4NDIy/ios-developer-hints-possibility-of-4k-apple-tv.html').content 
>>> soup = bs4.BeautifulSoup(page, 'lxml') 
>>> big_img_box = soup.select('.big-img-box') 

獲取所有文本並剝離空白區域。

>>> big_img_box[0].text.strip() 
"New York [USA], August 6 (ANI): The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a 4K Apple TV with high dynamic range (HDR) support for both HDR10 and Dolby Vision.  While the current range of Apple's TV set-top box is incompatible to 4K technology, iOS developer Guilherme Rambo revealed that the company is hinting an adoption of the ultra high-definition format, reports The Verge.  Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year.  It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like Netflix and Amazon support the two high-definition formats.  Last month, iTunes started listing movies as supporting 4K and HDR in users' purchase histories, thus providing more thrust to the speculations of the 4K Apple TV. (ANI)" 

超出此範圍並移除較長的內部空白字符串。

>>> import re 
>>> re.sub(r'\s{2,}', ' ', big_img_box[0].text.strip()) 
"New York [USA], August 6 (ANI): The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a 4K Apple TV with high dynamic range (HDR) support for both HDR10 and Dolby Vision. While the current range of Apple's TV set-top box is incompatible to 4K technology, iOS developer Guilherme Rambo revealed that the company is hinting an adoption of the ultra high-definition format, reports The Verge. Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year. It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like Netflix and Amazon support the two high-definition formats. Last month, iTunes started listing movies as supporting 4K and HDR in users' purchase histories, thus providing more thrust to the speculations of the 4K Apple TV. (ANI)" 
+0

這適用於我(我的意思是「整理」,謝謝澄清)。但我想知道爲什麼Chrome的頁面檢查('

文本

')和BeautifulSoup的版本('

文本')有什麼區別? – Aroonalok

+0

我不確定。但是,我會說,當瀏覽器軟件或BeautifulSoup遇到一個未經過編碼以符合其標準的頁面時,它必須對該代碼執行某些操作才能顯示它。 Chrome的設計師在遇到問題時可能朝着一個方向發展,而BeautifulSoup的另一個方向。這種情況下的結果有點不同。 –

+1

@BillBell嘿比爾我只是想向你展示對這個StackOverflow標籤的良好支持以及對社區的支持,感謝你,你是一個很好的人。祝你一切順利,我只是想讓你知道我們如何感謝你的幫助。 –

相關問題