2017-04-16 67 views
0

我需要統計新聞文章中的字符。有些頁面有很多我不需要的東西(導航,頁腳等)。我設法擺脫所有這些,但我仍然有一些東西,比如圖像版權,圖像和視頻標題以及我努力去除的廣告。任何人都可以建議如何改進下面的代碼,只從文章中獲取有用的文本?BeautifulSoup:進一步清理文章文字

import requests 
from bs4 import BeautifulSoup 
r = requests.get("http://www.bbc.co.uk/news/world-europe-39612562") 
soup = BeautifulSoup(r.content) 
for s in soup.findAll("div", {"class":"story-body__inner"}): 
    article = ''.join(s.findAll(text=True)) 
print(article)  
print (len(article)) 

這個特定網址的代碼得到這個(頂部只是爲了說明問題):

Image copyright 
AFP 


Image caption 

        Erdogan supporters began celebrating early outside party headquarters in Ankara 


Turks have backed President Recep Tayyip Erdogan's call for sweeping new presidential powers, partial official results of a referendum indicate.With about 98% of ballots counted, "Yes" was on about 51.3% and "No" on about 48.7%.Erdogan supporters say replacing the parliamentary system with an executive presidency would modernise the country. Opponents have attacked a decision to accept unstamped ballot papers as valid unless proven otherwise.The main opposition Republican People's Party (CHP) is already demanding a recount of 60% of the votes. 


      /**/ 
      (function() { 
       if (window.bbcdotcom && bbcdotcom.adverts && bbcdotcom.adverts.slotAsync) { 
        bbcdotcom.adverts.slotAsync('mpu', [1,2,3]); 
       } 
      })(); 
      /**/ 

回答

0

似乎你不需要script也不figure標籤,所以:

import requests 
from bs4 import BeautifulSoup 
r = requests.get("http://www.bbc.co.uk/news/world-europe-39612562") 
soup = BeautifulSoup(r.content) 

# delete unwanted tags: 
for e in soup(['figure', 'script']): 
    e.decompose() 

article_soup = [e.get_text() for e in soup.find_all(
       'div', {'class': 'story-body__inner'})] 

article = ''.join(article_soup) 
print(article)  
print (len(article)) 
+0

這隻留下頁面的聯繫表格... – aviss

+0

奇怪,我更新了答案;它現在應該工作。 – odradek

+0

非常感謝!它現在有效。 – aviss