如何從新聞文章中提取h2和h3標題

-2

我正在試圖創建這個可以從新聞文章中提取主標題的網頁刮板。如何從新聞文章中提取h2和h3標題

# -*- coding: utf-8 -*- 
import requests 
from bs4 import BeautifulSoup 

url= input('enter the url \n') 

r = requests.get(url) 
content = r.content 
soup = BeautifulSoup(content, "html.parser") 
heading = soup.find_all('h1') 
print(heading) 
print(str.strip(heading[0].text))

這隻適用於h1標籤中的標題，但會爲h2或h3標籤中的標題引發錯誤。我該如何修改此代碼，以便它可以用於h2和h3標籤？提前致謝！

來源

2016-06-23 Amit Singh

BeautifulSoup是相當靈活的，只是通過在list of tag names你想找到：

soup.find_all(['h1', 'h2', 'h3'])

你甚至可以這樣做：

import re 

soup.find_all(re.compile(r"^h\d$")) # would match "h" followed by a single digit

來源

2016-06-23 19:58:58 alecxe

非常感謝您的幫助亞歷克斯，是工作，我能夠提取h1和h2標籤，但是我怎樣才能從[this]這樣的文章中提取主標題（http://android-developers.blogspot.in/2016/06/introducing-android-basics-nanodegree.html）主標題是h3標籤，日期是h2。 –

@AmitSingh好吧，你可以通過類名找到日期：'soup.find（class _ =「date-header」）。get_text（）'，文章標題也一樣：'soup.find（class _ =「post -title「）。get_text（）'。 – alecxe

如何從新聞文章中提取h2和h3標題

回答

相關問題