使用BeautifulSoup從標題標籤中提取數據？

我想通過python中的BeautifulSoup庫獲取它的HTML後提取鏈接的標題。基本上，整個標題標籤使用BeautifulSoup從標題標籤中提取數據？

<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>

我想提取的數據是在& QUOT標籤，這只是這個Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3) 我嘗試作爲

import urllib 
import urllib.request 

from bs4 import BeautifulSoup 

link = "https://twitter.com/ImaanZHazir/status/778560899061780481" 
try: 
    List=list() 
    r = urllib.request.Request(link, headers={'User-Agent': 'Chrome/51.0.2704.103'}) 
    h = urllib.request.urlopen(r).read() 
    data = BeautifulSoup(h,"html.parser") 
    for i in data.find_all("title"): 
     List.append(i.text) 
     print(List[0]) 
except urllib.error.HTTPError as err: 
    pass

我也嘗試作爲

for i in data.find_all("title.&quot"): 

for i in data.find_all("title>&quot"): 

for i in data.find_all("&quot"):

and

for i in data.find_all("quot"):

但是沒有人在工作。

來源

2016-09-21 Amar

我期望BeautifulSoup將'"'轉換成'''，所以你只需要尋找'''' – zvone

@zvone這是什麼？ ''''你的意思是這個''標題<">「'？ – Amar

就劈在結腸中的文字：

In [1]: h = """<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>""" 

In [2]: from bs4 import BeautifulSoup 

In [3]: soup = BeautifulSoup(h, "lxml") 

In [4]: print(soup.title.text.split(": ", 1)[1]) 
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

其實在看網頁，你不需要拆可言，文字是div內的p標記。JS-鳴叫文本容器，TH：

In [8]: import requests 

In [9]: from bs4 import BeautifulSoup 


In [10]: soup = BeautifulSoup(requests.get("https://twitter.com/ImaanZHazir/status/778560899061780481").content, "lxml") 


In [11]: print(soup.select_one("div.js-tweet-text-container p").text) 
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3) 

In [12]: print(soup.title.text.split(": ", 1)[1]) 
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

所以，你可以爲同樣的結果做任何一種方式。

來源

2016-09-21 22:37:16

Caunnungham這個工作！感謝您的通知。'print（soup.select_one（」div.js-tweet-text-container p「）。text）'' – Amar

一旦你解析的HTML：

data = BeautifulSoup(h,"html.parser")

查找標題是這樣的：

title = data.find("title").string # this is without <title> tag

現在找到字符串中的兩個引號（"）。有很多方法可以做到這一點。我會用正則表達式：

import re 
match = re.search(r'".*"', title) 
if match: 
    print match.group(0)

你從來沒有搜索"或任何其他&NAME;序列，因爲BeautifulSoup將它們轉換成他們所代表的實際字符。

編輯：

正則表達式不捕捉報價是：

re.search(r'(?<=").*(?=")', title)

來源

2016-09-21 19:05:06 zvone

下面是使用正則表達式來提取引號內的文本的簡單完整的例子：

import urllib 
import re 
from bs4 import BeautifulSoup 

link = "https://twitter.com/ImaanZHazir/status/778560899061780481" 

r = urllib.request.urlopen(link) 
soup = BeautifulSoup(r, "html.parser") 
title = soup.title.string 
quote = re.match(r'^.*\"(.*)\"', title) 
print(quote.group(1))

這裏發生的事情是，在獲取頁面的源代碼並找到title之後，我們使用正則表達式對標題來提取引號內的文字。

我們告訴正則表達式查找符號在開引號（\"）前的字符串（^.*）的開頭的任意數，然後捕獲它和關閉的引號（第二\"）之間的文本。

然後我們通過告訴Python打印第一個捕獲的組（正則表達式中括號之間的部分）來打印捕獲的文本。

這裏有更多關於Python與正則表達式匹配 - https://docs.python.org/3/library/re.html#match-objects

來源

2016-09-21 20:52:21 4140tm

使用BeautifulSoup從標題標籤中提取數據？

回答

相關問題