2016-09-21 114 views
1

我想通過python中的BeautifulSoup庫獲取它的HTML後提取鏈接的標題。 基本上,整個標題標籤使用BeautifulSoup從標題標籤中提取數據?

<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title> 

我想提取的數據是在& QUOT標籤,這只是這個Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3) 我嘗試作爲

import urllib 
import urllib.request 

from bs4 import BeautifulSoup 

link = "https://twitter.com/ImaanZHazir/status/778560899061780481" 
try: 
    List=list() 
    r = urllib.request.Request(link, headers={'User-Agent': 'Chrome/51.0.2704.103'}) 
    h = urllib.request.urlopen(r).read() 
    data = BeautifulSoup(h,"html.parser") 
    for i in data.find_all("title"): 
     List.append(i.text) 
     print(List[0]) 
except urllib.error.HTTPError as err: 
    pass 

我也嘗試作爲

for i in data.find_all("title.&quot"): 

for i in data.find_all("title>&quot"): 

for i in data.find_all("&quot"): 

and

for i in data.find_all("quot"): 

但是沒有人在工作。

+0

我期望BeautifulSoup將'"'轉換成''',所以你只需要尋找'''' – zvone

+0

@zvone這是什麼? ''''你的意思是這個''標題<">「'? – Amar

回答

0

就劈在結腸中的文字:

In [1]: h = """<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>""" 

In [2]: from bs4 import BeautifulSoup 

In [3]: soup = BeautifulSoup(h, "lxml") 

In [4]: print(soup.title.text.split(": ", 1)[1]) 
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)" 

其實在看網頁,你不需要拆可言,文字是div內的p標記。JS-鳴叫文本容器,TH:

In [8]: import requests 

In [9]: from bs4 import BeautifulSoup 


In [10]: soup = BeautifulSoup(requests.get("https://twitter.com/ImaanZHazir/status/778560899061780481").content, "lxml") 


In [11]: print(soup.select_one("div.js-tweet-text-container p").text) 
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3) 

In [12]: print(soup.title.text.split(": ", 1)[1]) 
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)" 

所以,你可以爲同樣的結果做任何一種方式。

+0

Caunnungham這個工作!感謝您的通知。'print(soup.select_one(」div.js-tweet-text-container p「)。text)'' – Amar

0

一旦你解析的HTML:

data = BeautifulSoup(h,"html.parser") 

查找標題是這樣的:

title = data.find("title").string # this is without <title> tag 

現在找到字符串中的兩個引號(")。有很多方法可以做到這一點。我會用正則表達式:

import re 
match = re.search(r'".*"', title) 
if match: 
    print match.group(0) 

你從來沒有搜索&quot;或任何其他&NAME;序列,因爲BeautifulSoup將它們轉換成他們所代表的實際字符。

編輯:

正則表達式不捕捉報價是:

re.search(r'(?<=").*(?=")', title) 
0

下面是使用正則表達式來提取引號內的文本的簡單完整的例子:

import urllib 
import re 
from bs4 import BeautifulSoup 

link = "https://twitter.com/ImaanZHazir/status/778560899061780481" 

r = urllib.request.urlopen(link) 
soup = BeautifulSoup(r, "html.parser") 
title = soup.title.string 
quote = re.match(r'^.*\"(.*)\"', title) 
print(quote.group(1)) 

這裏發生的事情是,在獲取頁面的源代碼並找到title之後,我們使用正則表達式對標題來提取引號內的文字。

我們告訴正則表達式查找符號在開引號(\")前的字符串(^.*)的開頭的任意數,然後捕獲它和關閉的引號(第二\")之間的文本。

然後我們通過告訴Python打印第一個捕獲的組(正則表達式中括號之間的部分)來打印捕獲的文本。

這裏有更多關於Python與正則表達式匹配 - https://docs.python.org/3/library/re.html#match-objects

相關問題