查詢：美麗的湯+ href模式不像我想要的那樣刮

我有以下html模式我想使用BeautifulSoup來報廢。 html模式是：查詢：美麗的湯+ href模式不像我想要的那樣刮

<a href="link" target="_blank" onclick="blah blah blah">TITLE</a>

我想抓住TITLE和顯示在鏈接中的信息。也就是說，如果您點擊鏈接，則會顯示TITLE的說明。我想要那個描述。

我開始只是想抓住題目與下面的代碼：

import urllib 
from bs4 import BeautifulSoup 
import re 

webpage = urrlib.urlopen("http://urlofinterest") 

title = re.compile('<a>(.*)</a>') 
findTitle = re.findall(title,webpage) 
print findTile

我的輸出是：

% python beta2.py 
[]

所以這顯然是連沒有找到稱號。我甚至嘗試過<a href>(.*)</a>，但沒有奏效。根據我對文檔的閱讀，我認爲BeautifulSoup會抓住我給它的符號之間的任何文本。在這種情況下，那麼我做錯了什麼？

來源

2013-02-02 GeekyOmega

想必應該打印findTitle而不是findTile？ –

您編譯的重新編排模式與鏈接不匹配，請嘗試使用re.compile（'（。*？）<\/a>'）...練習https://regex101.com/ – rebeling

你是如何導入beautifulsoup，然後根本不使用它？

webpage = urrlib.urlopen("http://urlofinterest")

你要讀取該數據，使：

webpage = urrlib.urlopen("http://urlofinterest").read()

喜歡的東西（應該讓你到一個點走得更遠）：

>>> blah = '<a href="link" target="_blank" onclick="blah blah blah">TITLE</a>' 
>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(blah) # change to webpage later 
>>> for tag in soup('a', href=True): 
    print tag['href'], tag.string 

link TITLE

來源

2013-02-02 18:02:40

我已經完成了它的工作。我也用它來排序我想要的確切的URL！我在tag ['href']中使用'i_need_this'，並且只打印我想要的URL。所以我的下一個問題是如何獲取該鏈接中的信息？在我想要的鏈接中有一個標題的描述，也許還有一些其他的東西。 – GeekyOmega

查詢：美麗的湯+ href模式不像我想要的那樣刮

回答

相關問題