Beautifulsoup在meta標籤中找到特定值

我試圖找到所有具有作者的meta標籤。它有效，如果我有一個特定的鍵和正則表達式值。當兩者都是正則表達式時它不起作用。是否有可能提取頁面中包含「author」關鍵字的所有meta標籤？這是我寫的代碼。Beautifulsoup在meta標籤中找到特定值

from bs4 import BeautifulSoup 
page = requests.get(url) 
contents = page.content 
soup = BeautifulSoup(contents, 'lxml') 
preys = soup.find_all("meta", attrs={re.compile('.*'): re.compile('author')})

編輯：爲了澄清，我想明確解決的問題是，如果值「的作者」被映射到任何按鍵。這個關鍵可能是「itemprop」，「name」甚至「property」，正如我在各種例子中看到的那樣。基本上，我的問題是拉取所有具有作者作爲其值的元標記，無論該值具有什麼關鍵。幾個例子是這樣的：

<meta content="Jami Miscik" name="citation_author"/> 
<meta content="Will Ripley, Joshua Berlinger and Allison Brennan, CNN" itemprop="author"/> 
<meta content="Alison Griswold" property="author"/>

來源

2017-06-13 Furkanicus

是否文檔暗示的地方，屬性名可以是一個正則表達式？我無法在https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attrs找到任何暗示 – Tomalak

可能是這種情況。如果是這樣，我將不得不收集所有可能的密鑰並檢查它們的值。 – Furkanicus

如果您正在尋找citation_author或author，你可能會用soup.select()組合和正則表達式相處：

from bs4 import BeautifulSoup 
import re 

# some test string 
html = ''' 
<meta name="author" content="Anna Lyse"> 
<meta name="date" content="2010-05-15T08:49:37+02:00"> 
<meta itemprop="author" content="2010-05-15T08:49:37+02:00"> 
<meta rel="author" content="2010-05-15T08:49:37+02:00"> 
<meta content="Jami Miscik" name="citation_author"/> 
<meta content="Will Ripley, Joshua Berlinger and Allison Brennan, CNN" itemprop="author"/> 
<meta content="Alison Griswold" property="author"/> 
''' 

soup = BeautifulSoup(html, 'html5lib') 

rx = re.compile(r'(?<=)"(?:citation_)?author"') 

authors = [author 
      for author in soup.select("meta") 
      if rx.search(str(author))] 

print(authors)

來源

2017-06-13 19:07:40 Jan

我試圖記住的語法 - 在文檔中找不到。 –

@BillBell：它有點隱藏，可以在https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors找到 - *測試是否存在一個屬性：* – Jan

謝謝。我一直在說我會嘗試以某種不同的形式編寫BS文檔 - 對於像我這樣的人。 –

這應該這樣做。我很遺憾，我找不到一個快速的頁面，其中有一個authormeta，它將證明這個代碼的有效性。如果您發現錯誤，請告訴我。

>>> import requests 
>>> import bs4 
>>> page = requests.get('http://reference.sitepoint.com/html/meta').text 
>>> soup = bs4.BeautifulSoup(page, 'lxml') 
>>> [item.attrs['name'] for item in soup('meta') if item.has_attr('name')] 
['robots', 'description'] 
>>> [item.attrs['name'] for item in soup('meta') if item.has_attr('name') and item.attrs['name'].lower()=='author'] 
[]

編輯：與Jan的大塊html一起工作。他的語法更好，使用它。

>>> html = '<meta name="author" content="Anna Lyse"> <meta name="date" content="2010-05-15T08:49:37+02:00">' 
>>> soup = bs4.BeautifulSoup(html, 'lxml') 
>>> [item.attrs['name'] for item in soup('meta') if item.has_attr('name') and item.attrs['name'].lower()=='author'] 
['author']

來源

2017-06-13 19:04:01

只是構建一個示例字符串:) – Jan

...爲什麼我沒有想到這一點？ –

...不知道:-) ... – Jan

Beautifulsoup在meta標籤中找到特定值

回答

相關問題