使用BeautifulSoup和列表從維基百科的Infobox中提取特定文本的最佳方式是什麼？

我使用BeautifulSoup從維基百科的信息框（收入）中提取特定文本。如果收入文字位於'a'標籤內，我的代碼正在工作。不幸的是，並非所有網頁的收入都列在「a」標籤中。例如，一些人在「span」標籤後面有收入文本。我想知道爲公司列表獲得收入文本的最佳/最安全的方式是什麼。會發現另一個標籤代替'a'的效果最好嗎？或者是其他東西？謝謝你的幫助。使用BeautifulSoup和列表從維基百科的Infobox中提取特定文本的最佳方式是什麼？

company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes'] 

for c in company: 
    r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read() 
    soup = BeautifulSoup(r, "lxml") 

    rev = re.compile('^Revenue') 
    thRev = [e for e in soup.find_all('th', {'scope': 'row'}) if rev.search(e.text)][0] 
    tdRev = thRev.find_next('td') 
    revenue = tdRev.find_all('a') 

    for f in revenue: 
     print c + " " + f.text 
     break

來源

2016-05-03 SallyH

您能否提供2個url示例？ –

是的！抱歉。 https://en.wikipedia.org/wiki/Lockheed_Martin，https://en.wikipedia.org/wiki/Phillips_66 – SallyH

在你的兩個例子中，收入都不在'a'標籤內。 –

你可以試試：

from bs4 import BeautifulSoup 
import urllib 
import re 
company = ['Lockheed_Martin', 'Phillips_66', 'ConocoPhillips', 'Sysco', 'Baker_Hughes'] 

for c in company: 
    r = urllib.urlopen('https://en.wikipedia.org/wiki/' + c).read() 
    soup = BeautifulSoup(r, "lxml") 
    for tr in soup.findAll('tr'): 
     trText = tr.text 
     if re.search(r"^\bRevenue\b$", trText): 
      match = re.search(r"\w+\$(?:\s+)?[\d\.]+.{1}\w+", trText) 
      revenue = match.group() 
      print c+"\n"+revenue+"\n"

輸出：

Lockheed_Martin 
US$ 46.132 billion 
Phillips_66 
US$ 161.21 billion 
ConocoPhillips 
US$55.52 billion 
Sysco 
US$44.41 Billion 
Baker_Hughes 
US$ 22.364 billion

注：您可能需要使用Wikipedia API來代替，例如：

https://en.wikipedia.org/w/api.php?action=query&titles=Baker_Hughes&prop=revisions&rvprop=content&format=json

來源

2016-05-03 23:50:19

使用BeautifulSoup和列表從維基百科的Infobox中提取特定文本的最佳方式是什麼？

回答

相關問題