2016-11-30 342 views
1

我試圖在維基百科文章中刮一張表,每個表元素的類型看起來都是<class 'bs4.element.Tag'><class 'bs4.element.NavigableString'>BeautifulSoup標記是類型bs4.element.NavigableString和bs4.element.Tag

import requests 
import bs4 
import lxml 


resp = requests.get('https://en.wikipedia.org/wiki/List_of_municipalities_in_Massachusetts') 

soup = bs4.BeautifulSoup(resp.text, 'lxml') 

munis = soup.find(id='mw-content-text')('table')[1] 

for muni in munis: 
    print type(muni) 
    print '============' 

產生以下輸出中:

<class 'bs4.element.Tag'> 
============ 
<class 'bs4.element.NavigableString'> 
============ 
<class 'bs4.element.Tag'> 
============ 
<class 'bs4.element.NavigableString'> 
============ 
<class 'bs4.element.Tag'> 
============ 
<class 'bs4.element.NavigableString'> 
... 

當我試圖找回muni.contents我得到了AttributeError: 'NavigableString' object has no attribute 'contents'錯誤。

我在做什麼錯?如何獲得每個munibs4.element.Tag對象?

(使用Python 2.7)。

+0

你可能知道, ** munis **是維基百科頁面中表格的表示形式。如果你打印它,你會看到表格的html。如果你想查看** munis **的孩子的標籤,即它的行,那麼你可以在munis.childGenerator()中使用代碼'child.name' - 只是一系列的tr引號。我懷疑這是你想要的。你是否應該問如何刪除表中每一行的內容,可能是以Python列表的形式? –

回答

0

如果在節點之間的標記中有空格,BeautifulSoup會將這些空格轉換爲NavigableString。只要把一個嘗試捕捉,看看內容是否越來越牽強,因爲你希望他們 -

for muni in munis: 
    #print type(muni) 
    try: 
     print muni.contents 
    except AttributeError: 
     pass 
    print '============' 
0
from bs4 import BeautifulSoup 
import requests 

r = requests.get('https://en.wikipedia.org/wiki/List_of_municipalities_in_Massachusetts') 
soup = BeautifulSoup(r.text, 'lxml') 
rows = soup.find(class_="wikitable sortable").find_all('tr')[1:] 

for row in rows: 
    cell = [i.text for i in row.find_all('td')] 
    print(cell) 

出來:

['Abington', 'Town', 'Plymouth', 'Open town meeting', '15,985', '1712'] 
['Acton', 'Town', 'Middlesex', 'Open town meeting', '21,924', '1735'] 
['Acushnet', 'Town', 'Bristol', 'Open town meeting', '10,303', '1860'] 
['Adams', 'Town', 'Berkshire', 'Representative town meeting', '8,485', '1778'] 
['Agawam', 'City[4]', 'Hampden', 'Mayor-council', '28,438', '1855'] 
['Alford', 'Town', 'Berkshire', 'Open town meeting', '494', '1773'] 
['Amesbury', 'City', 'Essex', 'Mayor-council', '16,283', '1668'] 
['Amherst', 'Town', 'Hampshire', 'Representative town meeting', '37,819', '1775'] 
['Andover', 'Town', 'Essex', 'Open town meeting', '33,201', '1646'] 
['Aquinnah', 'Town', 'Dukes', 'Open town meeting', '311', '1870'] 
['Arlington', 'Town', 'Middlesex', 'Representative town meeting', '42,844', '1807'] 
['Ashburnham', 'Town', 'Worcester', 'Open town meeting', '6,081', '1765'] 
['Ashby', 'Town', 'Middlesex', 'Open town meeting', '3,074', '1767'] 
['Ashfield', 'Town', 'Franklin', 'Open town meeting', '1,737', '1765'] 
['Ashland', 'Town', 'Middlesex', 'Open town meeting', '16,593', '1846'] 
2
#!/usr/bin/env python 
# coding:utf-8 
'''黃哥Python''' 

import requests 
import bs4 
from bs4 import BeautifulSoup 
# from urllib.request import urlopen 

html = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies') 
soup = BeautifulSoup(html.text, 'lxml') 

symbolslist = soup.find('table').tr.next_siblings 
for sec in symbolslist: 
    # print(type(sec)) 
    if type(sec) is not bs4.element.NavigableString: 
     print(sec.get_text()) 

result screenshot