2015-06-08 44 views
2

我是新手,並首次查看HTML代碼。對於我的研究,我需要知道網頁中標籤和屬性的數量。在HTML中查找所有標籤和屬性

我看着各種解析器,發現美麗的湯是最受歡迎的之一。下面的代碼(從Parsing HTML using Python拍攝)表明解析一個文件的方式:

import urllib2 
from BeautifulSoup import BeautifulSoup 

page = urllib2.urlopen('http://www.google.com/') 
soup = BeautifulSoup(page) 

x = soup.body.find('div', attrs={'class' : 'container'}).text 

我發現find_all非常有用的,但需要一個自變量找到的東西。

有人可以指導我如何知道計數的所有標籤和屬性在html頁面?

谷歌開發者工具在這方面可以提供幫助嗎?

回答

2

如果您在沒有任何參數的情況下調用find_all(),它會遞歸地查找頁面上的所有元素。演示:

>>> from bs4 import BeautifulSoup 
>>> 
>>> data = """ 
... <html><head><title>The Dormouse's story</title></head> 
... <body> 
... <p class="title"><b>The Dormouse's story</b></p> 
... 
... <p class="story">Once upon a time there were three little sisters; and their names were 
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
... and they lived at the bottom of a well.</p> 
... 
... <p class="story">...</p> 
... """ 
>>> 
>>> soup = BeautifulSoup(data) 
>>> for tag in soup.find_all(): 
...  print tag.name 
... 
html 
head 
title 
body 
p 
b 
p 
a 
a 
a 
p 

帕德里克表明您如何計算元素,並通過BeautifulSoup屬性。除了它,這裏是如何做同樣的lxml.html

from lxml.html import fromstring 

root = fromstring(data) 
print int(root.xpath("count(//*)")) + int(root.xpath("count(//@*)")) 

作爲獎勵,我做了一個簡單的基準表明後一種方法是更快(我的機器上,用我的設置和而不指定解析器would make BeautifulSoup use lxml under-the-hood etc..a很多事情可能影響結果,但無論如何):

$ python -mtimeit -s'import test' 'test.count_bs()' 
1000 loops, best of 3: 618 usec per loop 
$ python -mtimeit -s'import test' 'test.count_lxml_html()' 
10000 loops, best of 3: 114 usec per loop 

其中test.py包含:

from bs4 import BeautifulSoup 
from lxml.html import fromstring 

data = """ 
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title"><b>The Dormouse's story</b></p> 

<p class="story">Once upon a time there were three little sisters; and their names were 
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 

<p class="story">...</p> 
""" 

def count_bs(): 
    return sum(len(ele.attrs) + 1 for ele in BeautifulSoup(data).find_all()) 


def count_lxml_html(): 
    root = fromstring(data) 
    return int(root.xpath("count(//*)")) + int(root.xpath("count(//@*)")) 
+0

有沒有辦法知道DOM節點。 我目前在地址欄中使用javascript:alert(document.getElementsByTagName('*').length)來了解DOM節點。 – shingaridavesh

+0

@shingaridavesh對不起,你知道「知道DOM節點」是什麼意思? – alecxe

+0

網頁中DOM節點的數量。 – shingaridavesh

3

如果你想要所有的標籤和attrs的數量:

sum(len(ele.attrs) + 1 for ele in BeautifulSoup(page).find_all()) 
相關問題