在HTML中查找所有標籤和屬性

我是新手，並首次查看HTML代碼。對於我的研究，我需要知道網頁中標籤和屬性的數量。在HTML中查找所有標籤和屬性

我看着各種解析器，發現美麗的湯是最受歡迎的之一。下面的代碼（從Parsing HTML using Python拍攝）表明解析一個文件的方式：

import urllib2 
from BeautifulSoup import BeautifulSoup 

page = urllib2.urlopen('http://www.google.com/') 
soup = BeautifulSoup(page) 

x = soup.body.find('div', attrs={'class' : 'container'}).text

我發現find_all非常有用的，但需要一個自變量找到的東西。

有人可以指導我如何知道計數的所有標籤和屬性在html頁面？

谷歌開發者工具在這方面可以提供幫助嗎？

來源

2015-06-08 shingaridavesh

如果您在沒有任何參數的情況下調用find_all()，它會遞歸地查找頁面上的所有元素。演示：

>>> from bs4 import BeautifulSoup 
>>> 
>>> data = """ 
... <html><head><title>The Dormouse's story</title></head> 
... <body> 
... <p class="title"><b>The Dormouse's story</b></p> 
... 
... <p class="story">Once upon a time there were three little sisters; and their names were 
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
... and they lived at the bottom of a well.</p> 
... 
... <p class="story">...</p> 
... """ 
>>> 
>>> soup = BeautifulSoup(data) 
>>> for tag in soup.find_all(): 
...  print tag.name 
... 
html 
head 
title 
body 
p 
b 
p 
a 
a 
a 
p

帕德里克表明您如何計算元素，並通過BeautifulSoup屬性。除了它，這裏是如何做同樣的lxml.html：

from lxml.html import fromstring 

root = fromstring(data) 
print int(root.xpath("count(//*)")) + int(root.xpath("count(//@*)"))

作爲獎勵，我做了一個簡單的基準表明後一種方法是更快（我的機器上，用我的設置和而不指定解析器would make BeautifulSoup use lxml under-the-hood etc..a很多事情可能影響結果，但無論如何）：

$ python -mtimeit -s'import test' 'test.count_bs()' 
1000 loops, best of 3: 618 usec per loop 
$ python -mtimeit -s'import test' 'test.count_lxml_html()' 
10000 loops, best of 3: 114 usec per loop

其中test.py包含：

from bs4 import BeautifulSoup 
from lxml.html import fromstring 

data = """ 
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title"><b>The Dormouse's story</b></p> 

<p class="story">Once upon a time there were three little sisters; and their names were 
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 

<p class="story">...</p> 
""" 

def count_bs(): 
    return sum(len(ele.attrs) + 1 for ele in BeautifulSoup(data).find_all()) 


def count_lxml_html(): 
    root = fromstring(data) 
    return int(root.xpath("count(//*)")) + int(root.xpath("count(//@*)"))

來源

2015-06-08 19:57:08 alecxe

有沒有辦法知道DOM節點。我目前在地址欄中使用javascript：alert（document.getElementsByTagName（'*'）.length）來了解DOM節點。 – shingaridavesh

@shingaridavesh對不起，你知道「知道DOM節點」是什麼意思？ – alecxe

網頁中DOM節點的數量。 – shingaridavesh

如果你想要所有的標籤和attrs的數量：

sum(len(ele.attrs) + 1 for ele in BeautifulSoup(page).find_all())

來源

2015-06-08 20:05:25

在HTML中查找所有標籤和屬性

回答

相關問題