0
所以我有一個<span>
標籤style="font-size:...px"
一堆HTML文件,我想自動找到最大的字體大小的<span>
並獲取跨度標籤之間的文本。最好在R或Python中,但也歡迎其他方法。有任何想法嗎?使用R或python解析HTML屬性
所以我有一個<span>
標籤style="font-size:...px"
一堆HTML文件,我想自動找到最大的字體大小的<span>
並獲取跨度標籤之間的文本。最好在R或Python中,但也歡迎其他方法。有任何想法嗎?使用R或python解析HTML屬性
對於Python 3,您可以使用html.parser
。 (對於Python 2.x的,你需要看HTMLParser
)
一個例子是:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self, min_span):
HTMLParser.__init__(self)
#Keep track of our maximum entry thus far
self.max_span = min_span #set a minimum font size if you like, or just use 0
self.max_text = [] #to keep track of many entries
#This flags to the object to get data if we found a span tag
#with a new highest font-size
self.recording = 0
def handle_starttag(self, tag, attrs):
#Ignore all other tags
if tag != 'span':
return
for name, value in attrs:
if name != 'style':
continue
for css_style in value.split(";"):
sub_attrib = css_style.split(":")
if sub_attrib[0].strip() != 'font-size':
continue
this_size = int(sub_attrib[1][:-2])
if (this_size > self.max_span):
self.max_text = [] #'reset' the list for new maximum font-size
self.max_span = this_size
self.recording = 1
elif (this_size == self.max_span): #For equally large span font-size tags
self.recording = 1
def handle_endtag(self, tag):
"""
Turns off recording flag
"""
if tag == 'span' and self.recording:
self.recording = 0
def handle_data(self, data):
if self.recording:
self.max_text.append(data)
不是很好的HTML(如顯而易見我以前的答案),所以你可能需要更多的控制對於流邊緣的情況下
用途:
parser = MyHTMLParser(0)
parser.feed("""
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
<span style="font-size:10px;font-family:test">Not this one</span>
<span style="font-size:20px">Not this one either</span>
<span style="font-size:60px;font-family:hello">Yay!</span>
<span style="font-size:10px">Nope</span>
<span style="font-size:60px">Also this one</span>
</body>
</html>
""")
print(parser.max_text) #prints out ['Yay!', 'Also this one']
#to get individual entry
list_of_text = parser.max_text
first_maximum_text = list_of_text[0]
編輯:對於一個目錄遍歷所有的HTML文件去(例如,它是當前目錄)。此實現會發現在所有的HTML文件的最大值(如果要分析一次爲每個HTML文件,初始化每次迭代後MyHTMLParser
及處理結果)
import os
def main():
parser = MyHTMLParser(0)
for file in os.listdir("./"):
if file.endswith(".html"):
with open(file, 'r') as fd:
parser.feed(fd.read())
print(parser.max_text)
if __name__ == '__main__':
main()
評論不適合廣泛的討論;這個對話已經[轉移到聊天](http://chat.stackoverflow.com/rooms/134080/discussion-on-answer-by-kj-phan-parsing-html-attribute-using-r-or-python) 。 –