使用python從html文件中提取標籤

我想在不使用BeautifulSoup的情況下從python的html文件中提取標籤。例如，我想使用python從html文件中提取標籤

class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine

從

<a class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine</a>

任何想法？

來源

2013-07-01 user2460869

爲什麼你不想使用BeautifulSoup？可能有一個很好的理由，但是如果你可以包含這些信息，那麼這個問題會讓其他人更加有用。 –

這不是一個標籤，它只是HTML的一個片段。你想要做什麼？ –

爲了做基本的dom解析，你可以在stl中使用xml parser。

這裏是用它打開XML轉換爲HTML（從文檔）的例子：

import xml.dom.minidom 

document = """\ 
<slideshow> 
<title>Demo slideshow</title> 
<slide><title>Slide title</title> 
<point>This is a demo</point> 
<point>Of a program for processing slides</point> 
</slide> 

<slide><title>Another demo slide</title> 
<point>It is important</point> 
<point>To have more than</point> 
<point>one slide</point> 
</slide> 
</slideshow> 
""" 

dom = xml.dom.minidom.parseString(document) 

def getText(nodelist): 
    rc = [] 
    for node in nodelist: 
     if node.nodeType == node.TEXT_NODE: 
      rc.append(node.data) 
    return ''.join(rc) 

def handleSlideshow(slideshow): 
    print "<html>" 
    handleSlideshowTitle(slideshow.getElementsByTagName("title")[0]) 
    slides = slideshow.getElementsByTagName("slide") 
    handleToc(slides) 
    handleSlides(slides) 
    print "</html>" 

def handleSlides(slides): 
    for slide in slides: 
     handleSlide(slide) 

def handleSlide(slide): 
    handleSlideTitle(slide.getElementsByTagName("title")[0]) 
    handlePoints(slide.getElementsByTagName("point")) 

def handleSlideshowTitle(title): 
    print "<title>%s</title>" % getText(title.childNodes) 

def handleSlideTitle(title): 
    print "<h2>%s</h2>" % getText(title.childNodes) 

def handlePoints(points): 
    print "<ul>" 
    for point in points: 
     handlePoint(point) 
    print "</ul>" 

def handlePoint(point): 
    print "<li>%s</li>" % getText(point.childNodes) 

def handleToc(slides): 
    for slide in slides: 
     title = slide.getElementsByTagName("title")[0] 
     print "<p>%s</p>" % getText(title.childNodes) 

handleSlideshow(dom)

來源

2013-07-01 01:31:31

看一看這個XML API在python提供的，它說明了如何訪問屬性，元素和具有一定的HTML也是例子。您也可以生成解析器對象。

來源

2013-07-01 04:25:30 Saurabh7

使用python從html文件中提取標籤

回答

相關問題