2013-07-01 79 views
2

我想在不使用BeautifulSoup的情況下從python的html文件中提取標籤。例如,我想使用python從html文件中提取標籤

class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine 

<a class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine</a> 

任何想法?

+0

爲什麼你不想使用BeautifulSoup?可能有一個很好的理由,但是如果你可以包含這些信息,那麼這個問題會讓其他人更加有用。 –

+0

這不是一個標籤,它只是HTML的一個片段。你想要做什麼? –

回答

1

爲了做基本的dom解析,你可以在stl中使用xml parser

這裏是用它打開XML轉換爲HTML(從文檔)的例子:

import xml.dom.minidom 

document = """\ 
<slideshow> 
<title>Demo slideshow</title> 
<slide><title>Slide title</title> 
<point>This is a demo</point> 
<point>Of a program for processing slides</point> 
</slide> 

<slide><title>Another demo slide</title> 
<point>It is important</point> 
<point>To have more than</point> 
<point>one slide</point> 
</slide> 
</slideshow> 
""" 

dom = xml.dom.minidom.parseString(document) 

def getText(nodelist): 
    rc = [] 
    for node in nodelist: 
     if node.nodeType == node.TEXT_NODE: 
      rc.append(node.data) 
    return ''.join(rc) 

def handleSlideshow(slideshow): 
    print "<html>" 
    handleSlideshowTitle(slideshow.getElementsByTagName("title")[0]) 
    slides = slideshow.getElementsByTagName("slide") 
    handleToc(slides) 
    handleSlides(slides) 
    print "</html>" 

def handleSlides(slides): 
    for slide in slides: 
     handleSlide(slide) 

def handleSlide(slide): 
    handleSlideTitle(slide.getElementsByTagName("title")[0]) 
    handlePoints(slide.getElementsByTagName("point")) 

def handleSlideshowTitle(title): 
    print "<title>%s</title>" % getText(title.childNodes) 

def handleSlideTitle(title): 
    print "<h2>%s</h2>" % getText(title.childNodes) 

def handlePoints(points): 
    print "<ul>" 
    for point in points: 
     handlePoint(point) 
    print "</ul>" 

def handlePoint(point): 
    print "<li>%s</li>" % getText(point.childNodes) 

def handleToc(slides): 
    for slide in slides: 
     title = slide.getElementsByTagName("title")[0] 
     print "<p>%s</p>" % getText(title.childNodes) 

handleSlideshow(dom) 
1

看一看這個XML API在python提供的,它說明了如何訪問屬性,元素和具有一定的HTML也是例子。您也可以生成解析器對象。

相關問題