2014-02-24 31 views
0

一個網站,我有導致這樣的搜索的源頁面(只是它的一個切片,當然)的日期搜索結果中選擇:通過與Python

<div class="search-results-list-item clearfix is-collapsed is-topad-list-item "> 


<div class="list-item-data"> 
    <h2 class="list-item-title"> 
     <a href="http://www.mylink.com" name="61492088">Description</a> 
    </h2> 

      <div class="list-item-location"> 
     <span>Rimini</span> 
    </div> 
     </div> 

<div class="list-item-price"> 
    <span>2.000 &euro;</span> 
</div> 

<div class="list-item-actdate"> 
    <span>16 February</span> 
</div> 

</div>  

我的程序應該只打印鏈接(在例如,「list-item-data」div class中包含的鏈接)在「list-item-actdate」中包含單詞「Today」。其他鏈接不應打印,因此在我的示例中,代碼中的唯一鏈接將不會被打印。

我想用BeautifulSoup,但我不知道怎麼用它爲我的目的。

回答

0

下面是使用lxml.html代替BeautifulSoup做的一種方式......它使用XPath搜索文檔並提取相關部分。它應該有希望給你一個想法如何處理HTML(或XML)文檔...

import lxml.html 

# Parse the HTML document 
html = lxml.html.parse(open('/path/to/source/file').read()) 

# find div elements which contains a div child with class='list-item-data' 
for parent in html.xpath("//div[@class='list-item-data']/.."): 

    # get and check the date 
    # note xpath returns a list of elements, here we assume only the first match is of 
    # interest (based on the stated structure of the document) 
    date = parent.xpath("./div[@class='list-item-actdate']/span")[0].text 
    if not date.startswith("Today "): 
     continue 

    # print the link address 
    href = parent.xpath(".//a")[0].attrib['href'] 
    print href