通過與Python

一個網站，我有導致這樣的搜索的源頁面（只是它的一個切片，當然）的日期搜索結果中選擇：通過與Python

<div class="search-results-list-item clearfix is-collapsed is-topad-list-item "> 


<div class="list-item-data"> 
    <h2 class="list-item-title"> 
     <a href="http://www.mylink.com" name="61492088">Description</a> 
    </h2> 

      <div class="list-item-location"> 
     <span>Rimini</span> 
    </div> 
     </div> 

<div class="list-item-price"> 
    <span>2.000 &euro;</span> 
</div> 

<div class="list-item-actdate"> 
    <span>16 February</span> 
</div> 

</div>

我的程序應該只打印鏈接（在例如，「list-item-data」div class中包含的鏈接）在「list-item-actdate」中包含單詞「Today」。其他鏈接不應打印，因此在我的示例中，代碼中的唯一鏈接將不會被打印。

我想用BeautifulSoup，但我不知道怎麼用它爲我的目的。

來源

2014-02-24 user3348278

下面是使用lxml.html代替BeautifulSoup做的一種方式......它使用XPath搜索文檔並提取相關部分。它應該有希望給你一個想法如何處理HTML（或XML）文檔...

import lxml.html 

# Parse the HTML document 
html = lxml.html.parse(open('/path/to/source/file').read()) 

# find div elements which contains a div child with class='list-item-data' 
for parent in html.xpath("//div[@class='list-item-data']/.."): 

    # get and check the date 
    # note xpath returns a list of elements, here we assume only the first match is of 
    # interest (based on the stated structure of the document) 
    date = parent.xpath("./div[@class='list-item-actdate']/span")[0].text 
    if not date.startswith("Today "): 
     continue 

    # print the link address 
    href = parent.xpath(".//a")[0].attrib['href'] 
    print href

來源

2014-02-24 20:40:55 isedev

回答

相關問題