lxml網頁解析內容的長度

我在Python中使用lxml來抓取網頁。然而，爲了得到表格的行數，我首先得到它們，然後使用len()函數。我覺得這很浪費，有沒有其他方法可以讓他們的數字（動態的）進一步刮擦？lxml網頁解析內容的長度

import lxml.html 
doc = '' 
try: 
    doc = lxml.html.parse('url') 
except SkipException: pass 

if doc: 
    buf = '' 
    #get the total number of rows in table 
    tr = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr") 
    table = [] 
    # iterate over the table rows limited to max number 
    for i in range(3, len(tr)): 
      # get the rows content            
      table += doc.xpath("body/div[1]/div[1]/table[1]/tbody/tr[%s]/td" % i)

來源

2012-09-22 Igor Savinkin

爲什麼'beautifulsoup'標籤？你只在這裏使用'lxml'。 –

對不起，我以爲可以用bs代替即興 –

您可以使用您爲出發點匹配tr元素，你可以簡單地對他們進行迭代就像你使用Python列表：

tr = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr") 
for row in tr[3:]: 
    table += row.findall('td')

上述用途.findall()抓住所有包含td元素，但如果您需要更多控制，則可以使用更多.xpath()調用。

來源

2012-09-22 15:00:50

from itertools import islice 

trs = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr") 
for tr in islice(trs, 3): 
    for td in tr.xpath('td'): 
     ...whatever...

來源

2012-09-22 14:37:09 georg

爲什麼'islice'？不會'trs [3：]'工作嗎？ –

你嘗試使用迭代方法，如本節解釋說：http://lxml.de/api.html#iteration？我很確定有這樣的方式。找到東西的長度，然後用（x）範圍遍歷它，永遠不會是一個優雅的解決方案，我很確定lxml背後的人爲您提供了正確的工具。

來源

2012-09-22 14:45:15

lxml網頁解析內容的長度

回答

相關問題