2016-05-06 46 views
2

獲取文本我有一些當前的Python代碼應該從網站的某個部分使用HTML標記所在位置的xpath獲取HTML代碼。嘗試從網站的某個部分使用lxml.html

def wordorigins(word): 
    pageopen = lxml.html.fromstring("http://www.merriam-webster.com/dictionary/" + str(word)) 
    pbody = pageopen.xpath("/html/body/div[1]/div/div[4]/div/div[1]/main/article/div[5]/div[3]/div[1]/div/p[1]") 
    etybody = lxml.html.fromstring(pbody) 
    etytxt = etybody.xpath('text()') 
    etytxt = etytxt.replace("<em>", "") 
    etytxt = etytxt.replace("</em>", "") 
    return etytxt 

此代碼返回該錯誤有關期待一個字符串或緩衝區:

Traceback (most recent call last): 
    File "mott.py", line 47, in <module> 
    print wordorigins(x) 
    File "mott.py", line 30, in wordorigins 
    etybody = lxml.html.fromstring(pbody) 
    File "/usr/lib/python2.7/site-packages/lxml/html/__init__.py", line 866, in fromstring 
    is_full_html = _looks_like_full_html_unicode(html) 
TypeError: expected string or buffer 

的思考?

回答

1

xpath()方法返回一個結果列表,fromstring()需要一個字符串。

但是,您不需要重新分析文檔的一部分。只需使用你已經發現:

def wordorigins(word): 
    pageopen = lxml.html.fromstring("http://www.merriam-webster.com/dictionary/" + str(word)) 
    pbody = pageopen.xpath("/html/body/div[1]/div/div[4]/div/div[1]/main/article/div[5]/div[3]/div[1]/div/p[1]")[0] 
    etytxt = pbody.text_content() 
    etytxt = etytxt.replace("<em>", "") 
    etytxt = etytxt.replace("</em>", "") 
    return etytxt 

請注意,我用的方法text_content()代替了xpath("text()")的。

1

@alecxe的回答所提到的,在這種情況下匹配的元素,因此,當你試圖列表傳遞給lxml.html.fromstring()錯誤的xpath()方法返回列表。另外需要注意的是,XPath的text()函數和lxmltext_content()方法都不會返回包含標記的字符串,如<em></em>。它們會自動剝離標籤,因此不需要兩條線。您可以簡單地使用text_content()或XPath的string()函數(而不是text()):

...... 
# either of the following lines should be enough 
etytxt = pbody[0].xpath('string()') 
etytxt = pbody[0].text_content() 
相關問題