2015-10-16 110 views
0

下面是代碼:PyQuery找到子元素節點文本

from pyquery import PyQuery 

content = '''<td field="exceptions"><div style="white-space:normal;height:auto;" \ 
class="datagrid-cell datagrid-cell-c2-exceptions">Traceback (most recent call last):<br>\ 
    File "./crawler.py", line 381, in &lt;module&gt;<br> \ 
    crawler.start()<br> File "./crawler.py", line 153, in start<br> \ 
     raise RemoteTransportException(e)<br>RemoteTransportException: \ 
     This socket is already used by another greenlet: &lt;bound method Waiter.\ 
     switch of &lt;gevent.hub.Waiter object at 0x7f64d499d6e0&gt;&gt;<br></div></td>''' 
pq = PyQuery(content) 

for content in pq('td div'): 
    print content.text # get Traceback (most recent call last): 


for content in pq('td div'): 
    for sub in content.getchildren(): 
     print sub.text 


# Traceback (most recent call last): 
# None 
# None 
# None 
# None 
# None 
# None 

當你,我想在td div元素的內容,它應該是

Traceback (most recent call last): 
File "./crawler.py", line 381, in <module> 
crawler.start() 
File "./crawler.py", line 153, in start 
raise RemoteTransportException(e) 
RemoteTransportException: This socket is already used by another greenlet: <bound method Waiter.switch of <gevent.hub.Waiter object at 0x7f64d499d6e0>> 

但我只是得到Traceback (most recent call last):。 那麼如何找到td div裏面帶有子標籤的所有文字呢?

回答

1

你可以使用BeautifulSoup來代替:。

import bs4 
soup = bs4.BeautifulSoup(content) 
soup.find('td').find('div').text 
u'Traceback (most recent call last): File "./crawler.py", line 381, in <module>  crawler.start() File "./crawler.py", line 153, in start  raise RemoteTransportException(e)RemoteTransportException:  This socket is already used by another greenlet: <bound method Waiter.  switch of <gevent.hub.Waiter object at 0x7f64d499d6e0>>' 
+0

它應該是'soup.find( 'TD')找到( 'DIV')text' –

+0

哎呦,對不起,:/ – rofls

+0

你需要一個解決方案與PyQuery? :) – rofls

相關問題