2013-08-22 40 views
0

有人可以請指教如何從<td>檢索href和img值。我寫了下面的一段代碼來檢索下面給出的結果。我能夠檢索的值,直到<td>。我不確定如何進一步探索。href美麗的裝飾

請注意有很多<tr>值。我剛剛舉了兩個例子。

mycode的:

from bs4 import BeautifulSoup 
import urllib2 
url="http://mywebsite.com/" 
page=urllib2.urlopen(url) 
soup = BeautifulSoup(page.read()) 

records = [] 
tabledata = soup.find("table", {"class" : "class1"}) 
for row in tabledata.findAll('tr'): 
    col = row.findAll('td') 
    if col: 
     col1 = col[1].string.strip() 
     col2 = col[2].string.strip() 
     col3 = col[3].string.strip() 
     record = '%s %s %s' % (col1,col2,col3) 
     records.append(record) 


for values in records: 
    print values 

數據

<table class="class1"> 
<tr> 
<th></th> 
<th>Heading1</th> 
<th>Heading2</th> 
<th>Heading3</th> 
</th> 
</tr> 
<tr> 
<td><img src="http://image.com/new.png"/></td> 
<td>Data1</td> 
<td><a href="www.sample.com">Data2</a></td> 
<td>Data3</td> 
</tr> 

輸出:

Data1 Data2 Data3 

所需的輸出:

Data1 Data2 Data3 www.sample.com new.png 

回答

0

下面是解:

from bs4 import BeautifulSoup 
import urllib2 
#url="http://mywebsite.com/" 
#page=urllib2.urlopen(url) 


def getdata(col): 
    record = [] 
    for image in col.findAll('img'): 
     src = image.get('src') 
     record.append(src) 
    for a in col.findAll('a'): 
     href = a.get('href') 
     record.append(href) 
    if col.string: 
     record.append(col.string.strip()) 
    return record 


def extract(): 
    url="test.html" 
    soup = BeautifulSoup(open(url).read()) 

    records = [] 
    tabledata = soup.find("table", {"class" : "class1"}) 
    for row in tabledata.findAll('tr'): 
     cols = row.findAll('td') 
     for col in cols: 
      record = getdata(col) 
      records.extend(record) 
    return records 

if __name__ == "__main__": 
    records = extract() 
    print "recorsd:", records 
    for v in records: 
     print v 

輸出:

http://image.com/new.png 
Data1 
www.sample.com 
Data2 
Data3 

循環遍歷所有的 'TD',提取必要的數據和附加到記錄。

0

字符串屬性將只返回最有可能的子節點的文本內容。您還需要從每個列中查找您感興趣的其他標籤(和),並從您想要打印的列表中提取屬性。