提取數據與Python

-1

對於培訓的目的，我'試圖提取行：提取數據與Python

<td>2017/01/15</td>

從以下網頁（檢查元素預覽）：

<div class="bodyy"> 
      <div id="FullPart"> 
           <p class="d_intro"> 


       <table id="ldeface" cellpadding="0" cellspacing="0"> 
        <tbody><tr> 
         <td class="dtime">Date</td> 
         <td class="datt">Notifier</td> 
         <td class="dHMR">H</td> 
         <td class="dHMR">M</td> 
         <td class="dHMR">R</td> 
         <td class="dhMR">L</td> 
         <td class="dR"><img src="/images/star.gif" border="0"></td> 
         <td class="dDom">Domain</td> 
         <td class="dos">OS</td> 
         <td class="dview">View</td> 
        </tr> 
              <tr> 
         <td>2017/02/10</td> 
         <td><a href="/testarchive/</a></td> 
         <td></td> 
         <td></td> 
         <td></td>

我'困惑我將如何獲取td部分以及哪些部分是正確的（class/id），以便使用BeatifulSoup獲取正確的信息。在此先感謝

來源

2017-05-20 VorX

嘗試[scrapy（HTTPS： //doc.scrapy.org/en/latest/）並閱讀他們的文檔 – Jahid

對於您的示例，您應該使用下一件事。

from bs4 import BeautifulSoup 

soup = BeautifulSoup('yor_html_source', 'html.parser') 
for table in soup.find_all('table'): 
    tr = table.findAll('tr')[1] 
    td = tr.findAll('td')[0].text 
print(td) # return 2017/02/10

如果你想要得到的只是<td>2017/02/10</td>從td可變刪除text財產。

BeautifulSoup4有也很酷Soup documentation

來源

2017-05-20 15:34:31

看一看這個參考鏈接： -

https://chrisalbon.com/python/beautiful_soup_scrape_table.html

來源

2017-05-20 15:39:10

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html) 
tags=[str(tag) for tag in soup.find_all()] 
for elem in tags: 
    if '<td>' in elem and len(elem.split('/')==4): 
     print(elem.text)

經歷所有的標籤，如果標籤是TD和有權打印量。

來源

2017-05-20 15:39:27

收集數據：

來獲取數據處理您可以使用urllib2

import urllib2 
resource = urllib2.urlopen("http://www.somewebsite.com/somepage") 
html = resource.read() 
# assuming html is the example with a few more rows in the table

處理數據：

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html, "lxml") 
for table in soup.findAll("table"): 
    if table.attrs['id'] == 'ldeface': 
     rows = table.findAll("tr") 
     header = rows[0] 
     date_col = [ i for i, col in enumerate(header.findAll("td")) if col.text == "Date"][0] 
     for row in rows[1:]: 
      print row.findAll("td")[date_col].text

結果：

2017/02/10 
2017/02/11 
2017/03/10 
...

您可以提取基於單元格中的文字等欄目，id屬性像我一樣的表，或以類似的方式class屬性表

來源

2017-05-20 16:09:28 Robb

提取數據與Python

回答

相關問題