2015-11-24 26 views
0

我試圖從this站點使用數據並保存到數據庫中。當我使用螢火蟲查看該網站時,表格行格式良好。但我的下面的代碼得到錯誤的html內容。python在使用時請求錯誤的html函數獲取函數

from bs4 import BeautifulSoup 
import requests, urllib2 
from peewee import SqliteDatabase,CharField,Model 

db = SqliteDatabase("cybercrime.db") 
class CyberCrimeList(Model): 
    date = CharField() 
    url = CharField() 
    ip = CharField() 
    type = CharField() 

    class Meta: 
     database = db 


url = "http://cybercrime-tracker.net/index.php?m=4" 
response = requests.get(url) 
html = response.content 
soup = BeautifulSoup(html, "html.parser") 
table = soup.find('table',attrs={'class':'ExploitTable'}) 
print table.tbody 

但代碼只給出第一行格式不正確。我得到</tr></td>而不是</td></tr>

有什麼我在誤解?我的代碼有什麼問題?

<tr><td>23-11-2015</td> 
<td>jda3.byethost3.com/panel/index.php?login</td> 
<td><a href="https://www.virustotal.com/en/ip-address/185.27.134.160/information/" target="_blank">185.27.134.160</a></td> 
<td>Solar</td> 
<td><a href="https://www.virustotal.com/latest-scan/http://jda3.byethost3.com/panel/index.php?login" target="_blank"><img alt="Scan with VirusTotal" border="0" height="12" longdesc="Scan with VirusTotal" src="vt.png" width="13"/></a> <a href="http://cybercrime-tracker.net/index.php?s=0&amp;m=40&amp;search=Solar"><img alt="Search the family" border="0" height="12" longdesc="Search the family" src="vwicn008.gif" width="13"/></a></td></tr> 
+0

所以你要得到什麼? –

+0

我將用日期,網址,IP和類型的值填充數據庫。但現在,代碼只給出一行而不是四行。 – Pant

回答

1

使用lxml得到的所有結果

soup = BeautifulSoup(html, "lxml") 

似乎"html.parser"與這個網站的一個問題。

+0

使用'lxml'似乎解決了這個問題。 'html.parser'作爲doc說'不太寬大'。謝謝。 – Pant

1

那麼,你可以嘗試搜索tr標籤是這樣的:

from bs4 import BeautifulSoup 
import requests, urllib2 
from peewee import SqliteDatabase,CharField,Model 

db = SqliteDatabase("cybercrime.db") 
class CyberCrimeList(Model): 
    date = CharField() 
    url = CharField() 
    ip = CharField() 
    type = CharField() 

    class Meta: 
     database = db 

url = "http://cybercrime-tracker.net/index.php?m=4" 
response = requests.get(url) 

html = response.text 
# I'd recommend use r.text instead r.content if the results is text 

soup = BeautifulSoup(html, "html.parser")  
tables = soup.find_all('tr') 

for table in tables[1:]: # skip the first element 
    print(table) 
    print() 

輸出看起來像:

<tr><td>23-11-2015</td> 
<td>jda3.byethost3.com/panel/index.php?login</td> 
<td><a href="https://www.virustotal.com/en/ip-address/185.27.134.160/information/" target="_blank">185.27.134.160</a></td> 
<td>Solar</td> 
<td><a href="https://www.virustotal.com/latest-scan/http://jda3.byethost3.com/panel/index.php?login" target="_blank"><img alt="Scan with VirusTotal" border="0" height="12" longdesc="Scan with VirusTotal" src="vt.png" width="13"/></a> <a href="http://cybercrime-tracker.net/index.php?s=0&amp;m=40&amp;search=Solar"><img alt="Search the family" border="0" height="12" longdesc="Search the family" src="vwicn008.gif" width="13"/></a></td></tr> 

<tr><td>23-11-2015</td> 
<td>www.fyzee.top/senikan/web/login.php</td> 
<td><a href="https://www.virustotal.com/en/ip-address/68.168.209.242/information/" target="_blank">68.168.209.242</a></td> 
<td>KeyBase</td> 
<td><a href="https://www.virustotal.com/latest-scan/http://www.fyzee.top/senikan/web/login.php" target="_blank"><img alt="Scan with VirusTotal" border="0" height="12" longdesc="Scan with VirusTotal" src="vt.png" width="13"/></a> <a href="http://cybercrime-tracker.net/index.php?s=0&amp;m=40&amp;search=KeyBase"><img alt="Search the family" border="0" height="12" longdesc="Search the family" src="vwicn008.gif" width="13"/></a></td></tr> 

<tr><td>23-11-2015</td> 
<td>www.fyzee.top/kech/web/login.php</td> 
<td><a href="https://www.virustotal.com/en/ip-address/68.168.209.242/information/" target="_blank">68.168.209.242</a></td> 
<td>KeyBase</td> 
<td><a href="https://www.virustotal.com/latest-scan/http://www.fyzee.top/kech/web/login.php" target="_blank"><img alt="Scan with VirusTotal" border="0" height="12" longdesc="Scan with VirusTotal" src="vt.png" width="13"/></a> <a href="http://cybercrime-tracker.net/index.php?s=0&amp;m=40&amp;search=KeyBase"><img alt="Search the family" border="0" height="12" longdesc="Search the family" src="vwicn008.gif" width="13"/></a></td></tr> 

<tr><td>23-11-2015</td> 
<td>sentfactor.com/medinshushu/admin.php</td> 
<td><a href="https://www.virustotal.com/en/ip-address/50.31.160.159/information/" target="_blank">50.31.160.159</a></td> 
<td>Pony</td> 
<td><a href="https://www.virustotal.com/latest-scan/http://sentfactor.com/medinshushu/admin.php" target="_blank"><img alt="Scan with VirusTotal" border="0" height="12" longdesc="Scan with VirusTotal" src="vt.png" width="13"/></a> <a href="http://cybercrime-tracker.net/index.php?s=0&amp;m=40&amp;search=Pony"><img alt="Search the family" border="0" height="12" longdesc="Search the family" src="vwicn008.gif" width="13"/></a></td></tr> 
相關問題