2017-07-24 91 views
0

我做了一個網站刮板從看起來像這樣的網頁刮數據(它刮掉表):https://www.techpowerup.com/gpudb/2/我如何從這個鏈接的網頁颳去子標題?

的問題是,我的程序,由於某種原因,只刮值,而不是副標題。例如(點擊鏈接),它只會刪除「R420」,「130nm」,「160萬」等,但不包括「GPU名稱」,「工藝尺寸」,「晶體管」等。

我要添加哪些代碼才能獲取副標題?這是我的代碼:

import csv 
import requests 
import bs4 
url = "https://www.techpowerup.com/gpudb/2" 


#obtain HTML and parse through it 
response = requests.get(url) 
html = response.content 
import sys 
reload(sys) 
sys.setdefaultencoding('utf-8') 
soup = bs4.BeautifulSoup(html, "lxml") 
tables = soup.findAll("table") 

#reading every value in every row in each table and making a matrix 
tableMatrix = [] 
for table in tables: 
    list_of_rows = [] 
    for row in table.findAll('tr'): 
     list_of_cells = [] 
     for cell in row.findAll('td'): 
      text = cell.text.replace(' ', '') 
      list_of_cells.append(text) 
     list_of_rows.append(list_of_cells) 
    tableMatrix.append((list_of_rows, list_of_cells)) 

#(YOU CAN PROBABLY IGNORE THIS)placeHolder used to avoid duplicate data from appearing in list 
placeHolder = 0 
excelTable = [] 
for table in tableMatrix: 
    for row in table: 
     if placeHolder == 0: 
      for entry in row: 
       excelTable.append(entry) 
      placeHolder = 1 
     else: 
      placeHolder = 0 
    excelTable.append('\n') 

for value in excelTable: 
    print value 
    print '\n' 


#create excel file and write the values into a csv 
fl = open(str(count) + '.csv', 'w') 
writer = csv.writer(fl) 
for values in excelTable: 
    writer.writerow(values) 
fl.close() 

回答

0

如果您檢查頁面源,那些單元格是標題單元格。所以他們沒有使用TD標籤,而是使用TH標籤。你可能想要更新你的循環以包含TH細胞和TD細胞。

+0

當我添加「th」,所以它變成了「findAll('td','th'),然後它只是不顯示任何東西,我不應該添加它嗎? –

+0

這是不正確的用法findAll方法,請試試這個:'findAll(re.compile('td | th'))'基本上第一個參數是標籤名稱,第二個參數是屬性,所以你寫的東西試圖找到一個td標籤屬性(不要忘記導入) – xycf7

+0

謝謝! –