0
我做了一個網站刮板從看起來像這樣的網頁刮數據(它刮掉表):https://www.techpowerup.com/gpudb/2/我如何從這個鏈接的網頁颳去子標題?
的問題是,我的程序,由於某種原因,只刮值,而不是副標題。例如(點擊鏈接),它只會刪除「R420」,「130nm」,「160萬」等,但不包括「GPU名稱」,「工藝尺寸」,「晶體管」等。
我要添加哪些代碼才能獲取副標題?這是我的代碼:
import csv
import requests
import bs4
url = "https://www.techpowerup.com/gpudb/2"
#obtain HTML and parse through it
response = requests.get(url)
html = response.content
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
soup = bs4.BeautifulSoup(html, "lxml")
tables = soup.findAll("table")
#reading every value in every row in each table and making a matrix
tableMatrix = []
for table in tables:
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
tableMatrix.append((list_of_rows, list_of_cells))
#(YOU CAN PROBABLY IGNORE THIS)placeHolder used to avoid duplicate data from appearing in list
placeHolder = 0
excelTable = []
for table in tableMatrix:
for row in table:
if placeHolder == 0:
for entry in row:
excelTable.append(entry)
placeHolder = 1
else:
placeHolder = 0
excelTable.append('\n')
for value in excelTable:
print value
print '\n'
#create excel file and write the values into a csv
fl = open(str(count) + '.csv', 'w')
writer = csv.writer(fl)
for values in excelTable:
writer.writerow(values)
fl.close()
當我添加「th」,所以它變成了「findAll('td','th'),然後它只是不顯示任何東西,我不應該添加它嗎? –
這是不正確的用法findAll方法,請試試這個:'findAll(re.compile('td | th'))'基本上第一個參數是標籤名稱,第二個參數是屬性,所以你寫的東西試圖找到一個td標籤屬性(不要忘記導入) – xycf7
謝謝! –