2017-02-12 57 views
1

我想用BeautifulSoup解析存儲在HTML表格中的信息並將其存儲到字典中。我已經能夠訪問表格,並遍歷這些值,但表格中仍然有很多垃圾,我不知道如何處理。用BeautifulSoup解析HTML表格數據到字典

# load the HTML file 
r = requests.get("http://www.ebay.com/itm/222378225962") 
soup = BeautifulSoup(r.content, "html.parser") 

# navigate to the item attributes table 
table = soup.find('div', 'itemAttr') 

# iterate through the attribute information 
attr = [] 
for i in table.findAll("tr"): 
    attr.append(i.text.strip().replace('\t', '')) 

用這種方法,這就是數據的樣子。正如你所看到的,那裏有很多垃圾,一些行包含多個項目,如Year和VIN。

[u'Condition:\nUsed', 
u'Seller Notes:\n\u201cExcellent Condition\u201d', 
u'Year: \n\n2015\n\n VIN (Vehicle Identification Number): \n\n2G1FJ1EW2F9192023', 
u'Mileage: \n\n29,000\n\n Transmission: \n\nManual', 
u'Make: \n\nChevrolet\n\n Body Type: \n\nCoupe', 
u'Model: \n\nCamaro\n\n Warranty: \n\nVehicle has an existing warranty', 
u'Trim: \n\nSS Coupe 2-Door\n\n Vehicle Title: \n\nClear', 
u'Engine: \n\n6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated\n\n Options: \n\nLeather Seats', 
u'Drive Type: \n\nRWD\n\n Safety Features: \n\nAnti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags', 
u'Power Options: \n\nAir Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats\n\n Sub Model: \n\n1LE', 
u'Fuel Type: \n\nGasoline\n\n Color: \n\nWhite', 
u'For Sale By: \n\nPrivate Seller\n\n Interior Color: \n\nBlack', 
u'Disability Equipped: \n\nNo\n\n Number of Cylinders: \n\n8', 
u''] 

最終,我想要將數據存儲在下面的字典中。我知道如何創建一本字典,但不知道如何清理需要進入字典的數據,而無需蠻力查找和替換。

{'Condition' : 'Used', 
'Seller Notes' : 'Excellent Condition', 
'Year': '2015', 
'VIN (Vehicle Identification Number)': '2G1FJ1EW2F9192023', 
'Mileage': '29,000', 
'Transmission': 'Manual', 
'Make': 'Chevrolet', 
'Body Type': 'Coupe', 
'Model': 'Camaro', 
'Warranty': 'Vehicle has an existing warranty', 
'Trim': 'SS Coupe 2-Door', 
'Vehicle Title' : 'Clear', 
'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated', 
'Options': 'Leather Seats', 
'Drive Type': 'RWD', 
'Safety Features' : 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags', 
'Power Options' : 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats', 
'Sub Model' : '1LE', 
'Fuel Type' : 'Gasoline', 
'Exterior Color' : 'White', 
'For Sale By' : 'Private Seller', 
'Interior Color' : 'Black', 
'Disability Equipped' : 'No', 
'Number of Cylinders': '8'} 

回答

3

而不是試圖解析出從tr元素的數據,更好的辦法是遍歷td.attrLabels數據元素。您可以將這些標籤用作關鍵字,然後使用相鄰的同級元素作爲值。

在下面的例子中,CSS選擇div.itemAttr td.attrLabels用於選擇所有td元件與.attrLabels類屬於div.itemAttr的後代。從那裏,方法.find_next_sibling()被用來找到相鄰的兄弟元素。

r = requests.get("http://www.ebay.com/itm/222378225962") 
soup = BeautifulSoup(r.content, 'lxml') 

data = [] 
for label in soup.select('div.itemAttr td.attrLabels'): 
    data.append({ label.text.strip(): label.find_next_sibling().text.strip() }) 

輸出:

> [{'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}] 

如果你也想檢索表頭th元素,那麼你可以選擇表元素,然後以檢索都使用CSS選擇器th, td.attrLabels標籤:

r = requests.get("http://www.ebay.com/itm/222378225962") 
soup = BeautifulSoup(r.content, 'lxml') 
table = soup.find('div', 'itemAttr') 

data = [] 
for label in table.select('th, td.attrLabels'): 
    data.append({ label.text.strip(): label.find_next_sibling().text.strip() }) 

輸出:

> [{'Condition:': 'Used'}, {'Seller Notes:': '「Excellent Condition」'}, {'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}] 

如果你想去掉非字母數字字符(S)的鑰匙,那麼你可以使用:

r = requests.get("http://www.ebay.com/itm/222378225962") 
soup = BeautifulSoup(r.content, 'lxml') 
table = soup.find('div', 'itemAttr') 

data = [] 
for label in table.select('th, td.attrLabels'): 
    key = re.sub(r'\W+', '', label.text.strip()) 
    value = label.find_next_sibling().text.strip() 

    data.append({ key: value }) 

輸出:

> [{'Condition': 'Used'}, {'SellerNotes': '「Excellent Condition」'}, {'Year': '2015'}, {'VINVehicleIdentificationNumber': '2G1FJ1EW2F9192023'}, {'Mileage': '29,000'}, {'Transmission': 'Manual'}, {'Make': 'Chevrolet'}, {'BodyType': 'Coupe'}, {'Model': 'Camaro'}, {'Warranty': 'Vehicle has an existing warranty'}, {'Trim': 'SS Coupe 2-Door'}, {'VehicleTitle': 'Clear'}, {'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options': 'Leather Seats'}, {'DriveType': 'RWD'}, {'SafetyFeatures': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'PowerOptions': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'SubModel': '1LE'}, {'FuelType': 'Gasoline'}, {'ExteriorColor': 'White'}, {'ForSaleBy': 'Private Seller'}, {'InteriorColor': 'Black'}, {'DisabilityEquipped': 'No'}, {'NumberofCylinders': '8'}]