2016-05-27 59 views
0

從DIV中的所有表我需要從<div id="specs-list">所有表中提取的<tr>標籤。然而,它只提取前六個表格。這裏是page。這是我的代碼。使用廢料BeautifulSoup

def getPhoneStats(url): 
    urls={} 
    try: 
     request= requests.get(url) 
     if request.status_code == 200: 
      sourceCode = BeautifulSoup(request.content,"html.parser") 
      tables = sourceCode.select('#specs-list table') 
      for table in tables: 
       tag = table.find('tr') 
       print(tag.get_text()) 
     else: 
      print('no table or row found ') 
    except requests.HTTPError as e: 
     print('Unable to open url',e) 

它只是打印,直到div的第六表:

Network 
Technology 
GSM/HSPA/LTE 


Launch 
Announced 
2015, March 


Body 
Dimensions 
152.6 x 76.2 x 8 mm (6.01 x 3.00 x 0.31 in) 


Display 
Type 
IPS capacitive touchscreen, 16M colors 


Platform 
OS 
Android OS, v5.0.2 (Lollipop), upgradable to v6.0 (Marshmallow) 


Memory 
Card slot 
microSD, up to 32 GB (dedicated slot) 

Process finished with exit code 0 

回答

2

的HTML格式不正確。 「Memory」表最後有太多的/ td和/ tr標籤。我認爲這與解析器有關。我有更好的運氣跳過div和直接爲表看:

from bs4 import BeautifulSoup 
import requests 


def getPhoneStats(url): 
    try: 
     request= requests.get(url) 
     if request.status_code == 200: 
      soup = BeautifulSoup(request.content,"html.parser") 

      for table in soup.findAll("table"): 
       header = table.th.get_text() 
       for row in table.findAll("tr"): 
        out_row = [ header ] 
        for col in row.findAll("td"): 
         out_row.append(col.get_text()) 
        print(out_row) 
     else: 
      print('unable to connect ') 
    except requests.HTTPError as e: 
     print('Unable to open url',e) 

if __name__ == "__main__": 
    getPhoneStats('''http://www.gsmarena.com/lenovo_k3_note-7147.php''') 

這給了結果:

['Network', 'Technology', 'GSM/HSPA/LTE'] 
['Network', '2G bands', 'GSM 850/900/1800/1900 - SIM 1 & SIM 2'] 
['Network', '\xa0', 'GSM 850/900/1800/1900 - SIM 1 & SIM 2 - India'] 
['Network', '3G bands', 'HSDPA 850/900/1900/2100 '] 
['Network', '\xa0', 'HSDPA 2100 - India'] 
['Network', '4G bands', 'LTE band 1(2100), 3(1800), 7(2600), 38(2600), 39(1900), 40(2300), 41(2500)'] 
['Network', 'Speed', 'HSPA, TD-SCDMA, LTE, TD-LTE'] 
['Network', 'GPRS', 'Yes'] 
['Network', 'EDGE', 'Yes'] 
['Launch', 'Announced', '2015, March'] 
['Launch', 'Status', 'Available. Released 2015, March'] 
['Body', 'Dimensions', '152.6 x 76.2 x 8 mm (6.01 x 3.00 x 0.31 in)'] 
['Body', 'Weight', '150 g (5.29 oz)'] 
['Body', 'SIM', 'Dual SIM (Micro-SIM, dual stand-by)'] 
['Display', 'Type', 'IPS capacitive touchscreen, 16M colors'] 
['Display', 'Size', '5.5 inches (~71.7% screen-to-body ratio)'] 
['Display', 'Resolution', '1080 x 1920 pixels (~401 ppi pixel density)'] 
['Display', 'Multitouch', 'Yes, up to 5 fingers'] 
['Display', '\xa0', '- Lenovo Vibe 2.0'] 
['Platform', 'OS', 'Android OS, v5.0.2 (Lollipop), upgradable to v6.0 (Marshmallow)'] 
['Platform', 'Chipset', 'Mediatek MT6752'] 
['Platform', 'CPU', 'Octa-core 1.7 GHz Cortex-A53'] 
['Platform', 'GPU', 'Mali-T760MP2'] 
['Memory', 'Card slot', 'microSD, up to 32 GB (dedicated slot)'] 
['Memory', 'Internal', '16 GB, 2 GB RAM'] 
['Camera', 'Primary', '13 MP, f/2.0, autofocus, dual-LED flash, check quality'] 
['Camera', 'Features', 'Geo-tagging, touch focus, face detection, HDR, panorama'] 
['Camera', 'Video', '[email protected], check quality'] 
['Camera', 'Secondary', '5 MP, f/2.4'] 
['Sound', 'Alert types', 'Vibration; MP3, WAV ringtones'] 
['Sound', 'Loudspeaker ', 'Yes'] 
['Sound', '3.5mm jack ', 'Yes'] 
['Sound', '\xa0', '- Dolby Atmos'] 
['Comms', 'WLAN', 'Wi-Fi 802.11 b/g/n, hotspot'] 
['Comms', 'Bluetooth', 'v4.1, A2DP, LE'] 
['Comms', 'GPS', 'Yes, with A-GPS, GLONASS'] 
['Comms', 'Radio', 'FM radio'] 
['Comms', 'USB', 'microUSB v2.0, USB Host'] 
['Features', 'Sensors', 'Accelerometer, gyro, proximity, compass'] 
['Features', 'Messaging', 'SMS(threaded view), MMS, Email, Push Mail, IM'] 
['Features', 'Browser', 'HTML5'] 
['Features', 'Java', 'No'] 
['Features', '\xa0', '- Active noise cancellation with dedicated mic\r\n- MP4/H.264 player\r\n- MP3/WAV/eAAC+/FLAC player\r\n- Photo/video editor\r\n- Document viewer'] 
['Battery', '\xa0', 'Removable Li-Ion 3000 mAh battery'] 
['Battery', 'Stand-by', 'Up to 750 h (3G)'] 
['Battery', 'Talk time', 'Up to 36 h (3G)'] 
['Misc', 'Colors', 'Onyx Black, Pearl White, Laser Yellow'] 
['Misc', 'Price group', '3/10 (About 150 EUR)'] 
['Tests', 'Performance', '\nBasemark OS II: 1053/Basemark OS II 2.0: 984Basemark X: 5656'] 
['Tests', 'Display', '\nContrast ratio: 1793:1 (nominal)'] 
['Tests', 'Camera', '\nPhoto/Video'] 
['Tests', 'Loudspeaker', '\nVoice 65dB/Noise 66dB/Ring 76dB\n'] 
['Tests', 'Battery life', '\n\nEndurance rating 53h\n\n'] 
['Tests'] 

下一次,請張貼代碼,我可以運行(如我的例子)。

+0

謝謝!對不起,進口不在那裏。這工作 –

2

這是一個與html解析器有關的問題。我更喜歡使用html5lib,但它的速度較慢,因此,如果速度是很重要的,在基於C的解析器之一可能是更好的(閱讀更多here

我只是改變sourceCode = BeautifulSoup(request.content,"html.parser")sourceCode = BeautifulSoup(request.content,"html5lib"),這是好去(全更新代碼如下)。

而且,我不知道你是否注意到了這一點,但通過使用tag = table.find('tr')行,你只返回每個分組表中的第一行。如果你想要全表,print(table.get_text() for for循環

from bs4 import BeautifulSoup 
import requests, html5lib 
def getPhoneStats(url): 
    urls={} 
    try: 
     request= requests.get(url) 
     if request.status_code == 200: 
      sourceCode = BeautifulSoup(request.content,'html5lib') 
      tables = sourceCode.select('#specs-list table') 
      for table in tables: 
       #tag = table.find('tr') 
       #print(tag.get_text()) 
       print(table.get_text()) 
     else: 
      print('no table or row found ') 
    except requests.HTTPError as e: 
     print('Unable to open url',e)