Python解析幫助

-3

有人可以幫我解析一下嗎？我有很大的麻煩。我正在解析這個site的信息。Python解析幫助

下面是幾行代碼從表中提取數據與2個冠軍和4個值：

for x in soup.findAll(attrs={'valign':'top'}): 
       print(x.contents) 
       make_list = x.contents 
       print(make_list[1]) #trying to select one of the values on the list.

當我嘗試與make_list[1]行打印出來，它會得到一個錯誤。但是，如果我拔出最後2行，我會以列表格式獲得我想要的html，但我似乎無法分開單個或篩選它們（取出html標記）。任何人都可以幫忙嗎？

這裏是一個輸出示例，我想在這裏具體說明。我不知道正確的正則表達式：

['\n', <td align="left">Western Mutual/Residence <a href="http://interactive.web.insurance.ca.gov/companyprofile/companyprofile?event=companyProfile&amp;doFunction=getCompanyProfile&amp;eid=3303"><small>(Info)</small></a></td>, '\n', <td align="left"><div align="right">           355</div></td>, '\n', <td align="left"><div align="right">250</div></td>, '\n', <td align="left"> </td>, '\n', <td align="left">Western Mutual/Residence <a href="http://interactive.web.insurance.ca.gov/companyprofile/companyprofile?event=companyProfile&amp;doFunction=getCompanyProfile&amp;eid=3303"><small>(Info)</small></a></td>, '\n', <td align="left"><div align="right">           320</div></td>, '\n', <td align="left"><div align="right">500</div></td>, '\n']

來源

2015-09-11 Kenny Truong

什麼是預期輸出 – The6thSense

「它得到一個錯誤」。什麼是錯誤？ – Kevin

@Kevin IndexError：列表索引超出範圍 –

如果你試圖解析從該網站的結果，下面應該工作：

from bs4 import BeautifulSoup 

html_doc = ....add your html.... 
soup = BeautifulSoup(html_doc, 'html.parser') 
rows = [] 
tables = soup.find_all('table') 
t2 = None 

# Find the second from last table 
for t3 in tables: 
    t1, t2 = t2, t3 

for row in t1.find_all('tr'): 
    cols = row.find_all(['td', 'th']) 
    cols = [col.text.strip() for col in cols] 
    rows.append(cols) 

# Collate the two columns 
data = [cols[0:3] for cols in rows] 
data.extend([cols[4:7] for cols in rows[1:]]) 

for row in data: 
    print "{:40} {:15} {}".format(row[0], row[1], row[2])

這給了我輸出看起來像：

Company Name        Annual Premium Deductible 
AAA (Interinsurance Exchange) (Info)  N/A    250 
Allstate (Info)       315    250 
American Modern (Info)     N/A    250 
Amica Mutual (Info)      259    250 
Bankers Standard (Info)     N/A    250 
California Capital (Info)    160    250 
Century National (Info)     N/A    250 
.....

它是如何工作的？

由於網頁主要是顯示一個表格，所以這是我們需要找到的，所以第一步是獲取表格列表。

該網站已使用多個表的部分。至少在請求之間頁面的結構可能會保持不變。

我們需要的表格幾乎是頁面上的最後一個（但不是最後一個），所以我決定遍歷可用的表格並從最後一箇中選擇第二個。 t1t2t3只是一個工作，以保持迭代過程中的最後一個值。

從這裏HTML表通常有一個相當標準的結構，TR和TD。這一個也使用了TH作爲標題行。使用這個table BeautifulSoup然後允許我們枚舉所有的行。

隨着每一行，我們可以找到所有的列。如果您打印返回的內容，您將看到每行的所有條目，然後可以看到需要使用哪些索引對其進行分片。

他們已將輸出顯示在兩個列組中，中間有一個空白列。我構建了兩個列表，用於提取兩組列，然後將第二組附加到第一組的底部以供顯示。

來源

2015-09-11 12:33:41

OMG謝謝我會試試這個......這與我的想法完全不同......但我不明白你如何在網頁中找到t1，t2，t3？你是怎麼做到的？找到那些讓我可以知道未來桌子的東西？謝謝你，我會嘗試這個，但讓你知道它是如何工作的:) –

你怎麼知道專門找'td'和'th'？我一直在做的是右鍵單擊並檢查元素，並試圖查看和理解該代碼大聲笑。 –

就像你如何得到數字0,3,4,7，40,15？哈哈抱歉打擾你問...但也謝謝你！ –

Python解析幫助

回答

相關問題