2014-06-25 167 views
2

我想從使用BeautifulSoup的表中刮取數據。下面的問題發生:[u'A Southern RV, Inc.1642 E New York AveDeland, FLPhone: (386) 734-5678Website: www.southernrvrentals.comEmail: [email protected]\xa0\n']從有像使用BeautifulSoup刮網頁Python

<table id="ctl00_TemplateBody_WebPartManager1_gwpste_container_SearchForm_ciSearchForm_RTable" border="0"> 
          <tbody><tr style="background-color:#990000;"> 
           <th align="left" colspan="3" style="margin-top:5px;margin-bottom:5px;"><span id="ctl00_TemplateBody_WebPartManager1_gwpste_container_SearchForm_ciSearchForm_RSCount" style="color:White;">Your search results returned (85) records </span></th> 
          </tr><tr> 
           <td class="ml15" align="left" valign="top"><img src="./RVDealers-Florida_files/AfterMarket2.gif" alt="After Market Member Logo" border="0"> </td><td class="ml15" align="left" valign="top"><span style="font-weight:bold;">A Southern RV, Inc.</span><br>1642 E New York Ave<br>Deland, FL<br>Phone: (386) 734-5678<br>Website: <a href="http://www.southernrvrentals.com/" target="_blank">www.southernrvrentals.com</a><br>Email: <a href="mailto:[email protected]" target="_blank">[email protected]</a></td><td class="ml15" align="left" valign="top">&nbsp;</td> 
          </tr><tr> 
           <td colspan="3"><hr></td> 
          </tr><tr> 
           <td class="ml15" align="left" valign="top"><img src="./RVDealers-Florida_files/AfterMarket2.gif" alt="After Market Member Logo" border="0"> </td><td class="ml15" align="left" valign="top"><span style="font-weight:bold;">Alec's Truck Trailer &amp; RV</span><br>16960 S Dixie Hwy<br>Miami, FL<br>Phone: (305) 234-5444<br>Website: <a href="http://www.alecstruck.com/" target="_blank">www.alecstruck.com</a><br>Email: <a href="mailto:[email protected]" target="_blank">[email protected]</a></td><td class="ml15" align="left" valign="top">&nbsp;</td> 
          </tr><tr> 
           <td colspan="3"><hr></td> 
          </tr><tr> 
           <td class="ml15" align="left" valign="top"><img src="./RVDealers-Florida_files/RVRAMember2.gif" alt="RVRA Member Logo" border="0"><br> <img src="./RVDealers-Florida_files/GoRVDealer2.gif" alt="Go RV Dealer Logo" border="0"><br> </td><td class="ml15" align="left" valign="top"><span style="font-weight:bold;">All Star Coaches</span><br>131 NW 73rd Terraces, Bay 1117<br>Fort Lauderdale, FL<br>Phone: (866) 838-4465<br>Website: <a href="http://www.allstarcoaches.com/" target="_blank">www.allstarcoaches.com</a><br>Email: <a href="mailto:[email protected]" target="_blank">[email protected]</a></td><td class="ml15" align="left" valign="top">&nbsp;</td> 
          </tr><tr> 
           <td colspan="3"><hr></td> 
          </tr><tr> 
           <td class="ml15" align="left" valign="top"><img src="./RVDealers-Florida_files/RVDAMember2.gif" alt="RVDA Member Logo" border="0"><br> <img src="./RVDealers-Florida_files/GoRVDealer2.gif" alt="Go RV Dealer Logo" border="0"><br> </td><td class="ml15" align="left" valign="top"><span style="font-weight:bold;">Alliance Coach</span><br>4505 Monaco Way<br>Wildwood, FL<br>Phone: (866) 888-8941<br>Website: <a href="http://www.alliancecoachonline.com/" target="_blank">www.alliancecoachonline.com</a><br>Email: <a href="mailto:[email protected]" target="_blank">[email protected]</a></td><td class="ml15" align="left" valign="top"><table width="100%" border="0" cellpadding="0" cellspacing="5"><tbody><tr><td valign="top" width="75" align="left"><img src="./RVDealers-Florida_files/Cert_web.jpg" height="75" width="75" alt="Certified RV Technician" border="0"></td> <td valign="top" style="font-size:8px;font-weight:bold;" align="left" nowrap=""><img src="./RVDealers-Florida_files/RVLCenter_web.jpg" height="33" width="93" alt="RV Learning Center Certifications" border="0"><br>&nbsp;Certifications:<ul><li style="font-size:7px;">&nbsp;Service Writer/Advisor</li><li style="font-size:7px;">&nbsp;Parts Specialist</li><li style="font-size:7px;">&nbsp;Parts Manager</li><li style="font-size:7px;">&nbsp;Warranty Administrator</li></ul></td></tr></tbody></table></td> 
          </tr><tr> 
           <td colspan="3"><hr></td> 

的問題是,當我湊數據,這一切凝結成一個長字符串沒有任何空格或回車行的表。我怎樣才能解決這個問題?我使用此代碼提取從表中的文本:

mech = Browser() 
page = mech.open(BASE_URL_DIRECTORY) 
html = page.read() 
soup = BeautifulSoup(html) 
data = extract(soup) 

def extract(soup): 
    table = soup.find("table",attrs={'id':'ctl00_TemplateBody_WebPartManager1_gwpste_container_SearchForm_ciSearchForm_RTable'}) 
    #print table 
     data = [] 
    for row in table.findAll("tr"): 
     s = row.getText() 
     data.append(s) 
    return data 

回答

1

您可以使用replace_with()來代替每個br標籤與新線:

def extract(soup): 
    table = soup.find("table", attrs={'id':'ctl00_TemplateBody_WebPartManager1_gwpste_container_SearchForm_ciSearchForm_RTable'}) 
    for br in table.find_all('br'): 
     br.replace_with('\n') 
    return table.get_text().strip() 

對於您所提供的HTML輸入它打印:

A Southern RV, Inc. 

1642 E New York Ave 
Deland, FL 
Phone: (386) 734-5678 
Website: www.southernrvrentals.com 
Email: [email protected] 
+0

我試過你的解決方案,但它只產生了名稱(在這個例子中A Southern RV,Inc)。我已經包含了一個我正在處理的更全面的HTML樣本;如果你看看,我會非常感激。 – Apollo

+0

@Apollo嘗試了你提供的例子 - 它確實很好地用新行顯示結果。你能否澄清現在的問題?謝謝。 – alecxe