2016-11-17 55 views
2
<h3 id="LABandServerNamingConvention-:"><a href="/display/ES/Lab+Org+Code+Summary+Listing">Lab Org Code Summary Listing</a>:</h3> 
<div class="sectionColumnWrapper"> 
    <div class="sectionMacro"> 
    <div class="sectionMacroRow"> 
     <div class="columnMacro"> 
     <div class="table-wrap"> 
      <table class="confluenceTable"> 
      <tbody> 
       <tr> 
       <th class="confluenceTh"> 
        <p>Prefix</p> 
       </th> 
       <th class="confluenceTh"> 
        <p>Group</p> 
       </th> 
       <th class="confluenceTh"> 
        <p>Contact</p> 
       </th> 
       <th class="confluenceTh"> 
        <p>Dev/Test Lab</p> 
       </th> 
       <th class="confluenceTh"> 
        <p>Performance</p> 
       </th> 
       </tr> 
       <tr> 
       <td class="confluenceTd"> 
        <p>SEE00</p> 
       </td> 
       <td class="confluenceTd"> 
        <p>Entertainment</p> 
       </td> 
<tr><td class="confluenceTd"><p>SEF00</p></td><td class="confluenceTd"><p>APTRA Vision</p></td><td class="confluenceTd"><p> </p></td><td class="confluenceTd"><p><a href="/pages/viewpage.action?pageId=83909590">VCD Lab</a> , <a href="/display/ES/SEF00+%28+Aptra+Vision%29+-+Virtual+Lab+Details">Test Lab</a></p></td> 

我有一張表格,其中有5列,其中2個填寫了此特定條目。 如何從表格中將行數據從此HTML代碼片段獲取到我的Python代碼中。我正在使用BeautifulSoup。這是我到目前爲止所嘗試的:使用BeautifulSoup導入特定列的數據

data   = requests.get(url,auth=(username,password)) 
sample   = data.content 
soup   = BeautifulSoup(sample,'html.parser') 
article_text = ' ' 
article  = soup.findAll('td', {'class' : "confluenceTd" })  
for element in article: 
article_text += '\n' + ''.join(element.findAll(text = True)) 

我想以某種方式獲得'SEE00'和'娛樂'。

回答

1
from bs4 import BeautifulSoup 
doc = '''<h3 id="LABandServerNamingConvention-:"><a href="/display/ES/Lab+Org+Code+Summary+Listing">Lab Org Code Summary Listing</a>:</h3> 
<div class="sectionColumnWrapper"><div class="sectionMacro"><div class="sectionMacroRow"><div class="columnMacro"><div class="table-wrap"><table class="confluenceTable"><tbody><tr><th class="confluenceTh"><p>Prefix</p></th><th class="confluenceTh"><p>Group</p></th><th class="confluenceTh"><p>Contact</p></th><th class="confluenceTh"><p>Dev/Test Lab</p></th><th class="confluenceTh"><p>Performance</p></th></tr><tr><td class="confluenceTd"><p>SEE00</p></td><td class="confluenceTd"><p>Entertainment</p></td> 
''' 
soup = BeautifulSoup(doc, 'lxml') 

for row in soup.find_all('tr'): 
    print(row.get_text(separator='\t')) # this separator is only for format, you can use whatever you want 

出來:

Prefix Group Contact Dev/Test Lab Performance 
SEE00 Entertainment 

您可以控制與切片循環:

for row in soup.find_all('tr')[1:]: 

這將只打印

SEE00 Entertainment 

更新:

在:

for row in soup.find_all('tr'): 
    row_data = row.get_text(strip=True, separator='|').split('|')[:2] 
    print(row_data) 

出來:

['Prefix', 'Group'] 
['SEE00', 'Entertainment'] 
['SEF00', 'APTRA Vision'] 
+0

好但這樣做的工作,如果我的網頁是動態變化? –

+0

beautifulsoup +請求無法處理javascript –

+0

謝謝。這解決了我的問題。 –