2013-12-14 52 views
1

我需要提取包含特定字符串的行,但是我的下面的代碼隨同它一起提供了html標記。在沒有html標記的HTML文件中提取一行

from BeautifulSoup import BeautifulSoup 
import re 
import os 
import codecs 
import sys 


get_company = "ABB LTD" 


OUTFILE = os.path.join('company', 'a', 'viewids') 

soup = BeautifulSoup(open("/company/a/searches/a")) 
rows = soup.findAll("table",{"id":"cos"})[0].findAll('tr') 
userrows = [t for t in rows if t.findAll(text=re.compile(get_company))] 
print userrows 

這是我的表格式

<table id="cos" width="500" cellpadding="3" cellspacing="0" border="1"> 
    <tr> 
    <th>Company Name</th> 
    <th>CIK Number</th> 
    <th>SIC Code</th> 
    </tr> 
    <tr valign="top"> 
    <td>A CONSULTING TEAM INC</td> 
    <td align="right">1040792</td> 
    <td align="right">7380</td> 
    </tr> 
    <tr valign="top"> 
    <td>A J&amp;J PHARMA CORP</td> 
    <td align="right">1140452</td> 
    <td align="right">9995</td> 
    </tr> 
</table> 

所以,如果我需要爲J &Ĵ藥業股份有限公司的CIK號碼怎麼辦呢?現在,它給了我這樣的輸出:

[<tr valign="top"> 
    <td>A J&amp;J PHARMA CORP</td> 
    <td align="right">1140452</td> 
    <td align="right">9995</td> 
    </tr>] 

回答

2
import re 
from BeautifulSoup import BeautifulSoup 

html= ''' 
<table id="cos" width="500" cellpadding="3" cellspacing="0" border="1"> 
    <tr> 
    <th>Company Name</th> 
    <th>CIK Number</th> 
    <th>SIC Code</th> 
    </tr> 
    <tr valign="top"> 
    <td>A CONSULTING TEAM INC</td> 
    <td align="right">1040792</td> 
    <td align="right">7380</td> 
    </tr> 
    <tr valign="top"> 
    <td>A J&amp;J PHARMA CORP</td> 
    <td align="right">1140452</td> 
    <td align="right">9995</td> 
    </tr> 
</table> 
''' 

soup = BeautifulSoup(html) 
table = soup.find("table", {"id":"cos"}) 
td = table.find('td', text='A J&amp;J PHARMA CORP') 
#^This return text node, not td. 
print(td.parent.parent.findAll('td')[1].string) 

打印

1140452 
+0

太謝謝你了。有用。 如果我可以搜索J&J PHARMA CORP而不是A J & J PHARMA CORP? – blackmamba

+1

@ user3092632,是的,可以使用['cgi.escape'](http://docs.python.org/2/library/cgi.html#cgi.escape):'td = table.find('td ',text = cgi.escape('J&J PHARMA CORP'))' – falsetru

+0

@ user3092632,如果你使用bs4,你不需要自己逃脫。 – falsetru