在沒有html標記的HTML文件中提取一行

我需要提取包含特定字符串的行，但是我的下面的代碼隨同它一起提供了html標記。在沒有html標記的HTML文件中提取一行

from BeautifulSoup import BeautifulSoup 
import re 
import os 
import codecs 
import sys 


get_company = "ABB LTD" 


OUTFILE = os.path.join('company', 'a', 'viewids') 

soup = BeautifulSoup(open("/company/a/searches/a")) 
rows = soup.findAll("table",{"id":"cos"})[0].findAll('tr') 
userrows = [t for t in rows if t.findAll(text=re.compile(get_company))] 
print userrows

這是我的表格式

<table id="cos" width="500" cellpadding="3" cellspacing="0" border="1"> 
    <tr> 
    <th>Company Name</th> 
    <th>CIK Number</th> 
    <th>SIC Code</th> 
    </tr> 
    <tr valign="top"> 
    <td>A CONSULTING TEAM INC</td> 
    <td align="right">1040792</td> 
    <td align="right">7380</td> 
    </tr> 
    <tr valign="top"> 
    <td>A J&amp;J PHARMA CORP</td> 
    <td align="right">1140452</td> 
    <td align="right">9995</td> 
    </tr> 
</table>

所以，如果我需要爲J &Ĵ藥業股份有限公司的CIK號碼怎麼辦呢？現在，它給了我這樣的輸出：

[<tr valign="top"> 
    <td>A J&amp;J PHARMA CORP</td> 
    <td align="right">1140452</td> 
    <td align="right">9995</td> 
    </tr>]

來源

2013-12-14 blackmamba

import re 
from BeautifulSoup import BeautifulSoup 

html= ''' 
<table id="cos" width="500" cellpadding="3" cellspacing="0" border="1"> 
    <tr> 
    <th>Company Name</th> 
    <th>CIK Number</th> 
    <th>SIC Code</th> 
    </tr> 
    <tr valign="top"> 
    <td>A CONSULTING TEAM INC</td> 
    <td align="right">1040792</td> 
    <td align="right">7380</td> 
    </tr> 
    <tr valign="top"> 
    <td>A J&amp;J PHARMA CORP</td> 
    <td align="right">1140452</td> 
    <td align="right">9995</td> 
    </tr> 
</table> 
''' 

soup = BeautifulSoup(html) 
table = soup.find("table", {"id":"cos"}) 
td = table.find('td', text='A J&amp;J PHARMA CORP') 
#^This return text node, not td. 
print(td.parent.parent.findAll('td')[1].string)

打印

來源

2013-12-14 06:41:43 falsetru

太謝謝你了。有用。如果我可以搜索J＆J PHARMA CORP而不是A J & J PHARMA CORP？ – blackmamba

@ user3092632，是的，可以使用['cgi.escape']（http://docs.python.org/2/library/cgi.html#cgi.escape）：'td = table.find（'td '，text = cgi.escape（'J＆J PHARMA CORP'））' – falsetru

@ user3092632，如果你使用bs4，你不需要自己逃脫。 – falsetru

在沒有html標記的HTML文件中提取一行

回答

相關問題