2014-02-25 128 views
0

我已經提取了第二個表格,在第二個表格中,我需要提取具有column[0]中文件名的行。解析來自html的特定數據

<TABLE WIDTH="100%" BORDER="1" > 
<TR ><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="2" WIDTH="70%">Root</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Functions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;10.1% (1077/10647)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Functions and exits</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;9.5% (2142/22473)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Statement blocks</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;9.1% (2191/24167)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Decisions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;8.8% (2648/29930)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Loops</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;8.4% (305/3628)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Basic conditions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;8.3% (1759/21254)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Modified conditions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;1.8% (35/1997)</TD></TR> 
<TR ><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="70%">Multiple conditions</TD><TD BGCOLOR="#FFFFCC" ROWSPAN="1" COLSPAN="1" WIDTH="30%"> &#160;&#160;&#160;&#160;4.4% (137/3082)</TD></TR> 

</TABLE> 
</P> 
<P ALIGN="LEFT"><BR> 
2 - Files list</P> 
<BR> 
Display absolute values only.<BR> 

<TABLE WIDTH="100%" BORDER="1" > 
<TR BGCOLOR="#FFFF99"><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><b>Item<IMG SRC="cvi_sort_d.png" ALT="cvi_sort_d.xpm"></b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Functions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Functions and exits</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Statement blocks</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Decisions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Loops</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Basic conditions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Modified conditions</b></TD><TD BGCOLOR="#FFFF99" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><b>Multiple conditions</b></TD></TR> 
<TR ><TD BGCOLOR="#FF9999" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><B><A NAME="175746848"></A><a href="LOADER.H.html">LOADER.H</a></B></TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/2</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746912"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoaderState_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746976"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadParameters_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747104"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadOffsets_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747168"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadAppComponent_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#FF9999" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><B><A NAME="175746848"></A><a href="CORBA_FIXED.CC.html">CORBA_FIXED.CC</a></B></TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/2</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#FFDFDD" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">0/1</P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746912"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoaderState_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175746976"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadParameters_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747104"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadOffsets_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
<TR ><TD BGCOLOR="#9999FF" ROWSPAN="1" COLSPAN="1" WIDTH="27%"><A NAME="175747168"></A> &#160;&#160;&#160;<a href="LOADER.H.html">LoadAppComponent_struct</a></TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD><TD BGCOLOR="#CCCCFF" ROWSPAN="1" COLSPAN="1" WIDTH="9%"><P ALIGN="RIGHT">none </P> 
</TD></TR> 
</TABLE> 

對於這個分析我寫了一個Python腳本如下:

from bs4 import BeautifulSoup 
f = open("/home/vignesh/Downloads/html/RateDoc.html","r") 
fl = {'LOADER.H','CORBA_FIXED.H'} 
soup = BeautifulSoup(f) 
t = soup.findAll('table') 
for table in t[1:]: 
    rows = table.findAll('tr') 
    for tr in rows[1:]: 
     cols = tr.findAll('td') 
     for td in cols: 
      text = ''.join((td.find(text=True)).encode('utf-8')) 
      print text+"\t", 
     print 
    print 


the above script extracts the data as follows: 


LOADER.H 0/1 0/2 0/1 0/1 none none none none  
    none none none none none none none none  
    none none none none none none none none  
        none none none none none none none none  
    none none none none none none none none  
CORBA_FIXED.CC 0/1 0/2 0/1 0/1 none none none none  
    none none none none none none none none  
    none none none none none none none none  
    none none none none none none none none  
    none none none none none none none none 

但該預期的結果如下,我想提取與擴展*.cc*.h

輸出的所有文件要求:

LOADER.H 0/1 0/2 0/1 0/1 none none none none  
CORBA_FIXED.CC 0/1 0/2 0/1 0/1 none none none none  

是否有人幫助我修改上述腳本,以便提取特定擴展*.cc*.h

回答

0
from bs4 import BeautifulSoup 

INPUT = "/home/vignesh/Downloads/html/RateDoc.html" 

def main(): 
    with open(INPUT, "rb") as inf: 
     soup = BeautifulSoup(inf) 

    for row in soup.findAll("tr"): 
     first_col = row.find("td") 
     links = first_col.findAll("a") 
     if len(links) == 2: 
      link_text = links[1].text 
      parts = link_text.rsplit(".", 1) 
      if len(parts) > 1 and parts[-1].lower() in {"h", "cc"}: 
       # print row 
       print("\t".join(cell.text.strip().encode("utf-8") for cell in row.findAll("td"))) 

產生

LOADER.H 0/1 0/2 0/1 0/1 none none none none 
CORBA_FIXED.CC 0/1 0/2 0/1 0/1 none none none none 
0

它會出現,如果你封裝你的數據在一個if,它應該工作。基於這樣的事實,要跳過線的初始打印似乎顯示一個空白項 其次是「無」的8個數值

if text is '': 
    break 
else: 
    print text + '\t', 

這是你的代碼的檢查只能作爲我目前不能對其進行測試。