Python正則表達式查找行包含特定類型的文件名

我有一個文本文件。只有當文件名是.doc或.pdf類型文件時，我纔想得到包含文件名的行。使用python re.findall()Python正則表達式查找行包含特定類型的文件名

例如，

<TR><TD ALIGN="RIGHT">4.</TD> 
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD> 
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>L. Sam</TD> 
</TR> 
<TR><TD ALIGN="RIGHT">5.</TD> 
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD> 
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>G.K. Ram</TD> 
</TR>

我想以下行。

<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD> 
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>

任何機構可以告訴我任何可擴展的方式在re.findall定義模式（）？

來源

2013-05-15 mxant

like'href =「。+？\。（doc | pdf）' – georg

它只返回['pdf'，'doc'] ....但是我需要整行...... – mxant

嘗試用'搜索'，而不是'findall' – georg

事情是這樣的：

>>> strs="""<TR><TD ALIGN="RIGHT">4.</TD> 
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD> 
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>L. Sam</TD> 
</TR> 
<TR><TD ALIGN="RIGHT">5.</TD> 
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD> 
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>G.K. Ram</TD> 
</TR>""" 

>>> [x for x in strs.splitlines() if re.search(r"[a-zA-Z0-9]+\.(pdf|doc)",x)] 
['<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>', 
'<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>' 
]

來源

2013-05-15 06:53:20

其實我不想使用字符串函數...我需要使用正則表達式... – mxant

你可以使用這個表達式：

(.*?<a\shref=[\"']\w+(?:\.doc|\.pdf)[\"']>.*)

輸出：

>>> html = """<TR><TD ALIGN="RIGHT">4.</TD> 
... <TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD> 
... <TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>L. Sam</TD> 
... </TR> 
... <TR><TD ALIGN="RIGHT">5.</TD> 
... <TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD> 
... <TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>G.K. Ram</TD> 
... </TR>""" 
>>> re.findall("(.*?<a\shref=[\"']\w+(?:\.doc|\.pdf)[\"']>.*)", html) 
['<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>', '<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>']

來源

2013-05-15 06:55:31 jvallver

但我需要全線...從到 – mxant

好吧，我已經糾正了正則表達式 – jvallver

工作....非常感謝... – mxant

您可以同時使用BeautifulSoup和re。

import BeautifulSoup 
import re 

lines = soup.findAll('href', text = re.compile('your regex here'), attrs = {'class' : 'text'})

與class您在html代碼中的高級標題。

來源

2013-05-15 07:36:17 octoback

Python正則表達式查找行包含特定類型的文件名

回答

相關問題