python找到一個字符串＆前後的所有內容

我有一堆HTML，我每週從網站下載一次，需要抓取它的一些信息，不知道從哪裏開始。python找到一個字符串＆前後的所有內容

我有大約100個這樣的文件重複，只想抓住2條線。

NUMBER2 ‑ ‑計算機
天前上君 22， 11589文件/ 4,363 MB

<td width="242"><div align="left"><span class="style9"> 
<span class="style9"><img src="pic.pn" width="32" height="32" border="0" style="vertical-align:text-top;" />number2 &nbsp;&#8209;&#8209;computer</span><br /> 
..... 
<div align="left">License:<br />Backup:<br />Files:</div></td><td width="186" valign="top" nowrap><div align="left" nowrap> 
<span class="black" nowrap><span class="black">Paid&nbsp;Unlimited</span> 
<br />3&nbsp;days&nbsp;ago&nbsp;on&nbsp;Jun&nbsp;22,&nbsp;12<br />11,589 files/4,363&nbsp;MB</span></td> 
<td width="92" valign="top">&nbsp;</td></tr> 
..... 
</div></td>

來源

2012-06-25 namit

[你嘗試過什麼？]（http://whathaveyoutried.com） – millimoose

你想要一個HTML解析器 - 這種情況下，我會建議BeautifulSoup。 –

@millimoose：顯然他到目前爲止還沒有嘗試過任何東西，比如「不確定從哪裏開始」 –

首先，嘗試刪除所有的一切e字符串中的HTML標籤。

>>> import re 
>>> def remove_html_tags(data): 
...  p = re.compile(r'<.*?>') 
...  return p.sub('', data) 
... 
>>> stripped = remove_html_tags(unescape(html_source)) 
>>> stripped 
u'\nnumber2 \xa0\u2011\u2011computer\n.....\nLicense:Backup:Files:\nPaid\xa0Unlimited\n3\xa0days\xa0ago\xa0on\xa0Jun\xa022,\xa01211,589 files/4,363\xa0MB\n\xa0\n.....\n'

那麼它的正常的搜索/分/重新匹配的問題

unescape感謝弗雷德裏克Lundh開發

這應該讓你去。

來源

2012-06-25 11:15:01

你需要做的是在文本的每一行後面加一個'\ n'（如果你把文件加載爲字符串，它就已經是這樣了）。比你需要搜索那部分文本並以較短的形式保存文本。 schould工作下的腳本，如果搜索和文本是由正確的字符串替換

#insurt text to search and to be searched 
search = '<td width="242"><div align="left"><span class="style9">\n<span class="style9"><img   src="pic.pn" width="32" height="32" border="0" style="vertical-align:text-top;" />number2 &nbsp;&#8209;&#8209;computer</span><br />\n.....\n<div align="left">License:<br />Backup:<br />Files:</div></td><td width="186" valign="top" nowrap><div align="left" nowrap>\n<span class="black" nowrap><span class="black">Paid&nbsp;Unlimited</span>\n<br />3&nbsp;days&nbsp;ago&nbsp;on&nbsp;Jun&nbsp;22,&nbsp;12<br />11,589 files/4,363&nbsp;MB</span></td>\n<td width="92" valign="top">&nbsp;</td></tr>\n.....\n</div></td>\n' 


text = 'a\n'+98*search+'\nb' 


changed = 0 
for x in range(len(text)): 
    if text[x:x+len(search)] == search: 
    if changed >= 2: 
     text = text[0:x]+' '+text[x+len(search):] # to place a replacement text, switch ' ' for 'replacement text' 
    changed += 1 


print(text)

來源

2013-12-28 09:14:01 user3140804

python找到一個字符串＆前後的所有內容

回答

相關問題