2014-12-22 66 views
0

我在源文件的末尾有HTML註釋。Python用BeautifulSoup查找文本

<!-- FEO DEBUG OUTPUT [TextTransAttempted:RENAME_JAVASCRIPT(18), RENAME_IMAGE(7), MINIFY_JAVASCRIPT(25), (1), JAVASCRIPT_HTML5_CACHE(19), EMBED_JAVASCRIPT(1), RENAME_CSS(3), (1), IMAGE_COMPRESSION(7), RESPONSIVE_IMAGES(6), ASYNC_JAVASCRIPT(2);TextTransApplied:RENAME_JAVASCRIPT(18), RENAME_IMAGE(7), MINIFY_JAVASCRIPT(25), (1), JAVASCRIPT_HTML5_CACHE(19), EMBED_JAVASCRIPT(1), RENAME_CSS(3), (1), IMAGE_COMPRESSION(7), RESPONSIVE_IMAGES(6), ASYNC_JAVASCRIPT(2);TagTransAttempted:(8), ASYNC_JAVASCRIPT(61);TagTransFailed:ASYNC_JAVASCRIPT(42);TagTransApplied:(8), ASYNC_JAVASCRIPT(19); ] --> 

現在我想檢查括號中的所有內容是否大於零。例如,我想從RENAME_JAVASCRIPT中獲得18的值,並檢查它是否大於零,並且類似地爲其餘部分。由於這是一個評論,而不是任何html標籤的一部分,BeautifulSoup有沒有辦法實現這一點。

+0

http://stackoverflow.com/questions/6062210/how-to-find-the -comment-tag-with-beautifulsoup –

回答

0

我只想用重:

import re 
from bs4 import BeautifulSoup 
with open("/sample_html.txt") as f: 
    soup = BeautifulSoup(f.read()) 
    tag = soup.find("html").next_sibling 
    print(all(x > 0 for x in map(int,re.findall("\((\d+)\)",tag)))) 

    True 

如果你想看到的名稱:

from bs4 import BeautifulSoup 
with open("/sample_html.txt") as f: 
    soup = BeautifulSoup(f.read()) 
    tag = soup.find("html").next_sibling 
    for ele in re.findall("\w+\(\d+\)",tag): 
     if int(ele.split("(")[1].rstrip(")")) > 0: 
      print(ele) 
RENAME_JAVASCRIPT(18) 
RENAME_IMAGE(7) 
MINIFY_JAVASCRIPT(25) 
JAVASCRIPT_HTML5_CACHE(19) 
EMBED_JAVASCRIPT(1) 
RENAME_CSS(3) 
IMAGE_COMPRESSION(7) 
RESPONSIVE_IMAGES(6) 
ASYNC_JAVASCRIPT(2) 
RENAME_JAVASCRIPT(18) 
RENAME_IMAGE(7) 
MINIFY_JAVASCRIPT(25) 
JAVASCRIPT_HTML5_CACHE(19) 
EMBED_JAVASCRIPT(1) 
RENAME_CSS(3) 
IMAGE_COMPRESSION(7) 
RESPONSIVE_IMAGES(6) 
ASYNC_JAVASCRIPT(2) 
ASYNC_JAVASCRIPT(61) 
ASYNC_JAVASCRIPT(42) 
ASYNC_JAVASCRIPT(19) 
+0

引發以下錯誤。回溯(最近通話最後一個): 文件 「body_parser.py」,線路119, 打印(所有(x> 0映射圖X(INT,re.findall( 「\((\ d +)\)」 ,飼料)))) 文件 「/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py」,線路177,在的findall 回報_compile(圖案,旗).findall (字符串) TypeError:期望的字符串或緩衝區 – station

+0

哦,我明白了,我的輸入將是整個HTML源代碼,並且該評論將在最後 – station

+0

是的,我推測您已經提取了您在問題中提供的html –

相關問題