2016-06-07 124 views
0

我與下面的代碼實驗:提取標籤與多值屬性

import re 
from bs4 import BeautifulSoup 
htmlsource1 = """<div class="small-12 columns "> 
        <h5 class="clsname1 large-text seq2">text1</h5> 
        <h5 class="clsname1 small-text seq1">text2</h5> 
        <h5 class="clsname1 seq1 small-text clsname2">text3</h5> 
       </div>""" 
soup = BeautifulSoup(htmlsource1, "html.parser") 
interesting_h5s = soup.find_all('h5', class_=re.compile('^(?=.*\bsmall-text\b)(?=.*\bseq1\b).*$')) 
for h5 in interesting_h5s: 
    print h5 

我的目的是提取同時包含「小文」和‘SEQ1’類(以任意順序)的H5標籤但由於某些原因,它不是,儘管正則表達式的功能正在積極http://pythex.org測試。

對於正則表達式我適應在Regex to match string containing two names in any order

感謝您的建議提供了答案。

+0

的可能的複製[BeautifulSoup由複合類名稱搜索時返回空列表(HTTP ://stackoverflow.com/questions/34288969/beautifulsoup-returns-empty-list-when-searching-by-compound-class-names) – alecxe

回答

0

Foreward

你應該真的使用html解析工具,但似乎你有創意控制你的HTML,所以可能的邊緣情況將受到限制。

說明

<h5(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sclass=['"](?=[^"]*\bsmall-text\b)(?=[^"]*\bseq1\b)([^"]*)['"]?)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?>(.*?)</h5>

Regular expression visualization

這個正則表達式將執行以下操作:

  • 提取同時包含「小文」和 'SEQ1' 類H5標籤(以任何順序)
  • 避免一些困難的邊緣情況

現場演示

https://regex101.com/r/fR0mT7/2

示例文本

注意困難邊緣情況在過去兩年h5標籤

<div class="small-12 columns "> 
    <h5 class="clsname1 large-text seq2">text1</h5> 
    <h5 class="clsname1 small-text seq1">text2</h5> 
    <h5 class="clsname1 seq1 small-text clsname2">text3</h5> 
    <h5 onmouseover=' class="small-text seq1" ; ' class="clsname1 large-text seq2">text4</h5> 
    <h5 onmouseover=' class="small-text seq1" ; ' class="clsname1 small-text seq1">text5</h5> 
    </div> 

樣品匹配

  • 捕獲組0獲取整個h5標籤
  • 捕獲組1得到從類屬性
  • 捕獲組2得到來自h5內部文本的整個值標籤
[0][0] = <h5 class="clsname1 small-text seq1">text2</h5> 
[0][1] = clsname1 small-text seq1 
[0][2] = text2 

[1][0] = <h5 class="clsname1 seq1 small-text clsname2">text3</h5> 
[1][1] = clsname1 seq1 small-text clsname2 
[1][2] = text3 

[2][0] = <h5 onmouseover=' class="small-text seq1" ; ' class="clsname1 small-text seq1">text5</h5> 
[2][1] = clsname1 small-text seq1 
[2][2] = text5 

說明

NODE      EXPLANATION 
---------------------------------------------------------------------- 
    <h5      '<h5' 
---------------------------------------------------------------------- 
    (?=      look ahead to see if there is: 
---------------------------------------------------------------------- 
    \s      whitespace (\n, \r, \t, \f, and " ") 
---------------------------------------------------------------------- 
)      end of look-ahead 
---------------------------------------------------------------------- 
    (?=      look ahead to see if there is: 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more 
          times (matching the least amount 
          possible)): 
---------------------------------------------------------------------- 
     [^>=]     any character except: '>', '=' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     ='      '=\'' 
---------------------------------------------------------------------- 
     [^']*     any character except: ''' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
     '      '\'' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     ="      '="' 
---------------------------------------------------------------------- 
     [^"]*     any character except: '"' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
     "      '"' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     =      '=' 
---------------------------------------------------------------------- 
     [^'"]     any character except: ''', '"' 
---------------------------------------------------------------------- 
     [^\s>]*     any character except: whitespace (\n, 
           \r, \t, \f, and " "), '>' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
    )*?      end of grouping 
---------------------------------------------------------------------- 
    \s      whitespace (\n, \r, \t, \f, and " ") 
---------------------------------------------------------------------- 
    class=     'class=' 
---------------------------------------------------------------------- 
    ['"]      any character of: ''', '"' 
---------------------------------------------------------------------- 
    (?=      look ahead to see if there is: 
---------------------------------------------------------------------- 
     [^"]*     any character except: '"' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
     \b      the boundary between a word char (\w) 
           and something that is not a word char 
---------------------------------------------------------------------- 
     small-text    'small-text' 
---------------------------------------------------------------------- 
     \b      the boundary between a word char (\w) 
           and something that is not a word char 
---------------------------------------------------------------------- 
    )      end of look-ahead 
---------------------------------------------------------------------- 
    (?=      look ahead to see if there is: 
---------------------------------------------------------------------- 
     [^"]*     any character except: '"' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
     \b      the boundary between a word char (\w) 
           and something that is not a word char 
---------------------------------------------------------------------- 
     seq1      'seq1' 
---------------------------------------------------------------------- 
     \b      the boundary between a word char (\w) 
           and something that is not a word char 
---------------------------------------------------------------------- 
    )      end of look-ahead 
---------------------------------------------------------------------- 
    (      group and capture to \1: 
---------------------------------------------------------------------- 
     [^"]*     any character except: '"' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
    )      end of \1 
---------------------------------------------------------------------- 
    ['"]?     any character of: ''', '"' (optional 
          (matching the most amount possible)) 
---------------------------------------------------------------------- 
)      end of look-ahead 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more times 
          (matching the most amount possible)): 
---------------------------------------------------------------------- 
    [^>=]     any character except: '>', '=' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
    ='      '=\'' 
---------------------------------------------------------------------- 
    [^']*     any character except: ''' (0 or more 
          times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    '      '\'' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
    ="      '="' 
---------------------------------------------------------------------- 
    [^"]*     any character except: '"' (0 or more 
          times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    "      '"' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
    =      '=' 
---------------------------------------------------------------------- 
    [^'"\s]*     any character except: ''', '"', 
          whitespace (\n, \r, \t, \f, and " ") (0 
          or more times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
)*      end of grouping 
---------------------------------------------------------------------- 
    "      '"' 
---------------------------------------------------------------------- 
    \s?      whitespace (\n, \r, \t, \f, and " ") 
          (optional (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    \/?      '/' (optional (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    >      '>' 
---------------------------------------------------------------------- 
    (      group and capture to \2: 
---------------------------------------------------------------------- 
    .*?      any character except \n (0 or more times 
          (matching the least amount possible)) 
---------------------------------------------------------------------- 
)      end of \2 
---------------------------------------------------------------------- 
    </h5>     '</h5>' 
---------------------------------------------------------------------- 
0

基於Disable special "class" attribute handling文章,這個問題通過加入如下代碼解決:

from bs4.builder import HTMLParserTreeBuilder 

bb = HTMLParserTreeBuilder() 
bb.cdata_list_attributes["*"].remove("class") 

soup = BeautifulSoup(bs, "html.parser", builder=bb)