Python正則表達式 - 在html標記中不匹配單詞

-1

如果它在html標記中，我需要編寫與單詞不匹配的正則表達式。Python正則表達式 - 在html標記中不匹配單詞

這裏是文字的樣本：

asdd qwe <a href="http://example.com" title="Some title with word qwe" class="external-link" rel="nofollow"> qwe

我現在正則表達式如下：

(?!(\<.+))[^a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ](<class="bad-word"(?: style="[^"]+")?>)?(qwe)(<>)?[^a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ](?!.+\>)

這是一個有點複雜，但everythink作品期待，當我測試它regex101 .com和regexr.com，它只匹配html標籤後面的單詞。

任何想法爲什麼？

編輯：

我不想使用HTML解析器或DOM操作，我不想改變這麼多的代碼。

def test_tagged_word_present(self): 
    input = 'words <a href="example.com" title="title with word qwe" class="external-link" rel="nofollow"> qwe some other words' 
    expected = 'words <a href="example.com" title="title with word qwe" class="external-link" rel="nofollow"><strong class="bad-word" style="color:red">qwe</strong> some other words' 
    parser = self.get_test_parser(input, search_word='qwe') 
    text = parser.mark_words() 
    self.assertEqual(text, expected)

一切完美，但正則表達式仍緩存在標題qwe。

來源

2015-10-12 Cosaquee

如何使用解析器，將HTML的文本內容反饋給您，然後與文本內容進行匹配？通過這樣做，標籤內的文本將不會返回給您。 – hwnd

您是否試圖匹配<>標籤之外的所有內容？ – Ephreal

@Ephreal我試圖匹配每一個沒有任何html標籤的詞。 – Cosaquee

要內HTML標記的好招是使用「未後跟」和包括在其中尖括號字符排除內容。例如您的正則表達式只能到此爲止：

(?!.+\>)

這大概應該是指「後面沒有一個或多個字符和一個右尖括號」。

但是，這種「一個或多個字符」過於寬泛，比你想將匹配更多：如果你做的是有點嚴格，那麼它不會是貪婪：

(?![^<>]*>)

所以這就是'之後沒有非尖括號和右括號。'

這樣它只會做更換，如果它是在HTML標籤之外，因爲如果它在裏面，那麼它將匹配，所以NOT後面的將阻止它被替換。

您可能還需要在其他字符類中包含<>以限制它們。

請注意，這並非完全符合100％，因爲這些屬性可以合法地在其中包含這些字符，但在許多情況下，您對輸入信息足夠了解，可以安全地使用[^ <>>來簡化任務不會造成任何問題。

$ python 
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 
Type "help", "copyright", "credits" or "license" for more information. 
>>> mystring = 'asdd qwe <a href="http://example.com" title="Some title with word qwe" class="external-link" rel="nofollow"> qwe ' 
>>> import re 
>>> p=re.compile(r'([^\s<>]+)(?![^<>]*>)') 
>>> p.findall(mystring) 
['asdd', 'qwe', 'qwe'] 
>>> 
$

二測：

$ python 
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 
Type "help", "copyright", "credits" or "license" for more information. 
>>> import re 
>>> mystring = r'words <a href="example.com" title="title with word qwe" class="external-link" rel="nofollow"> qwe some other words' 
>>> p=re.compile(r'([^\s<>]+)(?![^<>]*>)') 
>>> p.findall(mystring) 
['words', 'qwe', 'some', 'other', 'words'] 
>>> mystring = r'words <a href="example.com" title="title with word qwe" class="external-link" rel="nofollow"> qwe <strong class="bad-word" style="color:red">podmiotu</strong> some other words' 
>>> p.findall(mystring) 
['words', 'qwe', 'podmiotu', 'some', 'other', 'words'] 
>>>

注意， 'QWE' 是兩個字符串，HTML標記之外，所以它應該符合我的想法。

要搜索一個特定的單詞，只要用在正則表達式：

查找單詞「一些」，如果是外面的HTML：

>>> p=re.compile(r'(some)(?![^<>]*>)') 
>>> p.findall(mystring) 
['some'] 
>>>

查找單詞「外部」，如果它是HTML外（失敗，正確）：

>>> p=re.compile(r'(external)(?![^<>]*>)') 
>>> p.findall(mystring) 
[] 
>>>

來源

2015-10-12 10:45:20

它的工作原理類似於魅力，但不適用於Python。任何想法爲什麼？在我的測試中，我的問題與我的問題相同，但在更改正則表達式後，測試未通過，鏈接中的單詞仍匹配。 – Cosaquee

您能否包含您的預期輸出？我不清楚你實際上試圖匹配和結束。謝謝。 –

測試用例現在有問題 – Cosaquee

爲什麼不使用以下內容：首先從字符串中刪除任何html標記，然後搜索該單詞？

import re 
>>> s = "asdd qwe <a href="http://example.com" title="Some title with word qwe" class="external-link" rel="nofollow"> qwe " 
>>> re.findall(r"\bqwe\b", re.sub(r"<[^>]*>", "", s)) 
['qwe', 'qwe']

來源

2015-10-12 07:53:48 haavee

我需要這個html標籤在文本中。我真的在說你的想法，但不是在這種情況下。 – Cosaquee

這會修改文本的*副本*。所以你可以很容易做到：'if re.findall（...）do_something_with_string（s）';它只是讓您輕鬆測試您要查找的單詞是否出現在任何標籤之外。 – haavee

也許可以使用re.search，然後使用索引以任何你想要的方式對字符串進行切片 – Ephreal

Python正則表達式 - 在html標記中不匹配單詞

回答

相關問題