語法錯誤 - Python re.search（字符類，插入符號）

使用BeautifulSoup刮頁;試圖篩選出在最終環節「... HTML＃評論」語法錯誤 - Python re.search（字符類，插入符號）

代碼如下：

import urllib.request 
import re 
from bs4 import BeautifulSoup 

base_url = "http://voices.washingtonpost.com/thefix/morning-fix/" 
soup = BeautifulSoup(urllib.request.urlopen(base_url)).findAll('a') 
links_to_follow = [] 
for i in soup: 
     if i.has_key('href') and \ 
    re.search(base_url, i['href']) and \ 
    len(i['href']) > len(base_url) and \ 
    re.search(r'[^(comments)]', i['href']): 
     print(i['href'])

的Python 3.2，Windows 7的64位。

以上腳本保存在「#comments」

我試過re.search([^comments], i['href'])，re.search([^(comments)], i['href'])和re.search([^'comments'], i['href'])結尾的鏈接 - 所有扔語法錯誤。

對Python來說很陌生，所以對於平庸的道歉。（a）我對'r'前綴的正確理解不夠詳細，或者（b）響應[^（foo）] re.search返回的不是該集合排除'foo'的行，但是隻包含多於'foo'的行。例如，我保留我的...＃註釋鏈接，因爲... texttexttext.html＃註釋先於它或（c）Python將「＃」解釋爲結束re.search應匹配的行的註釋。

我覺得我錯了（b）。

對不起，知道這很簡單。謝謝，

扎克

來源

2012-03-24 Zack

你應該包括你得到的錯誤/回溯的確切文本。 – Amber 2012-03-24 18:49:05

[^(comments)]

指「一個字符既不是(也不是c，一個o，一個m，一個e，一個n，一個t，一個s或)」。可能不是你想要的。

如果你的目標是有，只有當提供的字符串不#comments結束相匹配的正則表達式，然後我會用

... and not re.search("#comments$", i['href'])

甚至更好（爲什麼使用正則表達式，如果就這麼簡單，在所有？）：

... and not i['href'].endswith("#comments")

至於你的其他問題：

的r'...'符號允許你寫的「原始字符串」，這意味着反斜線不需要轉義：

r'\b'指「反斜線+ B」（這將是由正則表達式引擎被解釋爲「字邊界」
'\b'指「退格字符」
等

#在正則表達式中沒有特殊含義，除非使用(?x)或re.VERBOSE選項。在那種情況下，它確實在多行正則表達式中開始評論。

來源

2012-03-24 18:48:53

不得不離開，現在回來 - 謝謝你的答案。 – Zack 2012-03-25 13:47:12

正則表達式可能不是最好的解決方案：

import urllib.request 
from bs4 import BeautifulSoup 

base_url = "http://voices.washingtonpost.com/thefix/morning-fix/" 
soup = BeautifulSoup(urllib.request.urlopen(base_url)).findAll('a') 
links_to_follow = [] 
for i in soup: 
    href = i.get('href') 
    if href is None: 
     continue 
    if not href.startswith(base_url): 
     continue 
    if href.endswith('#comments'): 
     print href

來源

2012-03-24 18:52:55 Amber

語法錯誤 - Python re.search（字符類，插入符號）

回答

相關問題