如何從HTML td標籤中除去所有屬性，但在Python中使用rowspan？

使用python 3.3我試圖讓一些正則表達式替換失敗。如何從HTML td標籤中除去所有屬性，但在Python中使用rowspan？

欲剝離td標籤的所有屬性，除了rowspan屬性（例如TD的末）。

使用下面的命令，我可以成功替代時rowspan存在：

re.sub('(<td)[^>]*([\\s]rowspan[\\s]*=[\\s]*[0-9]*)[^>]*(>)', handle_td, file_contents)

其中handle_td是：

def handle_td(matchobj): 
    new_td = '' 
    for curr_group in matchobj.groups(''): 
     if curr_group != '': 
      new_td += curr_group 
    return new_td

但我想也採取td的其餘的照顧。這我沒有管理。

如果我在第二組後添加?，它將td標記更改爲並且不保留rowspan屬性。

我在做什麼錯？我怎樣才能解決這個問題？

我不開採運行另一個命令來處理其他td的，但我沒...

<td width=307 valign=top style='width:230.3pt;border:solid windowtext 1.0pt; border-left:none;padding:0cm 5.4pt 0cm 5.4pt'> 
<td width=307 rowspan=4 style='width:230.3pt;border:solid windowtext 1.0pt; border-top:none;padding:0cm 5.4pt 0cm 5.4pt'> 
<td width=307 valign=top style='width:230.3pt;border-top:none;border-left: none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; padding:0cm 5.4pt 0cm 5.4pt'>

這將產生：

<td> 
<td rowspan=4> 
<td>

我這樣管理（如果你有更好的方法隨時添加它）：

# Leave only specific attributes for td tags 
def filter_td_attributes(matchobj): 
    if matchobj.group(1) == "rowspan": 
     return matchobj.group(1) + '=' + matchobj.group(2) 

# Loop the attributes of the td tags 
def handle_td(matchobj): 
    new_td = re.sub("([a-zA-Z]+)[\\s]*=[\\s]*([a-zA-Z0-9:;.\\-'\\s]*)([\\s]|>)", filter_td_attributes, matchobj.group(0)) 
    new_td = re.sub("[\\s]*$", '', new_td) 
    new_td = new_td + ">" # close the td tag 
    return new_td 

file_contents = re.sub('[\\s]*</p>[\\s]*</td>', '</td>', file_contents)

來源

2012-12-05 SimonW

不要解析HTML/XML與正則表達式。使用解析器，它會*更容易。 – kreativitea

強制性參考：http://stackoverflow.com/a/1732454/1350899 – mata

改用[Beautiful Soup]（http://www.crummy.com/software/BeautifulSoup/）。 – will

你必須做的代碼[^>]*部分非貪婪當行跨度碼是可選的：使[^>]*?。所有在一起就變成：

'(<td)[^>]*?([\\s]rowspan[\\s]*=[\\s]*[0-9]*)?[^>]*(>)'

貪婪版本（[^>]*）的意思是「給我許多非‘>’字符作爲可能的，但我會接受零」。

非貪婪版本（[^>]*?）指「給我的非最少數量‘>’字符儘可能同時仍然使整個正則表達式匹配」

來源

2012-12-05 22:39:43 EvilBob22

謝謝，我試過了，但它沒有工作... – SimonW

嘿，至少你得到了一些有用的東西，那是最重要的事情。我還注意到一些並不真正需要的「額外」東西：「\\ s」值周圍的方括號不起作用，「」周圍的東西也不是真的需要 - 你通常不需要需要動態地捕獲靜態文本（並且你不會將一組字符看作是一個像（（td）*'）一樣的實體）。 – EvilBob22

感謝您的回答，括號用於分組，因此我可以在代碼中使用該組的內容。 – SimonW

如何從HTML td標籤中除去所有屬性，但在Python中使用rowspan？

回答

相關問題