替換&只在部分html文檔中的鏈接

我試過幾種方法（下面顯示的jsoup）只在鏈接中將&amp轉換爲&。我遇到的困難表明我正在談論這一切都是錯誤的。我懷疑在提供解決方案時我會面對面，但是也許好的舊正則表達式是最好的答案（因爲我只需要在hrefs中進行替換），除非讀者代碼被修改了？替換&只在部分html文檔中的鏈接

的解析庫（也嘗試NekoHTML）希望所有&秒值進行轉換，以&所以我用他們連得真正鏈接的HREF與使用String的replace方法有問題。

輸入：

String toParse = "The <a href=\"http://example.com?key=val&amp;another_key=val.pdf&amp;action=edit&happy=good\">Link with an encoded ampersand (&amp;)</a> is challenging."

所需的輸出：

The <a href=\"http://example.com?key=val&another_key=val.pdf&action=edit&happy=good\">Link with an encoded ampersand (&amp;)</a> is challenging.

我遇到這種試圖讀取正在呈現<link> s的&代替&的RSS feed。

更新我結束了使用正則表達式來識別鏈接，然後使用replace插入到位一個與& s的解碼的鏈接。 Pattern.quote()原來是很方便，但我不得不手動關閉並重新打開引述部分，所以我可以正則表達式或我符號條件：

final String cleanLink = StringUtils.strip(link).replaceAll(" ", "%20").replaceAll("'", "%27"); 
String regex = Pattern.quote(link); 
// end and re-start literal matching around my or condition 
regex = regex.replaceAll("&", "\\\\E(&amp;|&)\\\\Q"); 
final Pattern pattern = Pattern.compile(regex); 
final Matcher matcher = pattern.matcher(result); 

while (matcher.find()) { 
    int index = result.indexOf(matcher.group()); 
    while (index != -1) { 
     // this replaces the links with &amp; with the same links with & 
     // because cleanLink is from the DOM and has been properly decoded 
     result.replace(index, index + matcher.group().length(), cleanLink); 
     index += cleanLink.length(); 
     index = result.indexOf(matcher.group(), index); 
     linkReplaced = true; 
    } 
}

我並不感到這種做法，但我不得不處理太多條件而不使用DOM工具來識別鏈接。

來源

2015-06-24 eebbesen

在URL中擁有「&」實際上是標準。沒有人像他們那樣編寫他們的URL，但作爲一個URL沒有任何錯誤，因此如此。 – Stewart

爲什麼你只想在'href's **'中替換'&'**？爲什麼不到處？另外，你可以顯示你正在處理的整個文件/文件嗎？ – Roman

至少在我的機器上，這個鏈接無法正確解決使用Safari，Chrome或Firerox：http://www.europarl.europa.eu/sides/getAllAnswers.do?reference=E-2015-006220 & language = EN，但這沒關係：http://www.europarl.europa.eu/sides/getAllAnswers.do?reference=E-2015-006220&language=EN。所以對我來說正確處理＆符號是必要的。 – eebbesen

看看StringEscapeUtils。在String上嘗試使用unescapeHtml()。

來源

2015-06-24 02:49:52 bphilipnyc

謝謝@ bphilipnyc！在'doc.body（）'上使用''將（（&）'（不在href中）'轉換爲'＆'。並且'attribute.setValue（StringEscapeUtils.unescapeHtml（attribute.getValue（）））;'也沒有做我所需要的--dom對象中的所有東西仍然被強制轉換爲HTML。 – eebbesen

替換&只在部分html文檔中的鏈接

回答

相關問題