Java正則表達式或XML解析器？

我想刪除任何標記，如Java正則表達式或XML解析器？

<p>hello <namespace:tag : a>hello</namespace:tag></p>

成爲

<p> hello hello </p>

什麼是做到這一點，如果它是正則表達式由於某種原因，這是現在工作的任何人都可以幫助的最佳方式？

(<|</)[:]{1,2}[^</>]>

編輯：添加

來源

2012-02-02 Paul

絕對使用XML解析器。 Regex should not be used to parse *ML

來源

2012-02-02 22:28:56 Bozho

直接鏈接：http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – 2012-02-02 22:40:16

@LouisWasserman：我剛加入該鏈接。這個答案是一個標準。 – RanRag 2012-02-02 22:42:56

如果按照「標準」你的意思是「藝術作品」，那麼是的。 – 2012-02-02 22:43:59

爲這些目的使用解析器像lxml或BeautifulSoup

>>> import lxml.html as lxht 
>>> myString = '<p>hello <namespace:tag : a>hello</namespace:tag></p>' 
>>> lxht.fromstring(myString).text_content() 
'hello hello'

在這裏，你不應該使用正則表達式是一個reason爲什麼你不應該用正則表達式解析HTML/XML。

來源

2012-02-02 22:33:48 RanRag

+1 - 我的錯誤。我只想看到標準的解決方案「不要用正則表達式進行xml解析」，你的解決方案很抱歉！ – sln 2012-02-02 23:45:05

@sln：但你沒有提出我的意見：P。 – RanRag 2012-02-02 23:46:04

我試過了，但它希望你編輯它之前它會反對我的投票upvote。只要做一個僞編輯，我的upvote就會啓用。我會稍後再回來看看。 – sln 2012-02-03 00:43:06

如果你只是想拉明文一些簡單的XML，最好的（最快，最小的內存空間）是隻運行一個for循環在數據：

僞代碼中

bool inMarkup = false; 
string text = ""; 
for each character in data // (dunno what you're reading from) 
{ 
    char c = current; 
    if(c == '<') inMarkup = true; 
    else if(c == '>') inMarkup = false; 
    else if(!inMarkup) text += c; 
}

注意：如果在解析中遇到類似CDATA，JavaScript或CSS的情況，這會中斷。

因此，總結一下......如果它很簡單，請做一些類似上面的事情，而不是正則表達式。如果不那麼簡單，那麼請傾聽其他人使用高級解析器。

來源

2012-02-02 22:37:53

他沒有指定他是從流中讀取還是僅從字符串讀取，或者他的內容是否具有CDATA或類似內容，以便部分答案不同。我只是提供了一個涵蓋問題域的大部分子集的簡單解決方案。謝謝你的批評。 – 2012-02-02 23:05:11

+1 - 對不起，我的壞。建立一個僞編輯，以便我的upvote可以計數。 – sln 2012-02-02 23:46:37

這是我個人用於解決java類似問題的解決方案。用於此的庫是Jsoup：http://jsoup.org/。

在我的特殊情況下，我不得不打開標籤，其中包含一個特定值的屬性。你看到這個代碼反映出來，它不是這個問題的確切解決方案，但可能會讓你走上前路。

public static String unWrapTag(String html, String tagName, String attribute, String matchRegEx) { 
    Validate.notNull(html, "html must be non null"); 
    Validate.isTrue(StringUtils.isNotBlank(tagName), "tagName must be non blank"); 
    if (StringUtils.isNotBlank(attribute)) { 
     Validate.notNull(matchRegEx, "matchRegEx must be non null when an attribute is provided"); 
    }  
    Document doc = Jsoup.parse(html); 
    OutputSettings outputSettings = doc.outputSettings(); 
    outputSettings.prettyPrint(false); 
    Elements elements = doc.getElementsByTag(tagName); 
    for (Element element : elements) { 
     if(StringUtils.isBlank(attribute)){ 
     element.unwrap(); 
     }else{ 
     String attr = element.attr(attribute); 
     if(!StringUtils.isBlank(attr)){ 
      String newData = attr.replaceAll(matchRegEx, ""); 
      if(StringUtils.isBlank(newData)){ 
      element.unwrap(); 
      } 
     }   
     } 
    } 
    return doc.html(); 
    }

來源

2014-05-13 14:06:27 kenny

Java正則表達式或XML解析器？

回答

相關問題