如何從HTML中除去特殊標記除外的所有內容？

我想解析HTML字符串只提取<form> ... </form>。所有其他的東西不需要，我可以刪除它。如何從HTML中除去特殊標記除外的所有內容？

今天我有一些助手通過replaceAll特殊標記的內容刪除，如：

/** remove form */ 
    String newString = string.replaceAll("(?s)<form.*?</form>", "");

(?s)<form.*?</form>

刪除form標籤。但我需要反過來，刪除除了form之外的所有內容。

我該如何解決？

見我Gskinner例如

來源

2013-07-10 Maxim Shoustin

一般情況下，它的解析與HTML DOM解析器是個好主意。 – Leri

是的，但有時網頁上有錯誤，如沒有結束標記，在這種情況下，這種做法是不好的主意 –

在這種情況下可以嘗試：'字符串newString = string.replaceAll（「*（<形式*）。？。？」「$ 1」）;' – Leri

試試下面的代碼。

import java.util.regex.Matcher; 
import java.util.regex.Pattern; 

public class Client { 

    private static final String PATTERN = "<form>(.+?)</form>"; 
    private static final Pattern REGEX = Pattern.compile(PATTERN); 

    private static final boolean ONLY_TAG = true; 

    public static void main(String[] args) { 

     String text = "Hello <form><span><table>Hello Rais</table></span></form> end"; 
     System.out.println(getValues(text, ONLY_TAG)); 
     System.out.println(getValues(text, !ONLY_TAG)); 

    } 

    private static String getValues(final String text, boolean flag) { 
     final Matcher matcher = REGEX.matcher(text); 
     String tagValues = null; 
     if (flag) { 
      if (matcher.find()) { 
       tagValues = "<form>" + matcher.group(1) + "</form>"; 
      } 

     } else { 
      tagValues = text.replaceAll(PATTERN, ""); 
     } 
     return tagValues; 
    } 
}

您將獲得以下輸出

<form><span><table>Hello Rais</table></span></form> 
Hello end

來源

2013-07-10 11:51:38

-1

下面的代碼會給你你正在尋找一個方向：

String str = "<html><form>test form</form></html>"; 
String newString = str.replaceAll("[^<form</form>]+|((?s)<form.*?</form>)", "$1"); 
System.out.println(newString);

來源

2013-07-10 11:35:06 Mubin

你應該閱讀[否定字符類]（http://www.regular-expressions.info/charclass.html）'[^ ...]'，它們不像你想象的那樣行事。 – sp00m

如何從HTML中除去特殊標記除外的所有內容？

回答

相關問題