幫助用Java和正則表達式從html標記中提取文本

我想使用Regex從html文件中提取一些文本。我正在學習正則表達式，我仍然無法理解這一切。我有提取的所有文本包括betweeen <body>和</body>這裏是代碼：幫助用Java和正則表達式從html標記中提取文本

public class Harn2 { 

public static void main(String[] args) throws IOException{ 

String toMatch=readFile(); 
//Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?"); this one works fine 
Pattern pattern=Pattern.compile(".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"); //I want this one to work 
Matcher matcher=pattern.matcher(toMatch); 

if(matcher.matches()) { 
    System.out.println(matcher.group(1)); 
} 

} 

private static String readFile() { 

     try{ 
      // Open the file that is the first 
      // command line parameter 
      FileInputStream fstream = new FileInputStream("user.html"); 
      // Get the object of DataInputStream 
      DataInputStream in = new DataInputStream(fstream); 
      BufferedReader br = new BufferedReader(new InputStreamReader(in)); 
      String strLine = null; 
      //Read File Line By Line 
      while (br.readLine() != null) { 
       // Print the content on the console 
       //System.out.println (strLine); 
       strLine+=br.readLine(); 
      } 
      //Close the input stream 
      in.close(); 
      return strLine; 
      }catch (Exception e){//Catch exception if any 

       System.err.println("Error: " + e.getMessage()); 
       return ""; 
      } 
} 
}

那麼它工作正常，像這樣的，但現在我想提取標籤之間的文本： <table class="claroTable">和</table>

所以我更換我的正則表達式字符串".*?<table class=\"claroTable\".*?>(.*?)</table>.*?" 我也試過 ".*?<table class=\"claroTable\">(.*?)</table>.*?" 但它不起作用，我不明白爲什麼。 html文件中只有一個表格，但是在javascript代碼中出現了「table」：「... dataTables.js ...」可能是導致錯誤的原因嗎？

預先感謝您對我的幫助，

編輯：HTML文本extranct是一樣的東西：

<body> 
..... 
<table class="claroTable"> 
<td><th>some data and manya many tags </td> 
..... 
</table>

我想提取物<table class="claroTable">和</table>

來源

2011-08-29 vallllll

如果你婉t從html中提取數據：使用一個html解析器。如果你想學習RegExp：do ** not **使用html或xml輸入。遲早你會意識到，正則表達式的HTML不起作用。 –

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – NimChimpsky

@NimChimpsky我有一種感覺有人會張貼這個笑聲。 – Matt

這裏是你如何能與JSoup parser做到這一點：

File file = new File("path/to/your/file.html"); 
String charSet = "ISO-8859-1"; 
String innerHtml = Jsoup.parse(file,charSet).select("body").html();

是的，你可以也不知怎麼用正則表達式做，但它永遠不會這麼容易。

更新：與正則表達式模式的主要問題是，你缺少的DOTALL標誌：

Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?",Pattern.DOTALL);

如果你只是想內容的指定表標籤，你可以這樣做：

String tableTag = 
    Pattern.compile(".*?<table.*?claroTable.*?>(.*?)</table>.*?",Pattern.DOTALL) 
      .matcher(html) 
      .replaceFirst("$1");

（更新：現在返回table標籤而已，而不是表標籤本身的內容）

來源

2011-08-29 09:24:48

謝恩帕特里克弗洛伊德，它實際上與身體標記一起工作，但我想提取表標記，那一個不起作用：

.... data to extract ...

so something像 Pattern pattern = Pattern.compile（「。*？（。*？）

。*？」） – vallllll

@vallllll好的，看我的更新 –

th Sean Patrick Floyd但它返回給我整個html字符串，就好像沒有任何事情發生一樣。我不明白什麼replaceFirst（...）做？ – vallllll

之間的任何如前所述，這是使用正則表達式的一個不好的地方。只有在實際需要時才使用正則表達式，所以基本上儘量遠離它。看看這個帖子雖然對於解析器：

How to parse and modify HTML file in Java

來源

2011-08-29 09:20:05 Matt

to：Andreas_D和Matt：我知道，但我必須使用它。這裏的重點是使用正則表達式，我沒有選擇。編程語言並不重要，但使用正則表達式是一個要求，所以我真的會appreaceate一些幫助。那麼 – vallllll

@vallllll好吧，我已經更新了我的答案，以真正解決您的正則表達式問題。 –

幫助用Java和正則表達式從html標記中提取文本

回答

相關問題