我有以下的Java正則表達式替換邏輯車削代替正則表達式轉換成Java算法

text.replaceAll("(?i)(" + keyword + ")(?!([^<]+)?>>)", "<b>$1</b>");

它所做需要keyword和查找它的HTML頁面上，而忽略大小寫和HTML標籤的內容。它會捕獲找到的關鍵字並用<b></b>標籤包圍它。

我該如何使用StringBuilder或StringBuffer，可能是HashMap？目標是更好的表現。

UPDATE

我使用新commons lang 3 beta package裝箱以下方法：

public static String highlight(String text, String q) { 
    String[] textAr = StringUtils.split(text, " "); 
    int len = textAr.length; 
    int index = 0; 
    while (index < len){ 
     if (textAr[index].startsWith("<")) { 
      while (!textAr[index].endsWith(">")) { 
       index++; 
      } 
     } 
     if (StringUtils.equalsIgnoreCase(textAr[index], q)){ 

      textAr[index] = "<b>"+textAr[index]+"</b>"; 
     } 
     index++; 
    } 
    return StringUtils.join(textAr," "); 
}

運行一些測試後，我接到了上述溶液約10％的性能提升。任何有關如何改善沒有正則表達式的建議，將不勝感激。

來源

2011-01-23 Mat B.

請原諒我的嘮叨，但'不要用正則表達式解析html'是SO上最常見的推薦之一。您可以在SO評分最高的用戶強調這一點時找到許多相關答案。 – 2011-01-23 16:30:18

雖然我尼基塔同意：解析HTML的最佳方式是使用HTML或XML解析器。

但是，如果你真的需要這個，這裏有一些提示。

字符串緩衝區是字符串生成的線程安全的版本，所以如果你不必須是線程安全的，或者如果線程safity的問題是由其他層使用字符串生成器來解決。
StringBuilder不支持使用模式進行替換。字符串支持。但是當關鍵字數量很高時直接使用字符串是無效的。
因此，最有效的方法是生成包含所有關鍵字的模式，然後執行一次替換操作。例如，如果你有一個關鍵字富，酒吧，焦油，創建正則表達式像 regex = (?i)(foo|bar|tar)(?!([^<]+)?>>)

現在運行text.replaceAll(regex);

創建正則表達式時，您可以使用StringBuilder，但我建議你使用 StringUtils.join()從雅加達utils或Guava的類似工具。在關鍵字

來源

2011-01-23 16:41:37 AlexR

你不能用xml解析器解析html，因爲html不是xml（除非它是也是xhtml）。 – KitsuneYMG 2011-01-24 00:15:02

你可能想逃跑的關鍵字，以防萬一：

Pattern p = text.replaceAll("(?i)(" + Pattern.quote(keyword) + ")(?!([^<]+)?>>)", "<b>$1</b>");

然後，你需要創建一個匹配

Matcher m = p.matcher(myInputString);

如果輸入不匹配，那麼你就大功告成了：

if (!m.find()) { return myInputString; }

否則分配的輸出緩衝器：

StringBuilder out = new StringBuilder(myInputString.length() + 16);

，並標記關鍵字加粗的所有實例：

int nCharsProcessed = 0; 
do { 
    out.append(myInputString, nCharsProcessed, m.start(1)) 
    .append("<b>") 
    .append(m.group(1)) 
    .append("</b>"); 
    nCharsProcessed = m.end(1); 
} while (m.find());

終於，最後一場比賽後拼接的部分和返回

out.append(myInputString, nCharsProcessed, myInputString.length()); 
return out.toString();

來源

2011-01-23 16:36:14

~~的replaceAll已經與StringBuffers工作反正。（嗯，準確，Matcher.replaceAll（）使用StringBuffer的，但只String.replaceAll代表們Matcher.replaceAll（））~~

爲了獲得更好的性能，您可以通過使用一個StringBuffer建立的正則表達式字符串：

String head = "(?i)("; 
    String tail = ")(?!([^<]+)?>>)"; 

    StringBuffer regex = new StringBuffer(); 
    regex.append(head); 
    regex.append(keyword); 
    regex.append(tail); 

    text.replaceAll(regex.toString(), "<b>$1</b>");

我真的不知道，如果有比Matcher類更快的替換實現。但在你使用StringBuffer實現它之前，我想告訴你，它已經以這種方式實現了。

下面的僞代碼可能是越野車，但你可以嘗試這樣。（更好的性能不能保證，但是這應該是上面一樣沒有正則表達式）

StringBuffer sb = new StringBuffer(text); 

int i = 0; 
int size = text.size() 
while(i<size) { 
    if(sb.charAt(i) == '<') { 
     increase i until you find '>'; 
    } 
    if(sb.charAt(i) == keyword.charAt(0) { 
     if(next chars of sb match next chars of keyword) { 
      insert "<b>" before and "</b>" after the keyword; 
      size += 7; 
      i += keyword.size() + 7; 
     } 
    } 
}

你可能也想看看進入匹配器執行的replaceAll的：http://kickjava.com/src/java/util/regex/Matcher.java.htm

來源

2011-01-23 16:36:41 myAces

增加了一個更新 – myAces 2011-01-23 17:37:50

這應該是真正有用的：http://stackoverflow.com/questions/2861/options-for-html-scraping – myAces 2011-01-23 17:43:47

拆分然後CONCAT一切都在一個StringBuffer

 
import java.io.*; 
import java.util.*; 


class Hilighter { 

     public static String regex(String text, String key) { 
       System.out.println(System.currentTimeMillis()); 
       text = text.replaceAll("(?i)(" + key + ")(?!([^<]+)?>>)", "<b>$1</b>"); 
       System.out.println(System.currentTimeMillis()); 
       return text; 
     } 


     public static String splitr(String text, String key) { 
       System.out.println(System.currentTimeMillis()); 
       String[] parts = text.split(key); 
       StringBuffer buffer = new StringBuffer(); 
       buffer.append(parts[0]); 
       for (int i = 1; i < parts.length; i++) { 
         buffer.append("<b>"); 
         buffer.append(key); 
         buffer.append("</b>"); 
         buffer.append(parts[i]); 
       } 
       System.out.println(System.currentTimeMillis()); 
       return buffer.toString(); 
     } 


     public static void main(String[] args) { 
       try { 
         String text = readFileAsString("./test.html"); 
         text = splitr(text, args[0]); 
         text = regex(text, args[0]); 
       } catch (Exception e) { 
         System.err.println("IO ERROR"); 
       } 
     } 


     private static String readFileAsString(String filePath) throws java.io.IOException{ 
       StringBuffer fileData = new StringBuffer(1000); 
       BufferedReader reader = new BufferedReader(new FileReader(filePath)); 
       char[] buf = new char[1024]; 
       int numRead=0; 
       while((numRead=reader.read(buf)) != -1){ 
        String readData = String.valueOf(buf, 0, numRead); 
        fileData.append(readData); 
        buf = new char[1024]; 
       } 
       reader.close(); 
       return fileData.toString(); 
     } 



}

來源

2011-01-23 16:42:38

注意，斯普利特（）也使用正則表達式。如果你確實需要一些與正則表達式無關的東西，那麼你將自己循環遍歷字符串。或者使用indexOf（）查找第一個匹配項，然後查看它是否跟隨小於號。

我不認爲你的意思是，雖然正則表達式不能直接使用。我認爲你的意思是模式不應該直接使用。

來源

2011-01-23 18:09:10

車削代替正則表達式轉換成Java算法

UPDATE

回答

相關問題