java的正則表達式來清除的mediawiki標記

可能重複：
Wikipedia : Java library to remove wikipedia text markup removal java的正則表達式來清除的mediawiki標記

我一定要洗的是來自合流部分內容。該內容幾乎乾淨;然而，有喜歡的一些事情：

[鏈接|]：沒有網址的一部分
* [鏈接|]鏈接*：粗體
的鏈接（沒有網址的一部分）*文本*：黑體字
_ *文本* _：斜體粗體文字

等。我需要編寫清理所有的正則表達式，所以，我不喜歡的東西：

String wikiCleanMarkupRegex = "\\\\[(.*?)[\\\\|.*?]?\\\\]|\\\\*(.*?)\\\\*|_(.*?)_";

但這並不乾淨的一切，我的意思是，如果我給它＃2的聯繫，我將獲得：

[鏈接|]

這是不是我想要的，我想要得到「鏈接」 ......所以，我需要到沒有其他發現匹配，一次又一次重新解析字符串。

這真的很慢，因爲有數百萬條記錄要清理，所以，有沒有辦法一次完成所有的正則表達式？

非常感謝。

來源

2012-10-11 user1739166

還，如果我有像\ _ \ * \ [link | \] \ * \ _之類的東西，一個粗體和斜體的鏈接（沒有url部分），我需要解析它3次，一次刪除斜體，其他刪除粗體，最後一個刪除括號......這對我所需要的太慢了 – user1739166

，因爲它看起來，基本上是三種類型的代碼格式：斜體，大膽，並LINK

我會做3遍正則表達式的替代品。

和優先順序根據你給應該是輸入：

/** 
* FIRST REMOVE ITALICS, THEN BOLD, THEN URL 
*/ 
public static String cleanWikiFormat(CharSequence sequence) { 
    return Test.removeUrl(Test.removeBold(Test.removeItalic(sequence))); 
}

下面是一個示例代碼：

import java.util.regex.Matcher; 
import java.util.regex.Pattern; 


public class Test { 

    private static String removeItalic(CharSequence sequence) { 
     Pattern patt = Pattern.compile("_\\*(.+?)\\*_"); 
     Matcher m = patt.matcher(sequence); 
     StringBuffer sb = new StringBuffer(sequence.length()); 
     while (m.find()) { 
      String text = m.group(1); 
      // ... possibly process 'text' ... 
      m.appendReplacement(sb, Matcher.quoteReplacement(text)); 
     } 
     m.appendTail(sb); 
     return sb.toString(); 
    } 

    private static String removeBold(CharSequence sequence) { 
     Pattern patt = Pattern.compile("\\*(.+?)\\*"); 
     Matcher m = patt.matcher(sequence); 
     StringBuffer sb = new StringBuffer(sequence.length()); 
     while (m.find()) { 
      String text = m.group(1); 
      // ... possibly process 'text' ... 
      m.appendReplacement(sb, Matcher.quoteReplacement(text)); 
     } 
     m.appendTail(sb); 
     return sb.toString(); 
    } 


    private static String removeUrl(CharSequence sequence) { 
     Pattern patt = Pattern.compile("\\[(.+?)\\|\\]"); 
     Matcher m = patt.matcher(sequence); 
     StringBuffer sb = new StringBuffer(sequence.length()); 
     while (m.find()) { 
      String text = m.group(1); 
      // ... possibly process 'text' ... 
      m.appendReplacement(sb, Matcher.quoteReplacement(text)); 
     } 
     m.appendTail(sb); 
     return sb.toString(); 
    } 


    public static String cleanWikiFormat(CharSequence sequence) { 
     return Test.removeUrl(Test.removeBold(Test.removeItalic(sequence))); 
    } 

    public static void main(String[] args) { 
     String text = "[hello|] this is just a *[test|]* to clean wiki *type* and _*formatting*_"; 
     System.out.println("Original"); 
     System.out.println(text); 
     text = Test.cleanWikiFormat(text); 
     System.out.println("CHANGED"); 
     System.out.println(text); 
    } 
}

下面將爲：

Original 
[hello|] this is just a *[test|]* to clean wiki *type* and _*formatting*_ 
CHANGED 
hello this is just a test to clean wiki type and formatting

來源

2012-10-11 20:18:45 gtgaxiola

java的正則表達式來清除的mediawiki標記

回答

相關問題