正則表達式對文檔進行過濾

我試圖找到驗證輸入文檔的最佳解決方案。我需要檢查文檔的每一行。基本上每行可以存在無效的字符或字符。搜索（驗證）的結果是：'讓我找到具有無效字符的行的索引以及此行中每個無效字符的索引'。正則表達式對文檔進行過濾

我知道怎麼用標準的方式（打開文件 - >讀取所有行 - >逐個檢查字符），但是這種方法並不是最佳的優化方式。相反，最好的解決方案是使用「MatchCollection」（在我看來）。

但如何在C＃中正確執行此操作？

鏈接：

http://www.dotnetperls.com/regex-matches

例子：

「輸入文本這裏，\ n是另一本的文本行。」

第一行[0]在[6]索引中發現無效字符，在行[1] 中找到[0,12,21]索引上的無效字符。

using System; 
using System.Text.RegularExpressions; 

namespace RegularExpresion 
{ 
    class Program 
    { 
     private static Regex regex = null; 

     static void Main(string[] args) 
     { 
      string input_text = "Some Înput text here, Îs another lÎne of thÎs text."; 

      string line_pattern = "\n"; 

      string invalid_character = "Î"; 

      regex = new Regex(line_pattern); 

      /// Check is multiple or single line document 
      if (IsMultipleLine(input_text)) 
      { 
       /// ---> How to do this correctly for each line ? <--- 
      } 
      else 
      { 
       Console.WriteLine("Is a single line file"); 

       regex = new Regex(invalid_character); 

       MatchCollection mc = regex.Matches(input_text); 

       Console.WriteLine($"How many matches: {mc.Count}"); 

       foreach (Match match in mc) 
        Console.WriteLine($"Index: {match.Index}"); 
      } 

      Console.ReadKey(); 
     } 

     public static bool IsMultipleLine(string input) => regex.IsMatch(input); 
    } 
}

輸出：

是單行文件
多少匹配：4
指數：5
指數：22
指數：34
指數：43

來源

2016-09-04 Nerus

什麼是*「無效字符」*？標準方式*可能會更快，發佈一些代碼。 –

我懷疑你想匹配任何不是ascii的字母。試試'Regex.Matches（s，@「[\ p {L} - [a-zA-Z]]」）'。但是，這不包含任何行索引信息。 –

像在代碼中一樣，我無法找到MatchCollection使用的多行解決方案。 – Nerus

鏈接： http://www.dotnetperls.com/regexoptions-multiline

SOLUTION

using System; 
using System.Text.RegularExpressions; 

namespace RegularExpresion 
{ 
    class Program 
    { 
     private static Regex regex = null; 

     static void Main(string[] args) 
     { 
      string input_text = @"Some Înput text here, 
Îs another lÎne of thÎs text."; 

      string line_pattern = "\n"; 

      string invalid_character = "Î"; 

      regex = new Regex(line_pattern); 

      /// Check is multiple or single line document 
      if (IsMultipleLine(input_text)) 
      { 
       Console.WriteLine("Is a multiple line file"); 

       MatchCollection matches = Regex.Matches(input_text, "^(.+)$", RegexOptions.Multiline); 

       int line = 0; 

       foreach (Match match in matches) 
       { 
        foreach (Capture capture in match.Captures) 
        { 
         line++; 

         Console.WriteLine($"Line: {line}"); 

         RegexpLine(capture.Value, invalid_character); 
        } 
       } 
      } 
      else 
      { 
       Console.WriteLine("Is a single line file"); 

       RegexpLine(input_text, invalid_character); 
      } 

      Pause(); 
     } 

     public static bool IsMultipleLine(string input) => regex.IsMatch(input); 

     public static void RegexpLine(string line, string characters) 
     { 
      regex = new Regex(characters); 

      MatchCollection mc = regex.Matches(line); 

      Console.WriteLine($"How many matches: {mc.Count}"); 

      foreach (Match match in mc) 
       Console.WriteLine($"Index: {match.Index}"); 
     } 

     public static ConsoleKeyInfo Pause(string message = "please press ANY key to continue...") 
     { 
      Console.WriteLine(message); 

      return Console.ReadKey(); 
     } 
    } 
}

THX球員的幫助下，基本上會是很好，如果有人聰明的我，然後，檢查而言，這代碼的表現。

Regards， Nerus。

來源

2016-09-04 11:38:24 Nerus

我的做法是將字符串拆分爲字符串數組，每個字符串都包含一行。如果數組的長度僅爲1，那意味着您只有1行。然後從那裏你使用正則表達式匹配每一行，找到你正在尋找的無效字符。

string input_text = "Some Înput text here,\nÎs another lÎne of thÎs text."; 
string line_pattern = "\n"; 

// split the string into string arrays 
string[] input_texts = input_text.Split(new string[] { line_pattern }, StringSplitOptions.RemoveEmptyEntries); 

string invalid_character = "Î"; 

if (input_texts != null && input_texts.Length > 0) 
{ 
    if (input_texts.Length == 1) 
    { 
     Console.WriteLine("Is a single line file"); 
    } 

    // loop every line 
    foreach (string oneline in input_texts) 
    { 
     Regex regex = new Regex(invalid_character); 

     MatchCollection mc = regex.Matches(oneline); 

     Console.WriteLine("How many matches: {0}", mc.Count); 

     foreach (Match match in mc) 
     { 
      Console.WriteLine("Index: {0}", match.Index); 
     } 
    } 
}

---編輯---

需要考慮的事情：

如果你從一個文件你的輸入，我建議你讀一行行，而不是整個文本。
通常，當您搜索無效字符時，您不指定它。相反，你會尋找一種模式。例如：不是來自a-z，A-Z，0-9的字符。那麼你的正則表達式會有點不同。

來源

2016-09-04 12:55:11 kurakura88

正則表達式對文檔進行過濾

回答

相關問題