正則表達式文件刮擦

我正在使用正則表達式在一個文件中刪除一個電子郵件地址。正則表達式文件刮擦

不幸的是我正則表達式規則不能與此匹配字符串：

" <font size=-1><a href=mailto:[email protected]>_ MR NOURS _</a></font> ";

我很想找到爲什麼在stackoverflow上的原因，我希望有人能夠解釋我的規則有什麼問題。

這是我的代碼進行測試：

#include <stdio.h> 
#include <stdlib.h> 
#include <regex.h> 

int main (void) 
{ 
    int match; 
    int err; 
    regex_t preg; 
    regmatch_t pmatch[5]; 
    size_t nmatch = 5; 
    const char *str_request = "   <font size=-1><a href=mailto:[email protected]>_ MR NOURS _</a></font>   "; 

const char *str_regex = "[a-zA-Z0-9][a-zA-Z0-9_.][email protected][a-zA-Z0-9_]+\\.(com|net|[a-zA-Z]{2})$"; 

    err = regcomp(&preg, str_regex, REG_EXTENDED); 
    if (err == 0) 
    { 
     match = regexec(&preg, str_request, nmatch, pmatch, 0); 
     nmatch = preg.re_nsub; 
     regfree(&preg); 
     if (match == 0) 
     { 
      printf ("match\n"); 
      int start = pmatch[0].rm_so; 
      int end = pmatch[0].rm_eo; 
      printf("%d - %d\n", start, end); 
     } 
     else if (match == REG_NOMATCH) 
     { 
      printf("unmatch\n"); 
     } 
    } 
    puts ("\nPress any key\n"); 
    getchar(); 
    return (EXIT_SUCCESS); 
}

來源

2016-04-21 joelmoluni

從模式中移除'$'。 –

@AeroX：這個問題與html無關 –

@AndreaCorbellini從給出的HTML標記的例子字符串，我懷疑這可能最終是一個XY問題。該OP提到了報廢並提供了一個HTML字符串，這可能意味着他們以後想要抓取網頁。因此，可能的重複將它們指向HTML解析的方向而不是正則表達式。 – AeroX

我懷疑你正試圖將字符串匹配整個單詞，因此，您所使用$（結束字符串）停泊在年底該模式。但是，您正在查找的子串不在輸入字符串的末尾。

由於regex.h不支持單詞邊界，你可以用一種變通方法：

const char *str_regex = "([a-zA-Z0-9][a-zA-Z0-9_.][email protected][a-zA-Z0-9_]+\\.(com|net|[a-zA-Z]{2}))([^a-zA-Z]|$)"; 
                          ^^^^^^^^^^^^^

你需要將駐留在捕獲組中的值1

看到這個C IDEONE demo：

#include <stdio.h> 
#include <stdlib.h> 
#include <regex.h> 

int main (void) 
{ 
    int match; 
    int err; 
    regex_t preg; 
    regmatch_t pmatch[5]; 
    size_t nmatch = 4; // We have 4 groups as a result of matching: 0 - the whole match, and 3 capture groups 
    const char *str_request = "   <font size=-1><a href=mailto:[email protected]>_ MR NOURS _</a></font>   "; 

const char *str_regex = "([a-zA-Z0-9][a-zA-Z0-9_.][email protected][a-zA-Z0-9_]+\\.(com|net|[a-zA-Z]{2}))([^a-zA-Z]|$)"; 

    err = regcomp(&preg, str_regex, REG_EXTENDED); 
    if (err == 0) 
    { 
     match = regexec(&preg, str_request, nmatch, pmatch, 0); 
     nmatch = preg.re_nsub; 
     regfree(&preg); 
     if (match == 0) 
     { 
      printf ("match\n"); 
      int start = pmatch[1].rm_so; // <- Changed from 0 to 1 
      int end = pmatch[1].rm_eo; // <- Changed from 0 to 1 
      printf("%d - %d\n\"%.*s\"", start, end, pmatch[1].rm_eo - pmatch[1].rm_so, &str_request[pmatch[1].rm_so]); 
     } //    ^--^ Added a captured substring display 
     else if (match == REG_NOMATCH) 
     { 
      printf("unmatch\n"); 
     } 
    } 
    puts ("\nPress any key\n"); 
    getchar(); 
    return (EXIT_SUCCESS); 
}

或者只是刪除$如果你不關心整個單詞匹配。

來源

2016-04-21 09:50:11

請檢查答案，如果它爲你工作，請考慮接受。 –

正則表達式文件刮擦

回答

相關問題