Bash RexEx：逐行讀取文件以提取捕獲組中的每個href

我試圖逐行讀取文件以提取捕獲組中的所有錨標記。Bash RexEx：逐行讀取文件以提取捕獲組中的每個href

到目前爲止，我有：

regex="(<a href=\")([A-Za-z0-9:/._-]+)\".*(<\/a>)" 
while read line; do  
    if [[ $line =~ $regex ]]; then 
     #echo ${BASH_REMATCH} 
     href=${BASH_REMATCH[2]} 
     echo $href 
    fi 
done < file.txt

雖然這幾乎是工作，因爲我捕捉URL的要求，我遇到的問題是，當一行含有兩個或兩個以上的錨<a>標籤，在那個時候，我的正則表達式是無效的，因爲只有第一個錨標籤被捕獲。

所以，我不知道，必須有一種捕捉所有重複組的方法。

示例文本將是：

This paragraph has only one anchor tag, <a href="http://google.com" target="_blank">google</a>, lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 

Some paragraph with a lot of anchor tags, <a href="http://en.wikipedia.org/wiki/Regular_expression" target="_blank">regular expression</a>, lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <a href="http://en.wikipedia.org/wiki/Bash_(Unix_shell)" target="_blank">Bash</a>. Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <a href="http://stackoverflow.com/questions/ask" target="_blank">asking</a>, lorem ipsum dolor sit amet <a href="http://en.wikipedia.org" target="_blank">wikipedia</a>

你會發現，在上述案文，file.txt運行我的bash腳本的結果是「：

http://google.com 
http://en.wikipedia.org/wiki/Regular_expression

...如果你取消註釋#echo ${BASH_REMATCH}，你會看到整個段落是匹配的，只有第一個錨點被捕獲。段？

謝謝你的時間！

來源

2014-06-28 asking

您可以使用while循環來捕獲所有

regex="<a href=\"([A-Za-z0-9:/._-]+)\"[^<]*<\/a>(.*$)"                         
while read line; do                                 
    while [[ $line =~ $regex ]]; do                             
     href=${BASH_REMATCH[1]}                              
     line=${BASH_REMATCH[2]}                              
     echo $href                                  
    done                                    
done < file.txt

打印

http://google.com 
http://en.wikipedia.org/wiki/Regular_expression 
http://stackoverflow.com/questions/ask 
http://en.wikipedia.org

來源

2014-06-28 20:17:17 Fabricator

你試過grep -o？那隻會打印比賽。

grep -Po '(?<=<a href=\")([A-Za-z0-9:/._-]+)(?=\".*?<\/a>)' file.txt

-P接通Perl兼容的正則表達式
-o只返回匹配的模式不整行
(?<=...)正面看後面：由該模式之前的位置匹配
(?=...)積極向上：匹配此模式後跟的位置
.*?非貪婪匹配：所以你不會結束從第一個o pening <a>標籤最後收</a>標籤

使用前瞻，看看你身後不匹配周圍圖案只是要求他們的存在。這使得grep -o輸出正是你所需要的。

只是一個說明：這種方法是非常片狀，評論等不明白。如果你需要這個工具，一些重要的東西，使用XML/HTML解析器，而不是

來源

2014-06-28 19:57:25 fejese

+1相匹配指出正則表達式不是這個工作的正確工具。 –

運行您的解決方案產生的grep幫助文本：用法：用grep [-abcDEFGHhIiJLlmnOoPqRSsUVvwxZ] [-A NUM] [-B NUM] [-C [NUM] ...等 – asking

@BartonChittenden你爲什麼說是正則表達式不適合工作的正確工具，請展開。 – asking

Bash RexEx：逐行讀取文件以提取捕獲組中的每個href

回答

相關問題