grep來提取出從HTML

正則表達式的href和rel

<a class="title may-blank" data-event-action="title" href="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/" tabindex="1" data-href-url="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/" data-inbound-url="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/?utm_content=title&amp;utm_medium=hot&amp;utm_source=reddit&amp;utm_name=frontpage" rel="">We can play singleplayer games OFF THE INTERNET? Are they seriously that out of touch to advertise this?</a>

多條線路一樣，

我只想要那個引號之間的東西一律在href="http://xxxxxxxx"和rel="">yyyyyyyyyy中，其餘是不必要的。

標識像他們這樣的輸出，對於每一個塊的新線之上

<a href="http://xxxxxxxx" rel="">yyyyyyyyyy</a>

任何想法，我將如何得到解決這樣做呢？

來源

2017-08-12 pxssy

它看起來像一個reddit鏈接，因此您可能還想查看[reddit API]（https://www.reddit.com/dev/api/）而不是手動解析html – user3151902

請參見https：// stackoverflow.com/a/1732454/1682509 – Reeno

所以這裏是一個10秒的解決方案。這可能是有點脆，但應該工作假設這個字符串是一個名爲html.txt

cat html.txt | sed 's/class.*href/href/' | sed 's/data-in.*rel=/rel=/'

來源

2017-08-12 19:57:19 James

你的HTML例子使我以下方式獲得所需的值：

<a href="http://$2" rel="">$4</a>

：

<a class=\"(.*) href=\"/(.*)\" tabindex=(.*) rel=\"\">(.*)</a>

通過使用下面的圖案替換匹配

對於我來說，你可以在regexe上試試它，它的工作方式和預期的一樣。

來源

2017-08-12 19:57:26 Fruchtzwerg

grep來提取出從HTML

回答

相關問題