在Shell腳本中用CURL解析HTML

我想解析shell腳本中網頁的特定內容。在Shell腳本中用CURL解析HTML

我需要grep<div>標籤內的內容。

<div class="tracklistInfo"> 
<p class="artist">Diplo - Justin Bieber - Skrillex</p> 
<p>Where Are U Now</p> 
</div>

如果我使用grep -E -m 1 -o '<div class="tracklistInfo">'，簡歷只是<div class="tracklistInfo">

如何訪問藝術家(Diplo - Justin Bieber - Skrillex)以及如何標題(Where Are U Now)？

來源

2016-03-22 Fab ian

不要。使用HTML解析器。例如，Python的BeautifulSoup易於使用，並且可以非常輕鬆地完成此操作。

也就是說，請記住grep適用於行。該模式匹配行，而不是整個字符串。

你可以使用什麼是-A賽後還輸出線：

grep -A2 -E -m 1 '<div class="tracklistInfo">'

應該輸出：

<div class="tracklistInfo"> 
<p class="artist">Diplo - Justin Bieber - Skrillex</p> 
<p>Where Are U Now</p>

然後，您可以通過管道得到它的最後或倒數第二行到tail：

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1 
<p>Where Are U Now</p> 

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n2 | head -n1 
<p class="artist">Diplo - Justin Bieber - Skrillex</p>

然後用去掉HTML：

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1 
Where Are U Now 

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n2 | head -n1 | sed 's/<[^>]*>//g' 
Diplo - Justin Bieber - Skrillex

但正如所說，這是善變的，有可能打破，而不是很漂亮。下面是與BeautifulSoup相同，順便說一句：

html = '''<body> 
<p>Blah text</p> 
<div class="tracklistInfo"> 
<p class="artist">Diplo - Justin Bieber - Skrillex</p> 
<p>Where Are U Now</p> 
</div> 
</body>''' 

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html, 'html.parser') 

for track in soup.find_all(class_='tracklistInfo'): 
    print(track.find_all('p')[0].text) 
    print(track.find_all('p')[1].text)

這也適用於的tracklistInfo多行 - 補充說，在shell命令需要更多的工作;-)

來源

2016-03-22 14:36:47 Carpetsmoker

非常感謝。現在我成爲了繼簡歷：弗洛裏達轉向（5,4,3,2,1）那是完美的，但我怎麼能之前刪除的空間？而且我可以使用UTF-8，因爲我把它包括空間角色，我不工作例如：恩裏克·伊格萊西亞斯 - 尼基果醬薩爾瓦多和對外關係司óň –

@Fabian是的，這就是爲什麼你不使用'捲曲'/'grep' /'sed'，但是一個HTML解析器;-) – Carpetsmoker

oh ok 然後我嘗試使用BeautifulSoup。謝謝 –

cat - > file.html << EOF 
<div class="tracklistInfo"> 
<p class="artist">Diplo - Justin Bieber - Skrillex</p> 
<p>Where Are U Now</p> 
</div><div class="tracklistInfo"> 
<p class="artist">toto</p> 
<p>tata</p> 
</div> 
EOF 


cat file.html | tr -d '\n' | sed -e "s/<\/div>/<\/div>\n/g" | sed -n 's/^.*class="artist">\([^<]*\)<\/p> *<p>\([^<]*\)<.*$/artist : \1\ntitle : \2\n/p'

來源

2016-03-22 23:38:10

使用xmllint：

a='<div class="tracklistInfo"> 
<p class="artist">Diplo - Justin Bieber - Skrillex</p> 
<p>Where Are U Now</p> 
</div>' 

xmllint --html --xpath 'concat(//div[@class="tracklistInfo"]/p[1]/text(), "#", //div[@class="tracklistInfo"]/p[2]/text())' <(echo "$a")

您獲得：

Diplo - Justin Bieber - Skrillex#Where Are U Now

這可以很容易地分開。

來源

2016-04-06 02:01:26

在Shell腳本中用CURL解析HTML

回答

相關問題