我如何從一個html節點中拖放2-3個元素並將其餘部分刪除？

準確地說，我有一個班級，說A，我通過在rvest中的html_nodes選擇。現在A可以有許多子類和許多html標籤，如鏈接和img標籤。我想從A中刪除一些特定類&標籤，同時刪除其餘數據。我不知道其他數據的類。我知道我想要加入黑名單。我如何從一個html節點中拖放2-3個元素並將其餘部分刪除？

HTML（假設）。該標籤<div class="messageContent">在文檔中重複最多25次，內容不同，但結構相同。

<div class="messageContent"> 
<article> 
<blockquote class="messageText SelectQuoteContainer ugc baseHtml"> 
<div class="bbCodeBlock bbCodeQuote" data-author="Generic"> 

<aside> 
<div class="attribution type">Generic said: 
<a href="goto/post?id=32554#post-32754" class="AttributionLink">&uarr;</a> 
</div> 
<blockquote class="quoteContainer"><div class="quote">I see what you did there.</div><div class="quoteExpand">Click to expand...</div></blockquote> 
</aside> 

</div><img src="styles/default/xenforo/clear.png" class="mceSmilieSprite  mceSmilie9" alt=":o" title="Eek! :o"/> Really? 
<aside> 
<div class="attribution type">Generic said: 
<a href="goto/post?id=32554#post-32754" class="AttributionLink">&uarr;</a> 
</div> 
<blockquote class="quoteContainer"><div class="quote">I see what you did there.</div><div class="quoteExpand">Click to expand...</div></blockquote> 
</aside> 

<div class="messageTextEndMarker">&nbsp;</div> 
</blockquote> 
</article> 
</div>

因此，我刮的頁面包含多個這樣的類。我做

posts <- page %>% html_nodes(".messageContent")

這給了我一個25個html節點的列表，每個節點都包含上述html內容的變化。

我想將<aside> & </aside>標籤（可在後多個地方發生）中刪除了一切，並通過捕捉HTML其餘html_text() %>% as.character()

我能做到這一點與rvest？

測試@MirosławZalewski的解決方案。

test<- page %>% html_node(".messageContent") %>% 
      html_nodes(xpath='//*[not(ancestor::aside or name()="aside")]/text()')

返回頁面中所有不在一邊的元素。稍微微調，導致我：

page %>% html_nodes(xpath='(//div[@class="messageContent"])[1]//*[not(ancestor::aside or name()="aside")]/text()') %>% html_text() %>% as.character()

迭代了25個類，這給了我正是我需要的東西。

來源

2015-12-21 user795028

請給我們一個可重複的例子，以幫助你。 – MaxPD

添加了一個示例。 – user795028

使用XPath，你可以選擇不<aside>或<aside>後人的所有節點：

page %>% html_node(".messageContent") %>% 
    html_nodes(xpath='//*[not(ancestor::aside or name()="aside")]')

不幸的是，這也將匹配包含<aside>元素。如果您將其傳遞給html_text()，則無論如何將返回<aside>文本內容。

這可以通過在查詢中添加另一個條件來克服。這種情況的一個很好的候選人是「一切是文本節點」：

page %>% html_node(".messageContent") %>% 
    html_nodes(xpath='//*[not(ancestor::aside or name()="aside")]/text()')

其實，/text()將只返回文本節點，這幾乎可以讓你完全跳過html_text()電話。但是由於許多文本節點是可疑的（只包含空格字符），並且此函數內置了trim，所以您可能會考慮調用它。

請注意，此解決方案還將跳過任何非文本內容，如圖像引用（可能包括表情符號）。你原來的建議也會這樣做，但我不清楚你是否有意或無意。

來源

2015-12-21 23:35:11

你的命令給了我幾乎所有頁面元素的列表。（1.5MB，2500元素）。這個命令％page_not％html_nodes（xpath ='// article [not（ancestor :: aside or name（）=「aside」or self :: aside）]'）％>％html_text（）％>％ as.character（）' 這給了我一個包含所有文章的文本的25個列表，其中包括'

'之間的位我試過其他幾種組合， '

@ user795028除「

'page％>％html_nodes（xpath ='// article/* [not（parent :: aside）]'）'或'page％>％html_nodes（xpath ='// article/* [not（parent :: blockquote [@ class =「quoteContainer」]）]'）'也不取消選擇相關節點。 – user795028

我如何從一個html節點中拖放2-3個元素並將其餘部分刪除？

回答

相關問題