問題與紅寶石解析

我只是有一個輕微的問題，在紅寶石nokogiri網站與一個網站。問題與紅寶石解析

下面是該網站看起來像

<div id="post_message_111112" class="postcontent"> 

     Hee is text 1 
    here is another 
     </div> 
<div id="post_message_111111" class="postcontent"> 

      Here is text 2 
    </div>

這裏是我的代碼來解析它

doc = Nokogiri::HTML(open(myNewLink)) 
myPost = doc.xpath("//div[@class='postcontent']/text()").to_a() 

ii=0 

while ii!=myPost.length 
    puts "#{ii} #{myPost[ii].to_s().strip}" 
    ii+=1 
end

我的問題是，當它Hee is text 1，在to_a後顯示出來，因爲新線把它怪怪的像這樣

myPost[0] = hee is text 1 
myPost[1] = here is another 
myPost[2] = here is text 2

我希望每個div都是它自己的消息。像

myPost[0] = hee is text 1 here is another 
myPost[1] = here is text 2

我將如何解決這個感謝

修訂

我試圖

myPost = doc.xpath("//div[@class='postcontent']/text()").to_a() 

myPost.each_with_index do |post, index| 
    puts "#{index} #{post.to_s().gsub(/\n/, ' ').strip}" 
end

我把post.to_s（）。GSUB，因爲它是抱怨GSUB不作爲發佈的方法。但我仍然有同樣的問題。我知道即時做錯了剛剛擊毀我的頭

更新2

忘了說，新的生產線是<br />，甚至與

doc.search('br').each do |n| 
    n.replace('') 
end

或

doc.search('br').remove

的問題仍然存在

來源

2013-03-10 DanielJ

如果你看看myPost數組，你會看到每個div實際上是它自己的消息。第一個恰好包括一個換行符\n。要用空格替換它，請使用#gsub(/\n/, ' ')。所以，你的循環是這樣的：

myPost.each_with_index do |post, index| 
    puts "#{index} #{post.to_s.gsub(/\n/, ' ').strip}" 
end

編輯：

據我有限的瞭解它，XPath的只能找到節點。子節點爲<br />，因此您要麼在它們之間有多個文本，要麼在搜索中包含div標記。確實有辦法加入<br />節點之間的文本，但我不知道它。直到你找到它，在這裏一些作品：

與"//div[@class='postcontent']"

更換您的XPath匹配調整你的循環刪除div標籤：

myPost.each_with_index do |post, index| 
    post = post.to_s 
    post.gsub!(/\n/, ' ') 
    post.gsub!(/^<div[^>]*>/, '') # delete opening div tag 
    post.gsub!(%r|</\s*div[^>]*>|, '') # delete closing div tag 
    puts "#{index} #{post.strip}" 
end

來源

2013-03-10 17:36:22 Huluk

感謝您的快速回復，但只有一個小問題。之後myPost = doc.xpath（「// div [@ class ='postcontent']/text（）」）。to_a（）... I have .... myPost.each_with_index do | post，index | puts「＃{index}＃{post.gsub（/ \ n /，''）.strip}」 end ....但是它給出了關於沒有方法gsub的帖子，所以如果我把... myPost .each_with_index do | post，index | puts「＃{index}＃{post.to_s（）。gsub（/ \ n /，''）.strip}」 end ......它解決了no gsub問題，但仍然是數組的問題 – DanielJ 2013-03-10 17:51:48

不好意思，當然有'to_s'丟失了。我將它固定在原文中，但現在它會將每篇文章打印在一行中。我不知道到底發生了什麼，你能提供一個有效的例子嗎？您發佈的html無法自行分析。 – Huluk 2013-03-10 18:23:11

\t \t \t text text text text text text text text text text text text text text text text text text.

MAny thanks. \t \t

– DanielJ 2013-03-10 18:30:42

這裏，讓我爲你清理它：

doc.search('div.postcontent').each_with_index do |div, i| 
    puts "#{i} #{div.text.gsub(/\s+/, ' ').strip}" 
end 
# 0 Hee is text 1 here is another 
# 1 Here is text 2

來源

2013-03-10 23:13:37 pguardiario

問題與紅寶石解析

回答

相關問題