在Ruby中計算HTML文檔中的標記和文本字符

我需要解析一個HTML文檔來計算Ruby中兩個標記（包括屬性）和文本的字符數。出於性能原因，我不想使用DOM解析器。我已經看過Nokogiri的SAX和Reader解析器以及SaxMachine，但似乎都沒有爲我提供跟蹤解析器在輸入HTML中的位置的方法。在Ruby中計算HTML文檔中的標記和文本字符

有沒有人知道在Ruby中訪問這些信息的方法？在此先感謝

來源

2011-08-08 pablochacin

在我的問題中，更精確的是，我需要計算所有的html字符和所有文本，但我也需要知道文本在哪個標記內。而且，將來我可能需要過濾哪些標籤需要處理。 – pablochacin

我們無法真正設計出您的解決方案，您提出的要求仍然很模糊。你能提供一個你想要的輸出的例子嗎？ –

我已經提出了另一種解決方案，這是更接近你要找的？ –

輸入字符串

html = <<-HTML 
<html> 

<head> 
    <title>Title</title> 
</head> 

<body> 
    Hello world! 
</body> 

</html> 
HTML

阿呆解決方案

原油的解決方案，它計算每一個字母字符（即</html>數爲4個字符）。

tag_count = 0 
text_count = 0 

in_tag = false 

html.each_char do |char| 
    case char 
    when '<' 
    in_tag = true 
    when '>' 
    in_tag = false 
    when /\w/ 
    in_tag ? tag_count += 1 : text_count += 1 
    end 
end 

puts "Text char count: #{text_count}" 
puts "Tag char count: #{tag_count}"

引入nokogiri SAX解決方案

這一個可以很容易地翻譯成另一種語言（如Java的）。

require 'nokogiri' 

class HtmlCounter < Nokogiri::XML::SAX::Document 

    attr_accessor :tag_count, :text_count, :comment_count 

    def initialize(filtered_tags = []) 
    @filtered_tags = filtered_tags 
    end 

    def start_document 
    @tag_count = Hash.new(0) 
    @text_count = Hash.new(0) 
    @comment_count = 0 
    @current_tags = [] 
    end 

    def start_element(name, attrs) 
    # Keep track of the nesting 
    @current_tags.push(name) 

    if should_count? 
     # Count the end element as well 
     count_tag(name.length * 2) 
     count_tag(attrs.flatten.map(&:length).inject(0) {|sum, length| sum + length}) 
    end 
    end 

    def end_element(name) 
    @current_tags.pop 
    end 

    def comment(string) 
    count_comment(string.length) if should_count? 
    end 

    def characters(string) 
    count_text(string.strip.length) if should_count? 
    end 

    def should_count? 
    # Are we in a filtered tag ? 
    (@current_tags & @filtered_tags).empty? 
    end 

    def count_text(count) 
    @text_count[@current_tags.last] += count 
    end 

    def count_tag(count) 
    @tag_count[@current_tags.last] += count 
    end 

    def count_comment(count) 
    @comment_count[@current_tags.last] += count 
    end 
end 

# Don't count things in title tags 
counter = HtmlCounter.new(["title"]) 
parser = Nokogiri::HTML::SAX::Parser.new(counter) 
parser.parse(html) 

puts "Text char count: #{counter.text_count}" 
puts "Tag char count: #{counter.tag_count}"

輸出：

Text char count: {"body"=>12} 
Tag char count: {"html"=>8, "head"=>8, "body"=>8}

希望這有助於。

來源

2011-08-08 10:29:49

在Ruby中計算HTML文檔中的標記和文本字符

回答

相關問題