如何根據類提取HTML元素？

我剛剛開始使用Perl，並寫了一個簡單的腳本來做一些網頁抓取。我使用WWW :: Mechanize和HTML :: TreeBuilder來完成大部分工作，但我遇到了一些麻煩。我有以下的HTML：如何根據類提取HTML元素？

<table class="winsTable"> 
    <thead>...</thead> 
    <tbody> 
     <tr> 
      <td class = "wins">15</td> 
     </tr> 
    </tbody> 
</table>

我知道有一些模塊，從表中獲取數據，但這是一個特例;不是我想要的所有數據都在表格中。所以，我想：

my $tree = HTML::TreeBuilder->new_from_url($url); 
my @data = $tree->find('td class = "wins"');

但@data返回空。我知道這個方法可以在沒有類名的情況下工作，因爲我已經用$tree->find('strong')成功解析了數據。那麼，是否有一個模塊可以處理這種類型的HTML語法？我通過HTML :: TreeBuilder文檔掃描，並沒有發現任何似乎，但我可能是錯的。

來源

2013-07-14 aquemini

我使用的是優秀（但有時有點慢）HTML::TreeBuilder::XPath模塊：

my $tree = HTML::TreeBuilder::XPath->new_from_content($mech->content()); 
my @data = $tree->findvalues('//table[ @class = "winsTable" ]//td[@class = "wins"]');

來源

2013-07-14 03:55:32 gangabass

哇哦，工作平凡。謝謝！ – aquemini

你可以使用look_down方法來找到特定的標記和屬性，你要尋找的。這在HTML::Element模塊中（由HTML::TreeBuilder導入）。

my $data = $tree->look_down(
    _tag => 'td', 
    class => 'wins' 
); 

print $data->content_list, "\n" if $data; #prints '15' using the given HTML 

$data = $tree->look_down(
    _tag => 'td', 
    class => 'losses' 
); 

print $data->content_list, "\n" if $data; #prints nothing using the given HTML

來源

2013-07-14 04:13:05 dms

也很好。好東西，謝謝！ – aquemini

我使用的是同樣的情況，但IAM收到以下錯誤：在/usr/local/share/perl5/HTML/TreeBuilder.pm線207 奇數哈希分配的元素 – Nagaraju

（這是怎樣的一個補充答案dspain's）的

其實你在HTML::TreeBuilder documentation在那裏說錯過了點，

Objects of this class inherit the methods of both HTML::Parser and HTML::Element. The methods inherited from HTML::Parser are used for building the HTML tree, and the methods inherited from HTML::Element are what you use to scrutinize the tree. Besides this (HTML::TreeBuilder) documentation, you must also carefully read the HTML::Element documentation, and also skim the HTML::Parser documentation -- probably only its parse and parse_file methods are of interest.

（注意加粗格式是我的，它不是在文檔中）

這表明你應該閱讀HTML::Element's documentation爲好，在那裏你會發現find method這說

This is just an alias to find_by_tag_name

這應該告訴你，它並不適用於類名的工作，但它的描述也提到了look_down method可以稍微再往下找到。如果你看看這個例子，你會發現它能做到你想要的。並dspain's answer顯示如何在你的情況。

爲了公平起見，該文件是不是易於瀏覽。

來源

2013-07-21 01:12:30 doubleDown

我發現this在告訴我如何提取從HTML內容的具體信息鏈接是最有用的。我使用了頁面上的最後一個示例：

use v5.10; 
use WWW::Mechanize; 
use WWW::Mechanize::TreeBuilder; 

my $mech = WWW::Mechanize->new; 
WWW::Mechanize::TreeBuilder->meta->apply($mech); 

$mech->get('http://htmlparsing.com/'); 

# Find all <h1> tags 
my @list = $mech->find('h1'); 

# or this way <----- I found this way very useful to pinpoint exact classes with in some html 
my @list = $mech->look_down('_tag' => 'h1', 
          'class' => 'main_title'); 

# Now just iterate and process 
foreach (@list) { 
    say $_->as_text(); 
}

這看起來比起我看過的其他任何模塊都要簡單得多。希望這可以幫助！

來源

2016-03-14 17:28:36 Anna

如何根據類提取HTML元素？

回答

相關問題