2013-01-31 58 views
1

我想從網頁中提取僅用於標題節點對象方法的文本。具體HMTL部分如下:如何從Web :: Scraper中只選擇一個表格?

<h2>Node Object Properties</h2> 
<p>The &quot;DOM&quot; column indicates in which DOM Level the property was introduced.</p> 

<table class="reference"> 
<tr> 
<th width="23%" align="left">Property</th> 
<th width="71%" align="left">Description</th> 
<th style="text-align:center;">DOM</th> 
</tr> 
<tr> 
    <td><a href="prop_node_attributes.asp">attributes</a></td> 
    <td>Returns a collection of a node's attributes</td> 
    <td style="text-align:center;">1</td> 
</tr> 

<tr> 
    <td><a href="prop_node_baseuri.asp">baseURI</a></td> 
    <td>Returns the absolute base URI of a node</td> 
    <td style="text-align:center;">3</td> 
</tr> 
<tr> 
    <td><a href="prop_node_childnodes.asp">childNodes</a></td> 
    <td>Returns a NodeList of child nodes for a node</td> 
    <td style="text-align:center;">1</td> 
</tr> 
<tr> 
    <td><a href="prop_node_firstchild.asp">firstChild</a></td> 
    <td>Returns the first child of a node</td> 
    <td style="text-align:center;">1</td> 
</tr> 
<tr> 
    <td><a href="prop_node_lastchild.asp">lastChild</a></td> 
    <td>Returns the last child of a node</td> 
    <td style="text-align:center;">1</td> 
</tr> 
<tr> 
    <td><a href="prop_node_localname.asp">localName</a></td> 
    <td>Returns the local part of the name of a node</td> 
    <td style="text-align:center;">2</td> 
</tr> 
<tr> 
    <td><a href="prop_node_namespaceuri.asp">namespaceURI</a></td> 
    <td>Returns the namespace URI of a node</td> 
    <td style="text-align:center;">2</td> 
</tr> 
<tr> 
    <td><a href="prop_node_nextsibling.asp">nextSibling</a></td> 
    <td>Returns the next node at the same node tree level</td> 
    <td style="text-align:center;">1</td> 
</tr> 
<tr> 
    <td><a href="prop_node_nodename.asp">nodeName</a></td> 
    <td>Returns the name of a node, depending on its type</td> 
    <td style="text-align:center;">1</td> 
</tr> 
<tr> 
    <td><a href="prop_node_nodetype.asp">nodeType</a></td> 
    <td>Returns the type of a node</td> 
    <td style="text-align:center;">1</td> 
</tr> 
<tr> 
    <td><a href="prop_node_nodevalue.asp">nodeValue</a></td> 
    <td>Sets or returns the value of a node, depending on its 
    type</td> 
    <td style="text-align:center;">1</td> 
</tr> 
<tr> 
    <td><a href="prop_node_ownerdocument.asp">ownerDocument</a></td> 
    <td>Returns the root element (document object) for a node</td> 
    <td style="text-align:center;">2</td> 
</tr> 
<tr> 
    <td><a href="prop_node_parentnode.asp">parentNode</a></td> 
    <td>Returns the parent node of a node</td> 
    <td style="text-align:center;">1</td> 
</tr> 
<tr> 
    <td><a href="prop_node_prefix.asp">prefix</a></td> 
    <td>Sets or returns the namespace prefix of a node</td> 
    <td style="text-align:center;">2</td> 
</tr> 
<tr> 
    <td><a href="prop_node_previoussibling.asp">previousSibling</a></td> 
    <td>Returns the previous node at the same node tree level</td> 
    <td style="text-align:center;">1</td> 
</tr> 
<tr> 
    <td><a href="prop_node_textcontent.asp">textContent</a></td> 
    <td>Sets or returns the textual content of a node and its 
    descendants</td> 
    <td style="text-align:center;">3</td> 
</tr> 
</table> 

<h2>Node Object Methods</h2> 
<p>The &quot;DOM&quot; column indicates in which DOM Level the method was introduced.</p> 
<table class="reference"> 
<tr> 
<th width="33%" align="left">Method</th> 
<th width="61%" align="left">Description</th> 
<th style="text-align:center;">DOM</th> 
</tr> 
<tr> 
    <td><a href="met_node_appendchild.asp">appendChild()</a></td> 
    <td>Adds a new child node, to the specified node, as the last child node</td> 
    <td style="text-align:center;">1 </td> 
</tr> 
<tr> 
    <td><a href="met_node_clonenode.asp">cloneNode()</a></td> 
    <td>Clones a node</td> 
    <td style="text-align:center;">1 </td> 
</tr> 
<tr> 
    <td><a href="met_node_comparedocumentposition.asp">compareDocumentPosition()</a></td> 
    <td>Compares the document position of two nodes</td> 
    <td style="text-align:center;">1 </td> 
</tr> 
<tr> 
    <td>getFeature(<span class="parameter">feature</span>,<span class="parameter">version</span>)</td> 
    <td>Returns a DOM object which implements the specialized APIs 
    of the specified feature and version</td> 
    <td style="text-align:center;">3 </td> 
</tr> 
<tr> 
    <td>getUserData(<span class="parameter">key</span>)</td> 
    <td>Returns the object associated to a key on a this node. The 
    object must first have been set to this node by calling setUserData with the 
    same key</td> 
    <td style="text-align:center;">3 </td> 
</tr> 
<tr> 
    <td><a href="met_node_hasattributes.asp">hasAttributes()</a></td> 
    <td>Returns true if a node has any attributes, otherwise it 
    returns false</td> 
    <td style="text-align:center;">2 </td> 
</tr> 
<tr> 
    <td><a href="met_node_haschildnodes.asp">hasChildNodes()</a></td> 
    <td>Returns true if a node has any child nodes, otherwise it 
    returns false</td> 
    <td style="text-align:center;">1 </td> 
</tr> 
<tr> 
    <td><a href="met_node_insertbefore.asp">insertBefore()</a></td> 
    <td>Inserts a new child node before a specified, existing, child node</td> 
    <td style="text-align:center;">1 </td> 
</tr> 
</table> 

在Perl如果我寫了以下內容:

my $data = scraper { 
process "table.reference > tr > td > a", 'renners[]' => 'TEXT'; 
} 

for my $i (0 .. $#{$res2->{renners}}) { 
    print $res2->{renners}[$i]; 
print "\n"; 
} 

我得到的文本對所有的標籤,即:

attributes 
baseURI 
. 
. 
. 
. 
insertBefore() 

wheras我需要僅用於節點對象方法的標籤<a>的文本,即:

appendChild() 
. 
. 
. 
insertBefore() 

總之我只想打印NODE對象方法。我應該在代碼中修改什麼?

回答

2

Web::Scraper可以使用nth_of_type來選擇正確的表格。有相同類別的兩個表,所以你可以說table.reference:nth-of-type(2)

use v5.22; 

use feature qw(postderef); 
no warnings qw(experimental::postderef); 


use Web::Scraper; 

my $html = do { local $/; <DATA> }; 

my $methods = scraper { 
    process "table.reference:nth-of-type(2) > tr > td > a", 'renners[]' => 'TEXT'; 
    }; 
my $res = $methods->scrape($html); 

say join "\n", $res->{renners}->@*; 

這裏是一個Mojo::DOM

use Mojo::DOM; 

my $html = do { local $/; <DATA> }; 

my $dom = Mojo::DOM->new($html); 

say $dom 
    ->find('table.reference:nth-of-type(2) > tr > td > a') 
    ->map('text') 
    ->join("\n"); 

我試圖尋找一種能夠識別在h2文本選擇的解決方案,但我的功夫在這裏很弱。

1

Web::Query爲brian d foy提出的Mojo::DOM解決方案提供了幾乎相同的解決方案。

use Web::Query; 

my $html = do { local $/; <DATA> }; 

wq($html) 
    ->find('table.reference:nth-of-type(2) > tr > td > a') 
    ->each(sub { 
     my ($i, $e) = @_; 
     say $e->text(); 
    }); 

然而,它看起來像Mojo :: DOM是更強大的庫。爲了使Web :: Query正確匹配其選擇器,我必須編輯問題中提供的輸入以添加圍繞所有其他內容的根節點。

__DATA__ 
<html> 
... 
</html> 
1

您可以使用XPath標題Node Object Methods後提取第二天表中的數據,像這樣

use Web::Scraper; 

my $html = do { local $/; <DATA> }; 

my $methods = scraper { 
    process '//h2[.="Node Object Methods"]/following-sibling::table[1]//tr/td[1]', 
     'renners[]' => 'TEXT'; 
}; 
my $res = $methods->scrape($html); 

say join "\n", @{ $res->{renners} }; 

輸出將是

appendChild() 
cloneNode() 
compareDocumentPosition() 
getFeature(feature,version) 
getUserData(key) 
hasAttributes() 
hasChildNodes() 
insertBefore() 
相關問題