如何從HTML中提取亞馬遜評論？

我一直在試圖編寫一個Perl腳本來報廢亞馬遜並下載產品評論，但我一直無法這樣做。我一直在使用perl模塊LWP :: Simple和HTML :: TreeBuilder :: XPath來實現這一點。如何從HTML中提取亞馬遜評論？

對於HTML

<div id="revData-dpReviewsMostHelpfulAUI-R1GQHD9GMGBDXP" class="a-row a-spacing-small"> 
    <span class="a-size-mini a-color-state a-text-bold"> 
    Verified Purchase 
    </span> 
    <div class="a-section"> 
    I bought this to replace an earlier model that got lost in transit when we moved. It is a real handy helper to have when making tortillas. Follow the recipe for flour tortillas in the little recipe book that comes with it. I make a few changes 

    </div> 
</div> 

</div> 
</div>

我想提取產品的審查。對於這個我寫道： -

use LWP::Simple; 

#use HTML::TreeBuilder; 
use HTML::TreeBuilder::XPath; 

# Take the ASIN from the command line. 
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n"; 

# Assemble the URL from the passed ASIN. 
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews"; 

# Set up unescape-HTML rules. Quicker than URI::Escape. 
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' '); 
my $unescape_re = join '|' => keys %unescape; 

# Request the URL. 
my $content = get($url); 
die "Could not retrieve $url" unless $content; 
my $tree = HTML::TreeBuilder::XPath->new_from_content($content); 
my @data = $tree->findvalues('div[@class ="a-section"]'); 

foreach (@data) 
{ 
    print "$_\n"; 
}

但我沒有得到任何輸出。任何人都可以指出我的錯誤嗎？

來源

2015-04-01 Aakash Sharma

你應該堅持'uri_unescape'從HTML中刪除字符實體。與全球正則表達式一起使用的散列可能會更快，但與從互聯網上恢復HTML所花費的時間相比，可能會更快。而'uri_unescape'則更加簡潔和自我記錄。 – Borodin 2015-04-01 13:14:34

爲什麼刮亞馬遜？你知道他們有一個[產品API]（https://metacpan.org/release/Net-Amazon）？ – 2015-04-08 16:04:55

我覺得XPath的應該是'//div[@class ="a-section"]'（額外//在表達式的開頭找到div任何地方HTML）

來源

2015-04-01 08:27:25 mirod

正如choroba說，你的XPath表達式應該//開始尋找對於類型div的後代。現在，您正在文檔的根目錄搜索<div>元素，並且沒有。

你也正在尋找一個class屬性是等於到a-section的時候，其實每個div元素的class屬性可以包含多個類，像

class="a-section a-subheader a-breadcrumb celwidget"

，你想他們中的任何一個是a-section。

有幾種解決方法。最明顯的是使用XPath 包含，看是否a-section在類的字符串出現在任何地方，像這樣

use strict; 
use warnings; 

use LWP::Simple; 
use HTML::TreeBuilder::XPath; 

my $asin = 'B0031EJBI4'; 

my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews"; 

my $tree = HTML::TreeBuilder::XPath->new->parse(get $url); 

my @nodes = $tree->findnodes('//div[contains(@class, "a-section")]'); 

say scalar @nodes;

該報告在第60個這樣的節點。這是正確的結果，你可能不想去任何進一步的，但解決的辦法是不是一個安全的，因爲它會匹配

<div class="aaa-sections">

節點爲好。爲了正確解決這個問題，您需要恢復到非XPath HTML::Element方法look_down，像這樣，它在a-section之前和之後堅持一個字邊界。

my @nodes = $tree->look_down(
    _tag => 'div', 
    class => qr/\ba-section\b/, 
); 

say scalar @nodes;

同樣，其結果是正確的64

但即使這樣，解決方案將不允許該開始或類似-section非單詞字符結束，因爲/\b-section\b/將永遠不會被發現的類。最常用的解決方案是在look_down條件中使用子例程，如下所示，它將空白字符串上的類字符串（' '正確：不要更改它爲/ /或/\s+/），並構建使用所有子字符串的%classes哈希作爲關鍵。然後，一個a-section階層的存在是一個簡單的$classes{'a-section'}

@nodes = $tree->look_down(
    _tag => 'div', 
    sub { 
    return unless my $class = $_[0]->attr('class'); 
    my %classes = map { $_ => 1 } split ' ', $class; 
    $classes{'a-section'}; 
    } 
); 

say scalar @nodes;

再次與此頁面的搜索結果是64的值，但是這種解決方案將與任何類的字符串工作。

來源

2015-04-01 13:03:27 Borodin

-1

use LWP::Simple; 

#use HTML::TreeBuilder; 
use HTML::TreeBuilder::XPath; 

# Take the ASIN from the command line. 
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n"; 

# Assemble the URL from the passed ASIN. 
my $url = "http://rads.stackoverflow.com/amzn/click/B00R3DO58K"; 

# Set up unescape-HTML rules. Quicker than URI::Escape. 
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' '); 
my $unescape_re = join '|' => keys %unescape; 

# Request the URL. 
my $content = get($url); 



die "Could not retrieve $url" unless $content; 
my $tree = HTML::TreeBuilder::XPath->new_from_content($content); 
my @data = $tree->findvalues('//span[@class="vtp-byline-text"]'); 


#print $content; 

foreach (@data) 
{ 
    print "$_\n"; 
}

來源

2015-04-01 13:14:29

有一點小故事會很好解釋你的帖子。並且它與OP的代碼有同樣的問題：它不會在'class'屬性中找到具有多個值的''元素。 – Borodin 2015-04-01 13:16:59

你的'@ data'數組只包含四個節點，文本爲'〜Matthew McConaughey〜Ian McKellen〜Jennifer Lawrence〜Ian McKellen'。當他要求評論時，OP並沒有想到什麼！ – Borodin 2015-04-01 13:22:40

只是我在span元素屬性中給出了包含'// span [@ class =「a-size-base review-text」]'它會給你評論列表...在當前頁面的結果.... – 2015-04-02 06:04:22

如何從HTML中提取亞馬遜評論？

回答

相關問題