我一直在試圖編寫一個Perl腳本來報廢亞馬遜並下載產品評論,但我一直無法這樣做。 我一直在使用perl模塊LWP :: Simple和HTML :: TreeBuilder :: XPath來實現這一點。如何從HTML中提取亞馬遜評論?
對於HTML
<div id="revData-dpReviewsMostHelpfulAUI-R1GQHD9GMGBDXP" class="a-row a-spacing-small">
<span class="a-size-mini a-color-state a-text-bold">
Verified Purchase
</span>
<div class="a-section">
I bought this to replace an earlier model that got lost in transit when we moved. It is a real handy helper to have when making tortillas. Follow the recipe for flour tortillas in the little recipe book that comes with it. I make a few changes
</div>
</div>
</div>
</div>
我想提取產品的審查。對於這個我寫道: -
use LWP::Simple;
#use HTML::TreeBuilder;
use HTML::TreeBuilder::XPath;
# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";
# Assemble the URL from the passed ASIN.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";
# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('"'=>'"', '&'=>'&', ' '=>' ');
my $unescape_re = join '|' => keys %unescape;
# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);
my @data = $tree->findvalues('div[@class ="a-section"]');
foreach (@data)
{
print "$_\n";
}
但我沒有得到任何輸出。任何人都可以指出我的錯誤嗎?
你應該堅持'uri_unescape'從HTML中刪除字符實體。與全球正則表達式一起使用的散列可能會更快,但與從互聯網上恢復HTML所花費的時間相比,可能會更快。而'uri_unescape'則更加簡潔和自我記錄。 – Borodin 2015-04-01 13:14:34
爲什麼刮亞馬遜?你知道他們有一個[產品API](https://metacpan.org/release/Net-Amazon)? – 2015-04-08 16:04:55