2012-02-15 18 views
0

我正在使用'簡單的HTML Dom'來刮HN的頭版(news.ycombinator.com),這很好的大部分的時間。刮HN首頁 - Handeling簡單的HTML Dom錯誤

但是,他們每隔一段時間都會推銷一個缺乏刮刀正在尋找的元素(即樂譜,用戶名和評論數量)的工作/公司。

Shameless Promotion

這當然,打破了陣列,因此我的腳本的輸出:

<?php 

// 2012-02-12 Maximilian (Extract news.ycombinator.com's Front Page) 

// Set the header during development 
//header ("content-type: text/xml"); 

// Call the external PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/manual.htm) 
include('lib/simple_html_dom.php'); 

date_default_timezone_set('Europe/Berlin'); 

// Download 'news.ycombinator.com' content 
//$tmp = file_get_contents('http://news.ycombinator.com'); 
//file_put_contents('get.tmp', $tmp); 

// Retrieve the content 
$html = file_get_html('tc.tmp'); 

// Set the extraction pattern for each item 
$title = $html->find("tr td table tr td.title a"); 
$score = $html->find("tr td.subtext span"); 
$user = $html->find("tr td.subtext a[href^=user]"); 
$link = $html->find("tr td table tr td.title a"); 
$time = $html->find("tr td.subtext"); 
$additionals = $html->find("tr td.subtext a[href^=item?id]"); 

// Construct the feed by looping through the items 
for($i=0;$i<29;$i++) { 

$cr=1; 

// Check if the item points to an external website 
if (!strstr($link[$i]->href,'http')) { 

$url = 'http://news.ycombinator.com/'.$link[$i]->href; 
$description = "Join the discussion on Hacker News."; 


} else { 

$url = $link[$i]->href; 

// Getting content here 

if (empty($abstract)) { 

$description ="Failed to load any relevant content. Please try again later."; 

} else { 

$description = $abstract; 

} 

} 
// Put all the items together 
    $result .= '<item><id>f'.$i.'</id><title>'.htmlspecialchars(trim($title[$i]->plaintext)).'</title><description><![CDATA['.$description.']]></description><pubDate>'.str_replace(' | '.$additionals[$i]->plaintext,'',str_replace($score[$i]->plaintext.' by '.$user[$i]->plaintext.' ','',$time[$i]->plaintext)).'</pubDate><score>'.$score[$i]->plaintext.'</score><user>'.$user[$i]->plaintext.'</user><comments>'.$additionals[$i]->plaintext.'</comments><id>'.substr($additionals[$i]->href,8).'</id><discussion>http://news.ycombinator.com/'.$additionals[$i]->href.'</discussion><link>'.htmlspecialchars($url).'</link></item>'; 
} 

$output = '<rss><channel><id>news.ycombinator.com Frontpage</id><buildDate>'.date('Y-m-d H:i:s').'</buildDate>'.$result.'</channel></rss>'; 

file_put_contents('tc.xml', $output); 


?> 

這裏是正確的輸出

<item> 
<id>f0</id> 
<title>Show HN: Bootswatch, free swatches for your Bootstrap site</title> 
<description><![CDATA[Easy to Install Simply download the CSS file from the swatch of your choice and replace the one in Bootstrap. No messing around with hex values. Whole New Feel We've all been there with the black bar and blue buttons. See how a splash of color and typography can transform the feel of your site. Modular Changes are contained in just two LESS files, enabling modification and ensuring forward compatibility.]]></description> 
<pubDate>3 hours ago</pubDate> 
<score>196 points</score> 
<user>parkov</user> 
<comments>30 comments</comments> 
<id>3594540</id> 
<discussion>http://news.ycombinator.com/item?id=3594540</discussion> 
<link>http://bootswatch.com</link> 
</item> 
<item> 
<id>f1</id> 
<title>Louis CK inspires Jim Gaffigan to sell comedy special for $5 online</title> 
<description><![CDATA[Dear Internet Friends,Inspired by the brilliant Louis CK, I have decided to debut my all-new hour stand-up special on my website, Jimgaffigan.com.Beginning sometime in April, 「Jim Gaffigan: Mr. Universe」 will be available exclusively for download for only $5. A dollar from each download will go directly to The Bob Woodruff Foundation; a charity dedicated to serving injured Veterans and their families.I am confident that the low price of my new comedy special and the fact that 20% of each $5 download will be donated to this very noble cause will prevent people from stealing it. Maybe I’m being naïve, but I trust you guys.]]></description> 
<pubDate>57 minutes ago</pubDate> 
<score>25 points</score> 
<user>rkudeshi</user> 
<comments>4 comments</comments> 
<id>3595285</id> 
<discussion>http://news.ycombinator.com/item?id=3595285</discussion> 
<link>http://www.whosay.com/jimgaffigan/content/218011</link> 
</item> 

這裏的一個例子是的例子輸出不正確。請注意,元素不是空的,因此我似乎無法捕捉到錯誤,只是跳到下一個項目。一切都會過去的推廣後會打破:

<item> 
<id>f14</id> 
<title>Build the next Legos: We're hiring an iOS Developer &amp; Web Developer (YC S11)</title> 
<description><![CDATA[Interested in building the next generation of toys on digital devices such as the iPad? That’s what we’re doing here at Launchpad Toys with apps like Toontastic (Named one of the 「Top 10 iPad Apps of 2011」 by the New York Times and was recently added to the iTunes Hall of Fame) and an awesom]]><![CDATA[e suite of others we have under development. We’re looking for creative and playful coders that have made games or highly visual apps/sites in the past for our two open development positions. As a kid, you probably played with Legos endlessly and grew up to be a hacker because you still love building things. Sounds like you? Email us at [email protected] with a couple links to some projects and code that we can look at along with your resume.]]></description> 
<pubDate>2 hours ago</pubDate> 
<score>14 points</score> 
<user>bproper</user> 
<comments>7 comments</comments> 
<id>3594944</id> 
<discussion>http://news.ycombinator.com/item?id=3594944</discussion> 
<link>http://launchpadtoys.com/blog/2012/02/iosdeveloper-webdeveloper/</link> 
</item> 
<item> 
<id>f15</id> 
<title>SOPA foe Fred Wilson supports a blacklist on pirate sites</title> 
<description><![CDATA[VC Fred Wilson says Google, Bing, Facebook, and Twitter should warn people when they try to log in at known pirate sites: "We don't need legislation." Fred Wilson says: If they try to pass antipiracy legislation, it will once again be 'war.' (Credit: Greg Sandoval/CNET) Fred Wilson, a well-known ven]]><![CDATA[ture capitalist from New York, says he's in favor of creating a blacklist for Web sites found to traffic in pirated films, music, and other intellectual property. The co-founder of Union Square Ventures told a gathering of media executives at the Paley Center for Media yesterday that he believes a good antipiracy measure would be for Google, Twitter, Facebook, and other major sites to issue warnings to people when they try to connect with a known pirate site. Fred Wilson, a co-founder of Union Square Ventures, says 'Our children have been taught to steal.' (Credit: Union Square Ventures) Wilson favors establishing an independent group to create a "black and white list." "The blacklist are those sites we all know are bad news," he told the audience in New York.]]></description> 
<pubDate>14 points by bproper 2 hours ago | 7 comments</pubDate> 
<score>24 points</score> 
<user>andrewcross</user> 
<comments>12 comments</comments> 
<id>3594558</id> 
<discussion>http://news.ycombinator.com/item?id=3594558</discussion> 
<link>http://news.cnet.com/8301-31001_3-57377862-261/post-sopa-influential-tech-investor-favors-blacklisting-pirate-sites/</link> 
</item> 

因此,這裏是我的問題:我如何處理這種情況的特定元素缺失的情況下,找到()不拋出一個錯誤?我必須從頭開始,還是有更好的方法來抓取HN的首頁?

對於任何人都好奇,這裏是整個XML文件:http://thequeue.org/api/tc.xml

+0

他們也有[RSS](http://news.ycombinator.com/rss)你知道... – rid 2012-02-15 20:58:32

+0

@Radu事實上他們這樣做,但我試圖讓:在發佈時間,評論數量,發佈用戶名和提交得分。 – mmackh 2012-02-15 21:00:58

回答

1

你必須塊,以處理這個工作,似乎有一個虛擬隔離元件,可以幫你:

$news = preg_split('/<tr style="height:5px"><\/tr>/',$html->find('tbody',2)->innertext); 

然後用subselectors:

foreach($news as $article){ 
    $article = str_get_html($article) 
    // No upvote arrow found so its not a valid article 
    if(count($article->find('img')) === 0){ 
     continue; 
    } 
} 

而對於其他要素使用相同的選擇

+0

非常感謝您提出將頁面拆分爲塊的建議。我已經用我的解決方案更新了上面的代碼 – mmackh 2012-02-17 10:58:28

+0

嗨,你應該離開這個問題,以便人們可以理解你最初的方法是什麼......所以他們可以更清楚地知道你的問題 – 2012-02-17 14:12:34

+0

會這樣做,謝謝你的建議 – mmackh 2012-02-17 21:14:32

0

我們將感謝Ivan的思想痕跡,我現在將最初被刮掉的HTML分成一個數組,每個節點代表一個帖子。然後,通過循環中的每一篇文章,我將檢查是否存在向上投票的箭頭圖像。如果不是,我不會將其添加到結果中。最後,所有東西都會被縫合在一起,並且贊助的帖子被排除在外。下面的代碼:

$array = explode('<tr style="height:5px"></tr>',$html); 
foreach ($array as $post) { 

    if (!strstr($post,'grayarrow.gif')){}else{ 

    $clean .= $post; 

    } 

} 
unset($array); 
$html = str_get_html($clean.'</body></html>');