如何從其他網站爲應用程序收集數據？

我正在嘗試構建一個新聞中心應用程序，我的目標是從其他新聞頻道中提取新聞文章，對其進行總結，並以無偏見的方式呈現子彈形式。我已經開始運行算法，我需要的是從其他網站收集數據的代碼，如NDTV，CNN等。請給我一個如何執行此操作的描述。代碼，鏈接，示例和屏幕截圖會有很大的幫助。謝謝！（Y）如何從其他網站爲應用程序收集數據？

來源

2013-11-03 Sunny

大多數新聞頻道都會有某種RSS Feed，這可能是您最好的選擇 –

您可以使用** python **。 –

webscraping是你的方式; 你可以得到你需要你的新聞報道或一切與scrapy，beautifulsoup或selenium它們是Python模塊用於獲取HTML頁面的數據（文本），之後您可以將數據保存到任何你想要如數據庫; 最好使用rss頁面作爲頭條新聞，並考慮這些事情。

來源

2013-11-03 11:18:11

有一個稱爲QueryList（http://git.oschina.net/jae/QueryList）一個PHP lib中，它使用phpQuery內部，並使用一些CSS選擇濾波器陣列抓取在某些URL的具體內容。

的文檔是在中國（我不認爲這是一個英文版本），但它是非常簡單的使用方法：

<?php 
// include the lib 
require_once('QueryList.class.php'); 

// url to fetch content 
$url = 'http://www.example.com/index.html'; 

// filter rules using css selector grammar 
$regArr = array(
    'time' => array('td:nth-child(2)', 'text'), 
    'summary' => array('td:nth-child(3) td:nth-child(3)', 'text'), 
    'imgSrc' => array('h1 > a > img', 'src') 
    ); 

// optional, firstly find `.divbox > table`, then find the things defined by $regArr in each `.divbox > table` 
$regRange = '.divbox > table'; 

// do the query 
$result = QueryList::Query($url, $regArr, $regRange); 

// the result will be an array like: 
/** Array 
* (
* [0] => Array 
* (
*  'time' => , 
*  'summary' => , 
*  'imgSrc' => 
* ) 
* [1] => Array 
* (
*  'time' => , 
*  'summary' => , 
*  'imgSrc' => 
* ) 
* ... 
*) 
*/ 
echo '<pre>'; 
print_r($result->jsonArr); 
echo '</pre>';

，你還可以定義排除模式和$ regArr一個回調函數，我認爲這將符合你的要求。

來源

2015-03-16 13:38:17 UniFreak

如何從其他網站爲應用程序收集數據？

回答

相關問題