如何檢查頁面的哪一部分是文章？

-5

我想創建一個類似Instapaper或Readability的工具，我想知道從網頁中查找和獲取文本的最佳方式是什麼。你有什麼想法？如何檢查頁面的哪一部分是文章？

來源

2012-06-28 Sławosz

既然你沒有談論technolgies或算法，絕對最好的辦法是打開Web瀏覽器，打開所需的網頁，複製相關的文本，並將其粘貼到你的數據庫。 – Amberlamps

的問題是過於寬泛，給出一個具體的答案，但你可以在這個問題分成三個關注點：抓住網絡資源的一種方式

。例如libcurl，或者幾乎任何能夠說話的東西HTTP。
A DOM解析器。例如，Python有xml.dom.minidom。
一種遍歷DOM樹和提取文本的算法。無論是用class=article還是<div>來掃描超過1024個字符等，都完全取決於您。你需要實驗才能做到這一點。

我建議爲這些問題分別提出問題。當然，在對每一個進行研究之後。 :)

來源

2012-06-28 14:11:21 Deestan

這是一個讓你開始使用Ruby的想法。剛剛測試了下面的代碼，它對我來說工作正常。看看它可以幫助你。

require 'open-uri'  
require 'cgi'  
require 'nokogiri' 

$url='http://www.stackoverflow.com' 

$txt_file = open($url) 

$raw_contents = $txt_file.read 

$html = Nokogiri::HTML(CGI.unescapeHTML($raw_contents)).content 
#strip the web page fetched out of all hmtl tags and encoded chars 

$txt_file = File.new('c:\ruby193\bin\web-content\stack.txt', "w") 
#stack.txt now contains a stripped, pure txt file which you can manipulate further 

$txt_file.write($html)  
$txt_file.close 

puts 'Here is the stripped text of your webpage\n'+$html

來源

2012-06-28 14:17:12

如何檢查頁面的哪一部分是文章？

回答

相關問題