刮HTML中口齒不清

我的問題是有關在這裏Scraping an HTML table in Common Lisp?刮HTML中口齒不清

發現了另一個問題，我想從Common Lisp的網頁中提取數據。我目前正在使用drakma發送http請求，並且我正在嘗試使用chtml來提取我正在查找的數據。我試圖廢料的網頁爲http://erg.delph-in.net/logon，這裏是我的代碼

(defun send-request (sentence) 
"sends sentence in an http request to logon for parsing, and recieves 
    back the webpage containing the MRS output" 
(drakma:http-request "http://erg.delph-in.net/logon" 
        :method :post 
        :parameters `(("input" . ,sentence) 
           ("task" . "Analyze") 
           ("roots" . "sentences") 
           ("output" . "mrs") 
           ("exhaustivep" . "best") 
           ("nresults" . "1"))))

這裏與

(defun get-mrs (sentence) 
    (let* (
     (str (send-request sentence)) 
     (document (chtml:parse str (cxml-stp:make-builder)))) 
     (stp:filter-recursively (stp:of-name "mrsFeatureTop") document)))

我有麻煩的功能基本上所有我需要提取的數據是在HTML表格，但它太大，不能粘貼在這裏。在我的get-mrs函數中，我只是試圖獲取名稱爲mrsFeatureTop的標籤，但我不確定這是否正確，但由於出現錯誤：不是NCName的onclick。任何幫助刮桌子將不勝感激。謝謝。

來源

2013-05-18 turingcomplete

感謝wvxvw，我同意很多網頁都是垃圾。我會嘗試你的建議。你以前幫過我很多其他的問題，我爲此感謝你:)。 – turingcomplete

古老的問題，我知道。但是那個打敗了我很長時間的人。誠然，很多網頁都是垃圾，但幾乎整個2.0都是基於屏幕抓取，集成異構網站和破解黑客 - 應該是Lisp的理想應用！

關鍵（除了drakma）是lquery，它允許您使用css選擇器的lispy音譯（jQuery使用什麼）來訪問頁面內容。

讓我們從Google新聞頁面上的媒體條中獲取鏈接！如果您在瀏覽器中打開https://news.google.com並查看源代碼。你會被頁面的複雜性所淹沒。但是，如果您在瀏覽器開發面板（Firefox：F12，Inspector）中查看該頁面，則會看到該頁面具有某種邏輯。使用搜索框查找.media-strip-table該元素包含我們想要的圖像。現在打開你最喜歡的repl。（好吧，讓我們在這裏說實話，Emacs的：M-x slime）

(ql:quickload '(:drakma :lquery)) 

;;; Get the links from the media strip on Google's news page. 
(defparameter response (drakma:http-request "https://news.google.com/")) 

;;; lquery parses the page and gets it ready to be queried. 
(lquery:$ (initialize http-response))

現在讓我們來看看結果

;;; package qualified '$' opperator, Barbaric! 
;;; Use (use-package :lquery) to omit the package prefix. 
(lquery:$ ".media-strip-table" (html))

哇！那只是網頁的一小部分？好的，第一個元素呢？

(elt (lquery:$ ".media-strip-table" (html)) 0)

好的，這是一個更容易管理。讓我們來看看在那裏是否有圖像標籤，Emacs：C-s img Yay！它是。

(lquery:$ ".media-strip-table img" (html))

嗯...它找到了一些東西，但只返回空的文本...噢，圖像標籤應該是空的！

(lquery:$ ".media-strip-table img" (attr :src))

廢話！ gif的不僅僅用於不流暢，粗糙的動畫？

來源

2015-05-02 03:00:07

+1 for lquery！它住在這裏：https://shinmera.github.io/lquery/（以及一個網頁抓取的小教程：https：//lispcookbook.github.io/cl-cookbook/web-scraping.html – Ehvince

刮HTML中口齒不清

回答

相關問題