如何通過cURL只抓取頁面的第一個40KB

所以我不想拉整個頁面，只是頁面的前40KB。就像這個Facebook Debugger工具一樣。如何通過cURL只抓取頁面的第一個40KB

我的目標就是抓住社交媒體元數據，即og:image等

可以在任何編程語言，PHP或者Python。

我確實有phpQuery代碼，使用的file_get_contents /捲曲，我知道如何分析接收到的HTML，我的問題是「如何在沒有獲取整個頁面抓取網頁的只有第一NKB」

來源

2017-09-16 Umair

也許這將幫助https://stackoverflow.com/a/12014561/661872 –

@LawrenceCherone我在phpQuery中有使用file_get_contents/cURL的代碼，並且我知道如何解析收到的HTML，我的問題是**「如何僅抓取頁面的第一個nKB而不抓取整個頁面」** – Umair

這似乎已經回答[這裏]（https://stackoverflow.com/questions/2032924/how-to-partially-download-a-remote-file-with-curl）。 – Dardanboy

這不具體到Facebook或任何其他社交媒體網站，但你可以得到前40 KB和Python這樣的：

import urllib2 
start = urllib2.urlopen(your_link).read(40000)

來源

2017-09-16 11:13:38 mdegis

這是否會停止加載頁面，只要前40 KB ？ – Umair

@Umair它只會先讀取40KB。所以，是的，之後就會停止。 – mdegis

這可以用於：

curl -r 0-40000 -o 40k.raw https://www.keycdn.com/support/byte-range-requests/

個

的-r代表範圍：

來自卷邊手冊頁：

r, --range <range> 
      (HTTP FTP SFTP FILE) Retrieve a byte range (i.e a partial document) from a HTTP/1.1, FTP or SFTP server or a local FILE. Ranges can be 
      specified in a number of ways. 

      0-499  specifies the first 500 bytes 

      500-999 specifies the second 500 bytes 

      -500  specifies the last 500 bytes 

      9500-  specifies the bytes from offset 9500 and forward 

      0-0,-1 specifies the first and last byte only(*)(HTTP)

更多信息可以在本文中找到：https://www.keycdn.com/support/byte-range-requests/

以防萬一，這是一個基本的例子如何與go

package main 

import (
    "fmt" 
    "io" 
    "io/ioutil" 
    "log" 
    "net/http" 
) 

func main() { 
    response, err := http.Get("https://google.com") 
    if err != nil { 
     log.Fatal(err) 
    } 
    defer response.Body.Close() 
    data, err := ioutil.ReadAll(io.LimitReader(response.Body, 40000)) 
    fmt.Printf("data = %s\n", data) 
}

來源

2017-09-16 12:06:41 nbari

如何通過cURL只抓取頁面的第一個40KB

回答

相關問題