2017-02-22 96 views
1

Final Update It appears that the targeted website blocked DO IPs and are giving the problems which I've been resolving for days. I spinned a EC2 instance and manage to work the code working, together with caching etc so as to reduce the hit on the website and allow my user to share the website.PHP捲曲405不允許

-

UPDATE: I manage to get the Html by setting curl error to off, however the website other than returning 405 error is also not setting some cookies which are required for the website content to be loaded.

curl_setopt($ CH,CURLOPT_FAILONERROR,FALSE);

我使用下面的代碼爲ajax-> PHP來檢索og:元網站。但是,有1或2個特定網站返回錯誤,並不會檢索信息。有以下錯誤。該代碼可以爲大多數網站無縫工作。

Warning: DOMDocument::loadHTML(): Empty string supplied as input in /my/home/path/getUrlMeta.php on line 58

從curl_error在我的error_log

The requested URL returned error: 405 Not Allowed

而且

Failed to connect to www.something.com port 443: Connection refused

我沒有問題得到當我用捲曲我的服務器控制檯上的網站的HTML和沒有問題的檢索大部分使用以下代碼的網站所需信息

function file_get_contents_curl($url) 
{ 
    $ch = curl_init(); 
    $header[0] = "Accept: text/html, text/xml,application/xml,application/xhtml+xml,"; 
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 
    $header[] = "Cache-Control: max-age=0"; 
    $header[] = "Connection: keep-alive"; 
    $header[] = "Keep-Alive: 300"; 
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
    $header[] = "Accept-Language: en-us,en;q=0.5"; 
    $header[] = "Pragma: no-cache"; 
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header); 

    curl_setopt($ch, CURLOPT_HEADER, 0); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($ch, CURLOPT_URL, $url); 
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); 
    //curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET'); 

    curl_setopt($ch, CURLOPT_FAILONERROR, true); 
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
    curl_setopt($ch, CURLOPT_TIMEOUT, 30); 
    curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 "); 
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false); 
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); 
    //The following 2 set up lines work with sites like www.nytimes.com 

    //Update: Added option for cookie jar since some websites recommended it. cookies.txt is set to permission 777. Still doesn't work. 
    $cookiefile = '/home/my/folder/cookies.txt'; 
    curl_setopt($ch, CURLOPT_COOKIESESSION, true); 
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile); 
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookiefile); 

    $data = curl_exec($ch); 

    if(curl_error($ch)) 
    { 
     error_log(curl_error($ch)); 
    } 
    curl_close($ch); 

    return $data; 
} 

$html = file_get_contents_curl($url); 

libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings 
$doc = new DomDocument(); 
$doc->loadHTML($html); 
$xpath = new DOMXPath($doc); 
$query = '//*/meta[starts-with(@property, \'og:\')]'; 
$metas = $xpath->query($query); 
$rmetas = array(); 
foreach ($metas as $meta) { 
    $property = substr($meta->getAttribute('property'),3); 
    $content = $meta->getAttribute('content'); 
    $rmetas[$property] = $content; 
} 

/*below code retrieves the next bigger than 600px image should og:image be empty.*/ 
if (empty($rmetas['image'])) { 
    //$src = $xpath->evaluate("string(//img/@src)"); 
    //echo "src=" . $src . "\n"; 
    $query = '//*/img'; 
    $srcs = $xpath->query($query); 
    foreach ($srcs as $src) { 

     $property = $src->getAttribute('src'); 


     if (substr($property,0,4) == 'http' && in_array(substr($property,-3), array('jpg','png','peg'), true)) { 
      if (list($width, $height) = getimagesize($property)) { 
      do if ($width > 600) { 
       $rmetas['image'] = $property; 
       break; 
      } while (0); 
      } 
     } 

    } 
} 

echo json_encode($rmetas); 


die(); 

UPDATE: Error on my part that website is not https enabled so I still have the 405 not allowed error.

捲曲信息

{ 
    "url": "http://www.example.com/", 
    "content_type": null, 
    "http_code": 405, 
    "header_size": 0, 
    "request_size": 458, 
    "filetime": -1, 
    "ssl_verify_result": 0, 
    "redirect_count": 0, 
    "total_time": 0.326782, 
    "namelookup_time": 0.004364, 
    "connect_time": 0.007725, 
    "pretransfer_time": 0.007867, 
    "size_upload": 0, 
    "size_download": 0, 
    "speed_download": 0, 
    "speed_upload": 0, 
    "download_content_length": -1, 
    "upload_content_length": -1, 
    "starttransfer_time": 0.326634, 
    "redirect_time": 0, 
    "redirect_url": "", 
    "primary_ip": "SOME IP", 
    "certinfo": [], 
    "primary_port": 80, 
    "local_ip": "SOME IP", 
    "local_port": 52966 
} 

Update: If I do a curl -i from console I get the following response. A error 405 but it follows by all the HTML that I need.

Home> curl -i http://www.domain.com 
HTTP/1.1 405 Not Allowed 
Server: nginx 
Date: Wed, 22 Feb 2017 17:57:03 GMT 
Content-Type: text/html; charset=UTF-8 
Transfer-Encoding: chunked 
Vary: Accept-Encoding 
Vary: Accept-Encoding 
Set-Cookie: PHPSESSID2=ko67tfga36gpvrkk0rtqga4g94; path=/; domain=.domain.com 
Expires: Thu, 19 Nov 1981 08:52:00 GMT 
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 
Pragma: no-cache 
Set-Cookie: __PAGE_REFERRER=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; domain=www.domain.com 
Set-Cookie: __PAGE_SITE_REFERRER=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; domain=www.domain.com 
X-Repository: legacy 
X-App-Server: production-web23:8018 
X-App-Server: distil2-kvm:80 
+0

如果它只在某些站點停止工作,這是服務器端問題。我們無能爲力。 – miken32

+0

@ miken32但可從網絡瀏覽器訪問URL。不捲曲模擬瀏覽器?這是一個公開訪問的網站,不需要登錄,不需要SSL等。 –

+0

刪除'CURLOPT_FAILONERROR',你將得到405的全部內容,就像你展示的命令行一樣。 –

回答

0

以下內容添加到您的代碼,以幫助調試問題:

$info = curl_getinfo($ch); 
print_r($info); 

更可能的,問題如下:

  • 405不允許 - 您試圖使cURL調用不被允許。例如進行GET調用時,只允許POST。
  • 443:拒絕連接 - 您嘗試訪問的網站不支持HTTPS。或者,該網站正在使用您的代碼不支持的加密協議,例如只使用TLSv1.2,而你的代碼可能使用TLSv1.1。
+0

我在我的問題中添加了curl_getinfo。該網站是一個可公開訪問的網站,當用戶在我的應用程序中共享網站網址時,我試圖獲得og標籤(認爲Facebook網址共享)。 –

+0

變成網站不使用HTTPS,所以我不需要修復連接拒絕錯誤,但我仍然無法獲得405錯誤解決。 –

+0

您是否嘗試過使用瀏覽器訪問URL框405?這個URL是否允許GET請求? –