如何使用cURL刮取iframe內容

目標：我想使用cURL在iframe中刮掉「Paris」一詞。如何使用cURL刮取iframe內容

假設你有一個包含iframe一個簡單的頁面：

<html> 
<head> 
<title>Curl into this page</title> 
</head> 
<body> 

<iframe src="france.html" title="test" name="test"> 

</body> 
</html>

iframe的頁面：

<html> 
<head> 
<title>France</title> 
</head> 
<body> 

<p>The Capital of France is: Paris</p> 

</body> 
</html>

我捲曲腳本：

<?php> 

// 1. initialize 

$ch = curl_init(); 

// 2. The URL containing the iframe 

$url = "http://localhost/test/index.html"; 

// 3. set the options, including the url 

curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_HEADER, 0); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
curl_setopt($ch, CURLOPT_TIMEOUT, 2); 
curl_setopt($ch, CURLOPT_MAXREDIRS, 10); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 

// 4. execute and fetch the resulting HTML output by putting into $output 

$output = curl_exec($ch); 

// 5. free up the curl handle 

curl_close($ch); 

// 6. Scrape for a single string/word ("Paris") 

preg_match("'The Capital of France is:(.*?). </p>'si", $output, $match); 
if($match) 

// 7. Display the scraped string 

echo "The Capital of France is: ".$match[1]; 

?>

結果=什麼！

有人能幫我找出法國的首都嗎？！ ;）

我需要的例子：

解析/斂iframe網址
捲曲URL（因爲我已經與index.html頁面完成）
解析的字符串「巴黎」

謝謝！

來源

2011-12-06 ven

這不是一個cURL腳本，它是一個PHP腳本。不要將它與圖書館混淆。不要用正則表達式解析HTML！ – sidyll

我沒有看到您加載iframe的部分。你首先必須刮掉索引頁面的任何iframe，然後加載和刮擦每一個。（ps按[此問題]（http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php）你應該使用[DOMDocument-> loadHTML（）]（http：// docs .php.net/manual/en/domdocument.loadhtml.php）用PHP解析HTML而不是正則表達式） – CanSpice

你喜歡，接受任何答案嗎？ – FailedDev

- 編輯 - 您可以將頁面內容加載到字符串中，解析iframe的字符串，然後將iframe源文件加載到另一個字符串中。

$wrapperPage = file_get_contents('http://localhost/test/index.html'); 

$pattern = '/\.*src=\".*\.html"\.*/'; 

$iframeSrc = preg_match($pattern, $wrapperPage, $matches); 

if (!isset($matches[0])) { 
    throw new Exception('No match found!'); 
} 

$src = $matches[0]; 

$src = str_ireplace('"', '', $src); 
$src = str_ireplace('src=', '', $src); 
$src = trim($src); 

$iframeContents = file_get_contents($src); 

var_dump($iframeContents);

你的錄取率

--Original--

工作（接受答案之前回答問題）。

你捲曲處理程序中設置的URL是包裝的iframe文件，則嘗試將其設置爲iframe的網址：

$url = "http://localhost/test/france.html";

來源

2011-12-07 00:02:41

我想主要的問題是我不知道如何刮取iframe的鏈接，然後獲取，然後刮擦！任何例子，將不勝感激。 – ven

當我捲曲的iframe頁面（france.html）一切正常。我需要一種方式將其指向index.html - 因此我需要做一個「捲曲內捲曲」 – ven

@Dri：更新後。看看是否有效。 –

要回答你的問題regex，你的模式不匹配輸入文本：

  <p>The Capitol of France is: Paris</p>

你必須結束段落標記前一個額外的空間，這不能匹配：

preg_match("'The Capitol of France is:(.*?). </p>'si"

你應該有捕獲組之前的空間，並刪除冗餘.後：

preg_match("'The Capitol of France is: (.*?)</p>'si"

要在任意兩個位置的使用可選的空間，使用\s*代替：

preg_match("'The Capitol of France is:\s*(.*?)\s*</p>'si"

你也可以使捕獲組只與(\w+)匹配字母更具體。

來源

2011-12-07 00:07:11 mario

啊 - 感謝你指出了這一點。 – ven

注意，偶爾有多種原因的iframe捲曲不能被自己的服務器，看着捲曲的環境之外直接讀取拋出某種類型的「不能直接或外部讀」的錯誤消息。在這些情況下，您可以使用curl_setopt（$ ch，CURLOPT_REFERER，$ fullpageurl）;在這些情況下，您可以使用curl_setopt（$ ch，CURLOPT_REFERER，$ fullpageurl）;在這些情況下，您可以使用curl_setopt（$ ch，CURLOPT_REFERER，$ fullpageurl）; （如果你在php中並使用curl_exec讀取文本），然後curl_exec認爲iframe在原始頁面，你可以閱讀源代碼。

因此，如果無論什麼原因，france.html不能在包含iframe的較大頁面的上下文之外讀取，您仍然可以使用上面的方法使用CURLOPT_REFERER獲取源代碼並設置主頁面（測試/作爲引薦來源的原始問題中的index.html）。

來源

2013-06-26 18:09:24 Barry

或只設置CURLOPT_AUTOREFERER – nurettin

如何使用cURL刮取iframe內容

回答

相關問題