如何從.html頁面中提取鏈接和標題？

爲我的網站，我想添加一個新的功能。如何從.html頁面中提取鏈接和標題？

我希望用戶能夠上傳自己的書籤備份文件（從任何瀏覽器如果可能的話），這樣我就可以把它上傳到他們的個人資料，他們不必插入所有手動他們......

我錯過了這個唯一的一部分，這是從上傳的文件中提取標題和URL的一部分..任何人都可以提供一個線索從哪裏開始或在哪裏閱讀？

使用搜索選項和（how to extract data from a raw html file）這個姐姐爲我和它沒有談論它的最相關的問題..

我真的如果使用jQuery或PHP

謝不介意你很

來源

2010-12-12 Toni Michel Caubet

它可能會幫助大家，如果你能忍受的類型的書籤備份文件的例子，你想支持（每個瀏覽器） – scoates 2010-12-12 18:41:58

網景格式爲常見的是：http：/ /msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx – Matthew 2010-12-12 18:56:34

謝謝大家，我知道了！

最終代碼：這說明您分配錨文本並在.html文件

$html = file_get_contents('bookmarks.html'); 
//Create a new DOM document 
$dom = new DOMDocument; 

//Parse the HTML. The @ is used to suppress any parsing errors 
//that will be thrown if the $html string isn't valid XHTML. 
@$dom->loadHTML($html); 

//Get all links. You could also use any other tag name here, 
//like 'img' or 'table', to extract other tags. 
$links = $dom->getElementsByTagName('a'); 

//Iterate over the extracted links and display their URLs 
foreach ($links as $link){ 
    //Extract and show the "href" attribute. 
    echo $link->nodeValue; 
    echo $link->getAttribute('href'), '<br>'; 
}

再次各個環節的HREF，非常感謝。

來源

2010-12-12 20:18:17

這可能已經足夠：

$dom = new DOMDocument; 
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $node) 
{ 
    echo $node->nodeValue.': '.$node->getAttribute("href")."\n"; 
}

來源

2010-12-12 18:50:07 Matthew

whre $ html它是文件的路徑？感謝這麼快速的回答：D – 2010-12-12 18:53:36

@Toni，'$ html'是包含HTML的字符串。你可以使用'$ dom-> loadHTMLFile（）'直接從文件中加載。（你可能想用'@'作爲前綴來壓制警告。） – Matthew 2010-12-12 18:54:34

哇！非常感謝你！看起來就像差不多完成了！我可以得到鏈接，但我有名稱或標題的麻煩（我都試過） – 2010-12-12 19:06:56

假設存儲鏈接在一個HTML文件的最佳解決方案可能是牛逼o使用一個html解析器，例如PHP Simple HTML DOM Parser（從未嘗試過）。（另一種選擇是使用基本的字符串搜索或正則表達式進行搜索，並且您應該使用regexp來解析html，否則絕不會使用）。

從教程：

使用的解析器使用它的功能來找到a標籤讀取HTML文件後

// Find all links 
foreach($html->find('a') as $element) 
     echo $element->href . '<br>';

來源

2010-12-12 18:53:17

這是一個例子，你可以在你的情況下使用：

$content = file_get_contents('bookmarks.html');

運行以下命令：

<?php 

$content = '<html> 

<title>Random Website I am Crawling</title> 

<body> 

Click <a href="http://clicklink.com">here</a> for foobar 

Another site is http://foobar.com 

</body> 

</html>'; 

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME 
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)[email protected])?"; // User and Pass 
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP 
$regex .= "(\:[0-9]{2,5})?"; // Port 
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path 
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query 
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor 


$matches = array(); //create array 
$pattern = "/$regex/"; 

preg_match_all($pattern, $content, $matches); 

print_r(array_values(array_unique($matches[0]))); 
echo "<br><br>"; 
echo implode("<br>", array_values(array_unique($matches[0])));

輸出：

Array 
(
    [0] => http://clicklink.com 
    [1] => http://foobar.com 
)

http://clicklink.com

http://foobar.com

來源

2015-03-28 20:59:50

$html = file_get_contents('your file path'); 

$dom = new DOMDocument; 

@$dom->loadHTML($html); 

$styles = $dom->getElementsByTagName('link'); 

$links = $dom->getElementsByTagName('a'); 

$scripts = $dom->getElementsByTagName('script'); 

foreach($styles as $style) 
{ 

    if($style->getAttribute('href')!="#") 

    { 
     echo $style->getAttribute('href'); 
     echo'<br>'; 
    } 
} 

foreach ($links as $link){ 

    if($link->getAttribute('href')!="#") 
    { 
     echo $link->getAttribute('href'); 
     echo'<br>'; 
    } 
} 

foreach($scripts as $script) 
{ 

     echo $script->getAttribute('src'); 
     echo'<br>'; 

}

來源

2016-01-08 08:20:56 Raghavendra

造型失敗，答案難以閱讀。請編輯您的答案並使其更具可讀性 – michaldo 2016-01-08 08:29:26

給定問題的代碼太多... – 2016-01-08 08:43:17

如何從.html頁面中提取鏈接和標題？

回答

相關問題