從網站源文件中提取href鏈接w/Python

我之前問過這個問題無濟於事。我想弄清楚如何實現bs4以獲取網站源代碼中用於下載的鏈接。我無法弄清楚的問題是鏈接在動態內容庫中。 我已經刪除以前的HTML片段，看下面從網站源文件中提取href鏈接w/Python

我們已經能夠抓住這個劇本後，才手動抓住從該網站的源代碼的鏈接：

import re 
enter code here 

line = line.rstrip() 
x = re.findall('href=[\'"]?([^\'" >]+)tif', line) 
if len(x) > 0 : 
    result.write('tif">link</a><br>\n<a href="'.join(x)) 

`result.write('tif">link</a><br>\n\n</html>\n</body>\n') 

result.write("There are " + len(x) + " links")  


print "Download HTML page created."

但只打算後進入網站ctrl + a - >查看源代碼 - >全選&複製 - >粘貼到SourceCode.txt。我想刪除所有這些手工勞動。

我非常感謝任何信息/提示/建議！

編輯

我想補充關於我們使用，圖書館的內容將只顯示當它被手動擴大了網站的一些信息。否則，內容（即，下載鏈接/ href * .tif）不可見。以下是我們看到的一個示例：

未打開庫元素的網站源代碼。

<html><body>

打開庫元件後

源代碼。

<html><body> 
<h3>Library</h3> 
<div id="libraryModalBody"> 

    <div><table><tbody> 

    <tr> 
    <td>Tile12</td> 
    <td><a href="http://www.website.com/path/Tile12.zip">Button</a></td> 
    </tr> 

    </tbody></table></div> 

</div>

展開所有下載選項後

的源代碼。

<html><body> 
<h3>Library</h3> 
<div id="libraryModalBody"> 
    <div><table><tbody> 
    <tr> 
    <td>Tile12</td> 
    <td><a href="http://www.website.com/path/Tile12.zip">Button</a></td> 
    </tr> 
    <tr> 
    <td>Tile12_Set1.tif</td> 
    <td><a href="http://www.website.com/path/Tile12_Set1.tif">Button</a></td> 
    </tr> 
    <tr> 
    <td>Tile12_Set2.tif</td> 
    <td><a href="http://www.website.com/path/Tile12_Set2.tif">Button</a></td> 
    </tr> 
    </tbody></table></div> 
</div>

我們的最終目標是要搶的下載量僅不必輸入網站的URL鏈接。這個問題似乎是在顯示內容（即動態內容僅供庫的手動擴張後可見的方式

來源

2015-11-24 D.V

不要試圖用正則表達式解析HTML It's not possible和it won't work使用BeautifulSoup4代替：。

from urllib2 import urlopen 
from bs4 import BeautifulSoup 

url = "http://www.your-server.com/page.html" 
document = urlopen(url) 
soup = BeautifulSoup(document) 

# look for all URLs: 
found_urls = [link["href"] for link in soup.find_all("a", href=True)] 

# look only for URLs to *.tif files: 
found_tif_urls = [link["href"] for link in soup.find_all("a", href=True) if link["href"].endswith(".tif")]

來源

2015-11-24 15:24:23 geckon

感謝geckon，在這個問題上我可能是易懂錯誤地發現你的腳本，但這可以通過從網站手動檢索html源代碼來工作。如果是這種情況，這不會消除我已經做過的體力勞動。 –

@ D.V我編輯了我的答案，是更好的？ – geckon

這是有道理的，謝謝你的信息。奇怪的是，正則表達式一直在爲我一直在做的事情工作。話雖如此，我會用你寫的。謝謝 –

你不妨在PyQuery庫，它使用（子）CSS選擇器的設定從JQuery的看一看：

pq = PyQuery(body) 
pq('div.content div#filter-container div.filter-section')

來源

2015-11-24 15:55:34 Ojomio

謝謝，Ojomio。我會測試你的建議。 –

從網站源文件中提取href鏈接w/Python

回答

相關問題