如何加載網頁（不在瀏覽器中）然後獲取該網頁的網址？

我想做一些我在Reddit上看到的東西，它可以讓你獲得一個隨機的維基百科文章，看到它的標題，然後A（在你的瀏覽器中打開文章）或B（獲得一個新的隨機文章）。要獲得一篇隨機文章，你可以輸入這個網址「https://en.wikipedia.org/wiki/Special:Random」，但是之後我需要重新加載網址，看看它改變了什麼，然後弄清楚我得到了什麼文章。我將如何做到這一點？如何加載網頁（不在瀏覽器中）然後獲取該網頁的網址？

來源

2015-09-07 Beta_Penguin

的Site:Random頁面在維基百科返回redirection response與目標位置：

HTTP/1.1 302 Found 
... 
Location: https://en.wikipedia.org/wiki/URL_redirection 
...

大多數圖書館（和所有瀏覽器）自動跟隨該鏈接，但您可以禁用它，例如，在requests：

import requests 
url = 'https://en.wikipedia.org/wiki/Special:Random' 
response = requests.get(url, allow_redirects=False) 
real_url = response.headers['location'] 
# then use real_url to fetch the page

另外，requests提供重定向歷史：

response = requests.get(url) 
real_url = response.history[-1].headers['location']

在後一種情況下，response已包含您需要的頁面，因此這是一種更簡單的方法。

來源

2015-09-07 06:33:25 bereal

有沒有辦法做到這一點，而不使用「請求」？ –

@Beta_Penguin當然，這也可以用標準的'urllib2'，就像@ dv1337所示。 – bereal

網址 - 你可以得到的urllib2 response.geturl（）
維基頭中的URL - 您可以解析與BeautifulSoup包
瀏覽器的標題 - 你可以在Web瀏覽器中打開URL在webbrowser.open（URL）

這是一個簡單的工作例如：

import urllib2 
import webbrowser 
from BeautifulSoup import BeautifulSoup 

while (True): 
    response = urllib2.urlopen('https://en.wikipedia.org/wiki/Special:Random') 
    headline = BeautifulSoup(response.read()).html.title.string 
    url = response.geturl() 
    print "The url:   " +url 
    print "The headline:  " + headline 

    x = raw_input("Press: [A - Open in browser] [B - Get a new random article] [Anything else to exit]\n>") 
    if x == "A": 
     webbrowser.open(url) #open in browser 
    elif x == "B": 
     continue # get a new random article 
    else: 
     break #exit

來源

2015-09-07 06:51:26 sub

溴eaking任務分解成一口大小的塊：

獲得一個隨機的維基百科文章

酷。這非常簡單。您可以使用Python的內置urllib2或requests軟件包。大多數人推薦requests（pip install requests），因爲它是一個更高級的庫，使用起來更簡單一些，但在這種情況下，我們所做的事情非常簡單，可能會過度。無論如何：

import requests 

RANDOM_WIKI_URL = "https://en.wikipedia.org/wiki/Special:Random" 
response = requests.get(RANDOM_WIKI_URL) 
data = response.content 
url = response.url

看到它的標題

爲此，我們需要解析HTML。人們很容易建議您只需使用正則表達式來提取包含標題，但真正做這種事情的正確方法元素中的文本是使用像BeautifulSoup庫（pip install beautifulsoup4）：

from bs4 import BeautifulSoup 
soup = BeautifulSoup(data, 'html.parser') 
title = soup.select('#firstHeading')[0].get_text() 
print title

A（[...]）或B（[...]]）

print "=" * 80 
print "(a): Open in new browser tab" 
print "(b): Get new article" 
print "(q): Quit" 
user_input = raw_input("[a|b|q]: ").lower() 

if user_input == 'a': 
    ... 
elif user_input == 'b': 
    ... 
elif user_input == 'q': 
    ...

在瀏覽器中打開文章

import webbrowser 

webbrowser.open_new_tab(url)

得到一個新的隨機文章

response = requests.get(RANDOM_WIKI_URL) 
data = response.content 
url = response.url

全部放在一起：

from __future__ import unicode_literals 

import webbrowser 

from bs4 import BeautifulSoup 
import requests 


RANDOM_WIKI_URL = "https://en.wikipedia.org/wiki/Special:Random" 


def get_user_input(): 
    user_input = '' 
    while user_input not in ('a', 'b', 'q'): 
     print '-' * 79 
     print "(a): Open in new browser tab" 
     print "(b): Get new random article" 
     print "(q): Quit" 
     print '-' * 79 
     user_input = raw_input("[a|b|q]: ").lower() 
    return user_input 

def main(): 
    while True: 
     print "=" * 79 
     print "Retrieving random wikipedia article..." 
     response = requests.get(RANDOM_WIKI_URL) 
     data = response.content 
     url = response.url 

     soup = BeautifulSoup(data, 'html.parser') 
     title = soup.select('#firstHeading')[0].get_text() 

     print "Random Wikipedia article: '{}'".format(title) 
     user_input = get_user_input() 
     if user_input == 'q': 
      break 
     elif user_input == 'a': 
      webbrowser.open_new_tab(url) 


if __name__ == '__main__': 
    main()

來源

2015-09-07 06:51:30 chucksmash

如何加載網頁（不在瀏覽器中）然後獲取該網頁的網址？

回答

相關問題