2016-12-27 19 views
0

我有定義爲字符串類型變量的網頁的源代碼。我知道某個日期會在源代碼上出現。我想打印出該日期之前出現的第一個鏈接。此鏈接可以在單引號("")之間找到,下面的代碼:使用搜索方法打印子字符串

import requests 
from datetime import date 
import re 

link = "https://www.google.com.mx/search?biw=1535&bih=799&tbm=nws&q=%22New+Strong+Buy%22+site%3A+zacks.com&oq=%22New+Strong+Buy%22+site%3A+zacks.com&gs_l=serp.3...1632004.1638057.0.1638325.24.24.0.0.0.0.257.2605.0j15j2.17.0....0...1c.1.64.serp..8.0.0.Nl4BZQWwR3o" 
fetch_data =requests.get(link) 
content = str((fetch_data.content)) 

#this is the source code as a string 

Months = ["January","February","March","April","May","June","July","August","September","October","November","December"] 
today = date.today() 
A= ("%s %s" % (Months[today.month - 1],today.day)) 
a=today.day 
B= A in content 
if B == True: 
    B = ("%s %s" % (Months[today.month - 1], a)) 
else: 
    while B == False: 
     a = a - 1 
     B = ("%s %s" % (Months[today.month - 1], a)) 

#the B variable is the string date that will appear in the variable string content 

c= ('"https:') 
Z= ("%s(.*)%s" % (c,B)) 
result = re.search(Z, content) 
print (result) 

這就是我想:我所期望的是變量之間的串cB,代碼沒發現什麼

如果有人尋找源代碼the link你會發現,今天的日期「12月27日」中只出現一次,而且我很感興趣的鏈接顯示爲「https://www.zacks.com/commentary/98986/new-strong-buy-stocks-for-december-27th」之前。

人可以幫我自動蟒蛇來定義這個鏈接,並打印了嗎?

+0

的'而B ==假:'循環永遠不會搜索'B'在'content'。 – Barmar

+1

使用正則表達式來解析HTML通常是一個壞主意。使用DOM解析器庫。 – Barmar

回答

0

正如Barmar說,你會使用DOM解析器如BeautifulSoup會更好。下面是一個例子

from BeautifulSoup import BeautifulSoup 
import requests, urlparse 
from datetime import datetime 

link = "https://www.google.com.mx/search?biw=1535&bih=799&tbm=nws&q=%22New+Strong+Buy%22+site%3A+zacks.com&oq=%22New+Strong+Buy%22+site%3A+zacks.com&gs_l=serp.3...1632004.1638057.0.1638325.24.24.0.0.0.0.257.2605.0j15j2.17.0....0...1c.1.64.serp..8.0.0.Nl4BZQWwR3o" 

r = requests.get(link) 

soup = BeautifulSoup(r.text) 

search = datetime.today().strftime("%B %d") 
print("Searching for {}".format(search)) 

result = None 
for i in soup.findAll('h3'): 
    linkText = i.getText() 
    if search in linkText: 
     result = i.find('a').get('href') 
     result = result.split('?')[-1] 
     result = urlparse.parse_qs(result)['q'][0] 
     break 

print(result) 

我收到的輸出是

Searching for December 27 
https://www.zacks.com/commentary/98986/new-strong-buy-stocks-for-december-27th 
相關問題