2
經過很多努力嘗試抓取具有單點登錄的角度Js頁面之後,我已經放置了這段代碼。此代碼運行良好,登錄打開所需的頁面並將其剪切,但我沒有獲取由角度加載的網站中存在的所有鏈接和文本。我的xpath似乎是正確的。抓取:從AngularJs網站和抓取中提取所有文本和鏈接(href和ng-href)
此外,它不抓取正在提取的鏈接。我需要在代碼中更改哪些內容以提取網站和後續網頁中的所有文本?
import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from ps_crawler.items import PsCrawlerItem
import time
from selenium.webdriver.common.keys import Keys
class SISSpider(scrapy.Spider):
name = "SIS"
allowed_domains = ["domain.com"]
start_urls = ["https://domain.com/login?"]
def __init__(self):
self.driver = webdriver.Chrome()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()
def parse(self, response):
# selenium part of the job
self.driver.get("https://domain.com/login?")
time.sleep(5)
self.driver.find_element_by_xpath('//*[@id="Login"]/div[2]/div[1]/div[2]/form/div[1]/input').send_keys("ssasdad")
self.driver.find_element_by_xpath('//*[@id="Login"]/div[2]/div[1]/div[2]/form/div[2]/input').send_keys("")
#self.driver.find_element_by_xpath('//*[@id="login"]').click()
more_btn = WebDriverWait(self.driver, 10).until(
EC.visibility_of_element_located((By.XPATH, '//*[@id="login"]'))
)
time.sleep(5)
more_btn.click()
time.sleep(5)
self.driver.execute_script("window.open('https://domain.com/#/admin','_blank');");
time.sleep(10)
window_now = self.driver.window_handles[1]
self.driver.switch_to_window(window_now)
## stop when we reach the desired page
#if self.driver.current_url.endswith('page=20'):
# break
#now scrapy should do the job
time.sleep(10)
response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
time.sleep(10)
for post in response.xpath('//div'):
item = PsCrawlerItem()
print post.xpath('a/span/text()').extract(), post.xpath('a/@href').extract(), post.xpath('a/@ng-href').extract()
您知道網頁上會有多少鏈接嗎? – alecxe
不,我只是提取所有的東西。但是當我檢查元素時,我觀察到大量的文本和鏈接被遺漏。 –
你很可能需要等待'page_source'並將它傳遞給Scrapy。但是,問題是 - 等待什麼?有沒有跡象表明該頁面已經完成加載? – alecxe