2016-09-06 24 views

我想寫一個Indian patent search website的網頁掃描器來獲取有關專利的數據。這是我迄今爲止的代碼。專利數據的印度專利網站搜索

#import the necessary modules 
import urllib2 
#import the beautifulsoup functions to parse the data 
from bs4 import BeautifulSoup 

#mention the website that you are trying to scrape 

#Query the website and return the html to the variable 'page' 
page = urllib2.urlopen(patentsite) 

#Parse the html in the 'page' variable, and store it in Beautiful Soup format 
soup = BeautifulSoup(page) 

print soup 






那麼你有你要求的HTML。然而,這個頁面似乎是作爲一個web應用程序,所有東西都通過JavaScript處理(在'app.js'中)。所以你的方法很可能不起作用。你可能想看看,如果該網站提供的API可以使用 – UnholySheep


是的,我確實在尋找這樣的信息。這似乎並不存在。我也嘗試了幾個在線網絡刮板。有沒有辦法,我可以刮這個網站? –


正如我所說的,它更像是一個webapp而不是一個網站(因爲它完全是通過javascript來驅動的)。您可能可以使用Selenium做些事情,但我從未使用它。 – UnholySheep





data = { 
    "publication_type_published": "on", 
    "publication_type_granted": "on", 
    "fieldDate": "APD", 
    "datefieldfrom": "19120101", 
    "datefieldto": "20160906", 
    "operatordate": " AND ", 
    "field[]": ["PA"], # claims,.description, patent-number codes go here 
    "fieldvalue[]": ["chris*"], # matching values for ^^ go here 
    "operator[]": [" AND "], # matching sql logic for ^^ goes here 
    "page": "1", # gives you next page results 
    "start": "0", # not sure what effect this actually has. 
    "limit": "25"} # not sure how this relates as len(r.json()[u'record']) stays 25 regardless 

import requests 
from time import time 

post = "http://ipindiaservices.gov.in/publicsearch/resources/webservices/search.php?_dc={}".format(
    str(time()).replace(".", "")) 

with requests.Session() as s: 
    s.headers.update({"X-Requested-With": "XMLHttpRequest"}) 
    r = s.post(post, data=data) 


{u'success': True, u'record': [{u'Publication_Status': u'Published', u'appDate': u'2016/06/16', u'pubDate': u'2016/08/31', u'title': u'ACTUATOR FOR DEPLOYABLE IMPLANT', u'sourceID': u'inpat', u'abstract': u'\n Systems and methods are provided for usin............. 


{u'Publication_Status': u'Published', u'appDate': u'2015/01/27', u'pubDate': u'2015/06/26', u'title': u'CORRUGATED PALLET', u'sourceID': u'inpat', u'abstract': u'\n A corrugated paperboard pallet is produced from two flat blanks which comprise a pallet top and a pallet bottom. The two blanks are each folded to produce only two parallel vertically extending double thickness ribs three horizontal panels two vertical side walls and two horizontal flaps. The ribs of the pallet top and pallet bottom lock each other from opening in the center of the pallet by intersecting perpendicularly with notches in the ribs. The horizontal flaps lock the ribs from opening at the edges of the pallet by intersecting perpendicularly with notches and the vertical sidewalls include vertical flaps that open inward defining fork passages whereby the vertical flaps lock said horizontal flaps from opening.\n ', u'Assignee': u'OLVEY Douglas A., SKETO James L., GUMBERT Sean G., DANKO Joseph J., GABRYS Christopher W., ', u'field_of_invention': u'FI10', u'publication_no': u'26/2015', u'patent_no': u'', u'application_no': u'642/DELNP/2015', u'UCID': u'WVJ4NVVIYzFLcUQvVnJsZGczcVRmSS96Vkh3NWsrS1h3Qk43S2xHczJ2WT0%3D', u'Publication_Type': u'A'} 



enter image description here

enter image description here



這太棒了!謝謝。我會寫代碼,然後用它做更多。萬分感謝。 –


不用擔心,它只是一個挑選你想要的任何值的問題,確保列表中的對齊和發佈到url,你將得到你想要的json格式。 –