2016-09-06 24 views
2

我想寫一個Indian patent search website的網頁掃描器來獲取有關專利的數據。這是我迄今爲止的代碼。專利數據的印度專利網站搜索

#import the necessary modules 
import urllib2 
#import the beautifulsoup functions to parse the data 
from bs4 import BeautifulSoup 

#mention the website that you are trying to scrape 
patentsite="http://ipindiaservices.gov.in/publicsearch/" 

#Query the website and return the html to the variable 'page' 
page = urllib2.urlopen(patentsite) 

#Parse the html in the 'page' variable, and store it in Beautiful Soup format 
soup = BeautifulSoup(page) 

print soup 

不幸的是,印度專利網站不健全或我不知道如何進一步在這方面進一步。

這是上述代碼的輸出。

我想說的是,假設我提供了一個公司名稱,刮板應該獲得該公司的所有專利。如果我能夠正確地掌握這部分內容,我想做其他事情,比如提供一組刮刮器用來尋找專利的輸入。但是我陷入了我無法進一步發展的部分。

任何關於如何獲得這些數據的指針將不勝感激。

+1

那麼你有你要求的HTML。然而,這個頁面似乎是作爲一個web應用程序,所有東西都通過JavaScript處理(在'app.js'中)。所以你的方法很可能不起作用。你可能想看看,如果該網站提供的API可以使用 – UnholySheep

+0

是的,我確實在尋找這樣的信息。這似乎並不存在。我也嘗試了幾個在線網絡刮板。有沒有辦法,我可以刮這個網站? –

+1

正如我所說的,它更像是一個webapp而不是一個網站(因爲它完全是通過javascript來驅動的)。您可能可以使用Selenium做些事情,但我從未使用它。 – UnholySheep

回答

4

只需要請求即可。該職位是http://ipindiaservices.gov.in/publicsearch/resources/webservices/search.php一個PARAMRC_這是我們與了time.time創建一個時間戳。

"field[]"每個值應該匹配到每個"fieldvalue[]"和反過來匹配"operator[]"是否選擇*AND**OR**NOT*,即我們傳遞值(S)陣列,每個密鑰指定後[],沒有,沒有什麼會的工作:

data = { 
    "publication_type_published": "on", 
    "publication_type_granted": "on", 
    "fieldDate": "APD", 
    "datefieldfrom": "19120101", 
    "datefieldto": "20160906", 
    "operatordate": " AND ", 
    "field[]": ["PA"], # claims,.description, patent-number codes go here 
    "fieldvalue[]": ["chris*"], # matching values for ^^ go here 
    "operator[]": [" AND "], # matching sql logic for ^^ goes here 
    "page": "1", # gives you next page results 
    "start": "0", # not sure what effect this actually has. 
    "limit": "25"} # not sure how this relates as len(r.json()[u'record']) stays 25 regardless 

import requests 
from time import time 

post = "http://ipindiaservices.gov.in/publicsearch/resources/webservices/search.php?_dc={}".format(
    str(time()).replace(".", "")) 

with requests.Session() as s: 
    s.get("http://ipindiaservices.gov.in/publicsearch/") 
    s.headers.update({"X-Requested-With": "XMLHttpRequest"}) 
    r = s.post(post, data=data) 
    print(r.json()) 

輸出將看起來像下面,我不能添加這一切因爲有太多的數據要發佈:

{u'success': True, u'record': [{u'Publication_Status': u'Published', u'appDate': u'2016/06/16', u'pubDate': u'2016/08/31', u'title': u'ACTUATOR FOR DEPLOYABLE IMPLANT', u'sourceID': u'inpat', u'abstract': u'\n Systems and methods are provided for usin............. 

如果使用記錄鍵你喜歡類型的字典列表:

{u'Publication_Status': u'Published', u'appDate': u'2015/01/27', u'pubDate': u'2015/06/26', u'title': u'CORRUGATED PALLET', u'sourceID': u'inpat', u'abstract': u'\n A corrugated paperboard pallet is produced from two flat blanks which comprise a pallet top and a pallet bottom. The two blanks are each folded to produce only two parallel vertically extending double thickness ribs three horizontal panels two vertical side walls and two horizontal flaps. The ribs of the pallet top and pallet bottom lock each other from opening in the center of the pallet by intersecting perpendicularly with notches in the ribs. The horizontal flaps lock the ribs from opening at the edges of the pallet by intersecting perpendicularly with notches and the vertical sidewalls include vertical flaps that open inward defining fork passages whereby the vertical flaps lock said horizontal flaps from opening.\n ', u'Assignee': u'OLVEY Douglas A., SKETO James L., GUMBERT Sean G., DANKO Joseph J., GABRYS Christopher W., ', u'field_of_invention': u'FI10', u'publication_no': u'26/2015', u'patent_no': u'', u'application_no': u'642/DELNP/2015', u'UCID': u'WVJ4NVVIYzFLcUQvVnJsZGczcVRmSS96Vkh3NWsrS1h3Qk43S2xHczJ2WT0%3D', u'Publication_Type': u'A'} 

這是你的專利信息。

你可以看到,如果我們選擇在我們的瀏覽器中的幾個值,在值的所有fieldValue方法操作排隊,AND是默認的,所以你看到,每個選項:

enter image description here

enter image description here

所以找出代碼,選擇你想要的東西和職務。

+0

這太棒了!謝謝。我會寫代碼,然後用它做更多。萬分感謝。 –

+1

不用擔心,它只是一個挑選你想要的任何值的問題,確保列表中的對齊和發佈到url,你將得到你想要的json格式。 –