什麼是用python解析非訂購的HTML頁面的最佳方法？

我想解析下面的HTML頁面使用BeautifulSoup（我要解析大量的頁面）。什麼是用python解析非訂購的HTML頁面的最佳方法？

我需要保存每個頁面中的所有字段，但它們可以動態更改（在不同的頁面上）。

這裏是一個頁面的例子 - Page 1 和頁面與不同領域的訂單 - Page 2

我已經寫了下面的代碼來解析頁面。

import requests 
from bs4 import BeautifulSoup 

PTiD = 7680560 

url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=" + str(PTiD) + ".PN.&OS=PN/" + str(PTiD) + "&RS=PN/" + str(PTiD) 

res = requests.get(url, prefetch = True) 

raw_html = res.content 

print "Parser Started.. " 

bs_html = BeautifulSoup(raw_html, "lxml") 

#Initialize all the Search Lists 
fonts = bs_html.find_all('font') 
para = bs_html.find_all('p') 
bs_text = bs_html.find_all(text=True) 
onlytext = [x for x in bs_text if x != '\n' and x != ' '] 

#Initialize the Indexes 
AppNumIndex = onlytext.index('Appl. No.:\n') 
FiledIndex = onlytext.index('Filed:\n ') 
InventorsIndex = onlytext.index('Inventors: ') 
AssigneeIndex = onlytext.index('Assignee:') 
ClaimsIndex = onlytext.index('Claims') 
DescriptionIndex = onlytext.index(' Description') 
CurrentUSClassIndex = onlytext.index('Current U.S. Class:') 
CurrentIntClassIndex = onlytext.index('Current International Class: ') 
PrimaryExaminerIndex = onlytext.index('Primary Examiner:') 
AttorneyOrAgentIndex = onlytext.index('Attorney, Agent or Firm:') 
RefByIndex = onlytext.index('[Referenced By]') 

#~~Title~~ 
for a in fonts: 
     if a.has_key('size') and a['size'] == '+1': 
       d_title = a.string 
print "title: " + d_title 

#~~Abstract~~~ 
d_abstract = para[0].string 
print "abstract: " + d_abstract 

#~~Assignee Name~~ 
d_assigneeName = onlytext[AssigneeIndex +1] 
print "as name: " + d_assigneeName 

#~~Application number~~ 
d_appNum = onlytext[AppNumIndex + 1] 
print "ap num: " + d_appNum 

#~~Application date~~ 
d_appDate = onlytext[FiledIndex + 1] 
print "ap date: " + d_appDate 

#~~ Patent Number~~ 
d_PatNum = onlytext[0].split(':')[1].strip() 
print "patnum: " + d_PatNum 

#~~Issue Date~~ 
d_IssueDate = onlytext[10].strip('\n') 
print "issue date: " + d_IssueDate 

#~~Inventors Name~~ 
d_InventorsName = '' 
for x in range(InventorsIndex+1, AssigneeIndex, 2): 
    d_InventorsName += onlytext[x] 
print "inv name: " + d_InventorsName 

#~~Inventors City~~ 
d_InventorsCity = '' 

for x in range(InventorsIndex+2, AssigneeIndex, 2): 
    d_InventorsCity += onlytext[x].split(',')[0].strip().strip('(') 

d_InventorsCity = d_InventorsCity.strip(',').strip().strip(')') 
print "inv city: " + d_InventorsCity 

#~~Inventors State~~ 
d_InventorsState = '' 
for x in range(InventorsIndex+2, AssigneeIndex, 2): 
    d_InventorsState += onlytext[x].split(',')[1].strip(')').strip() + ',' 

d_InventorsState = d_InventorsState.strip(',').strip() 
print "inv state: " + d_InventorsState 

#~~ Asignee City ~~ 
d_AssigneeCity = onlytext[AssigneeIndex + 2].split(',')[1].strip().strip('\n').strip(')') 
print "asign city: " + d_AssigneeCity 

#~~ Assignee State~~ 
d_AssigneeState = onlytext[AssigneeIndex + 2].split(',')[0].strip('\n').strip().strip('(') 
print "asign state: " + d_AssigneeState 

#~~Current US Class~~ 
d_CurrentUSClass = '' 

for x in range (CuurentUSClassIndex + 1, CurrentIntClassIndex): 
    d_CurrentUSClass += onlytext[x] 
print "cur us class: " + d_CurrentUSClass 

#~~ Current Int Class~~ 
d_CurrentIntlClass = onlytext[CurrentIntClassIndex +1] 
print "cur intl class: " + d_CurrentIntlClass 

#~~~Primary Examiner~~~ 
d_PrimaryExaminer = onlytext[PrimaryExaminerIndex +1] 
print "prim ex: " + d_PrimaryExaminer 

#~~d_AttorneyOrAgent~~ 
d_AttorneyOrAgent = onlytext[AttorneyOrAgentIndex +1] 
print "agent: " + d_AttorneyOrAgent 

#~~ Referenced by ~~ 
for x in range(RefByIndex + 2, RefByIndex + 400): 
    if (('Foreign' in onlytext[x]) or ('Primary' in onlytext[x])): 
     break 
    else: 
     d_ReferencedBy += onlytext[x] 
print "ref by: " + d_ReferencedBy 

#~~Claims~~ 
d_Claims = '' 

for x in range(ClaimsIndex , DescriptionIndex): 
    d_Claims += onlytext[x] 
print "claims: " + d_Claims

我將頁面中的所有文本插入列表（使用BeautifulSoup的find_all（text = True））。然後我嘗試查找字段名稱的索引，並從該位置遍歷列表，並將成員保存爲字符串，直到到達下一個字段索引。

當我嘗試了幾個不同頁面上的代碼時，我注意到成員的結構正在發生變化，我無法在列表中找到它們的索引。例如，我搜索'123'的索引，並在它顯示在列表中的某些頁面上顯示爲'12'，'3'。

你能想到任何其他方式來解析通用的頁面嗎？

感謝。

來源

2012-07-02 Rgo

模式，我已更新我的文章 – pinkdawn

如果使用beautifulsoup，並有DOM 123和find_all(text=True)你將有['123']

但是，如果你有DOM 123，它們具有相同的語義以前，但beautifulsoup會給你['12','3']

也許您只需找到完整的標籤即可完成['123']，並首先忽略/清除該標籤。

如何消除標籤

import re 
html='<p>12<b>3</b></p>' 
reExp='<[\/\!]?b[^<>]*?>' 
print re.sub(reExp,'',html)

的模式一些假的代碼，你可以這樣做：

import re 
patterns = '<TD align=center>(?P<VALUES_TO_FIND>.*?)<\/TD>' 
print re.findall(patterns, your_html)

來源

2012-07-02 09:42:04 pinkdawn

以及模式？如果我想通過搜索前後查找內容。例如，如果我有HTML代碼：的再發行： ** ** VALUES_TO_FIND

，我知道確保** VALUES_TO_FIND **之前和之後的代碼始終是相同的。我怎樣才能找到它使用RE？謝謝。 – Rgo

@Rgo我已更新主帖子，供您參考 – pinkdawn

我認爲最簡單的解決方法是使用pyquery庫 http://packages.python.org/pyquery/api.html

你可以使用jquer選擇頁面的元素y選擇器。

來源

2012-07-02 10:22:36

PyQuery ftw。無痛快速網刮：D – eMPee584

什麼是用python解析非訂購的HTML頁面的最佳方法？

回答

相關問題