創建分佈式抓取python應用程序。它由主服務器以及將在客戶端服務器上運行的關聯客戶端應用程序組成。客戶端應用程序的目的是在目標網站上運行,以提取特定數據。客戶需要在網站內部「深入」,落後於多層次的表單,因此每個客戶都專門面向特定的網站。架構python問題
每個客戶端應用程序看起來像
main:
parse initial url
call function level1 (data1)
function level1 (data)
parse the url, for data1
use the required xpath to get the dom elements
call the next function
call level2 (data)
function level2 (data2)
parse the url, for data2
use the required xpath to get the dom elements
call the next function
call level3
function level3 (dat3)
parse the url, for data3
use the required xpath to get the dom elements
call the next function
call level4
function level4 (data)
parse the url, for data4
use the required xpath to get the dom elements
at the final function..
--all the data output, and eventually returned to the server
--at this point the data has elements from each function...
我的問題: 因爲這是由當前函數 孩子函數的調用數不同,我試圖找出 出最好的方法。
each function essentialy fetches a page of content, and then parses
the page using a number of different XPath expressions, combined
with different regex expressions depending on the site/page.
if i run a client on a single box, as a sequential process, it'll
take awhile, but the load on the box is rather small. i've thought
of attempting to implement the child functions as threads from the
current function, but that could be a nightmare, as well as quickly
bring the "box" to its knees!
i've thought of breaking the app up in a manner that would allow
the master to essentially pass packets to the client boxes, in a
way to allow each client/function to be run directly from the
master. this process requires a bit of rewrite, but it has a number
of advantages. a bunch of redundancy, and speed. it would detect if
a section of the process was crashing and restart from that point.
but not sure if it would be any faster...
我正在寫在python解析腳本..
所以...任何想法/意見,將不勝感激......
我可以進入一個很大的更詳細,但不想忍受任何人!
謝謝!
湯姆
您可能希望從問題的後半部分刪除「代碼縮進」,因爲它沒有代碼。 – viksit 2010-04-19 19:52:58
此外,請使用大寫字母,特別是人稱代詞(I)。如果你的問題很容易閱讀,你會得到很好的答案。如果你的問題很難閱讀(例如,小寫'i'無處不在),人們將停止嘗試解析它並繼續前進。 – 2010-04-19 20:06:46
說真的,爲什麼要刪除它?有效的問題,有效的答案。如何選擇最能幫助你的答案? – Will 2011-01-14 18:26:14