算法:
Set root url distance to zero and others to null
Begin Loop
Find urls matching the current distance
Find their linked urls and if they are null set their distance to current + 1
Increment current distance
Loop if there are urls with distance not set yet
與您的數據進行了測試(942個網址,27008個鏈接),並得到以下結果:
從起始頁最短點擊:
Distance Count
0 1
1 149
2 600
3 141
4 38
5 7
6 6
最短點擊從/返回到開始頁面(取消註釋3行UNION和SELECT):
Distance Count
0 1
1 494
2 447
我已經把它放在SQL小提琴上,帶有少量我自己的測試數據(必須使用SQL Server,因爲它只允許MySQL的Select查詢)。
http://sqlfiddle.com/#!6/efdd1/4
UPDATE crawl_urls SET Distance = NULL -- Reset distances for the test
UPDATE crawl_urls SET Distance = 0 -- Start Root Url at 0 distance
WHERE ID = (SELECT MIN(ID) FROM crawl_urls)
DECLARE @UrlsToDo int = -1 -- Count of Urls still to process
DECLARE @Distance int = 0 -- Current Distance from root
WHILE (@UrlsToDo != 0) -- Loop while urls to process
BEGIN
UPDATE crawl_urls -- Find urls at current distance
SET Distance = @Distance + 1 -- Set their linked urls distance
WHERE Distance IS NULL AND ID IN (
SELECT target_urls_id IDs FROM Links L1
INNER JOIN crawl_urls A ON L1.crawl_urls_id = A.ID AND A.Distance = @Distance
--UNION ALL -- Union of both sides of link
-- SELECT crawl_urls_id IDs FROM Links L2 -- Uncomment for shortest way BACK
-- INNER JOIN crawl_urls B ON L2.target_urls_id = B.ID AND B.Distance = @Distance
)
SET @UrlsToDo = (SELECT COUNT(ID) FROM crawl_urls WHERE Distance IS NULL)
SET @Distance = @Distance + 1
END -- Increment Distance and loop
SELECT * FROM crawl_urls ORDER BY Distance -- Output results
注意事項:您將需要確保根URL距離爲0的開局。另外請注意,如果存在孤立網址而沒有與其他鏈接的鏈接,循環可能會無限期地進行,儘管在理論上這應該是不可能的,除非在爬網和記錄跳過時出現錯誤。適當的索引會對更大的數據集產生巨大影響。
我將盡快做出與此相似的事情,這裏還有一些我注意到的事情。 Links表中有5%的重複項,只允許唯一標識符會大大加快速度 - 減少記錄和更好的索引。此外,主頁已添加兩次(末尾帶有和不帶「/」),因此在兩個方向上都會有很多額外的重複鏈接,這可能也適用於搜索友好的網址和文件夾名稱。
你想計算一個圖中的最短路徑,看看[Dijkstra算法](http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm)。 –
這讓我感興趣,現在我不認爲如果沒有解決它,我可以通過一天的工作:)你是否有任何可用的測試數據,顯然不是完整的作品,只要鏈接工作,只要鏈接工作,僞裝URL就可以了 – Geezer68