以下是一些概念驗證:)
通過「?」分割URL的示例
解析參數。
計算唯一參數值的頻率。
獲得第N百分位數。
生成URL和替換參數,其頻率大於第N百分
對於像在這裏sandbox 50百分小數據是足夠將一些URL。
對於「大實數據」90-95百分位。 例如:我用90百分位爲5000頁的鏈接 - >result ~200 links
<?php
$stats = [];
$pages = [
(object)['page' => 'http://example.com/?page=123'],
(object)['page' => 'http://example.com/?page=123'],
(object)['page' => 'http://example.com/?page=123'],
(object)['page' => 'http://example.com/?page=321'],
(object)['page' => 'http://example.com/?page=321'],
(object)['page' => 'http://example.com/?page=321'],
(object)['page' => 'http://example.com/?page=qwas'],
(object)['page' => 'http://example.com/?page=safa15'],
]; // array of objects with page property = URL
$params_counter = [];
foreach ($pages as $page) {
$components = explode('?', $page->page);
if (!empty($components[1])) {
parse_str($components[1], $params);
foreach ($params as $key => $val) {
if (!isset($params_counter[$key][$val])) {
$params_counter[$key][$val] = 0;
}
$params_counter[$key][$val]++;
}
}
}
function procentile($percentile, $array)
{
sort($array);
$index = ($percentile/100) * count($array);
if (floor($index) == $index) {
$result = ($array[$index-1] + $array[$index])/2;
} else {
$result = $array[floor($index)];
}
return $result;
}
$some_data = [];
foreach ($params_counter as $key => $val) {
$some_data[$key] = count($val);
}
$procentile = procentile(90, $some_data);
foreach ($pages as $page) {
$components = explode('?', $page->page);
if (!empty($components[1])) {
parse_str($components[1], $params);
arsort($params);
foreach ($params as $key => $val) {
if ($some_data[$key] > $procentile) {
$params[$key] = '$var';
}
}
arsort($params);
$pattern = http_build_query($params);
$new_url = urldecode('?'.$pattern);
if (!isset($stats[$new_url])) {
$stats[$new_url] = 0;
}
$stats[$new_url]++;
}
}
arsort($stats);
_「任何算法來自動檢測URL的模式?」 _你是什麼意思的「自動偵測模式」嗎? – guest271314
什麼是「幾乎相同」?不確定你想要達到什麼目的? – guest271314
我認爲OP希望瀏覽URL並找到其中的靜態部分,例如,它們都以'site1.com/?search ='開頭,然後找到更改的部分,例如搜索字符串。 – vlaz