這裏是我的想法:
- 排序作者的由字符串長度(升序),讓你從較小的文字工作,以更大的文字職位的集合。
- 將每個帖子的文本拆分爲一個或多個空格字符,以便在處理期間僅處理完全非空白子字符串。
- 查找匹配的子字符串,發生在每個後續的帖子中,而不是越來越窄的子串(
overlaps
)。
- 通過分析它們的索引值對連續匹配的子串進行分組。
- 將所分組的連續子串重新組合爲它們的原始字符串形式(當然,修剪了前導和尾隨空白字符)。
- 按字符串長度(降序)對重組字符串進行排序,以便爲最長的字符串分配
0
索引。
- 根據公共性和長度打印以篩選假定爲作者簽名的子字符串(作爲最佳猜測)。
代碼:(Demo)
$posts['Author1']=['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
$posts['Author2']=['This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];
foreach($posts as $author=>$texts){
echo "Author: $author\n";
usort($texts,function($a,$b){return strlen($a)-strlen($b);}); // sort ASC by strlen; mb_strlen probably isn't advantageous
var_export($texts);
echo "\n";
foreach($texts as $index=>$string){
if(!$index){
$overlaps=preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY); // declare with all non-white-space substrings from first text
}else{
$overlaps=array_intersect($overlaps,preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY)); // filter word bank using narrowing number of words
}
}
var_export($overlaps);
echo "\n";
// batch consecutive substrings
$group=null;
$consecutives=[]; // clear previous iteration's data
foreach($overlaps as $i=>$word){
if($group===null || $i-$last>1){
$group=$i;
}
$last=$i;
$consecutives[$group][]=$word;
}
var_export($consecutives);
echo "\n";
foreach($consecutives as $words){
// match potential signatures in first text for measurement:
if(preg_match_all('/\Q'.implode('\E\s+\Q',$words).'\E/',$texts[0],$out)){ // make alternatives characters literal using \Q & \E
$potential_signatures=$out[0];
}
}
usort($potential_signatures,function($a,$b){return strlen($b)-strlen($a);}); // sort DESC by strlen; mb_strlen probably isn't advantageous
echo "Assumed Signature: {$potential_signatures[0]}\n\n";
}
輸出:
Author: Author1
array (
0 => 'sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
11 => '@jhsad.sadas.com',
)
array (
11 =>
array (
0 => '@jhsad.sadas.com',
),
)
Assumed Signature: @jhsad.sadas.com
Author: Author2
array (
0 => 'Finally, this is unwanted stuff. This is the
*author\'s* signature.',
1 => 'This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
2 => 'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
)
array (
2 => 'is',
5 => 'This',
6 => 'is',
7 => 'the',
8 => '*author\'s*',
9 => 'signature.',
)
array (
2 =>
array (
0 => 'is',
),
5 =>
array (
0 => 'This',
1 => 'is',
2 => 'the',
3 => '*author\'s*',
4 => 'signature.',
),
)
Assumed Signature: This is the
*author's* signature.
你需要找到'@ jhsad.sadas.com'或確認字符串有嗎?你是否允許鬆散的比賽,例如'@ jhsad.sadas.com.uk'? '@jhsad \ .sadas \ .com \ b'可以工作,或者如果這個域是一個變量,就使用'preg_quote'。 – chris85
@ chris85,我想在他的文章中找到作者簽名。我不知道它會是什麼,他會在哪裏使用它。 – mrmrn
如果你不知道它是什麼,而不是如何識別它? – chris85