如何查找出現在數組的每個元素中的最長子字符串？

我有一些作者的文本集合。每位作者在其所有文本中都有獨特的簽名或鏈接。如何查找出現在數組的每個元素中的最長子字符串？

示例作者1：

$texts=['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd 

@jhsad.sadas.com sdsdADSA sada', 
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g 
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf', 
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl 
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

預期作者1輸出是：@jhsad.sadas.com

爲Author2實施例：

$texts=['This is some random string representative of non-signature text. 

This is the 
*author\'s* signature.', 
'Different message body text.  This is the 
*author\'s* signature. 

This is an afterthought that expresses that a signature is not always at the end.', 
'Finally, this is unwanted stuff. This is the 
*author\'s* signature.'];

爲Author2預期輸出是：

This is the 
*author's* signature.

要特別通知這樣的事實，存在有一些表示簽名的開始或結束沒有可靠的識別字符（或位置）。它可以是任何長度的URL，Twitter提及，任何類型的純文本等，包含在字符串的開始，結束或中間出現的任何字符序列。

我正在尋找一種方法，將提取單個作者的所有$text元素中存在的最長子字符串。

爲了這個任務，預計所有作者都會在每個帖子/文本中都有一個簽名子字符串。

IDEA：我在考慮將單詞轉換爲矢量，並找到每個文本之間的相似性。我們可以使用餘弦相似性來查找簽名。我認爲解決方案必須是這樣的想法。

mickmackusa's commented code捕捉了所需要的東西的本質，但我想看看是否有其他方法來達到預期的效果。

來源

2017-10-13 mrmrn

你需要找到'@ jhsad.sadas.com'或確認字符串有嗎？你是否允許鬆散的比賽，例如'@ jhsad.sadas.com.uk'？ '@jhsad \ .sadas \ .com \ b'可以工作，或者如果這個域是一個變量，就使用'preg_quote'。 – chris85

@ chris85，我想在他的文章中找到作者簽名。我不知道它會是什麼，他會在哪裏使用它。 – mrmrn

如果你不知道它是什麼，而不是如何識別它？ – chris85

這裏是我的想法：

排序作者的由字符串長度（升序），讓你從較小的文字工作，以更大的文字職位的集合。
將每個帖子的文本拆分爲一個或多個空格字符，以便在處理期間僅處理完全非空白子字符串。
查找匹配的子字符串，發生在每個後續的帖子中，而不是越來越窄的子串（overlaps）。
通過分析它們的索引值對連續匹配的子串進行分組。
將所分組的連續子串重新組合爲它們的原始字符串形式（當然，修剪了前導和尾隨空白字符）。
按字符串長度（降序）對重組字符串進行排序，以便爲最長的字符串分配0索引。
根據公共性和長度打印以篩選假定爲作者簽名的子字符串（作爲最佳猜測）。

代碼：（Demo）

$posts['Author1']=['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd 

@jhsad.sadas.com sdsdADSA sada', 
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g 
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf', 
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl 
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl']; 

$posts['Author2']=['This is some random string representative of non-signature text. 

This is the 
*author\'s* signature.', 
     'Different message body text.  This is the 
*author\'s* signature. 

    This is an afterthought that expresses that a signature is not always at the end.', 
     'Finally, this is unwanted stuff. This is the 
*author\'s* signature.']; 

foreach($posts as $author=>$texts){ 
    echo "Author: $author\n"; 

    usort($texts,function($a,$b){return strlen($a)-strlen($b);}); // sort ASC by strlen; mb_strlen probably isn't advantageous 
    var_export($texts); 
    echo "\n"; 

    foreach($texts as $index=>$string){ 
     if(!$index){ 
      $overlaps=preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY); // declare with all non-white-space substrings from first text 
     }else{ 
      $overlaps=array_intersect($overlaps,preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY)); // filter word bank using narrowing number of words 
     } 
    } 
    var_export($overlaps); 
    echo "\n"; 

    // batch consecutive substrings 
    $group=null; 
    $consecutives=[]; // clear previous iteration's data 
    foreach($overlaps as $i=>$word){ 
     if($group===null || $i-$last>1){ 
      $group=$i; 
     } 
     $last=$i; 
     $consecutives[$group][]=$word; 
    } 
    var_export($consecutives); 
    echo "\n"; 

    foreach($consecutives as $words){ 
     // match potential signatures in first text for measurement: 
     if(preg_match_all('/\Q'.implode('\E\s+\Q',$words).'\E/',$texts[0],$out)){ // make alternatives characters literal using \Q & \E 
      $potential_signatures=$out[0]; 
     } 
    } 
    usort($potential_signatures,function($a,$b){return strlen($b)-strlen($a);}); // sort DESC by strlen; mb_strlen probably isn't advantageous 

    echo "Assumed Signature: {$potential_signatures[0]}\n\n"; 
}

輸出：

Author: Author1 
array (
    0 => 'sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd 

@jhsad.sadas.com sdsdADSA sada', 
    1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl 
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl', 
    2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g 
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf', 
) 
array (
    11 => '@jhsad.sadas.com', 
) 
array (
    11 => 
    array (
    0 => '@jhsad.sadas.com', 
), 
) 
Assumed Signature: @jhsad.sadas.com 

Author: Author2 
array (
    0 => 'Finally, this is unwanted stuff. This is the 
*author\'s* signature.', 
    1 => 'This is some random string representative of non-signature text. 

This is the 
*author\'s* signature.', 
    2 => 'Different message body text.  This is the 
*author\'s* signature. 

    This is an afterthought that expresses that a signature is not always at the end.', 
) 
array (
    2 => 'is', 
    5 => 'This', 
    6 => 'is', 
    7 => 'the', 
    8 => '*author\'s*', 
    9 => 'signature.', 
) 
array (
    2 => 
    array (
    0 => 'is', 
), 
    5 => 
    array (
    0 => 'This', 
    1 => 'is', 
    2 => 'the', 
    3 => '*author\'s*', 
    4 => 'signature.', 
), 
) 
Assumed Signature: This is the 
*author's* signature.

來源

2017-11-07 02:24:15 mickmackusa

您可以使用preg_match()與正則表達式來實現此目的。

$str = "KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf"; 

preg_match("/\@[^\s]+/", $str, $match); 

var_dump($match); //Will output the signature

來源

2017-10-13 11:21:07 WasteD

這裏@ jhsad.sadas.com就是一個例子。我不知道那個作者真正的簽名是什麼！我擁有的只是那個作者的一些文本，我知道它有一個簽名 – mrmrn

@ chris85是的，我現在改變了！ – WasteD

@mrmrn但簽名總是以@開頭嗎？ – WasteD

如何查找出現在數組的每個元素中的最長子字符串？

回答

相關問題