2009-07-02 15 views
7

我寫這個PHP代碼來實現弗萊士 - 金凱德可讀性分數的功能:弗萊士 - 金凱德可讀性:提高PHP函數

function readability($text) { 
    $total_sentences = 1; // one full stop = two sentences => start with 1 
    $punctuation_marks = array('.', '?', '!', ':'); 
    foreach ($punctuation_marks as $punctuation_mark) { 
     $total_sentences += substr_count($text, $punctuation_mark); 
    } 
    $total_words = str_word_count($text); 
    $total_syllable = 3; // assuming this value since I don't know how to count them 
    $score = 206.835-(1.015*$total_words/$total_sentences)-(84.6*$total_syllables/$total_words); 
    return $score; 
} 

你有任何建議如何改進的代碼?這是對的嗎?它會起作用嗎?

我希望你能幫助我。提前致謝!

回答

17

就啓發式而言,代碼看起來很好。下面是一些需要考慮的要點,讓你需要計算相當困難的一臺機器的項目:

  1. 什麼是一個句子?

    說真的,什麼是句子?我們有時間段,但它們也可以用於博士,例如,Y.M.C.A.和其他非句子最終目的。當你考慮感嘆號,問號和省略號時,通過假設一段時間會做到這一點,你確實在做自己的事情。我之前已經看過這個問題,如果你真的希望在真實文本中使用更可靠的句子,你需要解析文本。這可能是計算密集型的,耗時的,並且很難找到免費資源。最後,您仍然需要擔心特定解析器實現的錯誤率。但是,只有完整的解析才能告訴你什麼是句子,什麼是一段時間的其他許多用途。此外,如果你使用的是文本,比如說HTML,那麼你也不得不擔心句子不是以標點符號結尾,而是以標籤結尾。例如,許多網站不會爲h1和h2標籤添加標點符號,但它們顯然是不同的句子或短語。

  2. 音節是不是我們應該近似

    這是本可讀性啓發式的主要標誌,這是一個使它成爲最難以實施。對作品中音節計數的計算分析需要假定假定讀者用與您正在訓練的音節計數發生器相同的方言說話。聲音如何落在音節周圍實際上是口音重音的主要部分。如果你不相信我,請嘗試訪問牙買加。這意味着即使一個人手動計算這個數字,它仍然是一個特定方言的分數。

  3. 什麼是單詞?

    不打蠟絲毫psycholingusitic,但你會發現,用空格隔開的話,什麼都概念化爲詞揚聲器有很大的不同。這會使可計算可讀性分數的概念有些可疑。

所以最後,我可以回答你的「將它的工作」的問題。如果您希望獲取一段文字並在其他指標中顯示此可讀性分數以提供某種可想象的附加價值,那麼挑剔的用戶不會提出所有這些問題。如果你正在嘗試做一些科學的,甚至一些教學(如這個分數和那些喜歡它最終被意),我不會真的打擾。事實上,如果你打算使用它來向用戶提出有關他們產生的內容的任何建議,我會非常猶豫。

測量文本閱讀難度的更好方法很可能與低頻詞與高頻詞的比例以及文本中hapax legomena的數量有關。但我不會追求像這樣的啓發式,因爲要憑經驗測試像這樣的東西是非常困難的。

+0

非常感謝您對這個詳細的解答。現在我明白了,如果我需要確切的結果,使用這個公式是沒有意義的。 – caw 2009-07-03 09:52:34

0

我其實並沒有看到任何與該代碼有關的問題。當然,如果你真的想用單個計數循環替換所有不同的功能,它可以進行優化。但是,我強烈認爲這不是必要的,甚至是完全錯誤的。您目前的代碼非常易讀易懂,從這個角度來看,任何優化都可能讓事情變得更糟。儘量使用它,並且不要試圖優化它,除非它真的成爲性能瓶頸。

6

請看看以下兩個階級,它的使用信息。它一定會幫助你。

可讀性音節計數模式庫類:

<?php class ReadabilitySyllableCheckPattern { 

public $probWords = [ 
    'abalone' => 4, 
    'abare' => 3, 
    'abed' => 2, 
    'abruzzese' => 4, 
    'abbruzzese' => 4, 
    'aborigine' => 5, 
    'acreage' => 3, 
    'adame' => 3, 
    'adieu' => 2, 
    'adobe' => 3, 
    'anemone' => 4, 
    'apache' => 3, 
    'aphrodite' => 4, 
    'apostrophe' => 4, 
    'ariadne' => 4, 
    'cafe' => 2, 
    'calliope' => 4, 
    'catastrophe' => 4, 
    'chile' => 2, 
    'chloe' => 2, 
    'circe' => 2, 
    'coyote' => 3, 
    'epitome' => 4, 
    'forever' => 3, 
    'gethsemane' => 4, 
    'guacamole' => 4, 
    'hyperbole' => 4, 
    'jesse' => 2, 
    'jukebox' => 2, 
    'karate' => 3, 
    'machete' => 3, 
    'maybe' => 2, 
    'people' => 2, 
    'recipe' => 3, 
    'sesame' => 3, 
    'shoreline' => 2, 
    'simile' => 3, 
    'syncope' => 3, 
    'tamale' => 3, 
    'yosemite' => 4, 
    'daphne' => 2, 
    'eurydice' => 4, 
    'euterpe' => 3, 
    'hermione' => 4, 
    'penelope' => 4, 
    'persephone' => 4, 
    'phoebe' => 2, 
    'zoe' => 2 
]; 

public $addSyllablePatterns = [ 
    "([^s]|^)ia", 
    "iu", 
    "io", 
    "eo($|[b-df-hj-np-tv-z])", 
    "ii", 
    "[ou]a$", 
    "[aeiouym]bl$", 
    "[aeiou]{3}", 
    "[aeiou]y[aeiou]", 
    "^mc", 
    "ism$", 
    "asm$", 
    "thm$", 
    "([^aeiouy])\1l$", 
    "[^l]lien", 
    "^coa[dglx].", 
    "[^gq]ua[^auieo]", 
    "dnt$", 
    "uity$", 
    "[^aeiouy]ie(r|st|t)$", 
    "eings?$", 
    "[aeiouy]sh?e[rsd]$", 
    "iell", 
    "dea$", 
    "real", 
    "[^aeiou]y[ae]", 
    "gean$", 
    "riet", 
    "dien", 
    "uen" 
]; 

public $prefixSuffixPatterns = [ 
    "^un", 
    "^fore", 
    "^ware", 
    "^none?", 
    "^out", 
    "^post", 
    "^sub", 
    "^pre", 
    "^pro", 
    "^dis", 
    "^side", 
    "ly$", 
    "less$", 
    "some$", 
    "ful$", 
    "ers?$", 
    "ness$", 
    "cians?$", 
    "ments?$", 
    "ettes?$", 
    "villes?$", 
    "ships?$", 
    "sides?$", 
    "ports?$", 
    "shires?$", 
    "tion(ed)?$" 
]; 

public $subSyllablePatterns = [ 
    "cia(l|$)", 
    "tia", 
    "cius", 
    "cious", 
    "[^aeiou]giu", 
    "[aeiouy][^aeiouy]ion", 
    "iou", 
    "sia$", 
    "eous$", 
    "[oa]gue$", 
    ".[^aeiuoycgltdb]{2,}ed$", 
    ".ely$", 
    "^jua", 
    "uai", 
    "eau", 
    "[aeiouy](b|c|ch|d|dg|f|g|gh|gn|k|l|ll|lv|m|mm|n|nc|ng|nn|p|r|rc|rn|rs|rv|s|sc|sk|sl|squ|ss|st|t|th|v|y|z)e$", 
    "[aeiouy](b|c|ch|dg|f|g|gh|gn|k|l|lch|ll|lv|m|mm|n|nc|ng|nch|nn|p|r|rc|rn|rs|rv|s|sc|sk|sl|squ|ss|th|v|y|z)ed$", 
    "[aeiouy](b|ch|d|f|gh|gn|k|l|lch|ll|lv|m|mm|n|nch|nn|p|r|rn|rs|rv|s|sc|sk|sl|squ|ss|st|t|th|v|y)es$", 
    "^busi$" 
]; } ?> 

另一類是具有兩個方法可讀性算法類來計算得分:

<?php class ReadabilityAlgorithm { 
function countSyllable($strWord) { 
    $pattern = new ReadabilitySyllableCheckPattern(); 
    $strWord = trim($strWord); 

    // Check for problem words 
    if (isset($pattern->{'probWords'}[$strWord])) { 
     return $pattern->{'probWords'}[$strWord]; 
    } 

    // Check prefix, suffix 
    $strWord = str_replace($pattern->{'prefixSuffixPatterns'}, '', $strWord, $tmpPrefixSuffixCount); 

    // Removed non word characters from word 
    $arrWordParts = preg_split('`[^aeiouy]+`', $strWord); 
    $wordPartCount = 0; 
    foreach ($arrWordParts as $strWordPart) { 
     if ($strWordPart <> '') { 
      $wordPartCount++; 
     } 
    } 
    $intSyllableCount = $wordPartCount + $tmpPrefixSuffixCount; 

    // Check syllable patterns 
    foreach ($pattern->{'subSyllablePatterns'} as $strSyllable) { 
     $intSyllableCount -= preg_match('`' . $strSyllable . '`', $strWord); 
    } 

    foreach ($pattern->{'addSyllablePatterns'} as $strSyllable) { 
     $intSyllableCount += preg_match('`' . $strSyllable . '`', $strWord); 
    } 

    $intSyllableCount = ($intSyllableCount == 0) ? 1 : $intSyllableCount; 
    return $intSyllableCount; 
} 

function calculateReadabilityScore($stringText) { 
    # Calculate score 
    $totalSentences = 1; 
    $punctuationMarks = array('.', '!', ':', ';'); 

    foreach ($punctuationMarks as $punctuationMark) { 
     $totalSentences += substr_count($stringText, $punctuationMark); 
    } 

    // get ASL value 
    $totalWords = str_word_count($stringText); 
    $ASL = $totalWords/$totalSentences; 

    // find syllables value 
    $syllableCount = 0; 
    $arrWords = explode(' ', $stringText); 
    $intWordCount = count($arrWords); 
    //$intWordCount = $totalWords; 

    for ($i = 0; $i < $intWordCount; $i++) { 
     $syllableCount += $this->countSyllable($arrWords[$i]); 
    } 

    // get ASW value 
    $ASW = $syllableCount/$totalWords; 

    // Count the readability score 
    $score = 206.835 - (1.015 * $ASL) - (84.6 * $ASW); 
    return $score; 
} } ?> 

//例:如何使用

<?php // Create object to count readability score 
$readObj = new ReadabilityAlgorithm(); 
echo $readObj->calculateReadabilityScore("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into: electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently; with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum!"); 
?>