簡單化的開始:
<?php
// source text
$paragraph = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Proin congue, quam nec tincidunt congue, massa ipsum sodales tellus,
in rhoncus sem quam quis ante. Nam condimentum pellentesque libero at
blandit. Suspendisse felis sem, interdum pulvinar ultricies a, auctor
vel leo. Curabitur congue mi nec purus placerat sit amet mollis magna
laoreet. Duis eu purus non turpis lacinia sagittis. Aliquam tristique
nulla volutpat neque posuere faucibus. Aenean tempus diam quis sem
convallis id cursus lorem sagittis. Nam feugiat, felis nec tincidunt
aliquet, felis lectus bibendum mi, ut tincidunt purus urna ac felis.
Quisque ut lectus dolor. Duis ipsum arcu, adipiscing id vestibulum
fringilla, euismod non augue. Nullam quis ipsum nec tortor tristique
egestas sed nec leo. Pellentesque tempus velit lacus, sit amet rhoncus
mi. Curabitur justo ipsum, consectetur ac vestibulum sed, porttitor
eget dui. Vivamus nisi lorem, porta vel gravida quis, varius et elit.
Nulla eros metus, congue sit amet interdum at, porta eget ligula.";
// remove newlines
$paragraph = str_replace(array("\r","\n"), '', $paragraph);
// convert to lowercase
$paragraph = strtolower($paragraph);
// remove non-alphanumeric characters
$paragraph = preg_replace('/[^A-Za-z0-9\s]/', '', $paragraph);
// convert into array
$words = explode(' ', $paragraph);
// remove null values
$words = array_filter($words, 'strlen');
// remove duplicate values
$words = array_unique($words);
// sort array alphabetically (optional)
natsort($words);
// reindex array
$words = array_values($words);
// display array
print_r($words);
?>
更新:現在刪除換行。將所有修改分離爲單個命令。
什麼是特定問題?請不要告訴我們您需要知道如何使用簡單的拆分操作來讀取文件並將文本分割爲字符串。否則,這個問題值得質量差。 – 2011-03-31 18:02:19
也許你應該安裝一個搜索引擎,例如[ElasticSearch](http://www.elasticsearch.org/)。除非你真的*想要*重塑它? – bart 2011-04-01 12:33:32
感謝您的想法。我會從這些工作。我想知道從長遠來看,由於性能問題和更復雜的解析/突出顯示,我需要使用基於Java或Python的某種後端系統,比如Apache Solr。 – markwk 2011-04-01 15:37:49