ngrams可以在bash中生成嗎？

我在Python，Perl等中發現了ngrams的各種實現，但我真的很喜歡bash腳本中的某些東西。我遇到了「Missing textutils」版本，但是它只列出了ngrams，它並不按頻率計數，這對於使用ngram是相當重要的，或者至少是我的用法。我只想要一個結果的基本列表與他們的頻率，像這樣...ngrams可以在bash中生成嗎？

17 blue car 
14 red car 
5 and the 
2 brown monkey 
1 orange car

任何人都有類似的東西躺在他們可以發佈？謝謝！

來源

2013-01-19 user1889034

你能詳細說明'ngram'是什麼意思嗎？一個更完整的例子會比僅僅樣本輸出更好。 –

當然。「ngram」是語料庫中的單詞（文本，通常是純文本文件）的任意組合。一個二元組是兩個單詞（「藍色車」），一個三元組是三個單詞（「藍色車」），等等。「n」僅僅意味着單詞的數量是任意的，儘管在實踐中，很少見到超過五個單詞。通常，識別ngram的值是在文本中測量它們的頻率。 – user1889034

詳情請參閱http://en.wikipedia.org/wiki/N-gram。一個很好的例子是antconc，目前我正在使用antconc，但我很想簡單地調用一個腳本。這裏是我提到的現有腳本：http://www1.cuni.cz/~obo/textutils/ngrams – user1889034

這是一個純粹的bash實現。您需要使用bash> = 4.2的版本，並支持關聯數組。

#!/usr/bin/env bash 

((n=${1:-0})) || exit 1 

declare -A ngrams 

while read -ra line; do 
     for ((i = 0; i < ${#line[@]}; i++)); do 
       ((ngrams[${line[@]:i:n}]++)) 
     done 
done 

for i in "${!ngrams[@]}"; do 
     printf '%d\t%s\n' "${ngrams[$i]}" "$i" 
done

另存爲ngram並用作ngram 2 < file。

來源

2013-01-19 16:42:11

這很好。謝謝！ – user1889034

謝謝！在'ngram'上需要'chmod'嗎？ –

它是否適用於unicode？ –

是的，ngrams可以在bash中實現。

# Usage: ngrams N < FILE 
ngrams() { 
    local N=$1 
    local line 
    set -- 
    while read line; do 
    set -- $* $line 
    while [[ -n ${*:$N} ]]; do 
     echo ${*:1:$N} 
     shift 
    done 
    done | 
    sort | uniq -c 
} 

$ ngrams 2 
Here is some text, and here is 
some more text, and here is yet 
some more text 
    1 Here is 
    2 and here 
    2 here is 
    2 is some 
    1 is yet 
    1 more text 
    1 more text, 
    2 some more 
    1 some text, 
    2 text, and 
    1 yet some

注：以上是功能，而不是一個腳本（也許這question幫助，或許還有另外一個，這是更好）。這裏是腳本版本：

#!/bin/bash 
# Usage: ngrams N < FILE 
N=$1 
set -- 
while read line; do 
    set -- $* $line 
    while [[ -n ${*:$N} ]]; do 
    echo ${*:1:$N} 
    shift 
    done 
done | 
sort | uniq -c

來源

2013-01-19 05:50:03 rici

非常感謝發佈這個，但我似乎無法得到它的工作。您提供的示例與第一行中的使用說明不一致。我已經嘗試了兩種方式，沒有任何反應。 – user1889034

@ user1889034：你可能把它放在一個文件中並試圖執行該文件。那完全沒有什麼;這是一個shell函數，所以它必須從shell中調用。我將腳本版本添加到答案中。用法評論是正確的;它從'stdin'讀取。如果你不使用' rici

請解釋一下：$ {*：$ N}很難搜索！ ty – slashdottir

ngrams可以在bash中生成嗎？

回答

相關問題