2016-08-29 35 views
1

爲了學習更多的哈斯克爾(特別是Monads)我試圖建立一個拼寫檢查器。我的目標是能夠通過LaTeX文檔並對不在詞典列表中的單詞進行操作。哈斯克爾分析器與拼寫檢查

我已經寫了解析器(字符串到AST),我粘貼下面的代碼。它基本上返回分割成相關片段(文本,公式,命令等)的LaTeX源代碼。我想知道如何建立一個程序,以便在列表中找不到的每個單詞,我們要求用戶用什麼詞替代。

(我們真正關心的LaTeX的是,我們有源的某些部分是文字和必須拼寫檢查,這是公式,而不是簡單的英語其他部分)


設我更清楚地與期望的行爲的一些例子(爲了簡化公式$ HERE IS THE FORMULA $之間)解釋它

來源:

This is my frst file and here 
we have a formula: $\forall x \quad x$ 

渴望Desir ED行爲:

In file 'first.tex' at line 1: 'frst' unknown 
1 This is my **frst** file and here 
2 we have a formula: $\forall x \quad x$ 
Action [Add word to dictionary/Change word]? 

的主要問題是,我已經解析文件後,我留下了一個AST,並有線條沒有更多的引用,所以我不能像上面顯示出來例。


代碼分析器:

import System.Environment 
import Text.Parsec (ParseError) 
import Text.Parsec.String (Parser, parseFromFile) 
import Text.Parsec.String.Parsec (try) 
import Text.Parsec.String.Char (oneOf, char, digit, string, letter, satisfy, noneOf, anyChar) 
import Text.Parsec.String.Combinator (many1, choice, chainl1, between, count, option, optionMaybe, optional, manyTill, eof, lookAhead) 
import Control.Applicative ((<$>), (<*>), (<*), (*>), (<|>), many, (<$)) 
import Control.Monad (void, ap, mzero) 
import Data.Char (isLetter, isDigit) 
import FunctionsAndTypesForParsing 

data TexFile = Items [TexTerm] 
       deriving (Eq, Show) 

data TexTerm = Comment String 
      | Formula String 
      | Command String [TexFile] 
      | Text String 
      | Block TexFile 
       deriving (Eq, Show) 

-- We get the AST as output                                   
texFile :: Parser TexFile 
texFile = Items <$> (many texTerm) <* (optional (try $ eof)) 

texTerm :: Parser TexTerm 
texTerm = lexeme $ (try comment <|> text <|> formula <|> command <|> block) 

whitespace :: Parser() 
whitespace = void $ try $ oneOf " \n\t" 

lexeme :: Parser a -> Parser a 
lexeme p = p <* (many $ whitespace) 

comment :: Parser TexTerm 
comment = Comment <$> between (string "%") (string "\n") (many $ noneOf "\n") 

formula :: Parser TexTerm 
formula = Formula <$> (try singledollar <|> doubledollar <|> equation <|> align) 
    where 
    singledollar = between (string "$") (string "$") (many1 $ noneOf "$") 
    doubledollar = between (string "$$") (string "$$") (many1 $ noneOf "$$") 
    equation = try $ between (try $ string "\\begin{equation}") (string "\\end{equation}") (manyTill anyChar (lookAhead $ try $ string "\\end{equation}")) 
    align = try $ between (try $ string "\\begin{align*}") (string "\\end{align*}") (manyTill anyChar (lookAhead $ try $ string "\\end{align*}")) 

command :: Parser TexTerm 
command = Command <$> com <*> (many arg) 
    where 
    com = char '\\' *> (manyTill (try letter <|> oneOf "*") (lookAhead $ try $ oneOf "[{ \\\n\t")) 
    arg = (try (between (string "{") (string "}") texFile) 
      <|> (between (string "[") (string "]") texFile) 
     ) 

text :: Parser TexTerm 
text = Text <$> many1 textualchars 
    where 
    textualchars = try letter <|> digit <|> oneOf " \n\t\r,.*:;-<>#@()`_!'?" 

block :: Parser TexTerm 
block = Block <$> between (string "{") (string "}") texFile 

回答

2

你可以用秒差距的getPosition行動得到輸入流中的當前位置。然後,您可以將其存儲在您的AST類型(即其更改爲類似

data TexFile = Items [(SourcePos, TexTerm)] 

+0

接受,因爲它適合我已經做的更好。當我有更多的時間,我也會檢查megaparsec,正如其他答案一樣 – trenta3

1

您的基本問題是,你扔掉有關 空白的文件中。如果您將白色空間記錄爲另一個TexTerm ,則可以a)重新構建TexFile中的文件內容,並b)瞭解每個TexTerm出現在哪條線上的 。

所以一個方法是增加一個WhiteSpace構造方法TexTerm

data TexTerm = Comment String 
      | ... 
      | WhiteSpace String 

現在,當你穿越你的AST,你可以決定什麼線的每個結構是由每個WhiteSpace構造函數計算的換行字符數。

但是,由於您使用lexeme跳過空白區域,所以這會使解析器複雜化。如果你需要做的是拼寫檢查TeX的文檔, 我用簡單的數據結構的建議「標籤湯」的做法:

type TexFile = [TexTerm] 

data TeXTerm = Comment String 
      | Formula String 
      | Command String  -- e.g. \someCommand 
      | Text String 
      | Sym String   -- e.g. Sym "{" or Sym "}" 
      | WhiteSpace String -- e.g. WhiteSpace "\n" 

注意TeXFileTexTerm是平的 - 非遞歸 - 數據結構。我們只是標記TeX輸入。