\documentclass[a4paper]{article} \usepackage[english]{babel} \usepackage[utf8]{inputenc} \usepackage{amsmath} \usepackage{hyperref} \usepackage{algorithm} \usepackage{xcolor} \usepackage{listings} \usepackage{xcolor} \usepackage{color,courier} \usepackage{etoolbox} \patchcmd{\thebibliography}{\section*}{\section}{}{} \definecolor{dkgreen}{rgb}{0,0.55,0} \definecolor{gray}{rgb}{0.5,0.5,0.5} \definecolor{mauve}{rgb}{0.58,0,0.82} \lstset{language=Python, linewidth=80em, breaklines=true, numbers=left, basicstyle=\ttfamily\footnotesize, showstringspaces=false, numberstyle=\tiny\color{gray}, keywordstyle=\color{blue}, commentstyle=\color{dkgreen}, stringstyle=\color{mauve}, } \title{Spell checker} \author{2ID90 Artificial Intelligence \\ Team 08 \\ Wilco Brouwer (0816718) \\ Peter Wu (0783206)} \date{\today} \begin{document} \maketitle \section{Introduction} In this report we will describe our solution to the second assignment, which consists of designing a spell checker algorithm and its implementation in Java. The algorithm is provided with a sentence containing zero to two misspelled words with an Damerau-Levenshtein edit distance of 1. In addition, misspelled words are not consecutive. The expected output is a sentence where these errors have been corrected. \section{Approach} To remove spelling errors from a text, one has to generate possible corrections and evaluate these. Spelling suggestions are made on a word-by-word basis. Finding possible \textit{candidate words} that can substitute a possibly misspelled word requires a method to obtain such words. This is described in section \ref{section:wordsgen}. The sentences resulting from a substitution of a single word will be evaluated as a whole. This evaluation is based on Kernighan's Noisy Channel Model \cite{kernighan} and is further described in section \ref{section:wordeval}. The result from such an evaluation is a probability that a sentence is correct. Note that this number can only be used to compare similar sentences with the same number of words and with only one word different. Comparing a different number of words will most likely favor the shorter phrase as it has fewer words to affect the total probability (this condition does not occur as words can be modified, but not added or removed). Comparing completely different sentences can favor the sentence with errors (but popular n-grams) over one that uses less common words. This typically does not occur as the task assumes that words next to an error cannot be modified and at most two errors exist. If there is a better sentence found after evaluating all possible substitutions for a word, a recursion will follow on the new sentence. Under the assumptions of the task, at most two recursions are needed since there are no more errors. Initially we have tried to continue with the best word. This greedy approach may however miss cases because n-grams may increase the probability of neighbors. Finally the sentence which has the best overall rating will be returned. It is possible that the sentence is unmodified if no word was ranked better than the original sentence. \subsection{Overview} The indices of words that are surely correct are remembered (\texttt{readonlyWords} in the listing below). Once a rating is found which is higher than the initial one, the word is replaced. When a word was modified, the hunt for more faulty words continues until all of them (at most \texttt{MAX\_TYPOS} which is 2 by default) are found. Otherwise, it is assumed that the sentence has no more spelling errors. Pseudo-code: \begin{lstlisting}[language=python] # Attempt to find a better sentence by modifying words. def findBetterWord(words, modifiedIndices): # Rate the initial sentence to compare it with alternatives. bestRating = evaluateSentence(words) bestSentence = words # For each word in the sentence, try alternatives for a better rating. for wordIndex, word in enumerate(words): # Skip words if these were modified before or if they were # consecutive to a previous word (canModifyWord definition # is omitted for brevity). if not canModifyWord(modifiedIndices, wordIndex): continue # Generates a list of candidate words, it is described later. candidates = getCandidateWords(word): # Iterates through the candidates and remember the modification # that gave the best word. Initially assume that no candidate # results in a better sentence. bestCandidate = None for candidate in candidates: # Calculates the likelihood for the candidate word. If it is # better than the current word, remember it. rating = evaluateSentence(candidate) if rating > bestRating bestCandidate = candidate bestRating = rating # If a better word is found, use it and try to find more errors. if bestCandidate is not None: words[index] = bestCandidate # Ensure that this word (and its context) cannot be changed # given the assumption that no consecutive words have an error. modifiedIndices.append(wordIndex) # If there are more typos, try to find them. if modifiedIndices.length < MAX_TYPOS: # Recurse with the modified sentence to find better words newSentence = findBetterWord(bestCandidate, modifiedIndices) if evaluateSentence(newSentence) > bestRating: bestSentence = newSentence # Make words writable for further iterations. modifiedIndices.pop() # else no more suggestions, do not recurse and try the next word. # Finish with the best sentence (possibly unmodified). return bestSentence # Split the input on spaces and find the best sentence. words = input.split(' ') bestSentence = findBetterWord(words, []) \end{lstlisting} \subsection{Generation of candidate words} \label{section:wordsgen} A naive approach to generate candidate words in the $getCandidateWords()$ function is to generate all possible combinations of the alphabet, and then taking the intersection with the dictionary. This does not scale and costs $O(27^n)$ memory and time. Instead we opted for a smarter approach. As we know that the misspelled words have a Damerau-Levenshtein distance of 1, candidates are generated by cycling through each letter and applying an insertion, deletion, substitution or transposition. We also know that all corrected words are in the vocabulary, so other words can be dropped from the candidates list. While doing these operations, the Noisy Channel Probability can also be obtained since we have a confusion matrix and know what kind of alteration was executed. Pseudo-code (bounds checks are omitted, assuming that such operations do not generate words. The actual Java implementation has decomposed methods for each of the operations): \begin{lstlisting}[language=python] def getWordMutations(word): # Walk through the letters of a word and mutate those. for i in word[0..n]: # Insertion of a new character at position i. for character in ALPHABET: yield word[0..i] character word[i..n] # Deletion of a following character. yield word[0..i] word[i+1..n] # Substitution of the current character. for character in ALPHABET: yield word[0..i] character word[i+1..n] # Transposition of the previous and current characters. for character in ALPHABET: yield word[0..i-1] word[i+1] word[i] word[i+2..n] def getCandidateWords(word): # Initially there are no candidate words candidateWords [] # Start with an empty map for words -> channel probability channelProbs = {} # Obtain all possible word mutations and pick valid ones. for mutation, candidateWord in getWordMutations(word): # Skip words not in vocabulary or unmodified words: if candidateWord not in vocabulary or word == candidateWord: continue # Word is in vocabulary, keep it. candidateWords.append(candidateWord) # Find the channel probability. As multiple mutations can # result in the same word, remember the modification that # gives the highest rating. Do not sum those as it can # result in a bias for very small words. channelProbs[candidateWord] = max( channelProbs[candidateWord], calculateChannelProbability(mutation) ) return candidateWords \end{lstlisting} \subsection{Evaluation of words} \label{section:wordeval} The most important part of the algorithm is actually rating a word. For this, we will combine the channel model probability with the language model probability. The channel model probability $P(x|w)$ is obtained via $getCandidateWords()$ while the language model probability $P(w)$ depends on the frequency of prior n-grams from a corpus. The combination gives the word score $P(x|w) * P(w)$. Initially we have used interpolation with lambda conditionals on context \cite{interpolation} to combine the n-gram probabilities, but this gave bad results because it put too much favor on high probabilities for unigrams. So instead the language model probability is now the result from combining the n-gram probabilities by simple multiplication. The weight of bigrams is not changed now. Since there are fewer bigrams in the corpus, it will naturally be ranked higher than unigrams. Note that the algorithm can also be scaled to use trigrams, fourgrams, etc just by modifying the $NGRAM\_N$ parameter. Due to the usage of n-grams, the score of a sentence as a whole must be considered. This is the case because the $n - 1$ words before and after a word can form new n-grams which have different probabilities. Finally, all word scores are combined by simple multiplication. For this reason, words which are certainly incorrect (not in the vocabulary) should not be ranked with 0, but with a very small number (such as $10^{-99}$). This ensures that words with at least two errors have the chance to be seen as a better result. Pseudo-code to summarize the evaluation: \begin{lstlisting}[language=python] def evaluateSentence(words, correctedWordIndex): p = 1 # Simply multiple all word probabilities to obtain a sentence rating for wordIndex, word in enumerate(words): p *= getWordLikelihood(words, wordIndex, correctedWordIndex) return p def getWordLikelihood(words, wordIndex, correctedWordIndex): word = words[wordIndex] if wordIndex == correctedWordIndex: # If this is the possibly misspelled word, use the channel # probability to compare multiple candidates. p = channelProbs[word] else: # For other words, assume that they are correct and use a # high probability p = LM_PROBABILITY_UNMODIFIED # Calculate P(w), or, how often does it occur in the corpus # Apply add-1 smoothing under the assumption that there are many # unigrams and this does not significantly affect the chance, # it just ensures that it is non-zero. o *= (getNGramCount(word) + 1) / (unigramCount + 1) # multiply by n-gram probabilities for n in range(1, NGRAM\_N): # calculate P(word|prefix) p *= ngramProbability(word, words[wordIndex-n..wordIndex]) return p # Calculates P(word|prefix) = #word / #(prefiix word) def ngramProbability(word, prefix): a = getNGramCount(word) b = getNGramCount(prefix " " word) # apply smoothing, but add a smaller number because "b" is # typically very small. return (a + 0.001) / (b + 1) \end{lstlisting} \section{Steps to reproduce} \begin{enumerate} \item Obtain the source code of the SpellChecker project. \item Run \texttt{ant run} from the project directory to compile and execute the program. \item At the \texttt{run:} step, type the sentence to be corrected and press Enter. This should have at most \texttt{MAX\_TYPOS} errors (defaulting to two, but changeable in the source code of \texttt{SpellCorrector.java}. \item The corrected sentence is printed as \texttt{Answer: }. \end{enumerate} A mode exists where you can keep inputting lines and for each of the line, an answer will be displayed. This can be achieved by setting the environment variable \texttt{NO\_PEACH} prior to executing \texttt{ant run}. \section{Results and Explanation} The algorithm can be tuned with the following parameters: \begin{description} \item[NGRAM\_N] The largest $n$ (as in $n$-gram) that should be checked for. This depends on the input file \textit{samplecnt.txt} (which is also dependent on \textit{samplevoc.txt}). \item[MAX\_TYPOS] This parameter is set to 2 for the task, but it could be increased for other situations. The worst-case running time will exponentially increase with this parameter. \item[LM\_PROBABILITY\_UNMODIFIED] Words that possibly have an error are influenced by the channel probability. For other words (those that are not being considered for correction), this factor can be used to put more influence on the channel probability. Values closer to 1 will make the algorithm more conservative, preferring not to correct words. Values closer to zero will instead try to correct words whenever an alternative can be found. \end{description} The program was tested against the \textit{test-sentences.txt} dataset for which it only failed to correct \textit{the development of diabetes \textbf{u}s present in mice that \textbf{h}arry a transgen(\textbf{e})} (errors or additions are emphasized). This issue can be solved by setting increasing the parameter \texttt{MAX\_TYPOS} to 3. After that change, the word \texttt{harry} will also be corrected to \texttt{carry}. When changing \texttt{MAX\_TYPOS} to 3 however, the sentence \textit{boxing loves shield the knuckles nots the head} got corrected to \textit{boxing gloves shield the knuckles not\textbf{es} the head}. If ngrams would get a greater weight than the channel probability, this correction would not be done though. With the initial ngram text file, it failed to correct \textit{kind retards} to \textit{kind regards} (finding \textit{kind rewards} instead). This happened due to the quality of the learning data. With a different dataset \cite{sekine} including unigrams and bigrams (where bigram count is greater than 10), it was corrected as expected. Since this data set is huge (millions of ngrams) and negatively impacts the resource requirements, it was not used in the final version. The same dataset from sekine did not include all unigrams from the original sample though. For the sentences provided via Peach it does not matter, Sekine's learning data is accurate enough for them. An interesting observation is that the arbitrary n-gram support of the program does work as intended. For example: \begin{itemize} \item input line: \\ she still refers to me has a friend but i fel i am treated quite badly \item wrong output with original samplecnt: \\ he still refers to me has a friend but i feel i am treated quite badly \item wrong output with Sakine's dataset (up to 2-gram): \\ she still refers to me has a friend but i feel i a treated quite badly \item correct output with Sakine's dataset (up to 3-gram): \\ she still refers to me as a friend but i feel i am treated quite badly \end{itemize} \section{Statement of the Contributions} Peter wrote most of the Spell Checker implementation and report. \begin{thebibliography}{9} \bibitem{kernighan} Kernighan et al. \emph{A Spelling Correction Program Based on a Noisy Channel Model} 1990 \bibitem{interpolation} Presentation on Interpolation techniques by Daniel Jurafsky https://class.coursera.org/nlp/lecture/19 \bibitem{sekine} Satoshi Sekine et al. \emph{Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram} http://nlp.cs.nyu.edu/wikipedia-data/ \end{thebibliography} \end{document}