\documentclass[a4paper]{article} \usepackage[english]{babel} \usepackage[utf8]{inputenc} \usepackage{amsmath} \usepackage{hyperref} \usepackage{algorithm} \usepackage{xcolor} \usepackage{listings} \usepackage{xcolor} \usepackage{color,courier} \definecolor{dkgreen}{rgb}{0,0.55,0} \definecolor{gray}{rgb}{0.5,0.5,0.5} \definecolor{mauve}{rgb}{0.58,0,0.82} \lstset{language=Python, linewidth=80em, breaklines=true, numbers=left, basicstyle=\ttfamily\footnotesize, showstringspaces=false, numberstyle=\tiny\color{gray}, keywordstyle=\color{blue}, commentstyle=\color{dkgreen}, stringstyle=\color{mauve}, } \title{Spell checker} \author{2ID90 Artificial Intelligence \\ Team 08 \\ Wilco Brouwer (0816718) \\ Peter Wu (0783206)} \date{\today} \begin{document} \maketitle \section{Introduction} In this report we will describe our solution to the second assignment, which consists of designing a spell checker algorithm and its implementation in Java. The algorithm is provided with a sentence containing zero to two misspelled words with an Damerau-Levenshtein edit distance of 1. In addition, misspelled words are not consecutive. The expected output is a sentence where these errors have been corrected. \section{Approach} To remove spelling errors from a text, one has to generate possible corrections and evaluate these. Spelling suggestions are made on a word-by-word basis. Finding possible \textit{candidate words} that can substitute a possibly misspelled word requires a method to obtain such words. This is described in section \ref{section:wordsgen}. The sentences resulting from a substitution of a single word will be evaluated as a whole. This evaluation is based on Kernighan's Noisy Channel Model \cite{kernighan} and is further described in section \ref{section:wordeval}. The result from such an evaluation is a probability that a sentence is correct. Note that this number can only be used to compare similar sentences with the same number of words and with only one word different. Comparing a different number of words will most likely favor the shorter phrase as it has fewer words to affect the total probability (this condition does not occur as words can be modified, but not added or removed). Comparing completely different sentences can favor the sentence with errors (but popular n-grams) over one that uses less common words. This typically does not occur as the task assumes that words next to an error cannot be modified and at most two errors exist. If there is a better sentence found after evaluating all possible substitutions for a word, a recursion will follow on the new sentence. Under the assumptions of the task, at most two recursions are needed since there are no more errors. Initially we have tried to continue with the best word. This greedy approach may however miss cases because n-grams may increase the probability of neighbors. Finally the sentence which has the best overall rating will be returned. It is possible that the sentence is unmodified if no word was ranked better than the original sentence. \subsection{Overview} The indices of words that are surely correct are remembered (\texttt{readonlyWords} in the listing below). Once a rating is found which is higher than the initial one, the word is replaced. When a word was modified, the hunt for more faulty words continues until all of them (less than three) are found. Otherwise, it is assumed that the sentence has no more spelling errors. Pseudo-code: \begin{lstlisting}[language=python] # Attempt to find a better sentence by modifying words. def findBetterWord(words, modifiedIndices): # Rate the initial sentence to compare it with alternatives. bestRating = evaluateSentence(words) bestSentence = words # For each word in the sentence, try alternatives for a better rating. for wordIndex, word in enumerate(words): # Skip words if these were modified before or if they were # consecutive to a previous word (canModifyWord definition # is omitted for brevity). if not canModifyWord(modifiedIndices, wordIndex): continue # Generates a list of candidate words, it is described later. candidates = getCandidateWords(word): # Iterates through the candidates and remember the modification # that gave the best word. Initially assume that no candidate # results in a better sentence. bestCandidate = None for candidate in candidates: # Calculates the likelihood for the candidate word. If it is # better than the current word, remember it. rating = evaluateSentence(candidate) if rating > bestRating bestCandidate = candidate bestRating = rating # If a better word is found, use it and try to find more errors. if bestCandidate is not None: words[index] = bestCandidate # Ensure that this word (and its context) cannot be changed # given the assumption that no consecutive words have an error. modifiedIndices.append(wordIndex) # If there are more typos, try to find them. if modifiedIndices.length < MAX_TYPOS: # Recurse with the modified sentence to find better words newSentence = findBetterWord(bestCandidate, modifiedIndices) if evaluateSentence(newSentence) > bestRating: bestSentence = newSentence # Make words writable for further iterations. modifiedIndices.pop() # else no more suggestions, do not recurse and try the next word. # Finish with the best sentence (possibly unmodified). return bestSentence # Split the input on spaces and find the best sentence. words = input.split(' ') bestSentence = findBetterWord(words, []) \end{lstlisting} \subsection{Generation of candidate words} \label{section:wordsgen} A naive approach to generate candidate words in the $getCandidateWords()$ function is to generate all possible combinations of the alphabet, and then taking the intersection with the dictionary. This does not scale and costs $O(27^n)$ memory and time. Instead we opted for a smarter approach. As we know that the misspelled words have a Damerau-Levenshtein distance of 1, candidates are generated by cycling through each letter and applying an insertion, deletion, substitution or transposition. We also know that all corrected words are in the vocabulary, so other words can be dropped from the candidates list. While doing these operations, the Noisy Channel Probability can also be obtained since we have a confusion matrix and know what kind of alteration was executed. Pseudo-code (bounds checks are omitted, assuming that such operations do not generate words. The actual Java implementation has decomposed methods for each of the operations): \begin{lstlisting}[language=python] def getWordMutations(word): # Walk through the letters of a word and mutate those. for i in word[0..n]: # Insertion of a new character at position i. for character in ALPHABET: yield word[0..i] character word[i..n] # Deletion of a following character. yield word[0..i] word[i+1..n] # Substitution of the current character. for character in ALPHABET: yield word[0..i] character word[i+1..n] # Transposition of the previous and current characters. for character in ALPHABET: yield word[0..i-1] word[i+1] word[i] word[i+2..n] def getCandidateWords(word): # Initially there are no candidate words candidateWords [] # Start with an empty map for words -> channel probability channelProbs = {} # Obtain all possible word mutations and pick valid ones. for mutation, candidateWord in getWordMutations(word): # Skip words not in vocabulary or unmodified words: if candidateWord not in vocabulary or word == candidateWord: continue # Word is in vocabulary, keep it. candidateWords.append(candidateWord) # Find the channel probability. As multiple mutations can # result in the same word, remember the modification that # gives the highest rating. Do not sum those as it can # result in a bias for very small words. channelProbs[candidateWord] = max( channelProbs[candidateWord], calculateChannelProbability(mutation) ) return candidateWords \end{lstlisting} \subsection{Evaluation of words} \label{section:wordeval} The most important part of the algorithm is actually rating a word. For this, we will combine the channel model probability with the language model probability. The channel model probability $P(x|w)$ is obtained via $getCandidateWords()$ while the language model probability $P(w)$ depends on the frequency of prior n-grams from a corpus. The combination gives the word score $P(x|w) * P(w)$. The prior n-grams are combined by interpolation with lambda conditionals on context \cite{interpolation}. Intuitively bigrams (or trigrams) are more precise than unigrams because these are more specific and use more details from the context. However, there are also less bigrams and even less trigrams to get information about. This could cause a skew in the results, so while it should be weighted more than unigrams, the lambda weight parameter should not get too high. Due to the usage of n-grams, the score of a sentence as a whole must be considered. This is the case because the $n - 1$ words before and after a word can form new n-grams which have different probabilities. Finally, all word scores are combined by simple multiplication. For this reason, words which are certainly incorrect (not in the vocabulary) should not be ranked with 0, but with a very small number (such as $10^{-99}$). This ensures that words with at least two errors have the chance to be seen as a better result. \section{Results and Explanation} \section{Additional Resources} \section{Statement of the Contributions} Peter wrote most of the Spell Checker implementation and report. \section{Manual} \section{References} \begin{thebibliography}{9} \bibitem{kernighan} Kernighan et al. \emph{A Spelling Correction Program Based on a Noisy Channel Model} 1990 \bibitem{interpolation} Presentation on Interpolation techniques by Daniel Jurafsky https://class.coursera.org/nlp/lecture/19 \end{thebibliography} \end{document}