summaryrefslogtreecommitdiff
path: root/report.tex
blob: 6b01c977d6b04e77d504c4e7ccc154a4afecc7cb (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
\documentclass[a4paper]{article}

\usepackage[english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage{algorithm}
\usepackage{xcolor}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{color,courier}
\usepackage{etoolbox}
\patchcmd{\thebibliography}{\section*}{\section}{}{}

\definecolor{dkgreen}{rgb}{0,0.55,0}
\definecolor{gray}{rgb}{0.5,0.5,0.5}
\definecolor{mauve}{rgb}{0.58,0,0.82}
\lstset{language=Python,
linewidth=80em,
breaklines=true,
numbers=left,
basicstyle=\ttfamily\footnotesize,
showstringspaces=false,
numberstyle=\tiny\color{gray},
keywordstyle=\color{blue},
commentstyle=\color{dkgreen},
stringstyle=\color{mauve},
}

\title{Spell checker}

\author{2ID90 Artificial Intelligence \\ Team 08 \\ Wilco Brouwer (0816718) \\ Peter Wu (0783206)}

\date{\today}

\begin{document}
\maketitle

\section{Introduction}
In this report we will describe our solution to the second assignment, which
consists of designing a spell checker algorithm and its implementation in Java.
The algorithm is provided with a sentence containing zero to two misspelled
words with an Damerau-Levenshtein edit distance of 1. In addition, misspelled
words are not consecutive.

The expected output is a sentence where these errors have been corrected.

\section{Approach}
To remove spelling errors from a text, one has to generate possible corrections
and evaluate these. Spelling suggestions are made on a word-by-word basis.

Finding possible \textit{candidate words} that can substitute a possibly
misspelled word requires a method to obtain such words. This is described in
section \ref{section:wordsgen}.

The sentences resulting from a substitution of a single word will be evaluated
as a whole. This evaluation is based on Kernighan's Noisy Channel Model
\cite{kernighan} and is further described in section \ref{section:wordeval}. The
result from such an evaluation is a probability that a sentence is correct.
Note that this number can only be used to compare similar sentences with the
same number of words and with only one word different. Comparing a different
number of words will most likely favor the shorter phrase as it has fewer words
to affect the total probability (this condition does not occur as words can be
modified, but not added or removed). Comparing completely different sentences
can favor the sentence with errors (but popular n-grams) over one that uses less
common words. This typically does not occur as the task assumes that words next
to an error cannot be modified and at most two errors exist.

If there is a better sentence found after evaluating all possible substitutions
for a word, a recursion will follow on the new sentence. Under the assumptions
of the task, at most two recursions are needed since there are no more errors.
Initially we have tried to continue with the best word. This greedy approach may
however miss cases because n-grams may increase the probability of neighbors.

Finally the sentence which has the best overall rating will be returned. It is
possible that the sentence is unmodified if no word was ranked better than the
original sentence.

\subsection{Overview}

The indices of
words that are surely correct are remembered (\texttt{readonlyWords} in the
listing below). Once a rating is found which is higher than the initial one,
the word is replaced.

When a word was modified, the hunt for more faulty words continues until all of
them (at most \texttt{MAX\_TYPOS} which is 2 by default) are found. Otherwise,
it is assumed that the sentence has no more spelling errors.

Pseudo-code:

\begin{lstlisting}[language=python]
# Attempt to find a better sentence by modifying words.
def findBetterWord(words, modifiedIndices):
    # Rate the initial sentence to compare it with alternatives.
    bestRating = evaluateSentence(words)
    bestSentence = words

    # For each word in the sentence, try alternatives for a better rating.
    for wordIndex, word in enumerate(words):
        # Skip words if these were modified before or if they were
        # consecutive to a previous word (canModifyWord definition
        # is omitted for brevity).
        if not canModifyWord(modifiedIndices, wordIndex):
            continue

        # Generates a list of candidate words, it is described later.
        candidates = getCandidateWords(word):

        # Iterates through the candidates and remember the modification
        # that gave the best word. Initially assume that no candidate
        # results in a better sentence.
        bestCandidate = None
        for candidate in candidates:
            # Calculates the likelihood for the candidate word. If it is
            # better than the current word, remember it.
            rating = evaluateSentence(candidate)
            if rating > bestRating
                bestCandidate = candidate
                bestRating = rating

        # If a better word is found, use it and try to find more errors.
        if bestCandidate is not None:
            words[index] = bestCandidate
            # Ensure that this word (and its context) cannot be changed
            # given the assumption that no consecutive words have an error.
            modifiedIndices.append(wordIndex)

            # If there are more typos, try to find them.
            if modifiedIndices.length < MAX_TYPOS:
                # Recurse with the modified sentence to find better words
                newSentence = findBetterWord(bestCandidate, modifiedIndices)
                if evaluateSentence(newSentence) > bestRating:
                    bestSentence = newSentence

            # Make words writable for further iterations.
            modifiedIndices.pop()
        # else no more suggestions, do not recurse and try the next word.

    # Finish with the best sentence (possibly unmodified).
    return bestSentence

# Split the input on spaces and find the best sentence.
words = input.split(' ')
bestSentence = findBetterWord(words, [])
\end{lstlisting}

\subsection{Generation of candidate words}
\label{section:wordsgen}
A naive approach to generate candidate words in the $getCandidateWords()$
function is to generate all possible combinations of the alphabet, and then taking
the intersection with the dictionary. This does not scale and costs $O(27^n)$
memory and time.

Instead we opted for a smarter approach. As we know that the misspelled words
have a Damerau-Levenshtein distance of 1, candidates are generated by cycling
through each letter and applying an insertion, deletion, substitution or
transposition. We also know that all corrected words are in the vocabulary, so
other words can be dropped from the candidates list.

While doing these operations, the Noisy Channel Probability can also be obtained
since we have a confusion matrix and know what kind of alteration was executed.

Pseudo-code (bounds checks are omitted, assuming that such operations do not
generate words. The actual Java implementation has decomposed methods for each
of the operations):

\begin{lstlisting}[language=python]
def getWordMutations(word):
    # Walk through the letters of a word and mutate those.
    for i in word[0..n]:
        # Insertion of a new character at position i.
        for character in ALPHABET:
            yield word[0..i] character word[i..n]

        # Deletion of a following character.
        yield word[0..i] word[i+1..n]

        # Substitution of the current character.
        for character in ALPHABET:
            yield word[0..i] character word[i+1..n]

        # Transposition of the previous and current characters.
        for character in ALPHABET:
            yield word[0..i-1] word[i+1] word[i] word[i+2..n]

def getCandidateWords(word):
    # Initially there are no candidate words
    candidateWords []
    # Start with an empty map for words -> channel probability
    channelProbs = {}

    # Obtain all possible word mutations and pick valid ones.
    for mutation, candidateWord in getWordMutations(word):
        # Skip words not in vocabulary or unmodified words:
        if candidateWord not in vocabulary or word == candidateWord:
            continue

        # Word is in vocabulary, keep it.
        candidateWords.append(candidateWord)

        # Find the channel probability. As multiple mutations can
        # result in the same word, remember the modification that
        # gives the highest rating. Do not sum those as it can
        # result in a bias for very small words.
        channelProbs[candidateWord] = max(
            channelProbs[candidateWord],
            calculateChannelProbability(mutation)
        )

    return candidateWords
\end{lstlisting}


\subsection{Evaluation of words}
\label{section:wordeval}
The most important part of the algorithm is actually rating a word. For this, we
will combine the channel model probability with the language model probability.
The channel model probability $P(x|w)$ is obtained via $getCandidateWords()$
while the language model probability $P(w)$ depends on the frequency of prior
n-grams from a corpus. The combination gives the word score $P(x|w) * P(w)$.

Initially we have used interpolation with lambda conditionals on context
\cite{interpolation} to combine the n-gram probabilities, but this gave bad
results because it put too much favor on high probabilities for unigrams. So
instead the language model probability is now the result from combining the
n-gram probabilities by simple multiplication. The weight of bigrams is not
changed now. Since there are fewer bigrams in the corpus, it will naturally be
ranked higher than unigrams. Note that the algorithm can also be scaled to use
trigrams, fourgrams, etc just by modifying the $NGRAM\_N$ parameter.

Due to the usage of n-grams, the score of a sentence as a whole must be
considered. This is the case because the $n - 1$ words before and after a word
can form new n-grams which have different probabilities.

Finally, all word scores are combined by simple multiplication. For this reason,
words which are certainly incorrect (not in the vocabulary) should not be ranked
with 0, but with a very small number (such as $10^{-99}$). This ensures that
words with at least two errors have the chance to be seen as a better result.

Pseudo-code to summarize the evaluation:
\begin{lstlisting}[language=python]
def evaluateSentence(words, correctedWordIndex):
    p = 1
    # Simply multiple all word probabilities to obtain a sentence rating
    for wordIndex, word in enumerate(words):
        p *= getWordLikelihood(words, wordIndex, correctedWordIndex)
    return p

def getWordLikelihood(words, wordIndex, correctedWordIndex):
    word = words[wordIndex]

    if wordIndex == correctedWordIndex:
        # If this is the possibly misspelled word, use the channel
        # probability to compare multiple candidates.
        p = channelProbs[word]
    else:
        # For other words, assume that they are correct and use a
        # high probability
        p = LM_PROBABILITY_UNMODIFIED

    # Calculate P(w), or, how often does it occur in the corpus
    # Apply add-1 smoothing under the assumption that there are many
    # unigrams and this does not significantly affect the chance,
    # it just ensures that it is non-zero.
    o *= (getNGramCount(word) + 1) / (unigramCount + 1)

    # multiply by n-gram probabilities
    for n in range(1, NGRAM\_N):
        # calculate P(word|prefix)
        p *= ngramProbability(word, words[wordIndex-n..wordIndex])

    return p

# Calculates P(word|prefix) = #word / #(prefiix word)
def ngramProbability(word, prefix):
    a = getNGramCount(word)
    b = getNGramCount(prefix " " word)

    # apply smoothing, but add a smaller number because "b" is
    # typically very small.
    return (a + 0.001) / (b + 1)
\end{lstlisting}

\section{Steps to reproduce}
\begin{enumerate}
\item Obtain the source code of the SpellChecker project.
\item Run \texttt{ant run} from the project directory to compile and execute the
program.
\item At the \texttt{run:} step, type the sentence to be corrected and press
Enter. This should have at most \texttt{MAX\_TYPOS} errors (defaulting to two,
but changeable in the source code of \texttt{SpellCorrector.java}.
\item The corrected sentence is printed as \texttt{Answer:  <corrected sentence>}.
\end{enumerate}

A mode exists where you can keep inputting lines and for each of the line, an
answer will be displayed. This can be achieved by setting the environment
variable \texttt{NO\_PEACH} prior to executing \texttt{ant run}.

\section{Results and Explanation}
The algorithm can be tuned with the following parameters:
\begin{description}
\item[NGRAM\_N] The largest $n$ (as in $n$-gram) that should be checked for.
This depends on the input file \textit{samplecnt.txt} (which is also dependent
on \textit{samplevoc.txt}).
\item[MAX\_TYPOS] This parameter is set to 2 for the task, but it could be
increased for other situations. The worst-case running time will exponentially
increase with this parameter.
\item[LM\_PROBABILITY\_UNMODIFIED] Words that possibly have an error are
influenced by the channel probability. For other words (those that are not being
considered for correction), this factor can be used to put more influence on the
channel probability. Values closer to 1 will make the algorithm more
conservative, preferring not to correct words. Values closer to zero will
instead try to correct words whenever an alternative can be found.
\end{description}

The program was tested against the \textit{test-sentences.txt} dataset for which
it only failed to correct \textit{the development of diabetes \textbf{u}s
present in mice that \textbf{h}arry a transgen(\textbf{e})} (errors or additions
are emphasized). This issue can be solved by setting increasing the parameter
\texttt{MAX\_TYPOS} to 3. After that change, the word \texttt{harry} will also
be corrected to \texttt{carry}.

When changing \texttt{MAX\_TYPOS} to 3 however, the sentence \textit{boxing
loves shield the knuckles nots the head} got corrected to \textit{boxing gloves
shield the knuckles not\textbf{es} the head}. If ngrams would get a greater
weight than the channel probability, this correction would not be done though.

With the initial ngram text file, it failed to correct \textit{kind retards} to
\textit{kind regards} (finding \textit{kind rewards} instead). This happened due
to the quality of the learning data. With a different dataset \cite{sekine}
including unigrams and bigrams (where bigram count is greater than 10), it was
corrected as expected. Since this data set is huge (millions of ngrams) and
negatively impacts the resource requirements, it was not used in the final
version.

The same dataset from sekine did not include all unigrams from the original
sample though. For the sentences provided via Peach it does not matter, Sekine's
learning data is accurate enough for them.

An interesting observation is that the arbitrary n-gram support of the program
does work as intended. For example:

\begin{itemize}
\item input line: \\
she still refers to me has a friend but i fel i am treated quite badly
\item wrong output with original samplecnt: \\
he still refers to me has a friend but i feel i am treated quite badly
\item wrong output with Sakine's dataset (up to 2-gram): \\
she still refers to me has a friend but i feel i a treated quite badly
\item correct output with Sakine's dataset (up to 3-gram): \\
she still refers to me as a friend but i feel i am treated quite badly
\end{itemize}

\section{Statement of the Contributions}
Peter wrote most of the Spell Checker implementation and report.

\begin{thebibliography}{9}
\bibitem{kernighan}
Kernighan et al.
\emph{A Spelling Correction Program Based on a Noisy Channel Model}
1990
\bibitem{interpolation}
Presentation on Interpolation techniques by Daniel Jurafsky
https://class.coursera.org/nlp/lecture/19
\bibitem{sekine}
Satoshi Sekine et al.
\emph{Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram}
http://nlp.cs.nyu.edu/wikipedia-data/
\end{thebibliography}

\end{document}