report.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

\documentclass[a4paper]{article}

\usepackage[english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage{algorithm}
\usepackage{xcolor}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{color,courier}

\definecolor{dkgreen}{rgb}{0,0.55,0}
\definecolor{gray}{rgb}{0.5,0.5,0.5}
\definecolor{mauve}{rgb}{0.58,0,0.82}
\lstset{language=Python,
linewidth=80em,
breaklines=true,
numbers=left,
basicstyle=\ttfamily\footnotesize,
showstringspaces=false,
numberstyle=\tiny\color{gray},
keywordstyle=\color{blue},
commentstyle=\color{dkgreen},
stringstyle=\color{mauve},
}

\title{Spell checker}

\author{2ID90 Artificial Intelligence \\ Team 08 \\ Wilco Brouwer (0816718) \\ Peter Wu (0783206)}

\date{\today}

\begin{document}
\maketitle

\section{Introduction}
In this report we will describe our solution to the second assignment, which
consists of designing a spell checker algorithm and its implementation in Java.
The algorithm is provided with a sentence which contains zero to two misspelled
words with an Damerau-Levenshtein edit distance of 1. In addition, misspelled
words are not consecutive.

The expected output is a sentence where these errors have been corrected.

\section{Approach}
To remove spelling errors from a text, one has to generate possible corrections
and evaluate them. Spelling suggestions are done on a word-by-word basis.

To obtain a list of possible words, \textit{candidate words} will generated with
a Damerau-Levenshtein distance of 1 to the original word. While doing this, a
noisy channel probability can also be obtained which is needed later for word
evaluation.

For each of word from the sentence, the word will be evaluated against the
context of the word. This includes the previous and next words for n-grams.
Since we only have the guarantee that the words next to the misspelled one are
correct, tri-grams cannot reach out for the following two words before or after
the current word.

Finally the most likely word may be chosen. If this is different from the
original word, then the search may continue to find a possible second and third
misspelled word.

\subsection{Overview}
The program is given a single sentence which is split in words. The indices of
words that are surely correct are remembered (\texttt{readonly\_words} in the
listing below). Once a rating is found which is higher than the initial one,
then the word is replaced.

When a word was modified, the hunt for more faulty words continue until all of
them (less than three) are found. Otherwise, it is assumed that the sentence
has no more spelling errors.

Pseudo-code:

\begin{lstlisting}[language=python]
words = input.split(' ')

# Initially a spelling suggestion can be proposed for all words.
readonly_words = [False] * len(words)

# Attempt to find wrongly spelled words.
for attempt in range(0, 3):

    for word_index, word in enumerate(words):
        # Skip words if these were modified before or if they were
        # consecutive to a previous word.
        if readonly_words[word_index]:
            continue

        # Finds the likelihood of the current word, it is described later
        rating_old = getLikelihood(word)

        # Generates a list of candidate words, it is described later.
        candidates = getCandidateWords(word):

        # Initially there is no best word.
        bestCandidate = None

        # Iterates through the candidates until one is found that is more
        # likely than the current word.
        for candidate in candidates:
            # Calculates the likelihood for the candidate word. If it is
            # better than the current word, then it is chosen.
            rating = getLikelihood(candidate)
            if rating > rating_old:
                bestCandidate = candidate

        if bestCandidate is not None:
            word[word_index = bestCandidate
            # Ensure that these words are not changed again in later
            # iterations. (Bounds checks are omitted for brevity.)
            readonly_words[word_index-1..word_index+1] = 0
        else:
            # no more suggestions, abort
            break

# words now contains the sentence with misspelled words corrected
\end{lstlisting}

\subsection{Generation of candidate words}
A naive approach to generate candidate words in the $getCandidateWords()$
function is to generate all possible combinations of the alphabet, and then take
the intersection with the dictionary. This does not scale and uses $O(27^n)$
memory and time.

Instead we chose for a smarter approach. As we know that the misspelled word
have a Damerau-Levenshtein distance of 1, candidates are generated by cycling
through each letter and applying an insertion, deletion, substitution or
transposition.

While doing these operations, the Noisy Channel Probability can also be obtained
since we have a confusion matrix and know kind of operation was executed.

Pseudo-code (bounds checks are omitted, assume that such operations do not
generate words. The actual Java implementation has decomposed methods for each
of the operations):

\begin{lstlisting}[language=python]
def getCandidateWords(word):
    # Walk through the letters of a word. Words that are not in a
    # vocabulary will be dropped. While generating words, a Noisy Channel
    # Probability will also be calculated and attached to the word.
    for i in word[0..n]:
        # Insertion of a new character at position i.
        for character in ALPHABET:
            yield word[0..i] character word[i..n]

        # Deletion of a following character.
        yield word[0..i] word[i+1..n]

        # Substitution of the current character.
        for character in ALPHABET:
            yield word[0..i] character word[i+1..n]

        # Transposition of the previous and current characters.
        for character in ALPHABET:
            yield word[0..i-1] word[i+1] word[i] word[i+2..n]
\end{lstlisting}


\subsection{Evaluation of words}
The most important part of the algorithm is actually rating a word. For this, we
will combine the channel model probability with the language model probability.
The channel model probability $P(x|w)$ is obtained via $getCandidateWords()$
while the language model probability $P(w)$ depends on the frequency of prior
words from a corpus.

The evaluation of words basically calculates $P(x|w) * P(w)$. $P(w)$ is
calculated using $\frac{correctionCount}{errorsCount}$ where $correctionCount$
is the number of occurrences of the correction $x$ given an error, and
$errorsCount$ is the number of those errors.

Intuitively bigrams (or trigrams) are more precise than unigrams because these
are more specific and use more details from the context. However, there are also
less bigrams and even less trigrams to get information about. This could cause a
skew in the results, so while it should be weighted more than unigrams, this
parameter should not get too high.


\section{Results and Explanation}

\section{Additional Resources}

\section{Statement of the Contributions}
Peter wrote most of the Spell Checker implementationg and report.

\section{Manual}

\section{References}

%\begin{thebibliography}{9}
%\bibitem{kernighan}
%	Kernighan et al.
%	\emph{A Spelling Correction Program Based on a Noisy Channel Model}
%	1990
%\end{thebibliography}

\end{document}