summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPeter Wu <peter@lekensteyn.nl>2015-04-01 07:18:22 +0200
committerPeter Wu <peter@lekensteyn.nl>2015-04-01 07:18:22 +0200
commitf01e7e5769d568293bf31b15d7314740a0f281da (patch)
treee7225dd86d96c8bcfbee579820bfdf85e765ff5d
parentadb460408017c6344ee0f836aa56ce25143d9d6b (diff)
downloadassignment4-f01e7e5769d568293bf31b15d7314740a0f281da.tar.gz
Initial check-in of report
-rw-r--r--report.tex200
1 files changed, 200 insertions, 0 deletions
diff --git a/report.tex b/report.tex
new file mode 100644
index 0000000..84771f0
--- /dev/null
+++ b/report.tex
@@ -0,0 +1,200 @@
+\documentclass[a4paper]{article}
+
+\usepackage[english]{babel}
+\usepackage[utf8]{inputenc}
+\usepackage{amsmath}
+\usepackage{hyperref}
+\usepackage{algorithm}
+\usepackage{xcolor}
+\usepackage{listings}
+\usepackage{xcolor}
+\usepackage{color,courier}
+
+\definecolor{dkgreen}{rgb}{0,0.55,0}
+\definecolor{gray}{rgb}{0.5,0.5,0.5}
+\definecolor{mauve}{rgb}{0.58,0,0.82}
+\lstset{language=Python,
+linewidth=80em,
+breaklines=true,
+numbers=left,
+basicstyle=\ttfamily\footnotesize,
+showstringspaces=false,
+numberstyle=\tiny\color{gray},
+keywordstyle=\color{blue},
+commentstyle=\color{dkgreen},
+stringstyle=\color{mauve},
+}
+
+\title{Spell checker}
+
+\author{2ID90 Artificial Intelligence \\ Team 08 \\ Wilco Brouwer (0816718) \\ Peter Wu (0783206)}
+
+\date{\today}
+
+\begin{document}
+\maketitle
+
+\section{Introduction}
+In this report we will describe our solution to the second assignment, which
+consists of designing a spell checker algorithm and its implementation in Java.
+The algorithm is provided with a sentence which contains zero to two misspelled
+words with an Damerau-Levenshtein edit distance of 1. In addition, misspelled
+words are not consecutive.
+
+The expected output is a sentence where these errors have been corrected.
+
+\section{Approach}
+To remove spelling errors from a text, one has to generate possible corrections
+and evaluate them. Spelling suggestions are done on a word-by-word basis.
+
+To obtain a list of possible words, \textit{candidate words} will generated with
+a Damerau-Levenshtein distance of 1 to the original word. While doing this, a
+noisy channel probability can also be obtained which is needed later for word
+evaluation.
+
+For each of word from the sentence, the word will be evaluated against the
+context of the word. This includes the previous and next words for n-grams.
+Since we only have the guarantee that the words next to the misspelled one are
+correct, tri-grams cannot reach out for the following two words before or after
+the current word.
+
+Finally the most likely word may be chosen. If this is different from the
+original word, then the search may continue to find a possible second and third
+misspelled word.
+
+\subsection{Overview}
+The program is given a single sentence which is split in words. The indices of
+words that are surely correct are remembered (\texttt{readonly\_words} in the
+listing below). Once a rating is found which is higher than the initial one,
+then the word is replaced.
+
+When a word was modified, the hunt for more faulty words continue until all of
+them (less than three) are found. Otherwise, it is assumed that the sentence
+has no more spelling errors.
+
+Pseudo-code:
+
+\begin{lstlisting}[language=python]
+words = input.split(' ')
+
+# Initially a spelling suggestion can be proposed for all words.
+readonly_words = [False] * len(words)
+
+# Attempt to find wrongly spelled words.
+for attempt in range(0, 3):
+
+ for word_index, word in enumerate(words):
+ # Skip words if these were modified before or if they were
+ # consecutive to a previous word.
+ if readonly_words[word_index]:
+ continue
+
+ # Finds the likelihood of the current word, it is described later
+ rating_old = getLikelihood(word)
+
+ # Generates a list of candidate words, it is described later.
+ candidates = getCandidateWords(word):
+
+ # Initially there is no best word.
+ bestCandidate = None
+
+ # Iterates through the candidates until one is found that is more
+ # likely than the current word.
+ for candidate in candidates:
+ # Calculates the likelihood for the candidate word. If it is
+ # better than the current word, then it is chosen.
+ rating = getLikelihood(candidate)
+ if rating > rating_old:
+ bestCandidate = candidate
+
+ if bestCandidate is not None:
+ word[word_index = bestCandidate
+ # Ensure that these words are not changed again in later
+ # iterations. (Bounds checks are omitted for brevity.)
+ readonly_words[word_index-1..word_index+1] = 0
+ else:
+ # no more suggestions, abort
+ break
+
+# words now contains the sentence with misspelled words corrected
+\end{lstlisting}
+
+\subsection{Generation of candidate words}
+A naive approach to generate candidate words in the $getCandidateWords()$
+function is to generate all possible combinations of the alphabet, and then take
+the intersection with the dictionary. This does not scale and uses $O(27^n)$
+memory and time.
+
+Instead we chose for a smarter approach. As we know that the misspelled word
+have a Damerau-Levenshtein distance of 1, candidates are generated by cycling
+through each letter and applying an insertion, deletion, substitution or
+transposition.
+
+While doing these operations, the Noisy Channel Probability can also be obtained
+since we have a confusion matrix and know kind of operation was executed.
+
+Pseudo-code (bounds checks are omitted, assume that such operations do not
+generate words. The actual Java implementation has decomposed methods for each
+of the operations):
+
+\begin{lstlisting}[language=python]
+def getCandidateWords(word):
+ # Walk through the letters of a word. Words that are not in a
+ # vocabulary will be dropped. While generating words, a Noisy Channel
+ # Probability will also be calculated and attached to the word.
+ for i in word[0..n]:
+ # Insertion of a new character at position i.
+ for character in ALPHABET:
+ yield word[0..i] character word[i..n]
+
+ # Deletion of a following character.
+ yield word[0..i] word[i+1..n]
+
+ # Substitution of the current character.
+ for character in ALPHABET:
+ yield word[0..i] character word[i+1..n]
+
+ # Transposition of the previous and current characters.
+ for character in ALPHABET:
+ yield word[0..i-1] word[i+1] word[i] word[i+2..n]
+\end{lstlisting}
+
+
+\subsection{Evaluation of words}
+The most important part of the algorithm is actually rating a word. For this, we
+will combine the channel model probability with the language model probability.
+The channel model probability $P(x|w)$ is obtained via $getCandidateWords()$
+while the language model probability $P(w)$ depends on the frequency of prior
+words from a corpus.
+
+The evaluation of words basically calculates $P(x|w) * P(w)$. $P(w)$ is
+calculated using $\frac{correctionCount}{errorsCount}$ where $correctionCount$
+is the number of occurrences of the correction $x$ given an error, and
+$errorsCount$ is the number of those errors.
+
+Intuitively bigrams (or trigrams) are more precise than unigrams because these
+are more specific and use more details from the context. However, there are also
+less bigrams and even less trigrams to get information about. This could cause a
+skew in the results, so while it should be weighted more than unigrams, this
+parameter should not get too high.
+
+
+\section{Results and Explanation}
+
+\section{Additional Resources}
+
+\section{Statement of the Contributions}
+Peter wrote most of the Spell Checker implementationg and report.
+
+\section{Manual}
+
+\section{References}
+
+%\begin{thebibliography}{9}
+%\bibitem{kernighan}
+% Kernighan et al.
+% \emph{A Spelling Correction Program Based on a Noisy Channel Model}
+% 1990
+%\end{thebibliography}
+
+\end{document}