From f01e7e5769d568293bf31b15d7314740a0f281da Mon Sep 17 00:00:00 2001 From: Peter Wu Date: Wed, 1 Apr 2015 07:18:22 +0200 Subject: Initial check-in of report --- report.tex | 200 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 200 insertions(+) create mode 100644 report.tex diff --git a/report.tex b/report.tex new file mode 100644 index 0000000..84771f0 --- /dev/null +++ b/report.tex @@ -0,0 +1,200 @@ +\documentclass[a4paper]{article} + +\usepackage[english]{babel} +\usepackage[utf8]{inputenc} +\usepackage{amsmath} +\usepackage{hyperref} +\usepackage{algorithm} +\usepackage{xcolor} +\usepackage{listings} +\usepackage{xcolor} +\usepackage{color,courier} + +\definecolor{dkgreen}{rgb}{0,0.55,0} +\definecolor{gray}{rgb}{0.5,0.5,0.5} +\definecolor{mauve}{rgb}{0.58,0,0.82} +\lstset{language=Python, +linewidth=80em, +breaklines=true, +numbers=left, +basicstyle=\ttfamily\footnotesize, +showstringspaces=false, +numberstyle=\tiny\color{gray}, +keywordstyle=\color{blue}, +commentstyle=\color{dkgreen}, +stringstyle=\color{mauve}, +} + +\title{Spell checker} + +\author{2ID90 Artificial Intelligence \\ Team 08 \\ Wilco Brouwer (0816718) \\ Peter Wu (0783206)} + +\date{\today} + +\begin{document} +\maketitle + +\section{Introduction} +In this report we will describe our solution to the second assignment, which +consists of designing a spell checker algorithm and its implementation in Java. +The algorithm is provided with a sentence which contains zero to two misspelled +words with an Damerau-Levenshtein edit distance of 1. In addition, misspelled +words are not consecutive. + +The expected output is a sentence where these errors have been corrected. + +\section{Approach} +To remove spelling errors from a text, one has to generate possible corrections +and evaluate them. Spelling suggestions are done on a word-by-word basis. + +To obtain a list of possible words, \textit{candidate words} will generated with +a Damerau-Levenshtein distance of 1 to the original word. While doing this, a +noisy channel probability can also be obtained which is needed later for word +evaluation. + +For each of word from the sentence, the word will be evaluated against the +context of the word. This includes the previous and next words for n-grams. +Since we only have the guarantee that the words next to the misspelled one are +correct, tri-grams cannot reach out for the following two words before or after +the current word. + +Finally the most likely word may be chosen. If this is different from the +original word, then the search may continue to find a possible second and third +misspelled word. + +\subsection{Overview} +The program is given a single sentence which is split in words. The indices of +words that are surely correct are remembered (\texttt{readonly\_words} in the +listing below). Once a rating is found which is higher than the initial one, +then the word is replaced. + +When a word was modified, the hunt for more faulty words continue until all of +them (less than three) are found. Otherwise, it is assumed that the sentence +has no more spelling errors. + +Pseudo-code: + +\begin{lstlisting}[language=python] +words = input.split(' ') + +# Initially a spelling suggestion can be proposed for all words. +readonly_words = [False] * len(words) + +# Attempt to find wrongly spelled words. +for attempt in range(0, 3): + + for word_index, word in enumerate(words): + # Skip words if these were modified before or if they were + # consecutive to a previous word. + if readonly_words[word_index]: + continue + + # Finds the likelihood of the current word, it is described later + rating_old = getLikelihood(word) + + # Generates a list of candidate words, it is described later. + candidates = getCandidateWords(word): + + # Initially there is no best word. + bestCandidate = None + + # Iterates through the candidates until one is found that is more + # likely than the current word. + for candidate in candidates: + # Calculates the likelihood for the candidate word. If it is + # better than the current word, then it is chosen. + rating = getLikelihood(candidate) + if rating > rating_old: + bestCandidate = candidate + + if bestCandidate is not None: + word[word_index = bestCandidate + # Ensure that these words are not changed again in later + # iterations. (Bounds checks are omitted for brevity.) + readonly_words[word_index-1..word_index+1] = 0 + else: + # no more suggestions, abort + break + +# words now contains the sentence with misspelled words corrected +\end{lstlisting} + +\subsection{Generation of candidate words} +A naive approach to generate candidate words in the $getCandidateWords()$ +function is to generate all possible combinations of the alphabet, and then take +the intersection with the dictionary. This does not scale and uses $O(27^n)$ +memory and time. + +Instead we chose for a smarter approach. As we know that the misspelled word +have a Damerau-Levenshtein distance of 1, candidates are generated by cycling +through each letter and applying an insertion, deletion, substitution or +transposition. + +While doing these operations, the Noisy Channel Probability can also be obtained +since we have a confusion matrix and know kind of operation was executed. + +Pseudo-code (bounds checks are omitted, assume that such operations do not +generate words. The actual Java implementation has decomposed methods for each +of the operations): + +\begin{lstlisting}[language=python] +def getCandidateWords(word): + # Walk through the letters of a word. Words that are not in a + # vocabulary will be dropped. While generating words, a Noisy Channel + # Probability will also be calculated and attached to the word. + for i in word[0..n]: + # Insertion of a new character at position i. + for character in ALPHABET: + yield word[0..i] character word[i..n] + + # Deletion of a following character. + yield word[0..i] word[i+1..n] + + # Substitution of the current character. + for character in ALPHABET: + yield word[0..i] character word[i+1..n] + + # Transposition of the previous and current characters. + for character in ALPHABET: + yield word[0..i-1] word[i+1] word[i] word[i+2..n] +\end{lstlisting} + + +\subsection{Evaluation of words} +The most important part of the algorithm is actually rating a word. For this, we +will combine the channel model probability with the language model probability. +The channel model probability $P(x|w)$ is obtained via $getCandidateWords()$ +while the language model probability $P(w)$ depends on the frequency of prior +words from a corpus. + +The evaluation of words basically calculates $P(x|w) * P(w)$. $P(w)$ is +calculated using $\frac{correctionCount}{errorsCount}$ where $correctionCount$ +is the number of occurrences of the correction $x$ given an error, and +$errorsCount$ is the number of those errors. + +Intuitively bigrams (or trigrams) are more precise than unigrams because these +are more specific and use more details from the context. However, there are also +less bigrams and even less trigrams to get information about. This could cause a +skew in the results, so while it should be weighted more than unigrams, this +parameter should not get too high. + + +\section{Results and Explanation} + +\section{Additional Resources} + +\section{Statement of the Contributions} +Peter wrote most of the Spell Checker implementationg and report. + +\section{Manual} + +\section{References} + +%\begin{thebibliography}{9} +%\bibitem{kernighan} +% Kernighan et al. +% \emph{A Spelling Correction Program Based on a Noisy Channel Model} +% 1990 +%\end{thebibliography} + +\end{document} -- cgit v1.2.1