summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPeter Wu <peter@lekensteyn.nl>2015-04-03 22:56:33 +0200
committerPeter Wu <peter@lekensteyn.nl>2015-04-03 22:56:33 +0200
commitf469c98f2f7c4b0c2c93fc4618b66a6bce725916 (patch)
tree483c24114ddeb07d1d396e0a39b6192a420c5901
parentd4b12e91df83753913cd9c9535dc34431ff0d323 (diff)
downloadassignment4-f469c98f2f7c4b0c2c93fc4618b66a6bce725916.tar.gz
Report: update with non-greedy approach and more
-rw-r--r--report.tex193
1 files changed, 126 insertions, 67 deletions
diff --git a/report.tex b/report.tex
index 6d7932d..84aef14 100644
--- a/report.tex
+++ b/report.tex
@@ -45,26 +45,39 @@ The expected output is a sentence where these errors have been corrected.
\section{Approach}
To remove spelling errors from a text, one has to generate possible corrections
-and evaluate these. Spelling suggestions are done on a word-by-word basis.
-
-To obtain a list of possible words, \textit{candidate words} will be generated with
-a Damerau-Levenshtein distance of 1 to the original word. While doing this, a
-noisy channel probability can also be obtained which is needed later on for word
-evaluation.
-
-For each word of the sentence, the word will be evaluated against the
-context of the word. This includes the previous and next words for n-grams.
-Since we only have the guarantee that the words next to the misspelled one are
-correct, tri-grams cannot reach out for the following two words before or after
-the current word.
-
-Finally the most likely word may be chosen. If this is different from the
-original word, then the search may continue to find a possible second and third
-misspelled word.
+and evaluate these. Spelling suggestions are made on a word-by-word basis.
+
+Finding possible \textit{candidate words} that can substitute a possibly
+misspelled word requires a method to obtain such words. This is described in
+section \ref{section:wordsgen}.
+
+The sentences resulting from a substitution of a single word will be evaluated
+as a whole. This evaluation is based on Kernighan's Noisy Channel Model
+\cite{kernighan} and is further described in section \ref{section:wordeval}. The
+result from such an evaluation is a probability that a sentence is correct.
+Note that this number can only be used to compare similar sentences with the
+same number of words and with only one word different. Comparing a different
+number of words will most likely favor the shorter phrase as it has fewer words
+to affect the total probability (this condition does not occur as words can be
+modified, but not added or removed). Comparing completely different sentences
+can favor the sentence with errors (but popular n-grams) over one that uses less
+common words. This typically does not occur as the task assumes that words next
+to an error cannot be modified and at most two errors exist.
+
+If there is a better sentence found after evaluating all possible substitutions
+for a word, a recursion will follow on the new sentence. Under the assumptions
+of the task, at most two recursions are needed since there are no more errors.
+Initially we have tried to continue with the best word. This greedy approach may
+however miss cases because n-grams may increase the probability of neighbors.
+
+Finally the sentence which has the best overall rating will be returned. It is
+possible that the sentence is unmodified if no word was ranked better than the
+original sentence.
\subsection{Overview}
-The program is given a single sentence which is split in words. The indices of
-words that are surely correct are remembered (\texttt{readonly\_words} in the
+
+The indices of
+words that are surely correct are remembered (\texttt{readonlyWords} in the
listing below). Once a rating is found which is higher than the initial one,
the word is replaced.
@@ -75,51 +88,63 @@ has no more spelling errors.
Pseudo-code:
\begin{lstlisting}[language=python]
-words = input.split(' ')
-
-# Initially a spelling suggestion can be proposed for all words.
-readonly_words = [False] * len(words)
-
-# Attempt to find wrongly spelled words.
-for attempt in range(0, 3):
-
- for word_index, word in enumerate(words):
+# Attempt to find a better sentence by modifying words.
+def findBetterWord(words, modifiedIndices):
+ # Rate the initial sentence to compare it with alternatives.
+ bestRating = evaluateSentence(words)
+ bestSentence = words
+
+ # For each word in the sentence, try alternatives for a better rating.
+ for wordIndex, word in enumerate(words):
# Skip words if these were modified before or if they were
- # consecutive to a previous word.
- if readonly_words[word_index]:
+ # consecutive to a previous word (canModifyWord definition
+ # is omitted for brevity).
+ if not canModifyWord(modifiedIndices, wordIndex):
continue
- # Finds the likelihood of the current word, it is described later
- rating_old = getLikelihood(word)
-
# Generates a list of candidate words, it is described later.
candidates = getCandidateWords(word):
- # Initially there is no best word.
+ # Iterates through the candidates and remember the modification
+ # that gave the best word. Initially assume that no candidate
+ # results in a better sentence.
bestCandidate = None
-
- # Iterates through the candidates until one is found that is more
- # likely than the current word.
for candidate in candidates:
# Calculates the likelihood for the candidate word. If it is
- # better than the current word, then it is chosen.
- rating = getLikelihood(candidate)
- if rating > rating_old:
+ # better than the current word, remember it.
+ rating = evaluateSentence(candidate)
+ if rating > bestRating
bestCandidate = candidate
+ bestRating = rating
+ # If a better word is found, use it and try to find more errors.
if bestCandidate is not None:
- word[word_index = bestCandidate
- # Ensure that these words are not changed again in later
- # iterations. (Bounds checks are omitted for brevity.)
- readonly_words[word_index-1..word_index+1] = 0
- else:
- # no more suggestions, abort
- break
-
-# words now contains the sentence with misspelled words corrected
+ words[index] = bestCandidate
+ # Ensure that this word (and its context) cannot be changed
+ # given the assumption that no consecutive words have an error.
+ modifiedIndices.append(wordIndex)
+
+ # If there are more typos, try to find them.
+ if modifiedIndices.length < MAX_TYPOS:
+ # Recurse with the modified sentence to find better words
+ newSentence = findBetterWord(bestCandidate, modifiedIndices)
+ if evaluateSentence(newSentence) > bestRating:
+ bestSentence = newSentence
+
+ # Make words writable for further iterations.
+ modifiedIndices.pop()
+ # else no more suggestions, do not recurse and try the next word.
+
+ # Finish with the best sentence (possibly unmodified).
+ return bestSentence
+
+# Split the input on spaces and find the best sentence.
+words = input.split(' ')
+bestSentence = findBetterWord(words, [])
\end{lstlisting}
\subsection{Generation of candidate words}
+\label{section:wordsgen}
A naive approach to generate candidate words in the $getCandidateWords()$
function is to generate all possible combinations of the alphabet, and then taking
the intersection with the dictionary. This does not scale and costs $O(27^n)$
@@ -128,7 +153,8 @@ memory and time.
Instead we opted for a smarter approach. As we know that the misspelled words
have a Damerau-Levenshtein distance of 1, candidates are generated by cycling
through each letter and applying an insertion, deletion, substitution or
-transposition.
+transposition. We also know that all corrected words are in the vocabulary, so
+other words can be dropped from the candidates list.
While doing these operations, the Noisy Channel Probability can also be obtained
since we have a confusion matrix and know what kind of alteration was executed.
@@ -138,10 +164,8 @@ generate words. The actual Java implementation has decomposed methods for each
of the operations):
\begin{lstlisting}[language=python]
-def getCandidateWords(word):
- # Walk through the letters of a word. Words that are not in a
- # vocabulary will be dropped. While generating words, a Noisy Channel
- # Probability will also be calculated and attached to the word.
+def getWordMutations(word):
+ # Walk through the letters of a word and mutate those.
for i in word[0..n]:
# Insertion of a new character at position i.
for character in ALPHABET:
@@ -157,27 +181,59 @@ def getCandidateWords(word):
# Transposition of the previous and current characters.
for character in ALPHABET:
yield word[0..i-1] word[i+1] word[i] word[i+2..n]
+
+def getCandidateWords(word):
+ # Initially there are no candidate words
+ candidateWords []
+ # Start with an empty map for words -> channel probability
+ channelProbs = {}
+
+ # Obtain all possible word mutations and pick valid ones.
+ for mutation, candidateWord in getWordMutations(word):
+ # Skip words not in vocabulary or unmodified words:
+ if candidateWord not in vocabulary or word == candidateWord:
+ continue
+
+ # Word is in vocabulary, keep it.
+ candidateWords.append(candidateWord)
+
+ # Find the channel probability. As multiple mutations can
+ # result in the same word, remember the modification that
+ # gives the highest rating. Do not sum those as it can
+ # result in a bias for very small words.
+ channelProbs[candidateWord] = max(
+ channelProbs[candidateWord],
+ calculateChannelProbability(mutation)
+ )
+
+ return candidateWords
\end{lstlisting}
\subsection{Evaluation of words}
+\label{section:wordeval}
The most important part of the algorithm is actually rating a word. For this, we
will combine the channel model probability with the language model probability.
The channel model probability $P(x|w)$ is obtained via $getCandidateWords()$
while the language model probability $P(w)$ depends on the frequency of prior
-words from a corpus.
-
-The evaluation of words basically calculates $P(x|w) * P(w)$. $P(w)$ is
-calculated using $\frac{correctionCount}{errorsCount}$ where $correctionCount$
-is the number of occurrences of the correction $x$ given an error, and
-$errorsCount$ is the number of those errors.
+n-grams from a corpus. The combination gives the word score $P(x|w) * P(w)$.
+The prior n-grams are combined by interpolation with lambda conditionals on
+context \cite{interpolation}.
Intuitively bigrams (or trigrams) are more precise than unigrams because these
are more specific and use more details from the context. However, there are also
less bigrams and even less trigrams to get information about. This could cause a
-skew in the results, so while it should be weighted more than unigrams, this
-parameter should not get too high.
+skew in the results, so while it should be weighted more than unigrams, the
+lambda weight parameter should not get too high.
+
+Due to the usage of n-grams, the score of a sentence as a whole must be
+considered. This is the case because the $n - 1$ words before and after a word
+can form new n-grams which have different probabilities.
+Finally, all word scores are combined by simple multiplication. For this reason,
+words which are certainly incorrect (not in the vocabulary) should not be ranked
+with 0, but with a very small number (such as $10^{-99}$). This ensures that
+words with at least two errors have the chance to be seen as a better result.
\section{Results and Explanation}
@@ -190,11 +246,14 @@ Peter wrote most of the Spell Checker implementation and report.
\section{References}
-%\begin{thebibliography}{9}
-%\bibitem{kernighan}
-% Kernighan et al.
-% \emph{A Spelling Correction Program Based on a Noisy Channel Model}
-% 1990
-%\end{thebibliography}
+\begin{thebibliography}{9}
+\bibitem{kernighan}
+Kernighan et al.
+\emph{A Spelling Correction Program Based on a Noisy Channel Model}
+1990
+\bibitem{interpolation}
+Presentation on Interpolation techniques by Daniel Jurafsky
+https://class.coursera.org/nlp/lecture/19
+\end{thebibliography}
\end{document}