From d6fe6c9fd2523ea19f80a1ca523e76719951b00c Mon Sep 17 00:00:00 2001 From: Wilco Date: Thu, 2 Apr 2015 11:48:27 +0200 Subject: several fixes in report --- report.pdf | Bin 0 -> 142274 bytes report.tex | 26 +++++++++++++------------- 2 files changed, 13 insertions(+), 13 deletions(-) create mode 100644 report.pdf diff --git a/report.pdf b/report.pdf new file mode 100644 index 0000000..0e62ce1 Binary files /dev/null and b/report.pdf differ diff --git a/report.tex b/report.tex index 84771f0..6d7932d 100644 --- a/report.tex +++ b/report.tex @@ -37,7 +37,7 @@ stringstyle=\color{mauve}, \section{Introduction} In this report we will describe our solution to the second assignment, which consists of designing a spell checker algorithm and its implementation in Java. -The algorithm is provided with a sentence which contains zero to two misspelled +The algorithm is provided with a sentence containing zero to two misspelled words with an Damerau-Levenshtein edit distance of 1. In addition, misspelled words are not consecutive. @@ -45,14 +45,14 @@ The expected output is a sentence where these errors have been corrected. \section{Approach} To remove spelling errors from a text, one has to generate possible corrections -and evaluate them. Spelling suggestions are done on a word-by-word basis. +and evaluate these. Spelling suggestions are done on a word-by-word basis. -To obtain a list of possible words, \textit{candidate words} will generated with +To obtain a list of possible words, \textit{candidate words} will be generated with a Damerau-Levenshtein distance of 1 to the original word. While doing this, a -noisy channel probability can also be obtained which is needed later for word +noisy channel probability can also be obtained which is needed later on for word evaluation. -For each of word from the sentence, the word will be evaluated against the +For each word of the sentence, the word will be evaluated against the context of the word. This includes the previous and next words for n-grams. Since we only have the guarantee that the words next to the misspelled one are correct, tri-grams cannot reach out for the following two words before or after @@ -66,9 +66,9 @@ misspelled word. The program is given a single sentence which is split in words. The indices of words that are surely correct are remembered (\texttt{readonly\_words} in the listing below). Once a rating is found which is higher than the initial one, -then the word is replaced. +the word is replaced. -When a word was modified, the hunt for more faulty words continue until all of +When a word was modified, the hunt for more faulty words continues until all of them (less than three) are found. Otherwise, it is assumed that the sentence has no more spelling errors. @@ -121,19 +121,19 @@ for attempt in range(0, 3): \subsection{Generation of candidate words} A naive approach to generate candidate words in the $getCandidateWords()$ -function is to generate all possible combinations of the alphabet, and then take -the intersection with the dictionary. This does not scale and uses $O(27^n)$ +function is to generate all possible combinations of the alphabet, and then taking +the intersection with the dictionary. This does not scale and costs $O(27^n)$ memory and time. -Instead we chose for a smarter approach. As we know that the misspelled word +Instead we opted for a smarter approach. As we know that the misspelled words have a Damerau-Levenshtein distance of 1, candidates are generated by cycling through each letter and applying an insertion, deletion, substitution or transposition. While doing these operations, the Noisy Channel Probability can also be obtained -since we have a confusion matrix and know kind of operation was executed. +since we have a confusion matrix and know what kind of alteration was executed. -Pseudo-code (bounds checks are omitted, assume that such operations do not +Pseudo-code (bounds checks are omitted, assuming that such operations do not generate words. The actual Java implementation has decomposed methods for each of the operations): @@ -184,7 +184,7 @@ parameter should not get too high. \section{Additional Resources} \section{Statement of the Contributions} -Peter wrote most of the Spell Checker implementationg and report. +Peter wrote most of the Spell Checker implementation and report. \section{Manual} -- cgit v1.2.1