From f862f5528b8e05730697cf4ca68e21e0781c7886 Mon Sep 17 00:00:00 2001
From: Peter Wu <peter@lekensteyn.nl>
Date: Tue, 7 Apr 2015 16:57:06 +0200
Subject: Report: update test results

---
 report.pdf | Bin 175996 -> 180829 bytes
 report.tex |  53 ++++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/report.pdf b/report.pdf
index a757392..714ce86 100644
Binary files a/report.pdf and b/report.pdf differ
diff --git a/report.tex b/report.tex
index 14f982d..6b01c97 100644
--- a/report.tex
+++ b/report.tex
@@ -306,24 +306,51 @@ on \textit{samplevoc.txt}).
 \item[MAX\_TYPOS] This parameter is set to 2 for the task, but it could be
 increased for other situations. The worst-case running time will exponentially
 increase with this parameter.
-%\item[LM\_PROBABILITY\_UNMODIFIED] Words that possibly have an error are
-%influenced by the channel probability. For other words (those that are not being
-%considered for correction), this factor can be .... (I think that the factor
-%can be removed from the code since it only decreases the probability...)
-% todo you can set it to just 1... it does not matter since we are just trying
-% to find the best correction for a word
+\item[LM\_PROBABILITY\_UNMODIFIED] Words that possibly have an error are
+influenced by the channel probability. For other words (those that are not being
+considered for correction), this factor can be used to put more influence on the
+channel probability. Values closer to 1 will make the algorithm more
+conservative, preferring not to correct words. Values closer to zero will
+instead try to correct words whenever an alternative can be found.
 \end{description}
 
-% todo include results, it looks good but need to be extended!!!
 The program was tested against the \textit{test-sentences.txt} dataset for which
 it only failed to correct \textit{the development of diabetes \textbf{u}s
 present in mice that \textbf{h}arry a transgen(\textbf{e})} (errors or additions
-are emphasized).
-% add reason!
+are emphasized). This issue can be solved by setting increasing the parameter
+\texttt{MAX\_TYPOS} to 3. After that change, the word \texttt{harry} will also
+be corrected to \texttt{carry}.
+
+When changing \texttt{MAX\_TYPOS} to 3 however, the sentence \textit{boxing
+loves shield the knuckles nots the head} got corrected to \textit{boxing gloves
+shield the knuckles not\textbf{es} the head}. If ngrams would get a greater
+weight than the channel probability, this correction would not be done though.
 
 With the initial ngram text file, it failed to correct \textit{kind retards} to
-\textit{kind regards}.
-% add reason!
+\textit{kind regards} (finding \textit{kind rewards} instead). This happened due
+to the quality of the learning data. With a different dataset \cite{sekine}
+including unigrams and bigrams (where bigram count is greater than 10), it was
+corrected as expected. Since this data set is huge (millions of ngrams) and
+negatively impacts the resource requirements, it was not used in the final
+version.
+
+The same dataset from sekine did not include all unigrams from the original
+sample though. For the sentences provided via Peach it does not matter, Sekine's
+learning data is accurate enough for them.
+
+An interesting observation is that the arbitrary n-gram support of the program
+does work as intended. For example:
+
+\begin{itemize}
+\item input line: \\
+she still refers to me has a friend but i fel i am treated quite badly
+\item wrong output with original samplecnt: \\
+he still refers to me has a friend but i feel i am treated quite badly
+\item wrong output with Sakine's dataset (up to 2-gram): \\
+she still refers to me has a friend but i feel i a treated quite badly
+\item correct output with Sakine's dataset (up to 3-gram): \\
+she still refers to me as a friend but i feel i am treated quite badly
+\end{itemize}
 
 \section{Statement of the Contributions}
 Peter wrote most of the Spell Checker implementation and report.
@@ -336,6 +363,10 @@ Kernighan et al.
 \bibitem{interpolation}
 Presentation on Interpolation techniques by Daniel Jurafsky
 https://class.coursera.org/nlp/lecture/19
+\bibitem{sekine}
+Satoshi Sekine et al.
+\emph{Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram}
+http://nlp.cs.nyu.edu/wikipedia-data/
 \end{thebibliography}
 
 \end{document}
-- 
cgit v1.2.1