From f862f5528b8e05730697cf4ca68e21e0781c7886 Mon Sep 17 00:00:00 2001 From: Peter Wu Date: Tue, 7 Apr 2015 16:57:06 +0200 Subject: Report: update test results --- report.pdf | Bin 175996 -> 180829 bytes report.tex | 53 ++++++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 42 insertions(+), 11 deletions(-) diff --git a/report.pdf b/report.pdf index a757392..714ce86 100644 Binary files a/report.pdf and b/report.pdf differ diff --git a/report.tex b/report.tex index 14f982d..6b01c97 100644 --- a/report.tex +++ b/report.tex @@ -306,24 +306,51 @@ on \textit{samplevoc.txt}). \item[MAX\_TYPOS] This parameter is set to 2 for the task, but it could be increased for other situations. The worst-case running time will exponentially increase with this parameter. -%\item[LM\_PROBABILITY\_UNMODIFIED] Words that possibly have an error are -%influenced by the channel probability. For other words (those that are not being -%considered for correction), this factor can be .... (I think that the factor -%can be removed from the code since it only decreases the probability...) -% todo you can set it to just 1... it does not matter since we are just trying -% to find the best correction for a word +\item[LM\_PROBABILITY\_UNMODIFIED] Words that possibly have an error are +influenced by the channel probability. For other words (those that are not being +considered for correction), this factor can be used to put more influence on the +channel probability. Values closer to 1 will make the algorithm more +conservative, preferring not to correct words. Values closer to zero will +instead try to correct words whenever an alternative can be found. \end{description} -% todo include results, it looks good but need to be extended!!! The program was tested against the \textit{test-sentences.txt} dataset for which it only failed to correct \textit{the development of diabetes \textbf{u}s present in mice that \textbf{h}arry a transgen(\textbf{e})} (errors or additions -are emphasized). -% add reason! +are emphasized). This issue can be solved by setting increasing the parameter +\texttt{MAX\_TYPOS} to 3. After that change, the word \texttt{harry} will also +be corrected to \texttt{carry}. + +When changing \texttt{MAX\_TYPOS} to 3 however, the sentence \textit{boxing +loves shield the knuckles nots the head} got corrected to \textit{boxing gloves +shield the knuckles not\textbf{es} the head}. If ngrams would get a greater +weight than the channel probability, this correction would not be done though. With the initial ngram text file, it failed to correct \textit{kind retards} to -\textit{kind regards}. -% add reason! +\textit{kind regards} (finding \textit{kind rewards} instead). This happened due +to the quality of the learning data. With a different dataset \cite{sekine} +including unigrams and bigrams (where bigram count is greater than 10), it was +corrected as expected. Since this data set is huge (millions of ngrams) and +negatively impacts the resource requirements, it was not used in the final +version. + +The same dataset from sekine did not include all unigrams from the original +sample though. For the sentences provided via Peach it does not matter, Sekine's +learning data is accurate enough for them. + +An interesting observation is that the arbitrary n-gram support of the program +does work as intended. For example: + +\begin{itemize} +\item input line: \\ +she still refers to me has a friend but i fel i am treated quite badly +\item wrong output with original samplecnt: \\ +he still refers to me has a friend but i feel i am treated quite badly +\item wrong output with Sakine's dataset (up to 2-gram): \\ +she still refers to me has a friend but i feel i a treated quite badly +\item correct output with Sakine's dataset (up to 3-gram): \\ +she still refers to me as a friend but i feel i am treated quite badly +\end{itemize} \section{Statement of the Contributions} Peter wrote most of the Spell Checker implementation and report. @@ -336,6 +363,10 @@ Kernighan et al. \bibitem{interpolation} Presentation on Interpolation techniques by Daniel Jurafsky https://class.coursera.org/nlp/lecture/19 +\bibitem{sekine} +Satoshi Sekine et al. +\emph{Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram} +http://nlp.cs.nyu.edu/wikipedia-data/ \end{thebibliography} \end{document} -- cgit v1.2.1