summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPeter Wu <peter@lekensteyn.nl>2015-04-07 16:57:06 +0200
committerPeter Wu <peter@lekensteyn.nl>2015-04-07 16:57:06 +0200
commitf862f5528b8e05730697cf4ca68e21e0781c7886 (patch)
tree46440162cc280477b06918ef55bb54c982ca4d01
parent0fcfe8be393581d4776807b5aee9355116a6e3da (diff)
downloadassignment4-f862f5528b8e05730697cf4ca68e21e0781c7886.tar.gz
Report: update test resultsHEADmaster
-rw-r--r--report.pdfbin175996 -> 180829 bytes
-rw-r--r--report.tex53
2 files changed, 42 insertions, 11 deletions
diff --git a/report.pdf b/report.pdf
index a757392..714ce86 100644
--- a/report.pdf
+++ b/report.pdf
Binary files differ
diff --git a/report.tex b/report.tex
index 14f982d..6b01c97 100644
--- a/report.tex
+++ b/report.tex
@@ -306,24 +306,51 @@ on \textit{samplevoc.txt}).
\item[MAX\_TYPOS] This parameter is set to 2 for the task, but it could be
increased for other situations. The worst-case running time will exponentially
increase with this parameter.
-%\item[LM\_PROBABILITY\_UNMODIFIED] Words that possibly have an error are
-%influenced by the channel probability. For other words (those that are not being
-%considered for correction), this factor can be .... (I think that the factor
-%can be removed from the code since it only decreases the probability...)
-% todo you can set it to just 1... it does not matter since we are just trying
-% to find the best correction for a word
+\item[LM\_PROBABILITY\_UNMODIFIED] Words that possibly have an error are
+influenced by the channel probability. For other words (those that are not being
+considered for correction), this factor can be used to put more influence on the
+channel probability. Values closer to 1 will make the algorithm more
+conservative, preferring not to correct words. Values closer to zero will
+instead try to correct words whenever an alternative can be found.
\end{description}
-% todo include results, it looks good but need to be extended!!!
The program was tested against the \textit{test-sentences.txt} dataset for which
it only failed to correct \textit{the development of diabetes \textbf{u}s
present in mice that \textbf{h}arry a transgen(\textbf{e})} (errors or additions
-are emphasized).
-% add reason!
+are emphasized). This issue can be solved by setting increasing the parameter
+\texttt{MAX\_TYPOS} to 3. After that change, the word \texttt{harry} will also
+be corrected to \texttt{carry}.
+
+When changing \texttt{MAX\_TYPOS} to 3 however, the sentence \textit{boxing
+loves shield the knuckles nots the head} got corrected to \textit{boxing gloves
+shield the knuckles not\textbf{es} the head}. If ngrams would get a greater
+weight than the channel probability, this correction would not be done though.
With the initial ngram text file, it failed to correct \textit{kind retards} to
-\textit{kind regards}.
-% add reason!
+\textit{kind regards} (finding \textit{kind rewards} instead). This happened due
+to the quality of the learning data. With a different dataset \cite{sekine}
+including unigrams and bigrams (where bigram count is greater than 10), it was
+corrected as expected. Since this data set is huge (millions of ngrams) and
+negatively impacts the resource requirements, it was not used in the final
+version.
+
+The same dataset from sekine did not include all unigrams from the original
+sample though. For the sentences provided via Peach it does not matter, Sekine's
+learning data is accurate enough for them.
+
+An interesting observation is that the arbitrary n-gram support of the program
+does work as intended. For example:
+
+\begin{itemize}
+\item input line: \\
+she still refers to me has a friend but i fel i am treated quite badly
+\item wrong output with original samplecnt: \\
+he still refers to me has a friend but i feel i am treated quite badly
+\item wrong output with Sakine's dataset (up to 2-gram): \\
+she still refers to me has a friend but i feel i a treated quite badly
+\item correct output with Sakine's dataset (up to 3-gram): \\
+she still refers to me as a friend but i feel i am treated quite badly
+\end{itemize}
\section{Statement of the Contributions}
Peter wrote most of the Spell Checker implementation and report.
@@ -336,6 +363,10 @@ Kernighan et al.
\bibitem{interpolation}
Presentation on Interpolation techniques by Daniel Jurafsky
https://class.coursera.org/nlp/lecture/19
+\bibitem{sekine}
+Satoshi Sekine et al.
+\emph{Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram}
+http://nlp.cs.nyu.edu/wikipedia-data/
\end{thebibliography}
\end{document}