From 347fbdcc955f429277c0ab00ca4e0cc995156af0 Mon Sep 17 00:00:00 2001 From: Hongyang Zhou Date: Sun, 16 Apr 2023 10:05:00 +0300 Subject: [PATCH] Fix typos * Remove duplicate sentence * //emph --> markdown syntax --- 04_naive_bayes.Rmd | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/04_naive_bayes.Rmd b/04_naive_bayes.Rmd index 7bcd4557..1c4d6de7 100644 --- a/04_naive_bayes.Rmd +++ b/04_naive_bayes.Rmd @@ -157,12 +157,11 @@ the spam filter. As we mentioned, what we are facing here is a *classification* problem, and we will code from scratch and use a *supervised learning* algorithm to find a solution with the help of Bayes' theorem. We're going to use a -*naive Bayes* classifier to create our spam filter. We're going to use a -\emph{naive Bayes} classifier to create our spam filter. This method is -going to treat each email just as a collection of words, with no regard -for the order in which they appear. This means we won't take into -account semantic considerations like the particular relationship between -words and their context. +*naive Bayes* classifier to create our spam filter. This method is going +to treat each email just as a collection of words, with no regard for +the order in which they appear. This means we won't take into account +semantic considerations like the particular relationship between words +and their context. Our strategy will be to estimate a probability of an incoming email being ham or spam and make a decision based on that. Our general @@ -200,10 +199,9 @@ common in our example's training data. We would therefore expect $P(email|spam)$, the probability of the new email being generated by the words encountered in the training spam email set, to be relatively high. -(The word \\emph{win} appears in the form \\emph{won} in the training -set, but that's OK. The standard linguistic technique of -\\emph{lemmatization} groups together any related forms of a word and -treats them as the same word.) +(The word *win* appears in the form *won* in the training set, but +that's OK. The standard linguistic technique of *lemmatization* groups +together any related forms of a word and treats them as the same word.) Mathematically, the way to calculate $P(email|spam)$ is to take each word in our target email, calculate the probability of it appearing in