Data Mining_ Concepts and Techniques - Jiawei Han [218]
(b) If Ak is continuous-valued, then we need to do a bit more work, but the calculation is pretty straightforward. A continuous-valued attribute is typically assumed to have a Gaussian distribution with a mean and standard deviation , defined by
(8.13)
so that
(8.14)
These equations may appear daunting, but hold on! We need to compute and , which are the mean (i.e., average) and standard deviation, respectively, of the values of attribute Ak for training tuples of class Ci. We then plug these two quantities into Eq. (8.13), together with xk, to estimate .
For example, let , where A1 and A2 are the attributes age and income, respectively. Let the class label attribute be buys_computer. The associated class label for X is yes (i.e., buys_computer = yes). Let's suppose that age has not been discretized and therefore exists as a continuous-valued attribute. Suppose that from the training set, we find that customers in D who buy a computer are years of age. In other words, for attribute age and this class, we have years and . We can plug these quantities, along with for our tuple X, into Eq. (8.13) to estimate P(age = 35 | buys_computer = yes). For a quick review of mean and standard deviation calculations, please see Section 2.2.
5. To predict the class label of X, is evaluated for each class Ci. The classifier predicts that the class label of tuple X is the class Ci if and only if
(8.15)
In other words, the predicted class label is the class Ci for which is the maximum.
“How effective are Bayesian classifiers?” Various empirical studies of this classifier in comparison to decision tree and neural network classifiers have found it to be comparable in some domains. In theory, Bayesian classifiers have the minimum error rate in comparison to all other classifiers. However, in practice this is not always the case, owing to inaccuracies in the assumptions made for its use, such as class-conditional independence, and the lack of available probability data.
Bayesian classifiers are also useful in that they provide a theoretical justification for other classifiers that do not explicitly use Bayes’ theorem. For example, under certain assumptions, it can be shown that many neural network and curve-fitting algorithms output the maximum posteriori hypothesis, as does the naïve Bayesian classifier.
Predicting a class label using naïve Bayesian classification
We wish to predict the class label of a tuple using naïve Bayesian classification, given the same training data as in Example 8.3 for decision tree induction. The training data were shown earlier in Table 8.1. The data tuples are described by the attributes age, income, student, and credit_rating. The class label attribute, buys_computer, has two distinct values (namely, {yes, no}). Let C1 correspond to the class buys_computer = yes and C2 correspond to buys_computer = no. The tuple we wish to classify is
We need to maximize , for , 2. , the prior probability of each class, can be computed based on the training tuples:
To compute , for i = 1, 2, we compute the following conditional probabilities:
Using these probabilities, we obtain
Similarly,
To find the class, Ci, that maximizes , we compute
Therefore, the naïve Bayesian classifier predicts buys_computer = yes for tuple X.
“What if I encounter probability values of zero?” Recall that in Eq. (8.12), we estimate as the product of the probabilities , based on the assumption of class-conditional independence. These probabilities can be estimated from the training tuples (step 4). We need to compute for each class () to find the class Ci for which is the maximum (step 5). Let's consider this calculation. For each attribute–value pair (i.e., , for ) in tuple X, we need to count the number of tuples having that attribute–value pair, per class (i.e., per Ci, for ). In Example 8.4, we have two classes , namely buys_computer = yes and buys_computer = no. Therefore, for the attribute–value pair student = yes of X, say, we need two counts—the number of customers