Abstract:
Though naïve Bayes text classifiers are widely used because of its simplicity and effectiveness, the techniques
for improving performances of these classifiers have been rarely studied. Naïve Bayes classifiers which are
widely used for text classification in machine learning are based on the conditional probability of features
belonging to a class, which the features are selected by feature selection methods. However, its performance is
often imperfect because it does not model text well, and by inappropriate feature selection and some
disadvantages of the Naive Bayes itself. Sentiment Classification or Text Classification is the act of taking a set
of labeled text documents, learning a correlation between a document’s contents and its corresponding labels and
then predicting the labels of a set of unlabeled test documents as best as possible. Text Classification is also
sometimes called Text Categorization. Text classification has many applications in natural language processing
tasks such as E-mail filtering, Intrusion detection systems, news filtering, prediction of user preferences, and
organization of documents. The Naive Bayes model makes strong assumptions about the data: it assumes that
words in a document are independent. This assumption is clearly violated in natural language text: there are
various types of dependences between words induced by the syntactic, semantic, pragmatic and conversational
structure of a text. Also, the particular form of the probabilistic model makes assumptions about the distribution
of words in documents that are violated in practice. We address this problem and show that it can be solved by
modeling text data differently using N-Grams. N-gram Based Text Categorization is a simple method based on
statistical information about the usage of sequences of words. We conducted an experiment to demonstrate that
our simple modification is able to improve the performance of Naive Bayes for text classification significantly.
Keywords: Data Mining, Text Classification, Text Categorization, Naïve Bayes, N-Grams