N-gram Based Text Categorization Method for Improved Data  Mining

Kennedy Ogada; Waweru Mwangi; Wilson Cheruiyot

dc.contributor.author	Kennedy Ogada
dc.contributor.author	Waweru Mwangi
dc.contributor.author	Wilson Cheruiyot
dc.date.accessioned	2025-02-24T06:09:29Z
dc.date.available	2025-02-24T06:09:29Z
dc.date.issued	2015
dc.identifier.issn	2225-0506
dc.identifier.uri	http://ir.ttu.ac.ke/xmlui/handle/123456789/111
dc.description.abstract	Though naïve Bayes text classifiers are widely used because of its simplicity and effectiveness, the techniques for improving performances of these classifiers have been rarely studied. Naïve Bayes classifiers which are widely used for text classification in machine learning are based on the conditional probability of features belonging to a class, which the features are selected by feature selection methods. However, its performance is often imperfect because it does not model text well, and by inappropriate feature selection and some disadvantages of the Naive Bayes itself. Sentiment Classification or Text Classification is the act of taking a set of labeled text documents, learning a correlation between a document’s contents and its corresponding labels and then predicting the labels of a set of unlabeled test documents as best as possible. Text Classification is also sometimes called Text Categorization. Text classification has many applications in natural language processing tasks such as E-mail filtering, Intrusion detection systems, news filtering, prediction of user preferences, and organization of documents. The Naive Bayes model makes strong assumptions about the data: it assumes that words in a document are independent. This assumption is clearly violated in natural language text: there are various types of dependences between words induced by the syntactic, semantic, pragmatic and conversational structure of a text. Also, the particular form of the probabilistic model makes assumptions about the distribution of words in documents that are violated in practice. We address this problem and show that it can be solved by modeling text data differently using N-Grams. N-gram Based Text Categorization is a simple method based on statistical information about the usage of sequences of words. We conducted an experiment to demonstrate that our simple modification is able to improve the performance of Naive Bayes for text classification significantly. Keywords: Data Mining, Text Classification, Text Categorization, Naïve Bayes, N-Grams	en_US
dc.language.iso	en	en_US
dc.publisher	Journal of Information Engineering and Applications	en_US
dc.subject	Data Mining, Text Classification, Text Categorization, Naïve Bayes, N-Grams	en_US
dc.title	N-gram Based Text Categorization Method for Improved Data Mining	en_US
dc.type	Article	en_US