Taita Taveta University Repository

N-gram Based Text Categorization Method for Improved Data Mining

Show simple item record

dc.contributor.author Kennedy Ogada
dc.contributor.author Waweru Mwangi
dc.contributor.author Wilson Cheruiyot
dc.date.accessioned 2025-02-24T06:09:29Z
dc.date.available 2025-02-24T06:09:29Z
dc.date.issued 2015
dc.identifier.issn 2225-0506
dc.identifier.uri http://ir.ttu.ac.ke/xmlui/handle/123456789/111
dc.description.abstract Though naïve Bayes text classifiers are widely used because of its simplicity and effectiveness, the techniques for improving performances of these classifiers have been rarely studied. Naïve Bayes classifiers which are widely used for text classification in machine learning are based on the conditional probability of features belonging to a class, which the features are selected by feature selection methods. However, its performance is often imperfect because it does not model text well, and by inappropriate feature selection and some disadvantages of the Naive Bayes itself. Sentiment Classification or Text Classification is the act of taking a set of labeled text documents, learning a correlation between a document’s contents and its corresponding labels and then predicting the labels of a set of unlabeled test documents as best as possible. Text Classification is also sometimes called Text Categorization. Text classification has many applications in natural language processing tasks such as E-mail filtering, Intrusion detection systems, news filtering, prediction of user preferences, and organization of documents. The Naive Bayes model makes strong assumptions about the data: it assumes that words in a document are independent. This assumption is clearly violated in natural language text: there are various types of dependences between words induced by the syntactic, semantic, pragmatic and conversational structure of a text. Also, the particular form of the probabilistic model makes assumptions about the distribution of words in documents that are violated in practice. We address this problem and show that it can be solved by modeling text data differently using N-Grams. N-gram Based Text Categorization is a simple method based on statistical information about the usage of sequences of words. We conducted an experiment to demonstrate that our simple modification is able to improve the performance of Naive Bayes for text classification significantly. Keywords: Data Mining, Text Classification, Text Categorization, Naïve Bayes, N-Grams en_US
dc.language.iso en en_US
dc.publisher Journal of Information Engineering and Applications en_US
dc.subject Data Mining, Text Classification, Text Categorization, Naïve Bayes, N-Grams en_US
dc.title N-gram Based Text Categorization Method for Improved Data Mining en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search TTU Repository


Advanced Search

Browse

My Account