N-gram Based Text Categorization Method for Improved Data  Mining

Kennedy Ogada; Waweru Mwangi; Wilson Cheruiyot

N-gram Based Text Categorization Method for Improved Data Mining

Kennedy Ogada; Waweru Mwangi; Wilson Cheruiyot

URI: http://ir.ttu.ac.ke/xmlui/handle/123456789/111

Date: 2015

Abstract:

Though naïve Bayes text classifiers are widely used because of its simplicity and effectiveness, the techniques for improving performances of these classifiers have been rarely studied. Naïve Bayes classifiers which are widely used for text classification in machine learning are based on the conditional probability of features belonging to a class, which the features are selected by feature selection methods. However, its performance is often imperfect because it does not model text well, and by inappropriate feature selection and some disadvantages of the Naive Bayes itself. Sentiment Classification or Text Classification is the act of taking a set of labeled text documents, learning a correlation between a document’s contents and its corresponding labels and then predicting the labels of a set of unlabeled test documents as best as possible. Text Classification is also sometimes called Text Categorization. Text classification has many applications in natural language processing tasks such as E-mail filtering, Intrusion detection systems, news filtering, prediction of user preferences, and organization of documents. The Naive Bayes model makes strong assumptions about the data: it assumes that words in a document are independent. This assumption is clearly violated in natural language text: there are various types of dependences between words induced by the syntactic, semantic, pragmatic and conversational structure of a text. Also, the particular form of the probabilistic model makes assumptions about the distribution of words in documents that are violated in practice. We address this problem and show that it can be solved by modeling text data differently using N-Grams. N-gram Based Text Categorization is a simple method based on statistical information about the usage of sequences of words. We conducted an experiment to demonstrate that our simple modification is able to improve the performance of Naive Bayes for text classification significantly. Keywords: Data Mining, Text Classification, Text Categorization, Naïve Bayes, N-Grams

Show full item record

Files in this item

Name: 234677239.pdf

Size: 496.0Kb

Format: PDF

Description: Main article

View/Open

This item appears in the following Collection(s)

School of Science and Informatics
Contains a collection from the School of Science and Informatics

N-gram Based Text Categorization Method for Improved Data Mining

N-gram Based Text Categorization Method for Improved Data Mining

Abstract:

Files in this item

This item appears in the following Collection(s)

Search TTU Repository

Browse

All of TTU Repository

This Collection

My Account