Naive Bayesian classifier
Naïve Bayes Classifier:
The Naive Bayes classifier is a probabilistic classifier.
We compute the probability of a document d being in a class c as follows:
P(c|d) ∝ P(c) Y 1≤k≤nd P(tk |c)
nd is the length of the document. (number of tokens)
P(tk |c) is the conditional probability of term tk occurring in a document of class c
P(tk |c) as a measure of how much evidence tk contributes that c is the correct class.
P(c) is the prior probability of c.
If a document’s terms do not provide clear evidence for one class vs. another, we choose the c with highest P(c)
Algorithm (More of Code like):
Naïve Bayes(Test_Data_Dir, Training_Data_Dir)
For(each test file in test data directory)
For each class
Map<class, probability> ProbabilityMap;
For each word in test file
Wordprobability=Probability of occurance of that word in the class
Classified_class=Key of Max probability value
Holds the Test record as an object.
· String RecordId Filename of the Test File
· String fullRecord Test record as a single string.
· ArrayList<String> words words in the test record.
Used as a cache to store the probabilities of words associated with a particular class.
· String className Classname
· Hashmap<String,Double> Probability of the each word
Holds the training record as an object.
· String className Class name of the training file
· ArrayList<String> content Words in the class.
Flow of the Code:
1. Read each test file, remove stopwords, perform stemming and load in to objects.
2. Read each training file, remove stopwords, perform stemming and load in to objects.
3. For each test file, for each class name, for each word; check if the probability already exist in cache.
4. Else compute the probability of each word and multiply them to get overall probability for the test file.
5. Check which probability has maximum among the classes for the test file which gives the class value of the file.
Two Modes of Execution:
Take your choice depending upon the size of the dataset and computing power you have in the machine.
· In Memory
o Training Data is loaded in to memory as objects.
o Executes much faster
o Significantly less number of file reads.
o Higher memory load.
· File Read
o Handles Training data as files as it is.
o Executes slower
o More number of file reads.
o Significantly less memory load.
Steps for Execution:
· Please follow the following structure of directory for Test and training directory.
· NaiveBayesClassifier is the main class. So, To run the classifier, run the following command,
Correct Usage: java NaiveBayesClassifier <TrainingDataDirectory> <TestDataDirectory> <InMemoryFlag>
Java –jar NaiveBayesClassifier <TrainingDataDirectory> <TestDataDirectory> <InMemoryFlag>
Arg 1: Training data directory (No spaces in directory name please).
Arg 2: Test data directory(No spaces in directory name please).
Arg 3: InMemoryFlag <Two ways of execution> Set to “true”, if you want Training Data to be loaded in memory (Faster computation but higher memory load).If set to “false”, It reads training data from file again and again. Since, Number of file reads is high, it becomes very slow; but memory load is significantly less.
File listTips: You can preview the content of files by clicking file names^_^
|MemoryFile.java||534.00 B||2015-09-29 21:42|
|NaiveBayesClassifier.java||8.76 kB||2015-09-30 18:18|
|OccuranceProbabilties.java||479.00 B||2015-09-19 21:27|
|PorterStemmer.java||11.21 kB||2015-09-19 16:31|
|StopWordAnalyzer.java||17.84 kB||2015-09-29 23:23|
|TestRecord.java||542.00 B||2015-09-30 08:40|
- Page 1
- Total 1