Text classification (1) - text preprocessing & tex free download source code-CodeForge.com

Source Code / Text classification (1) - text preprocessing & tex

Text classification (1) - text preprocessing & tex

2016-08-23

no vote

Other

Earn points

Application background

1. Environment: Ubuntu14, Hadoop2.6, Eclipse, NLPIR/ICTCLAS2015, etc.;Two, algorithm profile:1, this project is based on MapReduce Hadoop2.6 parallel development;2, this project is a text classification of text preprocessing and text representation, including word segmentation, to stop words, feature selection and text representation (classification algorithm using the random forest algorithm, temporarily not open, readers can use Mahout or Weka for verification);3, the word segmentation is NLPIR/ICTCLAS2015; the text is used in the VSM model, the weight is calculated using TFIDF; the feature selection is based on the CHI algorithm (chi square statistics);4, about the environment of parallel word segmentation, can refer to my blog http://www.cnblogs.com/merru/p/4917665.html5, about the Hadoop environment to build, can refer to my blog http://www.cnblogs.com/merru/p/4901528.html and http://www.cnblogs.com/merru/p/4905118.html.