Home » Source Code » Text classification (1) - text preprocessing & text representation based on Hadoop

Text classification (1) - text preprocessing & text representation based on Hadoop

青密
2015-12-23 05:05:27
The author
View(s):
Download(s): 3
Point (s): 1 
Category Category:
并行计算并行计算 HadoopHadoop

Description

Application background

1. Environment: Ubuntu14, Hadoop2.6, Eclipse, NLPIR/ICTCLAS2015, etc.;Two, algorithm profile:1, this project is based on MapReduce Hadoop2.6 parallel development;2, this project is a text classification of text preprocessing and text representation, including word segmentation, to stop words, feature selection and text representation (classification algorithm using the random forest algorithm, temporarily not open, readers can use Mahout or Weka for verification);3, the word segmentation is NLPIR/ICTCLAS2015; the text is used in the VSM model, the weight is calculated using TFIDF; the feature selection is based on the CHI algorithm (chi square statistics);4, about the environment of parallel word segmentation, can refer to my blog http://www.cnblogs.com/merru/p/4917665.html5, about the Hadoop environment to build, can refer to my blog http://www.cnblogs.com/merru/p/4901528.html and http://www.cnblogs.com/merru/p/4905118.html.
Sponsored links

File list

Tips: You can preview the content of files by clicking file names^_^
Name Size Date
readme.txt808.00 B2015-12-23 17:13
01.97 kB
.classpath18.91 kB2015-10-27 22:23
.project413.00 B2015-10-27 22:15
01.97 kB
20141225.err92.00 B2015-10-27 22:20
20151027.err666.00 B2015-10-27 22:58
20151028.err444.00 B2015-10-28 16:07
20151106.err3.14 kB2015-11-06 20:09
20151126.err1.73 kB2015-11-26 18:45
20151129.err1.95 kB2015-11-30 15:40
20151130.err3.69 kB2015-11-30 20:52
20151201.err444.00 B2015-12-01 15:06
20151202.err1.95 kB2015-12-02 20:34
20151203.err1.19 kB2015-12-03 22:40
20151204.err2.93 kB2015-12-04 16:54
20151205.err1.73 kB2015-12-05 11:18
20151206.err4.44 kB2015-12-06 17:44
BIG2GBK.map279.49 kB2015-10-27 22:20
BIG5.pdat457.48 kB2015-10-27 22:20
BIG5.wordlist154.98 kB2015-10-27 22:20
BiWord.big3.36 MB2015-10-27 22:20
Configure.xml1.06 kB2015-10-27 22:20
CoreDict.pdat1.62 MB2015-10-27 22:20
CoreDict.pos1.70 MB2015-10-27 22:20
CoreDict.unig466.96 kB2015-10-27 22:20
DocExtractor.user3.28 kB2015-10-27 22:20
01.97 kB
English.pdat5.06 MB2015-10-27 22:20
English.pos4.29 MB2015-10-27 22:20
English.ung1.60 MB2015-10-27 22:20
English.wordlist2.74 MB2015-10-27 22:20
Irrel2regular.map955.22 kB2015-10-27 22:20
ne.pdat1.11 MB2015-10-27 22:20
ne.pos1.22 MB2015-10-27 22:20
ne.wordlist652.73 kB2015-10-27 22:20
FTU8.pdat533.91 kB2015-10-27 22:20
FTU8.wordlist186.22 kB2015-10-27 22:20
FTU82GBK.map279.49 kB2015-10-27 22:20
FieldDict.pdat371.11 kB2015-10-27 22:20
FieldDict.pos26.62 kB2015-10-27 22:20
GBK.pdat536.33 kB2015-10-27 22:20
GBK.wordlist163.07 kB2015-10-27 22:20
GBK2BIG.map279.49 kB2015-10-27 22:20
GBK2FTU8.map279.49 kB2015-10-27 22:20
GBK2GBKC.map279.49 kB2015-10-27 22:20
GBK2UTF.map279.49 kB2015-10-27 22:20
GBKA.pdat537.94 kB2015-10-27 22:20
GBKA.wordlist163.07 kB2015-10-27 22:20
GBKA2UTF.map279.49 kB2015-10-27 22:20
GBKC.pdat537.94 kB2015-10-27 22:20
GBKC.wordlist163.07 kB2015-10-27 22:20
GBKC2GBK.map279.49 kB2015-10-27 22:20
GranDict.pdat1.89 MB2015-10-27 22:20
GranDict.pos1.70 MB2015-10-27 22:20
ICTPOS.map422.00 B2015-10-27 22:20
LJHtmlParser.user3.28 kB2015-10-27 22:20
NLPIR.ctx36.38 kB2015-10-27 22:20
NLPIR.user3.28 kB2015-10-27 22:20
NLPIR_First.map288.00 B2015-10-27 22:20
NLPIR_trial.user3.28 kB2015-10-27 22:20
NewWord.lst4.98 kB2015-10-27 22:20
PKU.map323.00 B2015-10-27 22:20
PKU_First.map300.00 B2015-10-27 22:20
UTF2GBK.map279.49 kB2015-10-27 22:20
UTF2GBKA.map279.49 kB2015-10-27 22:20
UTF8.pdat544.21 kB2015-10-27 22:20
UTF8.wordlist186.22 kB2015-10-27 22:20
UserDict.pdat32.83 kB2015-10-27 22:20
charset.type64.00 kB2015-10-27 22:20
classifier.user3.28 kB2015-10-27 22:20
cluster.user3.28 kB2015-10-27 22:20
keyExtract.user3.28 kB2015-10-27 22:20
location.map77.55 kB2015-10-27 22:20
location.pdat406.98 kB2015-10-27 22:20
location.wordlist103.68 kB2015-10-27 22:20
nr.ctx2.16 kB2015-10-27 22:20
nr.fsa2.94 kB2015-10-27 22:20
nr.role1.68 MB2015-10-27 22:20
sentiment.pdat834.49 kB2015-10-27 22:20
sentiment.ung85.94 kB2015-10-27 22:20
summary.user3.28 kB2015-10-27 22:20
01.97 kB
01.97 kB
01.97 kB
01.97 kB
01.97 kB
BTree.class691.00 B2015-12-06 17:54
BuildTree.class12.79 kB2015-12-06 17:54
Classify$ClassifyMap.class7.02 kB2015-12-06 17:54
Classify$ClassifyReduce.class1.54 kB2015-12-06 17:54
Classify.class446.00 B2015-12-06 17:54
DTree$DTreeMap.class3.60 kB2015-12-06 17:54
DTree$DTreeReduce.class1.52 kB2015-12-06 17:54
DTree.class419.00 B2015-12-06 17:54
SrAndDsr.class1.73 kB2015-12-06 17:54
libNLPIR.so1.71 MB2015-10-27 19:16
log4j.properties11.03 kB2015-10-27 22:53
01.97 kB
01.97 kB
01.97 kB
01.97 kB
txMain.class2.91 kB2015-12-06 17:54
01.97 kB
01.97 kB
01.97 kB
01.97 kB
CLibrary.class832.00 B2015-12-06 17:54
fenci.class3.31 kB2015-12-06 17:54
01.97 kB
01.97 kB
01.97 kB
01.97 kB
Sample$SampleMap.class4.15 kB2015-12-06 17:54
Sample$SampleReduce.class1.52 kB2015-12-06 17:54
Sample.class420.00 B2015-12-06 17:54
SegWords$SegWordsMap.class4.44 kB2015-12-06 17:54
SegWords$SegWordsReduce.class1.53 kB2015-12-06 17:54
SegWords.class458.00 B2015-12-06 17:54
treeExecutor.class4.38 kB2015-12-06 17:54
01.97 kB
01.97 kB
01.97 kB
01.97 kB
FeatureSlt$FeatureSltMap.class2.83 kB2015-12-06 17:54
FeatureSlt$FeatureSltReduce$1.class1.63 kB2015-12-06 17:54
FeatureSlt$FeatureSltReduce.class5.22 kB2015-12-06 17:54
FeatureSlt.class464.00 B2015-12-06 17:54
FeatureSts$FeatureStsMap.class2.53 kB2015-12-06 17:54
FeatureSts$FeatureStsReduce.class6.34 kB2015-12-06 17:54
FeatureSts.class464.00 B2015-12-06 17:54
Idf$IdfMap.class3.14 kB2015-12-06 17:54
Idf$IdfReduce.class4.02 kB2015-12-06 17:54
Idf.class401.00 B2015-12-06 17:54
PreExecutor.class7.74 kB2015-12-06 17:54
PreExecutorOld.class7.77 kB2015-12-06 17:54
Tf$TfMap.class2.87 kB2015-12-06 17:54
Tf$TfReduce.class2.93 kB2015-12-06 17:54
Tf.class392.00 B2015-12-06 17:54
TfIdf$TfIdfMap.class6.19 kB2015-12-06 17:54
TfIdf$TfIdfReduce.class4.62 kB2015-12-06 17:54
TfIdf.class419.00 B2015-12-06 17:54
sampleUn.class1.25 kB2015-12-06 17:54
jna-4.1.0.jar893.16 kB2015-10-27 19:18
log4j.properties11.03 kB2015-10-23 16:50
01.97 kB
libNLPIR.so1.71 MB2015-10-27 19:16
log4j.properties11.03 kB2015-10-27 22:53
01.97 kB
01.97 kB
01.97 kB
01.97 kB
txMain.java2.97 kB2015-12-06 17:44
01.97 kB
01.97 kB
01.97 kB
01.97 kB
CLibrary.java1.55 kB2015-10-27 22:22
fenci.java2.68 kB2015-10-27 22:31
01.97 kB
01.97 kB
01.97 kB
01.97 kB
SegWords.java2.42 kB2015-12-04 13:42
01.97 kB
01.97 kB
01.97 kB
01.97 kB
FeatureSlt.java4.78 kB2015-12-04 13:47
FeatureSts.java7.20 kB2015-11-30 16:56
Idf.java3.40 kB2015-12-06 17:43
PreExecutor.java9.11 kB2015-12-06 17:22
PreExecutorOld.java9.05 kB2015-12-06 17:05
Tf.java3.47 kB2015-12-06 17:23
TfIdf.java7.02 kB2015-12-06 17:42
sampleUn.java771.00 B2015-12-03 16:22
...
Sponsored links

Comments

(Add your comment, get 0.1 Point)
Minimum:15 words, Maximum:160 words
  • 1
  • Page 1
  • Total 1

Text classification (1) - text preprocessing & text representation based on Hadoop (10.67 MB)

Need 1 Point(s)
Your Point (s)

Your Point isn't enough.

Get 22 Point immediately by PayPal

Point will be added to your account automatically after the transaction.

More(Debit card / Credit card / PayPal Credit / Online Banking)

Submit your source codes. Get more Points

LOGIN

Don't have an account? Register now
Need any help?
Mail to: support@codeforge.com

切换到中文版?

CodeForge Chinese Version
CodeForge English Version

Where are you going?

^_^"Oops ...

Sorry!This guy is mysterious, its blog hasn't been opened, try another, please!
OK

Warm tip!

CodeForge to FavoriteFavorite by Ctrl+D