Ling Li: Code: Data Categorization

Code for Data Categorization

Here are the source files we used for the paper Improving Generalization by Data Categorization. We are sorry that we did not put enough comments there. Please let us know if you have any problems using them.

You may also find the data used for the paper at my data page.

C++ Files

You will need LEMGA (snapshot 2005/09/04), our software libary for machine learning research, LIBSVM, and Boost C++ Libraries (ver 1.30.0) to compile the files here.

Makefile
nnrho.cpp: Randomly samples neural networks with 1 hidden layer, and outputs data for computing the selection cost ρ (see also rho_calc.m)
svm-htlin.tar.gz: Using LIBSVM 2.8 for SVM confidence margin (see its README)
nnboost.cpp: Constructs AdaBoost ensembles and outputs the margin information (see also wgt_calc.m)
bstmis.cpp: Our implementation of Merler's misclassification ratio (with arguments <data> <#in> 50 1000 1, i.e., the base learner is AdaBoost with 50 decision stumps, and we boost it 1000 iterations)
nntest.cpp: Accepts weighted training data, and outputs the test error (see also batch_data.m)

Matlab Files

rho_calc.m: Computes the correlation coefficients (see also nnrho.cpp)
wgt_calc.m: Computes the average AdaBoost sample weights
batch_data.m: The main code to randomly split the data, get the estimates, and collect the test errors
Two commonly-used functions: assert.m and options.m