### Code for Data Categorization

Here are the source files we used for the paper
Improving Generalization by
Data Categorization. We are sorry that we did not put
enough comments there. Please let us know if you have any
problems using them.

You may also find the data used for the paper at
my data page.

#### C++ Files

You will need LEMGA
(snapshot 2005/09/04),
our software libary for machine learning research,
LIBSVM, and
Boost C++ Libraries (ver 1.30.0)
to compile the files here.

- Makefile
- nnrho.cpp:
Randomly samples neural networks with 1 hidden layer, and outputs data
for computing the selection cost ρ (see also rho_calc.m)
- svm-htlin.tar.gz:
Using LIBSVM 2.8 for SVM confidence margin (see its README)
- nnboost.cpp:
Constructs AdaBoost ensembles and outputs the margin information
(see also wgt_calc.m)
- bstmis.cpp:
Our implementation of Merler's misclassification ratio
(with arguments
`<data> <#in> 50 1000 1`

,
i.e., the base learner is AdaBoost with 50 decision stumps, and
we boost it 1000 iterations)
- nntest.cpp:
Accepts weighted training data, and outputs the test error
(see also batch_data.m)

#### Matlab Files

- rho_calc.m:
Computes the correlation coefficients (see also nnrho.cpp)
- wgt_calc.m:
Computes the average AdaBoost sample weights
- batch_data.m:
The
main

code to randomly split the data, get the estimates,
and collect the test errors
- Two commonly-used functions:
assert.m and
options.m