The data derive from 1708 human proteins.
A protein is a string of amini acids, with an alphabet of size
20.
Each protein is represented by a bag of
N-grams representation;
i.e. we record the number of occurrences for each possible N-tuple
of amino acids (for us N=4).
What we actually have is the 1708x1708 inner-product matrix for
such a representation.
This was computed using a string kernel
(Leslie
et al. 2003).
The response label takes values [-1, +1], there being 45 pluses
(particular protein class),
and 1663 minuses.
The idea is to build a classifier
protein_kernel.txt
protein_label.txt
The data can be read into R most simply via
protein_kernel <- matrix(scan("http://hastie.su.domains/CASI_files/DATA/protein_kernel.txt",what=0),1708,1708)
protein_label <- scan("http://hastie.su.domains/CASI_files/DATA/protein_label.txt",what=0)