Protein classification Section 19.6

The data derive from 1708 human proteins.

A protein is a string of amini acids, with an alphabet of size 20.
Each protein is represented by a bag of N-grams representation;
i.e. we record the number of occurrences for each possible N-tuple of amino acids (for us N=4).
What we actually have is the 1708x1708 inner-product matrix for such a representation.
This was computed using a string kernel (Leslie et al. 2003).
The response label takes values [-1, +1], there being 45 pluses (particular protein class), and 1663 minuses.
The idea is to build a classifier

protein_kernel.txt
protein_label.txt

The data can be read into R most simply via
protein_kernel <- matrix(scan("http://hastie.su.domains/CASI_files/DATA/protein_kernel.txt",what=0),1708,1708)
protein_label <- scan("http://hastie.su.domains/CASI_files/DATA/protein_label.txt",what=0)