**Deriving a string kernel form text compression tools, the context tree weighting algorithm revisited**

*Marco Cuturi (Ecole des Mines de Paris, France)*

Kernel design applied to sequences and text compression both share a common mean: detect and quantify statistical regularities over string objects. This makes any compression utility a potential idea for kernel design. Indeed, we might intuitively use compression algorithms as kernels, taking for instance the gain of compressing two sequences concatenated compared to the compression of each sequence taken apart as a measure of similarity between strings, and eventually as a string kernel.

This operation is not valid in general, as kernels must comply with properties that most compression heuristics wouldn't fulfill. Compression algorithms based on model averaging can be shown to provide valid kernels. In this work we show in particular that a string kernel can be derived from the context tree weighting algorithm (CTW) of Willems, Shtarkov and Tjalkens. This kenel is obtained as a three-stage mixture of the joint probabilities of the sequences under variable-length Markov models. Experiments on the problem of protein homology detection are currently conducted.

**Discussion of discrimination of protein sequences by datadriven kernels**

*Tor Fla (University of Tromsų, Norway)*

We discuss the problem of learning temperature adaption and function of enzymes. For this purpose we introduce a datadriven kernel to attempt to compare protein sequence data within an aligned family of m protein sequences each of length n with gaps. Relative sequence counts of single, binary, ternary and higher order compositions of aminoacid short strings with different position difference are used as observables . The squareroot of the relative sequence counts are used as estimators for the probability amplitude of individual aminoacids which then give rise to a probability amplitude sequence identified with the aminoacid sequences. The advantage of the probability amplitude approach is that it can be given a quantum mechanical interpretation as phase averaged probability amplitudes in a path integral approach. This also points to more general mixed states constructions of real probability density matrix.

We approximate the family of probability sequences by a Gaussian distribution valid in the asymptotic limit of large length of the sequences. The distribution has a intersequence crosscorrelation tensor D which give rise to m projected , decorrelated projected sequences ordered by their variance corresponding to eigenvalues of D. We now introduce a model selection principle in which only the d largest probability amplitudes in the projected sequences are assigned a mean value while the (n-d) smallest are modelled as a pure noise component. We select the robust, distinguishable model of size d for each projected probability sequence according to it?s Fisher information and the corresponding Rissanen Minimum Description Length (MDL). This now give rise to data driven estimators of mean values and correlation tensor and corresponding datadriven Fisher information, Fisher score and natural kernel. For generalization purposes we insist on that the Fisher Riemannian metric F should be pull backed to an equivalent invariant fixed metric F0 since otherwise the comparison of will be done with respect to different inner products and frames for each data sample and parameter coordinates chosen. We solve the equivalence problem for our simple Gaussian model and show that it correspond to choosing the logarithm of the relative variance as the basic variance parameter and the mean value?s normalized to the more obvious choice of relative standard variation. We now continue by suggesting that the Fisher score, the natural kernel and corresponding regularization operators should be constructed in an invariant manner according to the above pullbacked parameter variables and structures.

We implement a preliminary datadriven PCA analysis according to the above program and show for a family of aligned chymotrypsin and elastase how these can be separated in clusters of warm and cold adapted enzymes by sequence counts of singletons, binaries, ternaries ? e.t.c. alone.

Finally, we discuss the problem of generalization across families and structures by plotting the contact matrix for chymotryopsin and elastase.. We suggest to introduce additional quantum mechanical interactions modeled by interaction operators introduced by a prior on the mean parameters corresponding to the degree of overlap and strength of interaction of aminoacids in contact. We demonstrate how such an approach can generate new datadriven kernel structures. Consequences for datadriven machine learning principles are that the kernel structures now has to be carefully modelled for each protein family. If generalization is possible between families, comparison of sequence data can only be done with reference to a common kernel(metric) structure. Further, datadriven observables for e.g. separation curves between different structures and function and consequent new machinelearning principles like SVM, FDA, DAG e.t.c. can now be directly related to the datadriven and biomolecular modeling of bioinformatics. Future robust such datadriven machinelearning should be developped by usinge.g. the MDL principle in these algorithms as a preselection principle

**A hybrid kernel machine for protein secondary structure prediction**

*Yann Guermeur (LORIA, France)*, joint work with Alain Lifchitz and Regis Vert

Predicting the tertiary structure of proteins from their amino acid sequences remains one of the central challenges in structural biology. A standard way to tackle this problem consists in predicting first an intermediate description called the secondary structure. Loosely speaking, each amino acid of the sequence must be assigned to a category corresponding to the local conformational element in which it is involved, either an helix, a strand or an aperiodic structure. Considered from the point of view of pattern recognition, this is thus a three-class discrimination problem.

During the last four decades, many approaches have been implemented to predict the secondary structure. Most of the state-of-the-art methods are based on neural networks, either feedforward (MLPs) or recurrent (BRNNs). We introduce here a new method, which combines a multi-class support vector machine (M-SVM) with an inhomogeneous hidden Markov model (IHMM). The M-SVM performs the low-level treatments. It takes as input PSI-BLAST profiles of alignments and outputs conformational scores which are subsequently post-processed with a multiple-output perceptron, to generate class posterior probability estimates. These estimates provide us with the observation probability distributions of the IHMM. Maximum likelihood is computed using Viterbi algorithm. The transition probabilities are optimized through a Monte Carlo algorithm. The aim of this hierarchical architecture is to combine the advantages resulting from applying a discriminant and a generative model to perform the task.

*Heijin Kim (POSTECH, Korea)*

This paper addresses a problem of classifying microarray data into its subclasses and investigating their relationship between clinical outcomes by projecting data on kernel space. A main ingredient in our approach is kernel machines such as support vector machine (SVM), kernel principal component analysis (KPCA), and kernel partial least squares (KPLS). In the task of breast cancer classification, we employ SVM and KPLS and compare results by those techniques. Another important task is to analyze the patterns of clinical outcomes on feature space. In order to visualize those relationship in low-dimensional space, both KPCA and KPLS are used; each method compares to PCA and PLS. It turns out that these methods are useful not only to classify cancer's subclasses but also to identify correlations between clinical outcomes by representing the nonlinearly projected expression proles on low-dimensional feature space.

**Diffusion and other kernels on graphs**

*Imre Risi Kondor (Columbia University, USA)*

The structure of the space on which learning algorithms are to operate is often only very loosely known. By merely positing pairwise similarities between examples, the input space can be regarded as a graph. We discuss how to construct meaningful kernels in such a domain and present a canonical family of kernels based on the physical concept of diffusion. A number of recent extensions of this approach will also be discussed.

**Joint classifier and kernel design**

*Balaji Krishnapuram (Duke University, USA)*

The analysis of gene and protein expression data is challenging from a machine learning perspective due to their extremely large dimensionality. The presence of a large number of irrelevant features (here genes) makes such analysis error prone due to the curse of dimensionality. In order to overcome this limitation, Support Vector Machines are widely employed, since it is well known that they posses good generalization properties even in the presence of irrelevant predictor variables.

Motivated by error bounds from computational learning theory, we present a Bayesian algorithm that jointly learns the optimal classifier and kernel simultaneously from the data. The EM algorithm that we derive for this learning task represents a Bayesian generalization of the SVM.

Here, the problem of learning the kernel parameters is equivalent to identifying the optimal scaling factors associated with each feature (gene) in measuring the "similarity" between two expression profiles. A scaling factor of zero for a gene implies that it does not in any way influence the extent of similarity as measured by the kernel. Theoretical arguments and experimental results are provided to show that learning the kernel using our algorithm results in automatic feature selection and hence mitigates the problem of large dimensionality.

*Gert Lanckriet (U.C. Berkeley, USA)*

An important challenge in bioinformatics is to leverage different descriptions of the same data set, each capturing different aspects of the data. Many such sources of information are now available, such as sequence, expression, protein and regulation information. More data types are going to be available in the near future, such as array-based fitness profiles and protein-protein interaction data from mass spectrometry. Recent work in bioinformatics -- such as gene function prediction, prediction of protein structure and localization, and inference of regulatory and metabolic networks -- could benefit significantly from an approach that treats in a unified way the different types of information, merging them into a single representation, rather than only using the description that is judged to be the most relevant at hand.

The problem of fusing heterogeneous types of information can be addressed by using kernel based learning methods that involve mapping data into high dimensional Hilbert spaces. Combining different kernels provides a way to use many different representations of the data in a homogeneous way. Recent advances in the theory of kernel methods have provided efficient algorithms to perform such combinations in an optimal way. Such methods exploit semi- definite programming techniques to reduce the problem of optimal kernel combination to a convex optimization problem.

After introducing the semi-definite programming techniques, we will use them to investigate the problem of predicting membrane proteins based on different types of information, including amino acid sequences, hydropathy profiles, gene expression data and known protein-protein interactions. We show that the combined kernel performs better than any individual element, and significantly better than existing approaches.

**New String Kernels for Biological Sequence Data**

*Christina Leslie (Columbia University, USA)*

String kernels provide a natural way to apply kernel-based machine learning algorithms to biological sequence data like protein and DNA sequences. We present a class of combinatorial string kernels for biological sequence data, including mismatch string kernels -- which measure sequence similarity based on shared occurrences of $k$-length subsequences called $k$-mers, counted with up to $m$ mismatches -- as well as variants based on $k$-mer matches with wildcards, weightings that incorporate biological knowledge about the string alphabet, and techniques for feature selection. We also discuss data structures for efficient computation of these kernel functions.

Combinatorial string kernels do not rely on generative models like hidden Markov models for sequence data and hence lend themselves to a wide variety of sequence-based learning problems. We discuss applications to two classification problems for which we successfully use string kernels with support vector machines: remote protein homology detection and splice site recognition in pre-mRNA.

**Support vector machine evaluation of peptide identification via mass spectrometry**

*William S. Noble (University of Washington, USA)*

Shotgun tandem mass spectrometry-based peptide sequencing allows high-throughput identification of proteins. We have trained a support vector machine to discriminate between peptides that have been correctly and incorrectly identified via this process. Each predicted peptide is characterized by thirteen features derived from the observed and theoretical spectra. The trained SVM classifier performs significantly better than previous cutoff-based methods at separating positive from negative peptides, and is an important step towards automated interpretation of peptide tandem mass spectrometry data in proteomics.

**Protein homology detection using sequence alignment and support vector machines**

*Hiroto Saigo (Kyoto University, Japan)*

Detection of remote homology for protein sequences is one of the most important and well-studied problems in Bioinformatics. We propose new SVM based methods (SVM-SW and SVM-LA), which use the Smith-Waterman (SW) sequence alignment algorithm and variants of the SW algorithm as kernel functions. In this poster abstract, we briefly show the performance of the resulting algorithms on the problem of detecting remote homology among proteins in the SCOP database.

*Alexander J. Smola (Australia National University, Australia)*

We present a new algorithm suitable for matching discrete objects such as strings and trees in linear time, thus obviating dynamic programming with quadratic time complexity. Furthermore, prediction cost in many cases can be reduced to linear cost in the length of the sequence to be classified, regardless of the number of support vectors. This improvement on the currently available algorithms makes string kernels a viable alternative for the practitioner.

**Marginalized kernels between labeled graphs**

*Koji Tsuda (Max Plank Institute for Biological Cybernetics, Germany)*

A new kernel function between two labeled graphs is presented. Feature vectors are defined as the counts of label paths produced by random walks on graphs. The kernel computation finally boils down to obtaining the stationary state of a discrete-time linear system, thus is efficiently performed by solving simultaneous linear equations. Our kernel is based on an infinite dimensional feature space, so it is fundamentally different from other string or tree kernels based on dynamic programming.

We will present promising empirical results in classification of chemical compounds.

**Optimal Encoding of Genomic Information**

*Chris Watkins (Royal Holloway, University of London, UK)*

Informative patterns in DNA are continually degraded by mutation and other errors of reproduction, but are restored by selection. What types of pattern are "load-optimal" in that they can maintain the most information with the least amount of selection?

It is well known that the triplet genetic code is nearly load-optimal among possible triplet codes, in the sense that a mutated codon tends to encode for an amino acid that is similar to that encoded by the original codon. However, the untranslated parts of the genome contain much information which is not encoded by the universal genetic code. What types of encoding are load-optimal for this information?

By analysing a simple genetic model, we show that in sexual organisms, there is a large advantage in encodings that are highly distributed and dispersed, whereas in asexual organisms, distributed and dispersed encodings are not so advantageous. We will also show that load-optimal encodings will tend to take over from less efficient encodings.

Finally, we will discuss the implications of these results for various problems in bioinformatics.

**Generalized kernel canonical correlation analysis for multiple genomic data**

*Yoshihiro Yamanishi (Kyoto University, Japan)*

The integration of different types of genomic data (e.g., biochemical pathways, genomes, and expression data) is a key problem in computational biology nowadays. When data types are different (e.g., graphs, strings, and vectors), integration strategies often rely on various heuristic approaches, which depend on the types of the data.

In this talk we present new methods to measure the correlation between several heterogeneous datasets, and to extract sets of genes which share similarities with respect to multiple biological attributes. The originality of our approach is the extension of the concept of correlation for non-vectorial data, which is made possible by the use of generalized kernel canonical correlation analysis (KCCA), and the method we propose to extract groups of genes responsible for the detected correlations. Moreover, two variants of KCCA are proposed when more than two datasets are available.

These methods are successfully tested on their ability to recognize operons in the E. coli genome, from the comparison of three datasets corresponding to functional relationships between genes in metabolic pathways, geometrical relationships along the chromosome, and co-expression relationships as observed by gene expression data.

Back to the workshop's homepage Last modified: Wed Apr 16 10:29:02 CEST 2003