bio references

[Watson1953Structure]	J. D. Watson and F. H. C. Crick. A Structure for Deoxyribose Nucleic Acid. Nature, 171:737, 1953. [ bib \| .html \| .pdf ] Keywords: bio
[Felsenstein1981Evolutionary]	J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17:368-376, 1981. [ bib ]
[Rarey1996fast]	M. Rarey, B. Kramer, T. Lengauer, and G. Klebe. A fast flexible docking method using an incremental construction algorithm. J. Mol. Biol., 261(3):470-489, Aug 1996. [ bib \| DOI \| http ] We present an automatic method for docking organic ligands into protein binding sites. The method can be used in the design process of specific protein ligands. It combines an appropriate model of the physico-chemical properties of the docked molecules with efficient methods for sampling the conformational space of the ligand. If the ligand is flexible, it can adopt a large variety of different conformations. Each such minimum in conformational space presents a potential candidate for the conformation of the ligand in the complexed state. Our docking method samples the conformation space of the ligand on the basis of a discrete model and uses a tree-search technique for placing the ligand incrementally into the active site. For placing the first fragment of the ligand into the protein, we use hashing techniques adapted from computer vision. The incremental construction algorithm is based on a greedy strategy combined with efficient methods for overlap detection and for the search of new interactions. We present results on 19 complexes of which the binding geometry has been crystallographically determined. All considered ligands are docked in at most three minutes on a current workstation. The experimentally observed binding mode of the ligand is reproduced with 0.5 to 1.2 A rms deviation. It is almost always found among the highest-ranking conformations computed. Keywords: Aldehyde Reductase, Algorithms, Amiloride, Aminoimidazole Carboxamide, Animals, Arabinose, Automation, Binding Sites, Carbonic Anhydrases, Computational Biology, Computer Simulation, Concanavalin A, Crystallography, Databases, Drug Design, Drug Evaluation, Enzyme Inhibitors, Factual, Folic Acid, Folic Acid Antagonists, Fructose-Bisphosphatase, Humans, Internet, Ligands, Methotrexate, Models, Molecular, Non-U.S. Gov't, Pancreatic Elastase, Pentamidine, Pliability, Point Mutation, Preclinical, Protein Binding, Protein Conformation, Proteins, Reproducibility of Results, Research Support, Ribonucleosides, Software, Tetrahydrofolate Dehydrogenase, Thermolysin, Time Factors, Trypsin, X-Ray, 8780787
[Goffeau1996Life]	A. Goffeau, B.G. Barrell, H. Bussey, R.W. Davis, B. Dujon, H. Feldmann, F. Galibert, J.D. Hoheisel, C. Jacq, M. Johnston, E.J. Louis, H.W. Mewes, Y. Murakami, P. Philippsen, H. Tettelin, and S. G. Oliver. Life with 6000 genes. Science, 274:546-567, October 1996. [ bib \| DOI \| http \| .pdf ]
[Nielsen1997Identification]	H. Nielsen, J. Engelbrecht, S. Brunak, and G. von Heijne. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng., 10(1):1-6, 1997. [ bib \| http \| .pdf ]
[Kanehisa1997database]	M. Kanehisa. A database for post-genome analysis. Trends Genet., 13:375-376, 1997. [ bib \| DOI \| http \| .pdf ]
[Altschul1997Gapped]	S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389-3402, 1997. [ bib \| .pdf \| .pdf ]
[Schneider1998Artificial]	G. Schneider and P. Wrede. Artificial neural networks for computer-based molecular design. Prog Biophys Mol Biol, 70(3):175-222, 1998. [ bib ] The theory of artificial neural networks is briefly reviewed focusing on supervised and unsupervised techniques which have great impact on current chemical applications. An introduction to molecular descriptors and representation schemes is given. In addition, worked examples of recent advances in this field are highlighted and pioneering publications are discussed. Applications of several types of artificial neural networks to compound classification, modelling of structure-activity relationships, biological target identification, and feature extraction from biopolymers are presented and compared to other techniques. Advantages and limitations of neural networks for computer-aided molecular design and sequence analysis are discussed. Keywords: Algorithms, Amino Acid Sequence, Amino Acids, Animals, Artificial Intelligence, Automated, Bacterial, Bacterial Proteins, Bicuculline, Binding Sites, Biological, Biological Availability, Blood Proteins, Blood-Brain Barrier, Cation Transport Proteins, Cats, Cell Membrane Permeability, Chemical, Chemistry, Cluster Analysis, Combinatorial Chemistry Techniques, Comparative Study, Computational Biology, Computer Simulation, Computer Systems, Computer-Aided Design, Computer-Assisted, Computing Methodologies, DNA-Binding Proteins, Databases, Dogs, Drug Design, Electric Stimulation, Electromyography, Enzyme Inhibitors, Ether-A-Go-Go Potassium Channels, Excitatory Amino Acid Antagonists, Factual, False Positive Reactions, Forecasting, Forelimb, GABA Antagonists, Gene Expression Profiling, Genome, Glutamic Acid, Humans, Hydrogen Bonding, Image Enhancement, Image Interpretation, Image Processing, Information Storage and Retrieval, Iontophoresis, Kynurenic Acid, Least-Squares Analysis, Linear Models, Liver, Markov Chains, Metabolic Clearance Rate, Metalloendopeptidases, Microelectrodes, Models, Molecular, Molecular Conformation, Molecular Sequence Data, Molecular Structure, Motor Cortex, Movement, Multivariate Analysis, Nerve Net, Neural Networks (Computer), Neuropeptides, Non-U.S. Gov't, Nonlinear Dynamics, Pattern Recognition, Pharmaceutical, Pharmaceutical Preparations, Pharmacokinetics, Phylogeny, Potassium Channels, Predictive Value of Tests, Protein Interaction Mapping, Protein Sorting Signals, Protein Structure, Proteins, Rats, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Shoulder, Signal Processing, Software, Statistical, Stereotaxic Techniques, Structure-Activity Relationship, Terminology, Tertiary, Trans-Activators, Voltage-Gated, Zinc, 9830312
[Poggio1998Sparse]	Poggio and Girosi. A Sparse Representation for Function Approximation. Neural Comput, 10(6):1445-54, Jul 1998. [ bib ] We derive a new general representation for a function as a linear combination of local correlation kernels at optimal sparse locations (and scales) and characterize its relation to principal component analysis, regularization, sparsity principles, and support vector machines. Keywords: Algorithms, Automated, Biometry, Computers, DNA, Databases, Factual, Fungal, Fungal Proteins, GTP-Binding Proteins, Gene Expression, Genes, Learning, Markov Chains, Models, Neural Networks (Computer), Neurological, Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Hybridization, Open Reading Frames, P.H.S., Pattern Recognition, Protein, Protein Structure, Proteins, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Sequence Alignment, Sequence Analysis, Software, Statistical, Tertiary, U.S. Gov't, 9698352
[Mukherjee1998Support]	S. Mukherjee, P. Tamayo, J. P. Mesirov, D. Slonim, A. Verri, and T. Poggio. Support vector machine classification of microarray data. Technical Report 182, C.B.L.C., 1998. A.I. Memo 1677. [ bib \| .html \| .pdf ] Keywords: biosvm microarray
[Kononen1998Tissue]	J. Kononen, L. Bubendorf, A. Kallioniemi, M. Bärlund, P. Schraml, S. Leighton, J. Torhorst, M. J. Mihatsch, G. Sauter, and O. P. Kallioniemi. Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med, 4(7):844-847, Jul 1998. [ bib ] Many genes and signalling pathways controlling cell proliferation, death and differentiation, as well as genomic integrity, are involved in cancer development. New techniques, such as serial analysis of gene expression and cDNA microarrays, have enabled measurement of the expression of thousands of genes in a single experiment, revealing many new, potentially important cancer genes. These genome screening tools can comprehensively survey one tumor at a time; however, analysis of hundreds of specimens from patients in different stages of disease is needed to establish the diagnostic, prognostic and therapeutic importance of each of the emerging cancer gene candidates. Here we have developed an array-based high-throughput technique that facilitates gene expression and copy number surveys of very large numbers of tumors. As many as 1000 cylindrical tissue biopsies from individual tumors can be distributed in a single tumor tissue microarray. Sections of the microarray provide targets for parallel in situ detection of DNA, RNA and protein targets in each specimen on the array, and consecutive sections allow the rapid analysis of hundreds of molecular markers in the same set of specimens. Our detection of six gene amplifications as well as p53 and estrogen receptor expression in breast cancer demonstrates the power of this technique for defining new subgroups of tumors. Keywords: Animals; Breast Neoplasms, genetics/metabolism/pathology; Cyclin D1, genetics/metabolism; Female; Genetic Techniques; Humans; Immunoenzyme Techniques; In Situ Hybridization, Fluorescence; Mice; Oncogene Proteins v-myb; Proto-Oncogene Proteins c-myc, genetics/metabolism; Rabbits; Receptor, erbB-2, genetics/metabolism; Receptors, Estrogen, genetics/metabolism; Retroviridae Proteins, Oncogenic, genetics/metabolism; Tumor Markers, Biological, genetics/metabolism; Tumor Suppressor Protein p53, genetics/metabolism
[Karplus1998Hidden]	K. Karplus, C. Barrett, and R. Hughey. Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics, 14(10):846-856, 1998. [ bib \| .ps \| .pdf ]
[Grundy1998Family-based]	W. N. Grundy. Family-based Homology Detection via Pairwise Sequence Comparison. In Proceedings of the Second Annual International Conference on Computational Molecular Biology, March 22-25, pages 94-100, 1998. [ bib \| .html \| .pdf ]
[Goto1998LIGAND:]	S. Goto, T. Nishioka, and M. Kanehisa. LIGAND: chemical database for enzyme reactions. Bioinformatics, 14:591-599, 1998. [ bib \| http \| .pdf ]
[Girosi1998Equivalence]	Girosi. An Equivalence Between Sparse Approximation and Support Vector Machines. Neural Comput, 10(6):1455-80, Jul 1998. [ bib ] This article shows a relationship between two different approximation techniques: the support vector machines (SVM), proposed by V. Vapnik (1995) and a sparse approximation scheme that resembles the basis pursuit denoising algorithm (Chen, 1995; Chen, Donoho, and Saunders, 1995). SVM is a technique that can be derived from the structural risk minimization principle (Vapnik, 1982) and can be used to estimate the parameters of several different approximation schemes, including radial basis functions, algebraic and trigonometric polynomials, B-splines, and some forms of multilayer perceptrons. Basis pursuit denoising is a sparse approximation technique in which a function is reconstructed by using a small number of basis functions chosen from a large set (the dictionary). We show that if the data are noiseless, the modified version of basis pursuit denoising proposed in this article is equivalent to SVM in the following sense: if applied to the same data set, the two techniques give the same solution, which is obtained by solving the same quadratic programming problem. In the appendix, we present a derivation of the SVM technique in one framework of regularization theory, rather than statistical learning theory, establishing a connection between SVM, sparse approximation, and regularization theory. Keywords: Algorithms, Automated, Biometry, Computers, DNA, Databases, Factual, Fungal, Fungal Proteins, GTP-Binding Proteins, Gene Expression, Genes, Learning, Markov Chains, Models, Neural Networks (Computer), Neurological, Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Hybridization, Open Reading Frames, P.H.S., Pattern Recognition, Protein, Protein Structure, Proteins, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Sequence Alignment, Sequence Analysis, Software, Statistical, Tertiary, U.S. Gov't, 9698353
[Pontil1998Properties]	M. Pontil and A. Verri. Properties of support vector machines. Neural Comput, 10(4):955-74, May 1998. [ bib ] Support vector machines (SVMs) perform pattern recognition between two point classes by finding a decision surface determined by certain points of the training set, termed support vectors (SV). This surface, which in some feature space of possibly infinite dimension can be regarded as a hyperplane, is obtained from the solution of a problem of quadratic programming that depends on a regularization parameter. In this article, we study some mathematical properties of support vectors and show that the decision surface can be written as the sum of two orthogonal terms, the first depending on only the margin vectors (which are SVs lying on the margin), the second proportional to the regularization parameter. For almost all values of the parameter, this enables us to predict how the decision surface varies for small parameter changes. In the special but important case of feature space of finite dimension m, we also show that m + 1 SVs are usually sufficient to determine the decision surface fully. For relatively small m, this latter result leads to a consistent reduction of the SV number. Keywords: Algorithms, Artificial Intelligence, Automated, Biometry, Computers, DNA, Databases, Factual, Fungal, Fungal Proteins, GTP-Binding Proteins, Gene Expression, Genes, Learning, Linear Models, Markov Chains, Mathematics, Models, Neural Networks (Computer), Neurological, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Nucleic Acid Hybridization, Open Reading Frames, P.H.S., Pattern Recognition, Protein, Protein Structure, Proteins, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Sequence Alignment, Sequence Analysis, Software, Statistical, Tertiary, U.S. Gov't, 9573414
[Roth1998Finding]	F. P. Roth, J. D. Hughes, P. W. Estep, and G. M. Church. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mrna quantitation. Nat. Biotechnol., 16(10):939-945, October 1998. [ bib \| DOI \| http ] Whole-genome mRNA quantitation can be used to identify the genes that are most responsive to environmental or genotypic change. By searching for mutually similar DNA elements among the upstream non-coding DNA sequences of these genes, we can identify candidate regulatory motifs and corresponding candidate sets of coregulated genes. We have tested this strategy by applying it to three extensively studied regulatory systems in the yeast Saccharomyces cerevisiae: galactose response, heat shock, and mating type. Galactose-response data yielded the known binding site of Gal4, and six of nine genes known to be induced by galactose. Heat shock data yielded the cell-cycle activation motif, which is known to mediate cell-cycle dependent activation, and a set of genes coding for all four nucleosomal proteins. Mating type alpha and a data yielded all of the four relevant DNA motifs and most of the known a- and alpha-specific genes. Keywords: bioinformatics, genome-wide, tfs
[Murphy1999Modelling]	K. Murphy and S. Mian. Modelling gene expression data using dynamic Bayesian networks. Technical report, Computer Science Division, University of California, Berkeley, CA., 1999. [ bib \| .pdf ] Keywords: biogm
[Marcotte1999Detecting]	E.M. Marcotte, M. Pellegrini, H.-L. Ng, D.W. Rice, T.O. Yeates, and D. Eisenberg. Detecting Protein Function and Protein-Protein Interactions from Genome Sequences. Science, 285:751-753, 1999. [ bib \| .pdf \| .pdf ]
[Jaakkola1999Exploiting]	T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Proc. of Tenth Conference on Advances in Neural Information Processing Systems, 1999. [ bib \| .ps \| .pdf ] Keywords: biosvm
[Jaakkola1999Using]	T. S. Jaakkola, M. Diekhans, and D. Haussler. Using the Fisher Kernel Method to Detect Remote Protein Homologies. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 149-158. AAAI Press, 1999. [ bib ] Keywords: biosvm
[Heckerman1999tutorial]	D. Heckerman. A tutorial on learning with Bayesian networks. In M. Jordan, editor, Learning in graphical models, pages 301-354. MIT Press, Cambridge, MA, USA, 1999. [ bib \| .pdf ] Keywords: biogm
[Haussler1999Convolution]	D. Haussler. Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, UC Santa Cruz, 1999. [ bib \| .pdf ] We introduce a new method of constructing kernels on sets whose elements are discrete structures like strings, trees and graphs. The method can be applied iteratively to build a kernel on a infinite set from kernels involving generators of the set. The family of kernels generated generalizes the family of radial basis kernels. It can also be used to define kernels in the form of joint Gibbs probability distributions. Kernels can be built from hidden Markov random fields, generalized regular expressions, pair-HMMs, or ANOVA decompositions. Uses of the method lead to open problems involving the theory of infinitely divisible positive definite functions. Fundamentals of this theory and the theory of reproducing kernel Hilbert spaces are reviewed and applied in establishing the validity of the method. Keywords: biosvm
[Hartwell1999a]	L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray. From molecular to modular cell biology. Nature, 402(6761 Suppl):C47-C52, Dec 1999. [ bib \| DOI \| http ] Cellular functions, such as signal transmission, are carried out by 'modules' made up of many species of interacting molecules. Understanding how modules work has depended on combining phenomenological analysis with molecular studies. General principles that govern the structure and behaviour of modules may be discovered with help from synthetic sciences such as engineering and computer science, from stronger interactions between experiment and theory in cell biology, and from an appreciation of evolutionary constraints. Keywords: Action Potentials; Biological Evolution; Forecasting; Models, Biological; Molecular Biology, trends
[Debouck1999DNA]	C. Debouck and P. N. Goodfellow. DNA microarrays in drug discovery and development. Nat. Genet., 21(1 Suppl):48-50, Jan 1999. [ bib \| DOI \| http ] DNA microarrays can be used to measure the expression patterns of thousands of genes in parallel, generating clues to gene function that can help to identify appropriate targets for therapeutic intervention. They can also be used to monitor changes in gene expression in response to drug treatments. Here, we discuss the different ways in which microarray analysis is likely to affect drug discovery. Keywords: Agricultural, Alleles, Alternaria, Amino Acid, Amino Acid Chloromethyl Ketones, Amino Acid Sequence, Animal, Animals, Apoptosis, Asthma, Bacteria, Base Sequence, Binding Sites, Biotechnology, Blotting, Bone Density, Bone Matrix, Bone and Bones, CCR5, Camptothecin, Caspases, Cathepsins, Cell Surface, Central America, Chloroplast, Chondrocytes, Chromosome Mapping, Chromosomes, Cloning, Cluster Analysis, Collagen, Comparative Study, Coumarins, Crops, Crystallography, DNA, DNA Primers, Dipeptides, Disease, Disease Models, Drug Design, Drug Evaluation, Drug Industry, Enzyme Activation, Enzyme Inhibitors, Escherichia coli, Evolution, Exons, Expressed Sequence Tags, Female, Fetus, Fluorescent Dyes, Food Microbiology, Founder Effect, GTP-Binding Proteins, Gene Expression, Gene Frequency, Gene Library, Genes, Genetic, Genetic Predisposition to Disease, Genome, Geography, Growth Plate, Haplotypes, Hordeum, Human, Humans, Inclusion Bodies, Injections, Intraperitoneal, Introns, Isatin, Knockout, Male, Membrane Proteins, Messenger, Mice, Models, Molecular, Molecular Sequence Data, Molecular Structure, Mutation, Mycotoxins, Neutrophils, Non-U.S. Gov't, Northern, Oligonucleotide Array Sequence Analysis, Osteoarthritis, Osteochondrodysplasias, Osteoclasts, Osteopetrosis, Pair 15, Phaseolus, Polymorphism, Preclinical, Pregnancy, Promoter Regions (Genetics), Protein Precursors, Proteomics, RNA, Receptors, Recombinant Fusion Proteins, Recombinant Proteins, Research Support, Restriction Fragment Length, Ribosomal Proteins, Sequence Alignment, Sequence Analysis, Sequence Homology, South America, Species Specificity, Splenomegaly, Sulfonamides, Synteny, Tissue Distribution, Transcription, Trichothecenes, X-Ray, 9915501
[Cuff1999Evaluation]	J. A. Cuff and G. J. Barton. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Protein. Struct. Funct. Genet., 34:508-519, 1999. [ bib \| http \| .pdf ]
[Barabasi1999Emergence]	A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286:509-512, 1999. [ bib \| .pdf \| .pdf ] Systems as diverse as genetic networks or the World Wide Web are best described as networks with complex topology. A common property of many large networks is that the vertex connectivities follow a scale-free power-law distribution. This feature was found to be a consequence of two generic mechanisms: (i) networks expand continuously by the addition of new vertices, and (ii) new vertices attach preferentially to sites that are already well connected. A model based on these two ingredients reproduces the observed stationary scale-free distributions, which indicates that the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.
[Baldi1999Exploiting]	P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15:937-946, 1999. [ bib \| .pdf \| .pdf ]
[Pellegrini1999Assigning]	M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA, 96:4285-4288, April 1999. [ bib \| .pdf \| .pdf ]
[Mathews1999Expandeda]	D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288(5):911-940, May 1999. [ bib \| DOI \| http ] An improved dynamic programming algorithm is reported for RNA secondary structure prediction by free energy minimization. Thermodynamic parameters for the stabilities of secondary structure motifs are revised to include expanded sequence dependence as revealed by recent experiments. Additional algorithmic improvements include reduced search time and storage for multibranch loop free energies and improved imposition of folding constraints. An extended database of 151,503 nt in 955 structures? determined by comparative sequence analysis was assembled to allow optimization of parameters not based on experiments and to test the accuracy of the algorithm. On average, the predicted lowest free energy structure contains 73 % of known base-pairs when domains of fewer than 700 nt are folded; this compares with 64 % accuracy for previous versions of the algorithm and parameters. For a given sequence, a set of 750 generated structures contains one structure that, on average, has 86 % of known base-pairs. Experimental constraints, derived from enzymatic and flavin mononucleotide cleavage, improve the accuracy of structure predictions. Keywords: 16S, 23S, 5S, Affinity, Algorithms, Aluminum Silicates, Amino Acid, Amino Acid Sequence, Amyloidosis, Archaeal, Bacillus, Bacterial, Bacterial Proteins, Bacteriophage T4, Base Sequence, Chloroplast, Chromatography, Circular Dichroism, Comparative Study, Computational Biology, Databases, Electrophoresis, Entropy, Enzyme Stability, Escherichia coli, Factual, Fibroblast Growth Factor 2, Flavin Mononucleotide, Fluorescence, Genetic, Guanidine, Humans, Huntington Disease, Kinetics, Light, Models, Molecular Sequence Data, Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, P.H.S., Peptides, Phylogeny, Polyacrylamide Gel, Predictive Value of Tests, Protein Binding, Protein Denaturation, Protein Folding, Protein Structure, RNA, Radiation, Recombinant Proteins, Research Support, Ribosomal, Scattering, Secondary, Sequence Homology, Solutions, Spectrometry, Statistical, Temperature, Thermodynamics, Time Factors, Trinucleotide Repeat Expansion, U.S. Gov't, alpha-Amylase, 10329189
[Fields1999Functional]	S. Fields, Y. Kohara, and D. J. Lockhart. Functional genomics. Proc. Natl. Acad. Sci. USA, 96:8825-8826, August 1999. [ bib \| .pdf \| .pdf ]
[Marcotte1999combined]	E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg. A combined algorithm for genome-wide prediction of protein function. Nature, 402:83-86, November 1999. [ bib \| http \| .pdf ]
[Zien2000Engineering]	A. Zien, G. Rätsch, S. Mika, B. Schölkopf, T. Lengauer, and K.-R. Müller. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics, 16(9):799-807, 2000. [ bib \| http \| .pdf ] Motivation: In order to extract protein sequences from nucleotide sequences, it is an important step to recognize points at which regions start that code for proteins. These points are called translation initiation sites (TIS). Results: The task of finding TIS can be modeled as a classification problem. We demonstrate the applicability of support vector machines for this task, and show how to incorporate prior biological knowledge by engineering an appropriate kernel function. With the described techniques the recognition performance can be improved by 26 We provide evidence that existing related methods (e.g. ESTScan) could profit from advanced TIS recognition. Keywords: biosvm
[Wilbur2000Boosting]	W. J. Wilbur. Boosting naive Bayesian learning on a large subset of MEDLINE. Proc AMIA Symp, pages 918-22, 2000. [ bib ] We are concerned with the rating of new documents that appear in a large database (MEDLINE) and are candidates for inclusion in a small specialty database (REBASE). The requirement is to rank the new documents as nearly in order of decreasing potential to be added to the smaller database as possible, so as to improve the coverage of the smaller database without increasing the effort of those who manage this specialty database. To perform this ranking task we have considered several machine learning approaches based on the naÃ¯ ve Bayesian algorithm. We find that adaptive boosting outperforms naÃ¯ ve Bayes, but that a new form of boosting which we term staged Bayesian retrieval outperforms adaptive boosting. Staged Bayesian retrieval involves two stages of Bayesian retrieval and we further find that if the second stage is replaced by a support vector machine we again obtain a significant improvement over the strictly Bayesian approach. Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Animals, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Brachyura, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Classification, Codon, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diabetes Mellitus, Diagnosis, Discriminant Analysis, Discrimination Learning, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Genetic Predisposition to Disease, Genomics, Hemolysins, Humans, Indians, Information Storage and Retrieval, Initiator, Ion Channels, Kinetics, Leukemia, Likelihood Functions, Lipid Bilayers, Logistic Models, Lymphocytic, MEDLINE, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, North American, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Organ Specificity, Organelles, Ovarian Neoplasms, Ovary, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Biosynthesis, Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Sound Spectrography, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, Type 2, U.S. Gov't, Vertebrates, 11080018
[Watkins2000Dynamic]	C. Watkins. Dynamic alignment kernels. In A.J. Smola, P.L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 39-50. MIT Press, Cambridge, MA, 2000. [ bib \| .ps.gz \| .pdf ] Keywords: biosvm
[Uetz2000comprehensive]	P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, D. Lockshon, V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili, Y. Li, B. Godwin, D. Conover, T. Kalbfleish, G. Vijayadamodar, M. Yang, M. Johnston, S. Fields, and J. M. Rothberg. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature, 403:623-627, 2000. [ bib \| http \| .pdf ]
[Strahl2000language]	B. D. Strahl and C. D. Allis. The language of covalent histone modifications. Nature, 403(6765):41-45, Jan 2000. [ bib \| DOI \| http ] Histone proteins and the nucleosomes they form with DNA are the fundamental building blocks of eukaryotic chromatin. A diverse array of post-translational modifications that often occur on tail domains of these proteins has been well documented. Although the function of these highly conserved modifications has remained elusive, converging biochemical and genetic evidence suggests functions in several chromatin-based processes. We propose that distinct histone modifications, on one or more tails, act sequentially or in combination to form a 'histone code' that is, read by other proteins to bring about distinct downstream events. Keywords: Acetylation; Amino Acid Sequence; Animals; Chromatin, physiology; Histones, chemistry/metabolism/physiology; Humans; Lysine, physiology; Microtubules, physiology; Models, Biological; Molecular Sequence Data; Phosphorylation; Protein Processing, Post-Translational; Serine, metabolism
[Slanina2000Random]	F. Slanina and M. Kotrla. Random networks created by biological evolution. Phys. Rev. E, 62(5):6170-6177, 2000. [ bib \| http \| .pdf ]
[Risau-Gusman2000Generalization]	Risau-Gusman and Gordon. Generalization properties of finite-size polynomial support vector machines. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics, 62(5 Pt B):7092-9, Nov 2000. [ bib ] The learning properties of finite-size polynomial support vector machines are analyzed in the case of realizable classification tasks. The normalization of the high-order features acts as a squeezing factor, introducing a strong anisotropy in the patterns distribution in feature space. As a function of the training set size, the corresponding generalization error presents a crossover, more or less abrupt depending on the distribution's anisotropy and on the task to be learned, between a fast-decreasing and a slowly decreasing regime. This behavior corresponds to the stepwise decrease found by Dietrich et al. [Phys. Rev. Lett. 82, 2975 (1999)] in the thermodynamic limit. The theoretical results are in excellent agreement with the numerical simulations. Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Animals, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Brachyura, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Classification, Codon, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diabetes Mellitus, Diagnosis, Discriminant Analysis, Discrimination Learning, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Genetic Predisposition to Disease, Genomics, Hemolysins, Humans, Indians, Initiator, Ion Channels, Kinetics, Leukemia, Likelihood Functions, Lipid Bilayers, Logistic Models, Lymphocytic, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, North American, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Organ Specificity, Organelles, Ovarian Neoplasms, Ovary, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Biosynthesis, Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Sound Spectrography, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, Type 2, U.S. Gov't, Vertebrates, 0011102066
[Rice2000EMBOSS]	P. Rice, I. Longden, and A. Bleasby. Emboss: the european molecular biology open software suite. Trends Genet., 16(6):276-277, Jun 2000. [ bib ] Keywords: Internet; Molecular Biology; Sequence Alignment, methods; Software; User-Computer Interface
[Pandey2000Proteomics]	A. Pandey and M. Mann. Proteomics to study genes and genomes. Nature, 405:837-846, 2000. [ bib \| http \| .pdf ]
[Opper2000Gaussian]	M. Opper and O. Winther. Gaussian processes for classification: mean-field algorithms. Neural Comput, 12(11):2655-84, Nov 2000. [ bib ] We derive a mean-field algorithm for binary classification with gaussian processes that is based on the TAP approach originally proposed in statistical physics of disordered systems. The theory also yields an approximate leave-one-out estimator for the generalization error, which is computed with no extra computational cost. We show that from the TAP approach, it is possible to derive both a simpler "naive" mean-field theory and support vector machines (SVMs) as limiting cases. For both mean-field algorithms and support vector machines, simulation results for three small benchmark data sets are presented. They show that one may get state-of-the-art performance by using the leave-one-out estimator for model selection and the built-in leave-one-out estimators are extremely precise when compared to the exact leave-one-out estimate. The second result is taken as strong support for the internal consistency of the mean-field approach. Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Animals, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Brachyura, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Classification, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diabetes Mellitus, Diagnosis, Discriminant Analysis, Discrimination Learning, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Genetic Predisposition to Disease, Hemolysins, Humans, Indians, Ion Channels, Kinetics, Leukemia, Likelihood Functions, Lipid Bilayers, Logistic Models, Lymphocytic, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, North American, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Organ Specificity, Organelles, Ovarian Neoplasms, Ovary, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Sound Spectrography, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, Type 2, U.S. Gov't, 11110131
[Moler2000Analysis]	E. J. Moler, M. L. Chow, and I. S. Mian. Analysis of molecular profile data using generative and discriminative methods. Physiol. Genomics, 4(2):109-126, Dec 2000. [ bib \| http \| .pdf ] A modular framework is proposed for modeling and understanding the relationships between molecular profile data and other domain knowledge using a combination of generative (here, graphical models) and discriminative [Support Vector Machines (SVMs)] methods. As illustration, naive Bayes models, simple graphical models, and SVMs were applied to published transcription profile data for 1,988 genes in 62 colon adenocarcinoma tissue specimens labeled as tumor or nontumor. These unsupervised and supervised learning methods identified three classes or subtypes of specimens, assigned tumor or nontumor labels to new specimens and detected six potentially mislabeled specimens. The probability parameters of the three classes were utilized to develop a novel gene relevance, ranking, and selection method. SVMs trained to discriminate nontumor from tumor specimens using only the 50-200 top-ranked genes had the same or better generalization performance than the full repertoire of 1,988 genes. Approximately 90 marker genes were pinpointed for use in understanding the basic biology of colon adenocarcinoma, defining targets for therapeutic intervention and developing diagnostic tools. These potential markers highlight the importance of tissue biology in the etiology of cancer. Comparative analysis of molecular profile data is proposed as a mechanism for predicting the physiological function of genes in instances when comparative sequence analysis proves uninformative, such as with human and yeast translationally controlled tumour protein. Graphical models and SVMs hold promise as the foundations for developing decision support systems for diagnosis, prognosis, and monitoring as well as inferring biological networks. Keywords: biosvm
[Lodhi2000Text]	H. Lodhi, J. Shawe-Taylor, N. Cristianini, and C. J. C. H. Watkins. Text Classification using String Kernels. In Adv. Neural Inform. Process. Syst., pages 563-569, 2000. [ bib \| .ps.gz \| .pdf ] Keywords: biosvm
[Lazo2000Combinatorial]	J. S. Lazo and P. Wipf. Combinatorial chemistry and contemporary pharmacology. J. Pharmacol. Exp. Ther., 293(3):705-709, Jun 2000. [ bib ] Both solid- and liquid-phase combinatorial chemistry have emerged as powerful tools for identifying pharmacologically active compounds and optimizing the biological activity of a lead compound. Complementary high-throughput in vitro assays are essential for compound evaluation. Cell-based assays that use optical endpoints permit investigation of a wide variety of functional properties of these compounds including specific intracellular biochemical pathways, protein-protein interactions, and the subcellular localization of targets. Integration of combinatorial chemistry with contemporary pharmacology now represents an important factor in drug discovery and development. Keywords: Alzheimer Disease, Animals, Antineoplastic Agents, Biological, Bleomycin, Cell Cycle, Cell Cycle Proteins, Cell Death, Cell Line, Cell Nucleus, Cell Shape, Cell Transformation, Combinatorial Chemistry Techniques, Cultured, Drug Delivery Systems, Drug Design, Drug Evaluation, Enzyme Inhibitors, Formazans, Gene Expression, Humans, Inhibitory Concentration 50, Kinetics, Magnetic Resonance Spectroscopy, Mass, Mitochondria, Models, Molecular, Neoplasms, Neoplastic, Non-P.H.S., Non-U.S. Gov't, P.H.S., Paclitaxel, Peptide Library, Pharmaceutical Preparations, Pharmacology, Phosphoprotein Phosphatase, Preclinical, Protease Inhibitors, Protein-Tyrosine-Phosphatase, Research Support, Sensitivity and Specificity, Signal Transduction, Spectrum Analysis, Stereoisomerism, Structure-Activity Relationship, Sulfonic Acids, Tetrazolium Salts, Thiazoles, Toxicity Tests, Tumor, Tumor Cells, U.S. Gov't, cdc25 Phosphatase, 10869367
[Jeong2000large-scale]	H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.-L. Barabási. The large-scale organization of metabolic networks. Nature, 407:651-654, 2000. [ bib \| http \| .pdf ]
[Jaakkola2000Discriminative]	T. Jaakkola, M. Diekhans, and D. Haussler. A Discriminative Framework for Detecting Remote Protein Homologies. J. Comput. Biol., 7(1,2):95-114, 2000. [ bib \| .ps \| .pdf ] Keywords: biosvm
[Ito2000Toward]	T. Ito, K. Tashiro, S. Muta, R. Ozawa, T. Chiba, M. Nishizawa, K. Yamamoto, S. Kuhara, and Y. Sakaki. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl. Acad. Sci. USA, 93(3):1143-1147, 2000. [ bib \| http \| .pdf ]
[Gether2000Uncovering]	U. Gether. Uncovering molecular mechanisms involved in activation of g protein-coupled receptors. Endocr Rev, 21(1):90-113, Feb 2000. [ bib ] G protein-coupled, seven-transmembrane segment receptors (GPCRs or 7TM receptors), with more than 1000 different members, comprise the largest superfamily of proteins in the body. Since the cloning of the first receptors more than a decade ago, extensive experimental work has uncovered multiple aspects of their function and challenged many traditional paradigms. However, it is only recently that we are beginning to gain insight into some of the most fundamental questions in the molecular function of this class of receptors. How can, for example, so many chemically diverse hormones, neurotransmitters, and other signaling molecules activate receptors believed to share a similar overall tertiary structure? What is the nature of the physical changes linking agonist binding to receptor activation and subsequent transduction of the signal to the associated G protein on the cytoplasmic side of the membrane and to other putative signaling pathways? The goal of the present review is to specifically address these questions as well as to depict the current awareness about GPCR structure-function relationships in general. Keywords: Animals; GTP-Binding Proteins; Humans; Ligands; Models, Biological; Molecular Conformation; Receptors, Cell Surface
[Furey2000Support]	T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10):906-914, Oct 2000. [ bib \| http \| .pdf ] Motivation: DNA microarray experiments generating thousands of gene expression measurements, are being used to gather information from tissue and cell samples regarding gene expression differences that will be useful in diagnosing disease. We have developed a new method to analyse this kind of data using support vector machines (SVMs). This analysis consists of both classification of the tissue samples, and an exploration of the data for mis-labeled or questionable tissue results. Results: We demonstrate the method in detail on samples consisting of ovarian cancer tissues, normal ovarian tissues, and other normal tissues. The dataset consists of expression experiment results for 97802 cDNAs for each tissue. As a result of computational analysis, a tissue sample is discovered and confirmed to be wrongly labeled. Upon correction of this mistake and the removal of an outlier, perfect classification of tissues is achieved, but not with high confidence. We identify and analyse a subset of genes from the ovarian dataset whose expression is highly differentiated between the types of tissues. To show robustness of the SVM method, two previously published datasets from other types of tissues or cells are analysed. The results are comparable to those previously obtained. We show that other machine learning methods also perform comparably to the SVM on many of those datasets. Availability: The SVM software is available at http://www.cs.columbia.edu/ bgrundy/svm. Contact: booch@cse.ucsc.edu Keywords: biosvm
[Friedman2000Using]	N. Friedman, M. Linial, I. Nachman, and D. Pe'er. Using Bayesian networks to analyze expression data. J. Comput. Biol., 7(3-4):601-620, 2000. [ bib \| DOI \| http \| .pdf ] DNA hybridization arrays simultaneously measure the expression level for thousands of genes. These measurements provide a "snapshot" of transcription levels within the cell. A major challenge in computational biology is to uncover, from such measurements, gene/protein interactions and key biological features of cellular systems. In this paper, we propose a new framework for discovering interactions between genes based on multiple expression measurements. This framework builds on the use of Bayesian networks for representing statistical dependencies. A Bayesian network is a graph-based model of joint multivariate probability distributions that captures properties of conditional independence between variables. Such models are attractive for their ability to describe complex stochastic processes and because they provide a clear methodology for learning from (noisy) observations. We start by showing how Bayesian networks can describe interactions between genes. We then describe a method for recovering gene interactions from microarray data using tools for learning Bayesian networks. Finally, we demonstrate this method on the S. cerevisiae cell-cycle measurements of Spellman et al. (1998). Keywords: biogm
[Cai2000Support]	Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou. Support vector machines for prediction of protein subcellular location. Mol. Cell Biol. Res. Commun., 4(4):230-234, 2000. [ bib \| DOI \| http \| www: ] Support Vector Machine (SVM), which is one kind of learning machines, was applied to predict the subcellular location of proteins from their amino acid composition. In this research, the proteins are classified into the following 12 groups: (1) chloroplast, (2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracall, (6) Golgi apparatus, (7) lysosome, (8) mitochondria, (9) nucleus, (10) peroxisome, (11) plasma membrane, and (12) vacuole, which have covered almost all the organelles and subcellular compartments in an animal or plant cell. The examination for the self-consistency and the jackknife test of the SVMs method was tested for the three sets: 2022 proteins, 2161 proteins, and 2319 proteins. As a result, the correct rate of self-consistency and jackknife test reaches 91 and 82 73 rate was tested by the three independent testing datasets containing 2240 proteins, 2513 proteins, and 2591 proteins. The correct prediction rates reach 82, 75, and 73 2591 proteins, respectively. Keywords: biosvm
[Brown2000Knowledge-based]	M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA, 97(1):262-7, Jan 2000. [ bib \| http \| .pdf ] We introduce a method of functionally classifying genes by using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data. Keywords: biosvm microarray
[Ben-Dor2000Tissue]	A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. Tissue classification with gene expression profiles. J. Comput. Biol., 7(3-4):559-583, 2000. [ bib \| http \| .pdf ] Constantly improving gene expression profiling technologies are expected to provide understanding and insight into cancer-related cellular processes. Gene expression data is also expected to significantly aid in the development of efficient cancer diagnosis and classification platforms. In this work we examine three sets of gene expression data measured across sets of tumor(s) and normal clinical samples: The first set consists of 2,000 genes, measured in 62 epithelial colon samples (Alon et al., 1999). The second consists of approximately equal to 100,000 clones, measured in 32 ovarian samples (unpublished extension of data set described in Schummer et al. (1999)). The third set consists of approximately equal to 7,100 genes, measured in 72 bone marrow and peripheral blood samples (Golub et al, 1999). We examine the use of scoring methods, measuring separation of tissue type (e.g., tumors from normals) using individual gene expression levels. These are then coupled with high-dimensional classification methods to assess the classification power of complete expression profiles. We present results of performing leave-one-out cross validation (LOOCV) experiments on the three data sets, employing nearest neighbor classifier, SVM (Cortes and Vapnik, 1995), AdaBoost (Freund and Schapire, 1997) and a novel clustering-based classification technique. As tumor samples can differ from normal samples in their cell-type composition, we also perform LOOCV experiments using appropriately modified sets of genes, attempting to eliminate the resulting bias. We demonstrate success rate of at least 90 in tumor versus normal classification, using sets of selected genes, with, as well as without, cellular-contamination-related members. These results are insensitive to the exact selection mechanism, over a certain range. Keywords: biosvm microarray
[Akutsu2000Inferring]	T. Akutsu, S. Miyano, and S. Kuhara. Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics, 16(8):727-734, 2000. [ bib \| http \| .pdf ]
[Ashburner2000Gene]	M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet, 25(1):25-29, May 2000. [ bib \| DOI \| http ] Keywords: Animals; Computer Communication Networks; Databases, Factual; Eukaryotic Cells; Genes; Humans; Metaphysics; Mice; Molecular Biology; Sequence Analysis, DNA; Terminology as Topic
[Yeang2001Molecular]	C.H. Yeang, S. Ramaswamy, P. Tamayo, S. Mukherjee, R.M. Rifkin, M. Angelo, M. Reich, E. Lander, J. Mesirov, and T. Golub. Molecular classification of multiple tumor types. Bioinformatics, 17(Suppl. 1):S316-S322, 2001. [ bib \| http \| .pdf ] Using gene expression data to classify tumor types is a very promising tool in cancer diagnosis. Previous works show several pairs of tumor types can be successfully distinguished by their gene expression patterns (Golub et al. 1999, Ben-Dor et al. 2000, Alizadeh et al. 2000). However, the simultaneous classification across a heterogeneous set of tumor types has not been well studied yet. We obtained 190 samples from 14 tumor classes and generated a combined expression dataset containing 16063 genes for each of those samples. We performed multi-class classification by combining the outputs of binary classifiers. Three binary classifiers (k-nearest neighbors, weighted voting, and support vector machines) were applied in conjunction with three combination scenarios (one-vs-all, all-pairs, hierarchical partitioning). We achieved the best cross validation error rate of 18.75 support vector machine algorithm. The results demonstrate the feasibility of performing clinically useful classification from samples of multiple tumor types. Keywords: biosvm
[Xue2001Fingerprint]	L. Xue, F. L. Stahura, J. W. Godden, and J. Bajorath. Fingerprint scaling increases the probability of identifying molecules with similar activity in virtual screening calculations. J Chem Inf Comput Sci, 41(3):746-753, 2001. [ bib ] Results of systematic virtual screening calculations using a structural key-type fingerprint are reported for compounds belonging to 14 activity classes added to randomly selected synthetic molecules. For each class, a fingerprint profile was calculated to monitor the relative occupancy of fingerprint bit positions. Consensus bit patterns were determined consisting of all bits that were always set on in compounds belonging to a specific activity class. In virtual screening calculations, scale factors were applied to each consensus bit position in fingerprints of query molecules. This technique, called "fingerprint scaling", effectively increases the weight of consensus bit positions in fingerprint comparisons. Although overall prediction accuracy was satisfactory using unscaled calculations, scaling significantly increased the number of correct predictions but only slightly increased the rate of false positives. These observations suggest that fingerprint scaling is an attractive approach to increase the probability of identifying molecules with similar activity by virtual screening. It requires the availability of a series of related compounds and can be easily applied to any keyed fingerprint representation that associates bit positions with specific molecular features. Keywords: 16S, Algae, Algorithms, Animals, Archaeal, Automation, Bacteria, Biodiversity, Chemical, Colorimetry, Computational Biology, Computer Terminals, DNA, DNA Fingerprinting, Daphnia, Databases, Ecosystem, Euryarchaeota, Factual, Fresh Water, Hazardous Substances, Humans, Information Storage and Retrieval, Methane, Models, Non-U.S. Gov't, Oxidoreductases, Perciformes, Photic Stimulation, Photometry, Polymorphism, Quantitative Structure-Activity Relationship, RNA, Research Support, Restriction Fragment Length, Ribosomal, Seasons, Soil Microbiology, Spain, Sulfur, Theoretical, Time Factors, Toxicity Tests, Water Microbiology, Water Pollutants, 11410055
[Xiong2001Biomarker]	M. Xiong, X. Fang, and J. Zhao. Biomarker Identification by Feature Wrappers. Genome Res., 11(11):1878-1887, 2001. [ bib \| http \| .pdf ] Gene expression studies bridge the gap between DNA information and trait information by dissecting biochemical pathways into intermediate components between genotype and phenotype. These studies open new avenues for identifying complex disease genes and biomarkers for disease diagnosis and for assessing drug efficacy and toxicity. However, the majority of analytical methods applied to gene expression data are not efficient for biomarker identification and disease diagnosis. In this paper, we propose a general framework to incorporate feature (gene) selection into pattern recognition in the process to identify biomarkers. Using this framework, we develop three feature wrappers that search through the space of feature subsets using the classification error as measure of goodness for a particular feature subset being "wrapped around": linear discriminant analysis, logistic regression, and support vector machines. To effectively carry out this computationally intensive search process, we employ sequential forward search and sequential forward floating search algorithms. To evaluate the performance of feature selection for biomarker identification we have applied the proposed methods to three data sets. The preliminary results demonstrate that very high classification accuracy can be attained by identified composite classifiers with several biomarkers. Keywords: biosvm
[Wagner2001Yeast]	A. Wagner. The Yeast Protein Interaction Network Evolves Rapidly and Contains Few Redundant Duplicate Genes. Mol. Biol. Evol., 18:1283-1292, 2001. [ bib \| .html \| .pdf ]
[Vercoutere2001Rapid]	W. Vercoutere, S. Winters-Hilt, H. Olsen, D. Deamer, D. Haussler, and M. Akeson. Rapid discrimination among individual DNA hairpin molecules at single-nucleotide resolution using an ion channel. Nat Biotechnol, 19(3):248-52, Mar 2001. [ bib \| DOI \| http \| .pdf ] RNA and DNA strands produce ionic current signatures when driven through an alpha-hemolysin channel by an applied voltage. Here we combine this nanopore detector with a support vector machine (SVM) to analyze DNA hairpin molecules on the millisecond time scale. Measurable properties include duplex stem length, base pair mismatches, and loop length. This nanopore instrument can discriminate between individual DNA hairpins that differ by one base pair or by one nucleotide. Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diagnosis, Discriminant Analysis, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Ion Channels, Kinetics, Leukemia, Lipid Bilayers, Logistic Models, Lymphocytic, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplastic, Neural Networks (Computer), Nevus, Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, Organ Specificity, Organelles, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, U.S. Gov't, 11231558
[Venter2001Sequence]	J. C. et al. Venter. The Sequence of the Human Genome. Science, 291(5507):1304-1351, 2001. [ bib \| http \| .pdf ] A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90 bp or more, and 25 or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1 with 75 segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1 task of determining which SNPs have functional consequences remains an open challenge. Keywords: genomics bio
[Vazquez2001Modeling]	A. Vazquez, A. Flammini, A. Maritan, and A. Vespignani. Modeling of protein interaction networks. E-print cond-mat/0108043, Aug 2001. [ bib \| http \| .pdf ]
[Suykens2001Optimal]	J. A. Suykens, J. Vandewalle, and B. De Moor. Optimal control by least squares support vector machines. Neural Netw, 14(1):23-35, Jan 2001. [ bib ] Support vector machines have been very successful in pattern recognition and function estimation problems. In this paper we introduce the use of least squares support vector machines (LS-SVM's) for the optimal control of nonlinear systems. Linear and neural full static state feedback controllers are considered. The problem is formulated in such a way that it incorporates the N-stage optimal control problem as well as a least squares support vector machine approach for mapping the state space into the action space. The solution is characterized by a set of nonlinear equations. An alternative formulation as a constrained nonlinear optimization problem in less unknowns is given, together with a method for imposing local stability in the LS-SVM control scheme. The results are discussed for support vector machines with radial basis function kernel. Advantages of LS-SVM control are that no number of hidden units has to be determined for the controller and that no centers have to be specified for the Gaussian kernels when applying Mercer's condition. The curse of dimensionality is avoided in comparison with defining a regular grid for the centers in classical radial basis function networks. This is at the expense of taking the trajectory of state variables as additional unknowns in the optimization problem, while classical neural network approaches typically lead to parametric optimization problems. In the SVM methodology the number of unknowns equals the number of training data, while in the primal space the number of unknowns can be infinite dimensional. The method is illustrated both on stabilization and tracking problems including examples on swinging up an inverted pendulum with local stabilization at the endpoint and a tracking problem for a ball and beam system. Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diagnosis, Discriminant Analysis, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Ion Channels, Kinetics, Leukemia, Lipid Bilayers, Logistic Models, Lymphocytic, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplastic, Neural Networks (Computer), Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, Nucleic Acid Conformation, Organ Specificity, Organelles, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, U.S. Gov't, 11213211
[Su2001Molecular]	A. I. Su, J. B. Welsh, L. M. Sapinoso, S. G. Kern, P. Dimitrov, H. Lapp, P. G. Schultz, S. M. Powell, C. A. Moskaluk, H. F.Jr. Frierson, and G. M. Hampton. Molecular Classification of Human Carcinomas by Use of Gene Expression Signatures. Cancer Res., 61(20):7388-7393, 2001. [ bib \| http \| .html ] Classification of human tumors according to their primary anatomical site of origin is fundamental for the optimal treatment of patients with cancer. Here we describe the use of large-scale RNA profiling and supervised machine learning algorithms to construct a first-generation molecular classification scheme for carcinomas of the prostate, breast, lung, ovary, colorectum, kidney, liver, pancreas, bladder/ureter, and gastroesophagus, which collectively account for [ ]70 cancer-related deaths in the United States. The classification scheme was based on identifying gene subsets whose expression typifies each cancer class, and we quantified the extent to which these genes are characteristic of a specific tumor type by accurately and confidently predicting the anatomical site of tumor origin for 90 including 9 of 12 metastatic lesions. The predictor gene subsets include those whose expression is typical of specific types of normal epithelial differentiation, as well as other genes whose expression is elevated in cancer. This study demonstrates the feasibility of predicting the tissue origin of a carcinoma in the context of multiple cancer classes. Keywords: biosvm, breastcancer
[Sole2001Model]	R. V. Solé, R. Pastor-Satorras, E. D. Smith, and T. Kepler. A Model of Large-Scale Proteome Evolution. Technical report, Santa Fe Institute, 2001. Working paper 01-08-041. [ bib \| .html \| .pdf ]
[Sherry2001dbSNP]	S. T. Sherry, M. H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, and K. Sirotkin. dbsnp: the ncbi database of genetic variation. Nucleic Acids Res, 29(1):308-311, Jan 2001. [ bib ] In response to a need for a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, the National Center for Biotechnology Information (NCBI) has established the dbSNP database [S.T.Sherry, M.Ward and K. Sirotkin (1999) Genome Res., 9, 677-679]. Submissions to dbSNP will be integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data. The complete contents of dbSNP are available to the public at website: http://www.ncbi.nlm.nih.gov/SNP. The complete contents of dbSNP can also be downloaded in multiple formats via anonymous FTP at ftp://ncbi.nlm.nih.gov/snp/. Keywords: Animals; Biotechnology; Databases, Factual; Genetic Variation; Humans; Information Services; Internet; National Institutes of Health (U.S.); National Library of Medicine (U.S.); Polymorphism, Single Nucleotide, genetics; United States
[Ramaswamy2001Multiclass]	S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S. Lander, and T.R. Golub. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. USA, 98(26):15149-15154, Dec 2001. [ bib \| DOI \| http \| .pdf ] The optimal treatment of patients with cancer depends on establishing accurate diagnoses by using a complex combination of clinical and histopathological data. In some instances, this task is difficult or impossible because of atypical clinical presentation or histopathology. To determine whether the diagnosis of multiple common adult malignancies could be achieved purely by molecular classification, we subjected 218 tumor samples, spanning 14 common tumor types, and 90 normal tissue samples to oligonucleotide microarray gene expression analysis. The expression levels of 16,063 genes and expressed sequence tags were used to evaluate the accuracy of a multiclass classifier based on a support vector machine algorithm. Overall classification accuracy was 78 Poorly differentiated cancers resulted in low-confidence predictions and could not be accurately classified according to their tissue of origin, indicating that they are molecularly distinct entities with dramatically different gene expression patterns compared with their well differentiated counterparts. Taken together, these results demonstrate the feasibility of accurate, multiclass molecular cancer classification and suggest a strategy for future clinical implementation of molecular cancer diagnostics. Keywords: biosvm microarray
[Rain2001protein-protein]	J.-C. Rain, L. Selig, H. De Reuse, V. Battaglia, C. Reverdy, S. Simon, G. Lenzen, F. Petel, J. Wojcik, V. Schächter, Y. Chemama, A. Labigne, and P. Legrain. The protein-protein interaction map of Helicobacter pylori. Nature, 409:211-215, 2001. [ bib \| http \| .pdf ]
[Qian2001Protein]	J. Qian, N. M. Luscombe, and M. Gerstein. Protein Fold and Family Occurrence in Genomes: Power-Law Behaviour and Evolutionary Model. J. Mol. Biol., 313:673-681, 2001. [ bib \| http \| .pdf ]
[Podani2001Comparable]	J. Podani, Z.N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and E. Szathmáry. Comparable system-level organization of Archaea and Eukaryotes. Nat. Genet., 29:54-56, 2001. [ bib \| http \| .pdf ]
[Pavlidis2001Gene]	P. Pavlidis, J. Weston, J. Cai, and W.N. Grundy. Gene functional classification from heterogeneous data. In Proceedings of the Fifth Annual International Conference on Computational Biology, pages 249-255, 2001. [ bib \| .pdf \| .pdf ] Keywords: biosvm
[Pavlidis2001Promoter]	P. Pavlidis, T. S. Furey, M. Liberto, D. Haussler, and W. N. Grundy. Promoter Region-Based Classification of Genes. In Pacific Symposium on Biocomputing, pages 139-150, 2001. [ bib \| .pdf \| .pdf ] Keywords: biosvm
[Model2001Feature]	F. Model, P. Adorjan, A. Olek, and C. Piepenbrock. Feature selection for DNA methylation based cancer classification. Bioinformatics, 17(Supp. 1):S157-S164, 2001. [ bib \| http \| .pdf ] Molecular portraits, such as mRNA expression or DNA methylation patterns, have been shown to be strongly correlated with phenotypical parameters. These molecular patterns can be revealed routinely on a genomic scale. However, class prediction based on these patterns is an under-determined problem, due to the extreme high dimensionality of the data compared to the usually small number of available samples. This makes a reduction of the data dimensionality necessary. Here we demonstrate how phenotypic classes can be predicted by combining feature selection and discriminant analysis. By comparing several feature selection methods we show that the right dimension reduction strategy is of crucial importance for the classification performance. The techniques are demonstrated by methylation pattern based discrimination between acute lymphoblastic leukemia and acute myeloid leukemia. Contact: Fabian.Model@epigenomics.com Keywords: biosvm
[Miwakeichi2001comparison]	F. Miwakeichi, R. Ramirez-Padron, P. A. Valdes-Sosa, and T. Ozaki. A comparison of non-linear non-parametric models for epilepsy data. Comput. Biol. Med., 31(1):41-57, Jan 2001. [ bib ] EEG spike and wave (SW) activity has been described through a non-parametric stochastic model estimated by the Nadaraya-Watson (NW) method. In this paper the performance of the NW, the local linear polynomial regression and support vector machines (SVM) methods were compared. The noise-free realizations obtained by the NW and SVM methods reproduced SW better than as reported in previous works. The tuning parameters had to be estimated manually. Adding dynamical noise, only the NW method was capable of generating SW similar to training data. The standard deviation of the dynamical noise was estimated by means of the correlation dimension. Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Animals, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Brachyura, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Classification, Codon, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diabetes Mellitus, Diagnosis, Discriminant Analysis, Discrimination Learning, Electric Conductivity, Electroencephalography, Electrophysiology, Epilepsy, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Genetic Predisposition to Disease, Genomics, Hemolysins, Humans, Indians, Information Storage and Retrieval, Initiator, Ion Channels, Kinetics, Leukemia, Likelihood Functions, Linear Models, Lipid Bilayers, Logistic Models, Lymphocytic, MEDLINE, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, North American, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Organ Specificity, Organelles, Ovarian Neoplasms, Ovary, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Biosynthesis, Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Sound Spectrography, Statistical, Stochastic Processes, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, Type 2, U.S. Gov't, Vertebrates, 11058693
[Kim2001Evolving]	J. Kim, P.L. Krapivsky, B. Kahng, and S. Redner. Evolving protein interaction networks. E-print cond-mat/0203167, 2001. [ bib \| http \| .pdf ]
[Jeong2001Lethality]	H. Jeong, S. P. Mason, A.-L. Barabási, and Z. N. Oltvai. Lethality and centrality in protein networks. Nature, 411:41-42, 2001. [ bib \| http \| .pdf ]
[Ito2001comprehensive]	T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA, 98(8):4569-4574, 2001. [ bib \| http \| .pdf ]
[Consortium2001Initial]	International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860-921, Feb 2001. [ bib \| DOI \| http \| .pdf ] The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence. Keywords: genomics bio
[Hua2001Support]	S. Hua and Z. Sun. Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17(8):721-728, 2001. [ bib \| http \| .pdf ] Motivation: Subcellular localization is a key functional characteristic of proteins. A fully automatic and reliable prediction system for protein subcellular localization is needed, especially for the analysis of large-scale genome sequences. Results: In this paper, Support Vector Machine has been introduced to predict the subcellular localization of proteins from their amino acid compositions. The total prediction accuracies reach 91.4 in prokaryotic organisms and 79.4 organisms. Predictions by our approach are robust to errors in the protein N-terminal sequences. This new approach provides superior prediction performance compared with existing algorithms based on amino acid composition and can be a complementary method to other existing methods based on sorting signals. Availability: A web server implementing the prediction method is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/. Contact: sunzhr@mail.tsinghua.edu.cn; huasj00@mails.tsinghua.edu.cn Supplementary information: Supplementary material is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc Keywords: biosvm
[Dreiseitl2001comparison]	S. Dreiseitl, L. Ohno-Machado, H. Kittler, S. Vinterbo, H. Billhardt, and M. Binder. A comparison of machine learning methods for the diagnosis of pigmented skin lesions. J Biomed Inform, 34(1):28-36, Feb 2001. [ bib \| DOI \| http \| .pdf ] We analyze the discriminatory power of k-nearest neighbors, logistic regression, artificial neural networks (ANNs), decision tress, and support vector machines (SVMs) on the task of classifying pigmented skin lesions as common nevi, dysplastic nevi, or melanoma. Three different classification tasks were used as benchmarks: the dichotomous problem of distinguishing common nevi from dysplastic nevi and melanoma, the dichotomous problem of distinguishing melanoma from common and dysplastic nevi, and the trichotomous problem of correctly distinguishing all three classes. Using ROC analysis to measure the discriminatory power of the methods shows that excellent results for specific classification problems in the domain of pigmented skin lesions can be achieved with machine-learning methods. On both dichotomous and trichotomous tasks, logistic regression, ANNs, and SVMs performed on about the same level, with k-nearest neighbors and decision trees performing worse. Keywords: Algorithms, Amino Acid Sequence, Artificial Intelligence, Biological, Cell Compartmentation, Comparative Study, Computer Simulation, Computer-Assisted, Decision Trees, Diagnosis, Discriminant Analysis, Humans, Logistic Models, Melanoma, Models, Neural Networks (Computer), Nevus, Non-U.S. Gov't, Organelles, P.H.S., Pigmented, Predictive Value of Tests, Proteins, Reproducibility of Results, Research Support, Skin Diseases, Skin Neoplasms, Skin Pigmentation, U.S. Gov't, 11376540
[Ding2001Multi-class]	C.H.Q. Ding and I. Dubchak. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17:349-358, 2001. [ bib \| .pdf \| .pdf ] Motivation: Protein fold recognition is an important approach to structure discovery without relying on sequence similarity. We study this approach with new multi-class classification methods and examined many issues important for a practical recognition system. Results: Most current discriminative methods for protein fold prediction use the one-against-others method, which has the well-known ?False Positives? problem. We investigated two new methods: the unique one-against-others and the all-against-all methods. Both improve prediction accuracy by 14?110 SCOP folds. We used the Support Vector Machine (SVM) and the Neural Network (NN) learning methods as base classifiers. SVMs converges fast and leads to high accuracy. When scores of multiple parameter datasets are combined, majority voting reduces noise and increases recognition accuracy. We examined many issues involved with large number of classes, including dependencies of prediction accuracy on the number of folds and on the number of representatives in a fold. Overall, recognition systems achieve 56 accuracy on a protein test dataset, where most of the proteins have below 25 information: The protein parameter datasets used in this paper are available online (http://www.nersc.gov/ cding/protein). Keywords: biosvm
[Chow2001Identifying]	M. L. Chow, E. J. Moler, and I. S. Mian. Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. Physiol. Genomics, 5(2):99-111, Mar 2001. [ bib \| http \| .pdf ] Transcription profiling experiments permit the expression levels of many genes to be measured simultaneously. Given profiling data from two types of samples, genes that most distinguish the samples (marker genes) are good candidates for subsequent in-depth experimental studies and developing decision support systems for diagnosis, prognosis, and monitoring. This work proposes a mixture of feature relevance experts as a method for identifying marker genes and illustrates the idea using published data from samples labeled as acute lymphoblastic and myeloid leukemia (ALL, AML). A feature relevance expert implements an algorithm that calculates how well a gene distinguishes samples, reorders genes according to this relevance measure, and uses a supervised learning method [here, support vector machines (SVMs)] to determine the generalization performances of different nested gene subsets. The mixture of three feature relevance experts examined implement two existing and one novel feature relevance measures. For each expert, a gene subset consisting of the top 50 genes distinguished ALL from AML samples as completely as all 7,070 genes. The 125 genes at the union of the top 50s are plausible markers for a prototype decision support system. Chromosomal aberration and other data support the prediction that the three genes at the intersection of the top 50s, cystatin C, azurocidin, and adipsin, are good targets for investigating the basic biology of ALL/AML. The same data were employed to identify markers that distinguish samples based on their labels of T cell/B cell, peripheral blood/bone marrow, and male/female. Selenoprotein W may discriminate T cells from B cells. Results from analysis of transcription profiling data from tumor/nontumor colon adenocarcinoma samples support the general utility of the aforementioned approach. Theoretical issues such as choosing SVM kernels and their parameters, training and evaluating feature relevance experts, and the impact of potentially mislabeled samples on marker identification (feature selection) are discussed. Keywords: biosvm
[Chou2001Using]	K.-C. Chou. Using subsite coupling to predict signal peptides. Protein Eng., 14(2):75-79, 2001. [ bib \| http \| .pdf ]
[Chou2001Prediction]	K.-C. Chou. Prediction of protein signal sequences and their cleavage sites. Protein. Struct. Funct. Genet., 42:136-139, 2001. [ bib \| http \| .pdf ]
[Carter2001computational]	R. J. Carter, I. Dubchak, and S. R. Holbrook. A computational approach to identify genes for functional RNAs in genomic sequences. Nucl. Acids Res., 29(19):3928-3938, 2001. [ bib \| http \| .pdf ] Currently there is no successful computational approach for identification of genes encoding novel functional RNAs (fRNAs) in genomic sequences. We have developed a machine learning approach using neural networks and support vector machines to extract common features among known RNAs for prediction of new RNA genes in the unannotated regions of prokaryotic and archaeal genomes. The Escherichia coli genome was used for development, but we have applied this method to several other bacterial and archaeal genomes. Networks based on nucleotide composition were 80-90 for bacteria and 90-99 achieved a significant improvement in accuracy by combining these predictions with those obtained using a second set of parameters consisting of known RNA sequence motifs and the calculated free energy of folding. Several known fRNAs not included in the training datasets were identified as well as several hundred predicted novel RNAs. These studies indicate that there are many unidentified RNAs in simple genomes that can be predicted computationally as a precursor to experimental study. Public access to our RNA gene predictions and an interface for user predictions is available via the web. Keywords: biosvm
[Cai2001Support]	Y.-D. Cai, X.-J. Liu, X.-B. Xu, and G.-P. Zhou. Support Vector Machines for predicting protein structural class. BMC Bioinformatics, 2(3):3, 2001. [ bib \| DOI \| http \| .pdf ] Background We apply a new machine learning method, the so-called Support Vector Machine method, to predict the protein structural class. Support Vector Machine method is performed based on the database derived from SCOP, in which protein domains are classified based on known structures and the evolutionary relationships and the principles that govern their 3-D structure. Results High rates of both self-consistency and jackknife tests are obtained. The good results indicate that the structural class of a protein is considerably correlated with its amino acid composition. Conclusions It is expected that the Support Vector Machine method and the elegant component-coupled method, also named as the covariant discrimination algorithm, if complemented with each other, can provide a powerful computational tool for predicting the structural classes of proteins. Keywords: biosvm
[Brazma2001Minimum]	A. Brazma, P. Hingamp, J. Quackenbush, G. Sherlock, P. Spellman, C. Stoeckert, J. Aach, W. Ansorge, C. A. Ball, H. C. Causton, T. Gaasterland, P. Glenisson, F. C. Holstege, I. F. Kim, V. Markowitz, J. C. Matese, H. Parkinson, A. Robinson, U. Sarkans, S. Schulze-Kremer, J. Stewart, R. Taylor, J. Vilo, and M. Vingron. Minimum information about a microarray experiment (miame)-toward standards for microarray data. Nat. Genet., 29(4):365-371, Dec 2001. [ bib \| DOI \| http ] Microarray analysis has become a widely used tool for the generation of gene expression data on a genomic scale. Although many significant results have been derived from microarray studies, one limitation has been the lack of standards for presenting and exchanging such data. Here we present a proposal, the Minimum Information About a Microarray Experiment (MIAME), that describes the minimum information required to ensure that microarray data can be easily interpreted and that results derived from its analysis can be independently verified. The ultimate goal of this work is to establish a standard for recording and reporting microarray-based gene expression data, which will in turn facilitate the establishment of databases and public repositories and enable the development of data analysis tools. With respect to MIAME, we concentrate on defining the content and structure of the necessary information rather than the technical format for capturing it. Keywords: Computational Biology; Gene Expression Profiling, methods; Oligonucleotide Array Sequence Analysis, standards
[Bosshard2001Molecular]	H. R. Bosshard. Molecular recognition by induced fit: how fit is the concept? News Physiol Sci, 16:171-173, Aug 2001. [ bib ] Induced fit explains why biomolecules can bind together even if they are not optimized for binding. However, induced fit can lead to a kinetic bottleneck and does not describe every interaction in the absence of prior complementarity. Preselection of a fitting conformer is an alternative to induced fit. Keywords: Antigen-Antibody Complex, physiology; Biological Products, chemistry/metabolism; Models, Biological; Molecular Conformation
[Bock2001Predicting]	J. R. Bock and D. A. Gough. Predicting protein-protein interactions from primary structure. Bioinformatics, 17(5):455-460, 2001. [ bib \| .pdf \| .pdf ] Keywords: biosvm
[Beerenwinkel2001Geno2pheno]	N. Beerenwinkel, B. Schmidt, H. Walter, R. Kaiser, T. Lengauer, D. Hoffman, K. Korn, and J. Selbig. Geno2pheno: Interpreting Genotypic HIV Drug Resistance Tests. IEEE Intelligent Systems, 6(6):35-41, 2001. [ bib \| DOI \| http \| .pdf ] Rapid accumulation of resistance mutations in the genome of the human immunodeficiency virus (HIV) plays a central role in drug treatment failure in infected patients. The authors have developed geno2pheno, an intelligent system that uses the information encoded in the viral genomic sequence to predict resistance or susceptibility of the virus to 13 antiretroviral agents. To predict phenotypic drug resistance from genotype, they applied two machine learning techniques: decision trees and linear support vector machines. These techniques performed learning on more than 400 genotype-phenotype pairs for each drug. The authors compared the generalization performance of the two families of models in leave-one-out experiments. Except for three drugs, all error estimates ranged between 7.25 and 15.5 percent. Support vector machines performed slightly better for most drugs, but knowledge extraction was easier for decision trees. Geno2pheno is freely available at http://cartan.gmd.de/geno2pheno.html. Keywords: biosvm
[Bazzani2001SVM]	A. Bazzani, A. Bevilacqua, D. Bollini, R. Brancaccio, R. Campanini, N. Lanconelli, A. Riccardi, and D. Romani. An SVM classifier to separate false signals from microcalcifications in digital mammograms. Phys Med Biol, 46(6):1651-63, Jun 2001. [ bib \| DOI \| http \| .pdf ] In this paper we investigate the feasibility of using an SVM (support vector machine) classifier in our automatic system for the detection of clustered microcalcifications in digital mammograms. SVM is a technique for pattern recognition which relies on the statistical learning theory. It minimizes a function of two terms: the number of misclassified vectors of the training set and a term regarding the generalization classifier capability. We compare the SVM classifier with an MLP (multi-layer perceptron) in the false-positive reduction phase of our detection scheme: a detected signal is considered either microcalcification or false signal, according to the value of a set of its features. The SVM classifier gets slightly better results than the MLP one (Az value of 0.963 against 0.958) in the presence of a high number of training data; the improvement becomes much more evident (Az value of 0.952 against 0.918) in training sets of reduced size. Finally, the setting of the SVM classifier is much easier than the MLP one. Keywords: biosvm image
[Achard2001XML]	F. Achard, G. Vaysseix, and E. Barillot. Xml, bioinformatics and data integration. Bioinformatics, 17(2):115-125, Feb 2001. [ bib ] Motivation: The eXtensible Markup Language (XML) is an emerging standard for structuring documents, notably for the World Wide Web. In this paper, the authors present XML and examine its use as a data language for bioinformatics. In particular, XML is compared to other languages, and some of the potential uses of XML in bioinformatics applications are presented. The authors propose to adopt XML for data interchange between databases and other sources of data. Finally the discussion is illustrated by a test case of a pedigree data model in XML. Contact: Emmanuel.Barillot@infobiogen.fr Keywords: Computational Biology; Humans; Information Storage and Retrieval; Internet; Programming Languages
[Hua2001Novel]	S. Hua and Z. Sun. A Novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: Support Vector Machine Approach. J. Mol. Biol., 308(2):397-407, April 2001. [ bib \| DOI \| .pdf ] Keywords: biosvm
[Opper2001Universal]	M. Opper and R. Urbanczik. Universal learning curves of support vector machines. Phys Rev Lett, 86(19):4410-3, May 2001. [ bib ] Using methods of statistical physics, we investigate the role of model complexity in learning with support vector machines (SVMs), which are an important alternative to neural networks. We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which depend only on the target rule but not on the SVM kernel. Keywords: Algorithms, Amino Acid Sequence, Artificial Intelligence, Biological, Cell Compartmentation, Chemistry, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Databases, Decision Trees, Diagnosis, Discriminant Analysis, Electrophysiology, Factual, Gastric Emptying, Humans, Logistic Models, Melanoma, Models, Neural Networks (Computer), Nevus, Non-U.S. Gov't, Organelles, P.H.S., Physical, Pigmented, Predictive Value of Tests, Proteins, Proteome, Reproducibility of Results, Research Support, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Stomach Diseases, U.S. Gov't, 11328187
[Liang2001Detection]	H. Liang and Z. Lin. Detection of delayed gastric emptying from electrogastrograms with support vector machine. IEEE Trans Biomed Eng, 48(5):601-4, May 2001. [ bib ] A recent study reported a conventional neural network (NN) approach for the noninvasive diagnosis of delayed gastric emptying from the cutaneous electrogastrograms. Using support vector machine, we show that this relatively new technique can be used for detection of delayed gastric emptying and is in fact able to outdo the conventional NN. Keywords: Algorithms, Amino Acid Sequence, Artificial Intelligence, Biological, Cell Compartmentation, Comparative Study, Computer Simulation, Computer-Assisted, Decision Trees, Diagnosis, Discriminant Analysis, Electrophysiology, Gastric Emptying, Humans, Logistic Models, Melanoma, Models, Neural Networks (Computer), Nevus, Non-U.S. Gov't, Organelles, P.H.S., Pigmented, Predictive Value of Tests, Proteins, Reproducibility of Results, Research Support, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Stomach Diseases, U.S. Gov't, 11341535
[Logan2001Study]	B. Logan, P. Moreno, B. Suzek, Z. Weng, and S. Kasif. A Study of Remote Homology Detection. Technical Report CRL 2001/05, Compaq Cambridge Research laboratory, June 2001. [ bib \| .pdf ] Functional annotation of newly sequenced genomes is an important challenge for computational biology systems. While much progress has been made towards scalingup experimental methods for functional assignment to putative genes, most current genomic annotation systems rely on computational solutions for homology modeling via sequence or structural similarity. We present a new method for remote homology detection that relies on combining probabilistic modeling and supervised learning in high-dimensional features spaces. Our system uses a transformation that converts protein domains to fixed-dimension representative feature vectors, where each feature records the sensitivity of each protein domain to a previously learned set of ?protein motifs? or ?blocks?. Subsequently, the system utilizes Support Vector Machine (SVM) classifiers to learn the boundaries between structural protein classes. Our experiments suggest that this technique performs well relative to several other remote homology methods for the majority of protein domains in SCOP 1.37 PDB90. Keywords: biosvm
[Burbidge2001Drug]	R. Burbidge, M. Trotter, B. Buxton, and S. Holden. Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput. Chem., 26(1):4-15, December 2001. [ bib \| .pdf \| .pdf ] Keywords: biosvm chemoinformatics
[Zavaljevski2002Support]	N. Zavaljevski, F.J. Stevens, and J. Reifman. Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions. Bioinformatics, 18(5):689-696, 2002. [ bib \| http \| .pdf ] Motivation: Data that characterize primary and tertiary structures of proteins are now accumulating at a rapid and accelerating rate and require automated computational tools to extract critical information relating amino acid changes with the spectrum of functionally attributes exhibited by a protein. We propose that immunoglobulin-type beta-domains, which are found in approximate 400 functionally distinct forms in humans alone, provide the immense genetic variation within limited conformational changes that might facilitate the development of new computational tools. As an initial step, we describe here an approach based on Support Vector Machine (SVM) technology to identify amino acid variations that contribute to the functional attribute of pathological self-assembly by some human antibody light chains produced during plasma cell diseases. Results: We demonstrate that SVMs with selective kernel scaling are an effective tool in discriminating between benign and pathologic human immunoglobulin light chains. Initial results compare favorably against manual classification performed by experts and indicate the capability of SVMs to capture the underlying structure of the data. The data set consists of 70 proteins of human antibody 1 light chains, each represented by aligned sequences of 120 amino acids. We perform feature selection based on a first-order adaptive scaling algorithm, which confirms the importance of changes in certain amino acid positions and identifies other positions that are key in the characterization of protein function. Keywords: biosvm
[Yuan2002Prediction]	Z. Yuan, K. Burrage, and J.S. Mattick. Prediction of protein solvent accessibility using support vector machines. Proteins, 48(3):566-570, 2002. [ bib \| DOI \| http \| .pdf ] A Support Vector Machine learning system has been trained to predict protein solvent accessibility from the primary structure. Different kernel functions and sliding window sizes have been explored to find how they affect the prediction performance. Using a cut-off threshold of 15 of exposed and buried residues), this method was able to achieve a prediction accuracy of 70.1 for multiple alignment sequence input, respectively. The prediction of three and more states of solvent accessibility was also studied and compared with other methods. The prediction accuracies are better than, or comparable to, those obtained by other methods such as neural networks, Bayesian classification, multiple linear regression, and information theory. In addition, our results further suggest that this system may be combined with other prediction methods to achieve more reliable results, and that the Support Vector Machine method is a very useful tool for biological sequence analysis. Keywords: biosvm
[Yu2002Methods]	Kun Yu, Nikolai Petrovsky, Christian SchÃ¶nbach, Judice Y L Koh, and Vladimir Brusic. Methods for prediction of peptide binding to MHC molecules: a comparative study. Mol Med, 8(3):137-148, Mar 2002. [ bib ] BACKGROUND: A variety of methods for prediction of peptide binding to major histocompatibility complex (MHC) have been proposed. These methods are based on binding motifs, binding matrices, hidden Markov models (HMM), or artificial neural networks (ANN). There has been little prior work on the comparative analysis of these methods. MATERIALS AND METHODS: We performed a comparison of the performance of six methods applied to the prediction of two human MHC class I molecules, including binding matrices and motifs, ANNs, and HMMs. RESULTS: The selection of the optimal prediction method depends on the amount of available data (the number of peptides of known binding affinity to the MHC molecule of interest), the biases in the data set and the intended purpose of the prediction (screening of a single protein versus mass screening). When little or no peptide data are available, binding motifs are the most useful alternative to random guessing or use of a complete overlapping set of peptides for selection of candidate binders. As the number of known peptide binders increases, binding matrices and HMM become more useful predictors. ANN and HMM are the predictive methods of choice for MHC alleles with more than 100 known binding peptides. CONCLUSION: The ability of bioinformatic methods to reliably predict MHC binding peptides, and thereby potential T-cell epitopes, has major implications for clinical immunology, particularly in the area of vaccine design. Keywords: Amino Acid Motifs; Computational Biology; Histocompatibility Antigens Class I; Humans; Models, Molecular; Peptides; Protein Binding
[Weber2002Building]	Griffin Weber, Staal Vinterbo, and Lucila Ohno-Machado. Building an asynchronous web-based tool for machine learning classification. Proc AMIA Symp, pages 869-73, 2002. [ bib ] Various unsupervised and supervised learning methods including support vector machines, classification trees, linear discriminant analysis and nearest neighbor classifiers have been used to classify high-throughput gene expression data. Simpler and more widely accepted statistical tools have not yet been used for this purpose, hence proper comparisons between classification methods have not been conducted. We developed free software that implements logistic regression with stepwise variable selection as a quick and simple method for initial exploration of important genetic markers in disease classification. To implement the algorithm and allow our collaborators in remote locations to evaluate and compare its results against those of other methods, we developed a user-friendly asynchronous web-based application with a minimal amount of programming using free, downloadable software tools. With this program, we show that classification using logistic regression can perform as well as other more sophisticated algorithms, and it has the advantages of being easy to interpret and reproduce. By making the tool freely and easily available, we hope to promote the comparison of classification methods. In addition, we believe our web application can be used as a model for other bioinformatics laboratories that need to develop web-based analysis tools in a short amount of time and on a limited budget. Keywords: Acute, Algorithms, Animals, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Drug, Drug Design, Eukaryotic Cells, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Internet, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lymphocytic, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Protein Binding, Proteins, Quality Control, RNA, RNA Splicing, Receptors, Reference Values, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12463949
[Warmuth2002Active]	M. K. Warmuth, G. Rätsch, M. Mathieson, L. Liao, and C. Lemmen. Active learning in the drug discovery process. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Adv. Neural Inform. Process. Syst., volume 14, pages 1449-1456. MIT Press, 2002. [ bib ] Keywords: biosvm
[Wahba2002Soft]	Grace Wahba. Soft and hard classification by reproducing kernel Hilbert space methods. Proc Natl Acad Sci U S A, 99(26):16524-30, Dec 2002. [ bib \| DOI \| http \| .pdf ] Reproducing kernel Hilbert space (RKHS) methods provide a unified context for solving a wide variety of statistical modelling and function estimation problems. We consider two such problems: We are given a training set [yi, ti, i = 1, em leader, n], where yi is the response for the ith subject, and ti is a vector of attributes for this subject. The value of y(i) is a label that indicates which category it came from. For the first problem, we wish to build a model from the training set that assigns to each t in an attribute domain of interest an estimate of the probability pj(t) that a (future) subject with attribute vector t is in category j. The second problem is in some sense less ambitious; it is to build a model that assigns to each t a label, which classifies a future subject with that t into one of the categories or possibly "none of the above." The approach to the first of these two problems discussed here is a special case of what is known as penalized likelihood estimation. The approach to the second problem is known as the support vector machine. We also note some alternate but closely related approaches to the second problem. These approaches are all obtained as solutions to optimization problems in RKHS. Many other problems, in particular the solution of ill-posed inverse problems, can be obtained as solutions to optimization problems in RKHS and are mentioned in passing. We caution the reader that although a large literature exists in all of these topics, in this inaugural article we are selectively highlighting work of the author, former students, and other collaborators. Keywords: Acute, Algorithms, Animals, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Drug, Drug Design, Eukaryotic Cells, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Leukemia, Ligands, Likelihood Functions, Lymphocytic, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Protein Binding, Proteins, Quality Control, RNA, RNA Splicing, Receptors, Reference Values, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Statistical, Stomach Neoplasms, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12477931
[Vert2002Graph-driven]	J.-P. Vert and M. Kanehisa. Graph-driven features extraction from microarray data. Technical Report 0206055, Arxiv physics, 2002. [ bib ] Keywords: biosvm
[Vert2002tree]	J.-P. Vert. A tree kernel to analyze phylogenetic profiles. Bioinformatics, 18:S276-S284, 2002. [ bib \| .html \| .pdf ] Keywords: biosvm
[Vert2002Support]	J.-P. Vert. Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauerdale, and T. E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 649-660. World Scientific, 2002. [ bib \| .pdf \| .pdf ] Keywords: biosvm
[Valentini2002Gene]	G. Valentini. Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles. Artif. Intell. Med., 26(3):281-304, Nov 2002. [ bib \| DOI \| .pdf ] The large amount of data generated by DNA microarrays was originally analysed using unsupervised methods, such as clustering or self-organizing maps. Recently supervised methods such as decision trees, dot-product support vector machines (SVM) and multi-layer perceptrons (MLP) have been applied in order to classify normal and tumoural tissues. We propose methods based on non-linear SVM with polynomial and Gaussian kernels, and output coding (OC) ensembles of learning machines to separate normal from malignant tissues, to classify different types of lymphoma and to analyse the role of sets of coordinately expressed genes in carcinogenic processes of lymphoid tissues. Using gene expression data from "Lymphochip", a specialised DNA microarray developed at Stanford University School of Medicine, we show that SVM can correctly separate normal from tumoural tissues, and OC ensembles can be successfully used to classify different types of lymphoma. Moreover, we identify a group of coordinately expressed genes related to the separation of two distinct subgroups inside diffuse large B-cell lymphoma (DLBCL), validating a previous Alizadeh's hypothesis about the existence of two distinct diseases inside DLBCL. Keywords: biosvm
[Tsuda2002Marginalized]	K. Tsuda, T. Kin, and K. Asai. Marginalized Kernels for Biological Sequences. Bioinformatics, 18:S268-S275, 2002. [ bib \| .pdf ] Motivation: Kernel methods such as support vector machines require a kernel function between objects to be defined a priori. Several works have been done to derive kernels from probability distributions, e.g., the Fisher kernel. However, a general methodology to design a kernel is not fully developed. Results: We propose a reasonable way of designing a kernel when objects are generated from latent variable models (e.g., HMM). First of all, a joint kernel is designed for complete data which include both visible and hidden variables. Then a marginalized kernel for visible data is obtained by taking the expectation with respect to hidden variables. We will show that the Fisher kernel is a special case of marginalized kernels, which gives another viewpoint to the Fisher kernel theory. Although our approach can be applied to any object, we particularly derive several marginalized kernels useful for biological sequences (e.g., DNA and proteins). The effectiveness of marginalized kernels is illustrated in the task of classifying bacterial gyrase subunit B (gyrB) amino acid sequences. Keywords: biosvm
[Tsuda2002new]	K. Tsuda, M. Kawanabe, G. Rätsch, S. Sonnenburg, and K.-R. Müller. A new discriminative kernel from probabilistic models. Neural Computation, 14(10):2397-2414, 2002. [ bib \| DOI \| http \| .pdf ] Keywords: biosvm
[Sturn2002Genesis:]	Alexander Sturn, John Quackenbush, and Zlatko Trajanoski. Genesis: cluster analysis of microarray data. Bioinformatics, 18(1):207-8, Jan 2002. [ bib ] A versatile, platform independent and easy to use Java suite for large-scale gene expression analysis was developed. Genesis integrates various tools for microarray data analysis such as filters, normalization and visualization tools, distance measures as well as common clustering algorithms including hierarchical clustering, self-organizing maps, k-means, principal component analysis, and support vector machines. The results of the clustering are transparent across all implemented methods and enable the analysis of the outcome of different algorithms and parameters. Additionally, mapping of gene expression data onto chromosomal sequences was implemented to enhance promoter analysis and investigation of transcriptional control mechanisms. Keywords: Algorithms, Artificial Intelligence, Cluster Analysis, Comparative Study, Computational Biology, Databases, Gene Expression Profiling, Genetic, Models, Molecular Structure, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Principal Component Analysis, Programming Languages, Promoter Regions (Genetics), Protein, Proteins, Research Support, Software, Statistical, Transcription, 11836235
[Stapley2002Predicting]	B.J. Stapley, L.A. Kelley, and M.J. Sternberg. Predicting the sub-cellular location of proteins from text using support vector machines. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Kevin Lauerdale, and Teri E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 374-385. World Scientific, 2002. [ bib \| .pdf \| .pdf ] We present an automatic method to classify the sub-cellular location of proteins based on the text of relevant medline abstracts. For each protein, a vector of terms is generated from medline abstracts in which the protein/gene's name or synonym occurs. A Support Vector Machine (SVM) is used to automatically partition the term space and to thus discriminate the textual features that define sub-cellular location. The method is benchmarked on a set of proteins of known sub-cellular location from S. cerevisiae. No prior knowledge of the problem domain nor any natural language processing is used at any stage. The method out-performs support vector machines trained on amino acid composition and has comparable performance to rule-based text classifiers. Combining text with protein amino-acid composition improves recall for some sub-cellular locations. We discuss the generality of the method and its potential application to a variety of biological classification problems. Keywords: biosvm
[Sonnenburg2002New]	S. Sonnenburg, G. Rätsch, A. Jagota, and K.-R. Müller. New methods for splice-site recognition. In JR. Dorronsoro, editor, Proc. International conference on artificial Neural Networks ? ICANN?02, number 2415 in LNCS, pages 329-336. Springer Berlin, 2002. [ bib \| .pdf ] Keywords: biosvm
[Song2002Prediction]	Minghu Song, Curt M Breneman, Jinbo Bi, N. Sukumar, Kristin P Bennett, Steven Cramer, and Nihal Tugcu. Prediction of protein retention times in anion-exchange chromatography systems using support vector regression. J Chem Inf Comput Sci, 42(6):1347-57, 2002. [ bib ] Quantitative Structure-Retention Relationship (QSRR) models are developed for the prediction of protein retention times in anion-exchange chromatography systems. Topological, subdivided surface area, and TAE (Transferable Atom Equivalent) electron-density-based descriptors are computed directly for a set of proteins using molecular connectivity patterns and crystal structure geometries. A novel algorithm based on Support Vector Machine (SVM) regression has been employed to obtain predictive QSRR models using a two-step computational strategy. In the first step, a sparse linear SVM was utilized as a feature selection procedure to remove irrelevant or redundant information. Subsequently, the selected features were used to produce an ensemble of nonlinear SVM regression models that were combined using bootstrap aggregation (bagging) techniques, where various combinations of training and validation data sets were selected from the pool of available data. A visualization scheme (star plots) was used to display the relative importance of each selected descriptor in the final set of "bagged" models. Once these predictive models have been validated, they can be used as an automated prediction tool for virtual high-throughput screening (VHTS). Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Decision Making, Diagnosis, Differential, Drug, Drug Design, Electrostatics, Eukaryotic Cells, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Internet, Ion Exchange, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12444731
[Shipp2002Diffuse]	M. A. Shipp, K. N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. A. Pinkus, T. S. Ray, M. A. Koval, K. W. Last, A. Norton, T. A. Lister, J. Mesirov, D. S. Neuberg, E. S. Lander, J. C. Aster, and T. R. Golub. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med., 8(1):68-74, 2002. [ bib \| DOI \| .pdf ] Diffuse large B-cell lymphoma (DLBCL), the most common lymphoid malignancy in adults, is curable in less than 50 models based on pre-treatment characteristics, such as the International Prognostic Index (IPI), are currently used to predict outcome in DLBCL. However, clinical outcome models identify neither the molecular basis of clinical heterogeneity, nor specific therapeutic targets. We analyzed the expression of 6,817 genes in diagnostic tumor specimens from DLBCL patients who received cyclophosphamide, adriamycin, vincristine and prednisone (CHOP)-based chemotherapy, and applied a supervised learning prediction method to identify cured versus fatal or refractory disease. The algorithm classified two categories of patients with very different five-year overall survival rates (70 within specific IPI risk categories who were likely to be cured or to die of their disease. Genes implicated in DLBCL outcome included some that regulate responses to B-cell?receptor signaling, critical serine/threonine phosphorylation pathways and apoptosis. Our data indicate that supervised learning classification techniques can predict outcome in DLBCL and identify rational targets for intervention. Keywords: biosvm
[Seeger2002Covariance]	M. Seeger. Covariance Kernels from Bayesian Generative Models. In Adv. Neural Inform. Process. Syst., volume 14, pages 905-912, 2002. [ bib \| www: ] Keywords: biosvm
[Schoelkopf2002Kernel]	B. Schölkopf, J. Weston, E. Eskin, C. Leslie, and W.S. Noble. A Kernel Approach for Learning from Almost Orthogonal Patterns. In Proceedings of ECML 2002, 2002. [ bib \| .pdf \| .pdf ]
[Pavlidis2002Learning]	P. Pavlidis, J. Weston, J. Cai, and W.S. Noble. Learning gene functional classifications from multiple data types. J. Comput. Biol., 9(2):401-411, 2002. [ bib \| DOI \| .pdf ] In our attempts to understand cellular function at the molecular level, we must be able to synthesize information from disparate types of genomic data. We consider the problem of inferring gene functional classifications from a heterogeneous data set consisting of DNA microarray expression measurements and phylogenetic profiles from whole-genome sequence comparisons. We demonstrate the application of the support vector machine (SVM) learning algorithm to this functional inference task. Our results suggest the importance of exploiting prior information about the heterogeneity of the data. In particular, we propose an SVM kernel function that is explicitly heterogeneous. In addition, we describe feature scaling methods for further exploiting prior knowledge of heterogeneity by giving each data type different weights. Keywords: biosvm
[Patterson2002Pre-mRNA]	D.J. Patterson, K. Yasuhara, and W.L. Ruzzo. Pre-mRNA secondary structure prediction aids splice site prediction. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Kevin Lauerdale, and Teri E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 223-234. World Scientific, 2002. [ bib \| .pdf \| .pdf ] Accurate splice site prediction is a critical component of any computational approach to gene prediction in higher organisms. Existing approaches generally use sequence-based models that capture local dependencies among nucleotides in a small window around the splice site. We present evidence that computationally predicted secondary structure of moderate length pre-mRNA subsequencies contains information that can be exploited to improve acceptor splice site prediction beyond that possible with conventional sequence-based approaches. Both decision tree and support vector machine classifiers, using folding energy and structure metrics characterizing helix formation near the splice site, achieve a 5-10 a human data set. Based on our data, we hypothesize that acceptors preferentially exhibit short helices at the splice site. Keywords: biosvm
[Pastor-Satorras2002Evolving]	R. Pastor-Satorras, E. D. Smith, and R. V. Solé. Evolving protein interaction networks through gene duplication. Technical report, Santa Fe Institute, 2002. Working paper 02-02-008. [ bib \| .html \| .pdf ]
[Myasnikova2002Support]	E. Myasnikova, A. Samsonova, M. Samsonova, and J. Reinitz. Support vector regression applied to the determination of the developmental age of a Drosophila embryo from its segmentation gene expression patterns. Bioinformatics, 18(Suppl. 1):S87-S95, 2002. [ bib \| http \| .pdf ] Motivation: In this paper we address the problem of the determination of developmental age of an embryo from its segmentation gene expression patterns in Drosophila. Results: By applying support vector regression we have developed a fast method for automated staging of an embryo on the basis of its gene expression pattern. Support vector regression is a statistical method for creating regression functions of arbitrary type from a set of training data. The training set is composed of embryos for which the precise developmental age was determined by measuring the degree of membrane invagination. Testing the quality of regression on the training set showed good prediction accuracy. The optimal regression function was then used for the prediction of the gene expression based age of embryos in which the precise age has not been measured by membrane morphology. Moreover, we show that the same accuracy of prediction can be achieved when the dimensionality of the feature vector was reduced by applying factor analysis. The data reduction allowed us to avoid over-fitting and to increase the efficiency of the algorithm. Availability: This software may be obtained from the authors. Contact: samson@fn.csa.ru Keywords: gene expression patterns; development; embryo staging; support vector regression; segmentation genes; Drosophila. Keywords: biosvm
[Mateos2002Systematic]	Alvaro Mateos, JoaquÃn Dopazo, Ronald Jansen, Yuhai Tu, Mark Gerstein, and Gustavo Stolovitzky. Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. Genome Res., 12(11):1703-15, Nov 2002. [ bib \| DOI \| http \| .pdf ] Recent advances in microarray technology have opened new ways for functional annotation of previously uncharacterised genes on a genomic scale. This has been demonstrated by unsupervised clustering of co-expressed genes and, more importantly, by supervised learning algorithms. Using prior knowledge, these algorithms can assign functional annotations based on more complex expression signatures found in existing functional classes. Previously, support vector machines (SVMs) and other machine-learning methods have been applied to a limited number of functional classes for this purpose. Here we present, for the first time, the comprehensive application of supervised neural networks (SNNs) for functional annotation. Our study is novel in that we report systematic results for 100 classes in the Munich Information Center for Protein Sequences (MIPS) functional catalog. We found that only 10% of these are learnable (based on the rate of false negatives). A closer analysis reveals that false positives (and negatives) in a machine-learning context are not necessarily "false" in a biological sense. We show that the high degree of interconnections among functional classes confounds the signatures that ought to be learned for a unique class. We term this the "Borges effect" and introduce two new numerical indices for its quantification. Our analysis indicates that classification systems with a lower Borges effect are better suitable for machine learning. Furthermore, we introduce a learning procedure for combining false positives with the original class. We show that in a few iterations this process converges to a gene set that is learnable with considerably low rates of false positives and negatives and contains genes that are biologically related to the original class, allowing for a coarse reconstruction of the interactions between associated biological pathways. We exemplify this methodology using the well-studied tricarboxylic acid cycle. Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Citric Acid Cycle, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Databases, Decision Making, Diagnosis, Differential, Drug, Drug Design, Electrostatics, Eukaryotic Cells, Factual, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Heterogeneity, Genetic Markers, Hemolysins, Humans, Internet, Ion Exchange, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Structural, Structure-Activity Relationship, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12421757
[Maslov2002Specificity]	S. Maslov and K. Sneppen. Specificity and stability in topology of protein networks. Science, 296:910-913, 2002. [ bib \| .pdf \| .pdf ]
[Martoglio2002decomposition]	Ann-Marie Martoglio, James W Miskin, Stephen K Smith, and David J C MacKay. A decomposition model to track gene expression signatures: preview on observer-independent classification of ovarian cancer. Bioinformatics, 18(12):1617-24, Dec 2002. [ bib ] MOTIVATION: A number of algorithms and analytical models have been employed to reduce the multidimensional complexity of DNA array data and attempt to extract some meaningful interpretation of the results. These include clustering, principal components analysis, self-organizing maps, and support vector machine analysis. Each method assumes an implicit model for the data, many of which separate genes into distinct clusters defined by similar expression profiles in the samples tested. A point of concern is that many genes may be involved in a number of distinct behaviours, and should therefore be modelled to fit into as many separate clusters as detected in the multidimensional gene expression space. The analysis of gene expression data using a decomposition model that is independent of the observer involved would be highly beneficial to improve standard and reproducible classification of clinical and research samples. RESULTS: We present a variational independent component analysis (ICA) method for reducing high dimensional DNA array data to a smaller set of latent variables, each associated with a gene signature. We present the results of applying the method to data from an ovarian cancer study, revealing a number of tissue type-specific and tissue type-independent gene signatures present in varying amounts among the samples surveyed. The observer independent results of such molecular analysis of biological samples could help identify patients who would benefit from different treatment strategies. We further explore the application of the model to similar high-throughput studies. Keywords: Acute, Algorithms, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Cluster Analysis, Comparative Study, Computer-Assisted, Cystadenoma, DNA, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Leukemia, Lymphocytic, Markov Chains, Messenger, Models, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplastic, Neural Networks (Computer), Non-U.S. Gov't, Nucleic Acid Conformation, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, Pattern Recognition, Quality Control, RNA, Reference Values, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Statistical, Stomach Neoplasms, Transcription, Tumor Markers, 12490446
[Marsland2002self-organising]	Stephen Marsland, Jonathan Shapiro, and Ulrich Nehmzow. A self-organising network that grows when required. Neural Netw, 15(8-9):1041-58, 2002. [ bib ] The ability to grow extra nodes is a potentially useful facility for a self-organising neural network. A network that can add nodes into its map space can approximate the input space more accurately, and often more parsimoniously, than a network with predefined structure and size, such as the Self-Organising Map. In addition, a growing network can deal with dynamic input distributions. Most of the growing networks that have been proposed in the literature add new nodes to support the node that has accumulated the highest error during previous iterations or to support topological structures. This usually means that new nodes are added only when the number of iterations is an integer multiple of some pre-defined constant, A. This paper suggests a way in which the learning algorithm can add nodes whenever the network in its current state does not sufficiently match the input. In this way the network grows very quickly when new data is presented, but stops growing once the network has matched the data. This is particularly important when we consider dynamic data sets, where the distribution of inputs can change to a new regime after some time. We also demonstrate the preservation of neighbourhood relations in the data by the network. The new network is compared to an existing growing network, the Growing Neural Gas (GNG), on a artificial dataset, showing how the network deals with a change in input distribution after some time. Finally, the new network is applied to several novelty detection tasks and is compared with both the GNG and an unsupervised form of the Reduced Coulomb Energy network on a robotic inspection task and with a Support Vector Machine on two benchmark novelty detection tasks. Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Citric Acid Cycle, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Databases, Decision Making, Diagnosis, Differential, Drug, Drug Design, Electrostatics, Eukaryotic Cells, Factual, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Heterogeneity, Genetic Markers, Hemolysins, Humans, Internet, Ion Exchange, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Probability Learning, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Robotics, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Structural, Structure-Activity Relationship, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12416693
[Lodhi2002Text]	H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C.je n'ai pas vraiment d'Ã©lÃ©ments de rÃ©ponse. Watkins. Text classification using string kernels. J. Mach. Learn. Res., 2:419-444, 2002. [ bib \| .html \| .pdf ] Keywords: biosvm
[Lin2002Conserved]	K. Lin, Y. Kuang, J. S. Joseph, and P. R. Kolatkar. Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: lessons from supervised machine learning in functional genomics. Nucl. Acids Res., 30(11):2599-2607, 2002. [ bib \| http \| .pdf ] Genomics projects have resulted in a flood of sequence data. Functional annotation currently relies almost exclusively on inter-species sequence comparison and is restricted in cases of limited data from related species and widely divergent sequences with no known homologs. Here, we demonstrate that codon composition, a fusion of codon usage bias and amino acid composition signals, can accurately discriminate, in the absence of sequence homology information, cytoplasmic ribosomal protein genes from all other genes of known function in Saccharomyces cerevisiae, Escherichia coli and Mycobacterium tuberculosis using an implementation of support vector machines, SVMlight. Analysis of these codon composition signals is instructive in determining features that confer individuality to ribosomal protein genes. Each of the sets of positively charged, negatively charged and small hydrophobic residues, as well as codon bias, contribute to their distinctive codon composition profile. The representation of all these signals is sensitively detected, combined and augmented by the SVMs to perform an accurate classification. Of special mention is an obvious outlier, yeast gene RPL22B, highly homologous to RPL22A but employing very different codon usage, perhaps indicating a non-ribosomal function. Finally, we propose that codon composition be used in combination with other attributes in gene/protein classification by supervised machine learning algorithms. Keywords: biosvm
[Liberles2002use]	D. A. Liberles, A. Thorén, G. von Heijne, and A. Elofsson. The use of phylogenetic profiles for gene predictions. Curr. Genom., 2002. To appear. [ bib \| .pdf \| .pdf ]
[Liao2002Combining]	L. Liao and W. S. Noble. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In Proceedings of the Sixth International Conference on Computational Molecular Biology, 2002. [ bib \| .html \| .pdf ] Keywords: biosvm
[Leslie2002spectrum]	C. Leslie, E. Eskin, and W.S. Noble. The spectrum kernel: a string kernel for SVM protein classification. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Kevin Lauerdale, and Teri E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 564-575, Singapore, 2002. World Scientific. [ bib \| .pdf ] Keywords: biosvm
[Kramer2002Fragment]	S. Kramer, E. Frank, and C. Helma. Fragment generation and support vector machines for inducing SARs. SAR QSAR Environ Res, 13(5):509-23, Jul 2002. [ bib \| DOI \| http ] We present a new approach to the induction of SARs based on the generation of structural fragments and support vector machines (SVMs). It is tailored for bio-chemical databases, where the examples are two-dimensional descriptions of chemical compounds. The fragment generator finds all fragments (i.e. linearly connected atoms) that satisfy user-specified constraints regarding their frequency and generality. In this paper, we are querying for fragments within a minimum and a maximum frequency in the dataset. After fragment generation, we propose to apply SVMs to the problem of inducing SARs from these fragments. We conjecture that the SVMs are particularly useful in this context, as they can deal with a large number of features. Experiments in the domains of carcinogenicity and mutagenicity prediction show that the minimum and the maximum frequency queries for fragments can be answered within a reasonable time, and that the predictive accuracy obtained using these fragments is satisfactory. However, further experiments will have to confirm that this is a viable approach to inducing SARs. Keywords: biosvm
[Kondor2002Diffusion]	R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 315-322, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. [ bib \| .pdf ] Keywords: biosvm
[Kin2002Marginalized]	T. Kin, K. Tsuda, and K. Asai. Marginalized kernels for RNA sequence data analysis. In R.H. Lathtop, K. Nakai, S. Miyano, T. Takagi, and M. Kanehisa, editors, Genome Informatics 2002, pages 112-122. Universal Academic Press, 2002. [ bib \| .html \| .pdf ] We present novel kernels that measure similarity of two RNA sequences, taking account of their secondary structures. Two types of kernels are presented. One is for RNA sequences with known secondary structures, the other for those without known secondary structures. The latter employs stochastic context-free grammar (SCFG) for estimating the secondary structure. We call the latter the marginalized count kernel (MCK). We show computational experiments for MCK using 74 sets of human tRNA sequence data: (i) kernel principal component analysis (PCA) for visualizing tRNA similarities, (ii) supervised classification with support vector machines (SVMs). Both types of experiment show promising results for MCKs. Keywords: biosvm
[Karchin2002Classifying]	R. Karchin, K. Karplus, and D. Haussler. Classifying G-protein coupled receptors with support vector machines. Bioinformatics, 18:147-159, 2002. [ bib \| http \| .pdf ] Motivation: The enormous amount of protein sequence data uncovered by genome research has increased the demand for computer software that can automate the recognition of new proteins. We discuss the relative merits of various automated methods for recognizing G-Protein Coupled Receptors (GPCRs), a superfamily of cell membrane proteins. GPCRs are found in a wide range of organisms and are central to a cellular signalling network that regulates many basic physiological processes. They are the focus of a significant amount of current pharmaceutical research because they play a key role in many diseases. However, their tertiary structures remain largely unsolved. The methods described in this paper use only primary sequence information to make their predictions. We compare a simple nearest neighbor approach (BLAST), methods based on multiple alignments generated by a statistical profile Hidden Markov Model (HMM), and methods, including Support Vector Machines (SVMs), that transform protein sequences into fixed-length feature vectors. Results: The last is the most computationally expensive method, but our experiments show that, for those interested in annotation-quality classification, the results are worth the effort. In two-fold cross-validation experiments testing recognition of GPCR subfamilies that bind a specific ligand (such as a histamine molecule), the errors per sequence at the Minimum Error Point (MEP) were 13.7 SVMs, 17.1 SVM classification, 25.5 and 49 Kernel Nearest Neighbor (kernNN). The percentage of true positives recognized before the first false positive was 65 both SVM methods, 13 for kernNN. Availability: We have set up a web server for GPCR subfamily classification based on hierarchical multi-class SVMs at http://www.soe.ucsc.edu/research/compbio/gpcr-subclass. By scanning predicted peptides found in the human genome with the SVMtree server, we have identified a large number of genes that encode GPCRs. A list of our predictions for human GPCRs is available at http://www.soe.ucsc.edu/research/compbio/gpcrÂ·hg/classÂ·results. We also provide suggested subfamily classification for 18 sequences previously identified as unclassified Class A (rhodopsin-like) GPCRs in GPCRDB (Horn et al. , Nucleic Acids Res. , 26, 277?281, 1998), available at http://www.soe.ucsc.edu/research/compbio/gpcr/classAÂ·unclassified/ Keywords: fisher-kernel sequence-classification biosvm
[Kanehisa2002KEGG]	M. Kanehisa, S. Goto, S. Kawashima, and A. Nakaya. The KEGG databases at GenomeNet. Nucleic Acids Res., 30:42-46, 2002. [ bib \| http \| .pdf ]
[Imoto2002Bayesian]	S. Imoto, K. Sunyong, T. Goto, S. Aburatani, K. Tashiro, S. Kuhara, and S. Miyano. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. Proc. IEEE Comput. Soc. Bioinform. Conf., 1:219-227, 2002. [ bib \| DOI \| http \| .pdf ] We propose a new statistical method for constructing genetic network from microarray gene expression data by using a Bayesian network. An essential point of Bayesian network construction is in the estimation of the conditional distribution of each random variable. We consider fitting nonparametric regression models with heterogeneous error variances to the microarray gene expression data to capture the nonlinear structures between genes. A problem still remains to be solved in selecting an optimal graph, which gives the best representation of the system among genes. We theoretically derive a new graph selection criterion from Bayes approach in general situations. The proposed method includes previous methods based on Bayesian networks. We demonstrate the effectiveness of the proposed method through the analysis of Saccharomyces cerevisiae gene expression data newly obtained by disrupting 100 genes. Keywords: biogm
[Imoto2002Estimation]	S. Imoto, T. Goto, and S. Miyano. Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Pac. Symp. Biocomput., pages 175-186, 2002. [ bib \| .pdf \| .pdf ] We propose a new method for constructing genetic network from gene expression data by using Bayesian networks. We use nonparametric regression for capturing nonlinear relationships between genes and derive a new criterion for choosing the network in general situations. In a theoretical sense, our proposed theory and methodology include previous methods based on Bayes approach. We applied the proposed method to the S. cerevisiae cell cycle data and showed the effectiveness of our method by comparing with previous methods. Keywords: biogm
[Hanisch2002Co-clustering]	D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 2002. [ bib \| .pdf ]
[Guyon2002Gene]	I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Mach. Learn., 46(1/3):389-422, Jan 2002. [ bib \| .pdf \| .pdf ] DNA micro-arrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new micro-array devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive signatures of gene expression over normal tissues or other types of cancer tissues. In this paper, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA micro-arrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address this problem select genes with correlation techniques. We propose a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE). We demonstrate experimentally that the genes selected by our techniques yield better classification performance and are biologically relevant to cancer. In contrast with the baseline method, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. In patients with leukemia our method discovered 2 genes that yield zero leave-one-out error, while 64 genes are necessary for the baseline method to get the best result (one leave-one-out error). In the colon cancer database, using only 4 genes our method is 98% accurate, while the baseline method is only 86% accurate. Keywords: biosvm
[Guermeur2002Combining]	Y. Guermeur. Combining Discriminant Models with New Multi-Class SVMs. Pattern Anal. Appl., 5(2):168-179, 2002. [ bib \| DOI \| http \| .pdf ] The idea of performing model combination, instead of model selection, has a long theoretical background in statistics. However, making use of theoretical results is ordinarily subject to the satisfaction of strong hypotheses (weak error correlation, availability of large training sets, possibility to rerun the training procedure an arbitrary number of times, etc.). In contrast, the practitioner is frequently faced with the problem of combining a given set of pre-trained classifiers, with highly correlated errors, using only a small training sample. Overfitting is then the main risk, which cannot be overcome but with a strict complexity control of the combiner selected. This suggests that SVMs should be well suited for these difficult situations. Investigating this idea, we introduce a family of multi-class SVMs and assess them as ensemble methods on a real-world problem. This task, protein secondary structure prediction, is an open problem in biocomputing for which model combination appears to be an issue of central importance. Experimental evidence highlights the gain in quality resulting from combining some of the most widely used prediction methods with our SVMs rather than with the ensemble methods traditionally used in the field. The gain increases when the outputs of the combiners are post-processed with a DP algorithm. Keywords: biosvm
[Guelzim2002Topological]	N. Guelzim, S. Bottani, P. Bourgine, and F. Képès. Topological and causal structure of the yeast transcriptional regulatory network. Nat. Genet., 31:60-63, 2002. [ bib \| .html \| .pdf ]
[Goto2002LIGAND:]	S. Goto, Y. Okuno, M. Hattori, T. Nishioka, and M. Kanehisa. LIGAND: database of chemical compounds and reactions in biological pathways. Nucleic Acids Res., 30:402-404, 2002. [ bib \| http \| .pdf ]
[Fritz2002Microarray-based]	B. Fritz, F. Schubert, G. Wrobel, C. Schwaenen, S. Wessendorf, M. Nessling, C. Korz, R. J. Rieker, K. Montgomery, R. Kucherlapati, G. Mechtersheimer, R. Eils, S. Joos, and P. Lichter. Microarray-based Copy Number and Expression Profiling in Dedifferentiated and Pleomorphic Liposarcoma. Cancer Res., 62(11):2993-2998, 2002. [ bib \| http \| .pdf ] Sixteen dedifferentiated and pleomorphic liposarcomas were analyzed by comparative genomic hybridization (CGH) to genomic microarrays (matrix-CGH), cDNA-derived microarrays for expression profiling, and by quantitative PCR. Matrix-CGH revealed copy number gains of numerous oncogenes, i.e., CCND1, MDM2, GLI, CDK4, MYB, ESR1, and AIB1, several of which correlate with a high level of transcripts from the respective gene. In addition, a number of genes were found differentially expressed in dedifferentiated and pleomorphic liposarcomas. Application of dedicated clustering algorithms revealed that both tumor subtypes are clearly separated by the genomic profiles but only with a lesser power by the expression profiles. Using a support vector machine, a subset of five clones was identified as "class discriminators." Thus, for the distinction of these types of liposarcomas, genomic profiling appears to be more advantageous than RNA expression analysis. Keywords: biosvm, cgh
[Ekins2002Towards]	S. Ekins, B. Boulanger, P. W. Swaan, and M. A. Z. Hupcey. Towards a new age of virtual ADME/TOX and multidimensional drug discovery. J Comput Aided Mol Des, 16(5-6):381-401, 2002. [ bib ] With the continual pressure to ensure follow-up molecules to billion dollar blockbuster drugs, there is a hurdle in profitability and growth for pharmaceutical companies in the next decades. With each success and failure we increasingly appreciate that a key to the success of synthesized molecules through the research and development process is the possession of drug-like properties. These properties include an adequate bioactivity as well as adequate solubility, an ability to cross critical membranes (intestinal and sometimes blood-brain barrier), reasonable metabolic stability and of course safety in humans. Dependent on the therapeutic area being investigated it might also be desirable to avoid certain enzymes or transporters to circumvent potential drug-drug interactions. It may also be important to limit the induction of these same proteins that can result in further toxicities. We have clearly moved the assessment of in vitro absorption, distribution, metabolism, excretion and toxicity (ADME/TOX) parameters much earlier in the discovery organization than a decade ago with the inclusion of higher throughput systems. We are also now faced with huge amounts of ADME/TOX data for each molecule that need interpretation and also provide a valuable resource for generating predictive computational models for future drug discovery. The present review aims to show what tools exist today for visualizing and modeling ADME/TOX data, what tools need to be developed, and how both the present and future tools are valuable for virtual filtering using ADME/TOX and bioactivity properties in parallel as a viable addition to present practices. Keywords: ATP-Binding Cassette Transporters, Algorithms, Animals, Biological, Biological Availability, Computer Simulation, Drug Design, Drug Evaluation, Drug Industry, Gene Expression Profiling, Humans, Models, Organic Anion Transporters, P.H.S., Pharmaceutical, Pharmaceutical Preparations, Pharmacogenetics, Pharmacokinetics, Preclinical, Proteomics, Research Support, Software, Systems Biology, Technology, Toxicity Tests, U.S. Gov't, 12489686
[Donnes2002Prediction]	P. Dönnes and A. Elofsson. Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics, 3(1):25, Sep 2002. [ bib \| DOI \| http \| .pdf ] Background T-cells are key players in regulating a specific immune response. Activation of cytotoxic T-cells requires recognition of specific peptides bound to Major Histocompatibility Complex (MHC) class I molecules. MHC-peptide complexes are potential tools for diagnosis and treatment of pathogens and cancer, as well as for the development of peptide vaccines. Only one in 100 to 200 potential binders actually binds to a certain MHC molecule, therefore a good prediction method for MHC class I binding peptides can reduce the number of candidate binders that need to be synthesized and tested. Results Here, we present a novel approach, SVMHC, based on support vector machines to predict the binding of peptides to MHC class I molecules. This method seems to perform slightly better than two profile based methods, SYFPEITHI and HLA_BIND. The implementation of SVMHC is quite simple and does not involve any manual steps, therefore as more data become available it is trivial to provide prediction for more MHC types. SVMHC currently contains prediction for 26 MHC class I types from the MHCPEP database or alternatively 6 MHC class I types from the higher quality SYFPEITHI database. The prediction models for these MHC types are implemented in a public web service available at http://www.sbc.su.se/svmhc/. Conclusions Prediction of MHC class I binding peptides using Support Vector Machines, shows high performance and is easy to apply to a large number of MHC class I types. As more peptide data are put into MHC databases, SVMHC can easily be updated to give prediction for additional MHC class I types. We suggest that the number of binding peptides needed for SVM training is at least 20 sequences. Keywords: biosvm immunoinformatics
[Dover2002Methylation]	Jim Dover, Jessica Schneider, Mary Anne Tawiah-Boateng, Adam Wood, Kimberly Dean, Mark Johnston, and Ali Shilatifard. Methylation of histone h3 by compass requires ubiquitination of histone h2b by rad6. J Biol Chem, 277(32):28368-28371, Aug 2002. [ bib \| DOI \| http ] The DNA of eukaryotes is wrapped around nucleosomes and packaged into chromatin. Covalent modifications of the histone proteins that comprise the nucleosome alter chromatin structure and have major effects on gene expression. Methylation of lysine 4 of histone H3 by COMPASS is required for silencing of genes located near chromosome telomeres and within the rDNA (Krogan, N. J, Dover, J., Khorrami, S., Greenblatt, J. F., Schneider, J., Johnston, M., and Shilatifard, A. (2002) J. Biol. Chem. 277, 10753-10755; Briggs, S. D., Bryk, M., Strahl, B. D., Cheung, W. L., Davie, J. K., Dent, S. Y., Winston, F., and Allis, C. D. (2001) Genes. Dev. 15, 3286-3295). To learn about the mechanism of histone methylation, we surveyed the genome of the yeast Saccharomyces cerevisiae for genes necessary for this process. By analyzing approximately 4800 mutant strains, each deleted for a different non-essential gene, we discovered that the ubiquitin-conjugating enzyme Rad6 is required for methylation of lysine 4 of histone H3. Ubiquitination of histone H2B on lysine 123 is the signal for the methylation of histone H3, which leads to silencing of genes located near telomeres. Keywords: DNA, Ribosomal, metabolism; Electrophoresis, Polyacrylamide Gel; Gene Silencing; Histones, metabolism; Ligases, metabolism; Lysine, metabolism; Methylation; Models, Biological; Mutation; Saccharomyces cerevisiae Proteins; Saccharomyces cerevisiae, genetics; Ubiquitin, metabolism; Ubiquitin-Conjugating Enzymes
[Doniger2002Predicting]	S. Doniger, T. Hofmann, and J. Yeh. Predicting CNS permeability of drug molecules: comparison of neural network and support vector machine algorithms. J. Comput. Biol., 9(6):849-864, 2002. [ bib \| DOI \| .pdf ] Two different machine-learning algorithms have been used to predict the blood-brain barrier permeability of different classes of molecules, to develop a method to predict the ability of drug compounds to penetrate the CNS. The first algorithm is based on a multilayer perceptron neural network and the second algorithm uses a support vector machine. Both algorithms are trained on an identical data set consisting of 179 CNS active molecules and 145 CNS inactive molecules. The training parameters include molecular weight, lipophilicity, hydrogen bonding, and other variables that govern the ability of a molecule to diffuse through a membrane. The results show that the support vector machine outperforms the neural network. Based on over 30 different validation sets, the SVM can predict up to 96 molecules correctly, averaging 81.5 of equal numbers of CNS positive and negative molecules. This is quite favorable when compared with the neural network's average performance of 75.7 the SVM algorithm are very encouraging and suggest that a classification tool like this one will prove to be a valuable prediction approach. Keywords: biosvm
[Deshpande2002Evaluation]	M. Deshpande and G. Karypis. Evaluation of Techniques for Classifying Biological Sequences. In PAKDD '02: Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pages 417-431. Springer Verlag, 2002. [ bib \| .pdf ] In recent years we have witnessed an exponential increase in the amount of biological information, either DNA or protein sequences, that has become available in public databases. This has been followed by an increased interest in developing computational techniques to automatically classify these large volumes of sequence data into various categories corresponding to either their role in the chromosomes, their structure, and/or their function. In this paper we evaluate some of the widely-used sequence classification algorithms and develop a framework for modeling sequences in a fashion so that traditional machine learning algorithms, such as support vector machines, can be applied easily. Our detailed experimental evaluation shows that the SVM-based approaches are able to achieve higher classification accuracy compared to the more traditional sequence classification algorithms such as Markov model based techniques and K-nearest neighbor based approaches. Keywords: biosvm
[Degroeve2002Feature]	S. Degroeve, B. De Baets, Y. Van de Peer, and P. Rouze. Feature subset selection for splice site prediction. Bioinformatics, 18(Suppl. 1):S75-S83, 2002. [ bib \| http \| .pdf ] Motivation: The large amount of available annotated Arabidopsis thaliana sequences allows the induction of splice site prediction models with supervised learning algorithms (see Haussler (1998) for a review and references). These algorithms need information sources or features from which the models can be computed. For splice site prediction, the features we consider in this study are the presence or absence of certain nucleotides in close proximity to the splice site. Since it is not known how many and which nucleotides are relevant for splice site prediction, the set of features is chosen large enough such that the probability that all relevant information sources are in the set is very high. Using only those features that are relevant for constructing a splice site prediction system might improve the system and might also provide us with useful biological knowledge. Using fewer features will of course also improve the prediction speed of the system. Results: A wrapper-based feature subset selection algorithm using a support vector machine or a naive Bayes prediction method was evaluated against the traditional method for selecting features relevant for splice site prediction. Our results show that this wrapper approach selects features that improve the performance against the use of all features and against the use of the features selected by the traditional method. Availability: The data and additional interactive graphs on the selected feature subsets are available at http://www.psb.rug.ac.be/gps Contact: svgro@gengenp.rug.ac.be yvdp@gengenp.rug.ac.be Keywords: biosvm
[Doennes2002Prediction]	Pierre Dönnes and Arne Elofsson. Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics, 3:25, Sep 2002. [ bib ] BACKGROUND: T-cells are key players in regulating a specific immune response. Activation of cytotoxic T-cells requires recognition of specific peptides bound to Major Histocompatibility Complex (MHC) class I molecules. MHC-peptide complexes are potential tools for diagnosis and treatment of pathogens and cancer, as well as for the development of peptide vaccines. Only one in 100 to 200 potential binders actually binds to a certain MHC molecule, therefore a good prediction method for MHC class I binding peptides can reduce the number of candidate binders that need to be synthesized and tested. RESULTS: Here, we present a novel approach, SVMHC, based on support vector machines to predict the binding of peptides to MHC class I molecules. This method seems to perform slightly better than two profile based methods, SYFPEITHI and HLA_BIND. The implementation of SVMHC is quite simple and does not involve any manual steps, therefore as more data become available it is trivial to provide prediction for more MHC types. SVMHC currently contains prediction for 26 MHC class I types from the MHCPEP database or alternatively 6 MHC class I types from the higher quality SYFPEITHI database. The prediction models for these MHC types are implemented in a public web service available at http://www.sbc.su.se/svmhc/. CONCLUSIONS: Prediction of MHC class I binding peptides using Support Vector Machines, shows high performance and is easy to apply to a large number of MHC class I types. As more peptide data are put into MHC databases, SVMHC can easily be updated to give prediction for additional MHC class I types. We suggest that the number of binding peptides needed for SVM training is at least 20 sequences. Keywords: Animals; Artificial Intelligence; Comparative Study; Computational Biology; Databases, Protein; Epitopes, T-Lymphocyte; HLA Antigens; Histocompatibility Antigens Class I; Humans; Peptides; Predictive Value of Tests; Protein Binding; Research Support, Non-U.S. Gov't; Sensitivity and Specificity
[Churchill2002Fundamentals]	G. A. Churchill. Fundamentals of experimental design for cdna microarrays. Nat. Genet., 32 Suppl:490-495, Dec 2002. [ bib \| DOI \| http ] Microarray technology is now widely available and is being applied to address increasingly complex scientific questions. Consequently, there is a greater demand for statistical assessment of the conclusions drawn from microarray experiments. This review discusses fundamental issues of how to design an experiment to ensure that the resulting data are amenable to statistical analysis. The discussion focuses on two-color spotted cDNA microarrays, but many of the same issues apply to single-color gene-expression assays as well. Keywords: Animals; DNA, Complementary, analysis; Gene Expression; Gene Expression Profiling, methods; Mice; Models, Biological; Oligonucleotide Array Sequence Analysis, methods; Reference Standards; Reproducibility of Results; Research Design; Statistics as Topic
[Chou2002Using]	K.-C. Chou and Y.-D. Cai. Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location. J. Biol. Chem., 277(48):45765-45769, 2002. [ bib \| http \| .pdf ] Proteins are generally classified into the following 12 subcellular locations: 1) chloroplast, 2) cytoplasm, 3) cytoskeleton, 4) endoplasmic reticulum, 5) extracellular, 6) Golgi apparatus, 7) lysosome, 8) mitochondria, 9) nucleus, 10) peroxisome, 11) plasma membrane, and 12) vacuole. Because the function of a protein is closely correlated with its subcellular location, with the rapid increase in new protein sequences entering into databanks, it is vitally important for both basic research and pharmaceutical industry to establish a high throughput tool for predicting protein subcellular location. In this paper, a new concept, the so-called "functional domain composition" is introduced. Based on the novel concept, the representation for a protein can be defined as a vector in a high-dimensional space, where each of the clustered functional domains derived from the protein universe serves as a vector base. With such a novel representation for a protein, the support vector machine (SVM) algorithm is introduced for predicting protein subcellular location. High success rates are obtained by the self-consistency test, jackknife test, and independent dataset test, respectively. The current approach not only can play an important complementary role to the powerful covariant discriminant algorithm based on the pseudo amino acid composition representation (Chou, K. C. (2001) Proteins Struct. Funct. Genet. 43, 246-255; Correction (2001) Proteins Struct. Funct. Genet. 44, 60), but also may greatly stimulate the development of this area. Keywords: biosvm
[Chan2002Comparison]	Kwokleung Chan, Te-Won Lee, Pamela A Sample, Michael H Goldbaum, Robert N Weinreb, and Terrence J Sejnowski. Comparison of machine learning and traditional classifiers in glaucoma diagnosis. IEEE Trans Biomed Eng, 49(9):963-74, Sep 2002. [ bib \| DOI \| http \| .pdf ] Glaucoma is a progressive optic neuropathy with characteristic structural changes in the optic nerve head reflected in the visual field. The visual-field sensitivity test is commonly used in a clinical setting to evaluate glaucoma. Standard automated perimetry (SAP) is a common computerized visual-field test whose output is amenable to machine learning. We compared the performance of a number of machine learning algorithms with STATPAC indexes mean deviation, pattern standard deviation, and corrected pattern standard deviation. The machine learning algorithms studied included multilayer perceptron (MLP), support vector machine (SVM), and linear (LDA) and quadratic discriminant analysis (QDA), Parzen window, mixture of Gaussian (MOG), and mixture of generalized Gaussian (MGG). MLP and SVM are classifiers that work directly on the decision boundary and fall under the discriminative paradigm. Generative classifiers, which first model the data probability density and then perform classification via Bayes' rule, usually give deeper insight into the structure of the data space. We have applied MOG, MGG, LDA, QDA, and Parzen window to the classification of glaucoma from SAP. Performance of the various classifiers was compared by the areas under their receiver operating characteristic curves and by sensitivities (true-positive rates) at chosen specificities (true-negative rates). The machine-learning-type classifiers showed improved performance over the best indexes from STATPAC. Forward-selection and backward-elimination methodology further improved the classification rate and also has the potential to reduce testing time by diminishing the number of visual-field location measurements. Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Citric Acid Cycle, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Databases, Decision Making, Diagnosis, Differential, Discriminant Analysis, Drug, Drug Design, Electrostatics, Epitopes, Eukaryotic Cells, Factual, False Negative Reactions, False Positive Reactions, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Heterogeneity, Genetic Markers, Glaucoma, HLA Antigens, Hemolysins, Histocompatibility Antigens Class I, Humans, Internet, Intraocular Pressure, Ion Exchange, Lasers, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Open-Angle, Ophthalmoscopy, Optic Disk, Optic Nerve Diseases, Ovarian Neoplasms, P.H.S., Pattern Recognition, Peptides, Perimetry, Predictive Value of Tests, Probability, Probability Learning, Protein, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, ROC Curve, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Robotics, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Structural, Structure-Activity Relationship, T-Lymphocyte, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12214886
[Cai2002Prediction]	Y.-D. Cai, X.-J. Liu, X.-B. Xu, and G.-P. Zhou. Prediction of protein structural classes by support vector machines. Comput. Chem., 26(3):293-296, 2002. [ bib \| DOI \| http \| .pdf ] In this paper, we apply a new machine learning method which is called support vector machine to approach the prediction of protein structural class. The support vector machine method is performed based on the database derived from SCOP which is based upon domains of known structure and the evolutionary relationships and the principles that govern their 3D structure. As a result, high rates of both self-consistency and jackknife test are obtained. This indicates that the structural class of a protein inconsiderably correlated with its amino and composition, and the support vector machine can be referred as a powerful computational tool for predicting the structural classes of proteins. Keywords: biosvm
[Cai2002Support]	Y.-D. Cai, X.-J. Liu, X.-B. Xu, and K.-C. Chou. Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect. J. Cell. Biochem., 84(2):343-348, 2002. [ bib \| DOI \| http \| .pdf ] Support Vector Machine (SVM), which is one class of learning machines, was applied to predict the subcellular location of proteins by incorporating the quasi-sequence-order effect (Chou [2000] Biochem. Biophys. Res. Commun. 278:477-483). In this study, the proteins are classified into the following 12 groups: (1) chloroplast, (2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracellular, (6) Golgi apparatus, (7) lysosome, (8) mitochondria, (9) nucleus, (10) peroxisome, (11) plasma membrane, and (12) vacuole, which account for most organelles and subcellular compartments in an animal or plant cell. Examinations for self-consistency and jackknife testing of the SVMs method were conducted for three sets consisting of 1,911, 2,044, and 2,191 proteins. The correct rates for self-consistency and the jackknife test values achieved with these protein sets were 94 and 83 89 and 75 for correct prediction rates were undertaken with three independent testing datasets containing 2,148 proteins, 2,417 proteins, and 2,494 proteins producing values of 84, 77, and 74%, respectively. Keywords: biosvm
[Cai2002Supportc]	Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou. Support vector machines for the classification and prediction of beta-turn types. J. Pept. Sci., 8(7):297-301, 2002. [ bib \| DOI \| http \| www: ] The support vector machines (SVMs) method is proposed because it can reflect the sequence-coupling effect for a tetrapeptide in not only a beta-turn or non-beta-turn, but also in different types of beta-turn. The results of the model for 6022 tetrapeptides indicate that the rates of self-consistency for beta-turn types I, I', II, II', VI and VIII and non-beta-turns are 99.92 98.02 training data, the rate of correct prediction by the SVMs for a given protein: rubredoxin (54 residues. 51 tetrapeptides) which includes 12 beta-turn type I tetrapeptides, 1 beta-turn type II tetrapeptide and 38 non-beta-turns reached 82.4 of the SVMs implies that the formation of different beta-turn types or non-beta-turns is considerably correlated with the sequence of a tetrapeptide. The SVMs can save CPU time and avoid the overfitting problem compared with the neural network method. Keywords: biosvm
[Cai2002Supportb]	Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou. Support vector machines for predicting the specificity of GalNAc-transferase. Peptides, 23:205-208, 2002. [ bib \| DOI \| http \| .pdf ] Support Vector Machines (SVMs) which is one kind of learning machines, was applied to predict the specificity of GalNAc-transferase. The examination for the self-consistency and the jackknife test of the SVMs method were tested for the training dataset (305 oligopeptides), the correct rate of self-consistency and jackknife test reaches 100 and 84.9 testing dataset (30 oligopeptides) was tested, the rate reaches 76.67%. Keywords: biosvm
[Cai2002Supporta]	Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou. Support Vector Machines for predicting HIV protease cleavage sites in protein. J. Comput. Chem., 23(2):267-274, 2002. [ bib \| DOI \| http \| www: ] Knowledge of the polyprotein cleavage sites by HIV protease will refine our understanding of its specificity, and the information thus acquired is useful for designing specific and efficient HIV protease inhibitors. The pace in searching for the proper inhibitors of HIV protease will be greatly expedited if one can find an accurate, robust, and rapid method for predicting the cleavage sites in proteins by HIV protease. In this article, a Support Vector Machine is applied to predict the cleavability of oligopeptides by proteases with multiple and extended specificity subsites. We selected HIV-1 protease as the subject of the study. Two hundred ninety-nine oligopeptides were chosen for the training set, while the other 63 oligopeptides were taken as a test set. Because of its high rate of self-consistency (299/299 = 100 and correct prediction rate (55/63 = 87 Support Vector Machine method can be referred to as a useful assistant technique for finding effective inhibitors of HIV protease, which is one of the targets in designing potential drugs against AIDS. The principle of the Support Vector Machine method can also be applied to analyzing the specificity of other multisubsite enzymes. Keywords: biosvm
[Brusic2002Prediction]	V. Brusic, N. Petrovsky, G. Zhang, and V. B. Bajic. Prediction of promiscuous peptides that bind HLA class I molecules. Immunol. Cell Biol., 80(3):280-285, Jun 2002. [ bib ] Promiscuous T-cell epitopes make ideal targets for vaccine development. We report here a computational system, MULTIPRED, for the prediction of peptide binding to the HLA-A2 supertype. It combines a novel representation of peptide/MHC interactions with a hidden Markov model as the prediction algorithm. MULTIPREDis both sensitive and specific, and demonstrates high accuracy of peptide-binding predictions for HLA-A0201, 0204, and 0205 alleles, good accuracy for 0206 allele, and marginal accuracy for *0203 allele. MULTIPREDreplaces earlier requirements for individual prediction models for each HLA allelic variant and simplifies computational aspects of peptide-binding prediction. Preliminary testing indicates that MULTIPRED can predict peptide binding to HLA-A2 supertype molecules with high accuracy, including those allelic variants for which no experimental binding data are currently available. Keywords: Algorithms, Amino Acid Motifs, Amino Acid Sequence, Antigen-Antibody Complex, Automated, Binding Sites, Computational Biology, Drug Delivery Systems, Drug Design, Epitopes, Forecasting, Genes, HLA Antigens, HLA-A Antigens, HLA-A2 Antigen, HLA-DR Antigens, Humans, Internet, MHC Class I, Markov Chains, Molecular Sequence Data, Neural Networks (Computer), Pattern Recognition, Peptide Fragments, Peptides, Protein, Protein Binding, Protein Interaction Mapping, Sensitivity and Specificity, Sequence Analysis, Software, T-Lymphocyte, User-Computer Interface, Viral Vaccines, 12067415
[Briggs2002Gene]	Scott D Briggs, Tiaojiang Xiao, Zu-Wen Sun, Jennifer A Caldwell, Jeffrey Shabanowitz, Donald F Hunt, C. David Allis, and Brian D Strahl. Gene silencing: trans-histone regulatory pathway in chromatin. Nature, 418(6897):498, Aug 2002. [ bib \| DOI \| http ] The fundamental unit of eukaryotic chromatin, the nucleosome, consists of genomic DNA wrapped around the conserved histone proteins H3, H2B, H2A and H4, all of which are variously modified at their amino- and carboxy-terminal tails to influence the dynamics of chromatin structure and function - for example, conjugation of histone H2B with ubiquitin controls the outcome of methylation at a specific lysine residue (Lys 4) on histone H3, which regulates gene silencing in the yeast Saccharomyces cerevisiae. Here we show that ubiquitination of H2B is also necessary for the methylation of Lys 79 in H3, the only modification known to occur away from the histone tails, but that not all methylated lysines in H3 are regulated by this 'trans-histone' pathway because the methylation of Lys 36 in H3 is unaffected. Given that gene silencing is regulated by the methylation of Lys 4 and Lys 79 in histone H3, we suggest that H2B ubiquitination acts as a master switch that controls the site-selective histone methylation patterns responsible for this silencing. Keywords: Chromatin, chemistry/metabolism; Gene Expression Regulation, Fungal; Gene Silencing; Histone-Lysine N-Methyltransferase; Histones, chemistry/metabolism; Ligases, metabolism; Methylation; Models, Biological; Nuclear Proteins, metabolism; Saccharomyces cerevisiae Proteins; Saccharomyces cerevisiae, genetics/metabolism; Ubiquitin, metabolism; Ubiquitin-Conjugating Enzymes
[Bowd2002Comparing]	Christopher Bowd, Kwokleung Chan, Linda M Zangwill, Michael H Goldbaum, Te-Won Lee, Terrence J Sejnowski, and Robert N Weinreb. Comparing neural networks and linear discriminant functions for glaucoma detection using confocal scanning laser ophthalmoscopy of the optic disc. Invest Ophthalmol Vis Sci, 43(11):3444-54, Nov 2002. [ bib \| http \| .pdf ] PURPOSE: To determine whether neural network techniques can improve differentiation between glaucomatous and nonglaucomatous eyes, using the optic disc topography parameters of the Heidelberg Retina Tomograph (HRT; Heidelberg Engineering, Heidelberg, Germany). METHODS: With the HRT, one eye was imaged from each of 108 patients with glaucoma (defined as having repeatable visual field defects with standard automated perimetry) and 189 subjects without glaucoma (no visual field defects with healthy-appearing optic disc and retinal nerve fiber layer on clinical examination) and the optic nerve topography was defined by 17 global and 66 regional HRT parameters. With all the HRT parameters used as input, receiver operating characteristic (ROC) curves were generated for the classification of eyes, by three neural network techniques: linear and Gaussian support vector machines (SVM linear and SVM Gaussian, respectively) and a multilayer perceptron (MLP), as well as four previously proposed linear discriminant functions (LDFs) and one LDF developed on the current data with all HRT parameters used as input. RESULTS: The areas under the ROC curves for SVM linear and SVM Gaussian were 0.938 and 0.945, respectively; for MLP, 0.941; for the current LDF, 0.906; and for the best previously proposed LDF, 0.890. With the use of forward selection and backward elimination optimization techniques, the areas under the ROC curves for SVM Gaussian and the current LDF were increased to approximately 0.96. CONCLUSIONS: Trained neural networks, with global and regional HRT parameters used as input, improve on previously proposed HRT parameter-based LDFs for discriminating between glaucomatous and nonglaucomatous eyes. The performance of both neural networks and LDFs can be improved with optimization of the features in the input. Neural network analyses show promise for increasing diagnostic accuracy of tests for glaucoma. Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Citric Acid Cycle, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Databases, Decision Making, Diagnosis, Differential, Discriminant Analysis, Drug, Drug Design, Electrostatics, Eukaryotic Cells, Factual, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Heterogeneity, Genetic Markers, Glaucoma, Hemolysins, Humans, Internet, Intraocular Pressure, Ion Exchange, Lasers, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Open-Angle, Ophthalmoscopy, Optic Disk, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Probability Learning, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, ROC Curve, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Robotics, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Structural, Structure-Activity Relationship, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12407155
[Boobis2002In]	A. Boobis, U. Gundert-Remy, P. Kremers, P. Macheras, and O. Pelkonen. In silico prediction of ADME and pharmacokinetics. Report of an expert meeting organised by COST B15. Eur. J. Pharm. Sci., 17(4-5):183-193, Dec 2002. [ bib ] The computational approach is one of the newest and fastest developing techniques in pharmacokinetics, ADME (absorption, distribution, metabolism, excretion) evaluation, drug discovery and toxicity. However, to date, the software packages devoted to ADME prediction, especially of metabolism, have not yet been adequately validated and still require improvements to be effective. Most are 'open' systems, under constant evolution and able to incorporate rapidly, and often easily, new information from user or developer databases. Quantitative in silico predictions are now possible for several pharmacokinetic (PK) parameters, particularly absorption and distribution. The emerging consensus is that the predictions are no worse than those made using in vitro tests, with the decisive advantage that much less investment in technology, resources and time is needed. In addition, and of critical importance, it is possible to screen virtual compounds. Some packages are able to handle thousands of molecules in a few hours. However, common experience shows that, in part at least for essentially irrational reasons, there is currently a lack of confidence in these approaches. An effort should be made by the software producers towards more transparency, in order to improve the confidence of their consumers. It seems highly probable that in silico approaches will evolve rapidly, as did in vitro methods during the last decade. Past experience with the latter should be helpful in avoiding repetition of similar errors and in taking the necessary steps to ensure effective implementation. A general concern is the lack of access to the large amounts of data on compounds no longer in development, but still kept secret by the pharmaceutical industry. Controlled access to these data could be particularly helpful in validating new in silico approaches. Keywords: Adsorption, Biological Availability, Chemical, Computer Simulation, Models, Pharmaceutical, Pharmaceutical Preparations, Predictive Value of Tests, Software, Technology, 12453607
[Bock2002New]	J. R. Bock and D. A. Gough. A New Method to Estimate Ligand-Receptor Energetics. Mol Cell Proteomics, 1(11):904-910, 2002. [ bib \| http \| .pdf ] In the discovery of new drugs, lead identification and optimization have assumed critical importance given the number of drug targets generated from genetic, genomics, and proteomic technologies. High-throughput experimental screening assays have been complemented recently by "virtual screening" approaches to identify and filter potential ligands when the characteristics of a target receptor structure of interest are known. Virtual screening mandates a reliable procedure for automatic ranking of structurally distinct ligands in compound library databases. Computing a rank score requires the accurate prediction of binding affinities between these ligands and the target. Many current scoring strategies require information about the target three-dimensional structure. In this study, a new method to estimate the free binding energy between a ligand and receptor is proposed. We extend a central idea previously reported (Bock, J. R., and Gough, D. A. (2001) Predicting protein-protein interactions from primary structure. Bioinformatics 17, 455-460; Bock, J. R., and Gough, D. A. (2002) Whole-proteome interaction mining. Bioinformatics, in press) that uses simple descriptors to represent biomolecules as input examples to train a support vector machine (Smola, A. J., and Scholkopf, B. (1998) A Tutorial on Support Vector Regression, NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK) and the application of the trained system to previously unseen pairs, estimating their propensity for interaction. Here we seek to learn the function that maps features of a receptor-ligand pair onto their equilibrium free binding energy. These features do not comprise any direct information about the three-dimensional structures of ligand or target. In cross-validation experiments, it is demonstrated that objective measurements of prediction error rate and rank-ordering statistics are competitive with those of several other investigations, most of which depend on three-dimensional structural data. The size of the sample (n = 2,671) indicates that this approach is robust and may have widespread applicability beyond restricted families of receptor types. It is concluded that newly sequenced proteins, or those for which three-dimensional crystal structures are not easily obtained, can be rapidly analyzed for their binding potential against a library of ligands using this methodology. Keywords: biosvm
[Bao2002Identifying]	L. Bao and Z. Sun. Identifying genes related to drug anticancer mechanisms using support vector machine. FEBS Lett., 521:109-114, 2002. [ bib \| .html \| .pdf ] In an effort to identify genes related to the cell line chemosensitivity and to evaluate the functional relationships between genes and anticancer drugs acting by the same mechanism, a supervised machine learning approach called support vector machine was used to label genes into any of the five predefined anticancer drug mechanistic categories. Among dozens of unequivocally categorized genes, many were known to be causally related to the drug mechanisms. For example, a few genes were found to be involved in the biological process triggered by the drugs (e.g. DNA polymerase epsilon was the direct target for the drugs from DNA antimetabolites category). DNA repair-related genes were found to be enriched for about eight-fold in the resulting gene set relative to the entire gene set. Some uncharacterized transcripts might be of interest in future studies. This method of correlating the drugs and genes provides a strategy for finding novel biologically significant relationships for molecular pharmacology. Keywords: biosvm microarray
[Ambroise2002Selection]	C. Ambroise and G.J. McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA, 99(10):6562-6566, 2002. [ bib \| http \| .pdf ] In the context of cancer diagnosis and treatment, we consider the problem of constructing an accurate prediction rule on the basis of a relatively small number of tumor tissue samples of known type containing the expression data on very many (possibly thousands) genes. Recently, results have been presented in the literature suggesting that it is possible to construct a prediction rule from only a few genes such that it has a negligible prediction error rate. However, in these results the test error or the leave-one-out cross-validated error is calculated without allowance for the selection bias. There is no allowance because the rule is either tested on tissue samples that were used in the first instance to select the genes being used in the rule or because the cross-validation of the rule is not external to the selection process; that is, gene selection is not performed in training the rule at each stage of the cross-validation process. We describe how in practice the selection bias can be assessed and corrected for by either performing a cross-validation or applying the bootstrap external to the selection process. We recommend using 10-fold rather than leave-one-out cross-validation, and concerning the bootstrap, we suggest using the so-called .632+ bootstrap error estimate designed to handle overfitted prediction rules. Using two published data sets, we demonstrate that when correction is made for the selection bias, the cross-validated error is no longer zero for a subset of only a few genes. Keywords: featureselection biosvm
[Aliferis2002Machine]	C.F. Aliferis, D.P. Hardin, and P. Massion. Machine Learning Models For Lung Cancer Classification Using Array Comparative Genomic Hybridization. In Proceedings of the 2002 American Medical Informatics Association (AMIA) Annual Symposium, pages 7-11, 2002. [ bib \| .pdf ] Array CGH is a recently introduced technology that measures changes in the gene copy number of hundreds of genes in a single experiment. The primary goal of this study was to develop machine learning models that classify non-small Lung Cancers according to histopathology types and to compare several machine learning methods in this learning task. DNA from tumors of 37 patients (21 squamous carcinomas, and 16 adenocarcinomas) were extracted and hybridized onto a 452 BAC clone array. The following algorithms were used: KNN, Decision Tree Induction, Support Vector Machines and Feed-Forward Neural Networks. Performance was measured via leave-one-out classification accuracy. The best multi-gene model found had a leave-one-out accuracy of 89.2%. Decision Trees performed poorer than the other methods in this learning task and dataset. We conclude that gene copy numbers as measured by array CGH are, collectively, an excellent indicator of histological subtype. Several interesting research directions are discussed. Keywords: biosvm microarray, cgh
[Zhu2003Introduction]	Lingyun Zhu, Baoming Wu, and Changxiu Cao. Introduction to medical data mining. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi, 20(3):559-62, Sep 2003. [ bib ] Modern medicine generates a great deal of information stored in the medical database. Extracting useful knowledge and providing scientific decision-making for the diagnosis and treatment of disease from the database increasingly becomes necessary. Data mining in medicine can deal with this problem. It can also improve the management level of hospital information and promote the development of telemedicine and community medicine. Because the medical information is characteristic of redundancy, multi-attribution, incompletion and closely related with time, medical data mining differs from other one. In this paper we have discussed the key techniques of medical data mining involving pretreatment of medical data, fusion of different pattern and resource, fast and robust mining algorithms and reliability of mining results. The methods and applications of medical data mining based on computation intelligence such as artificial neural network, fuzzy system, evolutionary algorithms, rough set, and support vector machine have been introduced. The features and problems in data mining are summarized in the last section. Keywords: Algorithms, Anion Exchange Resins, Automatic Data Processing, Chemical, Chromatography, Computational Biology, Computer-Assisted, Data Interpretation, Databases, Decision Making, Decision Trees, English Abstract, Factual, Fuzzy Logic, Humans, Indicators and Reagents, Information Storage and Retrieval, Ion Exchange, Models, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, P.H.S., Proteins, Quantitative Structure-Activity Relationship, RNA, ROC Curve, Research Support, Sequence Analysis, Statistical, Transfer, U.S. Gov't, 14565039
[Zhao2003Application]	Y. Zhao, C. Pinilla, D. Valmori, R. Martin, and R. Simon. Application of support vector machines for T-cell epitopes prediction. Bioinformatics, 19(15):1978-1984, 2003. [ bib \| http \| .pdf ] Motivation: The T-cell receptor, a major histocompatibility complex (MHC) molecule, and a bound antigenic peptide, play major roles in the process of antigen-specific T-cell activation. T-cell recognition was long considered exquisitely specific. Recent data also indicate that it is highly flexible, and one receptor may recognize thousands of different peptides. Deciphering the patterns of peptides that elicit a MHC restricted T-cell response is critical for vaccine development. Results: For the first time we develop a support vector machine (SVM) for T-cell epitope prediction with an MHC type I restricted T-cell clone. Using cross-validation, we demonstrate that SVMs can be trained on relatively small data sets to provide prediction more accurate than those based on previously published methods or on MHC binding. Supplementary information: Data for 203 synthesized peptides is available at http://linus.nci.nih.gov/Data/LAU203_Peptide.pdf Keywords: biosvm immunoinformatics
[Zhang2003Sequence]	X. H-F. Zhang, K. A. Heller, I. Hefter, C. S. Leslie, and L. A. Chasin. Sequence Information for the Splicing of Human Pre-mRNA Identified by Support Vector Machine Classification. Genome Res., 13(12):2637-2650, 2003. [ bib \| DOI \| http \| .pdf ] Vertebrate pre-mRNA transcripts contain many sequences that resemble splice sites on the basis of agreement to the consensus, yet these more numerous false splice sites are usually completely ignored by the cellular splicing machinery. Even at the level of exon definition, pseudo exons defined by such false splices sites outnumber real exons by an order of magnitude. We used a support vector machine to discover sequence information that could be used to distinguish real exons from pseudo exons. This machine learning tool led to the definition of potential branch points, an extended polypyrimidine tract, and C-rich and TG-rich motifs in a region limited to 50 nt upstream of constitutively spliced exons. C-rich sequences were also found in a region extending to 80 nt downstream of exons, along with G-triplet motifs. In addition, it was shown that combinations of three bases within the splice donor consensus sequence were more effective than consensus values in distinguishing real from pseudo splice sites; two-way base combinations were optimal for distinguishing 3' splice sites. These data also suggest that interactions between two or more of these elements may contribute to exon recognition, and provide candidate sequences for assessment as intronic splicing enhancers. Keywords: biosvm
[Zhang2003Classification]	S.-W. Zhang, Q. Pan, H.-C. Zhang, Y-L. Zhang, and H.-Y. Wang. Classification of protein quaternary structure with support vector machine. Bioinformatics, 19(18):2390-2396, 2003. [ bib \| http \| .pdf ] Motivation: Since the gap between sharply increasing known sequences and slow accumulation of known structures is becoming large, an automatic classification process based on the primary sequences and known three-dimensional structure becomes indispensable. The classification of protein quaternary structure based on the primary sequences can provide some useful information for the biologists. So a fully automatic and reliable classification system is needed. This work tries to look for the effective methods of extracting attribute and the algorithm for classifying the quaternary structure from the primary sequences. Results: Both of the support vector machine (SVM) and the covariant discriminant algorithms have been first introduced to predict quaternary structure properties from the protein primary sequences. The amino acid composition and the auto-correlation functions based on the amino acid index profile of the primary sequence have been taken into account in the algorithms. We have analyzed 472 amino acid indices and selected the four amino acid indices as the examples, which have the best performance. Thus the five attribute parameter data sets (COMP, FASG, NISK, WOLS and KYTJ) were established from the protein primary sequences. The COMP attribute data set is composed of amino acid composition, and the FASG, NISK, WOLS and KYTJ attribute data sets are composed of the amino acid composition and the auto-correlation functions of the corresponding amino acid residue index. The overall accuracies of SVM are 78.5, 87.5, 83.2, 81.7 and 81.9 and KYTJ data sets in jackknife test, which are 19.6, 7.8, 15.5, 13.1 and 15.8 algorithm in the same test. The results show that SVM may be applied to discriminate between the primary sequences of homodimers and non-homodimers and the two protein sequence descriptors can reflect the quaternary structure information. Compared with previous Robert Garian's investigation, the performance of SVM is almost equal to that of the Decision tree models, and the methods of extracting feature vector from the primary sequences are superior to Robert's binning function method. Availability: Programs are available on request from the authors. Keywords: biosvm
[Zernov2003Drug]	V. V. Zernov, K. V. Balakin, A. A. Ivaschenko, N. P. Savchuk, and I. V. Pletnev. Drug discovery using support vector machines. The case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions. J Chem Inf Comput Sci, 43(6):2048-56, 2003. [ bib \| DOI \| http \| .pdf ] Support Vector Machines (SVM) is a powerful classification and regression tool that is becoming increasingly popular in various machine learning applications. We tested the ability of SVM, in comparison with well-known neural network techniques, to predict drug-likeness and agrochemical-likeness for large compound collections. For both kinds of data, SVM outperforms various neural networks using the same set of descriptors. We also used SVM for estimating the activity of Carbonic Anhydrase II (CA II) enzyme inhibitors and found that the prediction quality of our SVM model is better than that reported earlier for conventional QSAR. Model characteristics and data set features were studied in detail. Keywords: biosvm chemoinformatics
[Yu2003Fine-grained]	C.S. Yu, J.Y. Wang, J.M. Yang, P.C. Lyu, C.J. Lin, and J.K. Hwang. Fine-grained protein fold assignment by support vector machines using generalized npeptide coding schemes and jury voting from multiple-parameter sets. Proteins, 50(4):531, 6 2003. [ bib \| DOI \| http \| .pdf ] In the coarse-grained fold assignment of major protein classes, such as all-alpha, all-beta, alpha + beta, alpha/beta proteins, one can easily achieve high prediction accuracy from primary amino acid sequences. However, the fine-grained assignment of folds, such as those defined in the Structural Classification of Proteins (SCOP) database, presents a challenge due to the larger amount of folds available. Recent study yielded reasonable prediction accuracy of 56.0 an independent set of 27 most populated folds. In this communication, we apply the support vector machine (SVM) method, using a combination of protein descriptors based on the properties derived from the composition of n-peptide and jury voting, to the fine-grained fold prediction, and are able to achieve an overall prediction accuracy of 69.6 the same independent set-significantly higher than the previous results. On 10-fold cross-validation, we obtained a prediction accuracy of 65.3 sequence-coding schemes can significantly improve the fine-grained fold prediction. Our approach should be useful in structure prediction and modeling. Keywords: biosvm
[Yoon2003Analysis]	Y. Yoon, J. Song, S.H. Hong, and J.Q. Kim. Analysis of multiple single nucleotide polymorphisms of candidate genes related to coronary heart disease susceptibility by using support vector machines. Clin. Chem. Lab. Med., 41(4):529-534, 2003. [ bib \| .html \| .pdf ] Coronary heart disease (CHD) is a complex genetic disease involving gene-environment interaction. Many association studies between single nucleotide polymorphisms (SNPs) of candidate genes and CHD have been reported. We have applied a new method to analyze such relationships using support vector machines (SVMs), which is one of the methods for artificial neuronal network. We assumed that common haplotype implicit in genotypes will differ between cases and controls, and that this will allow SVM-derived patterns to be classifiable according to subject genotypes. Fourteen SNPs of ten candidate genes in 86 CHD patients and 119 controls were investigated. Genotypes were transformed to a numerical vector by giving scores based on difference between the genotypes of each subject and the reference genotypes, which represent the healthy normal population. Overall classification accuracy by SVMs was 64.4 By conventional analysis using the chi2 test, the association between CHD and the SNP of the scavenger receptor B1 gene was most significant in terms of allele frequencies in cases vs. controls (p = 0.0001). In conclusion, we suggest that the application of SVMs for association studies of SNPs in candidate genes shows considerable promise and that further work could be usefully performed upon the estimation of CHD susceptibility in individuals of high risk. Keywords: biosvm
[Yamanishi2003Extraction]	Y. Yamanishi, J.-P. Vert, A. Nakaya, and M. Kanehisa. Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis. Bioinformatics, 19(Suppl. 1):i323-i330, 2003. [ bib \| http \| .pdf ] Motivation: A major issue in computational biology is the reconstruction of pathways from several genomic datasets, such as expression data, protein interaction data and phylogenetic profiles. As a first step toward this goal, it is important to investigate the amount of correlation which exists between these data. Results: These methods are successfully tested on their ability to recognize operons in the Escherichia coli genome, from the comparison of three datasets corresponding to functional relationships between genes in metabolic pathways, geometrical relationships along the chromosome, and co-expression relationships as observed by gene expression data. Contact: yoshi@kuicr.kyoto-u.ac.jp Keywords: biosvm
[Wu2003Comparison]	B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams, and H. Zhao. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics, 19(13):1636-1643, 2003. [ bib \| http \| .pdf ] Motivation: Novel methods, both molecular and statistical, are urgently needed to take advantage of recent advances in biotechnology and the human genome project for disease diagnosis and prognosis. Mass spectrometry (MS) holds great promise for biomarker identification and genome-wide protein profiling. It has been demonstrated in the literature that biomarkers can be identified to distinguish normal individuals from cancer patients using MS data. Such progress is especially exciting for the detection of early-stage ovarian cancer patients. Although various statistical methods have been utilized to identify biomarkers from MS data, there has been no systematic comparison among these approaches in their relative ability to analyze MS data. Results: We compare the performance of several classes of statistical methods for the classification of cancer based on MS spectra. These methods include: linear discriminant analysis, quadratic discriminant analysis, k-nearest neighbor classifier, bagging and boosting classification trees, support vector machine, and random forest (RF). The methods are applied to ovarian cancer and control serum samples from the National Ovarian Cancer Early Detection Program clinic at Northwestern University Hospital. We found that RF outperforms other methods in the analysis of MS data. Supplementary information: http://bioinformatics.med.yale.edu/proteomics/BioSupp1.html Keywords: biosvm
[Winters-Hilt2003Highly]	S. Winters-Hilt, W. Vercoutere, V.S. DeGuzman, D. Deamer, M. Akeson, and D. Haussler. Highly accurate classification of Watson-Crick basepairs on termini of single DNA molecules. Biophys. J., 84(2):967-976, 2003. [ bib \| http \| .pdf ] We introduce a computational method for classification of individual DNA molecules measured by analpha -hemolysin channel detector. We show classification with better than 99 hairpin molecules that differ only in their terminal Watson-Crick basepairs. Signal classification was done in silico to establish performance metrics (i.e., where train and test data were of known type, via single-species data files). It was then performed in solution to assay real mixtures of DNA hairpins. Hidden Markov Models (HMMs) were used with Expectation/Maximization for denoising and for associating a feature vector with the ionic current blockade of the DNA molecule. Support Vector Machines (SVMs) were used as discriminators, and were the focus of off-line training. A multiclass SVM architecture was designed to place less discriminatory load on weaker discriminators, and novel SVM kernels were used to boost discrimination strength. The tuning on HMMs and SVMs enabled biophysical analysis of the captured molecule states and state transitions; structure revealed in the biophysical analysis was used for better feature selection. Keywords: biosvm
[Wilton2003Comparison]	D. Wilton, P. Willett, K. Lawson, and G. Mullier. Comparison of ranking methods for virtual screening in lead-discovery programs. J Chem Inf Comput Sci, 43(2):469-74, 2003. [ bib \| DOI \| http \| .pdf ] This paper discusses the use of several rank-based virtual screening methods for prioritizing compounds in lead-discovery programs, given a training set for which both structural and bioactivity data are available. Structures from the NCI AIDS data set and from the Syngenta corporate database were represented by two types of fragment bit-string and by sets of high-level molecular features. These representations were processed using binary kernel discrimination, similarity searching, substructural analysis, support vector machine, and trend vector analysis, with the effectiveness of the methods being judged by the extent to which active test set molecules were clustered toward the top of the resultant rankings. The binary kernel discrimination approach yielded consistently superior rankings and would appear to have considerable potential for chemical screening applications. Keywords: biosvm
[Weston2003Feature]	J. Weston, F. Pérez-Cruz, O. Bousquet, O. Chapelle, A. Elisseeff, and B. Schölkopf. Feature selection and transduction for prediction of molecular bioactivity for drug design. Bioinformatics, 19(6):764-771, 2003. [ bib \| http \| .pdf ] Motivation: In drug discovery a key task is to identify characteristics that separate active (binding) compounds from inactive (non-binding) ones. An automated prediction system can help reduce resources necessary to carry out this task. Results: Two methods for prediction of molecular bioactivity for drug design are introduced and shown to perform well in a data set previously studied as part of the KDD (Knowledge Discovery and Data Mining) Cup 2001. The data is characterized by very few positive examples, a very large number of features (describing three-dimensional properties of the molecules) and rather different distributions between training and test data. Two techniques are introduced specifically to tackle these problems: a feature selection method for unbalanced data and a classifier which adapts to the distribution of the the unlabeled test data (a so-called transductive method). We show both techniques improve identification performance and in conjunction provide an improvement over using only one of the techniques. Our results suggest the importance of taking into account the characteristics in this data which may also be relevant in other problems of a similar type. Availability: Matlab source code is available at http://www.kyb.tuebingen.mpg.de/bs/people/weston/kdd/kdd.html Contact: jason.weston@tuebingen.mpg.de Supplementary information: Supplementary material is available at http://www.kyb.tuebingen.mpg.de/bs/people/weston/kdd/kdd.html. Keywords: biosvm
[Warmuth2003Active]	M. K. Warmuth, J. Liao, G. Rätsch, M. Mathieson, S. Putta, and C. Lemmen. Active learning with support vector machines in the drug discovery process. J Chem Inf Comput Sci, 43(2):667-673, 2003. [ bib \| DOI \| http \| .pdf ] We investigate the following data mining problem from computer-aided drug design: From a large collection of compounds, find those that bind to a target molecule in as few iterations of biochemical testing as possible. In each iteration a comparatively small batch of compounds is screened for binding activity toward this target. We employed the so-called "active learning paradigm" from Machine Learning for selecting the successive batches. Our main selection strategy is based on the maximum margin hyperplane-generated by "Support Vector Machines". This hyperplane separates the current set of active from the inactive compounds and has the largest possible distance from any labeled compound. We perform a thorough comparative study of various other selection strategies on data sets provided by DuPont Pharmaceuticals and show that the strategies based on the maximum margin hyperplane clearly outperform the simpler ones. Keywords: biosvm
[Ward2003Secondary]	J. J. Ward, L. J. McGuffin, B. F. Buxton, and D. T. Jones. Secondary structure prediction with support vector machines. Bioinformatics, 19(13):1650-1655, 2003. [ bib \| http \| .pdf ] Motivation: A new method that uses support vector machines (SVMs) to predict protein secondary structure is described and evaluated. The study is designed to develop a reliable prediction method using an alternative technique and to investigate the applicability of SVMs to this type of bioinformatics problem. Methods: Binary SVMs are trained to discriminate between two structural classes. The binary classifiers are combined in several ways to predict multi-class secondary structure. Results: The average three-state prediction accuracy per protein (Q3) is estimated by cross-validation to be 77.07 +/- 0.26 +/- 0.39 PSIPRED prediction method on a non-homologous test set of 121 proteins despite being trained on substantially fewer examples. A simple consensus of the SVM, PSIPRED and PROFsec achieves significantly higher prediction accuracy than the individual methods. Availability: The SVM classifier is available from the authors. Work is in progress to make the method available on-line and to integrate the SVM predictions into the PSIPRED server. Keywords: biosvm
[Wagner2003Protocols]	M. Wagner, D. Naik, and A. Pothen. Protocols for disease classification from mass spectrometry data. Proteomics, 3(9):1692-1698, 2003. [ bib \| DOI \| http \| .pdf ] We report our results in classifying protein matrix-assisted laser desorption/ionization-time of flight mass spectra obtained from serum samples into diseased and healthy groups. We discuss in detail five of the steps in preprocessing the mass spectral data for biomarker discovery, as well as our criterion for choosing a small set of peaks for classifying the samples. Cross-validation studies with four selected proteins yielded misclassification rates in the 10-15 for all the classification methods. Three of these proteins or protein fragments are down-regulated and one up-regulated in lung cancer, the disease under consideration in this data set. When cross-validation studies are performed, care must be taken to ensure that the test set does not influence the choice of the peaks used in the classification. Misclassification rates are lower when both the training and test sets are used to select the peaks used in classification versus when only the training set is used. This expectation was validated for various statistical discrimination methods when thirteen peaks were used in cross-validation studies. One particular classification method, a linear support vector machine, exhibited especially robust performance when the number of peaks was varied from four to thirteen, and when the peaks were selected from the training set alone. Experiments with the samples randomly assigned to the two classes confirmed that misclassification rates were significantly higher in such cases than those observed with the true data. This indicates that our findings are indeed significant. We found closely matching masses in a database for protein expression in lung cancer for three of the four proteins we used to classify lung cancer. Data from additional samples, increased experience with the performance of various preprocessing techniques, and affirmation of the biological roles of the proteins that help in classification, will strengthen our conclusions in the future. Keywords: biosvm
[Vert2003Extracting]	J.-P. Vert and M. Kanehisa. Extracting active pathways from gene expression data. Bioinformatics, 19:238ii-234ii, 2003. [ bib \| http \| .pdf ] Motivation: A promising way to make sense out of gene expression profiles is to relate them to the activity of metabolic and signalling pathways. Each pathway usually involves many genes, such as enzymes, which can themselves participate in many pathways. The set of all known pathways can therefore be represented by a complex network of genes. Searching for regularities in the set of gene expression profiles with respect to the topology of this gene network is a way to automatically extract active pathways and their associated patterns of activity. Method: We present a method to perform this task, which consists in encoding both the gene network and the set of profiles into two kernel functions, and performing a regularized form of canonical correlation analysis between the two kernels. Results: When applied to publicly available expression data the method is able to extract biologically relevant expression patterns, as well as pathways with related activity. Keywords: biosvm
[Vert2003Graph-driven]	J.-P. Vert and M. Kanehisa. Graph-driven features extraction from microarray data using diffusion kernels and kernel CCA. In S. Becker, S. Thrun, and K. Obermayer, editors, Adv. Neural Inform. Process. Syst., pages 1449-1456. MIT Press, 2003. [ bib \| .pdf ] Keywords: biosvm
[Tsuda2003em]	K. Tsuda, S. Akaho, and K. Asai. The em Algorithm for Kernel Matrix Completion with Auxiliary Data. J. Mach. Learn. Res., 4:67-81, 2003. [ bib \| .html \| .pdf ] In biological data, it is often the case that observed data are available only for a subset of samples. When a kernel matrix is derived from such data, we have to leave the entries for unavailable samples as missing. In this paper, the missing entries are completed by exploiting an auxiliary kernel matrix derived from another information source. The parametric model of kernel matrices is created as a set of spectral variants of the auxiliary kernel matrix, and the missing entries are estimated by fitting this model to the existing entries. For model fitting, we adopt the em algorithm (distinguished from the EM algorithm of Dempster et al., 1977) based on the information geometry of positive definite matrices. We will report promising results on bacteria clustering experiments using two marker sequences: 16S and gyrB. Keywords: biosvm
[Takaoka2003Development]	Y. Takaoka, Y. Endo, S. Yamanobe, H. Kakinuma, T. Okubo, Y. Shimazaki, T. Ota, S. Sumiya, and K. Yoshikawa. Development of a method for evaluating drug-likeness and ease of synthesis using a data set in which compounds are assigned scores based on chemists' intuition. J Chem Inf Comput Sci, 43(4):1269-75, 2003. [ bib \| DOI \| http \| .pdf ] The concept of drug-likeness, an important characteristic for any compound in a screening library, is nevertheless difficult to pin down. Based on our belief that this concept is implicit within the collective experience of working chemists, we devised a data set to capture an intuitive human understanding of both this characteristic and ease of synthesis, a second key characteristic. Five chemists assigned a pair of scores to each of 3980 diverse compounds, with the component scores of each pair corresponding to drug-likeness and ease of synthesis, respectively. Using this data set, we devised binary classifiers with an artificial neural network and a support vector machine. These models were found to efficiently eliminate compounds that are not drug-like and/or hard-to-synthesize derivatives, demonstrating the suitability of these models for use as compound acquisition filters. Keywords: biosvm
[Sun2003Identifying]	Y.F. Sun, X.D. Fan, and Y.D. Li. Identifying splicing sites in eukaryotic RNA: support vector machine approach. Comput. Biol. Med., 33(1):17-29, 2003. [ bib \| DOI \| http \| .pdf ] We introduce a new method for splicing sites prediction based on the theory of support vector machines (SVM). The SVM represents a new approach to supervised pattern classification and has been successfully applied to a wide range of pattern recognition problems. In the process of splicing sites prediction, the statistical information of RNA secondary structure in the vicinity of splice sites, e.g. donor and acceptor sites, is introduced in order to compare recognition ratio of true positive and true negative. From the results of comparison, addition of structural information has brought no significant benefit for the recognition of splice sites and had even lowered the rate of recognition. Our results suggest that, through three cross validation, the SVM method can achieve a good performance for splice sites identification. Keywords: biosvm
[Su2003RankGene]	Yang Su, T.M. Murali, Vladimir Pavlovic, Michael Schaffer, and Simon Kasif. RankGene: identification of diagnostic genes based on expression data. Bioinformatics, 19(12):1578-1579, 2003. [ bib \| http \| .pdf ] Summary: RankGene is a program for analyzing gene expression data and computing diagnostic genes based on their predictive power in distinguishing between different types of samples. The program integrates into one system a variety of popular ranking criteria, ranging from the traditional t-statistic to one-dimensional support vector machines. This flexibility makes RankGene a useful tool in gene expression analysis and feature selection. Availability: http://genomics10.bu.edu/yangsu/rankgene Contact: murali@bu.edu Keywords: biosvm
[Sorich2003Comparison]	M. J. Sorich, J. O. Miners, R. A. McKinnon, D. A. Winkler, F. R. Burden, and P. A. Smith. Comparison of linear and nonlinear classification algorithms for the prediction of drug and chemical metabolism by human UDP-glucuronosyltransferase isoforms. J Chem Inf Comput Sci, 43(6):2019-24, 2003. [ bib \| DOI \| http \| .pdf ] Partial least squares discriminant analysis (PLSDA), Bayesian regularized artificial neural network (BRANN), and support vector machine (SVM) methodologies were compared by their ability to classify substrates and nonsubstrates of 12 isoforms of human UDP-glucuronosyltransferase (UGT), an enzyme "superfamily" involved in the metabolism of drugs, nondrug xenobiotics, and endogenous compounds. Simple two-dimensional descriptors were used to capture chemical information. For each data set, 70% of the data were used for training, and the remainder were used to assess the generalization performance. In general, the SVM methodology was able to produce models with the best predictive performance, followed by BRANN and then PLSDA. However, a small number of data sets showed either equivalent or better predictability using PLSDA, which may indicate relatively linear relationships in these data sets. All SVM models showed predictive ability (>60% of test set predicted correctly) and five out of the 12 test sets showed excellent prediction (>80% prediction accuracy). These models represent the first use of pattern recognition methods to discriminate between substrates and nonsubstrates of human drug metabolizing enzymes and the first thorough assessment of three classification algorithms using multiple metabolic data sets. Keywords: biosvm
[Siepen2003Beta]	J. A. Siepen, S. E. Radford, and D. R. Westhead. Beta Edge strands in protein structure prediction and aggregation. Protein Sci., 12(10):2348-2359, 2003. [ bib \| DOI \| http \| .pdf ] It is well established that recognition between exposed edges of beta-sheets is an important mode of protein-protein interaction and can have pathological consequences; for instance, it has been linked to the aggregation of proteins into a fibrillar structure, which is associated with a number of predominantly neurodegenerative disorders. A number of protective mechanisms have evolved in the edge strands of beta-sheets, preventing the aggregation and insolubility of most natural beta-sheet proteins. Such mechanisms are unfavorable in the interior of a beta-sheet. The problem of distinguishing edge strands from central strands based on sequence information alone is important in predicting residues and mutations likely to be involved in aggregation, and is also a first step in predicting folding topology. Here we report support vector machine (SVM) and decision tree methods developed to classify edge strands from central strands in a representative set of protein domains. Interestingly, rules generated by the decision tree method are in close agreement with our knowledge of protein structure and are potentially useful in a number of different biological applications. When trained on strands from proteins of known structure, using structure-based (Dictionary of Secondary Structure in Proteins) strand assignments, both methods achieved mean cross-validated, prediction accuracies of 78 strand assignments from secondary structure prediction were used. Further investigation of this effect revealed that it could be explained by a significant reduction in the accuracy of standard secondary structure prediction methods for edge strands, in comparison with central strands. Keywords: biosvm
[She2003Frequent-subsequence-based]	R. She, F. Chen, K. Wang, M. Ester, J.L. Gardy, and F.S.L. Brinkman. Frequent-subsequence-based prediction of outer membrane proteins. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 436-445. ACM Press, 2003. [ bib \| DOI \| .pdf ] A number of medically important disease-causing bacteria (collectively called Gram-negative bacteria) are noted for the extra "outer" membrane that surrounds their cell. Proteins resident in this membrane (outer membrane proteins, or OMPs) are of primary research interest for antibiotic and vaccine drug design as they are on the surface of the bacteria and so are the most accessible targets to develop new drugs against. With the development of genome sequencing technology and bioinformatics, biologists can now deduce all the proteins that are likely produced in a given bacteria and have attempted to classify where proteins are located in a bacterial cell. However such protein localization programs are currently least accurate when predicting OMPs, and so there is a current need for the development of a better OMP classifier. Data mining research suggests that the use of frequent patterns has good performance in aiding the development of accurate and efficient classification algorithms. In this paper, we present two methods to identify OMPs based on frequent subsequences and test them on all Gram-negative bacterial proteins whose localizations have been determined by biological experiments. One classifier follows an association rule approach, while the other is based on support vector machines (SVMs). We compare the proposed methods with the state-of-the-art methods in the biological domain. The results demonstrate that our methods are better both in terms of accurately identifying OMPs and providing biological insights that increase our understanding of the structures and functions of these important proteins. Keywords: biosvm
[Shannon2003Analyzing]	William Shannon, Robert Culverhouse, and Jill Duncan. Analyzing microarray data using cluster analysis. Pharmacogenomics, 4(1):41-52, Jan 2003. [ bib ] As pharmacogenetics researchers gather more detailed and complex data on gene polymorphisms that effect drug metabolizing enzymes, drug target receptors and drug transporters, they will need access to advanced statistical tools to mine that data. These tools include approaches from classical biostatistics, such as logistic regression or linear discriminant analysis, and supervised learning methods from computer science, such as support vector machines and artificial neural networks. In this review, we present an overview of another class of models, cluster analysis, which will likely be less familiar to pharmacogenetics researchers. Cluster analysis is used to analyze data that is not a priori known to contain any specific subgroups. The goal is to use the data itself to identify meaningful or informative subgroups. Specifically, we will focus on demonstrating the use of distance-based methods of hierarchical clustering to analyze gene expression data. Keywords: Algorithms, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biosensing Techniques, Cluster Analysis, Comparative Study, Computer-Assisted, DNA, Gene Expression Profiling, Gene Expression Regulation, Genes, Hemolysins, Humans, Markov Chains, Messenger, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplastic, Neural Networks (Computer), Non-U.S. Gov't, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Quality Control, RNA, Research Support, Signal Processing, Stomach Neoplasms, 12517285
[Serra2003Development]	J.R. Serra, E.D. Thompson, and P.C. Jurs. Development of binary classification of structural chromosome aberrations for a diverse set of organic compounds from molecular structure. Chem. Res. Toxicol., 16(2):153-163, 2003. [ bib \| DOI \| http \| .pdf ] Classification models are generated to predict in vitro cytogenetic results for a diverse set of 383 organic compounds. Both k-nearest neighbor and support vector machine models are developed. They are based on calculated molecular structure descriptors. Endpoints used are the labels clastogenic or nonclastogenic according to an in vitro chromosomal aberration assay with Chinese hamster lung cells. Compounds that were tested with both a 24 and 48 h exposure are included. Each compound is represented by calculated molecular structure descriptors encoding the topological, electronic, geometrical, or polar surface area aspects of the structure. Subsets of informative descriptors are identified with genetic algorithm feature selection coupled to the appropriate classification algorithm. The overall classification success rate for a k-nearest neighbor classifier built with just six topological descriptors is 81.2 and 86.5 success rate for a three-descriptor support vector machine model is 99.7 and 83.8% for an external prediction set. Keywords: biosvm
[Segal2003Classificationa]	N. H. Segal, P. Pavlidis, C. R. Antonescu, R. G. Maki, W. S. Noble, D. DeSantis, J. M. Woodruff, J. J. Lewis, M. F. Brennan, A. N. Houghton, and C. Cordon-Cardo. Classification and Subtype Prediction of Adult Soft Tissue Sarcoma by Functional Genomics. Am. J. Pathol., 163(2):691-700, Aug 2003. [ bib \| http \| .pdf ] Adult soft tissue sarcomas are a heterogeneous group of tumors, including well-described subtypes by histological and genotypic criteria, and pleomorphic tumors typically characterized by non-recurrent genetic aberrations and karyotypic heterogeneity. The latter pose a diagnostic challenge, even to experienced pathologists. We proposed that gene expression profiling in soft tissue sarcoma would identify a genomic-based classification scheme that is useful in diagnosis. RNA samples from 51 pathologically confirmed cases, representing nine different histological subtypes of adult soft tissue sarcoma, were examined using the Affymetrix U95A GeneChip. Statistical tests were performed on experimental groups identified by cluster analysis, to find discriminating genes that could subsequently be applied in a support vector machine algorithm. Synovial sarcomas, round-cell/myxoid liposarcomas, clear-cell sarcomas and gastrointestinal stromal tumors displayed remarkably distinct and homogenous gene expression profiles. Pleomorphic tumors were heterogeneous. Notably, a subset of malignant fibrous histiocytomas, a controversialhistological subtype, was identified as a distinct genomic group. The support vector machine algorithm supported a genomic basis for diagnosis, with both high sensitivity and specificity. In conclusion, we showed gene expression profiling to be useful in classification and diagnosis, providing insights into pathogenesis and pointing to potential new therapeutic targets of soft tissue sarcoma. Keywords: biosvm
[Segal2003Regression]	M. R. Segal, K. D. Dahlquist, and B. R. Conklin. Regression approaches for microarray data analysis. J. Comput. Biol., 10(6):961-980, 2003. [ bib \| DOI \| .pdf ] A variety of new procedures have been devised to handle the two-sample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays. Such new methods are required in part because of some defining characteristics of microarray-based studies: (i) the very large number of genes contributing expression measures which far exceeds the number of samples (observations) available and (ii) the fact that by virtue of pathway/network relationships, the gene expression measures tend to be highly correlated. These concerns are exacerbated in the regression setting, where the objective is to relate gene expression, simultaneously for multiple genes, to some external outcome or phenotype. Correspondingly, several methods have been recently proposed for addressing these issues. We briefly critique some of these methods prior to a detailed evaluation of gene harvesting. This reveals that gene harvesting, without additional constraints, can yield artifactual solutions. Results obtained employing such constraints motivate the use of regularized regression procedures such as the lasso, least angle regression, and support vector machines. Model selection and solution multiplicity issues are also discussed. The methods are evaluated using a microarray-based study of cardiomyopathy in transgenic mice. Keywords: biosvm
[Segal2003Module]	E. Segal, M. Shapira, A. Regev, D. Pe'er, D. Botstein, D. Koller, and N. Friedman. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat. Genet., 34(2):166-176, Jun 2003. [ bib \| DOI \| http \| .pdf ] Much of a cell's activity is organized as a network of interacting modules: sets of genes coregulated to respond to different conditions. We present a probabilistic method for identifying regulatory modules from gene expression data. Our procedure identifies modules of coregulated genes, their regulators and the conditions under which regulation occurs, generating testable hypotheses in the form 'regulator X regulates module Y under conditions W'. We applied the method to a Saccharomyces cerevisiae expression data set, showing its ability to identify functionally coherent modules and their correct regulators. We present microarray experiments supporting three novel predictions, suggesting regulatory roles for previously uncharacterized proteins. Keywords: biogm
[Sanchez-Carbayo2003Gene]	Marta Sanchez-Carbayo, Nicholas D Socci, Juan Jose Lozano, Wentian Li, Elizabeth Charytonowicz, Thomas J Belbin, Michael B Prystowsky, Angel R Ortiz, Geoffrey Childs, and Carlos Cordon-Cardo. Gene discovery in bladder cancer progression using cDNA microarrays. Am. J. Pathol., 163(2):505-16, Aug 2003. [ bib \| http \| .pdf ] To identify gene expression changes along progression of bladder cancer, we compared the expression profiles of early-stage and advanced bladder tumors using cDNA microarrays containing 17,842 known genes and expressed sequence tags. The application of bootstrapping techniques to hierarchical clustering segregated early-stage and invasive transitional carcinomas into two main clusters. Multidimensional analysis confirmed these clusters and more importantly, it separated carcinoma in situ from papillary superficial lesions and subgroups within early-stage and invasive tumors displaying different overall survival. Additionally, it recognized early-stage tumors showing gene profiles similar to invasive disease. Different techniques including standard t-test, single-gene logistic regression, and support vector machine algorithms were applied to identify relevant genes involved in bladder cancer progression. Cytokeratin 20, neuropilin-2, p21, and p33ING1 were selected among the top ranked molecular targets differentially expressed and validated by immunohistochemistry using tissue microarrays (n = 173). Their expression patterns were significantly associated with pathological stage, tumor grade, and altered retinoblastoma (RB) expression. Moreover, p33ING1 expression levels were significantly associated with overall survival. Analysis of the annotation of the most significant genes revealed the relevance of critical genes and pathways during bladder cancer progression, including the overexpression of oncogenic genes such as DEK in superficial tumors or immune response genes such as Cd86 antigen in invasive disease. Gene profiling successfully classified bladder tumors based on their progression and clinical outcome. The present study has identified molecular biomarkers of potential clinical significance and critical molecular targets associated with bladder cancer progression. Keywords: biosvm
[Salim2003Combination]	N. Salim, J. Holliday, and P. Willett. Combination of fingerprint-based similarity coefficients using data fusion. J Chem Inf Comput Sci, 43(2):435-442, 2003. [ bib \| DOI \| http ] Many different types of similarity coefficients have been described in the literature. Since different coefficients take into account different characteristics when assessing the degree of similarity between molecules, it is reasonable to combine them to further optimize the measures of similarity between molecules. This paper describes experiments in which data fusion is used to combine several binary similarity coefficients to get an overall estimate of similarity for searching databases of bioactive molecules. The results show that search performances can be improved by combining coefficients with little extra computational cost. However, there is no single combination which gives a consistently high performance for all search types. Keywords: 80 and over, Acid-Base Imbalance, Acute, Acute Disease, Adolescent, Adult, African Americans, Aged, Anemia, Animals, Anti-HIV Agents, Anti-Infective Agents, Antibiotics, Antibodies, Antineoplastic, Antineoplastic Agents, Antineoplastic Combined Chemotherapy Protocols, Antitubercular Agents, Aorta, Asparaginase, Autoimmune, B-Cell, Bangladesh, Bicarbonates, Biological Markers, Blood Glucose, California, Camptothecin, Cellulitis, Chorionic Gonadotropin, Chronic Disease, Ciprofloxacin, Clinical Protocols, Colorectal Neoplasms, Combination, Comparative Study, Daunorubicin, Decision Trees, Dexamethasone, Diabetes Mellitus, Dideoxynucleosides, Directly Observed Therapy, Disease Transmission, Drug Administration Schedule, Drug Resistance, Drug Therapy, English Abstract, Female, Fluorouracil, Follow-Up Studies, Glucose Tolerance Test, Glucosephosphate Dehydrogenase, Glyburide, HIV Infections, HIV-1, Health Planning, Health Resources, Helminth, Hemolysis, Hemolytic, Hormonal, Hospital Mortality, Human, Humans, Hypoglycemic Agents, Immunoglobulin M, In Vitro, Incidence, Indinavir, Insulin, Intensive Care Units, Interstitial, Lactates, Leucovorin, Leukemia, Male, Maternal Age, Middle Aged, Motor Activity, Multidrug-Resistant, Mutation, Nephritis, Non-U.S. Gov't, Organoplatinum Compounds, Pennsylvania, Phytotherapy, Plant Extracts, Plant Leaves, Population Dynamics, Potassium Channels, Prednisone, Pregnancy, Pregnancy Outcome, Prenatal, Prenatal Care, Progesterone, Prognosis, Prospective Studies, Pulmonary, Rabbits, Randomized Controlled Trials, Rats, Research Support, Retrospective Studies, Risk Assessment, Scalp Dermatoses, Schistosomiasis japonica, Severity of Illness Index, Spondylarthropathies, Streptozocin, Survival Rate, Trauma Centers, Trauma Severity Indices, Tubal, Tuberculosis, Type 2, Ultrasonography, Vertical, Vincristine, Viral, Viral Load, Wistar, Wounds and Injuries, Ziziphus, beta Subunit, 12653506
[Saeys2003Fast]	Y. Saeys, S. Degroeve, D. Aeyels, Y. Van de Peer, and P. Rouze. Fast feature selection using a simple estimation of distribution algorithm: a case study on splice site prediction. Bioinformatics, 19(Suppl. 1):ii179-ii188, 2003. [ bib \| http \| .pdf ] Motivation: Feature subset selection is an important preprocessing step for classification. In biology, where structures or processes are described by a large number of features, the elimination of irrelevant and redundant information in a reasonable amount of time has a number of advantages. It enables the classification system to achieve good or even better solutions with a restricted subset of features, allows for a faster classification, and it helps the human expert focus on a relevant subset of features, hence providing useful biological knowledge. Results: We present a heuristic method based on Estimation of Distribution Algorithms to select relevant subsets of features for splice site prediction in Arabidopsis thaliana. We show that this method performs a fast detection of relevant feature subsets using the technique of constrained feature subsets. Compared to the traditional greedy methods the gain in speed can be up to one order of magnitude, with results being comparable or even better than the greedy methods. This makes it a very practical solution for classification tasks that can be solved using a relatively small amount of discriminative features (or feature dependencies), but where the initial set of potential discriminative features is rather large. Keywords: Machine Learning, Feature Subset Selection, Estimation of Distribution Algorithms, Splice Site Prediction. Contact: yvsae@gengenp.rug.ac.be Keywords: biosvm
[Qin2003Kernel]	J. Qin, D. P. Lewis, and W. S. Noble. Kernel hierarchical gene clustering from microarray expression data. Bioinformatics, 19(16):2097-2104, 2003. [ bib \| http \| .pdf ] Motivation: Unsupervised analysis of microarray gene expression data attempts to find biologically significant patterns within a given collection of expression measurements. For example, hierarchical clustering can be applied to expression profiles of genes across multiple experiments, identifying groups of genes that share similiar expression profiles. Previous work using the support vector machine supervised learning algorithm with microarray data suggests that higher-order features, such as pairwise and tertiary correlations across multiple experiments, may provide significant benefit in learning to recognize classes of co-expressed genes. Results: We describe a generalization of the hierarchical clustering algorithm that efficiently incorporates these higher-order features by using a kernel function to map the data into a high-dimensional feature space. We then evaluate the utility of the kernel hierarchical clustering algorithm using both internal and external validation. The experiments demonstrate that the kernel representation itself is insufficient to provide improved clustering performance. We conclude that mapping gene expression data into a high-dimensional feature space is only a good idea when combined with a learning algorithm, such as the support vector machine that does not suffer from the curse of dimensionality. Availability: Supplementary data at www.cs.columbia.edu/compbio/hiclust. Software source code available by request. Keywords: biosvm
[Qian2003Prediction]	J. Qian, J. Lin, N. M. Luscombe, H. Yu, and M. Gerstein. Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data. Bioinformatics, 19(15):1917-1926, 2003. [ bib \| http \| .pdf ] Motivation: Defining regulatory networks, linking transcription factors (TFs) to their targets, is a central problem in post-genomic biology. One might imagine one could readily determine these networks through inspection of gene expression data. However, the relationship between the expression timecourse of a transcription factor and its target is not obvious (e.g. simple correlation over the timecourse), and current analysis methods, such as hierarchical clustering, have not been very successful in deciphering them. Results: Here we introduce an approach based on support vector machines (SVMs) to predict the targets of a transcription factor by identifying subtle relationships between their expression profiles. In particular, we used SVMs to predict the regulatory targets for 36 transcription factors in the Saccharomyces cerevisiae genome based on the microarray expression data from many different physiological conditions. We trained and tested our SVM on a data set constructed to include a significant number of both positive and negative examples, directly addressing data imbalance issues. This was non-trivial given that most of the known experimental information is only for positives. Overall, we found that 63 confirmed through cross-validation. We further assessed the performance of our regulatory network identifications by comparing them with the results from two recent genome-wide ChIP-chip experiments. Overall, we find the agreement between our results and these experiments is comparable to the agreement (albeit low) between the two experiments. We find that this network has a delocalized structure with respect to chromosomal positioning, with a given transcription factor having targets spread fairly uniformly across the genome. Availability: The overall network of the relationships is available on the web at http://bioinfo.mbb.yale.edu/expression/echipchip Keywords: biosvm
[Pham2003Prediction]	Tho Hoan Pham, Kenji Satou, and Tu Bao Ho. Prediction and analysis of beta-turns in proteins by support vector machine. Genome Inform Ser Workshop Genome Inform, 14:196-205, 2003. [ bib ] Tight turn has long been recognized as one of the three important features of proteins after the alpha-helix and beta-sheet. Tight turns play an important role in globular proteins from both the structural and functional points of view. More than 90% tight turns are beta-turns. Analysis and prediction of beta-turns in particular and tight turns in general are very useful for the design of new molecules such as drugs, pesticides, and antigens. In this paper, we introduce a support vector machine (SVM) approach to prediction and analysis of beta-turns. We have investigated two aspects of applying SVM to the prediction and analysis of beta-turns. First, we developed a new SVM method, called BTSVM, which predicts beta-turns of a protein from its sequence. The prediction results on the dataset of 426 non-homologous protein chains by sevenfold cross-validation technique showed that our method is superior to the other previous methods. Second, we analyzed how amino acid positions support (or prevent) the formation of beta-turns based on the "multivariable" classification model of a linear SVM. This model is more general than the other ones of previous statistical methods. Our analysis results are more comprehensive and easier to use than previously published analysis results. Keywords: biosvm
[Peng2003Molecular]	S. Peng, Q. Xu, X.B. Ling, X. Peng, W. Du, and L. Chen. Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines. FEBS Lett., 555(2):358-362, 2003. [ bib \| DOI \| http \| .pdf ] Simultaneous multiclass classification of tumor types is essential for future clinical implementations of microarray-based cancer diagnosis. In this study, we have combined genetic algorithms (GAs) and all paired support vector machines (SVMs) for multiclass cancer identification. The predictive features have been selected through iterative SVMs/GAs, and recursive feature elimination post-processing steps, leading to a very compact cancer-related predictive gene set. Leave-one-out cross-validations yielded accuracies of 87.93 the eight-class and 85.19 outperforming the results derived from previously published methods. Keywords: biosvm microarray
[Patterson2003Proteomics]	Scott D Patterson and Ruedi H Aebersold. Proteomics: the first decade and beyond. Nat Genet, 33 Suppl:311-323, Mar 2003. [ bib \| DOI \| http ] Proteomics is the systematic study of the many and diverse properties of proteins in a parallel manner with the aim of providing detailed descriptions of the structure, function and control of biological systems in health and disease. Advances in methods and technologies have catalyzed an expansion of the scope of biological studies from the reductionist biochemical analysis of single proteins to proteome-wide measurements. Proteomics and other complementary analysis methods are essential components of the emerging 'systems biology' approach that seeks to comprehensively describe biological systems through integration of diverse types of data and, in the future, to ultimately allow computational simulations of complex biological systems. Keywords: Amino Acid Sequence; Base Sequence; Chromatography, Liquid; Computational Biology; DNA; Genetic Techniques; History, 20th Century; History, 21st Century; Mass Spectrometry; Oligonucleotide Array Sequence Analysis; Proteins; Proteomics
[Park2003Prediction]	K.-J. Park and M. Kanehisa. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics, 19(13):1656-1663, 2003. [ bib \| http \| .pdf ] Motivation: The subcellular location of a protein is closely correlated to its function. Thus, computational prediction of subcellular locations from the amino acid sequence information would help annotation and functional prediction of protein coding genes in complete genomes. We have developed a method based on support vector machines (SVMs). Results: We considered 12 subcellular locations in eukaryotic cells: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracellular medium, Golgi apparatus, lysosome, mitochondrion, nucleus, peroxisome, plasma membrane, and vacuole. We constructed a data set of proteins with known locations from the SWISS-PROT database. A set of SVMs was trained to predict the subcellular location of a given protein based on its amino acid, amino acid pair, and gapped amino acid pair compositions. The predictors based on these different compositions were then combined using a voting scheme. Results obtained through 5-fold cross-validation tests showed an improvement in prediction accuracy over the algorithm based on the amino acid composition only. This prediction method is available via the Internet. Availability: http://www.genome.ad.jp/SIT/ploc.html Supplementary information: http://web.kuicr.kyoto-u.ac.jp/ park/Seqdata/ Keywords: biosvm
[Nguyen2003Multi-class]	Minh N Nguyen and Jagath C Rajapakse. Multi-class support vector machines for protein secondary structure prediction. Genome Inform Ser Workshop Genome Inform, 14:218-27, 2003. [ bib ] The solution of binary classification problems using the Support Vector Machine (SVM) method has been well developed. Though multi-class classification is typically solved by combining several binary classifiers, recently, several multi-class methods that consider all classes at once have been proposed. However, these methods require resolving a much larger optimization problem and are applicable to small datasets. Three methods based on binary classifications: one-against-all (OAA), one-against-one (OAO), and directed acyclic graph (DAG), and two approaches for multi-class problem by solving one single optimization problem, are implemented to predict protein secondary structure. Our experiments indicate that multi-class SVM methods are more suitable for protein secondary structure (PSS) prediction than the other methods, including binary SVMs, because their capacity to solve an optimization problem in one step. Furthermore, in this paper, we argue that it is feasible to extend the prediction accuracy by adding a second-stage multi-class SVM to capture the contextual information among secondary structural elements and thereby further improving the accuracies. We demonstrate that two-stage SVMs perform better than single-stage SVM techniques for PSS prediction using two datasets and report a maximum accuracy of 79.5%. Keywords: biosvm
[Meireles2003Differentially]	S.I. Meireles, A.F. Carvalho, R. Hirata, A.L. Montagnini, W.K. Martins, F.B. Runza, B.S. Stolf, L. Termini, C.E. Neto, R.L. Silva, F.A. Soares, E.J. Neves, and L.F. Reis. Differentially expressed genes in gastric tumors identified by cDNA array. Cancer Lett., 190(2):199-211, Feb 2003. [ bib \| DOI \| http \| .pdf ] Using cDNA fragments from the FAPESP/lICR Cancer Genome Project, we constructed a cDNA array having 4512 elements and determined gene expression in six normal and six tumor gastric tissues. Using t-statistics, we identified 80 cDNAs whose expression in normal and tumor samples differed more than 3.5 sample standard deviations. Using Self-Organizing Map, the expression profile of these cDNAs allowed perfect separation of malignant and non-malignant samples. Using the supervised learning procedure Support Vector Machine, we identified trios of cDNAs that could be used to classify samples as normal or tumor, based on single-array analysis. Finally, we identified genes with altered linear correlation when their expression in normal and tumor samples were compared. Further investigation concerning the function of these genes could contribute to the understanding of gastric carcinogenesis and may prove useful in molecular diagnostics. Keywords: biosvm microarray
[McKnight2003Categorization]	Larry McKnight and Padmini Srinivasan. Categorization of sentence types in medical abstracts. AMIA Annu Symp Proc, pages 440-4, 2003. [ bib ] This study evaluated the use of machine learning techniques in the classification of sentence type. 7253 structured abstracts and 204 unstructured abstracts of Randomized Controlled Trials from MedLINE were parsed into sentences and each sentence was labeled as one of four types (Introduction, Method, Result, or Conclusion). Support Vector Machine (SVM) and Linear Classifier models were generated and evaluated on cross-validated data. Treating sentences as a simple "bag of words", the SVM model had an average ROC area of 0.92. Adding a feature of relative sentence location improved performance markedly for some models and overall increasing the average ROC to 0.95. Linear classifier performance was significantly worse than the SVM in all datasets. Using the SVM model trained on structured abstracts to predict unstructured abstracts yielded performance similar to that of models trained with unstructured abstracts in 3 of the 4 types. We conclude that classification of sentence type seems feasible within the domain of RCT's. Identification of sentence types may be helpful for providing context to end users or other text summarization techniques. Keywords: biosvm
[Mayr2003Cross-reactive]	Torsten Mayr, Christian Igel, Gregor Liebsch, Ingo Klimant, and Otto S Wolfbeis. Cross-reactive metal ion sensor array in a micro titer plate format. Anal Chem, 75(17):4389-96, Sep 2003. [ bib ] A cross-reactive array in a micro titer plate (MTP) format is described that is based on a versatile and highly flexible scheme. It makes use of rather unspecific metal ions probes having almost identical fluorescence spectra, thus enabling (a) interrogation at identical analytical wavelengths, and (b) imaging of the probes contained in the wells of the MTP using a CCD camera and an array of blue-light-emitting diodes as a light source. The unselective response of the indicators in the presence of mixtures of five divalent cations generates a characteristic pattern that was analyzed by chemometric tools. The fluorescence intensity of the indicators was transferred into a time-dependent parameter applying a scheme called dual lifetime referencing. In this method, the fluorescence decay profile of the indicator is referenced against the phosphorescence of an inert reference dye added to the system. The intrinsically referenced measurements also were performed using blue LEDs as light sources and a CCD camera without intensifiers as the detector. The best performance was observed if each well was excited by a single LED. The assembly allows the detection of dye concentrations in the nanomoles-per-liter range without amplification and the acquisition of 96 wells simultaneously. The pictures obtained form the basis for evaluation by pattern recognition algorithms. Support vector machines are capable of predicting the presence of significant concentrations of metal ions with high accuracy. Keywords: Agrochemicals, Air Pollutants, Aircraft, Algorithms, Artificial Intelligence, Automated, Base Composition, Base Sequence, Bayes Theorem, Carbonic Anhydrase Inhibitors, Cluster Analysis, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer Systems, Computer-Assisted, Computing Methodologies, Confidence Intervals, Cytosine, DNA, Data Interpretation, Databases, Diagnosis, Drug Design, Enhancer Elements (Genetics), Environmental Monitoring, Enzyme Inhibitors, Ethanol, Exons, Forecasting, Fourier Transform Infrared, Gene Expression Profiling, Gene Expression Regulation, Genetic, Genetic Screening, Glucuronosyltransferase, Guanine, Humans, Image Interpretation, Isoenzymes, Least-Squares Analysis, Leukemia, Linear Models, Lymphoma, Models, Molecular, Molecular Conformation, Molecular Sequence Data, Natural Disasters, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Oligonucleotide Array Sequence Analysis, Online Systems, P.H.S., Pattern Recognition, Pharmaceutical Preparations, Phenotype, Photography, Probability, Pyrimidines, Quantitative Structure-Activity Relationship, RNA Precursors, RNA Splice Sites, RNA Splicing, Radiation, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Signal Processing, Software, Spectroscopy, Statistical, Subtraction Technique, Terminology, Thermodynamics, Time Factors, U.S. Gov't, Untranslated Regions, Video Recording, Walking, 14632041
[Mattfeldt2003Classification]	T. Mattfeldt, H.W. Gottfried, H. Wolter, V. Schmidt, H.A. Kestler, and J. Mayer. Classification of prostatic carcinoma with artificial neural networks using comparative genomic hybridization and quantitative stereological data. Pathol. Res. Pract., 199(12):773-784, 2003. [ bib \| DOI \| http ] Staging of prostate cancer is a mainstay of treatment decisions and prognostication. In the present study, 50 pT2N0 and 28 pT3N0 prostatic adenocarcinomas were characterized by Gleason grading, comparative genomic hybridization (CGH), and histological texture analysis based on principles of stereology and stochastic geometry. The cases were classified by learning vector quantization and support vector machines. The quality of classification was tested by cross-validation. Correct prediction of stage from primary tumor data was possible with an accuracy of 74-80 of prediction was similar when the Gleason score was used as input variable, when stereological data were used, or when a combination of CGH data and stereological data was used. The results of classification by learning vector quantization were slightly better than those by support vector machines. A method is briefly sketched by which training of neural networks can be adapted to unequal sample sizes per class. Progression from pT2 to pT3 prostate cancer is correlated with complex changes of the epithelial cells in terms of volume fraction, of surface area, and of second-order stereological properties. Genetically, this progression is accompanied by a significant global increase in losses and gains of DNA, and specifically by increased numerical aberrations on chromosome arms 1q, 7p, and 8p. Keywords: biosvm, cgh
[Markowetz2003Support]	F. Markowetz, L. Edler, and M. Vingron. Support Vector Machines for Protein Fold Class Prediction. Biometrical Journal, 45(3):377-389, 2003. [ bib \| DOI \| http \| .pdf ] Knowledge of the three-dimensional structure of a protein is essential for describing and understanding its function. Today, a large number of known protein sequences faces a small number of identified structures. Thus, the need arises to predict structure from sequence without using time-consuming experimental identification. In this paper the performance of Support Vector Machines (SVMs) is compared to Neural Networks and to standard statistical classification methods as Discriminant Analysis and Nearest Neighbor Classification. We show that SVMs can beat the competing methods on a dataset of 268 protein sequences to be classified into a set of 42 fold classes. We discuss misclassification with respect to biological function and similarity. In a second step we examine the performance of SVMs if the embedding is varied from frequencies of single amino acids to frequencies of tripletts of amino acids. This work shows that SVM provide a promising alternative to standard statistical classification and prediction methods in functional genomics. Keywords: biosvm
[Lu2003Expression]	Y.J. Lu, D. Williamson, R. Wang, B. Summersgill, S. Rodriguez, S. Rogers, K. Pritchard-Jones, C. Campbell, and J. Shipley. Expression profiling targeting chromosomes for tumor classification and prediction of clinical behavior. Genes Chromosomes Cancer, 38(3):207-214, 2003. [ bib \| DOI \| .pdf ] Tumors are associated with altered or deregulated gene products that affect critical cellular functions. Here we assess the use of a global expression profiling technique that identifies chromosome regions corresponding to differential gene expression, termed comparative expressed sequence hybridization (CESH). CESH analysis was performed on a total of 104 tumors with a diagnosis of rhabdomyosarcoma, leiomyosarcoma, prostate cancer, and favorable-histology Wilms tumors. Through the use of the chromosome regions identified as variables, support vector machine analysis was applied to assess classification potential, and feature selection (recursive feature elimination) was used to identify the best discriminatory regions. We demonstrate that the CESH profiles have characteristic patterns in tumor groups and were also able to distinguish subgroups of rhabdomyosarcoma. The overall CESH profiles in favorable-histology Wilms tumors were found to correlate with subsequent clinical behavior. Classification by use of CESH profiles was shown to be similar in performance to previous microarray expression studies and highlighted regions for further investigation. We conclude that analysis of chromosomal expression profiles can group, subgroup, and even predict clinical behavior of tumors to a level of performance similar to that of microarray analysis. CESH is independent of selecting sequences for interrogation and is a simple, rapid, and widely accessible approach to identify clinically useful differential expression. Keywords: biosvm
[Liu2003QSAR]	H. X. Liu, R. S. Zhang, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. QSAR study of ethyl 2-[(3-methyl-2,5-dioxo(3-pyrrolinyl))amino]-4-(trifluoromethyl) pyrimidine-5-carboxylate: an inhibitor of AP-1 and NF-kappa B mediated gene expression based on support vector machines. J Chem Inf Comput Sci, 43(4):1288-96, 2003. [ bib \| DOI \| http \| .pdf ] The support vector machine, as a novel type of learning machine, for the first time, was used to develop a QSAR model of 57 analogues of ethyl 2-[(3-methyl-2,5-dioxo(3-pyrrolinyl))amino]-4-(trifluoromethyl)pyrimidine-5-carboxylate (EPC), an inhibitor of AP-1 and NF-kappa B mediated gene expression, based on calculated quantum chemical parameters. The quantum chemical parameters involved in the model are Kier and Hall index (order3) (KHI3), Information content (order 0) (IC0), YZ Shadow (YZS) and Max partial charge for an N atom (MaxPCN), Min partial charge for an N atom (MinPCN). The mean relative error of the training set, the validation set, and the testing set is 1.35%, 1.52%, and 2.23%, respectively, and the maximum relative error is less than 5.00%. Keywords: biosvm
[Liu2003in-silico]	Huiqing Liu, Hao Han, Jinyan Li, and Limsoon Wong. An in-silico method for prediction of polyadenylation signals in human sequences. Genome Inform Ser Workshop Genome Inform, 14:84-93, 2003. [ bib ] This paper presents a machine learning method to predict polyadenylation signals (PASes) in human DNA and mRNA sequences by analysing features around them. This method consists of three sequential steps of feature manipulation: generation, selection and integration of features. In the first step, new features are generated using k-gram nucleotide acid or amino acid patterns. In the second step, a number of important features are selected by an entropy-based algorithm. In the third step, support vector machines are employed to recognize true PASes from a large number of candidates. Our study shows that true PASes in DNA and mRNA sequences can be characterized by different features, and also shows that both upstream and downstream sequence elements are important for recognizing PASes from DNA sequences. We tested our method on several public data sets as well as our own extracted data sets. In most cases, we achieved better validation results than those reported previously on the same data sets. The important motifs observed are highly consistent with those reported in literature. Keywords: biosvm
[Lind2003Support]	P. Lind and T. Maltseva. Support vector machines for the estimation of aqueous solubility. J Chem Inf Comput Sci, 43(6):1855-9, 2003. [ bib \| DOI \| http \| .pdf ] Support Vector Machines (SVMs) are used to estimate aqueous solubility of organic compounds. A SVM equipped with a Tanimoto similarity kernel estimates solubility with accuracy comparable to results from other reported methods where the same data sets have been studied. Complete cross-validation on a diverse data set resulted in a root-mean-squared error = 0.62 and R(2) = 0.88. The data input to the machine is in the form of molecular fingerprints. No physical parameters are explicitly involved in calculations. Keywords: biosvm chemoinformatics
[Liao2003Combining]	L. Liao and W.S. Noble. Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. J. Comput. Biol., 10(6):857-868, 2003. [ bib \| http \| .pdf ] One key element in understanding the molecular machinery of the cell is to understand the structure and function of each protein encoded in the genome. A very successful means of inferring the structure or function of a previously unannotated protein is via sequence similarity with one or more proteins whose structure or function is already known. Toward this end, we propose a means of representing proteins using pairwise sequence similarity scores. This representation, combined with a discriminative classification algorithm known as the support vector machine (SVM), provides a powerful means of detecting subtle structural and evolutionary relationships among proteins. The algorithm, called SVM-pairwise, when tested on its ability to recognize previously unseen families from the SCOP database, yields significantly better performance than SVM-Fisher, profile HMMs, and PSI-BLAST. Keywords: biosvm
[Li2003Simple]	Jinyan Li, Huiqing Liu, James R Downing, Allen Eng-Juh Yeoh, and Limsoon Wong. Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients. Bioinformatics, 19(1):71-8, Jan 2003. [ bib ] MOTIVATIONS AND RESULTS: For classifying gene expression profiles or other types of medical data, simple rules are preferable to non-linear distance or kernel functions. This is because rules may help us understand more about the application in addition to performing an accurate classification. In this paper, we discover novel rules that describe the gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients. We also introduce a new classifier, named PCL, to make effective use of the rules. PCL is accurate and can handle multiple parallel classifications. We evaluate this method by classifying 327 heterogeneous ALL samples. Our test error rate is competitive to that of support vector machines, and it is 71% better than C4.5, 50% better than Naive Bayes, and 43% better than k-nearest neighbour. Experimental results on another independent data sets are also presented to show the strength of our method. AVAILABILITY: Under http://sdmc.lit.org.sg/GEDatasets/, click on Supplementary Information. Keywords: Acute, Algorithms, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Cluster Analysis, Comparative Study, Computer-Assisted, DNA, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Leukemia, Lymphocytic, Markov Chains, Messenger, Models, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplastic, Neural Networks (Computer), Non-U.S. Gov't, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Quality Control, RNA, Research Support, Signal Processing, Statistical, Stomach Neoplasms, Tumor Markers, 12499295
[Leslie2003Mismatch]	C. Leslie, E. Eskin, J. Weston, and W.S. Noble. Mismatch String Kernels for SVM Protein Classification. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems 15. MIT Press, 2003. [ bib \| .pdf \| .pdf ] Keywords: biosvm
[Lee2003Classification]	Y. Lee and C.-K. Lee. Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics, 19(9):1132-1139, 2003. [ bib \| http \| .pdf ] Motivation: High-density DNA microarray measures the activities of several thousand genes simultaneously and the gene expression profiles have been used for the cancer classification recently. This new approach promises to give better therapeutic measurements to cancer patients by diagnosing cancer types with improved accuracy. The Support Vector Machine (SVM) is one of the classification methods successfully applied to the cancer diagnosis problems. However, its optimal extension to more than two classes was not obvious, which might impose limitations in its application to multiple tumor types. We briefly introduce the Multicategory SVM, which is a recently proposed extension of the binary SVM, and apply it to multiclass cancer diagnosis problems Results: Its applicability is demonstrated on the leukemia data (Golub et al., 1999) and the small round blue cell tumors of childhood data (Khan et al., 2001). Comparable classification accuracy shown in the applications and its flexibility render the MSVM a viable alternative to other classification methods Supplementary Information: http://www.stat.ohio-state.edu/ yklee/msvm.html Contact: yklee@stat.ohio-state.edu Keywords: biosvm
[Lee2003Discovery]	Dongkwon Lee, Sang Wook Choi, Myengsoo Kim, Jin Hyun Park, Moonkyu Kim, Jungchul Kim, and In-Beum Lee. Discovery of differentially expressed genes related to histological subtype of hepatocellular carcinoma. Biotechnol Prog., 19(3):1011-5, 2003. [ bib \| DOI \| http \| .pdf ] Hepatocellular carcinoma (HCC) is one of the most common human malignancies in the world. To identify the histological subtype-specific genes of HCC, we analyzed the gene expression profile of 10 HCC patients by means of cDNA microarray. We proposed a systematic approach for determining the discriminatory genes and revealing the biological phenomena of HCC with cDNA microarray data. First, normalization of cDNA microarray data was performed to reduce or minimize systematic variations. On the basis of the suitably normalized data, we identified specific genes involved in histological subtype of HCC. Two classification methods, Fisher's discriminant analysis (FDA) and support vector machine (SVM), were used to evaluate the reliability of the selected genes and discriminate the histological subtypes of HCC. This study may provide a clue for the needs of different chemotherapy and the reason for heterogeneity of the clinical responses according to histological subtypes. Keywords: biosvm
[Krishnan2003comparative]	V. G. Krishnan and D. R. Westhead. A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function. Bioinformatics, 19(17):2199-2209, 2003. [ bib \| http \| .pdf ] Motivation: The large volume of single nucleotide polymorphism data now available motivates the development of methods for distinguishing neutral changes from those which have real biological effects. Here, two different machine-learning methods, decision trees and support vector machines (SVMs), are applied for the first time to this problem. In common with most other methods, only non-synonymous changes in protein coding regions of the genome are considered. Results: In detailed cross-validation analysis, both learning methods are shown to compete well with existing methods, and to out-perform them in some key tests. SVMs show better generalization performance, but decision trees have the advantage of generating interpretable rules with robust estimates of prediction confidence. It is shown that the inclusion of protein structure information produces more accurate methods, in agreement with other recent studies, and the effect of using predicted rather than actual structure is evaluated. Availability: Software is available on request from the authors. Keywords: biosvm
[Kim2003Protein]	H. Kim and H. Park. Protein secondary structure prediction based on an improved support vector machines approach. Protein Eng., 16(8):553-560, Aug 2003. [ bib \| http \| .pdf ] The prediction of protein secondary structure is an important step in the prediction of protein tertiary structure. A new protein secondary structure prediction method, SVMpsi, was developed to improve the current level of prediction by incorporating new tertiary classifiers and their jury decision system, and the PSI-BLAST PSSM profiles. Additionally, efficient methods to handle unbalanced data and a new optimization strategy for maximizing the Q3 measure were developed. The SVMpsi produces the highest published Q3 and SOV94 scores on both the RS126 and CB513 data sets to date. For a new KP480 set, the prediction accuracy of SVMpsi was Q3 = 78.5 for 136 non-redundant protein sequences which do not contain homologues of training data sets were Q3 = 77.2 SVMpsi results in CASP5 illustrate that it is another competitive method to predict protein secondary structure. Keywords: biosvm
[Kashima2003Marginalized]	H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized Kernels between Labeled Graphs. In T. Faucett and N. Mishra, editors, Proceedings of the Twentieth International Conference on Machine Learning, pages 321-328, New York, NY, USA, 2003. AAAI Press. [ bib \| .pdf ] Keywords: biosvm
[Jansen2003Bayesian]	R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N.J. Krogan, S. Chung, A. Emili, M. Snyder, J.F. Greenblatt, and M. Gerstein. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302(5644):449-453, 2003. [ bib \| DOI \| http \| .pdf ] We have developed an approach using Bayesian networks to predict protein-protein interactions genome-wide in yeast. Our method naturally weights and combines into reliable predictions genomic features only weakly associated with interaction (e.g., mRNA coexpression, coessentiality, and colocalization). In addition to de novo predictions, it can integrate often noisy, experimental interaction data sets. We observe that at given levels of sensitivity, our predictions are more accurate than the existing high-throughput experimental data sets. We validate our predictions with new TAP?tagging (tandem affinity purification) experiments. Keywords: biogm
[Jambon2003New]	Martin Jambon, Anne Imberty, Gilbert DelÃ©age, and Christophe Geourjon. A new bioinformatic approach to detect common 3d sites in protein structures. Proteins, 52(2):137-145, Aug 2003. [ bib \| DOI \| http ] An innovative bioinformatic method has been designed and implemented to detect similar three-dimensional (3D) sites in proteins. This approach allows the comparison of protein structures or substructures and detects local spatial similarities: this method is completely independent from the amino acid sequence and from the backbone structure. In contrast to already existing tools, the basis for this method is a representation of the protein structure by a set of stereochemical groups that are defined independently from the notion of amino acid. An efficient heuristic for finding similarities that uses graphs of triangles of chemical groups to represent the protein structures has been developed. The implementation of this heuristic constitutes a software named SuMo (Surfing the Molecules), which allows the dynamic definition of chemical groups, the selection of sites in the proteins, and the management and screening of databases. To show the relevance of this approach, we focused on two extreme examples illustrating convergent and divergent evolution. In two unrelated serine proteases, SuMo detects one common site, which corresponds to the catalytic triad. In the legume lectins family composed of >100 structures that share similar sequences and folds but may have lost their ability to bind a carbohydrate molecule, SuMo discriminates between functional and non-functional lectins with a selectivity of 96%. The time needed for searching a given site in a protein structure is typically 0.1 s on a PIII 800MHz/Linux computer; thus, in further studies, SuMo will be used to screen the PDB. Keywords: Algorithms; Catalytic Domain; Chymotrypsin, chemistry/genetics; Computational Biology, methods; Evolution, Molecular; Fabaceae, chemistry; Models, Molecular; Plant Lectins, chemistry/genetics; Protein Conformation; Proteins, chemistry; Reproducibility of Results; Subtilisin, chemistry/genetics
[Imoto2003Bayesian]	S. Imoto, S. Kim, T. Goto, S. Miyano, S. Aburatani, K. Tashiro, and S. Kuhara. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. J. Bioinform. Comput. Biol., 1(2):231-252, Jul 2003. [ bib \| DOI \| http \| .pdf ] We propose a new statistical method for constructing a genetic network from microarray gene expression data by using a Bayesian network. An essential point of Bayesian network construction is the estimation of the conditional distribution of each random variable. We consider fitting nonparametric regression models with heterogeneous error variances to the microarray gene expression data to capture the nonlinear structures between genes. Selecting the optimal graph, which gives the best representation of the system among genes, is still a problem to be solved. We theoretically derive a new graph selection criterion from Bayes approach in general situations. The proposed method includes previous methods based on Bayesian networks. We demonstrate the effectiveness of the proposed method through the analysis of Saccharomyces cerevisiae gene expression data newly obtained by disrupting 100 genes. Keywords: biogm
[Ifantis2003nonlinear]	A. Ifantis and S. Papadimitriou. The nonlinear predictability of the electrotelluric field variations data analyzed with support vector machines as an earthquake precursor. Int J Neural Syst, 13(5):315-32, Oct 2003. [ bib ] This work investigates the nonlinear predictability of the Electro Telluric Field (ETF) variations data in order to develop new intelligent tools for the difficult task of earthquake prediction. Support Vector Machines trained on a signal window have been used to predict the next sample. We observe a significant increase at this short-term unpredictability of the ETF signal at about two weeks time period before the major earthquakes that took place in regions near the recording devices. The unpredictability increase can be attributed to a quick time variation of the dynamics that produce the ETF signal due to the earthquake generation process. Thus, this increase can be taken into advantage for signaling for an increased possibility of a large earthquake within the next few days in the neighboring region of the recording station. Keywords: Air Pollutants, Aircraft, Algorithms, Artificial Intelligence, Automated, Base Composition, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Computing Methodologies, Cytosine, Data Interpretation, Databases, Enhancer Elements (Genetics), Environmental Monitoring, Ethanol, Exons, Fourier Transform Infrared, Genetic, Guanine, Humans, Image Interpretation, Natural Disasters, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Online Systems, P.H.S., Pattern Recognition, Photography, Probability, Pyrimidines, RNA Precursors, RNA Splice Sites, RNA Splicing, Radiation, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Spectroscopy, Statistical, Subtraction Technique, Thermodynamics, Time Factors, U.S. Gov't, Untranslated Regions, Video Recording, Walking, 14652873
[Hou2003Efficient]	Y. Hou, W. Hsu, M. L. Lee, and C. Bystroff. Efficient remote homology detection using local structure. Bioinformatics, 19(17):2294-2301, 2003. [ bib \| http \| .pdf ] Motivation: The function of an unknown biological sequence can often be accurately inferred if we are able to map this unknown sequence to its corresponding homologous family. At present, discriminative methods such as SVM-Fisher and SVM-pairwise, which combine support vector machine (SVM) and sequence similarity, are recognized as the most accurate methods, with SVM-pairwise being the most accurate. However, these methods typically encode sequence information into their feature vectors and ignore the structure information. They are also computationally inefficient. Based on these observations, we present an alternative method for SVM-based protein classification. Our proposed method, SVM-I-sites, utilizes structure similarity for remote homology detection. Result: We run experiments on the Structural Classification of Proteins 1.53 data set. The results show that SVM-I-sites is more efficient than SVM-pairwise. Further, we find that SVM-I-sites outperforms sequence-based methods such as PSI-BLAST, SAM, and SVM-Fisher while achieving a comparable performance with SVM-pairwise. Availability: I-sites server is accessible through the web at http://www.bioinfo.rpi.edu. Programs are available upon request for academics. Licensing agreements are available for commercial interests. The framework of encoding local structure into feature vector is available upon request. Keywords: biosvm
[Harborth2003Sequence]	J. Harborth, S. M. Elbashir, K. Vandenburgh, H. Manninga, S. A. Scaringe, K. Weber, and T. Tuschl. Sequence, chemical, and structural variation of small interfering RNAs and short hairpin RNAs and the effect on mammalian gene silencing. Antisense Nucleic Acid. Drug. Dev., 13(2):83-105, Apr 2003. [ bib \| DOI \| http ] Small interfering RNAs (siRNAs) induce sequence-specific gene silencing in mammalian cells and guide mRNA degradation in the process of RNA interference (RNAi). By targeting endogenous lamin A/C mRNA in human HeLa or mouse SW3T3 cells, we investigated the positional variation of siRNA-mediated gene silencing. We find cell-type-dependent global effects and cell-type-independent positional effects. HeLa cells were about 2-fold more responsive to siRNAs than SW3T3 cells but displayed a very similar pattern of positional variation of lamin A/C silencing. In HeLa cells, 26 of 44 tested standard 21-nucleotide (nt) siRNA duplexes reduced the protein expression by at least 90%, and only 2 duplexes reduced the lamin A/C proteins to <50%. Fluorescent chromophores did not perturb gene silencing when conjugated to the 5'-end or 3'-end of the sense siRNA strand and the 5'-end of the antisense siRNA strand, but conjugation to the 3'-end of the antisense siRNA abolished gene silencing. RNase-protecting phosphorothioate and 2'-fluoropyrimidine RNA backbone modifications of siRNAs did not significantly affect silencing efficiency, although cytotoxic effects were observed when every second phosphate of an siRNA duplex was replaced by phosphorothioate. Synthetic RNA hairpin loops were subsequently evaluated for lamin A/C silencing as a function of stem length and loop composition. As long as the 5'-end of the guide strand coincided with the 5'-end of the hairpin RNA, 19-29 base pair (bp) hairpins effectively silenced lamin A/C, but when the hairpin started with the 5'-end of the sense strand, only 21-29 bp hairpins were highly active. Keywords: Adaptor Protein Complex alpha Subunits, Animal, Animals, Antisense, Apolipoproteins B, Base Sequence, Biological Transport, Blotting, Catalytic, Cell Line, Cell Membrane, Cell Survival, Chemical, Cholesterol, Clathrin, Clathrin Heavy Chains, Disease Models, Endocytosis, Epidermal Growth Factor, Fluorescence, Gene Expression Profiling, Gene Silencing, Gene Therapy, Hela Cells, Humans, Injections, Intravenous, Jejunum, Kinetics, Lamin Type A, Liver, Messenger, Metabolic Syndrome X, Mice, Microscopy, Models, Molecular Sequence Data, NIH 3T3 Cells, Non-U.S. Gov't, Nucleic Acid, Oligonucleotides, Open Reading Frames, Post-Transcriptional, Protein Isoforms, Pyrimidines, RNA, RNA Interference, RNA Processing, RNA Stability, Research Support, Reverse Transcriptase Polymerase Chain Reaction, Sensitivity and Specificity, Sequence Homology, Small Interfering, Subcellular Fractions, Swiss 3T3 Cells, Thionucleotides, Time Factors, Transfection, Transferrin, Transgenic, Tumor, Western, 12804036
[Gururaja2003Multiple]	T. Gururaja, W. Li, W.S. Noble, D.G. Payan, and D.C. Anderson. Multiple functional categories of proteins identified in an in vitro cellular ubiquitin affinity extract using shotgun peptide sequencing. J Proteome Res, 2(394-404):394-404, 2003. [ bib \| .pdf ] Using endogenous human cellular ubiquitin system enzymes and added his-tagged ubiquitin, ATP, and an ATP-regenerating system, we labelled cellular proteins with hexahistidine tagged ubiquitin in vitro. Labeling was dependent on ATP and the ATP recycling system, on the proteasome inhibitor MG132 and the ubiquitin protease inhibitor ubiquitin aldehyde, and was inhibited by iodoacetamide. Labeled proteins were affinity extracted in quadruplicate and tryptic peptides identifed by 2D capillary LC/MS/MS comb9ined with SEQUEST and MEDUSA analyses. Support vector machine analyais of the mass spectrometry data allowed prediction of correct matches between mass spectrometry data and peptide sequences. Overall, 144 proteins were identified by peptides predicted to be correctly sequenced, and 113 were identified by at least three peptides or one or two peptides with at least an 80 Identified proteins included 22 proteasome subunits or associated proteins, 18 E1, E2 or E3 ubiquitin system enzymes or related proteins, and four ubiquitin domain proteins. Seventeen directly ubiquitinated proteins or proteins associated with the ubiquitin system were identified. Functional clusters of other proteins included redox enzymes, proteins associated with endocytosis, cytoskeletal proteins, DNA damage or repair related proteins, calcium binding proteins, and splicing factor and related proteins, suggesting that in vitro ubiquitination is not random, and that these functions may be regulated by the ubiquitin system. This map of cellular ubiquitinated proteins and their interacting proteins will be useful for further studies of ubiquitin system function. Keywords: biosvm
[Gordon2003Sequence]	L. Gordon, A. Y. Chervonenkis, A. J. Gammerman, I. A. Shahmuradov, and V. V. Solovyev. Sequence alignment kernel for recognition of promoter regions. Bioinformatics, 19(15):1964-1971, 2003. [ bib \| http \| .pdf ] In this paper we propose a new method for recognition of prokaryotic promoter regions with startpoints of transcription. The method is based on Sequence Alignment Kernel, a function reflecting the quantitative measure of match between two sequences. This kernel function is further used in Dual SVM, which performs the recognition. Several recognition methods have been trained and tested on positive data set, consisting of 669 sigma70-promoter regions with known transcription startpoints of Escherichia coli and two negative data sets of 709 examples each, taken from coding and non-coding regions of the same genome. The results show that our method performs well and achieves 16.5 data and 18.6 data. Availability:The demo version of our method is accessible from our website http://mendel.cs.rhul.ac.uk/ Keywords: biosvm
[Gomez2003Learning]	S. M. Gomez, W. S. Noble, and A. Rzhetsky. Learning to predict protein-protein interactions from protein sequences. Bioinformatics, 19(15):1875-1881, 2003. [ bib \| http \| .pdf ] In order to understand the molecular machinery of the cell, we need to know about the multitude of protein-protein interactions that allow the cell to function. High-throughput technologies provide some data about these interactions, but so far that data is fairly noisy. Therefore, computational techniques for predicting protein-protein interactions could be of significant value. One approach to predicting interactions in silico is to produce from first principles a detailed model of a candidate interaction. We take an alternative approach, employing a relatively simple model that learns dynamically from a large collection of data. In this work, we describe an attraction-repulsion model, in which the interaction between a pair of proteins is represented as the sum of attractive and repulsive forces associated with small, domain- or motif-sized features along the length of each protein. The model is discriminative, learning simultaneously from known interactions and from pairs of proteins that are known (or suspected) not to interact. The model is efficient to compute and scales well to very large collections of data. In a cross-validated comparison using known yeast interactions, the attraction-repulsion method performs better than several competing techniques. Keywords: biosvm
[Ge2003Reducing]	Xijin Ge, Shuichi Tsutsumi, Hiroyuki Aburatani, and Shuichi Iwata. Reducing false positives in molecular pattern recognition. Genome Inform Ser Workshop Genome Inform, 14:34-43, 2003. [ bib ] In the search for new cancer subtypes by gene expression profiling, it is essential to avoid misclassifying samples of unknown subtypes as known ones. In this paper, we evaluated the false positive error rates of several classification algorithms through a 'null test' by presenting classifiers a large collection of independent samples that do not belong to any of the tumor types in the training dataset. The benchmark dataset is available at www2.genome.rcast.u-tokyo.ac.jp/pm/. We found that k-nearest neighbor (KNN) and support vector machine (SVM) have very high false positive error rates when fewer genes (<100) are used in prediction. The error rate can be partially reduced by including more genes. On the other hand, prototype matching (PM) method has a much lower false positive error rate. Such robustness can be achieved without loss of sensitivity by introducing suitable measures of prediction confidence. We also proposed a cluster-and-select technique to select genes for classification. The nonparametric Kruskal-Wallis H test is employed to select genes differentially expressed in multiple tumor types. To reduce the redundancy, we then divided these genes into clusters with similar expression patterns and selected a given number of genes from each cluster. The reliability of the new algorithm is tested on three public datasets. Keywords: Amino Acid Sequence, Amino Acids, Animals, Automated, Base Sequence, Bayes Theorem, Biological, Carbohydrate Conformation, Carbohydrate Sequence, Cattle, Computational Biology, Computer Simulation, Crystallography, DNA, Databases, Factual, False Positive Reactions, Gene Expression Profiling, Genes, Genetic, Genetic Techniques, Genome, Histocompatibility Antigens Class I, Human, Humans, Introns, Least-Squares Analysis, MHC Class I, Major Histocompatibility Complex, Markov Chains, Messenger, Mice, Models, Monosaccharides, Neoplasms, Non-U.S. Gov't, Nonparametric, Pattern Recognition, Peptides, Phylogeny, Plants, Poly A, Polysaccharides, Predictive Value of Tests, Protein, Protein Structure, Proteins, RNA, Rats, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sequence Alignment, Software, Species Specificity, Statistics, Theoretical, X-Ray, 15706518
[Garrett2003Comparison]	D. Garrett, D. A Peterson, C. Anderson, and M. Thaut. Comparison of linear, nonlinear, and feature selection methods for EEG signal classification. IEEE Trans Neural Syst Rehabil Eng, 11(2):141-4, Jun 2003. [ bib ] The reliable operation of brain-computer interfaces (BCIs) based on spontaneous electroencephalogram (EEG) signals requires accurate classification of multichannel EEG. The design of EEG representations and classifiers for BCI are open research questions whose difficulty stems from the need to extract complex spatial and temporal patterns from noisy multidimensional time series obtained from EEG measurements. The high-dimensional and noisy nature of EEG may limit the advantage of nonlinear classification methods over linear ones. This paper reports the results of a linear (linear discriminant analysis) and two nonlinear classifiers (neural networks and support vector machines) applied to the classification of spontaneous EEG during five mental tasks, showing that nonlinear classifiers produce only slightly better classification results. An approach to feature selection based on genetic algorithms is also presented with preliminary results of application to EEG during finger movement. Keywords: 80 and over, Adnexal Diseases, Adult, Aged, Algorithms, Artificial Intelligence, Automated, Bayes Theorem, Biological, Brain, Brain Mapping, Breast Neoplasms, Case-Control Studies, Chromatography, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Diagnosis, Differential, Discriminant Analysis, Electroencephalography, Evoked Potentials, Feasibility Studies, Female, Fingers, Gene Expression Profiling, Gene Expression Regulation, Genetic, Genetic Markers, Genetic Predisposition to Disease, Genetic Screening, Habituation (Psychophysiology), High Pressure Liquid, Humans, Linear Models, Logistic Models, Male, Middle Aged, Migraine, Models, Movement, Neural Networks (Computer), Neurological, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Nucleosides, Ovarian Neoplasms, Pattern Recognition, Photic Stimulation, Predictive Value of Tests, ROC Curve, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Software, Statistical, Thinking, Tumor Markers, U.S. Gov't, User-Computer Interface, Visual, 12899257
[Furlanello2003Entropy-based]	C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 4(54), 2003. [ bib \| DOI \| http \| .pdf ] Background We describe the E-RFE method for gene ranking, which is useful for the identification of markers in the predictive classification of array data. The method supports a practical modeling scheme designed to avoid the construction of classification rules based on the selection of too small gene subsets (an effect known as the selection bias, in which the estimated predictive errors are too optimistic due to testing on samples already considered in the feature selection process). Results With E-RFE, we speed up the recursive feature elimination (RFE) with SVM classifiers by eliminating chunks of uninteresting genes using an entropy measure of the SVM weights distribution. An optimal subset of genes is selected according to a two-strata model evaluation procedure: modeling is replicated by an external stratified-partition resampling scheme, and, within each run, an internal K-fold cross-validation is used for E-RFE ranking. Also, the optimal number of genes can be estimated according to the saturation of Zipf's law profiles. Conclusions Without a decrease of classification accuracy, E-RFE allows a speed-up factor of 100 with respect to standard RFE, while improving on alternative parametric RFE reduction strategies. Thus, a process for gene selection and error estimation is made practical, ensuring control of the selection bias, and providing additional diagnostic indicators of gene importance. Keywords: biosvm
[Driel2003new]	M. van Driel, K. Cuelenaere, P.P.C.W. Kemmeren, J.A.M. Leunissen, and H.G. Brunner. A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur. J. Hum. Genet., 11(1):57-63, Jan 2003. [ bib \| DOI \| http ] To identify the gene underlying a human genetic disorder can be difficult and time-consuming. Typically, positional data delimit a chromosomal region that contains between 20 and 200 genes. The choice then lies between sequencing large numbers of genes, or setting priorities by combining positional data with available expression and phenotype data, contained in different internet databases. This process of examining positional candidates for possible functional clues may be performed in many different ways, depending on the investigator's knowledge and experience. Here, we report on a new tool called the GeneSeeker, which gathers and combines positional data and expression/phenotypic data in an automated way from nine different web-based databases. This results in a quick overview of interesting candidate genes in the region of interest. The GeneSeeker system is built in a modular fashion allowing for easy addition or removal of databases if required. Databases are searched directly through the web, which obviates the need for data warehousing. In order to evaluate the GeneSeeker tool, we analysed syndromes with known genesis. For each of 10 syndromes the GeneSeeker programme generated a shortlist that contained a significantly reduced number of candidate genes from the critical region, yet still contained the causative gene. On average, a list of 163 genes based on position alone was reduced to a more manageable list of 22 genes based on position and expression or phenotype information. We are currently expanding the tool by adding other databases. The GeneSeeker is available via the web-interface (http://www.cmbi.kun.nl/GeneSeeker/). Keywords: Computational Biology; Databases, Genetic; Databases, Nucleic Acid; Gene Expression; Genetic Diseases, Inborn; Humans; Internet; Noonan Syndrome; Software
[Donaldson2003PreBIND]	I. Donaldson, J. Martin, B. de Bruijn, C. Wolting, V. Lay, B. Tuekam, S. Zhang, B. Baskin, G.D. Bader, K. Michalickova, T. Pawson, and C.W.V. Hogue. PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics, 4(1):11, Mar 2003. [ bib \| DOI \| http \| .pdf ] Background The majority of experimentally verified molecular interaction and biological pathway data are present in the unstructured text of biomedical journal articles where they are inaccessible to computational methods. The Biomolecular interaction network database (BIND) seeks to capture these data in a machine-readable format. We hypothesized that the formidable task-size of backfilling the database could be reduced by using Support Vector Machine technology to first locate interaction information in the literature. We present an information extraction system that was designed to locate protein-protein interaction data in the literature and present these data to curators and the public for review and entry into BIND. Results Cross-validation estimated the support vector machine's test-set precision, accuracy and recall for classifying abstracts describing interaction information was 92 would be able to recall up to 60 present in another yeast-protein interaction database. Finally, this system was applied to a real-world curation problem and its use was found to reduce the task duration by 70 days. Conclusions Machine learning methods are useful as tools to direct interaction and pathway database back-filling; however, this potential can only be realized if these techniques are coupled with human review and entry into a factual database such as BIND. The PreBIND system described here is available to the public at http://bind.ca. Current capabilities allow searching for human, mouse and yeast protein-interaction information. Keywords: biosvm
[Dobson2003Distinguishing]	P.D. Dobson and A.J. Doig. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol., 330(4):771-783, 2003. [ bib \| DOI \| http \| .pdf ] The ability to predict protein function from structure is becoming increasingly important as the number of structures resolved is growing more rapidly than our capacity to study function. Current methods for predicting protein function are mostly reliant on identifying a similar protein of known function. For proteins that are highly dissimilar or are only similar to proteins also lacking functional annotations, these methods fail. Here, we show that protein function can be predicted as enzymatic or not without resorting to alignments. We describe 1178 high-resolution proteins in a structurally non-redundant subset of the Protein Data Bank using simple features such as secondary-structure content, amino acid propensities, surface properties and ligands. The subset is split into two functional groupings, enzymes and non-enzymes. We use the support vector machine-learning algorithm to develop models that are capable of assigning the protein class. Validation of the method shows that the function can be predicted to an accuracy of 77 protein. An adaptive search of possible subsets of features produces a simplified model based on 36 features that predicts at an accuracy of 80 avoid calculating alignments and predict a recently released set of unrelated proteins. The most useful features for distinguishing enzymes from non-enzymes are secondary-structure content, amino acid frequencies, number of disulphide bonds and size of the largest cleft. This method is applicable to any structure as it does not require the identification of sequence or structural similarity to a protein of known function. Keywords: biosvm
[Dieterle2003Urinary]	Frank Dieterle, Silvia MÃ¼ller-Hagedorn, Hartmut M Liebich, and GÃ¼nter Gauglitz. Urinary nucleosides as potential tumor markers evaluated by learning vector quantization. Artif. Intell. Med., 28(3):265-79, Jul 2003. [ bib \| DOI \| http \| .pdf ] Modified nucleosides were recently presented as potential tumor markers for breast cancer. The patterns of the levels of urinary nucleosides are different for tumor bearing individuals and for healthy individuals. Thus, a powerful pattern recognition method is needed. Although backpropagation (BP) neural networks are becoming increasingly common in medical literature for pattern recognition, it has been shown that often-superior methods exist like learning vector quantization (LVQ) and support vector machines (SVM). The aim of this feasibility study is to get an indication of the performance of urinary nucleoside levels evaluated by LVQ in contrast to the evaluation the popular BP and SVM networks. Urine samples were collected from female breast cancer patients and from healthy females. Twelve different ribonucleosides were isolated and quantified by a high performance liquid chromatography (HPLC) procedure. LVQ, SVM and BP networks were trained and the performance was evaluated by the classification of the test sets into the categories "cancer" and "healthy". All methods showed a good classification with a sensitivity ranging from 58.8 to 70.6% at a specificity of 88.4-94.2% for the test patterns. Although the classification performance of all methods is comparable, the LVQ implementations are superior in terms of more qualitative features: the results of LVQ networks are more reproducible, as the initialization is deterministic. The LVQ networks can be trained by unbalanced sizes of the different classes. LVQ networks are fast during training, need only few parameters adjusted for training and can be retrained by patterns of "local individuals". As at least some of these features play an important role in an implementation into a medical decision support system, it is recommended to use LVQ for an extended study. Keywords: 80 and over, Adnexal Diseases, Adult, Aged, Algorithms, Artificial Intelligence, Automated, Bayes Theorem, Biological, Breast Neoplasms, Case-Control Studies, Chromatography, Comparative Study, Computational Biology, Computer-Assisted, Diagnosis, Differential, Feasibility Studies, Female, High Pressure Liquid, Humans, Logistic Models, Middle Aged, Neural Networks (Computer), Non-U.S. Gov't, Nucleosides, Ovarian Neoplasms, Pattern Recognition, Predictive Value of Tests, ROC Curve, Reproducibility of Results, Research Support, Sensitivity and Specificity, Tumor Markers, 12927336
[Diekman2003Hybrid]	Casey Diekman, Wei He, Nagabhushana Prabhu, and Harvey Cramer. Hybrid methods for automated diagnosis of breast tumors. Anal Quant Cytol Histol, 25(4):183-90, Aug 2003. [ bib ] OBJECTIVE: To design and analyze a new family of hybrid methods for the diagnosis of breast tumors using fine needle aspirates. STUDY DESIGN: We present a radically new approach to the design of diagnosis systems. In the new approach, a nonlinear classifier with high sensitivity but low specificity is hybridized with a linear classifier having low sensitivity but high specificity. Data from the Wisconsin Breast Cancer Database are used to evaluate, computationally, the performance of the hybrid classifiers. RESULTS: The diagnosis scheme obtained by hybridizing the nonlinear classifier ellipsoidal multisurface method (EMSM) with the linear classifier proximal support vector machine (PSVM) was found to have a mean sensitivity of 97.36% and a mean specificity of 95.14% and was found to yield a 2.44% improvement in the reliability of positive diagnosis over that of EMSM at the expense of 0.4% degradation in the reliability of negative diagnosis, again compared to EMSM. At the 95% confidence level we can trust the hybrid method to be 96.19-98.53% correct in its malignant diagnosis of new tumors and 93.57-96.71% correct in its benign diagnosis. CONCLUSION: Hybrid diagnosis schemes represent a significant paradigm shift and provide a promising new technique to improve the specificity of nonlinear classifiers without seriously affecting the high sensitivity of nonlinear classifiers. Keywords: Algorithms, Amino Acid Sequence, Amino Acids, Anion Exchange Resins, Antigen-Antibody Complex, Artificial Intelligence, Automated, Automatic Data Processing, Benchmarking, Biological, Biological Markers, Biopsy, Blood Cells, Blood Proteins, Breast Neoplasms, Cell Line, Cellular Structures, Chemical, Chromatography, Chromosome Aberrations, Cluster Analysis, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Computing Methodologies, DNA, Data Interpretation, Databases, Decision Making, Decision Trees, Diagnosis, Diffusion Magnetic Resonance Imaging, Disease, English Abstract, Epitopes, Expert Systems, Factual, Female, Fine-Needle, Fusion, Fuzzy Logic, Gene Expression Profiling, Gene Expression Regulation, Gene Targeting, Genetic, Genome, Histocompatibility Antigens Class I, Humans, Hydrogen Bonding, Hydrophobicity, Image Interpretation, Image Processing, In Vitro, Indicators and Reagents, Information Storage and Retrieval, Ion Exchange, Least-Squares Analysis, Leiomyosarcoma, Liver Cirrhosis, Lung Neoplasms, Magnetic Resonance Imaging, Male, Mass, Mathematical Computing, Matrix-Assisted Laser Desorption-Ionization, Models, Molecular, Molecular Sequence Data, Neoplasm Proteins, Neoplasms, Neoplastic, Nephroblastoma, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonl, Nucleic Acid Conformation, Nucleic Acid Hybridization, Oligonucleotide Array Sequence Analysis, Oncogene Proteins, Ovarian Neoplasms, P.H.S., Pattern Recognition, Predictive Value of Tests, Pro, Prostatic Neoplasms, Protein, Protein Binding, Protein Interaction Mapping, Protein Structure, Proteins, Quantitative Structure-Activity Relationship, RNA, ROC Curve, Reproducibility of Results, Research Support, Rhabdomyosarcoma, Secondary, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Severity of Illness Index, Software, Solubility, Spectrometry, Statistical, Structure-Activity Relationship, Subcellular Fractions, Subtraction Technique, T-Lymphocyte, Tissue Distribution, Transcription Factors, Transfer, Treatment Outcome, Tumor, Tumor Markers, U.S. Gov't, User-Computer Interface, inear Dynamics, teome, 12961824
[Chan2003Detection]	Ian Chan, William Wells, Robert V Mulkern, Steven Haker, Jianqing Zhang, Kelly H Zou, Stephan E Maier, and Clare M C Tempany. Detection of prostate cancer by integration of line-scan diffusion, T2-mapping and T2-weighted magnetic resonance imaging; a multichannel statistical classifier. Med Phys, 30(9):2390-8, Sep 2003. [ bib \| .pdf ] A multichannel statistical classifier for detecting prostate cancer was developed and validated by combining information from three different magnetic resonance (MR) methodologies: T2-weighted, T2-mapping, and line scan diffusion imaging (LSDI). From these MR sequences, four different sets of image intensities were obtained: T2-weighted (T2W) from T2-weighted imaging, Apparent Diffusion Coefficient (ADC) from LSDI, and proton density (PD) and T2 (T2 Map) from T2-mapping imaging. Manually segmented tumor labels from a radiologist, which were validated by biopsy results, served as tumor "ground truth." Textural features were extracted from the images using co-occurrence matrix (CM) and discrete cosine transform (DCT). Anatomical location of voxels was described by a cylindrical coordinate system. A statistical jack-knife approach was used to evaluate our classifiers. Single-channel maximum likelihood (ML) classifiers were based on 1 of the 4 basic image intensities. Our multichannel classifiers: support vector machine (SVM) and Fisher linear discriminant (FLD), utilized five different sets of derived features. Each classifier generated a summary statistical map that indicated tumor likelihood in the peripheral zone (PZ) of the prostate gland. To assess classifier accuracy, the average areas under the receiver operator characteristic (ROC) curves over all subjects were compared. Our best FLD classifier achieved an average ROC area of 0.839(+/-0.064), and our best SVM classifier achieved an average ROC area of 0.761(+/-0.043). The T2W ML classifier, our best single-channel classifier, only achieved an average ROC area of 0.599(+/-0.146). Compared to the best single-channel ML classifier, our best multichannel FLD and SVM classifiers have statistically superior ROC performance (P=0.0003 and 0.0017, respectively) from pairwise two-sided t-test. By integrating the information from multiple images and capturing the textural and anatomical features in tumor areas, summary statistical maps can potentially aid in image-guided prostate biopsy and assist in guiding and controlling delivery of localized therapy under image guidance. Keywords: Algorithms, Anion Exchange Resins, Antigen-Antibody Complex, Artificial Intelligence, Automated, Automatic Data Processing, Biological, Blood Cells, Chemical, Chromatography, Cluster Analysis, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Data Interpretation, Databases, Decision Making, Decision Trees, Diffusion Magnetic Resonance Imaging, English Abstract, Epitopes, Expert Systems, Factual, Fuzzy Logic, Gene Expression Profiling, Gene Expression Regulation, Gene Targeting, Genome, Histocompatibility Antigens Class I, Humans, Image Interpretation, Image Processing, In Vitro, Indicators and Reagents, Information Storage and Retrieval, Ion Exchange, Least-Squares Analysis, Liver Cirrhosis, Magnetic Resonance Imaging, Male, Models, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonl, Nucleic Acid Conformation, P.H.S., Pattern Recognition, Pro, Prostatic Neoplasms, Protein, Protein Binding, Protein Interaction Mapping, Proteins, Quantitative Structure-Activity Relationship, RNA, ROC Curve, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Analysis, Severity of Illness Index, Statistical, Structure-Activity Relationship, Subtraction Technique, T-Lymphocyte, Transcription Factors, Transfer, Treatment Outcome, U.S. Gov't, User-Computer Interface, inear Dynamics, teome, 14528961
[Cai2003Support]	Y.-D. Cai, G.-P. Zhou, and K.-C. Chou. Support Vector Machines for Predicting Membrane Protein Types by Using Functional Domain Composition. Biophys. J., 84(5):3257-3263, 2003. [ bib \| http \| .pdf ] Membrane proteins are generally classified into the following five types: 1), type I membrane protein; 2), type II membrane protein; 3), multipass transmembrane proteins; 4), lipid chain-anchored membrane proteins; and 5), GPI-anchored membrane proteins. In this article, based on the concept of using the functional domain composition to define a protein, the Support Vector Machine algorithm is developed for predicting the membrane protein type. High success rates are obtained by both the self-consistency and jackknife tests. The current approach, complemented with the powerful covariant discriminant algorithm based on the pseudo-amino acid composition that has incorporated quasi-sequence-order effect as recently proposed by K. C. Chou (2001), may become a very useful high-throughput tool in the area of bioinformatics and proteomics. Keywords: biosvm
[Cai2003Supportb]	Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou. Support vector machines for prediction of protein domain structural class. J. Theor. Biol., 221(1):115-120, 2003. [ bib \| DOI \| http \| .pdf ] The support vector machines (SVMs) method was introduced for predicting the structural class of protein domains. The results obtained through the self-consistency test, jack-knife test, and independent dataset test have indicated that the current method and the elegant component-coupled algorithm developed by Chou and co-workers, if effectively complemented with each other, may become a powerful tool for predicting the structural class of protein domains. Keywords: biosvm
[Cai2003Prediction]	Y.D. Cai, X.J. Liu, Y.X. Li, X.B. Xu, and K.C. Chou. Prediction of beta-turns with learning machines. Peptides, 24(5):665-669, 2003. [ bib \| DOI \| http \| .pdf ] The support vector machine approach was introduced to predict the beta-turns in proteins. The overall self-consistency rate by the re-substitution test for the training or learning dataset reached 100 were taken from Chou [J. Pept. Res. 49 (1997) 120]. The success prediction rates by the jackknife test for the beta-turn subset of 455 tetrapeptides and non-beta-turn subset of 3807 tetrapeptides in the training dataset were 58.1 and 98.4 success rates with the independent dataset test for the beta-turn subset of 110 tetrapeptides and non-beta-turn subset of 30,231 tetrapeptides were 69.1 and 97.3 study support the conclusion that the residue-coupled effect along a tetrapeptide is important for the formation of a beta-turn. Keywords: biosvm
[Cai2003Supporta]	Y.D. Cai, S.L. Lin, and K.C. Chou. Support vector machines for prediction of protein signal sequences and their cleavage sites. Peptides, 24(1):159-161, 2003. [ bib \| DOI \| .pdf ] Given a nascent protein sequence, how can one predict its signal peptide or "Zipcode" sequence? This is an important problem for scientists to use signal peptides as a vehicle to find new drugs or to reprogram cells for gene therapy (see, e.g. [7] K.C. Chou, Current Protein and Peptide Science 2002;3:615?22). In this paper, support vector machines (SVMs), a new machine learning method, is applied to approach this problem. The overall rate of correct prediction for 1939 secretary proteins and 1440 nonsecretary proteins was over 91 may also serve as a useful tool for further investigating many unclear details regarding the molecular mechanism of the ZIP code protein-sorting system in cells. Keywords: biosvm
[Cai2003Supportd]	Y.D. Cai and S.L. Lin. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim. Biophys. Acta, 1648(1-2):127-133, 2003. [ bib \| DOI \| http \| .pdf ] Classification of gene function remains one of the most important and demanding tasks in the post-genome era. Most of the current predictive computer methods rely on comparing features that are essentially linear to the protein sequence. However, features of a protein nonlinear to the sequence may also be predictive to its function. Machine learning methods, for instance the Support Vector Machines (SVMs), are particularly suitable for exploiting such features. In this work we introduce SVM and the pseudo-amino acid composition, a collection of nonlinear features extractable from protein sequence, to the field of protein function prediction. We have developed prototype SVMs for binary classification of rRNA-, RNA-, and DNA-binding proteins. Using a protein's amino acid composition and limited range correlation of hydrophobicity and solvent accessible surface area as input, each of the SVMs predicts whether the protein belongs to one of the three classes. In self-consistency and cross-validation tests, which measures the success of learning and prediction, respectively, the rRNA-binding SVM has consistently achieved >95 The RNA- and DNA-binding SVMs demonstrate more diverse accuracy, ranging from approximately 76 the test results suggests the directions of improving the SVMs. Keywords: biosvm
[Cai2003Supportc]	Y.D. Cai, K.Y. Feng, Y.X. Li, and K.C. Chou. Support vector machine for predicting alpha-turn types. Peptides, 24(4):629-630, 2003. [ bib \| DOI \| http \| .pdf ] Tight turns play an important role in globular proteins from both the structural and functional points of view. Of tight turns, beta-turns and gamma-turns have been extensively studied, but alpha-turns were little investigated. Recently, a systematic search for alpha-turns classified alpha-turns into nine different types according to their backbone trajectory features. In this paper, Support Vector Machines (SVMs), a new machine learning method, is proposed for predicting the alpha-turn types in proteins. The high rates of correct prediction imply that that the formation of different alpha-turn types is evidently correlated with the sequence of a pentapeptide, and hence can be approximately predicted based on the sequence information of the pentapeptide alone, although the incorporation of its interaction with the other part of a protein, the so-called "long distance interaction", will further improve the prediction quality. Keywords: biosvm
[Cai2003SVM-Prot]	C. Z. Cai, L. Y. Han, Z. L. Ji, X. Chen, and Y. Z. Chen. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res, 31(13):3692-7, Jul 2003. [ bib \| http \| .pdf ] Prediction of protein function is of significance in studying biological processes. One approach for function prediction is to classify a protein into functional family. Support vector machine (SVM) is a useful method for such classification, which may involve proteins with diverse sequence distribution. We have developed a web-based software, SVMProt, for SVM classification of a protein into functional family from its primary sequence. SVMProt classification system is trained from representative proteins of a number of functional families and seed proteins of Pfam curated protein families. It currently covers 54 functional families and additional families will be added in the near future. The computed accuracy for protein family classification is found to be in the range of 69.1-99.6%. SVMProt shows a certain degree of capability for the classification of distantly related proteins and homologous proteins of different function and thus may be used as a protein function prediction tool that complements sequence alignment methods. SVMProt can be accessed at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi. Keywords: biosvm
[Cai2003Protein]	C.Z. Cai, W.L. Wang, L.Z. Sun, and Y.Z. Chen. Protein function classification via support vector machine approach. Math. Biosci., 185(2):111-122, 2003. [ bib \| DOI \| .pdf ] Support vector machine (SVM) is introduced as a method for the classification of proteins into functionally distinguished classes. Studies are conducted on a number of protein classes including RNA-binding proteins; protein homodimers, proteins responsible for drug absorption, proteins involved in drug distribution and excretion, and drug metabolizing enzymes. Testing accuracy for the classification of these protein classes is found to be in the range of 84-96 usefulness of SVM in the classification of protein functional classes and its potential application in protein function prediction. Keywords: biosvm
[Byvatov2003Support]	E. Byvatov and G. Schneider. Support vector machine applications in bioinformatics. Appl Bioinformatics, 2(2):67-77, 2003. [ bib ] The support vector machine (SVM) approach represents a data-driven method for solving classification tasks. It has been shown to produce lower prediction error compared to classifiers based on other methods like artificial neural networks, especially when large numbers of features are considered for sample description. In this review, the theory and main principles of the SVM approach are outlined, and successful applications in traditional areas of bioinformatics research are described. Current developments in techniques related to the SVM approach are reviewed which might become relevant for future functional genomics and chemogenomics projects. In a comparative study, we developed neural network and SVM models to identify small organic molecules that potentially modulate the function of G-protein coupled receptors. The SVM system was able to correctly classify approximately 90% of the compounds in a cross-validation study yielding a Matthews correlation coefficient of 0.78. This classifier can be used for fast filtering of compound libraries in virtual screening applications. Keywords: biosvm
[Byvatov2003Comparison]	E. Byvatov, U. Fechner, J. Sadowski, and G. Schneider. Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. J Chem Inf Comput Sci, 43(6):1882-9, 2003. [ bib \| DOI \| http \| .pdf ] Support vector machine (SVM) and artificial neural network (ANN) systems were applied to a drug/nondrug classification problem as an example of binary decision problems in early-phase virtual compound filtering and screening. The results indicate that solutions obtained by SVM training seem to be more robust with a smaller standard error compared to ANN training. Generally, the SVM classifier yielded slightly higher prediction accuracy than ANN, irrespective of the type of descriptors used for molecule encoding, the size of the training data sets, and the algorithm employed for neural network training. The performance was compared using various different descriptor sets and descriptor combinations based on the 120 standard Ghose-Crippen fragment descriptors, a wide range of 180 different properties and physicochemical descriptors from the Molecular Operating Environment (MOE) package, and 225 topological pharmacophore (CATS) descriptors. For the complete set of 525 descriptors cross-validated classification by SVM yielded 82% correct predictions (Matthews cc = 0.63), whereas ANN reached 80% correct predictions (Matthews cc = 0.58). Although SVM outperformed the ANN classifiers with regard to overall prediction accuracy, both methods were shown to complement each other, as the sets of true positives, false positives (overprediction), true negatives, and false negatives (underprediction) produced by the two classifiers were not identical. The theory of SVM and ANN training is briefly reviewed. Keywords: biosvm chemoinformatics
[Bock2003Whole-proteome]	J. R. Bock and D. A. Gough. Whole-proteome interaction mining. Bioinformatics, 19(1):125-134, 2003. [ bib \| http \| .pdf ] Motivation: A major post-genomic scientific and technological pursuit is to describe the functions performed by the proteins encoded by the genome. One strategy is to first identify the protein-protein interactions in a proteome, then determine pathways and overall structure relating these interactions, and finally to statistically infer functional roles of individual proteins. Although huge amounts of genomic data are at hand, current experimental protein interaction assays must overcome technical problems to scale-up for high-throughput analysis. In the meantime, bioinformatics approaches may help bridge the information gap required for inference of protein function. In this paper, a previously described data mining approach to prediction of protein-protein interactions (Bock and Gough, 2001, Bioinformatics, 17, 455-460) is extended to interaction mining on a proteome-wide scale. An algorithm (the phylogenetic bootstrap) is introduced, which suggests traversal of a phenogram, interleaving rounds of computation and experiment, to develop a knowledge base of protein interactions in genetically-similar organisms. Results: The interaction mining approach was demonstrated by building a learning system based on 1,039 experimentally validated protein-protein interactions in the human gastric bacterium Helicobacter pylori. An estimate of the generalization performance of the classifier was derived from 10-fold cross-validation, which indicated expected upper bounds on precision of 80 One such organism is the enteric pathogen Campylobacter jejuni, in which comprehensive machine learning prediction of all possible pairwise protein-protein interactions was performed. The resulting network of interactions shares an average protein connectivity characteristic in common with previous investigations reported in the literature, offering strong evidence supporting the biological feasibility of the hypothesized map. For inferences about complete proteomes in which the number of pairwise non-interactions is expected to be much larger than the number of actual interactions, we anticipate that the sensitivity will remain the same but precision may decrease. We present specific biological examples of two subnetworks of protein-protein interactions in C. jejuni resulting from the application of this approach, including elements of a two-component signal transduction systems for thermoregulation, and a ferritin uptake network. Contact: dgough@bioeng.ucsd.edu Keywords: biosvm
[Ben-Hur2003Remote]	A. Ben-Hur and D. Brutlag. Remote homology detection: a motif based approach. Bioinformatics, 19(Suppl. 1):i26-i33, 2003. [ bib \| http \| .pdf ] Motivation: Remote homology detection is the problem of detecting homology in cases of low sequence similarity. It is a hard computational problem with no approach that works well in all cases. Results: We present a method for detecting remote homology that is based on the presence of discrete sequence motifs. The motif content of a pair of sequences is used to define a similarity that is used as a kernel for a Support Vector Machine (SVM) classifier. We test the method on two remote homology detection tasks: prediction of a previously unseen SCOP family and prediction of an enzyme class given other enzymes that have a similar function on other substrates. We find that it performs significantly better than an SVM method that uses BLAST or Smith-Waterman similarity scores as features. Availability: The software is available from the authors upon request. Keywords: biosvm
[Beerenwinkel2003Methods]	N. Beerenwinkel, T. Lengauer, M. Daumer, R. Kaiser, H. Walter, K. Korn, D. Hoffmann, and J. Selbig. Methods for optimizing antiviral combination therapies. Bioinformatics, 19(Suppl. 1):i16-i25, 2003. [ bib \| http \| .pdf ] Motivation: Despite some progress with antiretroviral combination therapies, therapeutic success in the management of HIV-infected patients is limited. The evolution of drug-resistant genetic variants in response to therapy plays a key role in treatment failure and finding a new potent drug combination after therapy failure is considered challenging. Results: To estimate the activity of a drug combination against a particular viral strain, we develop a scoring function whose independent variables describe a set of antiviral agents and viral DNA sequences coding for the molecular targets of the respective drugs. The construction of this activity score involves (1) predicting phenotypic drug resistance from genotypes for each drug individually, (2) probabilistic modeling of predicted resistance values and integration into a score for drug combinations, and (3) searching through the mutational neighborhood of the considered strain in order to estimate activity on nearby mutants. For a clinical data set, we determine the optimal search depth and show that the scoring scheme is predictive of therapeutic outcome. Properties of the activity score and applications are discussed. Contact: beerenwinkel@mpi-sb.mpg.de Keywords: HIV, antiretroviral therapy, drug resistance, SVM regression, therapy optimization, sequence space search. Keywords: biosvm
[Bagirov2003New]	A. M. Bagirov, B. Ferguson, S. Ivkovic, G. Saunders, and J. Yearwood. New algorithms for multi-class cancer diagnosis using tumor gene expression signatures. Bioinformatics, 19(14):1800-7, Sep 2003. [ bib \| http \| .pdf ] MOTIVATION: The increasing use of DNA microarray-based tumor gene expression profiles for cancer diagnosis requires mathematical methods with high accuracy for solving clustering, feature selection and classification problems of gene expression data. RESULTS: New algorithms are developed for solving clustering, feature selection and classification problems of gene expression data. The clustering algorithm is based on optimization techniques and allows the calculation of clusters step-by-step. This approach allows us to find as many clusters as a data set contains with respect to some tolerance. Feature selection is crucial for a gene expression database. Our feature selection algorithm is based on calculating overlaps of different genes. The database used, contains over 16 000 genes and this number is considerably reduced by feature selection. We propose a classification algorithm where each tissue sample is considered as the center of a cluster which is a ball. The results of numerical experiments confirm that the classification algorithm in combination with the feature selection algorithm perform slightly better than the published results for multi-class classifiers based on support vector machines for this data set. AVAILABILITY: Available on request from the authors. Keywords: Algorithms, Amino Acid Sequence, Anion Exchange Resins, Antigen-Antibody Complex, Artificial Intelligence, Automated, Automatic Data Processing, Biological, Blood Cells, Chemical, Chromatography, Cluster Analysis, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Making, Decision Trees, Diffusion Magnetic Resonance Imaging, English Abstract, Epitopes, Expert Systems, Factual, Fuzzy Logic, Gene Expression Profiling, Gene Expression Regulation, Gene Targeting, Genetic, Genome, Histocompatibility Antigens Class I, Humans, Image Interpretation, Image Processing, In Vitro, Indicators and Reagents, Information Storage and Retrieval, Ion Exchange, Least-Squares Analysis, Liver Cirrhosis, Magnetic Resonance Imaging, Male, Models, Molecular Sequence Data, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonl, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, P.H.S., Pattern Recognition, Pro, Prostatic Neoplasms, Protein, Protein Binding, Protein Interaction Mapping, Proteins, Quantitative Structure-Activity Relationship, RNA, ROC Curve, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Severity of Illness Index, Statistical, Structure-Activity Relationship, Subtraction Technique, T-Lymphocyte, Transcription Factors, Transfer, Treatment Outcome, Tumor Markers, U.S. Gov't, User-Computer Interface, inear Dynamics, teome, 14512351
[Anderson2003new]	D.C. Anderson, W. Li, D.G. Payan, and W.S. Noble. A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. J Proteome Res, 2(2):137-146, 2003. [ bib \| .pdf ] Shotgun tandem mass spectrometry-based peptide sequencing using programs such as SEQUEST allows high-throughput identification of peptides, which in turn allows the identification of corresponding proteins. We have applied a machine learning algorithm, called the support vector machine, to discriminate between correctly and incorrectly identified peptides using SEQUEST output. Each peptide was characterized by SEQUEST-calculated features such as delta Cn and Xcorr, measurements such as precursor ion current and mass, and additional calculated parameters such as the fraction of matched MS/MS peaks. The trained SVM classifier performed significantly better than previous cutoff-based methods at separating positive from negative peptides. Positive and negative peptides were more readily distinguished in training set data acquired on a QTOF, compared to an ion trap mass spectrometer. The use of 13 features, including four new parameters, significantly improved the separation between positive and negative peptides. Use of the support vector machine and these additional parameters resulted in a more accurate interpretation of peptide MS/MS spectra and is an important step toward automated interpretation of peptide tandem mass spectrometry data in proteomics. Keywords: biosvm proteomics
[Alexandersson2003SLAM]	M. Alexandersson, S. Cawley, and L. Pachter. SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res., 13(3):496-502, Mar 2003. [ bib \| DOI \| http \| .pdf ] Comparative-based gene recognition is driven by the principle that conserved regions between related organisms are more likely than divergent regions to be coding. We describe a probabilistic framework for gene structure and alignment that can be used to simultaneously find both the gene structure and alignment of two syntenic genomic regions. A key feature of the method is the ability to enhance gene predictions by finding the best alignment between two syntenic sequences, while at the same time finding biologically meaningful alignments that preserve the correspondence between coding exons. Our probabilistic framework is the generalized pair hidden Markov model, a hybrid of (1). generalized hidden Markov models, which have been used previously for gene finding, and (2). pair hidden Markov models, which have applications to sequence alignment. We have built a gene finding and alignment program called SLAM, which aligns and identifies complete exon/intron structures of genes in two related but unannotated sequences of DNA. SLAM is able to reliably predict gene structures for any suitably related pair of organisms, most notably with fewer false-positive predictions compared to previous methods (examples are provided for Homo sapiens/Mus musculus and Plasmodium falciparum/Plasmodium vivax comparisons). Accuracy is obtained by distinguishing conserved noncoding sequence (CNS) from conserved coding sequence. CNS annotation is a novel feature of SLAM and may be useful for the annotation of UTRs, regulatory elements, and other noncoding features. Keywords: biogm
[Aebersold2003Mass]	R. Aebersold and M. Mann. Mass spectrometry-based proteomics. Nature, 422(6928):198-207, Mar 2003. [ bib \| DOI \| http \| .pdf ] Recent successes illustrate the role of mass spectrometry-based proteomics as an indispensable tool for molecular and cellular biology and for the emerging field of systems biology. These include the study of protein-protein interactions via affinity-based isolations on a small and proteome-wide scale, the mapping of numerous organelles, the concurrent description of the malaria parasite genome and proteome, and the generation of quantitative protein profiles from diverse species. The ability of mass spectrometry to identify and, increasingly, to precisely quantify thousands of proteins from complex samples can be expected to impact broadly on biology and medicine. Keywords: bio
[Segal2003Classification]	N. H. Segal, P. Pavlidis, W. S. Noble, C. R. Antonescu, A. Viale, U. V. Wesley, K. Busam, H. Gallardo, D. DeSantis, M. F. Brennan, C. Cordon-Cardo, J. D. Wolchok, and A. N. Houghton. Classification of Clear-Cell Sarcoma as a Subtype of Melanoma by Genomic Profiling. J. Clin. Oncol., 21(9):1775-1781, May 2003. [ bib \| DOI \| http \| .pdf ] Purpose: To develop a genome-based classification scheme for clear-cell sarcoma (CCS), also known as melanoma of soft parts (MSP), which would have implications for diagnosis and treatment. This tumor displays characteristic features of soft tissue sarcoma (STS), including deep soft tissue primary location and a characteristic translocation, t(12;22)(q13;q12), involving EWS and ATF1 genes. CCS/MSP also has typical melanoma features, including immunoreactivity for S100 and HMB45, pigmentation, MITF-M expression, and a propensity for regional lymph node metastases. Materials and Methods: RNA samples from 21 cell lines and 60 pathologically confirmed cases of STS, melanoma, and CCS/MSP were examined using the U95A GeneChip (Affymetrix, Santa Clara, CA). Hierarchical cluster analysis, principal component analysis, and support vector machine (SVM) analysis exploited genomic correlations within the data to classify CCS/MSP. Results: Unsupervised analyses demonstrated a clear distinction between STS and melanoma and, furthermore, showed that CCS/MSP cluster with the melanomas as a distinct group. A supervised SVM learning approach further validated this finding and provided a user-independent approach to diagnosis. Genes of interest that discriminate CCS/MSP included those encoding melanocyte differentiation antigens, MITF, SOX10, ERBB3, and FGFR1. Conclusion: Gene expression profiles support the classification of CCS/MSP as a distinct genomic subtype of melanoma. Analysis of these gene profiles using the SVM may be an important diagnostic tool. Genomic analysis identified potential targets for the development of therapeutic strategies in the treatment of this disease. Keywords: biosvm
[Yuan2004SVMtm]	Z. Yuan, J.S. Mattick, and R.D. Teasdale. SVMtm: support vector machines to predict transmembrane segments. J. Comput. Chem., 25(5):632, 6 2004. [ bib \| DOI \| http \| .pdf ] A new method has been developed for prediction of transmembrane helices using support vector machines. Different coding schemes of protein sequences were explored, and their performances were assessed by crossvalidation tests. The best performance method can predict the transmembrane helices with sensitivity of 93.4 of 92.0 given to show the strength of transmembrane signal and the prediction reliability. In particular, this method can distinguish transmembrane proteins from soluble proteins with an accuracy of approximately 99 helix prediction methods and can be used for consensus analysis of entire proteomes. The predictor is located at http://genet.imb.uq.edu.au/predictors/SVMtm. Keywords: biosvm
[Yu2004integrated]	J.K. Yu, Y.D. Chen, and S. Zheng. An integrated approach to the detection of colorectal cancer utilizing proteomics and bioinformatics. World J. Gastroenterol., 10(21):3127-3131, 2004. [ bib \| .pdf ] AIM: To find new potential biomarkers and to establish patterns for early detection of colorectal cancer. METHODS: One hundred and eighty-two serum samples including 55 from colorectal cancer (CRC) patients, 35 from colorectal adenoma (CRA) patients and 92 from healthy persons (HP) were detected by surface-enhanced laser desorption/ionization mass spectrometry (SELDI-MS). The data of spectra were analyzed by bioinformatics tools like artificial neural network (ANN) and support vector machine (SVM). RESULTS: The diagnostic pattern combined with 7 potential biomarkers could differentiate CRC patients from CRA patients with a specificity of 83 The diagnostic pattern combined with 4 potential biomarkers could differentiate CRC patients from HP with a specificity of 92 sensitivity of 89 The combination of SELDI with bioinformatics tools could help find new biomarkers and establish patterns with high sensitivity and specificity for the detection of CRC. Keywords: biosvm
[Yu2004Predicting]	C.-S. Yu, C.-J. Lin, and J.-K. Hwang. Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Sci., 13(5):1402-1406, 2004. [ bib \| DOI \| http \| .pdf ] Gram-negative bacteria have five major subcellular localization sites: the cytoplasm, the periplasm, the inner membrane, the outer membrane, and the extracellular space. The subcellular location of a protein can provide valuable information about its function. With the rapid increase of sequenced genomic data, the need for an automated and accurate tool to predict subcellular localization becomes increasingly important. We present an approach to predict subcellular localization for Gram-negative bacteria. This method uses the support vector machines trained by multiple feature vectors based on n-peptide compositions. For a standard data set comprising 1443 proteins, the overall prediction accuracy reaches 89 highest prediction rate ever reported. Our prediction is 14 than that of the recently developed multimodular PSORT-B. Because of its simplicity, this approach can be easily extended to other organisms and should be a useful tool for the high-throughput and large-scale analysis of proteomic and genomic data. Keywords: biosvm
[Yao2004Comparative]	X. J. Yao, A. Panaye, J. P. Doucet, R. S. Zhang, H. F. Chen, M. C. Liu, Z. D. Hu, and B. T. Fan. Comparative study of QSAR/QSPR correlations using support vector machines, radial basis function neural networks, and multiple linear regression. J Chem Inf Comput Sci, 44(4):1257-66, 2004. [ bib \| DOI \| http \| .pdf ] Support vector machines (SVMs) were used to develop QSAR models that correlate molecular structures to their toxicity and bioactivities. The performance and predictive ability of SVM are investigated and compared with other methods such as multiple linear regression and radial basis function neural network methods. In the present study, two different data sets were evaluated. The first one involves an application of SVM to the development of a QSAR model for the prediction of toxicities of 153 phenols, and the second investigation deals with the QSAR model between the structures and the activities of a set of 85 cyclooxygenase 2 (COX-2) inhibitors. For each application, the molecular structures were described using either the physicochemical parameters or molecular descriptors. In both studied cases, the predictive ability of the SVM model is comparable or superior to those obtained by MLR and RBFNN. The results indicate that SVM can be used as an alternative powerful modeling tool for QSAR studies. Keywords: biosvm chemoinformatics
[Yang2004Bio-support]	Z. R. Yang and K.-C. Chou. Bio-support vector machines for computational proteomics. Bioinformatics, 20(5):735-741, 2004. [ bib \| http \| .pdf ] Motivation: One of the most important issues in computational proteomics is to produce a prediction model for the classification or annotation of biological function of novel protein sequences. In order to improve the prediction accuracy, much attention has been paid to the improvement of the performance of the algorithms used, few is for solving the fundamental issue, namely, amino acid encoding as most existing pattern recognition algorithms are unable to recognize amino acids in protein sequences. Importantly, the most commonly used amino acid encoding method has the flaw that leads to large computational cost and recognition bias. Results: By replacing kernel functions of support vector machines (SVMs) with amino acid similarity measurement matrices, we have modified SVMs, a new type of pattern recognition algorithm for analysing protein sequences, particularly for proteolytic cleavage site prediction. We refer to the modified SVMs as bio-support vector machine. When applied to the prediction of HIV protease cleavage sites, the new method has shown a remarkable advantage in reducing the model complexity and enhancing the model robustness. Keywords: biosvm
[Yan2004Identification]	C. Yan, V. Honavar, and D. Dobbs. Identification of interface residues in protease-inhibitor and antigen-antibody complexes: a support vector machine. Neural Comput. & Applic., 13:123-129, 2004. [ bib \| DOI \| .pdf ] Keywords: biosvm
[Yan2004two-stage]	C. Yan, D. Dobbs, and V. Honavar. A two-stage classifier for identification of protein-protein interface residues. Bioinformatics, 20(Suppl. 1):i371-i378, 2004. [ bib \| http \| .pdf ] Motivation: The ability to identify protein-protein interaction sites and to detect specific amino acid residues that contribute to the specificity and affinity of protein interactions has important implications for problems ranging from rational drug design to analysis of metabolic and signal transduction networks. Results: We have developed a two-stage method consisting of a support vector machine (SVM) and a Bayesian classifier for predicting surface residues of a protein that participate in protein-protein interactions. This approach exploits the fact that interface residues tend to form clusters in the primary amino acid sequence. Our results show that the proposed two-stage classifier outperforms previously published sequence-based methods for predicting interface residues. We also present results obtained using the two-stage classifier on an independent test set of seven CAPRI (Critical Assessment of PRedicted Interactions) targets. The success of the predictions is validated by examining the predictions in the context of the three-dimensional structures of protein complexes. Supplementary information: http://www.public.iastate.edu/ chhyan/ISMB2004/list.html Keywords: biosvm
[Yamanishi2004Protein]	Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Protein network inference from multiple genomic data: a supervised approach. Bioinformatics, 20:i363-i370, 2004. [ bib \| http \| .pdf ] Motivation: An increasing number of observations support the hypothesis that most biological functions involve the interactions between many proteins, and that the complexity of living systems arises as a result of such interactions. In this context, the problem of inferring a global protein network for a given organism, using all available genomic data about the organism, is quickly becoming one of the main challenges in current computational biology. Results: This paper presents a new method to infer protein networks from multiple types of genomic data. Based on a variant of kernel canonical correlation analysis, its originality is in the formalization of the protein network inference problem as a supervised learning problem, and in the integration of heterogeneous genomic data within this framework. We present promising results on the prediction of the protein network for the yeast Saccharomyces cerevisiae from four types of widely available data: gene expressions, protein interactions measured by yeast two-hybrid systems, protein localizations in the cell and protein phylogenetic profiles. The method is shown to outperform other unsupervised protein network inference methods. We finally conduct a comprehensive prediction of the protein network for all proteins of the yeast, which enables us to propose protein candidates for missing enzymes in a biosynthesis pathway. Availability: Softwares are available upon request. Keywords: biosvm
[Yamanishi2004Heterogeneous]	Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Heterogeneous data comparison and gene selection with kernel canonical correlation analysis. In B. SchÃ¶lkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 209-230. MIT Press, 2004. [ bib \| www: ] Keywords: biosvm
[Xue2004Prediction]	Y. Xue, C. W. Yap, L. Z. Sun, Z. W. Cao, J. F. Wang, and Y. Z. Chen. Prediction of P-glycoprotein substrates by a support vector machine approach. J Chem Inf Comput Sci, 44(4):1497-505, 2004. [ bib \| DOI \| http \| .pdf ] P-glycoproteins (P-gp) actively transport a wide variety of chemicals out of cells and function as drug efflux pumps that mediate multidrug resistance and limit the efficacy of many drugs. Methods for facilitating early elimination of potential P-gp substrates are useful for facilitating new drug discovery. A computational ensemble pharmacophore model has recently been used for the prediction of P-gp substrates with a promising accuracy of 63%. It is desirable to extend the prediction range beyond compounds covered by the known pharmacophore models. For such a purpose, a machine learning method, support vector machine (SVM), was explored for the prediction of P-gp substrates. A set of 201 chemical compounds, including 116 substrates and 85 nonsubstrates of P-gp, was used to train and test a SVM classification system. This SVM system gave a prediction accuracy of at least 81.2% for P-gp substrates based on two different evaluation methods, which is substantially improved against that obtained from the multiple-pharmacophore model. The prediction accuracy for nonsubstrates of P-gp is 79.2% using 5-fold cross-validation. These accuracies are slightly better than those obtained from other statistical classification methods, including k-nearest neighbor (k-NN), probabilistic neural networks (PNN), and C4.5 decision tree, that use the same sets of data and molecular descriptors. Our study indicates the potential of SVM in facilitating the prediction of P-gp substrates. Keywords: biosvm
[Xue2004Effect]	Y. Xue, Z. R. Li, C. W. Yap, L. Z. Sun, X. Chen, and Y. Z. Chen. Effect of molecular descriptor feature selection in support vector machine classification of pharmacokinetic and toxicological properties of chemical agents. J Chem Inf Comput Sci, 44(5):1630-8, 2004. [ bib \| DOI \| http \| .pdf ] Statistical-learning methods have been developed for facilitating the prediction of pharmacokinetic and toxicological properties of chemical agents. These methods employ a variety of molecular descriptors to characterize structural and physicochemical properties of molecules. Some of these descriptors are specifically designed for the study of a particular type of properties or agents, and their use for other properties or agents might generate noise and affect the prediction accuracy of a statistical learning system. This work examines to what extent the reduction of this noise can improve the prediction accuracy of a statistical learning system. A feature selection method, recursive feature elimination (RFE), is used to automatically select molecular descriptors for support vector machines (SVM) prediction of P-glycoprotein substrates (P-gp), human intestinal absorption of molecules (HIA), and agents that cause torsades de pointes (TdP), a rare but serious side effect. RFE significantly reduces the number of descriptors for each of these properties thereby increasing the computational speed for their classification. The SVM prediction accuracies of P-gp and HIA are substantially increased and that of TdP remains unchanged by RFE. These prediction accuracies are comparable to those of earlier studies derived from a selective set of descriptors. Our study suggests that molecular feature selection is useful for improving the speed and, in some cases, the accuracy of statistical learning methods for the prediction of pharmacokinetic and toxicological properties of chemical agents. Keywords: biosvm
[Xue2004Study]	C. X. Xue, R. S. Zhang, M. C. Liu, Z. D. Hu, and B. T. Fan. Study of the quantitative structure-mobility relationship of carboxylic acids in capillary electrophoresis based on support vector machines. J Chem Inf Comput Sci, 44(3):950-7, 2004. [ bib \| DOI \| http \| .pdf ] The support vector machines (SVM), as a novel type of learning machine, were used to develop a quantitative structure-mobility relationship (QSMR) model of 58 aliphatic and aromatic carboxylic acids based on molecular descriptors calculated from the structure alone. Multiple linear regression (MLR) and radial basis function neural networks (RBFNNs) were also utilized to construct the linear and the nonlinear model to compare with the results obtained by SVM. The root-mean-square errors in absolute mobility predictions for the whole data set given by MLR, RBFNNs, and SVM were 1.530, 1.373, and 0.888 mobility units (10(-5) cm(2) S(-1) V(-1)), respectively, which indicated that the prediction result agrees well with the experimental values of these compounds and also revealed the superiority of SVM over MLR and RBFNNs models for the prediction of the absolute mobility of carboxylic acids. Moreover, the models we proposed could also provide some insight into what structural features are related to the absolute mobility of aliphatic and aromatic carboxylic acids. Keywords: biosvm
[Xue2004QSAR]	C. X. Xue, R. S. Zhang, H. X. Liu, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. QSAR models for the prediction of binding affinities to human serum albumin using the heuristic method and a support vector machine. J Chem Inf Comput Sci, 44(5):1693-700, 2004. [ bib \| DOI \| http \| .pdf ] The binding affinities to human serum albumin for 94 diverse drugs and drug-like compounds were modeled with the descriptors calculated from the molecular structure alone using a quantitative structure-activity relationship (QSAR) technique. The heuristic method (HM) and support vector machine (SVM) were utilized to construct the linear and nonlinear prediction models, leading to a good correlation coefficient (R2) of 0.86 and 0.94 and root-mean-square errors (rms) of 0.212 and 0.134 albumin drug binding affinity units, respectively. Furthermore, the models were evaluated by a 10 compound external test set, yielding R2 of 0.71 and 0.89 and rms error of 0.430 and 0.222. The specific information described by the heuristic linear model could give some insights into the factors that are likely to govern the binding affinity of the compounds and be used as an aid to the drug design process; however, the prediction results of the nonlinear SVM model seem to be better than that of the HM. Keywords: biosvm
[Xue2004accurate]	C. X. Xue, R. S. Zhang, H. X. Liu, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. An accurate QSPR study of O-H bond dissociation energy in substituted phenols based on support vector machines. J Chem Inf Comput Sci, 44(2):669-77, 2004. [ bib \| DOI \| http \| .pdf ] The support vector machine (SVM), as a novel type of learning machine, was used to develop a Quantitative Structure-Property Relationship (QSPR) model of the O-H bond dissociation energy (BDE) of 78 substituted phenols. The six descriptors calculated solely from the molecular structures of compounds selected by forward stepwise regression were used as inputs for the SVM model. The root-mean-square (rms) errors in BDE predictions for the training, test, and overall data sets were 3.808, 3.320, and 3.713 BDE units (kJ mol(-1)), respectively. The results obtained by Gaussian-kernel SVM were much better than those obtained by multiple linear regression, radial basis function neural networks, linear-kernel SVM, and other QSPR approaches. Keywords: biosvm
[Xue2004Support]	C. X. Xue, R. S. Zhang, H. X. Liu, M. C. Liu, Z. D. Hu, and B. T. Fan. Support vector machines-based quantitative structure-property relationship for the prediction of heat capacity. J Chem Inf Comput Sci, 44(4):1267-74, 2004. [ bib \| DOI \| http \| .pdf ] The support vector machine (SVM), as a novel type of learning machine, for the first time, was used to develop a Quantitative Structure-Property Relationship (QSPR) model of the heat capacity of a diverse set of 182 compounds based on the molecular descriptors calculated from the structure alone. Multiple linear regression (MLR) and radial basis function networks (RBFNNs) were also utilized to construct quantitative linear and nonlinear models to compare with the results obtained by SVM. The root-mean-square (rms) errors in heat capacity predictions for the whole data set given by MLR, RBFNNs, and SVM were 4.648, 4.337, and 2.931 heat capacity units, respectively. The prediction results are in good agreement with the experimental value of heat capacity; also, the results reveal the superiority of the SVM over MLR and RBFNNs models. Keywords: biosvm
[Xu2004Molecular]	Xiu-Qin Xu, Chon K Leow, Xin Lu, Xuegong Zhang, Jun S Liu, Wing-Hung Wong, Arndt Asperger, SÃ¶ren Deininger, and Hon-Chiu Eastwood Leung. Molecular classification of liver cirrhosis in a rat model by proteomics and bioinformatics. Proteomics, 4(10):3235-45, Oct 2004. [ bib \| DOI \| http \| .pdf ] Liver cirrhosis is a worldwide health problem. Reliable, noninvasive methods for early detection of liver cirrhosis are not available. Using a three-step approach, we classified sera from rats with liver cirrhosis following different treatment insults. The approach consisted of: (i) protein profiling using surface-enhanced laser desorption/ionization (SELDI) technology; (ii) selection of a statistically significant serum biomarker set using machine learning algorithms; and (iii) identification of selected serum biomarkers by peptide sequencing. We generated serum protein profiles from three groups of rats: (i) normal (n=8), (ii) thioacetamide-induced liver cirrhosis (n=22), and (iii) bile duct ligation-induced liver fibrosis (n=5) using a weak cation exchanger surface. Profiling data were further analyzed by a recursive support vector machine algorithm to select a panel of statistically significant biomarkers for class prediction. Sensitivity and specificity of classification using the selected protein marker set were higher than 92%. A consistently down-regulated 3495 Da protein in cirrhosis samples was one of the selected significant biomarkers. This 3495 Da protein was purified on-chip and trypsin digested. Further structural characterization of this biomarkers candidate was done by using cross-platform matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) peptide mass fingerprinting (PMF) and matrix-assisted laser desorption/ionization time of flight/time of flight (MALDI-TOF/TOF) tandem mass spectrometry (MS/MS). Combined data from PMF and MS/MS spectra of two tryptic peptides suggested that this 3495 Da protein shared homology to a histidine-rich glycoprotein. These results demonstrated a novel approach to discovery of new biomarkers for early detection of liver cirrhosis and classification of liver diseases. Keywords: biosvm
[Xing2004LOGOS]	E. P. Xing, W. Wu, M. I. Jordan, and R. M. Karp. LOGOS: A modular Bayesian model for de novo motif detection. J. Bioinform. Comput. Biol., 2:127-154, 2004. [ bib \| DOI \| http \| .pdf ] The complexity of the global organization and internal structure of motifs in higher eukaryotic organisms raises significant challenges for motif detection techniques. To achieve successful de novo motif detection, it is necessary to model the complex dependencies within and among motifs and to incorporate biological prior knowledge. In this paper, we present LOGOS, an integrated LOcal and GlObal motif Sequence model for biopolymer sequences, which provides a principled framework for developing, modularizing, extending and computing expressive motif models for complex biopolymer sequence analysis. LOGOS consists of two interacting submodels: HMDM, a local alignment model capturing biological prior knowledge and positional dependency within the motif local structure; and HMM, a global motif distribution model modeling frequencies and dependencies of motif occurrences. Model parameters can be fit using training motifs within an empirical Bayesian framework. A variational EM algorithm is developed for de novo motif detection. LOGOS improves over existing models that ignore biological priors and dependencies in motif structures and motif occurrences, and demonstrates superior performance on both semi-realistic test data and cis-regulatory sequences from yeast and Drosophila genomes with regard to sensitivity, specificity, flexibility and extensibility. Keywords: biogm
[Williams2004Prognostic]	R.D. Williams, S.N. Hing, B.T. Greer, C.C. Whiteford, J.S. Wei, R. Natrajan, A. Kelsey, S. Rogers, C. Campbell, K. Pritchard-Jones, and J. Khan. Prognostic classification of relapsing favorable histology Wilms tumor using cDNA microarray expression profiling and support vector machines. Genes Chromosomes Cancer, 41(1):65-79, Sep 2004. [ bib \| DOI \| http \| .pdf ] Treatment of Wilms tumor has a high success rate, with some 85 of patients achieving long-term survival. However, late effects of treatment and management of relapse remain significant clinical problems. If accurate prognostic methods were available, effective risk-adapted therapies could be tailored to individual patients at diagnosis. Few molecular prognostic markers for Wilms tumor are currently defined, though previous studies have linked allele loss on 1p or 16q, genomic gain of 1q, and overexpression from 1q with an increased risk of relapse. To identify specific patterns of gene expression that are predictive of relapse, we used high-density (30 k) cDNA microarrays to analyze RNA samples from 27 favorable histology Wilms tumors taken from primary nephrectomies at the time of initial diagnosis. Thirteen of these tumors relapsed within 2 years. Genes differentially expressed between the relapsing and nonrelapsing tumor classes were identified by statistical scoring (t test). These genes encode proteins with diverse molecular functions, including transcription factors, developmental regulators, apoptotic factors, and signaling molecules. Use of a support vector machine classifier, feature selection, and test evaluation using cross-validation led to identification of a generalizable expression signature, a small subset of genes whose expression potentially can be used to predict tumor outcome in new samples. Similar methods were used to identify genes that are differentially expressed between tumors with and without genomic 1q gain. This set of discriminators was highly enriched in genes on 1q, indicating close agreement between data obtained from expression profiling with data from genomic copy number analyses. Keywords: biosvm
[Weathers2004Reduced]	E. A. Weathers, M. E. Paulaitis, T. B. Woolf, and J. H. Hoh. Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein. FEBS Lett., 576(3):348-352, 2004. [ bib \| DOI \| http \| .pdf ] Intrinsically disordered proteins are an important class of proteins with unique functions and properties. Here, we have applied a support vector machine (SVM) trained on naturally occurring disordered and ordered proteins to examine the contribution of various parameters (vectors) to recognizing proteins that contain disordered regions. We find that a SVM that incorporates only amino acid composition has a recognition accuracy of 87+/-2 composition alone is sufficient to accurately recognize disorder. Interestingly, SVMs using reduced sets of amino acids based on chemical similarity preserve high recognition accuracy. A set as small as four retains an accuracy of 84+/-2 general physicochemical properties rather than specific amino acids are important factors contributing to protein disorder. Keywords: biosvm
[Waring2004Interlaboratory]	Jeffrey F Waring, Roger G Ulrich, Nick Flint, David Morfitt, Arno Kalkuhl, Frank Staedtler, Michael Lawton, Johanna M Beekman, and Laura Suter. Interlaboratory evaluation of rat hepatic gene expression changes induced by methapyrilene. Environ Health Perspect, 112(4):439-48, Mar 2004. [ bib ] Several studies using microarrays have shown that changes in gene expression provide information about the mechanism of toxicity induced by xenobiotic agents. Nevertheless, the issue of whether gene expression profiles are reproducible across different laboratories remains to be determined. To address this question, several members of the Hepatotoxicity Working Group of the International Life Sciences Institute Health and Environmental Sciences Institute evaluated the liver gene expression profiles of rats treated with methapyrilene (MP). Animals were treated at one facility, and RNA was distributed to five different sites for gene expression analysis. A preliminary evaluation of the number of modulated genes uncovered striking differences between the five different sites. However, additional data analysis demonstrated that these differences had an effect on the absolute gene expression results but not on the outcome of the study. For all users, unsupervised algorithms showed that gene expression allows the distinction of the high dose of MP from controls and low dose. In addition, the use of a supervised analysis method (support vector machines) made it possible to correctly classify samples. In conclusion, the results show that, despite some variability, robust gene expression changes were consistent between sites. In addition, key expression changes related to the mechanism of MP-induced hepatotoxicity were identified. These results provide critical information regarding the consistency of microarray results across different laboratories and shed light on the strengths and limitations of expression profiling in drug safety analysis. Keywords: biosvm
[Wang2004Support]	M-L. Wang, W-J. Li, M-L. Wang, and W-B. Xu. Support vector machines for prediction of peptidyl prolyl cis/trans isomerization. J Pept Res, 63(1):23-8, Jan 2004. [ bib ] A new method for peptidyl prolyl cis/trans isomerization prediction based on the theory of support vector machines (SVM) was introduced. The SVM represents a new approach to supervised pattern classification and has been successfully applied to a wide range of pattern recognition problems. In this study, six training datasets consisting of different length local sequence respectively were used. The polynomial kernel functions with different parameter d were chosen. The test for the independent testing dataset and the jackknife test were both carried out. When the local sequence length was 20-residue and the parameter d = 8, the SVM method archived the best performance with the correct rate for the cis and trans forms reaching 70.4 and 69.7% for the independent testing dataset, 76.7 and 76.6% for the jackknife test, respectively. Matthew's correlation coefficients for the jackknife test could reach about 0.5. The results obtained through this study indicated that the SVM method would become a powerful tool for predicting peptidyl prolyl cis/trans isomerization. Keywords: biosvm
[Wang2004Weighted-support]	M. Wang, J. Yang, G.-P. Liu, Z.-J. Xu, and K.-C. Chou. Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. Protein Eng. Des. Sel., 17(6):509-516, 2004. [ bib \| DOI \| arXiv \| http \| .pdf ] Membrane proteins are generally classified into the following five types: (1) type I membrane proteins, (2) type II membrane proteins, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins and (5) GPI-anchored membrane proteins. Prediction of membrane protein types has become one of the growing hot topics in bioinformatics. Currently, we are facing two critical challenges in this area: first, how to take into account the extremely complicated sequence-order effects, and second, how to deal with the highly uneven sizes of the subsets in a training dataset. In this paper, stimulated by the concept of using the pseudo-amino acid composition to incorporate the sequence-order effects, the spectral analysis technique is introduced to represent the statistical sample of a protein. Based on such a framework, the weighted support vector machine (SVM) algorithm is applied. The new approach has remarkable power in dealing with the bias caused by the situation when one subset in the training dataset contains many more samples than the other. The new method is particularly useful when our focus is aimed at proteins belonging to small subsets. The results obtained by the self-consistency test, jackknife test and independent dataset test are encouraging, indicating that the current approach may serve as a powerful complementary tool to other existing methods for predicting the types of membrane proteins. Keywords: biosvm
[Wang2004Predicting]	Long-Hui Wang, Juan Liu, Yan-Fu Li, and Huai-Bei Zhou. Predicting protein secondary structure by a support vector machine based on a new coding scheme. Genome Inform Ser Workshop Genome Inform, 15(2):181-90, 2004. [ bib \| .html ] Protein structure prediction is one of the most important problems in modern computational biology. Protein secondary structure prediction is a key step in prediction of protein tertiary structure. There have emerged many methods based on machine learning techniques, such as neural networks (NN) and support vector machine (SVM) etc., to focus on the prediction of the secondary structures. In this paper, a new method was proposed based on SVM. Different from the existing methods, this method takes into account of the physical-chemical properties and structure properties of amino acids. When tested on the most popular dataset CB513, it achieved a Q(3) accuracy of 0.7844, which illustrates that it is one of the top range methods for protein of secondary structure prediction. Keywords: biosvm
[Wang2004Simple]	Kai Wang, Ekachai Jenwitheesuk, Ram Samudrala, and John E Mittler. Simple linear model provides highly accurate genotypic predictions of HIV-1 drug resistance. Antivir Ther, 9(3):343-52, Jun 2004. [ bib ] Drug resistance is a major obstacle to the successful treatment of HIV-1 infection. Genotypic assays are used widely to provide indirect evidence of drug resistance, but the performance of these assays has been mixed. We used standard stepwise linear regression to construct drug resistance models for seven protease inhibitors and 10 reverse transcriptase inhibitors using data obtained from the Stanford HIV drug resistance database. We evaluated these models by hold-one-out experiments and by tests on an independent dataset. Our linear model outperformed other publicly available genotypic interpretation algorithms, including decision tree, support vector machine and four rules-based algorithms (HIVdb, VGI, ANRS and Rega) under both tests. Interestingly, our model did well despite the absence of any terms for interactions between different residues in protease or reverse transcriptase. The resulting linear models are easy to understand and can potentially assist in choosing combination therapy regimens. Keywords: Algorithms, Computational Biology, Databases, Drug Resistance, Forecasting, Genetic, Genotype, HIV Protease Inhibitors, HIV-1, Humans, Information Management, Information Storage and Retrieval, Kinetics, Linear Models, Microbial Sensitivity Tests, Models, Non-U.S. Gov't, P.H.S., Periodicals, Point Mutation, Pyrimidinones, Research Support, Reverse Transcriptase Inhibitors, Theoretical, U.S. Gov't, Viral, 15259897
[Wagner2004Computational]	M. Wagner, D.N. Naik, A. Pothen, S. Kasukurti, R.R. Devineni, B.L. Adam, O.J. Semmes, and G.L. Wright Jr. Computational protein biomarker prediction: a case study for prostate cancer. BMC Bioinformatics, 5(26), 2004. [ bib \| DOI \| http \| .pdf ] Background Recent technological advances in mass spectrometry pose challenges in computational mathematics and statistics to process the mass spectral data into predictive models with clinical and biological significance. We discuss several classification-based approaches to finding protein biomarker candidates using protein profiles obtained via mass spectrometry, and we assess their statistical significance. Our overall goal is to implicate peaks that have a high likelihood of being biologically linked to a given disease state, and thus to narrow the search for biomarker candidates. Results Thorough cross-validation studies and randomization tests are performed on a prostate cancer dataset with over 300 patients, obtained at the Eastern Virginia Medical School using SELDI-TOF mass spectrometry. We obtain average classification accuracies of 87 on a four-group classification problem using a two-stage linear SVM-based procedure and just 13 peaks, with other methods performing comparably. Conclusions Modern feature selection and classification methods are powerful techniques for both the identification of biomarker candidates and the related problem of building predictive models from protein mass spectrometric profiles. Cross-validation and randomization are essential tools that must be performed carefully in order not to bias the results unfairly. However, only a biological validation and identification of the underlying proteins will ultimately confirm the actual value and power of any computational predictions. Keywords: biosvm
[Vinayagam2004Applying]	A. Vinayagam, R. KÃ¶nig, J. Moormann, F. Schubert, R. Eils, K.-H. Glatting, and S. Suhai. Applying Support Vector Machines for Gene Ontology based gene function prediction. BMC Bioinformatics, 5(1):116, Aug 2004. [ bib \| DOI \| http \| .pdf ] BACKGROUND: The current progress in sequencing projects calls for rapid, reliable and accurate function assignments of gene products. A variety of methods has been designed to annotate sequences on a large scale. However, these methods can either only be applied for specific subsets, or their results are not formalised, or they do not provide precise confidence estimates for their predictions. RESULTS: We have developed a large-scale annotation system that tackles all of these shortcomings. In our approach, annotation was provided through Gene Ontology terms by applying multiple Support Vector Machines (SVM) for the classification of correct and false predictions. The general performance of the system was benchmarked with a large dataset. An organism-wise cross-validation was performed to define confidence estimates, resulting in an average precision of 80% for 74% of all test sequences. The validation results show that the prediction performance was organism-independent and could reproduce the annotation of other automated systems as well as high-quality manual annotations. We applied our trained classification system to Xenopus laevis sequences, yielding functional annotation for more than half of the known expressed genome. Compared to the currently available annotation, we provided more than twice the number of contigs with good quality annotation, and additionally we assigned a confidence value to each predicted GO term. CONCLUSIONS: We present a complete automated annotation system that overcomes many of the usual problems by applying a controlled vocabulary of Gene Ontology and an established classification method on large and well-described sequence data sets. In a case study, the function for Xenopus laevis contig sequences was predicted and the results are publicly available at ftp://genome.dkfz-heidelberg.de/pub/agd/gene_association.agd_Xenopus. Keywords: biosvm
[Vert2004primer]	J.-P. Vert, K. Tsuda, and B. Schölkopf. A primer on kernel methods. In B. SchÃ¶lkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 35-70. MIT Press, 2004. [ bib ] Keywords: biosvm
[Vert2004Local]	J.-P. Vert, H. Saigo, and T. Akutsu. Local alignment kernels for biological sequences. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 131-154. MIT Press, The MIT Press, Cambridge, Massachussetts, 2004. [ bib \| www: ] Keywords: biosvm
[Vallabhaneni2004Motor]	Anirudh Vallabhaneni and Bin He. Motor imagery task classification for brain computer interface applications using spatiotemporal principle component analysis. Neurol Res, 26(3):282-7, Apr 2004. [ bib \| DOI \| http ] Classification of single-trial imagined left- and right-hand movements recorded through scalp EEG are explored in this study. Classical event-related desynchronization/synchronization (ERD/ERS) calculation approach was utilized to extract ERD features from the raw scalp EEG signal. Principle Component Analysis (PCA) was used for feature extraction and applied on spatial, as well as temporal dimensions in two consecutive steps. A Support Vector Machine (SVM) classifier using a linear decision function was used to classify each trial as either left or right. The present approach has yielded good classification results and promises to have potential for further refinement for increased accuracy as well as application in online brain computer interface (BCI). Keywords: Amino Acids, Antibodies, Artificial Intelligence, Biological, Brain, Brain Mapping, Calibration, Comparative Study, Computational Biology, Cysteine, Cystine, Electrodes, Electroencephalography, Evoked Potentials, Female, Horseradish Peroxidase, Humans, Imagery (Psychotherapy), Imagination, Laterality, Male, Monoclonal, Movement, Neoplasms, Non-P.H.S., Non-U.S. Gov't, P.H.S., Perception, Principal Component Analysis, Protein, Protein Array Analysis, Proteins, Research Support, Sensitivity and Specificity, Sequence Analysis, Tumor Markers, U.S. Gov't, User-Computer Interface, 15142321
[Tsuda2004Learning]	K. Tsuda and W.S. Noble. Learning kernels from biological networks by maximizing entropy. Bioinformatics, 20:i326-i333, 2004. [ bib \| DOI \| http \| .pdf ] Motivation: The diffusion kernel is a general method for computing pairwise distances among all nodes in a graph, based on the sum of weighted paths between each pair of nodes. This technique has been used successfully, in conjunction with kernel-based learning methods, to draw inferences from several types of biological networks. Results: We show that computing the diffusion kernel is equivalent to maximizing the von Neumann entropy, subject to a global constraint on the sum of the Euclidean distances between nodes. This global constraint allows for high variance in the pairwise distances. Accordingly, we propose an alternative, locally constrained diffusion kernel, and we demonstrate that the resulting kernel allows for more accurate support vector machine prediction of protein functional classifications from metabolic and protein?protein interaction networks. Availability: Supplementary results and data are available at noble.gs.washington.edu/proj/maxent Keywords: learning-kernel graph-kernel biosvm
[Tsai2004Gene]	C.A. Tsai, C.H. Chen, T.C. Lee, I.C. Ho, U.C. Yang, and J.J. Chen. Gene selection for sample classifications in microarray experiments. DNA Cell Biol., 23(10):607-614, 2004. [ bib \| DOI \| http \| .pdf ] DNA microarray technology provides useful tools for profiling global gene expression patterns in different cell/tissue samples. One major challenge is the large number of genes relative to the number of samples. The use of all genes can suppress or reduce the performance of a classification rule due to the noise of nondiscriminatory genes. Selection of an optimal subset from the original gene set becomes an important prestep in sample classification. In this study, we propose a family-wise error (FWE) rate approach to selection of discriminatory genes for two-sample or multiple-sample classification. The FWE approach controls the probability of the number of one or more false positives at a prespecified level. A public colon cancer data set is used to evaluate the performance of the proposed approach for the two classification methods: k nearest neighbors (k-NN) and support vector machine (SVM). The selected gene sets from the proposed procedure appears to perform better than or comparable to several results reported in the literature using the univariate analysis without performing multivariate search. In addition, we apply the FWE approach to a toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV) for a total of 55 samples for a multisample classification. Two gene sets are considered: the gene set omegaF formed by the ANOVA F-test, and a gene set omegaT formed by the union of one-versus-all t-tests. The predicted accuracies are evaluated using the internal and external crossvalidation. Using the SVM classification, the overall accuracies to predict 55 samples into one of the nine treatments are above 80 crossvalidation. OmegaF has slightly higher accuracy rates than omegaT. The overall predicted accuracies are above 70 external crossvalidation; the two gene sets omegaT and omegaF performed equally well. Keywords: biosvm microarray
[Sun2004protein]	Zhenghong Sun, Xiaoli Fu, Lu Zhang, Xiaoli Yang, Feizhou Liu, and Gengxi Hu. A protein chip system for parallel analysis of multi-tumor markers and its application in cancer detection. Anticancer Res, 24(2C):1159-65, 2004. [ bib ] BACKGROUND: Tumor markers are routinely measured in clinical oncology. However, their value in cancer detection has been controversial largely because no single tumor marker is sensitive and specific enough to meet strict diagnostic criteria. One strategy to overcome the shortcomings of single tumor markers is to measure a combination of tumor markers to increase sensitivity and look for distinct patterns to increase specificity. This study aimed to develop a system for parallel detection of tumor markers as a tool for tumor detection in both cancer patients and asymptomatic populations at high risk. MATERIALS AND METHODS: A protein chip was fabricated with twelve monoclonal antibodies against the following tumor markers respectively: CA125, CA15-3, CA19-9, CA242, CEA, AFP, PSA, free-PSA, HGH, beta-HCG, NSE and ferritin. Tumor markers were captured after the protein chip was incubated with serum samples. A secondary antibody conjugated with HRP was used to detect the captured tumor markers using chemiluminescence technique. Quantification of the tumor markers was obtained after calibration with standard curves. RESULTS: The chip system showed an overall sensitivity of 68.18% after testing 1147 cancer patients, with high sensitivities for liver, pancreas and ovarian tumors and low sensitivities for gastrointestinal tumors, and a specificity of 97.1% after testing 793 healthy individuals. Application of the chip system in physical checkups of 15,867 individuals resulted in 16 cases that were subsequently confirmed as having cancers. Analysis of the detection results with a Support Vector Machine algorithm considerably increased the specificity of the system as reflected in healthy individuals and hepatitis/cirrhosis patients, but only modestly decreased the sensitivity for cancer patients. CONCLUSION: This protein chip system is a potential tool for assisting cancer diagnosis and for screening cancer in high-risk populations. Keywords: Antibodies, Artificial Intelligence, Biological, Calibration, Female, Horseradish Peroxidase, Humans, Male, Monoclonal, Neoplasms, Protein Array Analysis, Sensitivity and Specificity, Tumor Markers, 15154641
[Steiner2004Discriminating]	Guido Steiner, Laura Suter, Franziska Boess, Rodolfo Gasser, Maria Cristina de Vera, Silvio Albertini, and Stefan Ruepp. Discriminating different classes of toxicants by transcript profiling. Environ. Health Perspect., 112(12):1236-48, Aug 2004. [ bib \| .html \| .pdf ] Male rats were treated with various model compounds or the appropriate vehicle controls. Most substances were either well-known hepatotoxicants or showed hepatotoxicity during preclinical testing. The aim of the present study was to determine if biological samples from rats treated with various compounds can be classified based on gene expression profiles. In addition to gene expression analysis using microarrays, a complete serum chemistry profile and liver and kidney histopathology were performed. We analyzed hepatic gene expression profiles using a supervised learning method (support vector machines; SVMs) to generate classification rules and combined this with recursive feature elimination to improve classification performance and to identify a compact subset of probe sets with potential use as biomarkers. Two different SVM algorithms were tested, and the models obtained were validated with a compound-based external cross-validation approach. Our predictive models were able to discriminate between hepatotoxic and nonhepatotoxic compounds. Furthermore, they predicted the correct class of hepatotoxicant in most cases. We provide an example showing that a predictive model built on transcript profiles from one rat strain can successfully classify profiles from another rat strain. In addition, we demonstrate that the predictive models identify nonresponders and are able to discriminate between gene changes related to pharmacology and toxicity. This work confirms the hypothesis that compound classification based on gene expression data is feasible. Keywords: biosvm
[Statnikov2004Methods]	Alexander Statnikov, Constantin F Aliferis, and Ioannis Tsamardinos. Methods for multi-category cancer diagnosis from gene expression data: a comprehensive evaluation to inform decision support system development. Medinfo, 11(Pt 2):813-7, 2004. [ bib ] Cancer diagnosis is a major clinical applications area of gene expression microarray technology. We are seeking to develop a system for cancer diagnostic model creation based on microarray data. In order to equip the system with the optimal combination of data modeling methods, we performed a comprehensive evaluation of several major classification algorithms, gene selection methods, and cross-validation designs using 11 datasets spanning 74 diagnostic categories (41 cancer types and 12 normal tissue types). The Multi-Category Support Vector Machine techniques by Crammer and Singer, Weston and Watkins, and one-versus-rest were found to be the best methods and they outperform other learning algorithms such as K-Nearest Neighbors and Neural Networks often to a remarkable degree. Gene selection techniques are shown to significantly improve classification performance. These results guided the development of a software system that fully automates cancer diagnostic model construction with quality on par with or better than previously published results derived by expert human analysts. Keywords: biosvm
[Stahura2004Virtual]	Florence L Stahura and JÃ¼rgen Bajorath. Virtual screening methods that complement HTS. Comb Chem High Throughput Screen, 7(4):259-69, Jun 2004. [ bib ] In this review, we discuss a number of computational methods that have been developed or adapted for molecule classification and virtual screening (VS) of compound databases. In particular, we focus on approaches that are complementary to high-throughput screening (HTS). The discussion is limited to VS methods that operate at the small molecular level, which is often called ligand-based VS (LBVS), and does not take into account docking algorithms or other structure-based screening tools. We describe areas that greatly benefit from combining virtual and biological screening and discuss computational methods that are most suitable to contribute to the integration of screening technologies. Relevant approaches range from established methods such as clustering or similarity searching to techniques that have only recently been introduced for LBVS applications such as statistical methods or support vector machines. Finally, we discuss a number of representative applications at the interface between VS and HTS. Keywords: Algorithms, Animals, Antisense, Artificial Intelligence, Cell Line, Cluster Analysis, Comparative Study, Computational Biology, Computer Simulation, DNA Fingerprinting, Drug Evaluation, Fluorescence, Fuzzy Logic, Gene Silencing, Gene Targeting, Genetic, Hela Cells, Humans, Imaging, Intracellular Space, Microscopy, Models, Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotides, P.H.S., Preclinical, Prognosis, Proteomics, Quantitative Structure-Activity Relationship, RNA, RNA Interference, Research Support, Sensitivity and Specificity, Small Interfering, Thionucleotides, Three-Dimensional, Tumor, U.S. Gov't, 15200375
[Sorich2004Rapid]	Michael J Sorich, Ross A McKinnon, John O Miners, David A Winkler, and Paul A Smith. Rapid prediction of chemical metabolism by human UDP-glucuronosyltransferase isoforms using quantum chemical descriptors derived with the electronegativity equalization method. J Med Chem, 47(21):5311-7, Oct 2004. [ bib \| DOI \| http \| .pdf ] This study aimed to evaluate in silico models based on quantum chemical (QC) descriptors derived using the electronegativity equalization method (EEM) and to assess the use of QC properties to predict chemical metabolism by human UDP-glucuronosyltransferase (UGT) isoforms. Various EEM-derived QC molecular descriptors were calculated for known UGT substrates and nonsubstrates. Classification models were developed using support vector machine and partial least squares discriminant analysis. In general, the most predictive models were generated with the support vector machine. Combining QC and 2D descriptors (from previous work) using a consensus approach resulted in a statistically significant improvement in predictivity (to 84%) over both the QC and 2D models and the other methods of combining the descriptors. EEM-derived QC descriptors were shown to be both highly predictive and computationally efficient. It is likely that EEM-derived QC properties will be generally useful for predicting ADMET and physicochemical properties during drug discovery. Keywords: biosvm
[Smith2004Towards]	P. A. Smith, M. J. Sorich, L. S C Low, R. A. McKinnon, and J. O. Miners. Towards integrated ADME prediction: past, present and future directions for modelling metabolism by UDP-glucuronosyltransferases. J Mol Graph Model, 22(6):507-17, Jul 2004. [ bib \| DOI \| http \| .pdf ] Undesirable absorption, distribution, metabolism, excretion (ADME) properties are the cause of many drug development failures and this has led to the need to identify such problems earlier in the development process. This review highlights computational (in silico) approaches that have been used to identify the characteristics of ligands influencing molecular recognition and/or metabolism by the drug-metabolising enzyme UDP-gucuronosyltransferase (UGT). Current studies applying pharmacophore elucidation, 2D-quantitative structure metabolism relationships (2D-QSMR), 3D-quantitative structure metabolism relationships (3D-QSMR), and non-linear pattern recognition techniques such as artificial neural networks and support vector machines for modelling metabolism by UGT are reported. An assessment of the utility of in silico approaches for the qualitative and quantitative prediction of drug glucuronidation parameters highlights the benefit of using multiple pharmacophores and also non-linear techniques for classification. Some of the challenges facing the development of generalisable models for predicting metabolism by UGT, including the need for screening of more diverse structures, are also outlined. Keywords: Algorithms, Animals, Antisense, Artificial Intelligence, Astrocytoma, Automated, Autonomic Nervous System, Brain, Brain Neoplasms, Cell Line, Cerebral Cortex, Child, Cluster Analysis, Cognition, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA Fingerprinting, Databases, Diagnosis, Discriminant Analysis, Drug Design, Drug Evaluation, Electroencephalography, Emotions, Event-Related Potentials, Evoked Potentials, Factual, Fluorescence, Fuzzy Logic, Gene Silencing, Gene Targeting, Genetic, Glucuronosyltransferase, Hand, Hela Cells, Humans, Imaging, Intracellular Space, Magnetic Resonance Spectroscopy, Male, Meningeal Neoplasms, Meningioma, Microscopy, Models, Molecular Structure, Monitoring, Motor, Neoplasm Metastasis, Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotides, P.H.S., P300, Pattern Recognition, Peptides, Pharmaceutical Preparations, Physiologic, Preclinical, Predictive Value of Tests, Preschool, Prognosis, Protein Interaction Mapping, Protein Structure, Proteins, Proteomics, Quantitative Structure-Activity Relationship, Quaternary, RNA, RNA Interference, Recognition (Psychology), Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Small Interfering, Software, Thionucleotides, Three-Dimensional, Tumor, U.S. Gov't, User-Computer Interface, Word Processing, 15182810
[Shoeb2004Patient-specific]	Ali Shoeb, Herman Edwards, Jack Connolly, Blaise Bourgeois, S. Ted Treves, and John Guttag. Patient-specific seizure onset detection. Epilepsy Behav, 5(4):483-98, Aug 2004. [ bib \| DOI \| http \| .pdf ] This article presents an automated, patient-specific method for the detection of epileptic seizure onset from noninvasive electroencephalography. We adopt a patient-specific approach to exploit the consistency of an individual patient's seizure and nonseizure electroencephalograms. Our method uses a wavelet decomposition to construct a feature vector that captures the morphology and spatial distribution of an electroencephalographic epoch, and then determines whether that vector is representative of a patient's seizure or nonseizure electroencephalogram using the support vector machine classification algorithm. Our completely automated method was tested on noninvasive electroencephalograms from 36 pediatric subjects suffering from a variety of seizure types. It detected 131 of 139 seizure events within 8.0+/-3.2 seconds of electrographic onset, and declared 15 false detections in 60 hours of clinical electroencephalography. Our patient-specific method can be used to initiate delay-sensitive clinical procedures following seizure onset, for example, the injection of a functional imaging radiotracer. Keywords: Algorithms, Comparative Study, Computational Biology, Computer-Assisted, Databases, Diagnosis, Drug Resistance, Electroencephalography, Epilepsy, Forecasting, Genetic, Genotype, HIV Protease Inhibitors, HIV-1, Humans, Information Management, Information Storage and Retrieval, Kinetics, Linear Models, Microbial Sensitivity Tests, Models, Monitoring, Non-U.S. Gov't, P.H.S., Periodicals, Physiologic, Point Mutation, Pyrimidinones, Reaction Time, Research Support, Reverse Transcriptase Inhibitors, Signal Processing, Theoretical, Time Factors, U.S. Gov't, Viral, 15256184
[Sen2004Predicting]	T.Z. Sen, A. Kloczkowski, R.L. Jernigan, C. Yan, V. Honavar, K.M. Ho, C.Z. Wang, Y. Ihm, H. Cao, X. Gu, and D. Dobbs. Predicting binding sites of hydrolase-inhibitor complexes by combining several methods. BMC Bioinformatics, 5(205), 2004. [ bib \| DOI \| .pdf ] Background Protein-protein interactions play a critical role in protein function. Completion of many genomes is being followed rapidly by major efforts to identify interacting protein pairs experimentally in order to decipher the networks of interacting, coordinated-in-action proteins. Identification of protein-protein interaction sites and detection of specific amino acids that contribute to the specificity and the strength of protein interactions is an important problem with broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks. Results In order to increase the power of predictive methods for protein-protein interaction sites, we have developed a consensus methodology for combining four different methods. These approaches include: data mining using Support Vector Machines, threading through protein structures, prediction of conserved residues on the protein surface by analysis of phylogenetic trees, and the Conservatism of Conservatism method of Mirny and Shakhnovich. Results obtained on a dataset of hydrolase-inhibitor complexes demonstrate that the combination of all four methods yield improved predictions over the individual methods. Conclusions We developed a consensus method for predicting protein-protein interface residues by combining sequence and structure-based methods. The success of our consensus approach suggests that similar methodologies can be developed to improve prediction accuracies for other bioinformatic problems. Keywords: biosvm
[Segal2004module]	E. Segal, N. Friedman, D. Koller, and A. Regev. A module map showing conditional activity of expression modules in cancer. Nat. Genet., 36(10):1090-1098, Oct 2004. [ bib \| DOI \| http \| .pdf ] DNA microarrays are widely used to study changes in gene expression in tumors, but such studies are typically system-specific and do not address the commonalities and variations between different types of tumor. Here we present an integrated analysis of 1,975 published microarrays spanning 22 tumor types. We describe expression profiles in different tumors in terms of the behavior of modules, sets of genes that act in concert to carry out a specific function. Using a simple unified analysis, we extract modules and characterize gene-expression profiles in tumors as a combination of activated and deactivated modules. Activation of some modules is specific to particular types of tumor; for example, a growth-inhibitory module is specifically repressed in acute lymphoblastic leukemias and may underlie the deregulated proliferation in these cancers. Other modules are shared across a diverse set of clinical conditions, suggestive of common tumor progression mechanisms. For example, the bone osteoblastic module spans a variety of tumor types and includes both secreted growth factors and their receptors. Our findings suggest that there is a single mechanism for both primary tumor proliferation and metastasis to bone. Our analysis presents multiple research directions for diagnostic, prognostic and therapeutic studies. Keywords: biogm
[Seeger2004Gaussian]	Matthias Seeger. Gaussian processes for machine learning. Int J Neural Syst, 14(2):69-106, Apr 2004. [ bib ] Gaussian processes (GPs) are natural generalisations of multivariate Gaussian random variables to infinite (countably or continuous) index sets. GPs have been applied in a large number of fields to a diverse range of ends, and very many deep theoretical analyses of various properties are available. This paper gives an introduction to Gaussian processes on a fairly elementary level with special emphasis on characteristics relevant in machine learning. It draws explicit connections to branches such as spline smoothing models and support vector machines in which similar ideas have been investigated. Gaussian process models are routinely used to solve hard machine learning problems. They are attractive because of their flexible non-parametric nature and computational simplicity. Treated within a Bayesian framework, very powerful statistical methods can be implemented which offer valid estimates of uncertainties in our predictions and generic model selection procedures cast as nonlinear optimization problems. Their main drawback of heavy computational scaling has recently been alleviated by the introduction of generic sparse approximations.13,78,31 The mathematical literature on GPs is large and often uses deep concepts which are not required to fully understand most machine learning applications. In this tutorial paper, we aim to present characteristics of GPs relevant to machine learning and to show up precise connections to other "kernel machines" popular in the community. Our focus is on a simple presentation, but references to more detailed sources are provided. Keywords: Algorithms, Amino Acids, Antibodies, Artificial Intelligence, Astrocytoma, Automated, Bayes Theorem, Biological, Biopsy, Brain, Brain Mapping, Brain Neoplasms, Calibration, Comparative Study, Computational Biology, Computer-Assisted, Computing Methodologies, Cysteine, Cystine, Dysplastic Nevus Syndrome, Electrodes, Electroencephalography, Entropy, Eosine Yellowish-(YS), Evoked Potentials, Female, Gene Expression Profiling, Hematoxylin, Horseradish Peroxidase, Humans, Image Interpretation, Image Processing, Imagery (Psychotherapy), Imagination, Laterality, Linear Models, Male, Melanoma, Models, Monoclonal, Movement, Neoplasms, Neural Networks (Computer), Neuropeptides, Non-P.H.S., Non-U.S. Gov't, Nonparametric, Normal Distribution, P.H.S., Pattern Recognition, Perception, Principal Component Analysis, Protein, Protein Array Analysis, Protein Interaction Mapping, Proteins, Regression Analysis, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Ana, Sequence Analysis, Skin Neoplasms, Software, Statistical, Statistics, Tumor Markers, U.S. Gov't, User-Computer Interface, World Health Organization, lysis, 15112367
[Schoelkopf2004Kernel]	B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT Press, The MIT Press, Cambridge, Massachussetts, 2004. [ bib ] Keywords: biosvm
[Schwender2004pilot]	Holger Schwender, Manuela Zucknick, Katja Ickstadt, Hermann M Bolt, and G. E. N. I. C. A. network. A pilot study on the application of statistical classification procedures to molecular epidemiological data. Toxicol Lett, 151(1):291-9, Jun 2004. [ bib ] The development of new statistical methods for use in molecular epidemiology comprises the building and application of appropriate classification rules. The aim of this study was to assess various classification methods that can potentially handle genetic interactions. A data set comprising genotypes at 25 single nucleotide polymorphic (SNP) loci from 518 breast cancer cases and 586 age-matched population-based controls from the GENICA study was used to built a classification rule with the discrimination methods SVM (support vector machine), CART (classification and regression tree), Bagging, Random Forest, LogitBoost and k nearest neighbours (kNN). A blind pilot analysis of the genotypic data set was a first approach to obtain an impression of the statistical structure of the data. Furthermore, this analysis was performed to explore classification methods that may be applied to molecular-epidemiological evaluation. The results showed that all blindly applied classification methods had a slightly smaller misclassification rate than a random classification. The findings, nevertheless, suggest that SNP data might be useful for the classification of individuals into categories of high or low risk of diseases. Keywords: biosvm
[Saigo2004Protein]	H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682-1689, 2004. [ bib \| http \| .pdf ] Motivation: Remote homology detection between protein sequences is a central problem in computational biology. Discriminative methods involving support vector machines (SVMs) are currently the most effective methods for the problem of superfamily recognition in the Structural Classification Of Proteins (SCOP) database. The performance of SVMs depends critically on the kernel function used to quantify the similarity between sequences. Results: We propose new kernels for strings adapted to biological sequences, which we call local alignment kernels. These kernels measure the similarity between two sequences by summing up scores obtained from local alignments with gaps of the sequences. When tested in combination with SVM on their ability to recognize SCOP superfamilies on a benchmark dataset, the new kernels outperform state-of-the-art methods for remote homology detection. Availability: Software and data available upon request. Keywords: biosvm
[Saeys2004Feature]	Y. Saeys, S. Degroeve, D. Aeyels, P. RouzÃ©, and Y. Van de Peer. Feature selection for splice site prediction: A new method using EDA-based feature ranking. BMC Bioinformatics, 5(64), 2004. [ bib \| DOI \| .pdf ] Background The identification of relevant biological features in large and complex datasets is an important step towards gaining insight in the processes underlying the data. Other advantages of feature selection include the ability of the classification system to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods for fast feature selection are of key importance in extracting knowledge from complex biological data. Results In this paper we present a novel method for feature subset selection applied to splice site prediction, based on estimation of distribution algorithms, a more general framework of genetic algorithms. From the estimated distribution of the algorithm, a feature ranking is derived. Afterwards this ranking is used to iteratively discard features. We apply this technique to the problem of splice site prediction, and show how it can be used to gain insight into the underlying biological process of splicing. Conclusion We show that this technique proves to be more robust than the traditional use of estimation of distribution algorithms for feature selection: instead of returning a single best subset of features (as they normally do) this method provides a dynamical view of the feature selection process, like the traditional sequential wrapper methods. However, the method is faster than the traditional techniques, and scales better to datasets described by a large number of features. Keywords: biosvm
[Saetrom2004Predicting]	P. Saetrom. Predicting the efficacy of short oligonucleotides in antisense and RNAi experiments with boosted genetic programming. Bioinformatics, 20(17):3055-3063, 2004. [ bib \| DOI \| http \| .pdf ] Motivation: Both small interfering RNAs (siRNAs) and antisense oligonucleotides can selectively block gene expression. Although the two methods rely on different cellular mechanisms, these methods share the common property that not all oligonucleotides (oligos) are equally effective. That is, if mRNA target sites are picked at random, many of the antisense or siRNA oligos will not be effective. Algorithms that can reliably predict the efficacy of candidate oligos can greatly reduce the cost of knockdown experiments, but previous attempts to predict the efficacy of antisense oligos have had limited success. Machine learning has not previously been used to predict siRNA efficacy. Results: We develop a genetic programming based prediction system that shows promising results on both antisense and siRNA efficacy prediction. We train and evaluate our system on a previously published database of antisense efficacies and our own database of siRNA efficacies collected from the literature. The best models gave an overall correlation between predicted and observed efficacy of 0.46 on both antisense and siRNA data. As a comparison, the best correlations of support vector machine classifiers trained on the same data were 0.40 and 0.30, respectively. Availability: The prediction system uses proprietary hardware and is available for both commercial and strategic academic collaborations. The siRNA database is available upon request. Keywords: biosvm
[Roegnvaldsson2004Why]	Thorsteinn Rögnvaldsson and Liwen You. Why neural networks should not be used for HIV-1 protease cleavage site prediction. Bioinformatics, 20(11):1702-9, Jul 2004. [ bib \| DOI \| http \| .pdf ] SUMMARY: Several papers have been published where nonlinear machine learning algorithms, e.g. artificial neural networks, support vector machines and decision trees, have been used to model the specificity of the HIV-1 protease and extract specificity rules. We show that the dataset used in these studies is linearly separable and that it is a misuse of nonlinear classifiers to apply them to this problem. The best solution on this dataset is achieved using a linear classifier like the simple perceptron or the linear support vector machine, and it is straightforward to extract rules from these linear models. We identify key residues in peptides that are efficiently cleaved by the HIV-1 protease and list the most prominent rules, relating them to experimental results for the HIV-1 protease. MOTIVATION: Understanding HIV-1 protease specificity is important when designing HIV inhibitors and several different machine learning algorithms have been applied to the problem. However, little progress has been made in understanding the specificity because nonlinear and overly complex models have been used. RESULTS: We show that the problem is much easier than what has previously been reported and that linear classifiers like the simple perceptron or linear support vector machines are at least as good predictors as nonlinear algorithms. We also show how sets of specificity rules can be generated from the resulting linear classifiers. AVAILABILITY: The datasets used are available at http://www.hh.se/staff/bioinf/ Keywords: biosvm
[Ratsch2004Accurate]	G. Rätsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 277-298. MIT Press, 2004. [ bib ] During the past three years, the support vector machine learning algorithm has been extensively applied within the field of computational biology. The algorithm has been used to detect patterns within and among biological sequences, to classify genes and patients based upon gene expression profiles, and has recently been applied to several new biological problems. This chapter reviews the state of the art with respect to SVM applications in computational biology. Keywords: biosvm
[Riedesel2004Peptide]	Henning Riedesel, BjÃ¶rn Kolbeck, Oliver Schmetzer, and Ernst-Walter Knapp. Peptide binding at class I major histocompatibility complex scored with linear functions and support vector machines. Genome Inform Ser Workshop Genome Inform, 15(1):198-212, 2004. [ bib \| .html \| .pdf ] We explore two different methods to predict the binding ability of nonapeptides at the class I major histocompatibility complex using a general linear scoring function that defines a separating hyperplane in the feature space of sequences. In absence of suitable data on non-binding nonapeptides we generated sequences randomly from a selected set of proteins from the protein data bank. The parameters of the scoring function were determined by a generalized least square optimization (LSM) and alternatively by the support vector machine (SVM). With the generalized LSM impaired data for learning with a small set of binding peptides and a large set of non-binding peptides can be treated in a balanced way rendering LSM more successful than SVM, while for symmetric data sets SVM has a slight advantage compared to LSM. Keywords: biosvm
[Prados2004Mining]	J. Prados, A. Kalousis, J.C. Sanchez, L. Allard, O. Carrette, and M. Hilario. Mining mass spectra for diagnosis and biomarker discovery of cerebral accidents. Proteomics, 4(8):2320-2332, 2004. [ bib \| DOI \| http \| .pdf ] In this paper we try to identify potential biomarkers for early stroke diagnosis using surface-enhanced laser desorption/ionization mass spectrometry coupled with analysis tools from machine learning and data mining. Data consist of 42 specimen samples, i.e., mass spectra divided in two big categories, stroke and control specimens. Among the stroke specimens two further categories exist that correspond to ischemic and hemorrhagic stroke; in this paper we limit our data analysis to discriminating between control and stroke specimens. We performed two suites of experiments. In the first one we simply applied a number of different machine learning algorithms; in the second one we have chosen the best performing algorithm as it was determined from the first phase and coupled it with a number of different feature selection methods. The reason for this was 2-fold, first to establish whether feature selection can indeed improve performance, which in our case it did not seem to confirm, but more importantly to acquire a small list of potentially interesting biomarkers. Of the different methods explored the most promising one was support vector machines which gave us high levels of sensitivity and specificity. Finally, by analyzing the models constructed by support vector machines we produced a small set of 13 features that could be used as potential biomarkers, and which exhibited good performance both in terms of sensitivity, specificity and model stability. Keywords: biosvm proteomics
[Pochet2004Systematic]	N. Pochet, F. De Smet, J. A. K. Suykens, and B. L. R. De Moor. Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction. Bioinformatics, 20(17):3185-3195, Nov 2004. [ bib \| DOI \| http \| .pdf ] Motivation: Microarrays are capable of determining the expression levels of thousands of genes simultaneously. In combination with classification methods, this technology can be useful to support clinical management decisions for individual patients, e.g. in oncology. The aim of this paper is to systematically benchmark the role of non-linear versus linear techniques and dimensionality reduction methods. Results: A systematic benchmarking study is performed by comparing linear versions of standard classification and dimensionality reduction techniques with their non-linear versions based on non-linear kernel functions with a radial basis function (RBF) kernel. A total of 9 binary cancer classification problems, derived from 7 publicly available microarray datasets, and 20 randomizations of each problem are examined. Conclusions: Three main conclusions can be formulated based on the performances on independent test sets. (1) When performing classification with least squares support vector machines (LS-SVMs) (without dimensionality reduction), RBF kernels can be used without risking too much overfitting. The results obtained with well-tuned RBF kernels are never worse and sometimes even statistically significantly better compared to results obtained with a linear kernel in terms of test set receiver operating characteristic and test set accuracy performances. (2) Even for classification with linear classifiers like LS-SVM with linear kernel, using regularization is very important. (3) When performing kernel principal component analysis (kernel PCA) before classification, using an RBF kernel for kernel PCA tends to result in overfitting, especially when using supervised feature selection. It has been observed that an optimal selection of a large number of features is often an indication for overfitting. Kernel PCA with linear kernel gives better results. Availability: Matlab scripts are available on request. Supplementary information: http://www.esat.kuleuven.ac.be/ npochet/Bioinformatics/ Keywords: biosvm microarray
[Pavlidis2004Support]	Paul Pavlidis, Ilan Wapinski, and William Stafford Noble. Support vector machine classification on the web. Bioinformatics, 20(4):586-7, Mar 2004. [ bib \| DOI \| http \| .pdf ] The support vector machine (SVM) learning algorithm has been widely applied in bioinformatics. We have developed a simple web interface to our implementation of the SVM algorithm, called Gist. This interface allows novice or occasional users to apply a sophisticated machine learning algorithm easily to their data. More advanced users can download the software and source code for local installation. The availability of these tools will permit more widespread application of this powerful learning algorithm in bioinformatics. Keywords: Adaptation, Algorithms, Ambergris, Amino Acid Sequence, Animals, Artifacts, Artificial Intelligence, Automated, Cadmium, Candida, Candida albicans, Capillary, Clinical, Cluster Analysis, Combinatorial Chemistry Techniques, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Computing Methodologies, Databases, Decision Support Systems, Electrophoresis, Enzymes, Europe, Eye Enucleation, Humans, Image Interpretation, Image Processing, Information Storage and Retrieval, Internet, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Markov Chains, Melanoma, Models, Molecular, Molecular Conformation, Molecular Sequence Data, Molecular Structure, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Odors, P.H.S., Pattern Recognition, Perfume, Physiological, Predictive Value of Tests, Prognosis, Prospective Studies, Protein, Protein Structure, Proteins, Proteomics, Quantitative Structure-Activity Relationship, Rats, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Secondary, Sensitivity and Specificity, Signal Processing, Single-Blind Method, Soft Tissue Neoplasms, Software, Statistical, U.S. Gov't, Uveal Neoplasms, Visual, 14990457
[Passerini2004Learning]	A. Passerini and P. Frasconi. Learning to discriminate between ligand-bound and disulfide-bound cysteines. Protein Eng. Des. Sel., 17(4):367-373, 2004. [ bib \| DOI \| http \| .pdf ] We present a machine learning method to discriminate between cysteines involved in ligand binding and cysteines forming disulfide bridges. Our method uses a window of multiple alignment profiles to represent each instance and support vector machines with a polynomial kernel as the learning algorithm. We also report results obtained with two new kernel functions based on similarity matrices. Experimental results indicate that binding type can be predicted at significantly higher accuracy than using PROSITE patterns. Keywords: biosvm
[Pan2004Comprehensive]	Fei Pan, Baoying Wang, Xin Hu, and William Perrizo. Comprehensive vertical sample-based KNN/LSVM classification for gene expression analysis. J Biomed Inform, 37(4):240-8, Aug 2004. [ bib \| DOI \| http \| .pdf ] Classification analysis of microarray gene expression data has been widely used to uncover biological features and to distinguish closely related cell types that often appear in the diagnosis of cancer. However, the number of dimensions of gene expression data is often very high, e.g., in the hundreds or thousands. Accurate and efficient classification of such high-dimensional data remains a contemporary challenge. In this paper, we propose a comprehensive vertical sample-based KNN/LSVM classification approach with weights optimized by genetic algorithms for high-dimensional data. Experiments on common gene expression datasets demonstrated that our approach can achieve high accuracy and efficiency at the same time. The improvement of speed is mainly related to the vertical data representation, P-tree,Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#:K96130308. and its optimized logical algebra. The high accuracy is due to the combination of a KNN majority voting approach and a local support vector machine approach that makes optimal decisions at the local level. As a result, our approach could be a powerful tool for high-dimensional gene expression data analysis. Keywords: biosvm
[Noble2004Support]	W. S. Noble. Support vector machine applications in computational biology. In B. SchÃ¶lkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 71-92. MIT Press, 2004. [ bib \| .pdf ] During the past three years, the support vector machine learning algorithm has been extensively applied within the field of computational biology. The algorithm has been used to detect patterns within and among biological sequences, to classify genes and patients based upon gene expression profiles, and has recently been applied to several new biological problems. This chapter reviews the state of the art with respect to SVM applications in computational biology. Keywords: biosvm
[Natt2004Prediction]	N.K. Natt, H. Kaur, and G.P. Raghava. Prediction of transmembrane regions of beta-barrel proteins using ANN- and SVM-based methods. Proteins, 56(1):11-18, 2004. [ bib \| DOI \| http \| .pdf ] This article describes a method developed for predicting transmembrane beta-barrel regions in membrane proteins using machine learning techniques: artificial neural network (ANN) and support vector machine (SVM). The ANN used in this study is a feed-forward neural network with a standard back-propagation training algorithm. The accuracy of the ANN-based method improved significantly, from 70.4 when evolutionary information was added to a single sequence as a multiple sequence alignment obtained from PSI-BLAST. We have also developed an SVM-based method using a primary sequence as input and achieved an accuracy of 77.4 by adding 36 physicochemical parameters to the amino acid sequence information. Finally, ANN- and SVM-based methods were combined to utilize the full potential of both techniques. The accuracy and Matthews correlation coefficient (MCC) value of SVM, ANN, and combined method are 78.5 and 0.64, respectively. These methods were trained and tested on a nonredundant data set of 16 proteins, and performance was evaluated using "leave one out cross-validation" (LOOCV). Based on this study, we have developed a Web server, TBBPred, for predicting transmembrane beta-barrel regions in proteins (available at http://www.imtech.res.in/raghava/tbbpred). Keywords: biosvm
[Mika2004Protein]	Sven Mika and Burkhard Rost. Protein names precisely peeled off free text. Bioinformatics, 20(Suppl. 1):i241-i247, 2004. [ bib \| http \| .pdf ] Motivation: Automatically identifying protein names from the scientific literature is a pre-requisite for the increasing demand in data-mining this wealth of information. Existing approaches are based on dictionaries, rules and machine-learning. Here, we introduced a novel system that combines a pre-processing dictionary- and rule-based filtering step with several separately trained support vector machines (SVMs) to identify protein names in the MEDLINE abstracts. Results: Our new tagging-system NLProt is capable of extracting protein names with a precision (accuracy) of 75 76 and contains 200 annotated abstracts. For our estimate of sustained performance, we considered partially identified names as false positives. One important issue frequently ignored in the literature is the redundancy in evaluation sets. We suggested some guidelines for removing overly inadequate overlaps between training and testing sets. Applying these new guidelines, our program appeared to significantly out-perform other methods tagging protein names. NLProt was so successful due to the SVM-building blocks that succeeded in utilizing the local context of protein names in the scientific literature. We challenge that our system may constitute the most general and precise method for tagging protein names. Availability: http://cubic.bioc.columbia.edu/services/nlprot/ Keywords: biosvm nlp
[Mika2004NLProt]	Sven Mika and Burkhard Rost. NLProt: extracting protein names and sequences from papers. Nucleic Acids Res, 32(Web Server issue):W634-7, Jul 2004. [ bib \| DOI \| http ] Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases is becoming increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several support vector machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagged names as errors, NLProt still reached a precision of 75% at a recall of 76%. By many criteria our system outperformed other tagging methods significantly; in particular, it proved very reliable even for novel names. Names encountered particularly frequently in Drosophila, such as white, wing and bizarre, constitute an obvious limitation of NLProt. Our method is available both as an Internet server and as a program for download (http://cubic.bioc.columbia.edu/services/NLProt/). Input can be PubMed/MEDLINE identifiers, authors, titles and journals, as well as collections of abstracts, or entire papers. Keywords: biosvm nlp
[Middendorf2004Discriminative]	M. Middendorf, E. Ziv, C. Adams, J. Hom, R. Koytcheff, C. Levovitz, G. Woods, L. Chen, and C. Wiggins. Discriminative topological features reveal biological network mechanisms. BMC Bioinformatics, 5(181), 2004. [ bib \| DOI \| http \| .pdf ] BACKGROUND: Recent genomic and bioinformatic advances have motivated the development of numerous network models intending to describe graphs of biological, technological, and sociological origin. In most cases the success of a model has been evaluated by how well it reproduces a few key features of the real-world data, such as degree distributions, mean geodesic lengths, and clustering coefficients. Often pairs of models can reproduce these features with indistinguishable fidelity despite being generated by vastly different mechanisms. In such cases, these few target features are insufficient to distinguish which of the different models best describes real world networks of interest; moreover, it is not clear a priori that any of the presently-existing algorithms for network generation offers a predictive description of the networks inspiring them. RESULTS: We present a method to assess systematically which of a set of proposed network generation algorithms gives the most accurate description of a given biological network. To derive discriminative classifiers, we construct a mapping from the set of all graphs to a high-dimensional (in principle infinite-dimensional) "word space". This map defines an input space for classification schemes which allow us to state unambiguously which models are most descriptive of a given network of interest. Our training sets include networks generated from 17 models either drawn from the literature or introduced in this work. We show that different duplication-mutation schemes best describe the E. coli genetic network, the S. cerevisiae protein interaction network, and the C. elegans neuronal network, out of a set of network models including a linear preferential attachment model and a small-world model. CONCLUSIONS: Our method is a first step towards systematizing network models and assessing their predictability, and we anticipate its usefulness for a number of communities. Keywords: biosvm
[Meinicke2004Oligo]	P. Meinicke, M. Tech, B. Morgenstern, and R. Merkl. Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinformatics, 5(169), 2004. [ bib \| DOI \| http \| .pdf ] Background Kernel-based learning algorithms are among the most advanced machine learning methods and have been successfully applied to a variety of sequence classification tasks within the field of bioinformatics. Conventional kernels utilized so far do not provide an easy interpretation of the learnt representations in terms of positional and compositional variability of the underlying biological signals. Results We propose a kernel-based approach to datamining on biological sequences. With our method it is possible to model and analyze positional variability of oligomers of any length in a natural way. On one hand this is achieved by mapping the sequences to an intuitive but high-dimensional feature space, well-suited for interpretation of the learnt models. On the other hand, by means of the kernel trick we can provide a general learning algorithm for that high-dimensional representation because all required statistics can be computed without performing an explicit feature space mapping of the sequences. By introducing a kernel parameter that controls the degree of position-dependency, our feature space representation can be tailored to the characteristics of the biological problem at hand. A regularized learning scheme enables application even to biological problems for which only small sets of example sequences are available. Our approach includes a visualization method for transparent representation of characteristic sequence features. Thereby importance of features can be measured in terms of discriminative strength with respect to classification of the underlying sequences. To demonstrate and validate our concept on a biochemically well-defined case, we analyze E. coli translation initiation sites in order to show that we can find biologically relevant signals. For that case, our results clearly show that the Shine-Dalgarno sequence is the most important signal upstream a start codon. The variability in position and composition we found for that signal is in accordance with previous biological knowledge. We also find evidence for signals downstream of the start codon, previously introduced as transcriptional enhancers. These signals are mainly characterized by occurrences of adenine in a region of about 4 nucleotides next to the start codon. Conclusions We showed that the oligo kernel can provide a valuable tool for the analysis of relevant signals in biological sequences. In the case of translation initiation sites we could clearly deduce the most discriminative motifs and their positional variation from example sequences. Attractive features of our approach are its flexibility with respect to oligomer length and position conservation. By means of these two parameters oligo kernels can easily be adapted to different biological problems. Keywords: biosvm
[McAuliffe2004Multiple-sequence]	J. D. McAuliffe, L. Pachter, and M. I. Jordan. Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. Bioinformatics, 20(12):1850-1860, Aug 2004. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with feature-based functional annotation methods. The resulting model, a generalized hidden Markov phylogeny (GHMP), applies to a variety of situations where functional regions are to be inferred from evolutionary constraints. RESULTS: We show how GHMPs can be used to predict complete shared gene structures in multiple primate sequences. We also describe shadower, our implementation of such a prediction system. We find that shadower outperforms previously reported ab initio gene finders, including comparative human-mouse approaches, on a small sample of diverse exonic regions. Finally, we report on an empirical analysis of shadower's performance which reveals that as few as five well-chosen species may suffice to attain maximal sensitivity and specificity in exon demarcation. AVAILABILITY: A Web server is available at http://bonaire.lbl.gov/shadower Keywords: biogm
[Man2004Evaluating]	M.Z. Man, G. Dyson, K. Johnson, and B. Liao. Evaluating methods for classifying expression data. J. Biopharm. Stat., 14(4):1065-1084, 2004. [ bib \| DOI \| .pdf ] An attractive application of expression technologies is to predict drug efficacy or safety using expression data of biomarkers. To evaluate the performance of various classification methods for building predictive models, we applied these methods on six expression datasets. These datasets were from studies using microarray technologies and had either two or more classes. From each of the original datasets, two subsets were generated to simulate two scenarios in biomarker applications. First, a 50-gene subset was used to simulate a candidate gene approach when it might not be practical to measure a large number of genes/biomarkers. Next, a 2000-gene subset was used to simulate a whole genome approach. We evaluated the relative performance of several classification methods by using leave-one-out cross-validation and bootstrap cross-validation. Although all methods perform well in both subsets for a relative easy dataset with two classes, differences in performance do exist among methods for other datasets. Overall, partial least squares discriminant analysis (PLS-DA) and support vector machines (SVM) outperform all other methods. We suggest a practical approach to take advantage of multiple methods in biomarker applications. Keywords: biosvm
[Mahe2004Extensions]	P. Mahé, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Extensions of marginalized graph kernels. In R. Greiner and D. Schuurmans, editors, Proceedings of the Twenty-First International Conference on Machine Learning (ICML 2004), pages 552-559. ACM Press, 2004. [ bib \| www: ] Positive definite kernels between labeled graphs have recently been proposed.They enable the application of kernel methods, such as support vectormachines, to the analysis and classification of graphs, for example, chemicalcompounds. These graph kernels are obtained by marginalizing a kernel betweenpaths with respect to a random walk model on the graph vertices along theedges. We propose two extensions of these graph kernels, with the double goal toreduce their computation time and increase their relevance as measure ofsimilarity between graphs. First, we propose to modify the label of eachvertex by automatically adding information about its environment with the useof the Morgan algorithm. Second, we suggest a modification of the random walkmodel to prevent the walk from coming back to a vertex that was just visited.These extensions are then tested on benchmark experiments of chemicalcompounds classification, with promising results. Keywords: biosvm chemoinformatics
[Madeira2004Biclustering]	S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform, 1(1):24-45, 2004. [ bib \| DOI \| http ] A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the data matrix has been proposed. The goal is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this paper, we refer to this class of algorithms as biclustering. Biclustering is also referred in the literature as coclustering and direct clustering, among others names, and has also been used in fields such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications. Keywords: Algorithms; Cluster Analysis; Computational Biology, methods; Gene Expression Profiling, statistics /&/ numerical data; Gene Expression, genetics; Humans; Models, Statistical; Oligonucleotide Array Sequence Analysis, methods; Saccharomyces cerevisiae, genetics
[Liu2004comparative]	Y. Liu. A comparative study on feature selection methods for drug discovery. J Chem Inf Comput Sci, 44(5):1823-8, 2004. [ bib \| DOI \| http \| .pdf ] Feature selection is frequently used as a preprocessing step to machine learning. The removal of irrelevant and redundant information often improves the performance of learning algorithms. This paper is a comparative study of feature selection in drug discovery. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including information gain, mutual information, a chi2-test, odds ratio, and GSS coefficient. Two well-known classification algorithms, NaÃ¯ve Bayesian and Support Vector Machine (SVM), were used to classify the chemical compounds. The results showed that NaÃ¯ve Bayesian benefited significantly from the feature selection, while SVM performed better when all features were used. In this experiment, information gain and chi2-test were most effective feature selection methods. Using information gain with a NaÃ¯ve Bayesian classifier, removal of up to 96% of the features yielded an improved classification accuracy measured by sensitivity. When information gain was used to select the features, SVM was much less sensitive to the reduction of feature space. The feature set size was reduced by 99%, while losing only a few percent in terms of sensitivity (from 58.7% to 52.5%) and specificity (from 98.4% to 97.2%). In contrast to information gain and chi2-test, mutual information had relatively poor performance due to its bias toward favoring rare features and its sensitivity to probability estimation errors. Keywords: biosvm
[Liu2004Active]	Y. Liu. Active learning with support vector machine applied to gene expression data for cancer classification. J. Chem. Inf. Comput. Sci., 44(6):1936-1941, 2004. [ bib \| DOI \| http \| .pdf ] There is growing interest in the application of machine learning techniques in bioinformatics. The supervised machine learning approach has been widely applied to bioinformatics and gained a lot of success in this research area. With this learning approach researchers first develop a large training set, which is a time-consuming and costly process. Moreover, the proportion of the positive examples and negative examples in the training set may not represent the real-world data distribution, which causes concept drift. Active learning avoids these problems. Unlike most conventional learning methods where the training set used to derive the model remains static, the classifier can actively choose the training data and the size of training set increases. We introduced an algorithm for performing active learning with support vector machine and applied the algorithm to gene expression profiles of colon cancer, lung cancer, and prostate cancer samples. We compared the classification performance of active learning with that of passive learning. The results showed that employing the active learning method can achieve high accuracy and significantly reduce the need for labeled training instances. For lung cancer classification, to achieve 96 only 31 labeled examples were needed in active learning whereas in passive learning 174 labeled examples were required. That meant over 82 the areas under the receiver operating characteristic (ROC) curves were over 0.81, while in passive learning the areas under the ROC curves were below 0.50. Keywords: biosvm
[Liu2004QSAR]	H. X. Liu, R. S. Zhang, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. QSAR and classification models of a novel series of COX-2 selective inhibitors: 1,5-diarylimidazoles based on support vector machines. J Comput Aided Mol Des, 18(6):389-99, Jun 2004. [ bib ] The support vector machine, which is a novel algorithm from the machine learning community, was used to develop quantitation and classification models which can be used as a potential screening mechanism for a novel series of COX-2 selective inhibitors. Each compound was represented by calculated structural descriptors that encode constitutional, topological, geometrical, electrostatic, and quantum-chemical features. The heuristic method was then used to search the descriptor space and select the descriptors responsible for activity. Quantitative modelling results in a nonlinear, seven-descriptor model based on SVMs with root mean-square errors of 0.107 and 0.136 for training and prediction sets, respectively. The best classification results are found using SVMs: the accuracy for training and test sets is 91.2% and 88.2%, respectively. This paper proposes a new and effective method for drug design and screening. Keywords: biosvm chemoinformatics
[Liu2004Prediction]	H. X. Liu, R. S. Zhang, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. Prediction of the isoelectric point of an amino acid based on GA-PLS and SVMs. J Chem Inf Comput Sci, 44(1):161-7, 2004. [ bib \| DOI \| http \| .pdf ] The support vector machine (SVM), as a novel type of a learning machine, for the first time, was used to develop a QSPR model that relates the structures of 35 amino acids to their isoelectric point. Molecular descriptors calculated from the structure alone were used to represent molecular structures. The seven descriptors selected using GA-PLS, which is a sophisticated hybrid approach that combines GA as a powerful optimization method with PLS as a robust statistical method for variable selection, were used as inputs of RBFNNs and SVM to predict the isoelectric point of an amino acid. The optimal QSPR model developed was based on support vector machines, which showed the following results: the root-mean-square error of 0.2383 and the prediction correlation coefficient R=0.9702 were obtained for the whole data set. Satisfactory results indicated that the GA-PLS approach is a very effective method for variable selection, and the support vector machine is a very promising tool for the nonlinear approximation. Keywords: biosvm
[Liu2004Quantitative]	H. X. Liu, C. X. Xue, R. S. Zhang, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. Quantitative prediction of logk of peptides in high-performance liquid chromatography based on molecular descriptors by using the heuristic method and support vector machine. J Chem Inf Comput Sci, 44(6):1979-86, 2004. [ bib \| DOI \| http \| .pdf ] A new method support vector machine (SVM) and the heuristic method (HM) were used to develop the nonlinear and linear models between the capacity factor (logk) and seven molecular descriptors of 75 peptides for the first time. The molecular descriptors representing the structural features of the compounds only included the constitutional and topological descriptors, which can be obtained easily without optimizing the structure of the molecule. The seven molecular descriptors selected by the heuristic method in CODESSA were used as inputs for SVM. The results obtained by SVM were compared with those obtained by the heuristic method. The prediction result of the SVM model is better than that of heuristic method. For the test set, a predictive correlation coefficient R = 0.9801 and root-mean-square error of 0.1523 were obtained. The prediction results are in very good agreement with the experimental values. But the linear model of the heuristic method is easier to understand and ready to use for a chemist. This paper provided a new and effective method for predicting the chromatography retention of peptides and some insight into the structural features which are related to the capacity factor of peptides. Keywords: biosvm
[Liu2004Using]	Huiqing Liu, Hao Han, Jinyan Li, and Limsoon Wong. Using amino acid patterns to accurately predict translation initiation sites. In Silico Biol., 4(3):255-69, 2004. [ bib \| http ] The translation initiation site (TIS) prediction problem is about how to correctly identify TIS in mRNA, cDNA, or other types of genomic sequences. High prediction accuracy can be helpful in a better understanding of protein coding from nucleotide sequences. This is an important step in genomic analysis to determine protein coding from nucleotide sequences. In this paper, we present an in silico method to predict translation initiation sites in vertebrate cDNA or mRNA sequences. This method consists of three sequential steps as follows. In the first step, candidate features are generated using k-gram amino acid patterns. In the second step, a small number of top-ranked features are selected by an entropy-based algorithm. In the third step, a classification model is built to recognize true TISs by applying support vector machines or ensembles of decision trees to the selected features. We have tested our method on several independent data sets, including two public ones and our own extracted sequences. The experimental results achieved are better than those reported previously using the same data sets. Our high accuracy not only demonstrates the feasibility of our method, but also indicates that there might be "amino acid" patterns around TIS in cDNA and mRNA sequences. Keywords: biosvm
[Listgarten2004Predictive]	J. Listgarten, S. Damaraju, B. Poulin, L. Cook, J. Dufour, A. Driga, J. Mackey, D. Wishart, R. Greiner, and B. Zanke. Predictive Models for Breast Cancer Susceptibility from Multiple Single Nucleotide Polymorphisms. Clin. Cancer Res., 10(8):2725-2737, 2004. [ bib \| arXiv \| http \| .pdf ] Hereditary predisposition and causative environmental exposures have long been recognized in human malignancies. In most instances, cancer cases occur sporadically, suggesting that environmental influences are critical in determining cancer risk. To test the influence of genetic polymorphisms on breast cancer risk, we have measured 98 single nucleotide polymorphisms (SNPs) distributed over 45 genes of potential relevance to breast cancer etiology in 174 patients and have compared these with matched normal controls. Using machine learning techniques such as support vector machines (SVMs), decision trees, and naive Bayes, we identified a subset of three SNPs as key discriminators between breast cancer and controls. The SVMs performed maximally among predictive models, achieving 69 power in distinguishing between the two groups, compared with a 50 baseline predictive power obtained from the data after repeated random permutation of class labels (individuals with cancer or controls). However, the simpler naive Bayes model as well as the decision tree model performed quite similarly to the SVM. The three SNP sites most useful in this model were (a) the +4536T/C site of the aldosterone synthase gene CYP11B2 at amino acid residue 386 Val/Ala (T/C) (rs4541); (b) the +4328C/G site of the aryl hydrocarbon hydroxylase CYP1B1 at amino acid residue 293 Leu/Val (C/G) (rs5292); and (c) the +4449C/T site of the transcription factor BCL6 at amino acid 387 Asp/Asp (rs1056932). No single SNP site on its own could achieve more than 60 predictive accuracy. We have shown that multiple SNP sites from different genes over distant parts of the genome are better at identifying breast cancer patients than any one SNP alone. As high-throughput technology for SNPs improves and as more SNPs are identified, it is likely that much higher predictive accuracy will be achieved and a useful clinical tool developed. Keywords: biosvm, breastcancer
[Li2004comparative]	T. Li, C. Zhang, and M. Ogihara. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics, 20(15):2429-2437, 2004. [ bib \| http \| .pdf ] Summary: This paper studies the problem of building multiclass classifiers for tissue classification based on gene expression. The recent development of microarray technologies has enabled biologists to quantify gene expression of tens of thousands of genes in a single experiment. Biologists have begun collecting gene expression for a large number of samples. One of the urgent issues in the use of microarray data is to develop methods for characterizing samples based on their gene expression. The most basic step in the research direction is binary sample classification, which has been studied extensively over the past few years. This paper investigates the next step-multiclass classification of samples based on gene expression. The characteristics of expression data (e.g. large number of genes with small sample size) makes the classification problem more challenging. The process of building multiclass classifiers is divided into two components: (i) selection of the features (i.e. genes) to be used for training and testing and (ii) selection of the classification method. This paper compares various feature selection methods as well as various state-of-the-art classification methods on various multiclass gene expression datasets. Our study indicates that multiclass classification problem is much more difficult than the binary one for the gene expression datasets. The difficulty lies in the fact that the data are of high dimensionality and that the sample size is small. The classification accuracy appears to degrade very rapidly as the number of classes increases. In particular, the accuracy was very low regardless of the choices of the methods for large-class datasets (e.g. NCI60 and GCM). While increasing the number of samples is a plausible solution to the problem of accuracy degradation, it is important to develop algorithms that are able to analyze effectively multiple-class expression data for these special datasets. Keywords: biosvm
[Li2004Data]	L. Li, H. Tang, Z. Wu, J. Gong, M. Gruidl, J. Zou, M. Tockman, and R.A. Clark. Data mining techniques for cancer detection using serum proteomic profiling. Artif. Intell. Med., 32(2):71-83, 2004. [ bib \| DOI \| http \| .pdf ] OBJECTIVE: Pathological changes in an organ or tissue may be reflected in proteomic patterns in serum. It is possible that unique serum proteomic patterns could be used to discriminate cancer samples from non-cancer ones. Due to the complexity of proteomic profiling, a higher order analysis such as data mining is needed to uncover the differences in complex proteomic patterns. The objectives of this paper are (1) to briefly review the application of data mining techniques in proteomics for cancer detection/diagnosis; (2) to explore a novel analytic method with different feature selection methods; (3) to compare the results obtained on different datasets and that reported by Petricoin et al. in terms of detection performance and selected proteomic patterns. METHODS AND MATERIAL: Three serum SELDI MS data sets were used in this research to identify serum proteomic patterns that distinguish the serum of ovarian cancer cases from non-cancer controls. A support vector machine-based method is applied in this study, in which statistical testing and genetic algorithm-based methods are used for feature selection respectively. Leave-one-out cross validation with receiver operating characteristic (ROC) curve is used for evaluation and comparison of cancer detection performance. RESULTS AND CONCLUSIONS: The results showed that (1) data mining techniques can be successfully applied to ovarian cancer detection with a reasonably high performance; (2) the classification using features selected by the genetic algorithm consistently outperformed those selected by statistical testing in terms of accuracy and robustness; (3) the discriminatory features (proteomic patterns) can be very different from one selection method to another. In other words, the pattern selection and its classification efficiency are highly classifier dependent. Therefore, when using data mining techniques, the discrimination of cancer from normal does not depend solely upon the identity and origination of cancer-related proteins. Keywords: biosvm
[Lett2004Interaction]	D. Lett, M. Hsing, and F. Pio. Interaction profile-based protein classification of death domain. BMC Bioinformatics, 5(75), 2004. [ bib \| DOI \| http \| .pdf ] Background The increasing number of protein sequences and 3D structure obtained from genomic initiatives is leading many of us to focus on proteomics, and to dedicate our experimental and computational efforts on the creation and analysis of information derived from 3D structure. In particular, the high-throughput generation of protein-protein interaction data from a few organisms makes such an approach very important towards understanding the molecular recognition that make-up the entire protein-protein interaction network. Since the generation of sequences, and experimental protein-protein interactions increases faster than the 3D structure determination of protein complexes, there is tremendous interest in developing in silico methods that generate such structure for prediction and classification purposes. In this study we focused on classifying protein family members based on their protein-protein interaction distinctiveness. Structure-based classification of protein-protein interfaces has been described initially by Ponstingl et al. [1] and more recently by Valdar et al. [2] and Mintseris et al. [3], from complex structures that have been solved experimentally. However, little has been done on protein classification based on the prediction of protein-protein complexes obtained from homology modeling and docking simulation. Results We have developed an in silico classification system entitled HODOCO (Homology modeling, Docking and Classification Oracle), in which protein Residue Potential Interaction Profiles (RPIPS) are used to summarize protein-protein interaction characteristics. This system applied to a dataset of 64 proteins of the death domain superfamily was used to classify each member into its proper subfamily. Two classification methods were attempted, heuristic and support vector machine learning. Both methods were tested with a 5-fold cross-validation. The heuristic approach yielded a 61 accuracy, while the machine learning approach yielded an 89 accuracy. Conclusion We have confirmed the reliability and potential value of classifying proteins via their predicted interactions. Our results are in the same range of accuracy as other studies that classify protein-protein interactions from 3D complex structure obtained experimentally. While our classification scheme does not take directly into account sequence information our results are in agreement with functional and sequence based classification of death domain family members. Keywords: biosvm
[Leslie2004Mismatch]	C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble. Mismatch string kernels for discriminative protein classification. Bioinformatics, 20(4):467-476, 2004. [ bib \| http \| .pdf ] Motivation: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. Results: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies. Availability: SVM software is publicly available at http://microarray.cpmc.columbia.edu/gist. Mismatch kernel software is available upon request. Keywords: biosvm
[Lanckriet2004statistical]	G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomic data fusion. Bioinformatics, 20(16):2626-2635, 2004. [ bib \| DOI \| http \| .pdf ] Motivation: During the past decade, the new focus on genomics has highlighted a particular challenge: to integrate the different views of the genome that are provided by various types of experimental data. Results: This paper describes a computational framework for integrating and drawing inferences from a collection of genome-wide measurements. Each dataset is represented via a kernel function, which defines generalized similarity relationships between pairs of entities, such as genes or proteins. The kernel representation is both flexible and efficient, and can be applied to many different types of data. Furthermore, kernel functions derived from different types of data can be combined in a straightforward fashion. Recent advances in the theory of kernel methods have provided efficient algorithms to perform such combinations in a way that minimizes a statistical loss function. These methods exploit semidefinite programming techniques to reduce the problem of finding optimizing kernel combinations to a convex optimization problem. Computational experiments performed using yeast genome-wide datasets, including amino acid sequences, hydropathy profiles, gene expression data and known protein-protein interactions, demonstrate the utility of this approach. A statistical learning algorithm trained from all of these data to recognize particular classes of proteins-membrane proteins and ribosomal proteins-performs significantly better than the same algorithm trained on any single type of data. Availability: Supplementary data at http://noble.gs.washington.edu/proj/sdp-svm Keywords: biosvm
[Lanckriet2004Kernel-baseda]	G.R. Lanckriet, M. Deng, N. Cristianini, M.I. Jordan, and W.S. Noble. Kernel-based data fusion and its application to protein function prediction in yeast. In Proceedings of the Pacific Symposium on Biocomputing, pages 300-311, 2004. [ bib \| .pdf ] Kernel methods provide a principled framework in which to represent many types of data, including vectors, strings, trees and graphs. As such, these methods are useful for drawing inferences about biological phenomena. We describe a method for combining multiple kernel representations in an optimal fashion, by formulating the problem as a convex optimization problem that can be solved using semidefinite programming techniques. The method is applied to the problem of predicting yeast protein functional classifications using a support vector machine (SVM) trained on five types of data. For this problem, the new method performs better than a previously-described Markov random field method, and better than the SVM trained on any single type of data. Keywords: biosvm
[Lanckriet2004Kernel-based]	G.R.G. Lanckriet, N. Cristianini, M.I. Jordan, and W.S. Noble. Kernel-based integration of genomic data using semidefinite programming. In B. SchÃ¶lkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 231-259. MIT Press, 2004. [ bib ] Keywords: biosvm
[Lal2004Support]	Thomas Navin Lal, Michael SchrÃ¶der, Thilo Hinterberger, Jason Weston, Martin Bogdan, Niels Birbaumer, and Bernhard SchÃ¶lkopf. Support vector channel selection in BCI. IEEE Trans Biomed Eng, 51(6):1003-10, Jun 2004. [ bib ] Designing a brain computer interface (BCI) system one can choose from a variety of features that may be useful for classifying brain activity during a mental task. For the special case of classifying electroencephalogram (EEG) signals we propose the usage of the state of the art feature selection algorithms Recursive Feature Elimination and Zero-Norm Optimization which are based on the training of support vector machines (SVM). These algorithms can provide more accurate solutions than standard filter methods for feature selection. We adapt the methods for the purpose of selecting EEG channels. For a motor imagery paradigm we show that the number of used channels can be reduced significantly without increasing the classification error. The resulting best channels agree well with the expected underlying cortical activity patterns during the mental tasks. Furthermore we show how time dependent task specific information can be visualized. Keywords: Algorithms, Animals, Antisense, Artificial Intelligence, Automated, Autonomic Nervous System, Brain, Cell Line, Cerebral Cortex, Child, Cluster Analysis, Cognition, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA Fingerprinting, Databases, Drug Evaluation, Electroencephalography, Emotions, Event-Related Potentials, Evoked Potentials, Factual, Fluorescence, Fuzzy Logic, Gene Silencing, Gene Targeting, Genetic, Hand, Hela Cells, Humans, Imaging, Intracellular Space, Male, Microscopy, Models, Monitoring, Motor, Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotides, P.H.S., P300, Pattern Recognition, Peptides, Physiologic, Preclinical, Predictive Value of Tests, Preschool, Prognosis, Protein Interaction Mapping, Protein Structure, Proteins, Proteomics, Quantitative Structure-Activity Relationship, Quaternary, RNA, RNA Interference, Recognition (Psychology), Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Small Interfering, Software, Thionucleotides, Three-Dimensional, Tumor, U.S. Gov't, User-Computer Interface, Word Processing, 15188871
[Kuang2004Protein]	R. Kuang, C. S. Leslie, and A.-S. Yang. Protein backbone angle prediction with machine learning approaches. Bioinformatics, 20(10):1612-1621, 2004. [ bib \| http \| .pdf ] Motivation: Protein backbone torsion angle prediction provides useful local structural information that goes beyond conventional three-state (alpha, beta and coil) secondary structure predictions. Accurate prediction of protein backbone torsion angles will substantially improve modeling procedures for local structures of protein sequence segments, especially in modeling loop conformations that do not form regular structures as in alpha-helices or beta-strands. Results: We have devised two novel automated methods in protein backbone conformational state prediction: one method is based on support vector machines (SVMs); the other method combines a standard feed-forward back-propagation artificial neural network (NN) with a local structure-based sequence profile database (LSBSP1). Extensive benchmark experiments demonstrate that both methods have improved the prediction accuracy rate over the previously published methods for conformation state prediction when using an alphabet of three or four states. Availability: LSBSP1 and the NN algorithm have been implemented in PrISM.1, which is available from www.columbia.edu/ ay1/. Supplementary information: Supplementary data for the SVM method can be downloaded from the Website www.cs.columbia.edu/compbio/backbone. Keywords: biosvm
[Kuang2004Profile-based]	R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, and C. Leslie. Profile-based string kernels for remote homology detection and motif extraction. Proc IEEE Comput Syst Bioinform Conf, pages 152-160, 2004. [ bib ] We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the pro- files is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" - short regions of the original profile that contribute almost all the weight of the SVM classification score - and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results are comparable to cluster kernels while providing much better scalability to large datasets. Keywords: biosvm
[Krishnapuram2004bayesian]	B. Krishnapuram, A. J. Hartemink, L. Carin, and M. A. T. Figueiredo. A bayesian approach to joint feature selection and classifier design. IEEE T. Pattern. Anal., 26(9):1105-11, Sep 2004. [ bib \| DOI \| http \| .pdf ] This paper adopts a Bayesian approach to simultaneously learn both an optimal nonlinear classifier and a subset of predictor variables (or features) that are most relevant to the classification task. The approach uses heavy-tailed priors to promote sparsity in the utilization of both basis functions and features; these priors act as regularizers for the likelihood function that rewards good classification on the training data. We derive an expectation-maximization (EM) algorithm to efficiently compute a maximum a posteriori (MAP) point estimate of the various parameters. The algorithm is an extension of recent state-of-the-art sparse Bayesian classifiers, which in turn can be seen as Bayesian counterparts of support vector machines. Experimental comparisons using kernel classifiers demonstrate both parsimonious feature selection and excellent classification accuracy on a range of synthetic and benchmark data sets. Keywords: biosvm
[Krishnapuram2004Joint]	B. Krishnapuram, L. Carin, and A. Hartemink. Joint Classifier and Feature Optimization for Comprehensive Cancer Diagnosis Using Gene Expression Data. J. Comput. Biol., 11(2-3):227-242, 2004. [ bib \| DOI \| http \| .pdf ] ecent research has demonstrated quite convincingly that accurate cancer diagnosis can be achieved by constructing classifiers that are designed to compare the gene expression profile of a tissue of unknown cancer status to a database of stored expression profiles from tissues of known cancer status. This paper introduces the JCFO, a novel algorithm that uses a sparse Bayesian approach to jointly identify both the optimal nonlinear classifier for diagnosis and the optimal set of genes on which to base that diagnosis. We show that the diagnostic classification accuracy of the proposed algorithm is superior to a number of current state-of-the-art methods in a full leave-one-out cross-validation study of five widely used benchmark datasets. In addition to its superior classification accuracy, the algorithm is designed to automatically identify a small subset of genes (typically around twenty in our experiments) that are capable of providing complete discriminatory information for diagnosis. Focusing attention on a small subset of genes is useful not only because it produces a classifier with good generalization capacity, but also because this set of genes may provide insights into the mechanisms responsible for the disease itself. A number of the genes identified by the JCFO in our experiments are already in use as clinical markers for cancer diagnosis; some of the remaining genes may be excellent candidates for further clinical investigation. If it is possible to identify a small set of genes that is indeed capable of providing complete discrimination, inexpensive diagnostic assays might be widely deployable in clinical settings. Keywords: biosvm
[Krishnapuram2004Gene]	B. Krishnapuram, L. Carin, and A. Hartemink. Gene expression analysis: joint feature selection and classifier design. In B. SchÃ¶lkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 299-317. MIT Press, 2004. [ bib \| www: ] Keywords: biosvm
[Kote-Jarai2004Gene]	Zsofia Kote-Jarai, Richard D Williams, Nicola Cattini, Maria Copeland, Ian Giddings, Richard Wooster, Robert H tePoele, Paul Workman, Barry Gusterson, John Peacock, Gerald Gui, Colin Campbell, and Ros Eeles. Gene expression profiling after radiation-induced DNA damage is strongly predictive of BRCA1 mutation carrier status. Clin. Cancer Res., 10(3):958-63, Feb 2004. [ bib \| http \| .pdf ] PURPOSE: The impact of the presence of a germ-line BRCA1 mutation on gene expression in normal breast fibroblasts after radiation-induced DNA damage has been investigated. EXPERIMENTAL DESIGN: High-density cDNA microarray technology was used to identify differential responses to DNA damage in fibroblasts from nine heterozygous BRCA1 mutation carriers compared with five control samples without personal or family history of any cancer. Fibroblast cultures were irradiated, and their expression profile was compared using intensity ratios of the cDNA microarrays representing 5603 IMAGE clones. RESULTS: Class comparison and class prediction analysis has shown that BRCA1 mutation carriers can be distinguished from controls with high probability (approximately 85%). Significance analysis of microarrays and the support vector machine classifier identified gene sets that discriminate the samples according to their mutation status. These include genes already known to interact with BRCA1 such as CDKN1B, ATR, and RAD51. CONCLUSIONS: The results of this initial study suggest that normal cells from heterozygous BRCA1 mutation carriers display a different gene expression profile from controls in response to DNA damage. Adaptations of this pilot result to other cell types could result in the development of a functional assay for BRCA1 mutation status. Keywords: biosvm , breastcancer
[Kondor2004Diffusion]	R. Kondor and J.-P. Vert. Diffusion kernels. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 171-192. MIT Press, 2004. [ bib \| www: ] Keywords: biosvm
[Koike2004Prediction]	A. Koike and T. Takagi. Prediction of protein-protein interaction sites using support vector machines. Protein Eng. Des. Sel., 17(2):165-173, Feb 2004. [ bib \| DOI \| http \| .pdf ] The identification of protein-protein interaction sites is essential for the mutant design and prediction of protein-protein networks. The interaction sites of residue units were predicted using support vector machines (SVM) and the profiles of sequentially/spatially neighboring residues, plus additional information. When only sequence information was used, prediction performance was highest using the feature vectors, sequentially neighboring profiles and predicted interaction site ratios, which were calculated by SVM regression using amino acid compositions. When structural information was also used, prediction performance was highest using the feature vectors, spatially neighboring residue profiles, accessible surface areas, and the with/without protein interaction sites ratios predicted by SVM regression and amino acid compositions. In the latter case, the precision at recall = 50 test set and >20 30 closest sequentially/spatially neighboring on the interaction site residues. The predicted residues covered 86-87 (96-97 appeared to be slightly higher than a previously reported study. Comparing the prediction accuracy of each molecule, it seems to be easier to predict interaction sites for stable complexes. Keywords: biosvm
[Kohlmann2004Pediatric]	A. Kohlmann, C. Schoch, S. Schnittger, M. Dugas, W. Hiddemann, W. Kern, and T. Haferlach. Pediatric acute lymphoblastic leukemia (ALL) gene expression signatures classify an independent cohort of adult ALL patients. Leukemia, 18(1):63-71, 2004. [ bib \| DOI \| http \| .pdf ] Recent reports support a possible future application of gene expression profiling for the diagnosis of leukemias. However, the robustness of subtype-specific gene expression signatures has to be proven on independent patient samples. Here, we present gene expression data of 34 adult acute lymphoblastic leukemia (ALL) patients (Affymetrix U133A microarrays). Support Vector Machines (SVMs) were applied to stratify our samples based on given gene lists reported to predict MLL, BCR-ABL, and T-ALL, as well as MLL and non-MLL gene rearrangement positive pediatric ALL. In addition, seven other B-precursor ALL cases not bearing t(9;22) or t(11q23)/MLL chromosomal aberrations were analyzed. Using top differentially expressed genes, hierarchical cluster and principal component analyses demonstrate that the genetically more heterogeneous B-precursor ALL samples intercalate with BCR-ABL-positive cases, but were clearly distinct from T-ALL and MLL profiles. Similar expression signatures were observed for both heterogeneous B-precursor ALL and for BCR-ABL-positive cases. As an unrelated laboratory, we demonstrate that gene signatures defined for childhood ALL were also capable of stratifying distinct subtypes in our cohort of adult ALL patients. As such, previously reported gene expression patterns identified by microarray technology are validated and confirmed on truly independent leukemia patient samples. Keywords: biosvm
[Kim2004Prediction]	J. H. Kim, J. Lee, B. Oh, K. Kimm, and I. Koh. Prediction of phosphorylation sites using SVMs. Bioinformatics, 20(17):3179-3184, 2004. [ bib \| DOI \| http \| .pdf ] Motivation: Phosphorylation is involved in diverse signal transduction pathways. By predicting phosphorylation sites and their kinases from primary protein sequences, we can obtain much valuable information that can form the basis for further research. Using support vector machines, we attempted to predict phosphorylation sites and the type of kinase that acts at each site. Results: Our prediction system was limited to phosphorylation sites catalyzed by four protein kinase families and four protein kinase groups. The accuracy of the predictions ranged from 83 to 95 kinase group level. The prediction system used-PredPhospho-can be applied to the functional study of proteins, and can help predict the changes in phosphorylation sites caused by amino acid variations at intra- and interspecies levels. Availability: PredPhospho is available at http://www.ngri.re.kr/proteo/PredPhospho.htm. Supplementary information: http://www.ngri.re.kr/proteo/supplementary.doc Keywords: biosvm
[Kim2004Predictiona]	H. Kim and H. Park. Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor. Proteins, 54(3):557-562, Feb 2004. [ bib \| DOI \| http \| .pdf ] The prediction of protein relative solvent accessibility gives us helpful information for the prediction of tertiary structure of a protein. The SVMpsi method, which uses support vector machines (SVMs), and the position-specific scoring matrix (PSSM) generated from PSI-BLAST have been applied to achieve better prediction accuracy of the relative solvent accessibility. We have introduced a three-dimensional local descriptor that contains information about the expected remote contacts by both the long-range interaction matrix and neighbor sequences. Moreover, we applied feature weights to kernels in SVMs in order to consider the degree of significance that depends on the distance from the specific amino acid. Relative solvent accessibility based on a two state-model, for 25 and 0 accuracy, respectively. Three-state prediction results provide a 64.5 approach has successfully been applied for solvent accessibility prediction by considering long-range interaction and handling unbalanced data. Keywords: biosvm
[Kharchenko2004Filling]	P. Kharchenko, D. Vitkup, and G. M. Church. Filling gaps in a metabolic network using expression information. Bioinformatics, 20 Suppl 1:I178-I185, Aug 2004. [ bib \| DOI \| http ] MOTIVATION: The metabolic models of both newly sequenced and well-studied organisms contain reactions for which the enzymes have not been identified yet. We present a computational approach for identifying genes encoding such missing metabolic enzymes in a partially reconstructed metabolic network. RESULTS: The metabolic expression placement (MEP) method relies on the coexpression properties of the metabolic network and is complementary to the sequence homology and genome context methods that are currently being used to identify missing metabolic genes. The MEP algorithm predicts over 20% of all known Saccharomyces cerevisiae metabolic enzyme-encoding genes within the top 50 out of 5594 candidates for their enzymatic function, and 70% of metabolic genes whose expression level has been significantly perturbed across the conditions of the expression dataset used. AVAILABILITY: Freely available (in Supplementary information). SUPPLEMENTARY INFORMATION: Available at the following URL http://arep.med.harvard.edu/kharchenko/mep/supplements.html Keywords: Bacterial, Binding Sites, Biological, Comparative Study, DNA, Energy Metabolism, Enzyme Induction, Enzymes, Escherichia coli Proteins, Fungal, Gene Expression Regulation, Genes, Genetic, Genome, Models, Non-P.H.S., Non-U.S. Gov't, Phylogeny, Promoter Regions (Genetics), Protein, Research Support, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Sequence Analysis, Systems Biology, Transcription Factors, U.S. Gov't, 15262797
[Kashima2004Kernels]	H. Kashima, K. Tsuda, and A. Inokuchi. Kernels for graphs. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 155-170. MIT Press, The MIT Press, Cambridge, Massachussetts, 2004. [ bib ] Keywords: biosvm chemoinformatics
[Kaper2004BCI]	Matthias Kaper, Peter Meinicke, Ulf Grossekathoefer, Thomas Lingner, and Helge Ritter. BCI Competition 2003-Data set IIb: support vector machines for the P300 speller paradigm. IEEE Trans Biomed Eng, 51(6):1073-6, Jun 2004. [ bib ] We propose an approach to analyze data from the P300 speller paradigm using the machine-learning technique support vector machines. In a conservative classification scheme, we found the correct solution after five repetitions. While the classification within the competition is designed for offline analysis, our approach is also well-suited for a real-world online solution: It is fast, requires only 10 electrode positions and demands only a small amount of preprocessing. Keywords: Algorithms, Animals, Antisense, Artificial Intelligence, Automated, Autonomic Nervous System, Brain, Cell Line, Child, Cluster Analysis, Cognition, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA Fingerprinting, Databases, Drug Evaluation, Electroencephalography, Emotions, Event-Related Potentials, Factual, Fluorescence, Fuzzy Logic, Gene Silencing, Gene Targeting, Genetic, Hela Cells, Humans, Imaging, Intracellular Space, Microscopy, Models, Monitoring, Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotides, P.H.S., P300, Pattern Recognition, Peptides, Physiologic, Preclinical, Predictive Value of Tests, Preschool, Prognosis, Protein Interaction Mapping, Protein Structure, Proteins, Proteomics, Quantitative Structure-Activity Relationship, Quaternary, RNA, RNA Interference, Recognition (Psychology), Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Small Interfering, Software, Thionucleotides, Three-Dimensional, Tumor, U.S. Gov't, User-Computer Interface, Word Processing, 15188881
[Jiang-Ning2004Cooperativity]	S. Jiang-Ning, L. Wei-Jiang, and X. Wen-Bo. Cooperativity of the oxidization of cysteines in globular proteins. J. Theor. Biol., 231(1):85-95, 2004. [ bib \| DOI \| http \| .pdf ] Based on the 639 non-homologous proteins with 2910 cysteine-containing segments of well-resolved three-dimensional structures, a novel approach has been proposed to predict the disulfide-bonding state of cysteines in proteins by constructing a two-stage classifier combining a first global linear discriminator based on their amino acid composition and a second local support vector machine classifier. The overall prediction accuracy of this hybrid classifier for the disulfide-bonding state of cysteines in proteins has scored 84.1 on cysteine and protein basis using the rigorous jack-knife procedure, respectively. It shows that whether cysteines should form disulfide bonds depends not only on the global structural features of proteins but also on the local sequence environment of proteins. The result demonstrates the applicability of this novel method and provides comparable prediction performance compared with existing methods for the prediction of the oxidation states of cysteines in proteins. Keywords: biosvm
[Hutter2004Prediction]	B. Hutter, C. Schaab, S. Albrecht, M. Borgmann, N. A. Brunner, C. Freiberg, K. Ziegelbauer, C. O. Rock, I. Ivanov, and H. Loferer. Prediction of Mechanisms of Action of Antibacterial Compounds by Gene Expression Profiling. Antimicrob. Agents Chemother., 48(8):2838-2844, Aug 2004. [ bib \| DOI \| arXiv \| http \| .pdf ] We have generated a database of expression profiles carrying the transcriptional responses of the model organism Bacillus subtilis following treatment with 37 well-characterized antibacterial compounds of different classes. The database was used to build a predictor for the assignment of the mechanisms of action (MoAs) of antibacterial compounds by the use of support vector machines. This predictor was able to correctly classify the MoA class for most compounds tested. Furthermore, we provide evidence that the in vivo MoA of hexachlorophene does not match the MoA predicted from in vitro data, a situation frequently faced in drug discovery. A database of this kind may facilitate the prioritization of novel antibacterial entities in drug discovery programs. Potential applications and limitations are discussed. Keywords: biosvm
[Huan2004Accurate]	J. Huan, W. Wang, A. Washington, J. Prins, R. Shah, and A. Tropsha. Accurate classification of protein structural families using coherent subgraph analysis. In Proceedings of the Pacific Symposium on Biocomputing 2002, pages 411-422, 2004. [ bib \| .pdf ] Protein structural annotation and classification is an important problem in bioinformatics. We report on the development of an efficient subgraph mining technique and its application to finding characteristic substructural patterns within protein structural families. In our method, protein structures are represented by graphs where the nodes are residues and the edges connect residues found within certain distance from each other. Application of subgraph mining to proteins is challenging for a number reasons: (1) protein graphs are large and complex, (2) current protein databases are large and continue to grow rapidly, and (3) only a small fraction of the frequent subgraphs among the huge pool of all possible subgraphs could be significant in the context of protein classification. To address these challenges, we have developed an information theoretic model called coherent subgraph mining. From information theory, the entropy of a random variable X measures the information content carried by X and the Mutual Information (MI) between two random variables X and Y measures the correlation between X and Y. We define a subgraph X as coherent if it is strongly correlated with every sufficiently large sub-subgraph Y embedded in it. Based on the MI metric, we have designed a search scheme that only reports coherent subgraphs. To determine the significance of coherent protein subgraphs, we have conducted an experimental study in which all coherent subgraphs were identified in several protein structural families annotated in the SCOP database (Murzin et al, 1995). The Support Vector Machine algorithm was used to classify proteins from different families under the binary classification scheme. We find that this approach identifies spatial motifs unique to individual SCOP families and affords excellent discrimination between families. Keywords: biosvm
[Hu2004Improved]	H.J. Hu, Y. Pan, R. Harrison, and P.C. Tai. Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier. IEEE Trans. Nanobioscience, 3(4):265-271, 2004. [ bib \| .pdf ] Prediction of protein secondary structures is an important problem in bioinformatics and has many applications. The recent trend of secondary structure prediction studies is mostly based on the neural network or the support vector machine (SVM). The SVM method is a comparatively new learning system which has mostly been used in pattern recognition problems. In this study, SVM is used as a machine learning tool for the prediction of secondary structure and several encoding schemes, including orthogonal matrix, hydrophobicity matrix, BLOSUM62 substitution matrix, and combined matrix of these, are applied and optimized to improve the prediction accuracy. Also, the optimal window length for six SVM binary classifiers is established by testing different window sizes and our new encoding scheme is tested based on this optimal window size via sevenfold cross validation tests. The results show 2 classifiers when compared with the instances in which the classical orthogonal matrix is used. Finally, to combine the results of the six SVM binary classifiers, a new tertiary classifier which combines the results of one-versus-one binary classifiers is introduced and the performance is compared with those of existing tertiary classifiers. According to the results, the Q3 prediction accuracy of new tertiary classifier reaches 78.8 reported in the literature. Keywords: biosvm
[Hu2004Developing]	C. Hu, X. Li, and J. Liang. Developing optimal non-linear scoring function for protein design. Bioinformatics, 20(17):3080-3098, 2004. [ bib \| DOI \| http \| www: ] Motivation. Protein design aims to identify sequences compatible with a given protein fold but incompatible to any alternative folds. To select the correct sequences and to guide the search process, a design scoring function is critically important. Such a scoring function should be able to characterize the global fitness landscape of many proteins simultaneously. Results: To find optimal design scoring functions, we introduce two geometric views and propose a formulation using a mixture of non-linear Gaussian kernel functions. We aim to solve a simplified protein sequence design problem. Our goal is to distinguish each native sequence for a major portion of representative protein structures from a large number of alternative decoy sequences, each a fragment from proteins of different folds. Our scoring function discriminates perfectly a set of 440 native proteins from 14 million sequence decoys. We show that no linear scoring function can succeed in this task. In a blind test of unrelated proteins, our scoring function misclassfies only 13 native proteins out of 194. This compares favorably with about three-four times more misclassifications when optimal linear functions reported in the literature are used. We also discuss how to develop protein folding scoring function. Availability: Available on request from the authors. Keywords: biosvm
[Hou2004Remote]	Y. Hou, W. Hsu, M. L. Lee, and C. Bystroff. Remote homolog detection using local sequence-structure correlations. Proteins, 57(3):518-530, 2004. [ bib \| DOI \| .pdf ] Remote homology detection refers to the detection of structural homology in proteins when there is little or no sequence similarity. In this article, we present a remote homolog detection method called SVM-HMMSTR that overcomes the reliance on detectable sequence similarity by transforming the sequences into strings of hidden Markov states that represent local folding motif patterns. These state strings are transformed into fixed-dimension feature vectors for input to a support vector machine. Two sets of features are defined: an order-independent feature set that captures the amino acid and local structure composition; and an order-dependent feature set that captures the sequential ordering of the local structures. Tests using the Structural Classification of Proteins (SCOP) 1.53 data set show that the SVM-HMMSTR gives a significant improvement over several current methods. Keywords: biosvm
[Hochreiter2004Gene]	S. Hochreiter and K. Obermayer. Gene selection for microarray data. In B. SchÃ¶lkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 319-355. MIT Press, 2004. [ bib \| www: ] Keywords: biosvm
[Helma2004Data]	C. Helma, T. Cramer, S. Kramer, and L. De Raedt. Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. J. Chem. Inf. Comput. Sci., 44(4):1402-11, 2004. [ bib \| DOI \| http \| .pdf ] This paper explores the utility of data mining and machine learning algorithms for the induction of mutagenicity structure-activity relationships (SARs) from noncongeneric data sets. We compare (i) a newly developed algorithm (MOLFEA) for the generation of descriptors (molecular fragments) for noncongeneric compounds with traditional SAR approaches (molecular properties) and (ii) different machine learning algorithms for the induction of SARs from these descriptors. In addition we investigate the optimal parameter settings for these programs and give an exemplary interpretation of the derived models. The predictive accuracies of models using MOLFEA derived descriptors is approximately 10-15%age points higher than those using molecular properties alone. Using both types of descriptors together does not improve the derived models. From the applied machine learning techniques the rule learner PART and support vector machines gave the best results, although the differences between the learning algorithms are only marginal. We were able to achieve predictive accuracies up to 78% for 10-fold cross-validation. The resulting models are relatively easy to interpret and usable for predictive as well as for explanatory purposes. Keywords: biosvm chemoinformatics
[Han2004Prediction]	L.Y. Han, C.Z. Cai, S.L. Lo, M.C. Chung, and Y.Z. Chen. Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA, 10(3):355-368, 2004. [ bib \| http \| .pdf ] Elucidation of the interaction of proteins with different molecules is of significance in the understanding of cellular processes. Computational methods have been developed for the prediction of protein-protein interactions. But insufficient attention has been paid to the prediction of protein-RNA interactions, which play central roles in regulating gene expression and certain RNA-mediated enzymatic processes. This work explored the use of a machine learning method, support vector machines (SVM), for the prediction of RNA-binding proteins directly from their primary sequence. Based on the knowledge of known RNA-binding and non-RNA-binding proteins, an SVM system was trained to recognize RNA-binding proteins. A total of 4011 RNA-binding and 9781 non-RNA-binding proteins was used to train and test the SVM classification system, and an independent set of 447 RNA-binding and 4881 non-RNA-binding proteins was used to evaluate the classification accuracy. Testing results using this independent evaluation set show a prediction accuracy of 94.1 proteins, and 98.7 and non-tRNA-binding proteins, respectively. The SVM classification system was further tested on a small class of snRNA-binding proteins with only 60 available sequences. The prediction accuracy is 40.0 and 99.9 a need for a sufficient number of proteins to train SVM. The SVM classification systems trained in this work were added to our Web-based protein functional classification software SVMProt, at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi. Our study suggests the potential of SVM as a useful tool for facilitating the prediction of protein-RNA interactions. Keywords: biosvm
[Han2004Predicting]	L.Y. Han, C.Z. Cai, Z.L. Ji, Z.W. Cao, J. Cui, and Y.Z. Chen. Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucl. Acids Res., 32(21):6437-6444, 2004. [ bib \| DOI \| http \| .pdf ] The function of a protein that has no sequence homolog of known function is difficult to assign on the basis of sequence similarity. The same problem may arise for homologous proteins of different functions if one is newly discovered and the other is the only known protein of similar sequence. It is desirable to explore methods that are not based on sequence similarity. One approach is to assign functional family of a protein to provide useful hint about its function. Several groups have employed a statistical learning method, support vector machines (SVMs), for predicting protein functional family directly from sequence irrespective of sequence similarity. These studies showed that SVM prediction accuracy is at a level useful for functional family assignment. But its capability for assignment of distantly related proteins and homologous proteins of different functions has not been critically and adequately assessed. Here SVM is tested for functional family assignment of two groups of enzymes. One consists of 50 enzymes that have no homolog of known function from PSI-BLAST search of protein databases. The other contains eight pairs of homologous enzymes of different families. SVM correctly assigns 72 pairs in the second group, suggesting that it is potentially useful for facilitating functional study of novel proteins. A web version of our software, SVMProt, is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi. Keywords: biosvm
[Hakenberg2004Finding]	J. Hakenberg, S. Schmeier, A. Kowald, E. Klipp, and U. Leser. Finding kinetic parameters using text mining. OMICS, 8(2):131-152, 2004. [ bib \| http \| .pdf ] The mathematical modeling and description of complex biological processes has become more and more important over the last years. Systems biology aims at the computational simulation of complex systems, up to whole cell simulations. An essential part focuses on solving a large number of parameterized differential equations. However, measuring those parameters is an expensive task, and finding them in the literature is very laborious. We developed a text mining system that supports researchers in their search for experimentally obtained parameters for kinetic models. Our system classifies full text documents regarding the question whether or not they contain appropriate data using a support vector machine. We evaluated our approach on a manually tagged corpus of 800 documents and found that it outperforms keyword searches in abstracts by a factor of five in terms of precision. Keywords: biosvm
[Gartner2004Kernels]	T. Gärtner, J.W. Lloyd, and P.A. Flach. Kernels and distances for structured data. Mach. Learn., 57(3):205-232, 2004. [ bib \| DOI \| http ] This paper brings together two strands of machine learning of increasing importance: kernel methods and highly structured data. We propose a general method for constructing a kernel following the syntactic structure of the data, as defined by its type signature in a higher-order logic. Our main theoretical result is the positive definiteness of any kernel thus defined. We report encouraging experimental results on a range of real-world data sets. By converting our kernel to a distance pseudo-metric for 1-nearest neighbour, we were able to improve the best accuracy from the literature on the Diterpene data set by more than 10%. Keywords: biosvm
[Guo2004novel]	J. Guo, H. Chen, Z. Sun, and Y. Lin. A novel method for protein secondary structure prediction using dual-layer SVM and profiles. Proteins, 54(4):738-743, 2004. [ bib \| DOI \| http \| .pdf ] A high-performance method was developed for protein secondary structure prediction based on the dual-layer support vector machine (SVM) and position-specific scoring matrices (PSSMs). SVM is a new machine learning technology that has been successfully applied in solving problems in the field of bioinformatics. The SVM's performance is usually better than that of traditional machine learning approaches. The performance was further improved by combining PSSM profiles with the SVM analysis. The PSSMs were generated from PSI-BLAST profiles, which contain important evolution information. The final prediction results were generated from the second SVM layer output. On the CB513 data set, the three-state overall per-residue accuracy, Q3, reached 75.2 to 80.0 74.0 has been constructed and is available at http://www.bioinfo.tsinghua.edu.cn/pmsvm. Keywords: biosvm
[Guermeur2004Combining]	Y. Guermeur, G. Pollastri, A. Elisseeff, D. Zelus, H. Paugam-Moisy, and P. Baldi. Combining protein secondary structure prediction models with ensemble methods of optimal complexity. Neurocomputing, 56:305-327, 2004. [ bib \| DOI \| http \| .pdf ] Many sophisticated methods are currently available to perform protein secondary structure prediction. Since they are frequently based on different principles, and different knowledge sources, significant benefits can be expected from combining them. However, the choice of an appropriate combiner appears to be an issue in its own right. The first difficulty to overcome when combining prediction methods is overfitting. This is the reason why we investigate the implementation of Support Vector Machines to perform the task. A family of multi-class SVMs is introduced. Two of these machines are used to combine some of the current best protein secondary structure prediction methods. Their performance is consistently superior to the performance of the ensemble methods traditionally used in the field. They also outperform the decomposition approaches based on bi-class SVMs. Furthermore, initial experimental evidence suggests that their outputs could be processed by the biologist to perform higher-level treatments. Keywords: biosvm
[Guermeur2004kernel]	Y. Guermeur, A. Lifschitz, and R. Vert. A kernel for protein secondary structure prediction. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 193-206. MIT Press, 2004. [ bib ] Keywords: biosvm
[Glotsos2004Computer-based]	Dimitris Glotsos, Panagiota Spyridonos, Panagiotis Petalas, Dionisis Cavouras, Panagiota Ravazoula, Petroula-Arampatoni Dadioti, Ioanna Lekka, and George Nikiforidis. Computer-based malignancy grading of astrocytomas employing a support vector machine classifier, the WHO grading system and the regular hematoxylin-eosin diagnostic staining procedure. Anal Quant Cytol Histol, 26(2):77-83, Apr 2004. [ bib ] OBJECTIVE: To investigate and develop an automated technique for astrocytoma malignancy grading compatible with the clinical routine. STUDY DESIGN: One hundred forty biopsies of astrocytomas were collected from 2 hospitals. The degree of tumor malignancy was defined as low or high according to the World Health Organization grading system. From each biopsy, images were digitized and segmented to isolate nuclei from background tissue. Morphologic and textural nuclear features were quantified to encode tumor malignancy. Each case was represented by a 40-dimensional feature vector. An exhaustive search procedure in feature space was utilized to determine the best feature combination that resulted in the smallest classification error. Low and high grade tumors were discriminated using support vector machines (SVMs). To evaluate the system performance, all available data were split randomly into training and test sets. RESULTS: The best vector combination consisted of 3 textural and 2 morphologic features. Low and high grade cases were discriminated with an accuracy of 90.7% and 88.9%, respectively, using an SVM classifier with polynomial kernel of degree 2. CONCLUSION: The proposed methodology was based on standards that are common in daily clinical practice and might be used in parallel with conventional grading as a second-opinion tool to reduce subjectivity in the classification of astrocytomas. Keywords: Amino Acids, Antibodies, Artificial Intelligence, Astrocytoma, Biological, Biopsy, Brain, Brain Mapping, Brain Neoplasms, Calibration, Comparative Study, Computational Biology, Computer-Assisted, Cysteine, Cystine, Electrodes, Electroencephalography, Eosine Yellowish-(YS), Evoked Potentials, Female, Hematoxylin, Horseradish Peroxidase, Humans, Image Processing, Imagery (Psychotherapy), Imagination, Laterality, Male, Monoclonal, Movement, Neoplasms, Non-P.H.S., Non-U.S. Gov't, P.H.S., Perception, Principal Component Analysis, Protein, Protein Array Analysis, Proteins, Research Support, Sensitivity and Specificity, Sequence Analysis, Software, Tumor Markers, U.S. Gov't, User-Computer Interface, World Health Organization, 15131894
[Glotsos2004Automated]	Dimitris Glotsos, Panagiota Spyridonos, Dionisis Cavouras, Panagiota Ravazoula, Petroula-Arampantoni Dadioti, and George Nikiforidis. Automated segmentation of routinely hematoxylin-eosin-stained microscopic images by combining support vector machine clustering and active contour models. Anal Quant Cytol Histol, 26(6):331-40, Dec 2004. [ bib ] OBJECTIVE: To develop a method for the automated segmentation of images of routinely hematoxylin-eosin (H-E)-stained microscopic sections to guarantee correct results in computer-assisted microscopy. STUDY DESIGN: Clinical material was composed 50 H-E-stained biopsies of astrocytomas and 50 H-E-stained biopsies of urinary bladder cancer. The basic idea was to use a support vector machine clustering (SVMC) algorithm to provide gross segmentation of regions holding nuclei and subsequently to refine nuclear boundary detection with active contours. The initialization coordinates of the active contour model were defined using a SVMC pixel-based classification algorithm that discriminated nuclear regions from the surrounding tissue. Starting from the boundaries of these regions, the snake fired and propagated until converging to nuclear boundaries. RESULTS: The method was validated for 2 different types of H-E-stained images. Results were evaluated by 2 histopathologists. On average, 94% of nuclei were correctly delineated. CONCLUSION: The proposed algorithm could be of value in computer-based systems for automated interpretation of microscopic images. Keywords: Adenosinetriphosphatase, Adolescent, Adult, Algorithms, Amino Acid Sequence, Amino Acids, Animals, Astrocytoma, Automated, Automation, Base Sequence, Bayes Theorem, Biological, Biopsy, Bladder Neoplasms, Breast Neoplasms, Carbohydrate Conformation, Carbohydrate Sequence, Cattle, Cell Cycle Proteins, Cell Nucleus, Computational Biology, Computer Simulation, Computer-Assisted, Crystallography, DNA, Databases, Diagnosis, Differential, Eosine Yellowish-(YS), Exoribonucleases, Factual, False Negative Reactions, False Positive Reactions, Female, Gene Expression, Gene Expression Profiling, Genes, Genetic, Genetic Techniques, Genetic Vectors, Genome, Hematoxylin, Histocompatibility Antigens Class I, Human, Humans, Image Interpretation, Image Processing, Introns, Least-Squares Analysis, MHC Class I, Major Histocompatibility Complex, Markov Chains, Messenger, Mice, Middle Aged, Models, Molecular Structure, Monosaccharides, Multigene Family, Mutation, Neoplasms, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonparametric, Nucleotidyltransferases, Observer Variation, Oligonucleotide Array Sequence Analysis, P.H.S., Pattern Recognition, Peptides, Phenotype, Phylogeny, Plants, Poly A, Polysaccharides, Predictive Value of Tests, Protein, Protein Biosynthesis, Protein Kinase Inhibitors, Protein Structure, Proteins, RNA, RNA Helicases, RNA Splicing, Rats, Reproducibility of Results, Research Support, Retrospective Studies, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Secondary, Sensitivity and Specificity, Sequence Alignment, Software, Species Specificity, Staining and Labeling, Statistics, Theoretical, Transcription, U.S. Gov't, Ultrasonography, X-Ray, 15678615
[Fong2004Predicting]	J. H. Fong, A. E. Keating, and M. Singh. Predicting specificity in bZIP coiled-coil protein interactions. Genome Biol., 5(R11), 2004. [ bib \| http \| .pdf ] We present a method for predicting protein-protein interactions mediated by the coiled-coil motif. When tested on interactions between nearly all human and yeast bZIP proteins, our method identifies 70 strong interactions while maintaining that 92 correct. Furthermore, cross-validation testing shows that including the bZIP experimental data significantly improves performance. Our method can be used to predict bZIP interactions in other genomes and is a promising approach for predicting coiled-coil interactions more generally. Keywords: biosvm
[Doytchinova2004Identifying]	Irini A Doytchinova, Pingping Guan, and Darren R Flower. Identifying human MHC supertypes using bioinformatic methods. J. Immunol., 172(7):4314-4323, Apr 2004. [ bib ] Classification of MHC molecules into supertypes in terms of peptide-binding specificities is an important issue, with direct implications for the development of epitope-based vaccines with wide population coverage. In view of extremely high MHC polymorphism (948 class I and 633 class II HLA alleles) the experimental solution of this task is presently impossible. In this study, we describe a bioinformatics strategy for classifying MHC molecules into supertypes using information drawn solely from three-dimensional protein structure. Two chemometric techniques-hierarchical clustering and principal component analysis-were used independently on a set of 783 HLA class I molecules to identify supertypes based on structural similarities and molecular interaction fields calculated for the peptide binding site. Eight supertypes were defined: A2, A3, A24, B7, B27, B44, C1, and C4. The two techniques gave 77% consensus, i.e., 605 HLA class I alleles were classified in the same supertype by both methods. The proposed strategy allowed "supertype fingerprints" to be identified. Thus, the A2 supertype fingerprint is Tyr(9)/Phe(9), Arg(97), and His(114) or Tyr(116); the A3-Tyr(9)/Phe(9)/Ser(9), Ile(97)/Met(97) and Glu(114) or Asp(116); the A24-Ser(9) and Met(97); the B7-Asn(63) and Leu(81); the B27-Glu(63) and Leu(81); for B44-Ala(81); the C1-Ser(77); and the C4-Asn(77). Keywords: Alleles; Amino Acid Motifs; Binding Sites; Computational Biology; DNA Fingerprinting; HLA Antigens; HLA-A Antigens; HLA-B Antigens; HLA-C Antigens; Histocompatibility Antigens Class I; Histocompatibility Testing; Humans; Multigene Family; Protein Interaction Mapping
[Cuturi2004mutual]	M. Cuturi and J.-P. Vert. A mutual information kernel for strings. In Proceedings of IJCNN 2004, pages 1904-1910, 2004. [ bib \| www: ] Keywords: biosvm
[Cui2004Esub8]	Q. Cui, T. Jiang, B. Liu, and S. Ma. Esub8: A novel tool to predict protein subcellular localizations in eukaryotic organisms. BMC Bioinformatics, 5(66):66, 2004. [ bib \| DOI \| http \| .pdf ] Background Subcellular localization of a new protein sequence is very important and fruitful for understanding its function. As the number of new genomes has dramatically increased over recent years, a reliable and efficient system to predict protein subcellular location is urgently needed. Results Esub8 was developed to predict protein subcellular localizations for eukaryotic proteins based on amino acid composition. In this research, the proteins are classified into the following eight groups: chloroplast, cytoplasm, extracellular, Golgi apparatus, lysosome, mitochondria, nucleus and peroxisome. We know subcellular localization is a typical classification problem; consequently, a one-against-one (1-v-1) multi-class support vector machine was introduced to construct the classifier. Unlike previous methods, ours considers the order information of protein sequences by a different method. Our method is tested in three subcellular localization predictions for prokaryotic proteins and four subcellular localization predictions for eukaryotic proteins on Reinhardt's dataset. The results are then compared to several other methods. The total prediction accuracies of two tests are both 100 self-consistency test, and are 92.9 test, respectively. Esub8 also provides excellent results: the total prediction accuracies are 100 87 a different approach for predicting protein subcellular localization and achieved a satisfactory result; furthermore, we believe Esub8 will be a useful tool for predicting protein subcellular localizations in eukaryotic organisms. Keywords: biosvm
[Collier2004Comparison]	Nigel Collier and Koichi Takeuchi. Comparison of character-level and part of speech features for name recognition in biomedical texts. J Biomed Inform, 37(6):423-35, Dec 2004. [ bib \| DOI \| http \| .pdf ] The immense volume of data which is now available from experiments in molecular biology has led to an explosion in reported results most of which are available only in unstructured text format. For this reason there has been great interest in the task of text mining to aid in fact extraction, document screening, citation analysis, and linkage with large gene and gene-product databases. In particular there has been an intensive investigation into the named entity (NE) task as a core technology in all of these tasks which has been driven by the availability of high volume training sets such as the GENIA v3.02 corpus. Despite such large training sets accuracy for biology NE has proven to be consistently far below the high levels of performance in the news domain where F scores above 90 are commonly reported which can be considered near to human performance. We argue that it is crucial that more rigorous analysis of the factors that contribute to the model's performance be applied to discover where the underlying limitations are and what our future research direction should be. Our investigation in this paper reports on variations of two widely used feature types, part of speech (POS) tags and character-level orthographic features, and makes a comparison of how these variations influence performance. We base our experiments on a proven state-of-the-art model, support vector machines using a high quality subset of 100 annotated MEDLINE abstracts. Experiments reveal that the best performing features are orthographic features with F score of 72.6. Although the Brill tagger trained in-domain on the GENIA v3.02p POS corpus gives the best overall performance of any POS tagger, at an F score of 68.6, this is still significantly below the orthographic features. In combination these two features types appear to interfere with each other and degrade performance slightly to an F score of 72.3. Keywords: biosvm nlp
[Chen2004Prediction]	Y.C. Chen, Y.S. Lin, C.J. Lin, and J.K. Hwang. Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences. Proteins, 55(4):1036-1042, 2004. [ bib \| DOI \| .pdf \| .pdf ] The support vector machine (SVM) method is used to predict the bonding states of cysteines. Besides using local descriptors such as the local sequences, we include global information, such as amino acid compositions and the patterns of the states of cysteines (bonded or nonbonded), or cysteine state sequences, of the proteins. We found that SVM based on local sequences or global amino acid compositions yielded similar prediction accuracies for the data set comprising 4136 cysteine-containing segments extracted from 969 nonhomologous proteins. However, the SVM method based on multiple feature vectors (combining local sequences and global amino acid compositions) significantly improves the prediction accuracy, from 80 cysteine state sequences, SVM based on multiple feature vectors yields 90 coefficient, around 10 obtained by SVM based on local sequence information. Keywords: biosvm
[Camps-Valls2004Profiled]	G. Camps-Valls, A.M. Chalk, A.J. Serrano-Lopez, J.D. Martin-Guerrero, and E.L. Sonnhammer. Profiled support vector machines for antisense oligonucleotide efficacy prediction. BMC Bioinformatics, 5(135):135, 2004. [ bib \| DOI \| http \| .pdf ] Background This paper presents the use of Support Vector Machines (SVMs) for prediction and analysis of antisense oligonucleotide (AO) efficacy. The collected database comprises 315 AO molecules including 68 features each, inducing a problem well-suited to SVMs. The task of feature selection is crucial given the presence of noisy or redundant features, and the well-known problem of the curse of dimensionality. We propose a two-stage strategy to develop an optimal model: (1) feature selection using correlation analysis, mutual information, and SVM-based recursive feature elimination (SVM-RFE), and (2) AO prediction using standard and profiled SVM formulations. A profiled SVM gives different weights to different parts of the training data to focus the training on the most important regions. Results In the first stage, the SVM-RFE technique was most efficient and robust in the presence of low number of samples and high input space dimension. This method yielded an optimal subset of 14 representative features, which were all related to energy and sequence motifs. The second stage evaluated the performance of the predictors (overall correlation coefficient between observed and predicted efficacy, r; mean error, ME; and root-mean-square-error, RMSE) using 8-fold and minus-one-RNA cross-validation methods. The profiled SVM produced the best results (r = 0.44, ME = 0.022, and RMSE= 0.278) and predicted high (>75 gene expression) and low efficacy (<25 of 83.3 approaches. A web server for AO prediction is available online at http://aosvm.cgb.ki.se/. Conclusions The SVM approach is well suited to the AO prediction problem, and yields a prediction accuracy superior to previous methods. The profiled SVM was found to perform better than the standard SVM, suggesting that it could lead to improvements in other prediction problems as well. Keywords: biosvm
[Cai2004Identify]	Y.D. Cai, G.P. Zhou, C.H. Jen, S.L. Lin, and K.C. Chou. Identify catalytic triads of serine hydrolases by support vector machines. J. Theor. Biol., 228(4):551-557, 2004. [ bib \| DOI \| http \| .pdf ] The core of an enzyme molecule is its active site from the viewpoints of both academic research and industrial application. To reveal the structural and functional mechanism of an enzyme, one needs to know its active site; to conduct structure-based drug design by regulating the function of an enzyme, one needs to know the active site and its microenvironment as well. Given the atomic coordinates of an enzyme molecule, how can we predict its active site? To tackle such a problem, a distance group approach was proposed and the support vector machine algorithm applied to predict the catalytic triad of serine hydrolase family. The success rate by jackknife test for the 139 serine hydrolases was 85 promising and may become a useful tool in structural bioinformatics. Keywords: biosvm
[Cai2004Application]	Y.D. Cai, P.W. Ricardo, C.H. Jen, and K.C. Chou. Application of SVM to predict membrane protein types. J. Theor. Biol., 226(4):373-376, 2004. [ bib \| DOI \| http \| .pdf ] As a continuous effort to develop automated methods for predicting membrane protein types that was initiated by Chou and Elrod (PROTEINS: Structure, Function, and Genetics, 1999, 34, 137-153), the support vector machine (SVM) is introduced. Results obtained through re-substitution, jackknife, and independent data set tests, respectively, have indicated that the SVM approach is quite a promising one, suggesting that the covariant discriminant algorithm (Chou and Elrod, Protein Eng. 12 (1999) 107) and SVM, if effectively complemented with each other, will become a powerful tool for predicting membrane protein types and the other protein attributes as well. Keywords: biosvm
[Cai2004Enzyme]	C.Z. Cai, L.Y. Han, Z.L. Ji, and Y.Z. Chen. Enzyme family classification by support vector machines. Proteins, 55(1):66-76, 2004. [ bib \| DOI \| http \| .pdf ] One approach for facilitating protein function prediction is to classify proteins into functional families. Recent studies on the classification of G-protein coupled receptors and other proteins suggest that a statistical learning method, Support vector machines (SVM), may be potentially useful for protein classification into functional families. In this work, SVM is applied and tested on the classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. SVM classification system for each family is trained from representative enzymes of that family and seed proteins of Pfam curated protein families. The classification accuracy for enzymes from 46 families and for non-enzymes is in the range of 50.0 Matthews correlation coefficient is in the range of 54.1 Moreover, 80.3 classified into a specific enzyme family by using a scoring function, indicating that SVM may have certain level of unique prediction capability. Testing results also suggest that SVM in some cases is capable of classification of distantly related enzymes and homologous enzymes of different functions. Effort is being made to use a more comprehensive set of enzymes as training sets and to incorporate multi-class SVM classification systems to further enhance the unique prediction accuracy. Our results suggest the potential of SVM for enzyme family classification and for facilitating protein function prediction. Our software is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi. Keywords: biosvm
[Byvatov2004SVM-based]	Evgeny Byvatov and Gisbert Schneider. SVM-based feature selection for characterization of focused compound collections. J Chem Inf Comput Sci, 44(3):993-9, 2004. [ bib \| DOI \| http \| .pdf ] Artificial neural networks, the support vector machine (SVM), and other machine learning methods for the classification of molecules are often considered as a "black box", since the molecular features that are most relevant for a given classifier are usually not presented in a human-interpretable form. We report on an SVM-based algorithm for the selection of relevant molecular features from a trained classifier that might be important for an understanding of ligand-receptor interactions. The original SVM approach was extended to allow for feature selection. The method was applied to characterize focused libraries of enzyme inhibitors. A comparison with classical Kolmogorov-Smirnov (KS)-based feature selection was performed. In most of the applications the SVM method showed sustained classification accuracy, thereby relying on a smaller number of molecular features than KS-based classifiers. In one case both methods produced comparable results. Limiting the calculation of descriptors to only the most relevant ones for a certain biological activity can also be used to speed up high-throughput virtual screening. Keywords: biosvm chemoinformatics featureselection
[Busuttil2004Support]	S. Busuttil, J. Abela, and G. J. Pace. Support vector machines with profile-based kernels for remote protein homology detection. Genome Inform Ser Workshop Genome Inform, 15(2):191-200, 2004. [ bib \| .html \| .pdf ] Two new techniques for remote protein homology detection particulary suited for sparse data are introduced. These methods are based on position specific scoring matrices or profiles and use a support vector machine (SVM) for discrimination. The performance on standard benchmarks outperforms previous non-discriminative techniques and is comparable to that of other SVM-based methods while giving distinct advantages. Keywords: biosvm
[Bowd2004Confocal]	Christopher Bowd, Linda M Zangwill, Felipe A Medeiros, Jiucang Hao, Kwokleung Chan, Te-Won Lee, Terrence J Sejnowski, Michael H Goldbaum, Pamela A Sample, Jonathan G Crowston, and Robert N Weinreb. Confocal scanning laser ophthalmoscopy classifiers and stereophotograph evaluation for prediction of visual field abnormalities in glaucoma-suspect eyes. Invest Ophthalmol Vis Sci, 45(7):2255-62, Jul 2004. [ bib \| DOI \| http \| .pdf ] PURPOSE: To determine whether Heidelberg Retina Tomograph (HRT; Heidelberg Engineering, Dossenheim, Germany) classification techniques and investigational support vector machine (SVM) analyses can detect optic disc abnormalities in glaucoma-suspect eyes before the development of visual field abnormalities. METHODS: Glaucoma-suspect eyes (n = 226) were classified as converts or nonconverts based on the development of repeatable (either two or three consecutive) standard automated perimetry (SAP)-detected abnormalities over the course of the study (mean follow-up, approximately 4.5 years). Hazard ratios for development of SAP abnormalities were calculated based on baseline classification results, follow-up time, and end point status (convert, nonconvert). Classification techniques applied were HRT classification (HRTC), Moorfields Regression Analysis, forward-selection optimized SVM (SVM fwd) and backward elimination-optimized SVM (SVM back) analysis of HRT data, and stereophotograph assessment. RESULTS: Univariate analyses indicated that all classification techniques were predictors of the development of two repeatable abnormal SAP results, with hazards ratios (95% confidence interval [CI]) ranging from 1.32 (1.00-1.75) for HRTC to 2.0 (1.48-2.76) for stereophotograph assessment (all P < or = 0.05). Only SVM (SVM fwd and SVM back) analysis of HRT data and stereophotograph assessment were univariate predictors of the development of three repeatable abnormal SAP results, with hazard ratios (95% CI) ranging from 1.73 (1.16-2.82) for SVM fwd to 1.82 (1.19-3.12) for SVM back (both P < 0.007). Multivariate analyses including each classification technique individually in a model with age, baseline SAP pattern standard deviation [PSD], and baseline IOP indicated that all classification techniques except HRTC (P = 0.06) were predictors of the development of two repeatable abnormal SAP results with hazards ratios ranging from 1.30 (0.99, 1.73) for HRTC to 1.90 (1.37, 2.69) for stereophotograph assessment. Only SVM (SVM fwd and SVM back) analysis of HRT data and stereophotograph assessment were significant predictors of the development of three repeatable abnormal SAP results in multivariate analyses; hazard ratios of 1.57 (1.03, 2.59) and 1.70 (1.18, 2.51), respectively. SAP PSD was a significant predictor of two repeatable abnormal SAP results in multivariate models with all classification techniques, with hazard ratios ranging from 3.31 (1.39, 7.89) to 4.70 (2.02, 10.93) per 1-dB increase. CONCLUSIONS: HRT classifications techniques and stereophotograph assessment can detect optic disc topography abnormalities in glaucoma-suspect eyes before the development of SAP abnormalities. These data support strongly the importance of optic disc examination for early glaucoma diagnosis. Keywords: 80 and over, Adolescent, Adult, Aged, Algorithms, Artificial Intelligence, Auditory, Benchmarking, Binding Sites, Brain Stem, Breast Diseases, Chemical, Child, Chromosomes, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Data Interpretation, Databases, Diagnosis, Diagnostic Errors, Differential, Drug Resistance, Electroencephalography, Epilepsy, Evoked Potentials, Female, Forecasting, Gene Expression, Gene Expression Profiling, Genetic, Genotype, Glaucoma, Greece, HIV Protease Inhibitors, HIV-1, Human, Humans, Infant, Information Management, Information Storage and Retrieval, Intraocular Pressure, Kinetics, Language Development Disorders, Lasers, Least-Squares Analysis, Linear Models, Male, Microbial Sensitivity Tests, Middle Aged, Models, Molecular, Monitoring, Nephroblastoma, Non-U.S. Gov't, Nonlinear Dynamics, Ocular Hypertension, Oligonucleotide Array Sequence Analysis, Ophthalmoscopy, Optic Disk, Optic Nerve Diseases, P.H.S., Pair 1, Perimetry, Periodicals, Phosphorylation, Phosphotransferases, Photography, Physiologic, Point Mutation, Preschool, Prognosis, Protein, Proteins, Pyrimidinones, Reaction Time, Recurrence, Reproducibility of Results, Research Support, Reverse Transcriptase Inhibitors, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Signal Processing, Software, Sound Localization, Statistical, Stochastic Processes, Structure-Activity Relationship, Theoretical, Time Factors, U.S. Gov't, Viral, Vision Disorders, Visual Fields, 15223803
[Bhasin2004SVM]	M. Bhasin and G. P. S. Raghava. SVM based method for predicting HLA-DRB10401 binding peptides in an antigen sequence. Bioinformatics, 20(3):421-423, 2004. [ bib \| http \| .pdf ] Summary: Prediction of peptides binding with MHC class II allele HLA-DRB10401 can effectively reduce the number of experiments required for identifying helper T cell epitopes. This paper describes support vector machine (SVM) based method developed for identifying HLA-DRB1*0401 binding peptides in an antigenic sequence. SVM was trained and tested on large and clean data set consisting of 567 binders and equal number of non-binders. The accuracy of the method was 86 Available: A web server HLA-DR4Pred based on above approach is available at http://www.imtech.res.in/raghava/hladr4pred/ and http://bioinformatics.uams.edu/mirror/hladr4pred/ (Mirror Site). Supplementary information: http://www.imtech.res.in/raghava/hladr4pred/info.html Keywords: biosvm immunoinformatics
[Bhasin2004Prediction]	M. Bhasin and G. P. S. Raghava. Prediction of CTL epitopes using QM, SVM and ANN techniques. Vaccine, 22(23-24):3195-3204, 2004. [ bib \| DOI \| http \| .pdf ] Cytotoxic T lymphocyte (CTL) epitopes are potential candidates for subunit vaccine design for various diseases. Most of the existing T cell epitope prediction methods are indirect methods that predict MHC class I binders instead of CTL epitopes. In this study, a systematic attempt has been made to develop a direct method for predicting CTL epitopes from an antigenic sequence. This method is based on quantitative matrix (QM) and machine learning techniques such as Support Vector Machine (SVM) and Artificial Neural Network (ANN). This method has been trained and tested on non-redundant dataset of T cell epitopes and non-epitopes that includes 1137 experimentally proven MHC class I restricted T cell epitopes. The accuracy of QM-, ANN- and SVM-based methods was 70.0, 72.2 and 75.2 has been evaluated through Leave One Out Cross-Validation (LOOCV) at a cutoff score where sensitivity and specificity was nearly equal. Finally, both machine-learning methods were used for consensus and combined prediction of CTL epitopes. The performances of these methods were evaluated on blind dataset where machine learning-based methods perform better than QM-based method. We also demonstrated through subgroup analysis that our methods can discriminate between T-cell epitopes and MHC binders (non-epitopes). In brief this method allows prediction of CTL epitopes using QM, SVM, ANN approaches. The method also facilitates prediction of MHC restriction in predicted T cell epitopes. Keywords: biosvm immunoinformatics
[Bhasin2004GPCRpred]	M. Bhasin and G. P. S. Raghava. GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucl. Acids Res., 32(Supp.2):W383-389, 2004. [ bib \| DOI \| arXiv \| http \| .pdf ] G-protein coupled receptors (GPCRs) belong to one of the largest superfamilies of membrane proteins and are important targets for drug design. In this study, a support vector machine (SVM)-based method, GPCRpred, has been developed for predicting families and subfamilies of GPCRs from the dipeptide composition of proteins. The dataset used in this study for training and testing was obtained from http://www.soe.ucsc.edu/research/compbio/gpcr/. The method classified GPCRs and non-GPCRs with an accuracy of 99.5 evaluated using 5-fold cross-validation. The method is further able to predict five major classes or families of GPCRs with an overall Matthew's correlation coefficient (MCC) and accuracy of 0.81 and 97.5 of the rhodopsin-like family, the method achieved an average MCC and accuracy of 0.97 and 97.3 overall accuracy of 91.3 respectively when evaluated on an independent/blind dataset of 650 GPCRs. A server for recognition and classification of GPCRs based on multiclass SVMs has been set up at http://www.imtech.res.in/raghava/gpcrpred/. We have also suggested subfamilies for 42 sequences which were previously identified as unclassified ClassA GPCRs. The supplementary information is available at http://www.imtech.res.in/raghava/gpcrpred/info.html. Keywords: biosvm
[Bhasin2004ESLpred]	M. Bhasin and G. P. S. Raghava. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucl. Acids Res., 32(Suppl. 2):W414-419, 2004. [ bib \| DOI \| arXiv \| http \| .pdf ] Automated prediction of subcellular localization of proteins is an important step in the functional annotation of genomes. The existing subcellular localization prediction methods are based on either amino acid composition or N-terminal characteristics of the proteins. In this paper, support vector machine (SVM) has been used to predict the subcellular location of eukaryotic proteins from their different features such as amino acid composition, dipeptide composition and physico-chemical properties. The SVM module based on dipeptide composition performed better than the SVM modules based on amino acid composition or physico-chemical properties. In addition, PSI-BLAST was also used to search the query sequence against the dataset of proteins (experimentally annotated proteins) to predict its subcellular location. In order to improve the prediction accuracy, we developed a hybrid module using all features of a protein, which consisted of an input vector of 458 dimensions (400 dipeptide compositions, 33 properties, 20 amino acid compositions of the protein and 5 from PSI-BLAST output). Using this hybrid approach, the prediction accuracies of nuclear, cytoplasmic, mitochondrial and extracellular proteins reached 95.3, 85.2, 68.2 and 88.9 overall prediction accuracy of SVM modules based on amino acid composition, physico-chemical properties, dipeptide composition and the hybrid approach was 78.1, 77.8, 82.9 and 88.0 The accuracy of all the modules was evaluated using a 5-fold cross-validation technique. Assigning a reliability index (reliability index > or =3), 73.5 on the above approach, an online web server ESLpred was developed, which is available at http://www.imtech.res.in/raghava/eslpred/. Keywords: biosvm
[Bhasin2004Classification]	M. Bhasin and G. P. S. Raghava. Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. J. Biol. Chem., 279(22):23262-23266, 2004. [ bib \| DOI \| arXiv \| http \| .pdf ] Nuclear receptors are key transcription factors that regulate crucial gene networks responsible for cell growth, differentiation, and homeostasis. Nuclear receptors form a superfamily of phylogenetically related proteins and control functions associated with major diseases (e.g. diabetes, osteoporosis, and cancer). In this study, a novel method has been developed for classifying the subfamilies of nuclear receptors. The classification was achieved on the basis of amino acid and dipeptide composition from a sequence of receptors using support vector machines. The training and testing was done on a non-redundant data set of 282 proteins obtained from the NucleaRDB data base (1). The performance of all classifiers was evaluated using a 5-fold cross validation test. In the 5-fold cross-validation, the data set was randomly partitioned into five equal sets and evaluated five times on each distinct set while keeping the remaining four sets for training. It was found that different subfamilies of nuclear receptors were quite closely correlated in terms of amino acid composition as well as dipeptide composition. The overall accuracy of amino acid composition-based and dipeptide compositionbased classifiers were 82.6 and 97.5 that different subfamilies of nuclear receptors are predictable with considerable accuracy using amino acid or dipeptide composition. Furthermore, based on above approach, an online web service, NRpred, was developed, which is available at www.imtech.res.in/raghava/nrpred. Keywords: biosvm
[Bhasin2004Analysis]	M. Bhasin and G. P. S. Raghava. Analysis and prediction of affinity of TAP binding peptides using cascade SVM. Protein Sci., 13(3):596-607, Mar 2004. [ bib \| DOI \| http \| .pdf ] The generation of cytotoxic T lymphocyte (CTL) epitopes from an antigenic sequence involves number of intracellular processes, including production of peptide fragments by proteasome and transport of peptides to endoplasmic reticulum through transporter associated with antigen processing (TAP). In this study, 409 peptides that bind to human TAP transporter with varying affinity were analyzed to explore the selectivity and specificity of TAP transporter. The abundance of each amino acid from P1 to P9 positions in high-, intermediate-, and low-affinity TAP binders were examined. The rules for predicting TAP binding regions in an antigenic sequence were derived from the above analysis. The quantitative matrix was generated on the basis of contribution of each position and residue in binding affinity. The correlation of r = 0.65 was obtained between experimentally determined and predicted binding affinity by using a quantitative matrix. Further a support vector machine (SVM)-based method has been developed to model the TAP binding affinity of peptides. The correlation (r = 0.80) was obtained between the predicted and experimental measured values by using sequence-based SVM. The reliability of prediction was further improved by cascade SVM that uses features of amino acids along with sequence. An extremely good correlation (r = 0.88) was obtained between measured and predicted values, when the cascade SVM-based method was evaluated through jackknife testing. A Web service, TAPPred (http://www.imtech.res.in/raghava/tappred/ or http://bioinformatics.uams.edu/mirror/tappred/), has been developed based on this approach. Keywords: biosvm
[Bern2004Automatic]	M. Bern, D. Goldberg, W. H. McDonald, and III Yates, J. R. Automatic Quality Assessment of Peptide Tandem Mass Spectra. Bioinformatics, 20(Suppl. 1):i49-i54, 2004. [ bib \| http \| .pdf ] Motivation: A powerful proteomics methodology couples high-performance liquid chromatography (HPLC) with tandem mass spectrometry and database-search software, such as SEQUEST. Such a set-up, however, produces a large number of spectra, many of which are of too poor quality to be useful. Hence a filter that eliminates poor spectra before the database search can significantly improve throughput and robustness. Moreover, spectra judged to be of high quality, but that cannot be identified by database search, are prime candidates for still more computationally intensive methods, such as de novo sequencing or wider database searches including post-translational modifications. Results: We report on two different approaches to assessing spectral quality prior to identification: binary classification, which predicts whether or not SEQUEST will be able to make an identification, and statistical regression, which predicts a more universal quality metric involving the number of b- and y-ion peaks. The best of our binary classifiers can eliminate over 75 spectra while losing only 10 regression can pick out spectra of modified peptides that can be identified by a de novo program but not by SEQUEST. In a section of independent interest, we discuss intensity normalization of mass spectra. Keywords: biosvm proteomics
[Baumgartner2004Supervised]	C. Baumgartner, C. Bohm, D. Baumgartner, G. Marini, K. Weinberger, B. Olgemoller, B. Liebl, and A. A. Roscher. Supervised machine learning techniques for the classification of metabolic disorders in newborns. Bioinformatics, 20(17):2985-2996, 2004. [ bib \| DOI \| http \| .pdf ] Motivation: During the Bavarian newborn screening programme all newborns have been tested for about 20 inherited metabolic disorders. Owing to the amount and complexity of the generated experimental data, machine learning techniques provide a promising approach to investigate novel patterns in high-dimensional metabolic data which form the source for constructing classification rules with high discriminatory power. Results: Six machine learning techniques have been investigated for their classification accuracy focusing on two metabolic disorders, phenylketo nuria (PKU) and medium-chain acyl-CoA dehydrogenase deficiency (MCADD). Logistic regression analysis led to superior classification rules (sensitivity >96.8 to all investigated algorithms. Including novel constellations of metabolites into the models, the positive predictive value could be strongly increased (PKU 71.9 54.6 clearly prove that the mined data confirm the known and indicate some novel metabolic patterns which may contribute to a better understanding of newborn metabolism. Availability: WEKA machine learning package: www.cs.waikato.ac.nz/ ml/weka and statistical software package ADE-4: http://pbil.univ-lyon1.fr/ADE-4 Keywords: biosvm proteomics
[Yap2004Prediction]	C. W. Yap, C. Z. Cai, Y. Xue, and Y. Z. Chen. Prediction of torsade-causing potential of drugs by support vector machine approach. Toxicol Sci, 79(1):170-7, May 2004. [ bib \| DOI \| http \| .pdf ] In an effort to facilitate drug discovery, computational methods for facilitating the prediction of various adverse drug reactions (ADRs) have been developed. So far, attention has not been sufficiently paid to the development of methods for the prediction of serious ADRs that occur less frequently. Some of these ADRs, such as torsade de pointes (TdP), are important issues in the approval of drugs for certain diseases. Thus there is a need to develop tools for facilitating the prediction of these ADRs. This work explores the use of a statistical learning method, support vector machine (SVM), for TdP prediction. TdP involves multiple mechanisms and SVM is a method suitable for such a problem. Our SVM classification system used a set of linear solvation energy relationship (LSER) descriptors and was optimized by leave-one-out cross validation procedure. Its prediction accuracy was evaluated by using an independent set of agents and by comparison with results obtained from other commonly used classification methods using the same dataset and optimization procedure. The accuracies for the SVM prediction of TdP-causing agents and non-TdP-causing agents are 97.4 and 84.6% respectively; one is substantially improved against and the other is comparable to the results obtained by other classification methods useful for multiple-mechanism prediction problems. This indicates the potential of SVM in facilitating the prediction of TdP-causing risk of small molecules and perhaps other ADRs that involve multiple mechanisms. Keywords: biosvm chemoinformatics
[Pavey2004Microarray]	S. Pavey, P. Johansson, L. Packer, J. Taylor, M. Stark, P.M. Pollock, G.J. Walker, G.M. Boyle, U. Harper, S.J. Cozzi, K. Hansen, L. Yudt, C. Schmidt, P. Hersey, K.A. Ellem, M.G. O'Rourke, P.G. Parsons, P. Meltzer, M. Ringner, and N.K. Hayward. Microarray expression profiling in melanoma reveals a BRAF mutation signature. Oncogene, 23(23):4060-4067, May 2004. [ bib \| DOI \| http \| .pdf ] We have used microarray gene expression profiling and machine learning to predict the presence of BRAF mutations in a panel of 61 melanoma cell lines. The BRAF gene was found to be mutated in 42 samples (69 seven samples (11 Using support vector machines, we have built a classifier that differentiates between melanoma cell lines based on BRAF mutation status. As few as 83 genes are able to discriminate between BRAF mutant and BRAF wild-type samples with clear separation observed using hierarchical clustering. Multidimensional scaling was used to visualize the relationship between a BRAF mutation signature and that of a generalized mitogen-activated protein kinase (MAPK) activation (either BRAF or NRAS mutation) in the context of the discriminating gene list. We observed that samples carrying NRAS mutations lie somewhere between those with or without BRAF mutations. These observations suggest that there are gene-specific mutation signals in addition to a common MAPK activation that result from the pleiotropic effects of either BRAF or NRAS on other signaling pathways, leading to measurably different transcriptional changes. Keywords: biosvm microarray
[Mestres2004Computational]	Jordi Mestres. Computational chemogenomics approaches to systematic knowledge-based drug discovery. Curr Opin Drug Discov Devel, 7(3):304-313, May 2004. [ bib ] Chemogenomics, the identification of all possible drugs for all possible targets, has recently emerged as a new paradigm in drug discovery in which efficiency in the compound design and optimization process is achieved through the gain and reuse of targeted knowledge. As targeted knowledge resides at the interface between chemistry and biology, computational tools aimed at integrating the chemical and biological spaces play a central role in chemogenomics. This review covers the recent progress made in integrative computational approaches to data annotation and knowledge generation for the systematic knowledge-based design and screening of chemical libraries. Keywords: Chemistry, Pharmaceutical; Combinatorial Chemistry Techniques; Computational Biology; Drug Design; Genomics; Ligands; Proteins; Receptors, G-Protein-Coupled
[Kim2004Emotion]	K. H. Kim, S. W. Bang, and S. R. Kim. Emotion recognition system using short-term monitoring of physiological signals. Med Biol Eng Comput, 42(3):419-27, May 2004. [ bib ] A physiological signal-based emotion recognition system is reported. The system was developed to operate as a user-independent system, based on physiological signal databases obtained from multiple subjects. The input signals were electrocardiogram, skin temperature variation and electrodermal activity, all of which were acquired without much discomfort from the body surface, and can reflect the influence of emotion on the autonomic nervous system. The system consisted of preprocessing, feature extraction and pattern classification stages. Preprocessing and feature extraction methods were devised so that emotion-specific characteristics could be extracted from short-segment signals. Although the features were carefully extracted, their distribution formed a classification problem, with large overlap among clusters and large variance within clusters. A support vector machine was adopted as a pattern classifier to resolve this difficulty. Correct-classification ratios for 50 subjects were 78.4% and 61.8%, for the recognition of three and four categories, respectively. Keywords: Algorithms, Animals, Antisense, Artificial Intelligence, Autonomic Nervous System, Cell Line, Child, Cluster Analysis, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA Fingerprinting, Drug Evaluation, Emotions, Fluorescence, Fuzzy Logic, Gene Silencing, Gene Targeting, Genetic, Hela Cells, Humans, Imaging, Intracellular Space, Microscopy, Models, Monitoring, Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotides, P.H.S., Physiologic, Preclinical, Preschool, Prognosis, Proteomics, Quantitative Structure-Activity Relationship, RNA, RNA Interference, Recognition (Psychology), Research Support, Sensitivity and Specificity, Signal Processing, Small Interfering, Thionucleotides, Three-Dimensional, Tumor, U.S. Gov't, User-Computer Interface, 15191089
[Kapetanovic2004Overview]	Izet M Kapetanovic, Simon Rosenfeld, and Grant Izmirlian. Overview of commonly used bioinformatics methods and their applications. Ann N Y Acad Sci, 1020:10-21, May 2004. [ bib \| DOI \| http \| .pdf ] Bioinformatics, in its broad sense, involves application of computer processes to solve biological problems. A wide range of computational tools are needed to effectively and efficiently process large amounts of data being generated as a result of recent technological innovations in biology and medicine. A number of computational tools have been developed or adapted to deal with the experimental riches of complex and multivariate data and transition from data collection to information or knowledge. These include a wide variety of clustering and classification algorithms, including self-organized maps (SOM), artificial neural networks (ANN), support vector machines (SVM), fuzzy logic, and even hyphenated techniques as neuro-fuzzy networks. These bioinformatics tools are being evaluated and applied in various medical areas including early detection, risk assessment, classification, and prognosis of cancer. The goal of these efforts is to develop and identify bioinformatics methods with optimal sensitivity, specificity, and predictive capabilities. Keywords: Computational Biology, Fuzzy Logic, Humans, Neoplasms, Neural Networks (Computer), Prognosis, 15208179
[Gunderson2004Decoding]	Kevin L Gunderson, Semyon Kruglyak, Michael S Graige, Francisco Garcia, Bahram G Kermani, Chanfeng Zhao, Diping Che, Todd Dickinson, Eliza Wickham, Jim Bierle, Dennis Doucet, Monika Milewski, Robert Yang, Chris Siegmund, Juergen Haas, Lixin Zhou, Arnold Oliphant, Jian-Bing Fan, Steven Barnard, and Mark S Chee. Decoding randomly ordered dna arrays. Genome Res, 14(5):870-877, May 2004. [ bib \| DOI \| http ] We have developed a simple and efficient algorithm to identify each member of a large collection of DNA-linked objects through the use of hybridization, and have applied it to the manufacture of randomly assembled arrays of beads in wells. Once the algorithm has been used to determine the identity of each bead, the microarray can be used in a wide variety of applications, including single nucleotide polymorphism genotyping and gene expression profiling. The algorithm requires only a few labels and several sequential hybridizations to identify thousands of different DNA sequences with great accuracy. We have decoded tens of thousands of arrays, each with 1520 sequences represented at approximately 30-fold redundancy by up to approximately 50,000 beads, with a median error rate of <1 x 10(-4) per bead. The approach makes use of error checking codes and provides, for the first time, a direct functional quality control of every element of each array that is manufactured. The algorithm can be applied to any spatially fixed collection of objects or molecules that are associated with specific DNA sequences. Keywords: Algorithms; Computational Biology, methods; Oligonucleotide Array Sequence Analysis, methods/trends; Random Allocation; Research Design; Sequence Analysis, DNA, methods; Silicon Dioxide, chemistry
[Cai2004Prediction]	Yu-Dong Cai and Andrew J Doig. Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition. Bioinformatics, 20(8):1292-300, May 2004. [ bib \| DOI \| http \| .pdf ] MOTIVATION: A key goal of genomics is to assign function to genes, especially for orphan sequences. RESULTS: We compared the clustered functional domains in the SBASE database to each protein sequence using BLASTP. This representation for a protein is a vector, where each of the non-zero entries in the vector indicates a significant match between the sequence of interest and the SBASE domain. The machine learning methods nearest neighbour algorithm (NNA) and support vector machines are used for predicting protein functional classes from this information. We find that the best results are found using the SBASE-A database and the NNA, namely 72% accuracy for 79% coverage. We tested an assigning function based on searching for InterPro sequence motifs and by taking the most significant BLAST match within the dataset. We applied the functional domain composition method to predict the functional class of 2018 currently unclassified yeast open reading frames. AVAILABILITY: A program for the prediction method, that uses NNA called Functional Class Prediction based on Functional Domains (FCPFD) is available and can be obtained by contacting Y.D.Cai at y.cai@umist.ac.uk Keywords: biosvm
[Zhou2005LS]	Xin Zhou and K. Z. Mao. LS Bound based gene selection for DNA microarray data. Bioinformatics, 21(8):1559-64, Apr 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: One problem with discriminant analysis of DNA microarray data is that each sample is represented by quite a large number of genes, and many of them are irrelevant, insignificant or redundant to the discriminant problem at hand. Methods for selecting important genes are, therefore, of much significance in microarray data analysis. In the present study, a new criterion, called LS Bound measure, is proposed to address the gene selection problem. The LS Bound measure is derived from leave-one-out procedure of LS-SVMs (least squares support vector machines), and as the upper bound for leave-one-out classification results it reflects to some extent the generalization performance of gene subsets. RESULTS: We applied this LS Bound measure for gene selection on two benchmark microarray datasets: colon cancer and leukemia. We also compared the LS Bound measure with other evaluation criteria, including the well-known Fisher's ratio and Mahalanobis class separability measure, and other published gene selection algorithms, including Weighting factor and SVM Recursive Feature Elimination. The strength of the LS Bound measure is that it provides gene subsets leading to more accurate classification results than the filter method while its computational complexity is at the level of the filter method. AVAILABILITY: A companion website can be accessed at http://www.ntu.edu.sg/home5/pg02776030/lsbound/. The website contains: (1) the source code of the gene selection algorithm; (2) the complete set of tables and figures regarding the experimental study; (3) proof of the inequality (9). CONTACT: ekzmao@ntu.edu.sg. Keywords: biosvm featureselection microarray
[Zhou2005Recognition]	GuoDong Zhou, Dan Shen, Jie Zhang, Jian Su, and SoonHeng Tan. Recognition of protein/gene names from text using an ensemble of classifiers. BMC Bioinformatics, 6 Suppl 1:S7, 2005. [ bib \| DOI \| http \| .pdf ] This paper proposes an ensemble of classifiers for biomedical name recognition in which three classifiers, one Support Vector Machine and two discriminative Hidden Markov Models, are combined effectively using a simple majority voting strategy. In addition, we incorporate three post-processing modules, including an abbreviation resolution module, a protein/gene name refinement module and a simple dictionary matching module, into the system to further improve the performance. Evaluation shows that our system achieves the best performance from among 10 systems with a balanced F-measure of 82.58 on the closed evaluation of the BioCreative protein/gene name recognition task (Task 1A). Keywords: biosvm nlp
[Zhang2005Descriptor-based]	Z. Zhang, S. Kochhar, and M. G. Grigorov. Descriptor-based protein remote homology identification. Protein Sci., 42(2):431-444, 2005. [ bib \| DOI \| http \| .pdf ] Here, we report a novel protein sequence descriptor-based remote homology identification method, able to infer fold relationships without the explicit knowledge of structure. In a first phase, we have individually benchmarked 13 different descriptor types in fold identification experiments in a highly diverse set of protein sequences. The relevant descriptors were related to the fold class membership by using simple similarity measures in the descriptor spaces, such as the cosine angle. Our results revealed that the three best-performing sets of descriptors were the sequence-alignment-based descriptor using PSI-BLAST e-values, the descriptors based on the alignment of secondary structural elements (SSEA), and the descriptors based on the occurrence of PROSITE functional motifs. In a second phase, the three top-performing descriptors were combined to obtain a final method with improved performance, which we named DescFold. Class membership was predicted by Support Vector Machine (SVM) learning. In comparison with the individual PSI-BLAST-based descriptor, the rate of remote homology identification increased from 33.7 was able to identify the true remote homolog for nearly every sixth sequence at the 95 PSI-BLAST search. We have benchmarked the DescFold method against several other state-of-the-art fold recognition algorithms for the 172 LiveBench-8 targets, and we concluded that it was able to add value to the existing techniques by providing a confident hit for at least 10 known methods. Keywords: biosvm
[Zhang2005MULTIPRED]	G. L. Zhang, A. M. Khan, K. N. Srinivasan, J. T. August, and V. Brusic. MULTIPRED: a computational system for prediction of promiscuous HLA binding peptides. Nucleic Acids Res/, 33(Web Server issue):W172-W179, Jul 2005. [ bib \| DOI \| http ] MULTIPRED is a web-based computational system for the prediction of peptide binding to multiple molecules (proteins) belonging to human leukocyte antigens (HLA) class I A2, A3 and class II DR supertypes. It uses hidden Markov models and artificial neural network methods as predictive engines. A novel data representation method enables MULTIPRED to predict peptides that promiscuously bind multiple HLA alleles within one HLA supertype. Extensive testing was performed for validation of the prediction models. Testing results show that MULTIPRED is both sensitive and specific and it has good predictive ability (area under the receiver operating characteristic curve A(ROC) > 0.80). MULTIPRED can be used for the mapping of promiscuous T-cell epitopes as well as the regions of high concentration of these targets-termed T-cell epitope hotspots. MULTIPRED is available at http://antigen.i2r.a-star.edu.sg/multipred/. Keywords: Algorithms, Amino Acid Sequence, Antigen-Antibody Complex, Automated, Binding Sites, Computational Biology, Drug Delivery Systems, Drug Design, Epitopes, HLA Antigens, HLA-A Antigens, HLA-DR Antigens, Humans, Internet, Markov Chains, Molecular Sequence Data, Neural Networks (Computer), Pattern Recognition, Peptides, Protein, Protein Binding, Protein Interaction Mapping, Sequence Analysis, Software, T-Lymphocyte, User-Computer Interface, Viral Vaccines, 15980449
[Zaki2005Application]	N. M. Zaki, S. Deris, and R. Illias. Application of string kernels in protein sequence classification. Appl. Bioinformatics, 4(1):45-52, 2005. [ bib ] INTRODUCTION: The production of biological information has become much greater than its consumption. The key issue now is how to organise and manage the huge amount of novel information to facilitate access to this useful and important biological information. One core problem in classifying biological information is the annotation of new protein sequences with structural and functional features. METHOD: This article introduces the application of string kernels in classifying protein sequences into homogeneous families. A string kernel approach used in conjunction with support vector machines has been shown to achieve good performance in text categorisation tasks. We evaluated and analysed the performance of this approach, and we present experimental results on three selected families from the SCOP (Structural Classification of Proteins) database. We then compared the overall performance of this method with the existing protein classification methods on benchmark SCOP datasets. RESULTS: According to the F1 performance measure and the rate of false positive (RFP) measure, the string kernel method performs well in classifying protein sequences. The method outperformed all the generative-based methods and is comparable with the SVM-Fisher method. DISCUSSION: Although the string kernel approach makes no use of prior biological knowledge, it still captures sufficient biological information to enable it to outperform some of the state-of-the-art methods. Keywords: biosvm
[Yu2005integrated]	J.-k. Yu, S. Zheng, Y. Tang, and L. Li. An integrated approach utilizing proteomics and bioinformatics to detect ovarian cancer. J Zhejiang Univ Sci B, 6(4):227-31, Apr 2005. [ bib \| DOI \| http \| .pdf ] OBJECTIVE: To find new potential biomarkers and establish the patterns for the detection of ovarian cancer. METHODS: Sixty one serum samples including 32 ovarian cancer patients and 29 healthy people were detected by surface-enhanced laser desorption/ionization mass spectrometry (SELDI-MS). The protein fingerprint data were analyzed by bioinformatics tools. Ten folds cross-validation support vector machine (SVM) was used to establish the diagnostic pattern. RESULTS: Five potential biomarkers were found (2085 Da, 5881 Da, 7564 Da, 9422 Da, 6044 Da), combined with which the diagnostic pattern separated the ovarian cancer from the healthy samples with a sensitivity of 96.7%, a specificity of 96.7% and a positive predictive value of 96.7%. CONCLUSIONS: The combination of SELDI with bioinformatics tools could find new biomarkers and establish patterns with high sensitivity and specificity for the detection of ovarian cancer. Keywords: biosvm
[Yu2005Classifying]	C. Yu, N. Zavaljevski, F. J. Stevens, K. Yackovich, and J. Reifman. Classifying noisy protein sequence data: a case study of immunoglobulin light chains. Bioinformatics, 21(Supp 1):i495-i501, Jun 2005. [ bib \| DOI \| http \| .pdf ] SUMMARY: The classification of protein sequences obtained from patients with various immunoglobulin-related conformational diseases may provide insight into structural correlates of pathogenicity. However, clinical data are very sparse and, in the case of antibody-related proteins, the collected sequences have large variability with only a small subset of variations relevant to the protein pathogenicity (function). On this basis, these sequences represent a model system for development of strategies to recognize the small subset of function-determining variations among the much larger number of primary structure diversifications introduced during evolution. Under such conditions, most protein classification algorithms have limited accuracy. To address this problem, we propose a support vector machine (SVM)-based classifier that combines sequence and 3D structural averaging information. Each amino acid in the sequence is represented by a set of six physicochemical properties: hydrophobicity, hydrophilicity, volume, surface area, bulkiness and refractivity. Each position in the sequence is described by the properties of the amino acid at that position and the properties of its neighbors in 3D space or in the sequence. A structure template is selected to determine neighbors in 3D space and a window size is used to determine the neighbors in the sequence. The test data consist of 209 proteins of human antibody immunoglobulin light chains, each represented by aligned sequences of 120 amino acids. The methodology is applied to the classification of protein sequences collected from patients with and without amyloidosis, and indicates that the proposed modified classifiers are more robust to sequence variability than standard SVM classifiers, improving classification error between 5 and 25% and sensitivity between 9 and 17%. The classification results might also suggest possible mechanisms for the propensity of immunoglobulin light chains to amyloid formation. CONTACT: cyu@bioanalysis.org. Keywords: biosvm
[Yiu2005Filtering]	S. M. Yiu, Prudence W. H. Wong, T.W. Lam, Y.C. Mui, H. F. Kung, Marie Lin, and Y. T. Cheung. Filtering of Ineffective siRNAs and Improved siRNA Design Tool. Bioinformatics, 21(2):144-151, Jan 2005. To appear. [ bib \| DOI \| http \| .pdf ] Motivation: Short interfering RNAs (siRNAs) can be used to suppress gene expression and possess many potential applications in therapy, but how to design an effective siRNA is still not clear. Based on the MPI (Max-Planck-Institute) basic principles, a number of siRNA design tools have been developed recently. The set of candidates reported by these tools is usually large and often contains ineffective siRNAs. In view of this, we initiate the study of filtering ineffective siRNAs. Results: The contribution of this paper is 2-fold. First, we propose a fair scheme to compare existing design tools based on real data in the literature. Second, we attempt to improve the MPI principles and existing tools by an algorithm that can filter ineffective siRNAs. The algorithm is based on some new observations on the secondary structure, which we have verified by AI techniques (decision trees and support vector machines). We have tested our algorithm together with the MPI principles and the existing tools. The results show that our filtering algorithm is effective. Availability: The siRNA design software tool can be found in the website http://www.cs.hku.hk/ sirna/ Contact: smyiu@cs.hku.hk Keywords: biosvm
[Yap2005Prediction]	C. W. Yap and Y. Z. Chen. Prediction of Cytochrome P450 3A4, 2D6, and 2C9 Inhibitors and Substrates by Using Support Vector Machines. J Chem Inf Model, 45(4):982-92, 2005. [ bib \| DOI \| http \| .pdf ] Statistical learning methods have been used in developing filters for predicting inhibitors of two P450 isoenzymes, CYP3A4 and CYP2D6. This work explores the use of different statistical learning methods for predicting inhibitors of these enzymes and an additional P450 enzyme, CYP2C9, and the substrates of the three P450 isoenzymes. Two consensus support vector machine (CSVM) methods, "positive majority" (PM-CSVM) and "positive probability" (PP-CSVM), were used in this work. These methods were first tested for the prediction of inhibitors of CYP3A4 and CYP2D6 by using a significantly higher number of inhibitors and noninhibitors than that used in earlier studies. They were then applied to the prediction of inhibitors of CYP2C9 and substrates of the three enzymes. Both methods predict inhibitors of CYP3A4 and CYP2D6 at a similar level of accuracy as those of earlier studies. For classification of inhibitors of CYP2C9, the best CSVM method gives an accuracy of 88.9% for inhibitors and 96.3% for noninhibitors. The accuracies for classification of substrates and nonsubstrates of CYP3A4, CYP2D6, and CYP2C9 are 98.2 and 90.9%, 96.6 and 94.4%, and 85.7 and 98.8%, respectively. Both CSVM methods are potentially useful as filters for predicting inhibitors and substrates of P450 isoenzymes. These methods generally give better accuracies than single SVM classification systems, and the performance of the PP-CSVM method is slightly better than that of the PM-CSVM method. Keywords: biosvm chemoinformatics
[Yamanishi2005Supervised]	Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics, 21:i468-i477, 2005. [ bib \| DOI \| http \| .pdf ] Motivation: The metabolic network is an important biological network which relates enzyme proteins and chemical compounds. A large number of metabolic pathways remain unknown nowadays, and many enzymes are missing even in known metabolic pathways. There is, therefore, an incentive to develop methods to reconstruct the unknown parts of the metabolic network and to identify genes coding for missing enzymes. Results: This paper presents new methods to infer enzyme networks from the integration of multiple genomic data and chemical information, in the framework of supervised graph inference. The originality of the methods is the introduction of chemical compatibility as a constraint for refining the network predicted by the network inference engine. The chemical compatibility between two enzymes is obtained automatically from the information encoded by their Enzyme Commission (EC) numbers. The proposed methods are tested and compared on their ability to infer the enzyme network of the yeast Saccharomyces cerevisiae from four datasets for enzymes with assigned EC numbers: gene expression data, protein localization data, phylogenetic profiles and chemical compatibility information. It is shown that the prediction accuracy of the network reconstruction consistently improves owing to the introduction of chemical constraints, the use of a supervised approach and the weighted integration of multiple datasets. Finally, we conduct a comprehensive prediction of a global enzyme network consisting of all enzyme candidate proteins of the yeast to obtain new biological findings. Keywords: biosvm
[Yabuki2005GRIFFIN]	Y. Yabuki, T. Muramatsu, T. Hirokawa, H. Mukai, and M. Suwa. GRIFFIN: a system for predicting GPCR-G-protein coupling selectivity using a support vector machine and a hidden Markov model. Nucleic Acids Res., 33(Web Server issue):W148-53, Jul 2005. [ bib \| DOI \| http \| .pdf ] We describe a novel system, GRIFFIN (G-protein and Receptor Interaction Feature Finding INstrument), that predicts G-protein coupled receptor (GPCR) and G-protein coupling selectivity based on a support vector machine (SVM) and a hidden Markov model (HMM) with high sensitivity and specificity. Based on our assumption that whole structural segments of ligands, GPCRs and G-proteins are essential to determine GPCR and G-protein coupling, various quantitative features were selected for ligands, GPCRs and G-protein complex structures, and those parameters that are the most effective in selecting G-protein type were used as feature vectors in the SVM. The main part of GRIFFIN includes a hierarchical SVM classifier using the feature vectors, which is useful for Class A GPCRs, the major family. For the opsins and olfactory subfamilies of Class A and other minor families (Classes B, C, frizzled and smoothened), the binding G-protein is predicted with high accuracy using the HMM. Applying this system to known GPCR sequences, each binding G-protein is predicted with high sensitivity and specificity (>85% on average). GRIFFIN (http://griffin.cbrc.jp/) is freely available and allows users to easily execute this reliable prediction of G-proteins. Keywords: biosvm
[Xie2005LOCSVMPSI]	Dan Xie, Ao Li, Minghui Wang, Zhewen Fan, and Huanqing Feng. LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acids Res., 33(Web Server issue):W105-10, Jul 2005. [ bib \| DOI \| http \| .pdf ] Subcellular location of a protein is one of the key functional characters as proteins must be localized correctly at the subcellular level to have normal biological function. In this paper, a novel method named LOCSVMPSI has been introduced, which is based on the support vector machine (SVM) and the position-specific scoring matrix generated from profiles of PSI-BLAST. With a jackknife test on the RH2427 data set, LOCSVMPSI achieved a high overall prediction accuracy of 90.2%, which is higher than the prediction results by SubLoc and ESLpred on this data set. In addition, prediction performance of LOCSVMPSI was evaluated with 5-fold cross validation test on the PK7579 data set and the prediction results were consistently better than the previous method based on several SVMs using composition of both amino acids and amino acid pairs. Further test on the SWISSPROT new-unique data set showed that LOCSVMPSI also performed better than some widely used prediction methods, such as PSORTII, TargetP and LOCnet. All these results indicate that LOCSVMPSI is a powerful tool for the prediction of eukaryotic protein subcellular localization. An online web server (current version is 1.3) based on this method has been developed and is freely available to both academic and commercial users, which can be accessed by at http://Bioinformatics.ustc.edu.cn/LOCSVMPSI/LOCSVMPSI.php. Keywords: biosvm
[Wang2005Gene]	Yu Wang, Igor V Tetko, Mark A Hall, Eibe Frank, Axel Facius, Klaus F X Mayer, and Hans W Mewes. Gene selection from microarray data for cancer classification-a machine learning approach. Comput. Biol. Chem., 29(1):37-46, Feb 2005. [ bib \| DOI \| http \| .pdf ] A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, naÃ¯ve Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis. Keywords: biosvm microarray
[Wang2005Prediction]	Ming-Lei Wang, Hui Yao, and Wen-Bo Xu. Prediction by support vector machines and analysis by Z-score of poly-L-proline type II conformation based on local sequence. Comput. Biol. Chem., 29(2):95-100, Apr 2005. [ bib \| DOI \| http \| .pdf ] In recent years, the poly-L-proline type II (PPII) conformation has gained more and more importance. This structure plays vital roles in many biological processes. But few studies have been made to predict PPII secondary structures computationally. The support vector machine (SVM) represents a new approach to supervised pattern classification and has been successfully applied to a wide range of pattern recognition problems. In this paper, we present a SVM prediction method of PPII conformation based on local sequence. The overall accuracy for both the independent testing set and estimate of jackknife testing reached approximately 70%. Matthew's correlation coefficient (MCC) could reach 0.4. By comparing the results of training and testing datasets with different sequence identities, we suggest that the performance of this method correlates with the sequence identity of dataset. The parameter of SVM kernel function was an important factor to the performance of this method. The propensities of residues located at different positions were also analyzed. By computing Z-scores, we found that P and G were the two most important residues to PPII structure conformation. Keywords: biosvm
[Wang2005Using]	M. Wang, J. Yang, and K-C. Chou. Using string kernel to predict signal peptide cleavage site based on subsite coupling model. Amino Acids, 28(4):395-402, Jun 2005. [ bib \| DOI \| http \| .pdf ] Owing to the importance of signal peptides for studying the molecular mechanisms of genetic diseases, reprogramming cells for gene therapy, and finding new drugs for healing a specific defect, it is in great demand to develop a fast and accurate method to identify the signal peptides. Introduction of the so-called -3,-1, +1 coupling model (Chou, K. C.: Protein Engineering, 2001, 14-2, 75-79) has made it possible to take into account the coupling effect among some key subsites and hence can significantly enhance the prediction quality of peptide cleavage site. Based on the subsite coupling model, a kind of string kernels for protein sequence is introduced. Integrating the biologically relevant prior knowledge, the constructed string kernels can thus be used by any kernel-based method. A Support vector machines (SVM) is thus built to predict the cleavage site of signal peptides from the protein sequences. The current approach is compared with the classical weight matrix method. At small false positive ratios, our method outperforms the classical weight matrix method, indicating the current approach may at least serve as a powerful complemental tool to other existing methods for predicting the signal peptide cleavage site.The software that generated the results reported in this paper is available upon requirement, and will appear at http://www.pami.sjtu.edu.cn/wm. Keywords: biosvm
[Wang2005Protein]	J. Wang, W.-K. Sung, A. Krishnan, and K.-B. Li. Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines. BMC Bioinformatics, 6(1):174, Jul 2005. [ bib \| DOI \| http \| .pdf ] BACKGROUND: Predicting the subcellular localization of proteins is important for determining the function of proteins. Previous works focused on predicting protein localization in Gram-negative bacteria obtained good results. However, these methods had relatively low accuracies for the localization of extracellular proteins. This paper studies ways to improve the accuracy for predicting extracellular localization in Gram-negative bacteria. RESULTS: We have developed a system for predicting the subcellular localization of proteins for Gram-negative bacteria based on amino acid subalphabets and a combination of multiple support vector machines. The recall of the extracellular site and overall recall of our predictor reach 86.0% and 89.8%, respectively, in 5-fold cross-validation. To the best of our knowledge, these are the most accurate results for predicting subcellular localization in Gram-negative bacteria. CONCLUSIONS: Clustering 20 amino acids into a few groups by the proposed greedy algorithm provides a new way to extract features from protein sequences to cover more adjacent amino acids and hence reduce the dimensionality of the input vector of protein features. It was observed that a good amino acid grouping leads to an increase in prediction performance. Furthermore, a proper choice of a subset of complementary support vector machines constructed by different features of proteins maximizes the prediction accuracy. Keywords: biosvm
[Vlahovicek2005SBASE]	Kristian Vlahovicek, LÃ¡szlÃ³ KajÃ¡n, Vilmos Agoston, and SÃ¡ndor Pongor. The SBASE domain sequence resource, release 12: prediction of protein domain-architecture using support vector machines. Nucleic Acids Res, 33(Database issue):D223-5, Jan 2005. [ bib \| DOI \| http \| .pdf ] SBASE (http://www.icgeb.trieste.it/sbase) is an online resource designed to facilitate the detection of domain homologies based on sequence database search. The present release of the SBASE A library of protein domain sequences contains 972,397 protein sequence segments annotated by structure, function, ligand-binding or cellular topology, clustered into 8547 domain groups. SBASE B contains 169,916 domain sequences clustered into 2526 less well-characterized groups. Domain prediction is based on an evaluation of database search results in comparison with a 'similarity network' of inter-sequence similarity scores, using support vector machines trained on similarity search results of known domains. Keywords: biosvm
[Vert2005Supervised]	J.-P. Vert and Y. Yamanishi. Supervised graph inference. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Adv. Neural Inform. Process. Syst., volume 17, pages 1433-1440. MIT Press, Cambridge, MA, 2005. [ bib \| www: ] Keywords: biosvm
[Vert2005Kernel]	J.-P. Vert. Kernel methods in computational biology. Technical Report ccsd-00012124, CNRS-HAL, Oct 2005. [ bib \| http \| .pdf ] Support vector machines and kernel methods are increasingly popular in genomics and computational biology, due to their good performance in real-world applications and strong modularity that makes them suitable to a wide range of problems, from the classification of tumors to the automatic annotation of proteins. Their ability to work in high dimension, to process non-vectorial data, and the natural framework they provide to integrate heterogeneous data are particularly relevant to various problems arising in computational biology. In this chapter we survey some of the most prominent applications published so far, highlighting the particular developments in kernel methods triggered by problems in biology, and mention a few promising research directions likely to expand in the future. Keywords: biosvm
[Tsirigos2005sensitive]	A. Tsirigos and I. Rigoutsos. A sensitive, support-vector-machine method for the detection of horizontal gene transfers in viral, archaeal and bacterial genomes. Nucleic Acids Res., 33(12):3699-707, 2005. [ bib \| DOI \| http \| .pdf ] In earlier work, we introduced and discussed a generalized computational framework for identifying horizontal transfers. This framework relied on a gene's nucleotide composition, obviated the need for knowledge of codon boundaries and database searches, and was shown to perform very well across a wide range of archaeal and bacterial genomes when compared with previously published approaches, such as Codon Adaptation Index and C + G content. Nonetheless, two considerations remained outstanding: we wanted to further increase the sensitivity of detecting horizontal transfers and also to be able to apply the method to increasingly smaller genomes. In the discussion that follows, we present such a method, Wn-SVM, and show that it exhibits a very significant improvement in sensitivity compared with earlier approaches. Wn-SVM uses a one-class support-vector machine and can learn using rather small training sets. This property makes Wn-SVM particularly suitable for studying small-size genomes, similar to those of viruses, as well as the typically larger archaeal and bacterial genomes. We show experimentally that the new method results in a superior performance across a wide range of organisms and that it improves even upon our own earlier method by an average of 10% across all examined genomes. As a small-genome case study, we analyze the genome of the human cytomegalovirus and demonstrate that Wn-SVM correctly identifies regions that are known to be conserved and prototypical of all beta-herpesvirinae, regions that are known to have been acquired horizontally from the human host and, finally, regions that had not up to now been suspected to be horizontally transferred. Atypical region predictions for many eukaryotic viruses, including the alpha-, beta- and gamma-herpesvirinae, and 123 archaeal and bacterial genomes, have been made available online at http://cbcsrv.watson.ibm.com/HGT_SVM/. Keywords: biosvm
[Tobita2005discriminant]	M. Tobita, T. Nishikawa, and R. Nagashima. A discriminant model constructed by the support vector machine method for HERG potassium channel inhibitors. Bioorg. Med. Chem. Lett., 15(11):2886-90, Jun 2005. [ bib \| DOI \| http \| .pdf ] HERG attracts attention as a risk factor for arrhythmia, which might trigger torsade de pointes. A highly accurate classifier of chemical compounds for inhibition of the HERG potassium channel is constructed using support vector machine. For two test sets, our discriminant models achieved 90% and 95% accuracy, respectively. The classifier is even applied for the prediction of cardio vascular adverse effects to achieve about 70% accuracy. While modest inhibitors are partly characterized by properties linked to global structure of a molecule including hydrophobicity and diameter, strong inhibitors are exclusively characterized by properties linked to substructures of a molecule. Keywords: biosvm chemoinformatics herg
[Thukral2005Prediction]	Sushil K Thukral, Paul J Nordone, Rong Hu, Leah Sullivan, Eric Galambos, Vincent D Fitzpatrick, Laura Healy, Michael B Bass, Mary E Cosenza, and Cynthia A Afshari. Prediction of nephrotoxicant action and identification of candidate toxicity-related biomarkers. Toxicol Pathol, 33(3):343-55, 2005. [ bib \| DOI \| http ] A vast majority of pharmacological compounds and their metabolites are excreted via the urine, and within the complex structure of the kidney,the proximal tubules are a main target site of nephrotoxic compounds. We used the model nephrotoxicants mercuric chloride, 2-bromoethylamine hydrobromide, hexachlorobutadiene, mitomycin, amphotericin, and puromycin to elucidate time- and dose-dependent global gene expression changes associated with proximal tubular toxicity. Male Sprague-Dawley rats were dosed via intraperitoneal injection once daily for mercuric chloride and amphotericin (up to 7 doses), while a single dose was given for all other compounds. Animals were exposed to 2 different doses of these compounds and kidney tissues were collected on day 1, 3, and 7 postdosing. Gene expression profiles were generated from kidney RNA using 17K rat cDNA dual dye microarray and analyzed in conjunction with histopathology. Analysis of gene expression profiles showed that the profiles clustered based on similarities in the severity and type of pathology of individual animals. Further, the expression changes were indicative of tubular toxicity showing hallmarks of tubular degeneration/regeneration and necrosis. Use of gene expression data in predicting the type of nephrotoxicity was then tested with a support vector machine (SVM)-based approach. A SVM prediction module was trained using 120 profiles of total profiles divided into four classes based on the severity of pathology and clustering. Although mitomycin C and amphotericin B treatments did not cause toxicity, their expression profiles were included in the SVM prediction module to increase the sample size. Using this classifier, the SVM predicted the type of pathology of 28 test profiles with 100% selectivity and 82% sensitivity. These data indicate that valid predictions could be made based on gene expression changes from a small set of expression profiles. A set of potential biomarkers showing a time- and dose-response with respect to the progression of proximal tubular toxicity were identified. These include several transporters (Slc21a2, Slc15, Slc34a2), Kim 1, IGFbp-1, osteopontin, alpha-fibrinogen, and Gstalpha. Keywords: Algorithms, Animals, Antibiotics, Antineoplastic, Artificial Intelligence, Butadienes, Chloroplasts, Comparative Study, Computer Simulation, Computer-Assisted, Diagnosis, Disinfectants, Dose-Response Relationship, Drug, Drug Toxicity, Electrodes, Electroencephalography, Ethylamines, Expert Systems, Feedback, Fungicides, Gene Expression Profiling, Genes, Genetic Markers, Humans, Implanted, Industrial, Information Storage and Retrieval, Kidney, Kidney Tubules, MEDLINE, Male, Mercuric Chloride, Microarray Analysis, Molecular Biology, Motor Cortex, Movement, Natural Language Processing, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Plant Proteins, Predictive Value of Tests, Proteins, Proteome, Proximal, Puromycin Aminonucleoside, Rats, Reproducibility of Results, Research Support, Sprague-Dawley, Subcellular Fractions, Terminology, Therapy, Time Factors, Toxicogenetics, U.S. Gov't, User-Computer Interface, 15805072
[Tang2005Discovering]	Thomas Tang, Jinbo Xu, and Ming Li. Discovering sequence-structure motifs from protein segments and two applications. Pac Symp Biocomput, pages 370-81, 2005. [ bib ] We present a novel method for clustering short protein segments having strong sequence-structure correlations, and demonstrate that these clusters contain useful structural information via two applications. When applied to local tertiary structure prediction, we achieve approximately 60% accuracy with a novel dynamic programming algorithm. When applied to secondary structure prediction based on Support Vector Machines, we obtain a approximately 2% gain in Q3 performance by incorporating cluster-derived data into training and classification. These encouraging results illustrate the great potential of using conserved local motifs to tackle protein structure predictions and possibly other important problems in biology. Keywords: biosvm
[Takeuchi2005Bio-medical]	Koichi Takeuchi and Nigel Collier. Bio-medical entity extraction using support vector machines. Artif. Intell. Med., 33(2):125-37, Feb 2005. [ bib \| DOI \| http \| .pdf ] OBJECTIVE: Support vector machines (SVMs) have achieved state-of-the-art performance in several classification tasks. In this article we apply them to the identification and semantic annotation of scientific and technical terminology in the domain of molecular biology. This illustrates the extensibility of the traditional named entity task to special domains with large-scale terminologies such as those in medicine and related disciplines. METHODS AND MATERIALS: The foundation for the model is a sample of text annotated by a domain expert according to an ontology of concepts, properties and relations. The model then learns to annotate unseen terms in new texts and contexts. The results can be used for a variety of intelligent language processing applications. We illustrate SVMs capabilities using a sample of 100 journal abstracts texts taken from the human, blood cell, transcription factor domain of MEDLINE. RESULTS: Approximately 3400 terms are annotated and the model performs at about 74% F-score on cross-validation tests. A detailed analysis based on empirical evidence shows the contribution of various feature sets to performance. CONCLUSION: Our experiments indicate a relationship between feature window size and the amount of training data and that a combination of surface words, orthographic features and head noun features achieve the best performance among the feature sets tested. Keywords: biosvm
[Swamidass2005Kernels]	S. J. Swamidass, J. Chen, J. Bruand, P. Phung, L. Ralaivola, and P. Baldi. Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics, 21(Suppl. 1):i359-i368, Jun 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Small molecules play a fundamental role in organic chemistry and biology. They can be used to probe biological systems and to discover new drugs and other useful compounds. As increasing numbers of large datasets of small molecules become available, it is necessary to develop computational methods that can deal with molecules of variable size and structure and predict their physical, chemical and biological properties. RESULTS: Here we develop several new classes of kernels for small molecules using their 1D, 2D and 3D representations. In 1D, we consider string kernels based on SMILES strings. In 2D, we introduce several similarity kernels based on conventional or generalized fingerprints. Generalized fingerprints are derived by counting in different ways subpaths contained in the graph of bonds, using depth-first searches. In 3D, we consider similarity measures between histograms of pairwise distances between atom classes. These kernels can be computed efficiently and are applied to problems of classification and prediction of mutagenicity, toxicity and anti-cancer activity on three publicly available datasets. The results derived using cross-validation methods are state-of-the-art. Tradeoffs between various kernels are briefly discussed. AVAILABILITY: Datasets available from http://www.igb.uci.edu/servers/servers.html CONTACT: pfbaldi@ics.uci.edu. Keywords: biosvm
[Statnikov2005comprehensive]	A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 2005. To appear. [ bib \| http \| .pdf ] Motivation: Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. In order to equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods, and two cross validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types.Results: Multicategory Support Vector Machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins, and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms such as K-Nearest Neighbors, Backpropagation and Probabilistic Neural Networks, often to a remarkable degree. Gene selection techniques can significantly improve classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets.Availability: The software system GEMS is available for download from http://www.gems-system.org for non-commercial use. Keywords: biosvm microarray
[Shi2005Building]	Lei Shi and Fabien Campagne. Building a protein name dictionary from full text: a machine learning term extraction approach. BMC Bioinformatics, 6(1):88, Apr 2005. [ bib \| DOI \| http \| .pdf ] BACKGROUND: The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature. RESULTS: We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text. CONCLUSION: This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt. Keywords: biosvm
[Sharan2005motif-based]	R. Sharan and E. W Myers. A motif-based framework for recognizing sequence families. Bioinformatics, 21 Suppl 1:i387-i393, Jun 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Many signals in biological sequences are based on the presence or absence of base signals and their spatial combinations. One of the best known examples of this is the signal identifying a core promoter-the site at which the basal transcription machinery starts the transcription of a gene. Our goal is a fully automatic pattern recognition system for a family of sequences, which simultaneously discovers the base signals, their spatial relationships and a classifier based upon them. RESULTS: In this paper we present a general method for characterizing a set of sequences by their recurrent motifs. Our approach relies on novel probabilistic models for DNA binding sites and modules of binding sites, on algorithms to study them from the data and on a support vector machine that uses the models studied to classify a set of sequences. We demonstrate the applicability of our approach to diverse instances, ranging from families of promoter sequences to a dataset of intronic sequences flanking alternatively spliced exons. On a core promoter dataset our results are comparable with the state-of-the-art McPromoter. On a dataset of alternatively spliced exons we outperform a previous approach. We also achieve high success rates in recognizing cell cycle regulated genes. These results demonstrate that a fully automatic pattern recognition algorithm can meet or exceed the performance of hand-crafted approaches. AVAILABILITY: The software and datasets are available from the authors upon request. CONTACT: roded@tau.ac.il. Keywords: biosvm
[Senawongse2005Predicting]	Pasak Senawongse, Andrew R Dalby, and Zheng Rong Yang. Predicting the phosphorylation sites using hidden markov models and machine learning methods. J Chem Inf Model, 45(4):1147-52, 2005. [ bib \| DOI \| http \| .pdf ] Accurately predicting phosphorylation sites in proteins is an important issue in postgenomics, for which how to efficiently extract the most predictive features from amino acid sequences for modeling is still challenging. Although both the distributed encoding method and the bio-basis function method work well, they still have some limits in use. The distributed encoding method is unable to code the biological content in sequences efficiently, whereas the bio-basis function method is a nonparametric method, which is often computationally expensive. As hidden Markov models (HMMs) can be used to generate one model for one cluster of aligned protein sequences, the aim in this study is to use HMMs to extract features from amino acid sequences, where sequence clusters are determined using available biological knowledge. In this novel method, HMMs are first constructed using functional sequences only. Both functional and nonfunctional training sequences are then inputted into the trained HMMs to generate functional and nonfunctional feature vectors. From this, a machine learning algorithm is used to construct a classifier based on these feature vectors. It is found in this work that (1) this method provides much better prediction accuracy than the use of HMMs only for prediction, and (2) the support vector machines (SVMs) algorithm outperforms decision trees and neural network algorithms when they are constructed on the features extracted using the trained HMMs. Keywords: biosvm
[Seike2005Proteomic]	M. Seike, T. Kondo, K. Fujii, T. Okano, T. Yamada, Y. Matsuno, A. Gemma, S. Kudoh, and S. Hirohashi. Proteomic signatures for histological types of lung cancer. Proteomics, Jul 2005. [ bib \| DOI \| http \| .pdf ] We performed proteomic studies on lung cancer cells to elucidate the mechanisms that determine histological phenotype. Thirty lung cancer cell lines with three different histological backgrounds (squamous cell carcinoma, small cell lung carcinoma and adenocarcinoma) were subjected to two-dimensional difference gel electrophoresis (2-D DIGE) and grouped by multivariate analyses on the basis of their protein expression profiles. 2-D DIGE achieves more accurate quantification of protein expression by using highly sensitive fluorescence dyes to label the cysteine residues of proteins prior to two-dimensional polyacrylamide gel electrophoresis. We found that hierarchical clustering analysis and principal component analysis divided the cell lines according to their original histology. Spot ranking analysis using a support vector machine algorithm and unsupervised classification methods identified 32 protein spots essential for the classification. The proteins corresponding to the spots were identified by mass spectrometry. Next, lung cancer cells isolated from tumor tissue by laser microdissection were classified on the basis of the expression pattern of these 32 protein spots. Based on the expression profile of the 32 spots, the isolated cancer cells were categorized into three histological groups: the squamous cell carcinoma group, the adenocarcinoma group, and a group of carcinomas with other histological types. In conclusion, our results demonstrate the utility of quantitative proteomic analysis for molecular diagnosis and classification of lung cancer cells. Keywords: biosvm proteomics
[Sarda2005pSLIP]	Deepak Sarda, Gek Huey Chua, Kuo-Bin Li, and Arun Krishnan. pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinformatics, 6(1):152, Jun 2005. [ bib \| DOI \| http \| .pdf ] BACKGROUND: Protein subcellular localization is an important determinant of protein function and hence, reliable methods for prediction of localization are needed. A number of prediction algorithms have been developed based on amino acid compositions or on the N-terminal characteristics (signal peptides) of proteins. However, such approaches lead to a loss of contextual information. Moreover, where information about the physicochemical properties of amino acids has been used, the methods employed to exploit that information are less than optimal and could use the information more effectively. RESULTS: In this paper, we propose a new algorithm called pSLIP which uses Support Vector Machines (SVMs) in conjunction with multiple physicochemical properties of amino acids to predict protein subcellular localization in eukaryotes across six different locations, namely, chloroplast, cytoplasmic, extracellular, mitochondrial, nuclear and plasma membrane. The algorithm was applied to the dataset provided by Park and Kanehisa and we obtained prediction accuracies for the different classes ranging from 87.7%-97.0% with an overall accuracy of 93.1%. CONCLUSIONS: This study presents a physicochemical property based protein localization prediction algorithm. Unlike other algorithms, contextual information is preserved by dividing the protein sequences into clusters. The prediction accuracy shows an improvement over other algorithms based on various types of amino acid composition (single, pair and gapped pair). We have also implemented a web server to predict protein localization across the six classes (available at http://pslip.bii.a-star.edu.sg). Keywords: biosvm
[Saeh2005Lead]	J. Saeh, P. Lyne, B. Takasaki, and D. Cosgrove. Lead hopping using SVM and 3D pharmacophore fingerprints. J Chem Inf Model, 45(4):1122-1133, Jul 2005. [ bib \| DOI \| http \| .pdf ] The combination of 3D pharmacophore fingerprints and the support vector machine classification algorithm has been used to generate robust models that are able to classify compounds as active or inactive in a number of G-protein-coupled receptor assays. The models have been tested against progressively more challenging validation sets where steps are taken to ensure that compounds in the validation set are chemically and structurally distinct from the training set. In the most challenging example, we simulate a lead-hopping experiment by excluding an entire class of compounds (defined by a core substructure) from the training set. The left-out active compounds comprised approximately 40% of the actives. The model trained on the remaining compounds is able to recall 75% of the actives from the "new" lead series while correctly classifying >99% of the 5000 inactives included in the validation set. Keywords: biosvm chemoinformatics
[Raetsch2005RASE]	G. Rätsch, S. Sonnenburg, and B. Schölkopf. RASE: recognition of alternatively spliced exons in C.elegans. Bioinformatics, 21(Suppl. 1):i369-i377, Jun 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Eukaryotic pre-mRNAs are spliced to form mature mRNA. Pre-mRNA alternative splicing greatly increases the complexity of gene expression. Estimates show that more than half of the human genes and at least one-third of the genes of less complex organisms, such as nematodes or flies, are alternatively spliced. In this work, we consider one major form of alternative splicing, namely the exclusion of exons from the transcript. It has been shown that alternatively spliced exons have certain properties that distinguish them from constitutively spliced exons. Although most recent computational studies on alternative splicing apply only to exons which are conserved among two species, our method only uses information that is available to the splicing machinery, i.e. the DNA sequence itself. We employ advanced machine learning techniques in order to answer the following two questions: (1) Is a certain exon alternatively spliced? (2) How can we identify yet unidentified exons within known introns? RESULTS: We designed a support vector machine (SVM) kernel well suited for the task of classifying sequences with motifs having positional preferences. In order to solve the task (1), we combine the kernel with additional local sequence information, such as lengths of the exon and the flanking introns. The resulting SVM-based classifier achieves a true positive rate of 48.5% at a false positive rate of 1%. By scanning over single EST confirmed exons we identified 215 potential alternatively spliced exons. For 10 randomly selected such exons we successfully performed biological verification experiments and confirmed three novel alternatively spliced exons. To answer question (2), we additionally used SVM-based predictions to recognize acceptor and donor splice sites. Combined with the above mentioned features we were able to identify 85.2% of skipped exons within known introns at a false positive rate of 1%. AVAILABILITY: Datasets, model selection results, our predictions and additional experimental results are available at http://www.fml.tuebingen.mpg.de/ raetsch/RASE CONTACT: Gunnar.Raetsch@tuebingen.mpg.de SUPPLEMENTARY INFORMATION: http://www.fml.tuebingen.mpg.de/raetsch/RASE. Keywords: biosvm
[Ruepp2005Assessment]	S. Ruepp, F. Boess, L. Suter, M. C. de Vera, G. Steiner, T. Steele, T. Weiser, and S. Albertini. Assessment of hepatotoxic liabilities by transcript profiling. Toxicol Appl Pharmacol, Jun 2005. [ bib \| DOI \| http \| .pdf ] Male Wistar rats were treated with various model compounds or the appropriate vehicle controls in order to create a reference database for toxicogenomics assessment of novel compounds. Hepatotoxic compounds in the database were either known hepatotoxicants or showed hepatotoxicity during preclinical testing. Histopathology and clinical chemistry data were used to anchor the transcript profiles to an established endpoint (steatosis, cholestasis, direct acting, peroxisomal proliferation or nontoxic/control). These reference data were analyzed using a supervised learning method (support vector machines, SVM) to generate classification rules. This predictive model was subsequently used to assess compounds with regard to a potential hepatotoxic liability. A steatotic and a non-hepatotoxic 5HT(6) receptor antagonist compound from the same series were successfully discriminated by this toxicogenomics model. Additionally, an example is shown where a hepatotoxic liability was correctly recognized in the absence of pathological findings. In vitro experiments and a dog study confirmed the correctness of the toxicogenomics alert. Another interesting observation was that transcript profiles indicate toxicologically relevant changes at an earlier timepoint than routinely used methods. Together, these results support the useful application of toxicogenomics in raising alerts for adverse effects and generating mechanistic hypotheses that can be followed up by confirmatory experiments. Keywords: biosvm
[Rudd2005Eclair]	S. Rudd and I. V. Tetko. Eclair-a web service for unravelling species origin of sequences sampled from mixed host interfaces. Nucleic Acids Res, 33(Web Server issue):W724-7, Jul 2005. [ bib \| DOI \| http \| .pdf ] The identification of the genes that participate at the biological interface of two species remains critical to our understanding of the mechanisms of disease resistance, disease susceptibility and symbiosis. The sequencing of complementary DNA (cDNA) libraries prepared from the biological interface between two organisms provides an inexpensive way to identify the novel genes that may be expressed as a cause or consequence of compatible or incompatible interactions. Sequence classification and annotation of species origin typically use an orthology-based approach and require access to large portions of either genome, or a close relative. Novel species- or clade-specific sequences may have no counterpart within existing databases and remain ambiguous features. Here we present a web-service, Eclair, which utilizes support vector machines for the classification of the origin of expressed sequence tags stemming from mixed host cDNA libraries. In addition to providing an interface for the classification of sequences, users are presented with the opportunity to train a model to suit their preferred species pair. Eclair is freely available at http://eclair.btk.fi. Keywords: biosvm
[Rose2005Correlation]	J. R. Rose, Jr. Turkett, W. H., I. C. Oroian, W. W. Laegreid, and J. Keele. Correlation of amino acid preference and mammalian viral genome type. Bioinformatics, 2005. [ bib \| DOI \| http \| .pdf ] Motivation: In the event of an outbreak of a disease caused by an initially unknown pathogen, the ability to characterize anonymous sequences prior to isolation and culturing of the pathogen will be helpful. We show that it is possible to classify viral sequences by genome type (dsDNA, ssDNA, ssRNA positive strand, ssRNA negative strand, retroid) using amino acid distribution.Results: In this paper we describe the results of analysis of amino acid preference in mammalian viruses. The study was carried out at the genome level as well as two shorter sequence levels: short (300 amino acids) and medium length (660 amino acids). The analysis indicates a correlation between the viral genome types dsDNA, ssDNA, ssRNA positive strand, ssRNA negative strand, and retroid and amino acid preference. We investigated three different models of amino acid preference. The simplest amino acid preference model, 1-AAP, is a normalized description of the frequency of amino acids in genomes of a viral genome type. A slightly more complex model is the ordered pair amino acid preference model (2-AAP), which characterizes genomes of different viral genome types by the frequency of ordered pairs of amino acids. The most complex and accurate model is the ordered triple amino acid preference model (3-AAP), which is based on ordered triples of amino acids. The results demonstrate that mammalian viral genome types differ in their amino acid preference.Availability: The tools used to format and analyze data and supplementary material are available at http://www.cse.sc.edu/ rose/aminoPreference/index.html. Keywords: biosvm
[Rice2005Mining]	Simon B Rice, Goran Nenadic, and Benjamin J Stapley. Mining protein function from text using term-based support vector machines. BMC Bioinformatics, 6 Suppl 1:S22, 2005. [ bib \| DOI \| http \| .pdf ] BACKGROUND: Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents. RESULTS: The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent. CONCLUSION: A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2. Keywords: biosvm
[Rice2005Reconstructing]	J.J. Rice, Y. Tu, and G. Stolovitzky. Reconstructing biological networks using conditional correlation analysis. Bioinformatics, 21(6):765-773, Mar 2005. [ bib \| DOI \| http ] MOTIVATION: One of the present challenges in biological research is the organization of the data originating from high-throughput technologies. One way in which this information can be organized is in the form of networks of influences, physical or statistical, between cellular components. We propose an experimental method for probing biological networks, analyzing the resulting data and reconstructing the network architecture. METHODS: We use networks of known topology consisting of nodes (genes), directed edges (gene-gene interactions) and a dynamics for the genes' mRNA concentrations in terms of the gene-gene interactions. We proposed a network reconstruction algorithm based on the conditional correlation of the mRNA equilibrium concentration between two genes given that one of them was knocked down. Using simulated gene expression data on networks of known connectivity, we investigated how the reconstruction error is affected by noise, network topology, size, sparseness and dynamic parameters. RESULTS: Errors arise from correlation between nodes connected through intermediate nodes (false positives) and when the correlation between two directly connected nodes is obscured by noise, non-linearity or multiple inputs to the target node (false negatives). Two critical components of the method are as follows: (1) the choice of an optimal correlation threshold for predicting connections and (2) the reduction of errors arising from indirect connections (for which a novel algorithm is proposed). With these improvements, we can reconstruct networks with the topology of the transcriptional regulatory network in Escherichia coli with a reasonably low error rate. Keywords: Algorithms; Computer Simulation; Gene Expression Profiling; Gene Expression Regulation; Models, Biological; Models, Statistical; Oligonucleotide Array Sequence Analysis; Protein Interaction Mapping; Signal Transduction; Statistics as Topic; Transcription Factors
[Rensing2005Protein]	Stefan A Rensing, Dana Fritzowsky, Daniel Lang, and Ralf Reski. Protein encoding genes in an ancient plant: analysis of codon usage, retained genes and splice sites in a moss, Physcomitrella patens. BMC Genomics, 6(1):43, Mar 2005. [ bib \| DOI \| http \| .pdf ] BACKGROUND: The moss Physcomitrella patens is an emerging plant model system due to its high rate of homologous recombination, haploidy, simple body plan, physiological properties as well as phylogenetic position. Available EST data was clustered and assembled, and provided the basis for a genome-wide analysis of protein encoding genes. RESULTS: We have clustered and assembled Physcomitrella patens EST and CDS data in order to represent the transcriptome of this non-seed plant. Clustering of the publicly available data and subsequent prediction resulted in a total of 19,081 non-redundant ORF. Of these putative transcripts, approximately 30% have a homolog in both rice and Arabidopsis transcriptome. More than 130 transcripts are not present in seed plants but can be found in other kingdoms. These potential "retained genes" might have been lost during seed plant evolution. Functional annotation of these genes reveals unequal distribution among taxonomic groups and intriguing putative functions such as cytotoxicity and nucleic acid repair. Whereas introns in the moss are larger on average than in the seed plant Arabidopsis thaliana, position and amount of introns are approximately the same. Contrary to Arabidopsis, where CDS contain on average 44% G/C, in Physcomitrella the average G/C content is 50%. Interestingly, moss orthologs of Arabidopsis genes show a significant drift of codon fraction usage, towards the seed plant. While averaged codon bias is the same in Physcomitrella and Arabidopsis, the distribution pattern is different, with 15% of moss genes being unbiased. Species-specific, sensitive and selective splice site prediction for Physcomitrella has been developed using a dataset of 368 donor and acceptor sites, utilizing a support vector machine. The prediction accuracy is better than those achieved with tools trained on Arabidopsis data. CONCLUSION: Analysis of the moss transcriptome displays differences in gene structure, codon and splice site usage in comparison with the seed plant Arabidopsis. Putative retained genes exhibit possible functions that might explain the peculiar physiological properties of mosses. Both the transcriptome representation (including a BLAST and retrieval service) and splice site prediction have been made available on http://www.cosmoss.org, setting the basis for assembly and annotation of the Physcomitrella genome, of which draft shotgun sequences will become available in 2005. Keywords: biosvm
[Rangwala2005Profile-based]	H. Rangwala and G. Karypis. Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21(23):4239-4247, Dec 2005. [ bib \| DOI \| http ] MOTIVATION: Protein remote homology detection is a central problem in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for remote homology detection. The performance of these methods depends on how the protein sequences are modeled and on the method used to compute the kernel function between them. RESULTS: We introduce two classes of kernel functions that are constructed by combining sequence profiles with new and existing approaches for determining the similarity between pairs of protein sequences. These kernels are constructed directly from these explicit protein similarity measures and employ effective profile-to-profile scoring schemes for measuring the similarity between pairs of proteins. Experiments with remote homology detection and fold recognition problems show that these kernels are capable of producing results that are substantially better than those produced by all of the existing state-of-the-art SVM-based methods. In addition, the experiments show that these kernels, even when used in the absence of profiles, produce results that are better than those produced by existing non-profile-based schemes. AVAILABILITY: The programs for computing the various kernel functions are available on request from the authors. Keywords: biosvm
[Raghava2005Correlation]	Gajendra P S Raghava and Joon H Han. Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein. BMC Bioinformatics, 6(1):59, Mar 2005. [ bib \| DOI \| http ] BACKGROUND: A large number of papers have been published on analysis of microarray data with particular emphasis on normalization of data, detection of differentially expressed genes, clustering of genes and regulatory network. On other hand there are only few studies on relation between expression level and composition of nucleotide/protein sequence, using expression data. There is a need to understand why particular genes/proteins express more in particular conditions. In this study, we analyze 3468 genes of Saccharomyces cerevisiae obtained from Holstege et al., (1998) to understand the relationship between expression level and amino acid composition. RESULTS: We compute the correlation between expression of a gene and amino acid composition of its protein. It was observed that some residues (like Ala, Gly, Arg and Val) have significant positive correlation (r > 0.20) and some other residues (Like Asp, Leu, Asn and Ser) have negative correlation (r < -0.15) with the expression of genes. A significant negative correlation (r = -0.18) was also found between length and gene expression. These observations indicate the relationship between percent composition and gene expression level. Thus, attempts have been made to develop a Support Vector Machine (SVM) based method for predicting the expression level of genes from its protein sequence. In this method the SVM is trained with proteins whose gene expression data is known in a given condition. Then trained SVM is used to predict the gene expression of other proteins of the same organism in the same condition. A correlation coefficient r = 0.70 was obtained between predicted and experimentally determined expression of genes, which improves from r = 0.70 to 0.72 when dipeptide composition was used instead of residue composition. The method was evaluated using 5-fold cross validation test. We also demonstrate that amino acid composition information along with gene expression data can be used for improving the function classification of proteins. CONCLUSION: There is a correlation between gene expression and amino acid composition that can be used to predict the expression level of genes up to a certain extent. A web server based on the above strategy has been developed for calculating the correlation between amino acid composition and gene expression and prediction of expression level http://kiwi.postech.ac.kr/raghava/lgepred/. This server will allow users to study the evolution from expression data. Keywords: biosvm
[Prill2005PlosBiol]	Robert J Prill, Pablo A Iglesias, and Andre Levchenko. Dynamic properties of network motifs contribute to biological network organization. PLoS Biol, 3(11):e343, Nov 2005. [ bib \| DOI \| http ] Biological networks, such as those describing gene regulation, signal transduction, and neural synapses, are representations of large-scale dynamic systems. Discovery of organizing principles of biological networks can be enhanced by embracing the notion that there is a deep interplay between network structure and system dynamics. Recently, many structural characteristics of these non-random networks have been identified, but dynamical implications of the features have not been explored comprehensively. We demonstrate by exhaustive computational analysis that a dynamical property-stability or robustness to small perturbations-is highly correlated with the relative abundance of small subnetworks (network motifs) in several previously determined biological networks. We propose that robust dynamical stability is an influential property that can determine the non-random structure of biological networks. Keywords: Animals; Caenorhabditis elegans, physiology; Computational Biology, methods; Computer Simulation; Drosophila melanogaster, physiology; Escherichia coli, physiology; Models, Biological; Nerve Net; Saccharomyces cerevisiae, physiology; Signal Transduction; Statistics as Topic; Systems Theory; Transcription, Genetic
[Plewczyski2005support]	Dariusz Plewczynski, Adrian Tkacz, Adam Godzik, and Leszek Rychlewski. A support vector machine approach to the identification of phosphorylation sites. Cell Mol Biol Lett, 10(1):73-89, 2005. [ bib \| .pdf ] We describe a bioinformatics tool that can be used to predict the position of phosphorylation sites in proteins based only on sequence information. The method uses the support vector machine (SVM) statistical learning theory. The statistical models for phosphorylation by various types of kinases are built using a dataset of short (9-amino acid long) sequence fragments. The sequence segments are dissected around post-translationally modified sites of proteins that are on the current release of the Swiss-Prot database, and that were experimentally confirmed to be phosphorylated by any kinase. We represent them as vectors in a multidimensional abstract space of short sequence fragments. The prediction method is as follows. First, a given query protein sequence is dissected into overlapping short segments. All the fragments are then projected into the multidimensional space of sequence fragments via a collection of different representations. Those points are classified with pre-built statistical models (the SVM method with linear, polynomial and radial kernel functions) either as phosphorylated or inactive ones. The resulting list of plausible sites for phosphorylation by various types of kinases in the query protein is returned to the user. The efficiency of the method for each type of phosphorylation is estimated using leave-one-out tests and presented here. The sensitivities of the models can reach over 70%, depending on the type of kinase. The additional information from profile representations of short sequence fragments helps in gaining a higher degree of accuracy in some phosphorylation types. The further development of an automatic phosphorylation site annotation predictor based on our algorithm should yield a significant improvement when using statistical algorithms in order to quantify the results. Keywords: biosvm
[Pham2005Support]	Tho Hoan Pham, Kenji Satou, and Tu Bao Ho. Support vector machines for prediction and analysis of beta and gamma-turns in proteins. J. Bioinform. Comput. Biol., 3(2):343-58, Apr 2005. [ bib ] Tight turns have long been recognized as one of the three important features of proteins, together with alpha-helix and beta-sheet. Tight turns play an important role in globular proteins from both the structural and functional points of view. More than 90% tight turns are beta-turns and most of the rest are gamma-turns. Analysis and prediction of beta-turns and gamma-turns is very useful for design of new molecules such as drugs, pesticides, and antigens. In this paper we investigated two aspects of applying support vector machine (SVM), a promising machine learning method for bioinformatics, to prediction and analysis of beta-turns and gamma-turns. First, we developed two SVM-based methods, called BTSVM and GTSVM, which predict beta-turns and gamma-turns in a protein from its sequence. When compared with other methods, BTSVM has a superior performance and GTSVM is competitive. Second, we used SVMs with a linear kernel to estimate the support of amino acids for the formation of beta-turns and gamma-turns depending on their position in a protein. Our analysis results are more comprehensive and easier to use than the previous results in designing turns in proteins. Keywords: biosvm
[Peters2005Generating]	Bjoern Peters and Alessandro Sette. Generating quantitative models describing the sequence specificity of biological processes with the stabilized matrix method. BMC Bioinformatics, 6:132, 2005. [ bib \| DOI \| http ] BACKGROUND: Many processes in molecular biology involve the recognition of short sequences of nucleic-or amino acids, such as the binding of immunogenic peptides to major histocompatibility complex (MHC) molecules. From experimental data, a model of the sequence specificity of these processes can be constructed, such as a sequence motif, a scoring matrix or an artificial neural network. The purpose of these models is two-fold. First, they can provide a summary of experimental results, allowing for a deeper understanding of the mechanisms involved in sequence recognition. Second, such models can be used to predict the experimental outcome for yet untested sequences. In the past we reported the development of a method to generate such models called the Stabilized Matrix Method (SMM). This method has been successfully applied to predicting peptide binding to MHC molecules, peptide transport by the transporter associated with antigen presentation (TAP) and proteasomal cleavage of protein sequences. RESULTS: Herein we report the implementation of the SMM algorithm as a publicly available software package. Specific features determining the type of problems the method is most appropriate for are discussed. Advantageous features of the package are: (1) the output generated is easy to interpret, (2) input and output are both quantitative, (3) specific computational strategies to handle experimental noise are built in, (4) the algorithm is designed to effectively handle bounded experimental data, (5) experimental data from randomized peptide libraries and conventional peptides can easily be combined, and (6) it is possible to incorporate pair interactions between positions of a sequence. CONCLUSION: Making the SMM method publicly available enables bioinformaticians and experimental biologists to easily access it, to compare its performance to other prediction methods, and to extend it to other applications. Keywords: Algorithms; Amino Acid Sequence; Biology; Computational Biology; Computer Simulation; Data Interpretation, Statistical; Databases, Protein; Models, Biological; Models, Statistical; Neural Networks (Computer); Peptide Library; Peptides; Programming Languages; Prote; Sensitivity and Specificity; Software; in Binding
[Pahikkala2005Contextual]	Tapio Pahikkala, Filip Ginter, Jorma Boberg, Jouni Jarvinen, and Tapio Salakoski. Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation. BMC Bioinformatics, 6(1):157, Jun 2005. [ bib \| DOI \| http \| .pdf ] BACKGROUND: The ability to distinguish between genes and proteins is essential for understanding biological text. Support Vector Machines (SVMs) have been proven to be very efficient in general data mining tasks. We explore their capability for the gene versus protein name disambiguation task. RESULTS: We incorporated into the conventional SVM a weighting scheme based on distances of context words from the word to be disambiguated. This weighting scheme increased the performance of SVMs by five percentage points giving performance better than 85% as measured by the area under ROC curve and outperformed the Weighted Additive Classifier, which also incorporates the weighting, and the Naive Bayes classifier. CONCLUSIONS: We show that the performance of SVMs can be improved by the proposed weighting scheme. Furthermore, our results suggest that in this study the increase of the classification performance due to the weighting is greater than that obtained by selecting the underlying classifier or the kernel part of the SVM. Keywords: biosvm
[ODonnell2005Gene]	Rebekah K O'Donnell, Michael Kupferman, S. Jack Wei, Sunil Singhal, Randal Weber, Bert O'Malley, Yi Cheng, Mary Putt, Michael Feldman, Barry Ziober, and Ruth J Muschel. Gene expression signature predicts lymphatic metastasis in squamous cell carcinoma of the oral cavity. Oncogene, 24(7):1244-51, Feb 2005. [ bib \| DOI \| http \| .pdf ] Metastasis via the lymphatics is a major risk factor in squamous cell carcinoma of the oral cavity (OSCC). We sought to determine whether the presence of metastasis in the regional lymph node could be predicted by a gene expression signature of the primary tumor. A total of 18 OSCCs were characterized for gene expression by hybridizing RNA to Affymetrix U133A gene chips. Genes with differential expression were identified using a permutation technique and verified by quantitative RT-PCR and immunohistochemistry. A predictive rule was built using a support vector machine, and the accuracy of the rule was evaluated using crossvalidation on the original data set and prediction of an independent set of four patients. Metastatic primary tumors could be differentiated from nonmetastatic primary tumors by a signature gene set of 116 genes. This signature gene set correctly predicted the four independent patients as well as associating five lymph node metastases from the original patient set with the metastatic primary tumor group. We concluded that lymph node metastasis could be predicted by gene expression profiles of primary oral cavity squamous cell carcinomas. The presence of a gene expression signature for lymph node metastasis indicates that clinical testing to assess risk for lymph node metastasis should be possible. Keywords: biosvm microarray
[Nguyen2005Two-stage]	M. N. Nguyen and J. C. Rajapakse. Two-stage multi-class support vector machines to protein secondary structure prediction. Pac Symp Biocomput, pages 346-57, 2005. [ bib ] Bioinformatics techniques to protein secondary structure (PSS) prediction are mostly single-stage approaches in the sense that they predict secondary structures of proteins by taking into account only the contextual information in amino acid sequences. In this paper, we propose two-stage Multi-class Support Vector Machine (MSVM) approach where a MSVM predictor is introduced to the output of the first stage MSVM to capture the sequential relationship among secondary structure elements for the prediction. By using position specific scoring matrices, generated by PSI-BLAST, the two-stage MSVM approach achieves Q3 accuracies of 78.0% and 76.3% on the RS126 dataset of 126 nonhomologous globular proteins and the CB396 dataset of 396 nonhomologous proteins, respectively, which are better than the highest scores published on both datasets to date. Keywords: biosvm
[Nguyen2005Prediction]	Minh N Nguyen and Jagath C Rajapakse. Prediction of protein relative solvent accessibility with a two-stage SVM approach. Proteins, 59(1):30-7, Apr 2005. [ bib \| DOI \| http \| .pdf ] Information on relative solvent accessibility (RSA) of amino acid residues in proteins provides valuable clues to the prediction of protein structure and function. A two-stage approach with support vector machines (SVMs) is proposed, where an SVM predictor is introduced to the output of the single-stage SVM approach to take into account the contextual relationships among solvent accessibilities for the prediction. By using the position-specific scoring matrices (PSSMs) generated by PSI-BLAST, the two-stage SVM approach achieves accuracies up to 90.4% and 90.2% on the Manesh data set of 215 protein structures and the RS126 data set of 126 nonhomologous globular proteins, respectively, which are better than the highest published scores on both data sets to date. A Web server for protein RSA prediction using a two-stage SVM method has been developed and is available (http://birc.ntu.edu.sg/ pas0186457/rsa.html). Keywords: biosvm
[Nair2005Mimicking]	Rajesh Nair and Burkhard Rost. Mimicking cellular sorting improves prediction of subcellular localization. J Mol Biol, 348(1):85-100, Apr 2005. [ bib \| DOI \| http \| .pdf ] Predicting the native subcellular compartment of a protein is an important step toward elucidating its function. Here we introduce LOCtree, a hierarchical system combining support vector machines (SVMs) and other prediction methods. LOCtree predicts the subcellular compartment of a protein by mimicking the mechanism of cellular sorting and exploiting a variety of sequence and predicted structural features in its input. Currently LOCtree does not predict localization for membrane proteins, since the compositional properties of membrane proteins significantly differ from those of non-membrane proteins. While any information about function can be used by the system, we present estimates of performance that are valid when only the amino acid sequence of a protein is known. When evaluated on a non-redundant test set, LOCtree achieved sustained levels of 74% accuracy for non-plant eukaryotes, 70% for plants, and 84% for prokaryotes. We rigorously benchmarked LOCtree in comparison to the best alternative methods for localization prediction. LOCtree outperformed all other methods in nearly all benchmarks. Localization assignments using LOCtree agreed quite well with data from recent large-scale experiments. Our preliminary analysis of a few entirely sequenced organisms, namely human (Homo sapiens), yeast (Saccharomyces cerevisiae), and weed (Arabidopsis thaliana) suggested that over 35% of all non-membrane proteins are nuclear, about 20% are retained in the cytosol, and that every fifth protein in the weed resides in the chloroplast. Keywords: biosvm
[Nabieva2005Whole-proteome]	Elena Nabieva, Kam Jim, Amit Agarwal, Bernard Chazelle, and Mona Singh. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21 Suppl 1:i302-i310, Jun 2005. [ bib \| DOI \| http ] MOTIVATION: Determining protein function is one of the most important problems in the post-genomic era. For the typical proteome, there are no functional annotations for one-third or more of its proteins. Recent high-throughput experiments have determined proteome-scale protein physical interaction maps for several organisms. These physical interactions are complemented by an abundance of data about other types of functional relationships between proteins, including genetic interactions, knowledge about co-expression and shared evolutionary history. Taken together, these pairwise linkages can be used to build whole-proteome protein interaction maps. RESULTS: We develop a network-flow based algorithm, FunctionalFlow, that exploits the underlying structure of protein interaction maps in order to predict protein function. In cross-validation testing on the yeast proteome, we show that FunctionalFlow has improved performance over previous methods in predicting the function of proteins with few (or no) annotated protein neighbors. By comparing several methods that use protein interaction maps to predict protein function, we demonstrate that FunctionalFlow performs well because it takes advantage of both network topology and some measure of locality. Finally, we show that performance can be improved substantially as we consider multiple data sources and use them to create weighted interaction networks. AVAILABILITY: http://compbio.cs.princeton.edu/function Keywords: Algorithms; Computational Biology, methods; Evolution, Molecular; Fungal Proteins, chemistry; Genomics; Models, Statistical; Models, Theoretical; Protein Interaction Mapping, methods; Proteins, chemistry; Proteomics, methods
[Mueller2005Classifying]	K.-R. Müller, G. Rätsch, S. Sonnenburg, S. Mika, M. Grimm, and N. Heinrich. Classifying 'drug-likeness' with Kernel-based learning methods. J Chem Inf Model, 45(2):249-53, 2005. [ bib \| DOI \| http \| .pdf ] In this article we report about a successful application of modern machine learning technology, namely Support Vector Machines, to the problem of assessing the 'drug-likeness' of a chemical from a given set of descriptors of the substance. We were able to drastically improve the recent result by Byvatov et al. (2003) on this task and achieved an error rate of about 7% on unseen compounds using Support Vector Machines. We see a very high potential of such machine learning techniques for a variety of computational chemistry problems that occur in the drug discovery and drug design process. Keywords: biosvm chemoinformatics
[Mitsumori2005Gene]	Tomohiro Mitsumori, Sevrani Fation, Masaki Murata, Kouichi Doi, and Hirohumi Doi. Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinformatics, 6 Suppl 1:S8, 2005. [ bib \| DOI \| http \| .pdf ] BACKGROUND: Automated information extraction from biomedical literature is important because a vast amount of biomedical literature has been published. Recognition of the biomedical named entities is the first step in information extraction. We developed an automated recognition system based on the SVM algorithm and evaluated it in Task 1.A of BioCreAtIvE, a competition for automated gene/protein name recognition. RESULTS: In the work presented here, our recognition system uses the feature set of the word, the part-of-speech (POS), the orthography, the prefix, the suffix, and the preceding class. We call these features "internal resource features", i.e., features that can be found in the training data. Additionally, we consider the features of matching against dictionaries to be external resource features. We investigated and evaluated the effect of these features as well as the effect of tuning the parameters of the SVM algorithm. We found that the dictionary matching features contributed slightly to the improvement in the performance of the f-score. We attribute this to the possibility that the dictionary matching features might overlap with other features in the current multiple feature setting. CONCLUSION: During SVM learning, each feature alone had a marginally positive effect on system performance. This supports the fact that the SVM algorithm is robust on the high dimensionality of the feature vector space and means that feature selection is not required. Keywords: biosvm nlp
[Mavroforakis2005Significance]	Michael Mavroforakis, Harris Georgiou, Nikos Dimitropoulos, Dionisis Cavouras, and Sergios Theodoridis. Significance analysis of qualitative mammographic features, using linear classifiers, neural networks and support vector machines. Eur J Radiol, 54(1):80-9, Apr 2005. [ bib \| DOI \| http \| .pdf ] Advances in modern technologies and computers have enabled digital image processing to become a vital tool in conventional clinical practice, including mammography. However, the core problem of the clinical evaluation of mammographic tumors remains a highly demanding cognitive task. In order for these automated diagnostic systems to perform in levels of sensitivity and specificity similar to that of human experts, it is essential that a robust framework on problem-specific design parameters is formulated. This study is focused on identifying a robust set of clinical features that can be used as the base for designing the input of any computer-aided diagnosis system for automatic mammographic tumor evaluation. A thorough list of clinical features was constructed and the diagnostic value of each feature was verified against current clinical practices by an expert physician. These features were directly or indirectly related to the overall morphological properties of the mammographic tumor or the texture of the fine-scale tissue structures as they appear in the digitized image, while others contained external clinical data of outmost importance, like the patient's age. The entire feature set was used as an annotation list for describing the clinical properties of mammographic tumor cases in a quantitative way, such that subsequent objective analyses were possible. For the purposes of this study, a mammographic image database was created, with complete clinical evaluation descriptions and positive histological verification for each case. All tumors contained in the database were characterized according to the identified clinical features' set and the resulting dataset was used as input for discrimination and diagnostic value analysis for each one of these features. Specifically, several standard methodologies of statistical significance analysis were employed to create feature rankings according to their discriminating power. Moreover, three different classification models, namely linear classifiers, neural networks and support vector machines, were employed to investigate the true efficiency of each one of them, as well as the overall complexity of the diagnostic task of mammographic tumor characterization. Both the statistical and the classification results have proven the explicit correlation of all the selected features with the final diagnosis, qualifying them as an adequate input base for any type of similar automated diagnosis system. The underlying complexity of the diagnostic task has justified the high value of sophisticated pattern recognition architectures. Keywords: Algorithms, Animals, Antibiotics, Antineoplastic, Artificial Intelligence, Butadienes, Chloroplasts, Comparative Study, Computer Simulation, Computer-Assisted, Diagnosis, Disinfectants, Dose-Response Relationship, Drug, Drug Toxicity, Electrodes, Electroencephalography, Ethylamines, Expert Systems, Feedback, Fungicides, Gene Expression Profiling, Genes, Genetic Markers, Humans, Implanted, Industrial, Information Storage and Retrieval, Kidney, Kidney Tubules, MEDLINE, Male, Mercuric Chloride, Microarray Analysis, Molecular Biology, Motor Cortex, Movement, Natural Language Processing, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Plant Proteins, Predictive Value of Tests, Proteins, Proteome, Proximal, Puromycin Aminonucleoside, Rats, Reproducibility of Results, Research Support, Sprague-Dawley, Subcellular Fractions, Terminology, Therapy, Time Factors, Toxicogenetics, U.S. Gov't, User-Computer Interface, 15797296
[Matsuda2005novel]	A. Matsuda, J.-P. Vert, H. Saigo, N. Ueda, H. Toh, and T. Akutsu. A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci., 14(11):2804-2813, 2005. [ bib \| DOI \| http ] As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: N-terminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold cross-validation tests, overall accuracies of more than 87 proteins, respectively. It is concluded that considering the respective features in the N-terminal, middle, and C-terminal parts is helpful to predict the subcellular location. Keywords: biosvm
[Martin2005Predicting]	S. Martin, D. Roe, and J.-L. Faulon. Predicting protein-protein interactions using signature products. Bioinformatics, 21(2):218-226, Jan 2005. [ bib \| DOI \| http \| .pdf ] Motivation: Proteome-wide prediction of protein-protein interaction is a difficult and important problem in biology. Although there have been recent advances in both experimental and computational methods for predicting protein-protein interactions, we are only beginning to see a confluence of these techniques. In this paper, we describe a very general, high-throughput method for predicting protein-protein interactions. Our method combines a sequence-based description of proteins with experimental information that can be gathered from any type of protein-protein interaction screen. The method uses a novel description of interacting proteins by extending the signature descriptor, which has demonstrated success in predicting peptide/protein binding interactions for individual proteins. This descriptor is extended to protein pairs by taking signature products. The signature product is implemented within a support vector machine classifier as a kernel function. Results: We have applied our method to publicly available yeast, Helicobacter pylori, human and mouse datasets. We used the yeast and H.pylori datasets to verify the predictive ability of our method, achieving from 70 to 80 human and mouse datasets to demonstrate that our method is capable of cross-species prediction. Finally, we reused the yeast dataset to explore the ability of our algorithm to predict domains. Contact: smartin@sandia.gov. Keywords: biosvm
[Mao2005Multiclass]	Yong Mao, Xiaobo Zhou, Daoying Pi, Youxian Sun, and Stephen T C Wong. Multiclass cancer classification by using fuzzy support vector machine and binary decision tree with gene selection. J Biomed Biotechnol, 2005(2):160-71, 2005. [ bib \| DOI \| http \| .pdf ] We investigate the problems of multiclass cancer classification with gene selection from gene expression data. Two different constructed multiclass classifiers with gene selection are proposed, which are fuzzy support vector machine (FSVM) with gene selection and binary classification tree based on SVM with gene selection. Using F test and recursive feature elimination based on SVM as gene selection methods, binary classification tree based on SVM with F test, binary classification tree based on SVM with recursive feature elimination based on SVM, and FSVM with recursive feature elimination based on SVM are tested in our experiments. To accelerate computation, preselecting the strongest genes is also used. The proposed techniques are applied to analyze breast cancer data, small round blue-cell tumors, and acute leukemia data. Compared to existing multiclass cancer classifiers and binary classification tree based on SVM with F test or binary classification tree based on SVM with recursive feature elimination based on SVM mentioned in this paper, FSVM based on recursive feature elimination based on SVM can find most important genes that affect certain types of cancer with high recognition accuracy. Keywords: biosvm
[Mahe2005Graph]	P. Mahé, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Graph kernels for molecular structure-activity relationship analysis with support vector machines. J. Chem. Inf. Model., 45(4):939-51, 2005. [ bib \| DOI \| http \| .pdf ] The support vector machine algorithm together with graph kernel functions has recently been introduced to model structure-activity relationships (SAR) of molecules from their 2D structure, without the need for explicit molecular descriptor computation. We propose two extensions to this approach with the double goal to reduce the computational burden associated with the model and to enhance its predictive accuracy: description of the molecules by a Morgan index process and definition of a second-order Markov model for random walks on 2D structures. Experiments on two mutagenicity data sets validate the proposed extensions, making this approach a possible complementary alternative to other modeling strategies. Keywords: biosvm chemoinformatics
[Luan2005Classification]	Feng Luan, Ruisheng Zhang, Chunyan Zhao, Xiaojun Yao, Mancang Liu, Zhide Hu, and Botao Fan. Classification of the carcinogenicity of N-nitroso compounds based on support vector machines and linear discriminant analysis. Chem Res Toxicol, 18(2):198-203, Feb 2005. [ bib \| DOI \| http \| .pdf ] The support vector machine (SVM), as a novel type of learning machine, was used to develop a classification model of carcinogenic properties of 148 N-nitroso compounds. The seven descriptors calculated solely from the molecular structures of compounds selected by forward stepwise linear discriminant analysis (LDA) were used as inputs of the SVM model. The obtained results confirmed the discriminative capacity of the calculated descriptors. The result of SVM (total accuracy of 95.2%) is better than that of LDA (total accuracy of 89.8%). Keywords: biosvm
[Lo2005Effect]	Siaw Ling Lo, Cong Zhong Cai, Yu Zong Chen, and Maxey C M Chung. Effect of training datasets on support vector machine prediction of protein-protein interactions. Proteomics, 5(4):876-84, Mar 2005. [ bib \| DOI \| http \| .pdf ] Knowledge of protein-protein interaction is useful for elucidating protein function via the concept of 'guilt-by-association'. A statistical learning method, Support Vector Machine (SVM), has recently been explored for the prediction of protein-protein interactions using artificial shuffled sequences as hypothetical noninteracting proteins and it has shown promising results (Bock, J. R., Gough, D. A., Bioinformatics 2001, 17, 455-460). It remains unclear however, how the prediction accuracy is affected if real protein sequences are used to represent noninteracting proteins. In this work, this effect is assessed by comparison of the results derived from the use of real protein sequences with that derived from the use of shuffled sequences. The real protein sequences of hypothetical noninteracting proteins are generated from an exclusion analysis in combination with subcellular localization information of interacting proteins found in the Database of Interacting Proteins. Prediction accuracy using real protein sequences is 76.9% compared to 94.1% using artificial shuffled sequences. The discrepancy likely arises from the expected higher level of difficulty for separating two sets of real protein sequences than that for separating a set of real protein sequences from a set of artificial sequences. The use of real protein sequences for training a SVM classification system is expected to give better prediction results in practical cases. This is tested by using both SVM systems for predicting putative protein partners of a set of thioredoxin related proteins. The prediction results are consistent with observations, suggesting that real sequence is more practically useful in development of SVM classification system for facilitating protein-protein interaction prediction. Keywords: biosvm
[Liu2005Gene]	Zhenqiu Liu, Dechang Chen, and Halima Bensmail. Gene expression data classification with kernel principal component analysis. J Biomed Biotechnol, 2005(2):155-9, 2005. [ bib \| DOI \| http \| .pdf ] One important feature of the gene expression data is that the number of genes M far exceeds the number of samples N . Standard statistical methods do not work well when N < M . Development of new methodologies or modification of existing methodologies is needed for the analysis of the microarray data. In this paper, we propose a novel analysis procedure for classifying the gene expression data. This procedure involves dimension reduction using kernel principal component analysis (KPCA) and classification with logistic regression (discrimination). KPCA is a generalization and nonlinear version of principal component analysis. The proposed algorithm was applied to five different gene expression datasets involving human tumor samples. Comparison with other popular classification methods such as support vector machines and neural networks shows that our algorithm is very promising in classifying gene expression data. Keywords: biosvm
[Liu2005Multiclass]	Jane Jijun Liu, Gene Cutler, Wuxiong Li, Zheng Pan, Sihua Peng, Tim Hoey, Liangbiao Chen, and Xuefeng Bruce Ling. Multiclass cancer classification and biomarker discovery using GA-based algorithms. Bioinformatics, 21(11):2691-7, Jun 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: The development of microarray-based high-throughput gene profiling has led to the hope that this technology could provide an efficient and accurate means of diagnosing and classifying tumors, as well as predicting prognoses and effective treatments. However, the large amount of data generated by microarrays requires effective reduction of discriminant gene features into reliable sets of tumor biomarkers for such multiclass tumor discrimination. The availability of reliable sets of biomarkers, especially serum biomarkers, should have a major impact on our understanding and treatment of cancer. RESULTS: We have combined genetic algorithm (GA) and all paired (AP) support vector machine (SVM) methods for multiclass cancer categorization. Predictive features can be automatically determined through iterative GA/SVM, leading to very compact sets of non-redundant cancer-relevant genes with the best classification performance reported to date. Interestingly, these different classifier sets harbor only modest overlapping gene features but have similar levels of accuracy in leave-one-out cross-validations (LOOCV). Further characterization of these optimal tumor discriminant features, including the use of nearest shrunken centroids (NSC), analysis of annotations and literature text mining, reveals previously unappreciated tumor subclasses and a series of genes that could be used as cancer biomarkers. With this approach, we believe that microarray-based multiclass molecular analysis can be an effective tool for cancer biomarker discovery and subsequent molecular cancer diagnosis. Keywords: biosvm
[Liu2005Use]	Huiqing Liu, Jinyan Li, and Limsoon Wong. Use of extreme patient samples for outcome prediction from gene expression data. Bioinformatics, Jun 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Patient outcome prediction using microarray technologies is an important application in bioinformatics. Based on patients' genotypic microarray data, predictions are made to estimate patients' survival time and their risk of tumor metastasis or recurrence. So, accurate prediction can potentially help to provide better treatment for patients. RESULTS: We present a new computational method for patient outcome prediction. In the training phase of this method, we make use of two types of extreme patient samples: short-term survivors who got an unfavorable outcome within a short period and long-term survivors who were maintaining a favorable outcome after a long follow-up time. These extreme training samples yield a clear platform for us to identify relevant genes whose expression is closely related to the outcome. The selected extreme samples and the relevant genes are then integrated by a support vector machine to build a prediction model, by which each validation sample is assigned a risk score that falls into one of special pre-defined risk groups. We apply this method to several public data sets. In most cases, patients in high and low risk groups stratified by our method have clearly distinguishable outcome status as seen in their Kaplan-Meier curves. We also show that the idea of selecting only extreme patient samples for training is effective for improving the prediction accuracy when different gene selection methods are used. SUPPLEMENTARY INFORMATION: http://research.i2r.a-star.edu.sg/huiqing/supplementaldata/survival/survival.html. Keywords: biosvm
[Li2005robust]	L. Li, W. Jiang, X. Li, K.L. Moser, Z. Guo, L. Du, Q. Wang, E.J. Topol, Q. Wang, and S. Rao. A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. Genomics, 85(1):16-23, 2005. [ bib \| DOI \| http \| .pdf ] Development of a robust and efficient approach for extracting useful information from microarray data continues to be a significant and challenging task. Microarray data are characterized by a high dimension, high signal-to-noise ratio, and high correlations between genes, but with a relatively small sample size. Current methods for dimensional reduction can further be improved for the scenario of the presence of a single (or a few) high influential gene(s) in which its effect in the feature subset would prohibit inclusion of other important genes. We have formalized a robust gene selection approach based on a hybrid between genetic algorithm and support vector machine. The major goal of this hybridization was to exploit fully their respective merits (e.g., robustness to the size of solution space and capability of handling a very large dimension of feature genes) for identification of key feature genes (or molecular signatures) for a complex biological phenotype. We have applied the approach to the microarray data of diffuse large B cell lymphoma to demonstrate its behaviors and properties for mining the high-dimension data of genome-wide gene expression profiles. The resulting classifier(s) (the optimal gene subset(s)) has achieved the highest accuracy (99 for prediction of independent microarray samples in comparisons with marginal filters and a hybrid between genetic algorithm and K nearest neighbors. Keywords: biosvm
[Li2005Prediction]	H. Li, C. Ung, C. Yap, Y. Xue, Z. Li, Z. Cao, and Y. Chen. Prediction of genotoxicity of chemical compounds by statistical learning methods. Chem. Res. Toxicol., 18(6):1071-1080, Jun 2005. [ bib \| DOI \| http \| .pdf ] Various toxicological profiles, such as genotoxic potential, need to be studied in drug discovery processes and submitted to the drug regulatory authorities for drug safety evaluation. As part of the effort for developing low cost and efficient adverse drug reaction testing tools, several statistical learning methods have been used for developing genotoxicity prediction systems with an accuracy of up to 73.8% for genotoxic (GT+) and 92.8% for nongenotoxic (GT-) agents. These systems have been developed and tested by using less than 400 known GT+ and GT- agents, which is significantly less in number and diversity than the 860 GT+ and GT- agents known at present. There is a need to examine if a similar level of accuracy can be achieved for the more diverse set of molecules and to evaluate other statistical learning methods not yet applied to genotoxicity prediction. This work is intended for testing several statistical learning methods by using 860 GT+ and GT- agents, which include support vector machines (SVM), probabilistic neural network (PNN), k-nearest neighbor (k-NN), and C4.5 decision tree (DT). A feature selection method, recursive feature elimination, is used for selecting molecular descriptors relevant to genotoxicity study. The overall accuracies of SVM, k-NN, and PNN are comparable to and those of DT lower than the results from earlier studies, with SVM giving the highest accuracies of 77.8% for GT+ and 92.7% for GT- agents. Our study suggests that statistical learning methods, particularly SVM, k-NN, and PNN, are useful for facilitating the prediction of genotoxic potential of a diverse set of molecules. Keywords: biosvm chemoinformatics
[Kumar2005BhairPred]	M. Kumar, M. Bhasin, N. K. Natt, and G. P. S. Raghava. BhairPred: prediction of beta-hairpins in a protein from multiple alignment information using ANN and SVM techniques. Nucleic Acids Res, 33(Web Server issue):W154-9, Jul 2005. [ bib \| DOI \| http \| .pdf ] This paper describes a method for predicting a supersecondary structural motif, beta-hairpins, in a protein sequence. The method was trained and tested on a set of 5102 hairpins and 5131 non-hairpins, obtained from a non-redundant dataset of 2880 proteins using the DSSP and PROMOTIF programs. Two machine-learning techniques, an artificial neural network (ANN) and a support vector machine (SVM), were used to predict beta-hairpins. An accuracy of 65.5% was achieved using ANN when an amino acid sequence was used as the input. The accuracy improved from 65.5 to 69.1% when evolutionary information (PSI-BLAST profile), observed secondary structure and surface accessibility were used as the inputs. The accuracy of the method further improved from 69.1 to 79.2% when the SVM was used for classification instead of the ANN. The performances of the methods developed were assessed in a test case, where predicted secondary structure and surface accessibility were used instead of the observed structure. The highest accuracy achieved by the SVM based method in the test case was 77.9%. A maximum accuracy of 71.1% with Matthew's correlation coefficient of 0.41 in the test case was obtained on a dataset previously used by X. Cruz, E. G. Hutchinson, A. Shephard and J. M. Thornton (2002) Proc. Natl Acad. Sci. USA, 99, 11157-11162. The performance of the method was also evaluated on proteins used in the '6th community-wide experiment on the critical assessment of techniques for protein structure prediction (CASP6)'. Based on the algorithm described, a web server, BhairPred (http://www.imtech.res.in/raghava/bhairpred/), has been developed, which can be used to predict beta-hairpins in a protein using the SVM approach. Keywords: biosvm
[Kuang2005Profile-based]	R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, and C. Leslie. Profile-based string kernels for remote homology detection and motif extraction. J. Bioinform. Comput. Biol., 3(3):527-550, Jun 2005. [ bib ] We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs"-short regions of the original profile that contribute almost all the weight of the SVM classification score-and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets. Keywords: biosvm
[Komura2005Multidimensional]	D. Komura, H. Nakamura, S. Tsutsumi, H. Aburatani, and S. Ihara. Multidimensional support vector machines for visualization of gene expression data. Bioinformatics, 21(4):439-444, Feb 2005. [ bib \| DOI \| http \| .pdf ] Motivation: Since DNA microarray experiments provide us with huge amount of gene expression data, they should be analyzed with statistical methods to extract the meanings of experimental results. Some dimensionality reduction methods such as Principal Component Analysis (PCA) are used to roughly visualize the distribution of high dimensional gene expression data. However, in the case of binary classification of gene expression data, PCA does not utilize class information when choosing axes. Thus clearly separable data in the original space may not be so in the reduced space used in PCA.Results: For visualization and class prediction of gene expression data, we have developed a new SVM-based method called multidimensional SVMs, that generate multiple orthogonal axes. This method projects high dimensional data into lower dimensional space to exhibit properties of the data clearly and to visualize a distribution of the data roughly. Furthermore, the multiple axes can be used for class prediction. The basic properties of conventional SVMs are retained in our method: solutions of mathematical programming are sparse, and nonlinear classification is implemented implicitly through the use of kernel functions. The application of our method to the experimentally obtained gene expression datasets for patients' samples indicates that our algorithm is efficient and useful for visualization and class prediction. Keywords: biosvm
[BioCyc2005]	P. D. Karp, C. A. Ouzounis, C. Moore-Kochlacs, L. Goldovsky, P. Kaipa, D. Ahren, S. Tsoka, N. Darzentas, V. Kunin, and N. Lopez-Bigas. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res, 33(19):6083-9, 2005. [ bib ] The BioCyc database collection is a set of 160 pathway/genome databases (PGDBs) for most eukaryotic and prokaryotic species whose genomes have been completely sequenced to date. Each PGDB in the BioCyc collection describes the genome and predicted metabolic network of a single organism, inferred from the MetaCyc database, which is a reference source on metabolic pathways from multiple organisms. In addition, each bacterial PGDB includes predicted operons for the corresponding species. The BioCyc collection provides a unique resource for computational systems biology, namely global and comparative analyses of genomes and metabolic networks, and a supplement to the BioCyc resource of curated PGDBs. The Omics viewer available through the BioCyc website allows scientists to visualize combinations of gene expression, proteomics and metabolomics data on the metabolic maps of these organisms. This paper discusses the computational methodology by which the BioCyc collection has been expanded, and presents an aggregate analysis of the collection that includes the range of number of pathways present in these organisms, and the most frequently observed pathways. We seek scientists to adopt and curate individual PGDBs within the BioCyc collection. Only by harnessing the expertise of many scientists we can hope to produce biological databases, which accurately reflect the depth and breadth of knowledge that the biomedical research community is producing. Keywords: Animals Computational Biology Databases, Genetic Genome Genome, Archaeal Genome, Bacterial Genomics Humans Metabolism/genetics Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, P.H.S.
[Karklin2005Classification]	Y. Karklin, R. F. Meraz, and S.R. Holbrook. Classification of non-coding RNA using graph representations of secondary structure. Pac. Symp. Biocomput., pages 4-15, 2005. [ bib \| .pdf ] Some genes produce transcripts that function directly in regulatory, catalytic, or structural roles in the cell. These non-coding RNAs are prevalent in all living organisms, and methods that aid the understanding of their functional roles are essential. RNA secondary structure, the pattern of base-pairing, contains the critical information for determining the three dimensional structure and function of the molecule. In this work we examine whether the basic geometric and topological properties of secondary structure are sufficient to distinguish between RNA families in a learning framework. First, we develop a labeled dual graph representation of RNA secondary structure by adding biologically meaningful labels to the dual graphs proposed by Gan et al [1]. Next, we define a similarity measure directly on the labeled dual graphs using the recently developed marginalized kernels [2]. Using this similarity measure, we were able to train Support Vector Machine classifiers to distinguish RNAs of known families from random RNAs with similar statistics. For 22 of the 25 families tested, the classifier achieved better than 70% accuracy, with much higher accuracy rates for some families. Training a set of classifiers to automatically assign family labels to RNAs using a one vs. all multi-class scheme also yielded encouraging results. From these initial learning experiments, we suggest that the labeled dual graph representation, together with kernel machine methods, has potential for use in automated analysis and classification of uncharacterized RNA molecules or efficient genome-wide screens for RNA molecules from existing families. Keywords: biosvm
[Karchin2005Improving]	R. Karchin, L. Kelly, and A. Sali. Improving functional annotation of non-synonomous SNPs with information theory. Pac Symp Biocomput, pages 397-408, 2005. [ bib ] Automated functional annotation of nsSNPs requires that amino-acid residue changes are represented by a set of descriptive features, such as evolutionary conservation, side-chain volume change, effect on ligand-binding, and residue structural rigidity. Identifying the most informative combinations of features is critical to the success of a computational prediction method. We rank 32 features according to their mutual information with functional effects of amino-acid substitutions, as measured by in vivo assays. In addition, we use a greedy algorithm to identify a subset of highly informative features. The method is simple to implement and provides a quantitative measure for selecting the best predictive features given a set of features that a human expert believes to be informative. We demonstrate the usefulness of the selected highly informative features by cross-validated tests of a computational classifier, a support vector machine (SVM). The SVM's classification accuracy is highly correlated with the ranking of the input features by their mutual information. Two features describing the solvent accessibility of "wild-type" and "mutant" amino-acid residues and one evolutionary feature based on superfamily-level multiple alignments produce comparable overall accuracy and 6% fewer false positives than a 32-feature set that considers physiochemical properties of amino acids, protein electrostatics, amino-acid residue flexibility, and binding interactions. Keywords: biosvm
[Jorissen2005Virtual]	R. N. Jorissen and M. K. Gilson. Virtual screening of molecular databases using a support vector machine. J Chem Inf Model, 45(3):549-61, 2005. [ bib \| DOI \| http \| .pdf ] The Support Vector Machine (SVM) is an algorithm that derives a model used for the classification of data into two categories and which has good generalization properties. This study applies the SVM algorithm to the problem of virtual screening for molecules with a desired activity. In contrast to typical applications of the SVM, we emphasize not classification but enrichment of actives by using a modified version of the standard SVM function to rank molecules. The method employs a simple and novel criterion for picking molecular descriptors and uses cross-validation to select SVM parameters. The resulting method is more effective at enriching for active compounds with novel chemistries than binary fingerprint-based methods such as binary kernel discrimination. Keywords: biosvm
[Jarzab2005Gene]	Barbara Jarzab, Malgorzata Wiench, Krzysztof Fujarewicz, Krzysztof Simek, Michal Jarzab, Malgorzata Oczko-Wojciechowska, Jan Wloch, Agnieszka Czarniecka, Ewa Chmielik, Dariusz Lange, Agnieszka Pawlaczek, Sylwia Szpak, Elzbieta Gubala, and Andrzej Swierniak. Gene expression profile of papillary thyroid cancer: sources of variability and diagnostic implications. Cancer Res., 65(4):1587-97, Feb 2005. [ bib \| DOI \| http \| .pdf ] The study looked for an optimal set of genes differentiating between papillary thyroid cancer (PTC) and normal thyroid tissue and assessed the sources of variability in gene expression profiles. The analysis was done by oligonucleotide microarrays (GeneChip HG-U133A) in 50 tissue samples taken intraoperatively from 33 patients (23 PTC patients and 10 patients with other thyroid disease). In the initial group of 16 PTC and 16 normal samples, we assessed the sources of variability in the gene expression profile by singular value decomposition which specified three major patterns of variability. The first and the most distinct mode grouped transcripts differentiating between tumor and normal tissues. Two consecutive modes contained a large proportion of immunity-related genes. To generate a multigene classifier for tumor-normal difference, we used support vector machines-based technique (recursive feature replacement). It included the following 19 genes: DPP4, GJB3, ST14, SERPINA1, LRP4, MET, EVA1, SPUVE, LGALS3, HBB, MKRN2, MRC2, IGSF1, KIAA0830, RXRG, P4HA2, CDH3, IL13RA1, and MTMR4, and correctly discriminated 17 of 18 additional PTC/normal thyroid samples and all 16 samples published in a previous microarray study. Selected novel genes (LRP4, EVA1, TMPRSS4, QPCT, and SLC34A2) were confirmed by Q-PCR.Our results prove that the gene expression signal of PTC is easily detectable even when cancer cells do not prevail over tumor stroma. We indicate and separate the confounding variability related to the immune response. Finally, we propose a potent molecular classifier able to discriminate between PTC and nonmalignant thyroid in more than 90% of investigated samples. Keywords: biosvm
[Huang2005Gene]	T. M. Huang and V. Kecman. Gene extraction for cancer diagnosis by support vector machines-An improvement. Artif. Intell. Med., Jul 2005. [ bib \| DOI \| http \| .pdf ] OBJECTIVE:: To improve the performance of gene extraction for cancer diagnosis by recursive feature elimination with support vector machines (RFE-SVMs): A cancer diagnosis by using the DNA microarray data faces many challenges the most serious one being the presence of thousands of genes and only several dozens (at the best) of patient's samples. Thus, making any kind of classification in high-dimensional spaces from a limited number of data is both an extremely difficult and a prone to an error procedure. The improved RFE-SVMs is introduced and used here for an elimination of less relevant genes and just for a reduction of the overall number of genes used in a medical diagnostic. METHODS:: The paper shows why and how the, usually neglected, penalty parameter C and some standard data preprocessing techniques (normalizing and scaling) influence classification results and the gene selection of RFE-SVMs. The gene selected by RFE-SVMs is compared with eight other gene selection algorithms implemented in the Rankgene software to investigate whether there is any consensus among the algorithms, so the scope of finding the right set of genes can be reduced. RESULTS:: The improved RFE-SVMs is applied on the two benchmarking colon and lymphoma cancer data sets with various C parameters and different standard preprocessing techniques. Here, decreasing C leads to the smaller diagnosis error in comparisons to other known methods applied to the benchmarking data sets. With an appropriate parameter C and with a proper preprocessing procedure, the reduction in a diagnosis error is as high as 36%. CONCLUSIONS:: The results suggest that with a properly chosen parameter C, the extracted genes and the constructed classifier will ensure less overfitting of the training data leading to an increased accuracy in selecting relevant genes. Finally, comparison in gene ranking obtained by different algorithms shows that there is a significant consensus among the various algorithms as to which set of genes is relevant. Keywords: biosvm
[Huang2005Computation]	Shao-Wei Huang and Jenn-Kang Hwang. Computation of conformational entropy from protein sequences using the machine-learning method-application to the study of the relationship between structural conservation and local structural stability. Proteins, 59(4):802-9, Jun 2005. [ bib \| DOI \| http \| .pdf ] A complete protein sequence can usually determine a unique conformation; however, the situation is different for shorter subsequences-some of them are able to adopt unique conformations, independent of context; while others assume diverse conformations in different contexts. The conformations of subsequences are determined by the interplay between local and nonlocal interactions. A quantitative measure of such structural conservation or variability will be useful in the understanding of the sequence-structure relationship. In this report, we developed an approach using the support vector machine method to compute the conformational variability directly from sequences, which is referred to as the sequence structural entropy. As a practical application, we studied the relationship between sequence structural entropy and the hydrogen exchange for a set of well-studied proteins. We found that the slowest exchange cores usually comprise amino acids of the lowest sequence structural entropy. Our results indicate that structural conservation is closely related to the local structural stability. This relationship may have interesting implications in the protein folding processes, and may be useful in the study of the sequence-structure relationship. Keywords: biosvm
[Huang2005CTKPred]	N. Huang, H. Chen, and Z. Sun. CTKPred: an SVM-based method for the prediction and classification of the cytokine superfamily. Protein Eng. Des. Sel., Jun 2005. [ bib \| DOI \| http \| .pdf ] Cell proliferation, differentiation and death are controlled by a multitude of cell-cell signals and loss of this control has devastating consequences. Prominent among these regulatory signals is the cytokine superfamily, which has crucial functions in the development, differentiation and regulation of immune cells. In this study, a support vector machine (SVM)-based method was developed for predicting families and subfamilies of cytokines using dipeptide composition. The taxonomy of the cytokine superfamily with which our method complies was described in the Cytokine Family cDNA Database (dbCFC) and the dataset used in this study for training and testing was obtained from the dbCFC and Structural Classification of Proteins (SCOP). The method classified cytokines and non-cytokines with an accuracy of 92.5% by 7-fold cross-validation. The method is further able to predict seven major classes of cytokine with an overall accuracy of 94.7%. A server for recognition and classification of cytokines based on multi-class SVMs has been set up at http://bioinfo.tsinghua.edu.cn/ huangni/CTKPred/. Keywords: biosvm
[Huang2005Support]	Jing Huang and Feng Shi. Support vector machines for predicting apoptosis proteins types. Acta Biotheor., 53(1):39-47, 2005. [ bib \| DOI \| http \| .pdf ] Apoptosis proteins have a central role in the development and homeostasis of an organism. These proteins are very important for understanding the mechanism of programmed cell death, and their function is related to their types. According to the classification scheme by Zhou and Doctor (2003), the apoptosis proteins are categorized into the following four types: (1) cytoplasmic protein; (2) plasma membrane-bound protein; (3) mitochondrial inner and outer proteins; (4) other proteins. A powerful learning machine, the Support Vector Machine, is applied for predicting the type of a given apoptosis protein by incorporating the sqrt-amino acid composition effect. High success rates were obtained by the re-substitute test (98/98 = 100 %) and the jackknife test (89/98 = 90.8%). Keywords: biosvm
[Hua2005Optimal]	J. Hua, Z. Xiong, J. Lowey, E. Suh, and E. R. Dougherty. Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21(8):1509-1515, Apr 2005. To appear. [ bib \| DOI \| http \| .pdf ] Motivation: Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically (but not always), for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The potential downside of using too many features is most critical for small samples, which are commonplace for gene-expression-based classifiers for phenotype discrimination. For fixed sample size and feature-label distribution, the issue is to find an optimal number of features.Results: Since only in rare cases is there a known distribution of the error as a function of the number of features and sample size, this study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes. To achieve the desired end, finding the optimal number of features as a function of sample size, it employs massively parallel computation. Seven classifiers are treated: 3-nearest-neighbor, Gaussian kernel, linear support vector machine, polynomial support vector machine, perceptron, regular histogram and linear discriminant analysis. Three Gaussian-based models are considered: linear, nonlinear and bimodal. In addition, real patient data from a large breast-cancer study is considered. To mitigate the combinatorial search for finding optimal feature sets, and to model the situation in which subsets of genes are co-regulated and correlation is internal to these subsets, we assume that the covariance matrix of the features is blocked, with each block corresponding to a group of correlated features. Altogether there is a large number of error surfaces for the many cases. These are provided in full on a companion web-site, which is meant to serve as resource for those working with small-sample classification.Availability: For the companion web-site, please visit http://public.tgen.org/tamu/ofs/. Keywords: biosvm
[Han2005Fold]	Sangjo Han, Byung-Chul Lee, Seung Taek Yu, Chan-Seok Jeong, Soyoung Lee, and Dongsup Kim. Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics, 21(11):2667-73, Jun 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Currently, the most accurate fold-recognition method is to perform profile-profile alignments and estimate the statistical significances of those alignments by calculating Z-score or E-value. Although this scheme is reliable in recognizing relatively close homologs related at the family level, it has difficulty in finding the remote homologs that are related at the superfamily or fold level. RESULTS: In this paper, we present an alternative method to estimate the significance of the alignments. The alignment between a query protein and a template of length n in the fold library is transformed into a feature vector of length n + 1, which is then evaluated by support vector machine (SVM). The output from SVM is converted to a posterior probability that a query sequence is related to a template, given SVM output. Results show that a new method shows significantly better performance than PSI-BLAST and profile-profile alignment with Z-score scheme. While PSI-BLAST and Z-score scheme detect 16 and 20% of superfamily-related proteins, respectively, at 90% specificity, a new method detects 46% of these proteins, resulting in more than 2-fold increase in sensitivity. More significantly, at the fold level, a new method can detect 14% of remotely related proteins at 90% specificity, a remarkable result considering the fact that the other methods can detect almost none at the same level of specificity. Keywords: biosvm
[Han2005Prediction]	L.Y. Han, C.Z. Cai, Z.L. Ji, and Y.Z. Chen. Prediction of functional class of novel viral proteins by a statistical learning method irrespective of sequence similarity. Virology, 331(1):136-143, 2005. [ bib \| DOI \| http \| .pdf ] The function of a substantial percentage of the putative protein-coding open reading frames (ORFs) in viral genomes is unknown. As their sequence is not similar to that of proteins of known function, the function of these ORFs cannot be assigned on the basis of sequence similarity. Methods complement or in combination with sequence similarity-based approaches are being explored. The web-based software SVMProt () to some extent assigns protein functional family irrespective of sequence similarity and has been found to be useful for studying distantly related proteins [Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z., 2003. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31(13): 3692-3697]. Here 25 novel viral proteins are selected to test the capability of SVMProt for functional family assignment of viral proteins whose function cannot be confidently predicted on by sequence similarity methods at present. These proteins are without a sequence homolog in the Swissprot database, with its precise function provided in the literature, and not included in the training sets of SVMProt. The predicted functional classes of 72 the literature-described function, which is compared to the overall accuracy of 87 proteins. This suggests that SVMProt to some extent is capable of functional class assignment irrespective of sequence similarity and it is potentially useful for facilitating functional study of novel viral proteins. Keywords: biosvm
[Hakenberg2005Systematic]	JÃ¶rg Hakenberg, Steffen Bickel, Conrad Plake, Ulf Brefeld, Hagen Zahn, Lukas Faulstich, Ulf Leser, and Tobias Scheffer. Systematic feature evaluation for gene name recognition. BMC Bioinformatics, 6 Suppl 1:S9, 2005. [ bib \| DOI \| http \| .pdf ] In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features. Keywords: biosvm
[Haferlach2005global]	Torsten Haferlach, Alexander Kohlmann, Susanne Schnittger, Martin Dugas, Wolfgang Hiddemann, Wolfgang Kern, and Claudia Schoch. A global approach to the diagnosis of leukemia using gene expression profiling. Blood, 106(4):1189-1198, Aug 2005. [ bib \| DOI \| http \| .pdf ] Accurate diagnosis and classification of leukemias are the bases for the appropriate management of patients. The diagnostic accuracy and efficiency of present methods may be improved by the use of microarrays for gene expression profiling. We analyzed gene expression profiles in bone marrow and peripheral blood samples from 937 patients with all clinically relevant leukemia subtypes (n=892) and non-leukemic controls (n=45) by U133A and B GeneChips (Affymetrix). For each subgroup differentially expressed genes were calculated. Class prediction was performed using support vector machines. Prediction accuracies were estimated by 10-fold cross validation and assessed for robustness in a 100-fold resampling approach using randomly chosen test-sets consisting of 1/3 of the samples. Applying the top 100 genes of each subgroup an overall prediction accuracy of 95.1% was achieved which was confirmed by resampling (median, 93.8%; 95% confidence interval, 91.4%-95.8%). In particular, AML with t(15;17), t(8;21), or inv(16), CLL, and Pro-B-ALL with t(11q23) were classified with 100% sensitivity and 100% specificity. Accordingly, cluster analysis completely separated all of the 13 subgroups analyzed. Gene expression profiling can predict all clinically relevant subentities of leukemia with high accuracy. Keywords: biosvm microarray
[Haferlach2005AML]	Torsten Haferlach, Alexander Kohlmann, Susanne Schnittger, Martin Dugas, Wolfgang Hiddemann, Wolfgang Kern, and Claudia Schoch. AML M3 and AML M3 variant each have a distinct gene expression signature but also share patterns different from other genetically defined AML subtypes. Genes Chromosomes Cancer, 43(2):113-27, Jun 2005. [ bib \| DOI \| http \| .pdf ] Acute promyelocytic leukemia (APL) with t(15;17) appears in two phenotypes: AML M3, with abnormal promyelocytes showing heavy granulation and bundles of Auer rods, and AML M3 variant (M3v), with non- or hypogranular cytoplasm and a bilobed nucleus. We investigated the global gene expression profiles of 35 APL patients (19 AML M3, 16 AML M3v) by using high-density DNA-oligonucleotide microarrays. First, an unsupervised approach clearly separated APL samples from other AMLs characterized genetically as t(8;21) (n = 35), inv(16) (n = 35), or t(11q23)/MLL (n = 35) or as having a normal karyotype (n = 50). Second, we found genes with functional relevance for blood coagulation that were differentially expressed between APL and other AMLs. Furthermore, a supervised pairwise comparison between M3 and M3v revealed differential expression of genes that encode for biological functions and pathways such as granulation and maturation of hematologic cells, explaining morphologic and clinical differences. Discrimination between M3 and M3v based on gene signatures showed a median classification accuracy of 90% by use of 10-fold CV and support vector machines. Additional molecular mutations such as FLT3-LM, which were significantly more frequent in M3v than in M3 (P < 0.0001), may partly contribute to the different phenotypes. However, linear regression analysis demonstrated that genes differentially expressed between M3 and M3v did not correlate with FLT3-LM. Keywords: biosvm microarray
[Haasdonk2005Feature]	Bernard Haasdonk. Feature space interpretation of SVMs with indefinite kernels. IEEE Trans Pattern Anal Mach Intell, 27(4):482-92, Apr 2005. [ bib \| DOI \| http \| .pdf ] Kernel methods are becoming increasingly popular for various kinds of machine learning tasks, the most famous being the support vector machine (SVM) for classification. The SVM is well understood when using conditionally positive definite (cpd) kernel functions. However, in practice, non-cpd kernels arise and demand application in SVMs. The procedure of "plugging" these indefinite kernels in SVMs often yields good empirical classification results. However, they are hard to interpret due to missing geometrical and theoretical understanding. In this paper, we provide a step toward the comprehension of SVM classifiers in these situations. We give a geometric interpretation of SVMs with indefinite kernel functions. We show that such SVMs are optimal hyperplane classifiers not by margin maximization, but by minimization of distances between convex hulls in pseudo-Euclidean spaces. By this, we obtain a sound framework and motivation for indefinite SVMs. This interpretation is the basis for further theoretical analysis, e.g., investigating uniqueness, and for the derivation of practical guidelines like characterizing the suitability of indefinite SVMs. Keywords: Algorithms, Animals, Antibiotics, Antineoplastic, Artificial Intelligence, Automated, Automatic Data Processing, Butadienes, Chloroplasts, Cluster Analysis, Comparative Study, Computer Simulation, Computer-Assisted, Computing Methodologies, Database Management Systems, Databases, Diagnosis, Disinfectants, Dose-Response Relationship, Drug, Drug Toxicity, Electrodes, Electroencephalography, Ethylamines, Expert Systems, Factual, Feedback, Fungicides, Gene Expression Profiling, Genes, Genetic Markers, Humans, Image Enhancement, Image Interpretation, Implanted, Industrial, Information Storage and Retrieval, Kidney, Kidney Tubules, MEDLINE, Male, Mercuric Chloride, Microarray Analysis, Molecular Biology, Motor Cortex, Movement, Natural Language Processing, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Numerical Analysis, Pattern Recognition, Plant Proteins, Predictive Value of Tests, Proteins, Proteome, Proximal, Puromycin Aminonucleoside, Rats, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Sprague-Dawley, Subcellular Fractions, Terminology, Therapy, Time Factors, Toxicogenetics, U.S. Gov't, User-Computer Interface, 15794155
[Guo2005novel]	Ting Guo, Yanxin Shi, and Zhirong Sun. A novel statistical ligand-binding site predictor: application to ATP-binding sites. Protein Eng Des Sel, 18(2):65-70, Feb 2005. [ bib \| DOI \| http \| .pdf ] Structural genomics initiatives are leading to rapid growth in newly determined protein 3D structures, the functional characterization of which may still be inadequate. As an attempt to provide insights into the possible roles of the emerging proteins whose structures are available and/or to complement biochemical research, a variety of computational methods have been developed for the screening and prediction of ligand-binding sites in raw structural data, including statistical pattern classification techniques. In this paper, we report a novel statistical descriptor (the Oriented Shell Model) for protein ligand-binding sites, which utilizes the distance and angular position distribution of various structural and physicochemical features present in immediate proximity to the center of a binding site. Using the support vector machine (SVM) as the classifier, our model identified 69% of the ATP-binding sites in whole-protein scanning tests and in eukaryotic proteins the accuracy is particularly high. We propose that this feature extraction and machine learning procedure can screen out ligand-binding-capable protein candidates and can yield valuable biochemical information for individual proteins. Keywords: biosvm
[Gaudan2005Resolving]	S. Gaudan, H. Kirsch, and D. Rebholz-Schuhmann. Resolving abbreviations to their senses in Medline. Bioinformatics, Jul 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Biological literature contains many abbreviations with one particular sense in each document. However, most abbreviations do not have a unique sense across the literature. Furthermore, many documents do not contain the long-forms of the abbreviations. Resolving an abbreviation in a document consists of retrieving its sense in use. Abbreviation resolution improves accuracy of document retrieval engines and of information extraction systems. RESULTS: We combine an automatic analysis of Medline abstracts and linguistic methods to build a dictionary of abbreviation/sense pairs. The dictionary is used for the resolution of abbreviations occurring with their long-forms. Ambiguous global abbreviations are resolved using Support Vector Machines that have been trained on the context of each instance of the abbreviation/sense pairs, previously extracted for the dictionary setup. The system disambiguates abbreviations with a precision of 98.9% for a recall of 98.2% (98.5% accuracy). This performance is superior in comparison to previously reported research work. AVAILABILITY: The abbreviation resolution module is available at http://www.ebi.ac.uk/Rebholz/software.html. Keywords: biosvm nlp
[Garg2005SVM-based]	A. Garg, M. Bhasin, and G.P. Raghava. SVM-based method for subcellular localization of human proteins using amino acid compositions, their order and similarity search. J. Biol. Chem., 280(15):14427-32, Apr 2005. [ bib \| DOI \| http \| .pdf ] Here we report a systematic approach for predicting subcellular localization (cytoplasm, mitochondrial, nuclear and plasma membrane) of human proteins. Firstly, SVM based modules for predicting subcellular localization using traditional amino acid and dipeptide (i+1) composition achieved overall accuracy of 76.6 when carried out using similarity-based search against non-redundant database of experimentally annotated proteins yielded 73.3 To gain further insight, hybrid module (hybrid1) was developed based on amino acid composition, dipeptide composition, and similarity information and attained better accuracy of 84.9 SVM module based on different higher order dipeptide i.e. i+2, i+3, and i+4 were also constructed for the prediction of subcellular localization of human proteins and overall accuracy of 79.7 and 77.1 module hybrid2 was developed using traditional dipeptide (i+1) and higher order dipeptide (i+2, i+3, and i+4) compositions, which gave an overall accuracy of 81.3 based on amino acid composition, traditional and higher order dipeptide compositions and PSI-BLAST output and achieved an overall accuracy of 84.4 or http://bioinformatics.uams.edu/raghava/hslpred/) has been designed to predict subcellular localization of human proteins using the above approaches. Keywords: biosvm
[Gardy2005PSORTb]	J. L. Gardy, M. R. Laird, F. Chen, S. Rey, C. J. Walsh, M. Ester, and F. S. L. Brinkman. PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics, 21(5):617-623, Mar 2005. [ bib \| DOI \| http \| .pdf ] Motivation: PSORTb v.1.1 is the most precise bacterial localization prediction tool available. However the program's predictive coverage and recall are low and the method is only applicable to Gram-negative bacteria. The goals of the present work were: increase PSORTb's coverage while maintaining the existing precision level, expand it to include Gram-positive bacteria, and then carry out a comparative analysis of localization.Results: An expanded database of proteins of known localization and new modules using frequent subsequence-based support vector machines were introduced into PSORTb v.2.0. The program attains a precision of 96 bacteria and predictive coverage comparable to other tools for whole proteome analysis. We show that the proportion of proteins at each localization is remarkably consistent across species, even in species with varying proteome size.Availability: Web-based version: http://www.psort.org/psortb. Standalone version: Available through the website under GNU General Public License.Supplementary Information: http://www.psort.org/psortb/supplementaryinfo.html. Keywords: biosvm
[Gangal2005Human]	Rajeev Gangal and Pankaj Sharma. Human pol II promoter prediction: time series descriptors and machine learning. Nucleic Acids Res, 33(4):1332-6, 2005. [ bib \| DOI \| http \| .pdf ] Although several in silico promoter prediction methods have been developed to date, they are still limited in predictive performance. The limitations are due to the challenge of selecting appropriate features of promoters that distinguish them from non-promoters and the generalization or predictive ability of the machine-learning algorithms. In this paper we attempt to define a novel approach by using unique descriptors and machine-learning methods for the recognition of eukaryotic polymerase II promoters. In this study, non-linear time series descriptors along with non-linear machine-learning algorithms, such as support vector machine (SVM), are used to discriminate between promoter and non-promoter regions. The basic idea here is to use descriptors that do not depend on the primary DNA sequence and provide a clear distinction between promoter and non-promoter regions. The classification model built on a set of 1000 promoter and 1500 non-promoter sequences, showed a 10-fold cross-validation accuracy of 87% and an independent test set had an accuracy >85% in both promoter and non-promoter identification. This approach correctly identified all 20 experimentally verified promoters of human chromosome 22. The high sensitivity and selectivity indicates that n-mer frequencies along with non-linear time series descriptors, such as Lyapunov component stability and Tsallis entropy, and supervised machine-learning methods, such as SVMs, can be useful in the identification of pol II promoters. Keywords: biosvm
[Friedel2005Support]	C. C. Friedel, K. H. V. Jahn, S. Sommer, S. Rudd, H. W. Mewes, and I. V. Tetko. Support vector machines for separation of mixed plant-pathogen EST collections based on codon usage. Bioinformatics, 21:1383-1388, 2005. [ bib \| DOI \| http \| .pdf ] Motivation: Discovery of host and pathogen genes expressed at the plant-pathogen interface often requires the construction of mixed libraries that contain sequences from both genomes. Sequence identification requires high-throughput and reliable classification of genome origin. When using single-pass cDNA sequences difficulties arise from the short sequence length, the lack of sufficient taxonomically relevant sequence data in public databases and ambiguous sequence homology between plant and pathogen genes.Results: A novel method is described, which is independent of the availability of homologous genes and relies on subtle differences in codon usage between plant and fungal genes. We used support vector machines (SVMs) to identify the probable origin of sequences. SVMs were compared to several other machine learning techniques and to a probabilistic algorithm (PF-IND, Maor et al., 2003) for EST classification also based on codon bias differences. Our software (ECLAT) has achieved a classification accuracy of 93.1 vulgare and B. graminis, which is a significant improvement compared to PF-IND (prediction accuracy of 81.2 EST sequences with at least 50 nt of coding sequence can be classified by ECLAT with high confidence. ECLAT allows training of classifiers for any host-pathogen combination for which there are sufficient classified training sequences.Availability: ECLAT is freely available on the internet (http://mips.gsf.de/proj/est) or on request as a standalone version. Keywords: biosvm
[Engelhardt2005Protein]	B. E. Engelhardt, M. I. Jordan, K. E. Muratore, and S. E. Brenner. Protein Molecular Function Prediction by Bayesian Phylogenomics. PLoS Comput. Biol., 1(5):e45, Oct 2005. [ bib \| DOI \| http \| .pdf ] We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5'-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors. Keywords: biogm
[Ehlers2005NBS1]	Justis P Ehlers and J. William Harbour. NBS1 expression as a prognostic marker in uveal melanoma. Clin. Cancer Res., 11(5):1849-53, Mar 2005. [ bib \| DOI \| http \| .pdf ] PURPOSE: Up to half of uveal melanoma patients die of metastatic disease. Treatment of the primary eye tumor does not improve survival in high-risk patients due to occult micrometastatic disease, which is present at the time of eye tumor diagnosis but is not detected and treated until months to years later. Here, we use microarray gene expression data to identify a new prognostic marker. EXPERIMENTAL DESIGN: Microarray gene expression profiles were analyzed in 25 primary uveal melanomas. Tumors were ranked by support vector machine (SVM) and by cytologic severity. Nbs1 protein expression was assessed by quantitative immunohistochemistry in 49 primary uveal melanomas. Survival was assessed using Kaplan-Meier life-table analysis. RESULTS: Expression of the Nijmegen breakage syndrome (NBS1) gene correlated strongly with SVM and cytologic tumor rankings (P < 0.0001). Further, immunohistochemistry expression of the Nbs1 protein correlated strongly with both SVM and cytologic rankings (P < 0.0001). The 6-year actuarial survival was 100% in patients with low immunohistochemistry expression of Nbs1 and 22% in those with high Nbs1 expression (P = 0.01). CONCLUSIONS: NBS1 is a strong predictor of uveal melanoma survival and potentially could be used as a clinical marker for guiding clinical management. Keywords: 80 and over, Adult, Aged, Algorithms, Amino Acid Sequence, Amino Acids, Analysis of Variance, Animals, Area Under Curve, Artifacts, Automated, Bacteriophage T4, Base Sequence, Biological, Birefringence, Brain Chemistry, Brain Neoplasms, Cell Cycle Proteins, Comparative Study, Computational Biology, Computer-Assisted, Cornea, Cross-Sectional Studies, Databases, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Extramural, Face, Female, Gene Expression Profiling, Genetic, Glaucoma, Humans, Immunohistochemistry, Intraocular Pressure, Lasers, Least-Squares Analysis, Likelihood Functions, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Male, Markov Chains, Melanoma, Middle Aged, Models, Molecular, Mutation, N.I.H., Nerve Fibers, Non-P.H.S., Non-U.S. Gov't, Nuclear Proteins, Nucleic Acid, Nucleic Acid Conformation, Numerical Analysis, Oligonucleotide Array Sequence Analysis, Ophthalmological, Optic Nerve Diseases, Optical Coherence, P.H.S., Pattern Recognition, Photic Stimulation, Polymorphism, Prognosis, Prospective Studies, Protein, Protein Structure, Proteins, RNA, ROC Curve, Regression Analysis, Reproducibility of Results, Research Support, Retinal Ganglion Cells, Secondary, Sensitivity and Specificity, Sequence Analysis, Single Nucleotide, Single-Stranded Conformational, Software, Statistics, Survival Analysis, Tertiary, Tomography, Tumor Markers, U.S. Gov't, Untranslated, Uveal Neoplasms, Visual Fields, beta-Lactamases, 15756009
[Donnes2005Integrated]	P. Dönnes and O. Kohlbacher. Integrated modeling of the major events in the MHC class I antigen processing pathway. Protein Sci., 14:2132-2140, Jun 2005. [ bib \| DOI \| http \| .pdf ] Rational design of epitope-driven vaccines is a key goal of immunoinformatics. Typically, candidate selection relies on the prediction of MHC-peptide binding only, as this is known to be the most selective step in the MHC class I antigen processing pathway. However, proteasomal cleavage and transport by the transporter associated with antigen processing (TAP) are essential steps in antigen processing as well. While prediction methods exist for the individual steps, no method has yet offered an integrated prediction of all three major processing events. Here we present WAPP, a method combining prediction of proteasomal cleavage, TAP transport, and MHC binding into a single prediction system. The proteasomal cleavage site prediction employs a new matrix-based method that is based on experimentally verified proteasomal cleavage sites. Support vector regression is used for predicting peptides transported by TAP. MHC binding is the last step in the antigen processing pathway and was predicted using a support vector machine method, SVMHC. The individual methods are combined in a filtering approach mimicking the natural processing pathway. WAPP thus predicts peptides that are cleaved by the proteasome at the C terminus, transported by TAP, and show significant affinity to MHC class I molecules. This results in a decrease in false positive rates compared to MHC binding prediction alone. Compared to prediction of MHC binding only, we report an increased overall accuracy and a lower rate of false positive predictions for the HLA-A0201, HLA-B2705, HLA-A01, and HLA-A03 alleles using WAPP. The method is available online through our prediction server at http://www-bs.informatik.uni-tuebingen.de/WAPP. Keywords: biosvm immunoinformatics
[Dubey2005Support]	Anshul Dubey, Matthew J Realff, Jay H Lee, and Andreas S Bommarius. Support vector machines for learning to identify the critical positions of a protein. J Theor Biol, 234(3):351-61, Jun 2005. [ bib \| DOI \| http \| .pdf ] A method for identifying the positions in the amino acid sequence, which are critical for the catalytic activity of a protein using support vector machines (SVMs) is introduced and analysed. SVMs are supported by an efficient learning algorithm and can utilize some prior knowledge about the structure of the problem. The amino acid sequences of the variants of a protein, created by inducing mutations, along with their fitness are required as input data by the method to predict its critical positions. To investigate the performance of this algorithm, variants of the beta-lactamase enzyme were created in silico using simulations of both mutagenesis and recombination protocols. Results from literature on beta-lactamase were used to test the accuracy of this method. It was also compared with the results from a simple search algorithm. The algorithm was also shown to be able to predict critical positions that can tolerate two different amino acids and retain function. Keywords: biosvm
[Dror2005Accurate]	G. Dror, R. Sorek, and R. Shamir. Accurate identification of alternatively spliced exons using support vector machine. Bioinformatics, 21(7):897-901, Apr 2005. [ bib \| DOI \| http \| .pdf ] Motivation: Alternative splicing is a major component of the regulation acting on mammalian transcriptomes. It is estimated that over half of all human genes have more than one splice variant. Previous studies have shown that alternatively spliced exons possess several features that distinguish them from constitutively spliced ones. Recently, we have demonstrated that such features can be used to distinguish alternative from constitutive exons. In the current study we use advanced machine learning methods to generate robust alternative exons classifier.Results: We extracted several hundred local sequence features of constitutive as well as alternative exons. Using feature selection methods we find seven attributes that are dominant for the task of classification. Several less informative features help to slightly increase the performance of the classifier. The classifier achieves a true positive rate of 50 positive rate of 0.5 alternatively spliced exons in exon databases that are believed to be dominated by constitutive exons.Availability: Upon request from the authors. Keywords: biosvm
[Doyle2005PlosBiol]	John Doyle and Marie Csete. Motifs, control, and stability. PLoS Biol, 3(11):e392, Nov 2005. [ bib \| DOI \| http ] Keywords: Amino Acid Motifs; Bacterial Physiological Phenomena; Bacterial Proteins, chemistry; Escherichia coli, metabolism; Genes, Bacterial; Genes, Plant; Glycolysis; Heat-Shock Proteins, chemistry; Models, Biological; Models, Theoretical; Molecular Chaperones, chemistry; Plant Proteins, chemistry; Protein Interaction Mapping; Protein Structure, Tertiary; Transcription Factors, chemistry; Transcription, Genetic
[Dong2005Fast]	Jian xiong Dong, Adam Krzyzak, and Ching Y Suen. Fast SVM training algorithm with decomposition on very large data sets. IEEE Trans Pattern Anal Mach Intell, 27(4):603-18, Apr 2005. [ bib ] Training a support vector machine on a data set of huge size with thousands of classes is a challenging problem. This paper proposes an efficient algorithm to solve this problem. The key idea is to introduce a parallel optimization step to quickly remove most of the nonsupport vectors, where block diagonal matrices are used to approximate the original kernel matrix so that the original problem can be split into hundreds of subproblems which can be solved more efficiently. In addition, some effective strategies such as kernel caching and efficient computation of kernel matrix are integrated to speed up the training process. Our analysis of the proposed algorithm shows that its time complexity grows linearly with the number of classes and size of the data set. In the experiments, many appealing properties of the proposed algorithm have been investigated and the results show that the proposed algorithm has a much better scaling capability than Libsvm, SVMlight, and SVMTorch. Moreover, the good generalization performances on several large databases have also been achieved. Keywords: Algorithms, Animals, Antibiotics, Antineoplastic, Artificial Intelligence, Automated, Automatic Data Processing, Butadienes, Chloroplasts, Comparative Study, Computer Simulation, Computer-Assisted, Database Management Systems, Databases, Diagnosis, Disinfectants, Dose-Response Relationship, Drug, Drug Toxicity, Electrodes, Electroencephalography, Ethylamines, Expert Systems, Factual, Feedback, Fungicides, Gene Expression Profiling, Genes, Genetic Markers, Humans, Image Enhancement, Image Interpretation, Implanted, Industrial, Information Storage and Retrieval, Kidney, Kidney Tubules, MEDLINE, Male, Mercuric Chloride, Microarray Analysis, Molecular Biology, Motor Cortex, Movement, Natural Language Processing, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Numerical Analysis, Pattern Recognition, Plant Proteins, Predictive Value of Tests, Proteins, Proteome, Proximal, Puromycin Aminonucleoside, Rats, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Sprague-Dawley, Subcellular Fractions, Terminology, Therapy, Time Factors, Toxicogenetics, U.S. Gov't, User-Computer Interface, 15794164
[Dobson2005Predicting]	P.D. Dobson and A.J. Doig. Predicting enzyme class from protein structure without alignments. J. Mol. Biol., 345(1):187-199, Jan 2005. [ bib \| DOI \| http \| .pdf ] Methods for predicting protein function from structure are becoming more important as the rate at which structures are solved increases more rapidly than experimental knowledge. As a result, protein structures now frequently lack functional annotations. The majority of methods for predicting protein function are reliant upon identifying a similar protein and transferring its annotations to the query protein. This method fails when a similar protein cannot be identified, or when any similar proteins identified also lack reliable annotations. Here, we describe a method that can assign function from structure without the use of algorithms reliant upon alignments. Using simple attributes that can be calculated from any crystal structure, such as secondary structure content, amino acid propensities, surface properties and ligands, we describe each enzyme in a non-redundant set. The set is split according to Enzyme Classification (EC) number. We combine the predictions of one-class versus one-class support vector machine models to make overall assignments of EC number to an accuracy of 35 to 60 the utility of simple structural attributes in protein function prediction and shed light on the link between structure and function. We apply our methods to predict the function of every currently unclassified protein in the Protein Data Bank. Keywords: biosvm
[Ding2005Minimum]	Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol, 3(2):185-205, Apr 2005. [ bib ] How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. We propose a minimum redundancy - maximum relevance (MRMR) feature selection framework. Genes selected via MRMR provide a more balanced coverage of the space and capture broader characteristics of phenotypes. They lead to significantly improved class predictions in extensive experiments on 6 gene expression data sets: NCI, Lymphoma, Lung, Child Leukemia, Leukemia, and Colon. Improvements are observed consistently among 4 classification methods: Naive Bayes, Linear discriminant analysis, Logistic regression, and Support vector machines. SUPPLIMENTARY: The top 60 MRMR genes for each of the datasets are listed in http://crd.lbl.gov/ cding/MRMR/. More information related to MRMR methods can be found at http://www.hpeng.net/. Keywords: Adult, Aged, Aging, Algorithms, Animals, Apoptosis, Artificial Intelligence, Automated, Biological, Bone Marrow, Breast Neoplasms, Classification, Cluster Analysis, Comparative Study, Computer Simulation, Computer-Assisted, Diagnosis, Dose-Response Relationship, Drug, Female, Foot, Gait, Gene Expression Profiling, Gene Expression Regulation, Gene Silencing, Genetic Vectors, Humans, Image Interpretation, Information Storage and Retrieval, Kidney, Liver, Logistic Models, Male, Messenger, Models, Myocardium, Neoplasms, Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Pharmaceutical Preparations, Polymerase Chain Reaction, Principal Component Analysis, Proteins, RNA, Rats, Reproducibility of Results, Research Support, Sensitivity and Specificity, Small Interfering, Sprague-Dawley, Statistical, Subcellular Fractions, Unknown Primary, 15852500
[Dhingra2005Substantial]	Vikas Dhingra, Mukta Gupta, Tracy Andacht, and Zhen F Fu. New frontiers in proteomics research: a perspective. Int. J. Pharm., 299(1-2):1-18, Aug 2005. [ bib \| DOI \| http ] Substantial advances have been made in the fundamental understanding of human biology, ranging from DNA structure to identification of diseases associated with genetic abnormalities. Genome sequence information is becoming available in unprecedented amounts. The absence of a direct functional correlation between gene transcripts and their corresponding proteins, however, represents a significant roadblock for improving the efficiency of biological discoveries. The success of proteomics depends on the ability to identify and analyze protein products in a cell or tissue and, this is reliant on the application of several key technologies. Proteomics is in its exponential growth phase. Two-dimensional electrophoresis complemented with mass spectrometry provides a global view of the state of the proteins from the sample. Proteins identification is a requirement to understand their functional diversity. Subtle difference in protein structure and function can contribute to complexity and diversity of life. This review focuses on the progress and the applications of proteomics science with special reference to integration of the evolving technologies involved to address biological questions. Keywords: Computational Biology; Electrophoresis, Gel, Two-Dimensional; Humans; Peptide Mapping; Protein Interaction Mapping; Proteomics; Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization
[Degroeve2005SpliceMachine]	S. Degroeve, Y. Saeys, B. De Baets, P. Rouze, and Y. Van de Peer. SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics, 21:1332-1338, 2005. [ bib \| DOI \| http \| .pdf ] Motivation: In this age of complete genome sequencing, finding the location and structure of genes is crucial for further molecular research. The accurate prediction of intron boundaries largely facilitates the correct prediction of gene structure in nuclear genomes. Many tools for localizing these boundaries on DNA sequences have been developed and are available to researchers through the internet. Nevertheless, these tools still make many false positive predictions.Results: This manuscript presents a novel publicly available splice site prediction tool named SpliceMachine that (i) shows state-of-the-art prediction performance on Arabidopsis thaliana and human sequences, (ii) performs a computationally fast annotation, and (iii) can be trained by the user on its own data.Availability: Results, figures and software are available at http://bioinformatics.psb.ugent.be/supplementary_data/. Keywords: biosvm
[Cuturi2005context-tree]	M. Cuturi and J.-P. Vert. The context-tree kernel for strings. Neural Network., 18(4):1111-1123, 2005. [ bib \| DOI \| http \| .pdf ] We propose a new kernel for strings which borrows ideas and techniques from information theory and data compression. This kernel can be used in combination with any kernel method, in particular Support Vector Machines for string classi- fication, with notable applications in proteomics. By using a Bayesian averaging framework with conjugate priors on a class of Markovian models known as prob- abilistic suffix trees or context-trees, we compute the value of this kernel in linear time and space while only using the information contained in the spectrum of the considered strings. This is ensured through an adaptation of a compression method known as the context-tree weighting algorithm. Encouraging classification results are reported on a standard protein homology detection experiment, showing that the context-tree kernel performs well with respect to other state-of-the-art methods while using no biological prior knowledge. Keywords: biosvm
[Cole2005Comparing]	Jason C Cole, Christopher W Murray, J. Willem M Nissink, Richard D Taylor, and Robin Taylor. Comparing protein-ligand docking programs is difficult. Proteins, 60(3):325-332, Aug 2005. [ bib \| DOI \| http ] There is currently great interest in comparing protein-ligand docking programs. A review of recent comparisons shows that it is difficult to draw conclusions of general applicability. Statistical hypothesis testing is required to ensure that differences in pose-prediction success rates and enrichment rates are significant. Numerical measures such as root-mean-square deviation need careful interpretation and may profitably be supplemented by interaction-based measures and visual inspection of dockings. Test sets must be of appropriate diversity and of good experimental reliability. The effects of crystal-packing interactions may be important. The method used for generating starting ligand geometries and positions may have an appreciable effect on docking results. For fair comparison, programs must be given search problems of equal complexity (e.g. binding-site regions of the same size) and approximately equal time in which to solve them. Comparisons based on rescoring require local optimization of the ligand in the space of the new objective function. Re-implementations of published scoring functions may give significantly different results from the originals. Ostensibly minor details in methodology may have a profound influence on headline success rates. Keywords: Algorithms; Artificial Intelligence; Binding Sites; Computational Biology, methods; Computer Simulation; Crystallization; Crystallography, X-Ray; Databases, Protein; Ligands; Models, Molecular; Molecular Structure; Programming Languages; Protein Binding; Proteins, chemistry; Proteomics, methods; Reproducibility of Results; Software
[Chen2005Understanding]	Y. Chen and D. Xu. Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics, 21:575-581, Mar 2005. [ bib \| DOI \| http \| .pdf ] Motivation: Protein dispensability is fundamental to understanding of gene function and evolution. Recent advances in generating high-throughput data such as genomic sequence data, protein-protein interaction data, gene-expression data, and growth-rate data of mutants allow us to investigate protein dispensability systematically at the genome scale.Results: In our studies, protein dispensability is represented as a fitness score that is measured by the growth rate of gene-deletion mutants. Through analyses of high-throughput data in yeast Saccharomyces cerevisia, we found that a protein's dispensability had significant correlations with its evolutionary rate and duplication rate, as well as its connectivity in protein-protein interaction network and gene-expression correlation network. Neural network and support vector machine were applied to predict protein dispensability through high-throughput data. Our studies shed some lights on global characteristics of protein dispensability and evolution.Availability: The original datasets for protein dispensability analysis and prediction, together with related scripts, are available at http://digbio.missouri.edu/ ychen/ProDispen/. Keywords: biosvm
[Cavalieri2005]	D. Cavalieri and C. De Filippo. Bioinformatic methods for integrating whole-genome expression results into cellular networks. Drug Discov Today, 10(10):727-34, 2005. [ bib ] Extracting a comprehensive overview from the huge amount of information arising from whole-genome analyses is a significant challenge. This review critically surveys the state of the art methods that are used to connect information from functional genomic studies to biological function. Cluster analysis methods for inferring the correlation between genes are discussed, as are the methods for integrating gene expression information with existing information on biological pathways and the methods that combine cluster analysis with biological information to reconstruct novel biological networks. Keywords: Cluster Analysis Computational Biology/methods/organization & administration/trends Genomics/methods/organization & administration/trends Humans Oligonucleotide Array Sequence Analysis/methods
[Capriotti2005I-Mutant]	E. Capriotti, P. Fariselli, and R. Casadio. I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res., 33(Web Server issue):W306-10, Jul 2005. [ bib \| DOI \| http \| www: ] I-Mutant2.0 is a support vector machine (SVM)-based tool for the automatic prediction of protein stability changes upon single point mutations. I-Mutant2.0 predictions are performed starting either from the protein structure or, more importantly, from the protein sequence. This latter task, to the best of our knowledge, is exploited for the first time. The method was trained and tested on a data set derived from ProTherm, which is presently the most comprehensive available database of thermodynamic experimental data of free energy changes of protein stability upon mutation under different conditions. I-Mutant2.0 can be used both as a classifier for predicting the sign of the protein stability change upon mutation and as a regression estimator for predicting the related DeltaDeltaG values. Acting as a classifier, I-Mutant2.0 correctly predicts (with a cross-validation procedure) 80% or 77% of the data set, depending on the usage of structural or sequence information, respectively. When predicting DeltaDeltaG values associated with mutations, the correlation of predicted with expected/experimental values is 0.71 (with a standard error of 1.30 kcal/mol) and 0.62 (with a standard error of 1.45 kcal/mol) when structural or sequence information are respectively adopted. Our web interface allows the selection of a predictive mode that depends on the availability of the protein structure and/or sequence. In this latter case, the web server requires only pasting of a protein sequence in a raw format. We therefore introduce I-Mutant2.0 as a unique and valuable helper for protein design, even when the protein structure is not yet known with atomic resolution. Availability: http://gpcr.biocomp.unibo.it/cgi/predictors/I-Mutant2.0/I-Mutant2.0.cgi. Keywords: biosvm
[Burckin2005Exploring]	T. Burckin, R. Nagel, Y. Mandel-Gutfreund, L. Shiue, T. A. Clark, J.-L. Chong, T.-H. Chang, S. Squazzo, G. Hartzog, and M. Ares. Exploring functional relationships between components of the gene expression machinery. Nat. Struct. Mol. Biol., 12(2):175-82, Feb 2005. [ bib \| DOI \| http \| .pdf ] Eukaryotic gene expression requires the coordinated activity of many macromolecular machines including transcription factors and RNA polymerase, the spliceosome, mRNA export factors, the nuclear pore, the ribosome and decay machineries. Yeast carrying mutations in genes encoding components of these machineries were examined using microarrays to measure changes in both pre-mRNA and mRNA levels. We used these measurements as a quantitative phenotype to ask how steps in the gene expression pathway are functionally connected. A multiclass support vector machine was trained to recognize the gene expression phenotypes caused by these mutations. In several cases, unexpected phenotype assignments by the computer revealed functional roles for specific factors at multiple steps in the gene expression pathway. The ability to resolve gene expression pathway phenotypes provides insight into how the major machineries of gene expression communicate with each other. Keywords: biosvm microarray
[Bunescu2005Comparative]	R. Bunescu, R. Ge, R. J. Kate, E. M. Marcotte, R. J. Mooney, A. K. Ramani, and Y. W. Wong. Comparative experiments on learning information extractors for proteins and their interactions. Artif. Intell. Med., 33(2):139-55, Feb 2005. [ bib \| DOI \| http \| .pdf ] OBJECTIVE: Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction efforts have been frustrated by the lack of conventions for describing human genes and proteins. We have developed and evaluated a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting information on interactions between the proteins. METHODS AND MATERIAL: We used a variety of machine learning methods to automatically develop information extraction systems for extracting information on gene/protein name, function and interactions from Medline abstracts. We present cross-validated results on identifying human proteins and their interactions by training and testing on a set of approximately 1000 manually-annotated Medline abstracts that discuss human genes/proteins. RESULTS: We demonstrate that machine learning approaches using support vector machines and maximum entropy are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manually-developed rules. CONCLUSION: Our results show that it is promising to use machine learning to automatically build systems for extracting information from biomedical text. The results also give a broad picture of the relative strengths of a wide variety of methods when tested on a reasonably large human-annotated corpus. Keywords: biosvm
[Bui2005Automated]	Huynh-Hoa Bui, John Sidney, Bjoern Peters, Muthuraman Sathiamurthy, Asabe Sinichi, Kelly-Anne Purton, Bianca R Mothé, Francis V Chisari, David I Watkins, and Alessandro Sette. Automated generation and evaluation of specific mhc binding predictive tools: Arb matrix applications. Immunogenetics, 57(5):304-314, Jun 2005. [ bib \| DOI \| http ] Prediction of which peptides can bind major histocompatibility complex (MHC) molecules is commonly used to assist in the identification of T cell epitopes. However, because of the large numbers of different MHC molecules of interest, each associated with different predictive tools, tool generation and evaluation can be a very resource intensive task. A methodology commonly used to predict MHC binding affinity is the matrix or linear coefficients method. Herein, we described Average Relative Binding (ARB) matrix methods that directly predict IC(50) values allowing combination of searches involving different peptide sizes and alleles into a single global prediction. A computer program was developed to automate the generation and evaluation of ARB predictive tools. Using an in-house MHC binding database, we generated a total of 85 and 13 MHC class I and class II matrices, respectively. Results from the automated evaluation of tool efficiency are presented. We anticipate that this automation framework will be generally applicable to the generation and evaluation of large numbers of MHC predictive methods and tools, and will be of value to centralize and rationalize the process of evaluation of MHC predictions. MHC binding predictions based on ARB matrices were made available at http://epitope.liai.org:8080/matrix web server. Keywords: Animals; Binding Sites; Computer Simulation; Databases, Protein; Epitopes; Histocompatibility Antigens; Humans; Major Histocompatibility Complex; Models, Biological; Protein Binding
[Briem2005Classifying]	Hans Briem and Judith Günther. Classifying "kinase inhibitor-likeness" by using machine-learning methods. ChemBioChem, 6(3):558-66, Mar 2005. [ bib \| DOI \| http \| .pdf ] By using an in-house data set of small-molecule structures, encoded by Ghose-Crippen parameters, several machine learning techniques were applied to distinguish between kinase inhibitors and other molecules with no reported activity on any protein kinase. All four approaches pursued-support-vector machines (SVM), artificial neural networks (ANN), k nearest neighbor classification with GA-optimized feature selection (GA/kNN), and recursive partitioning (RP)-proved capable of providing a reasonable discrimination. Nevertheless, substantial differences in performance among the methods were observed. For all techniques tested, the use of a consensus vote of the 13 different models derived improved the quality of the predictions in terms of accuracy, precision, recall, and F1 value. Support-vector machines, followed by the GA/kNN combination, outperformed the other techniques when comparing the average of individual models. By using the respective majority votes, the prediction of neural networks yielded the highest F1 value, followed by SVMs. Keywords: biosvm chemoinformatics
[Bradford2005Improved]	James R Bradford and David R Westhead. Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics, 21(8):1487-94, Apr 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Structural genomics projects are beginning to produce protein structures with unknown function, therefore, accurate, automated predictors of protein function are required if all these structures are to be properly annotated in reasonable time. Identifying the interface between two interacting proteins provides important clues to the function of a protein and can reduce the search space required by docking algorithms to predict the structures of complexes. RESULTS: We have combined a support vector machine (SVM) approach with surface patch analysis to predict protein-protein binding sites. Using a leave-one-out cross-validation procedure, we were able to successfully predict the location of the binding site on 76% of our dataset made up of proteins with both transient and obligate interfaces. With heterogeneous cross-validation, where we trained the SVM on transient complexes to predict on obligate complexes (and vice versa), we still achieved comparable success rates to the leave-one-out cross-validation suggesting that sufficient properties are shared between transient and obligate interfaces. AVAILABILITY: A web application based on the method can be found at http://www.bioinformatics.leeds.ac.uk/ppi_pred. The dataset of 180 proteins used in this study is also available via the same web site. CONTACT: westhead@bmb.leeds.ac.uk SUPPLEMENTARY INFORMATION: http://www.bioinformatics.leeds.ac.uk/ppi-pred/supp-material. Keywords: biosvm
[Borgwardt2005Protein]	K.M. Borgwardt, C.S. Ong, S. Schönauer, S.V.N. Vishwanathan, A.J. Smola, and H.-P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl. 1):i47-i56, Jun 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and support vector machine classification on these protein graphs. RESULTS: Our graph model, derivable from protein sequence and structure only, is competitive with vector models that require additional protein information, such as the size of surface pockets. If we include this extra information into our graph model, our classifier yields significantly higher accuracy levels than the vector models. Hyperkernels allow us to select and to optimally combine the most relevant node attributes in our protein graphs. We have laid the foundation for a protein function prediction system that integrates protein information from various sources efficiently and effectively. AVAILABILITY: More information available via www.dbs.ifi.lmu.de/Mitarbeiter/borgwardt.html. CONTACT: borgwardt@dbs.ifi.lmu.de. Keywords: biosvm
[Bordner2005Statistical]	Andrew J Bordner and Ruben Abagyan. Statistical analysis and prediction of protein-protein interfaces. Proteins, 60(3):353-66, Aug 2005. [ bib \| DOI \| http \| .pdf ] Predicting protein-protein interfaces from a three-dimensional structure is a key task of computational structural proteomics. In contrast to geometrically distinct small molecule binding sites, protein-protein interface are notoriously difficult to predict. We generated a large nonredundant data set of 1494 true protein-protein interfaces using biological symmetry annotation where necessary. The data set was carefully analyzed and a Support Vector Machine was trained on a combination of a new robust evolutionary conservation signal with the local surface properties to predict protein-protein interfaces. Fivefold cross validation verifies the high sensitivity and selectivity of the model. As much as 97% of the predicted patches had an overlap with the true interface patch while only 22% of the surface residues were included in an average predicted patch. The model allowed the identification of potential new interfaces and the correction of mislabeled oligomeric states. Keywords: biosvm
[Bhasin2005Pcleavage]	M. Bhasin and G. P. S. Raghava. Pcleavage: an SVM based method for prediction of constitutive proteasome and immunoproteasome cleavage sites in antigenic sequences. Nucleic Acids Res, 33(Web Server issue):W202-7, Jul 2005. [ bib \| DOI \| http \| .pdf ] This manuscript describes a support vector machine based method for the prediction of constitutive as well as immunoproteasome cleavage sites in antigenic sequences. This method achieved Matthew's correlation coefficents of 0.54 and 0.43 on in vitro and major histocompatibility complex ligand data, respectively. This shows that the performance of our method is comparable to that of the NetChop method, which is currently considered to be the best method for proteasome cleavage site prediction. Based on the method, a web server, Pcleavage, has also been developed. This server accepts protein sequences in any standard format and present results in a user-friendly format. The server is available for free use by all academic users at the URL http://www.imtech.res.in/raghava/pcleavage/ or http://bioinformatics.uams.edu/mirror/pcleavage/. Keywords: biosvm immunoinformatics
[Bhasin2005GPCRsclass]	M. Bhasin and G. P. S. Raghava. GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors. Nucleic Acids Res., 33(Web Server issue):W143-7, Jul 2005. [ bib \| DOI \| http \| .pdf ] The receptors of amine subfamily are specifically major drug targets for therapy of nervous disorders and psychiatric diseases. The recognition of novel amine type of receptors and their cognate ligands is of paramount interest for pharmaceutical companies. In the past, Chou and co-workers have shown that different types of amine receptors are correlated with their amino acid composition and are predictable on its basis with considerable accuracy [Elrod and Chou (2002) Protein Eng., 15, 713-715]. This motivated us to develop a better method for the recognition of novel amine receptors and for their further classification. The method was developed on the basis of amino acid composition and dipeptide composition of proteins using support vector machine. The method was trained and tested on 167 proteins of amine subfamily of G-protein-coupled receptors (GPCRs). The method discriminated amine subfamily of GPCRs from globular proteins with Matthew's correlation coefficient of 0.98 and 0.99 using amino acid composition and dipeptide composition, respectively. In classifying different types of amine receptors using amino acid composition and dipeptide composition, the method achieved an accuracy of 89.8 and 96.4%, respectively. The performance of the method was evaluated using 5-fold cross-validation. The dipeptide composition based method predicted 67.6% of protein sequences with an accuracy of 100% with a reliability index > or =5. A web server GPCRsclass has been developed for predicting amine-binding receptors from its amino acid sequence [http://www.imtech.res.in/raghava/gpcrsclass/ and http://bioinformatics.uams.edu/raghava/gpersclass/ (mirror site)]. Keywords: biosvm
[Bernardo2005Chemogenomica]	D. di Bernardo, M.J. Thompson, T.S. Gardner, S.E. Chobot, E.L. Eastwood, A.P. Wojtovich, S.J. Elliott, S.E. Schaus, and J.J. Collins. Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat Biotechnol, 23(3):377-383, Mar 2005. [ bib \| DOI \| http ] A major challenge in drug discovery is to distinguish the molecular targets of a bioactive compound from the hundreds to thousands of additional gene products that respond indirectly to changes in the activity of the targets. Here, we present an integrated computational-experimental approach for computing the likelihood that gene products and associated pathways are targets of a compound. This is achieved by filtering the mRNA expression profile of compound-exposed cells using a reverse-engineered model of the cell's gene regulatory network. We apply the method to a set of 515 whole-genome yeast expression profiles resulting from a variety of treatments (compounds, knockouts and induced expression), and correctly enrich for the known targets and associated pathways in the majority of compounds examined. We demonstrate our approach with PTSB, a growth inhibitory compound with a previously unknown mode of action, by predicting and validating thioredoxin and thioredoxin reductase as its target. Keywords: Algorithms; Artificial Intelligence; Computer Simulation; Drug Delivery Systems; Drug Design; Gene Expression Profiling; Gene Expression Regulation; Models, Biological; Models, Statistical; Protein Engineering; Protein Interaction Mapping; Saccharomyces cerevisiae; Saccharomyces cerevisiae Proteins; Signal Transduction; Thioredoxin-Disulfide Reductase; Thioredoxins
[Ben-Hur2005Kernel]	A. Ben-Hur and W. S. Noble. Kernel methods for predicting protein-protein interactions. Bioinformatics, 21(Suppl. 1):i38-i46, Jun 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Despite advances in high-throughput methods for discovering protein-protein interactions, the interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions. RESULTS: We present a kernel method for predicting protein-protein interactions using a combination of data sources, including protein sequences, Gene Ontology annotations, local properties of the network, and homologous interactions in other species. Whereas protein kernels proposed in the literature provide a similarity between single proteins, prediction of interactions requires a kernel between pairs of proteins. We propose a pairwise kernel that converts a kernel between single proteins into a kernel between pairs of proteins, and we illustrate the kernel's effectiveness in conjunction with a support vector machine classifier. Furthermore, we obtain improved performance by combining several sequence-based kernels based on k-mer frequency, motif and domain content and by further augmenting the pairwise sequence kernel with features that are based on other sources of data.We apply our method to predict physical interactions in yeast using data from the BIND database. At a false positive rate of 1% the classifier retrieves close to 80% of a set of trusted interactions. We thus demonstrate the ability of our method to make accurate predictions despite the sizeable fraction of false positives that are known to exist in interaction databases. AVAILABILITY: The classification experiments were performed using PyML available at http://pyml.sourceforge.net. Data are available at: http://noble.gs.washington.edu/proj/sppi CONTACT: asa@gs.washington.edu. Keywords: biosvm
[Beal2005Bayesian]	M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L. Wild. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21(3):349-356, Feb 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: We have used state-space models (SSMs) to reverse engineer transcriptional networks from highly replicated gene expression profiling time series data obtained from a well-established model of T cell activation. SSMs are a class of dynamic Bayesian networks in which the observed measurements depend on some hidden state variables that evolve according to Markovian dynamics. These hidden variables can capture effects that cannot be directly measured in a gene expression profiling experiment, for example: genes that have not been included in the microarray, levels of regulatory proteins, the effects of mRNA and protein degradation, etc. RESULTS: We have approached the problem of inferring the model structure of these state-space models using both classical and Bayesian methods. In our previous work, a bootstrap procedure was used to derive classical confidence intervals for parameters representing 'gene-gene' interactions over time. In this article, variational approximations are used to perform the analogous model selection task in the Bayesian context. Certain interactions are present in both the classical and the Bayesian analyses of these regulatory networks. The resulting models place JunB and JunD at the centre of the mechanisms that control apoptosis and proliferation. These mechanisms are key for clonal expansion and for controlling the long term behavior (e.g. programmed cell death) of these cells. AVAILABILITY: Supplementary data is available at http://public.kgi.edu/wild/index.htm and Matlab source code for variational Bayesian learning of SSMs is available at http://www.cse.ebuffalo.edu/faculty/mbeal/software.html. Keywords: biogm
[Bao2005Identifying]	Lei Bao. Identifying genes related to chemosensitivity using support vector machine. Methods Mol Med, 111:233-40, 2005. [ bib ] In an effort to identify genes involved in chemosensitivity and to evaluate the functional relationships between genes and anticancer drugs acting by the same mechanism, a supervised machine learning approach called support vector machine (SVM) is used to associate genes with any of five predefined anticancer drug mechanistic categories. The drug activity profiles are used as training examples to train the SVM and then the gene expression profiles are used as test examples to predict their associated mechanistic categories. This method of correlating drugs and genes provides a strategy for finding novel biologically significant relationships for molecular pharmacology. Keywords: biosvm
[Atalay2005Implicit]	V. Atalay and R. Cetin-Atalay. Implicit motif distribution based hybrid computational kernel for sequence classification. Bioinformatics, 21(8):1429-1436, Apr 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: We designed a general computational kernel for classification problems that require specific motif extraction and search from sequences. Instead of searching for explicit motifs, our approach finds the distribution of implicit motifs and uses as a feature for classification. Implicit motif distribution approach may be used as modus operandi for bioinformatics problems that require specific motif extraction and search, which is otherwise computationally prohibitive. RESULTS: A system named P2SL that infer protein subcellular targeting was developed through this computational kernel. Targeting-signal was modeled by the distribution of subsequence occurrences (implicit motifs) using self-organizing maps. The boundaries among the classes were then determined with a set of support vector machines. P2SL hybrid computational system achieved approximately 81% of prediction accuracy rate over ER targeted, cytosolic, mitochondrial and nuclear protein localization classes. P2SL additionally offers the distribution potential of proteins among localization classes, which is particularly important for proteins, shuttle between nucleus and cytosol. AVAILABILITY: http://staff.vbi.vt.edu/volkan/p2sl and http://www.i-cancer.fen.bilkent.edu.tr/p2sl CONTACT: rengul@bilkent.edu.tr. Keywords: biosvm
[Arodz2005Pattern]	Tomasz ArodŹ, Marcin Kurdziel, Erik O D Sevre, and David A Yuen. Pattern recognition techniques for automatic detection of suspicious-looking anomalies in mammograms. Comput. Methods Programs Biomed., 79(2):135-49, Aug 2005. [ bib \| DOI \| http \| .pdf ] We have employed two pattern recognition methods used commonly for face recognition in order to analyse digital mammograms. The methods are based on novel classification schemes, the AdaBoost and the support vector machines (SVM). A number of tests have been carried out to evaluate the accuracy of these two algorithms under different circumstances. Results for the AdaBoost classifier method are promising, especially for classifying mass-type lesions. In the best case the algorithm achieved accuracy of 76% for all lesion types and 90% for masses only. The SVM based algorithm did not perform as well. In order to achieve a higher accuracy for this method, we should choose image features that are better suited for analysing digital mammograms than the currently used ones. Keywords: biosvm image
[Arimoto2005Development]	Rieko Arimoto, Madhu-Ashni Prasad, and Eric M Gifford. Development of CYP3A4 inhibition models: comparisons of machine-learning techniques and molecular descriptors. J Biomol Screen, 10(3):197-205, Apr 2005. [ bib \| DOI \| http ] Computational models of cytochrome P450 3A4 inhibition were developed based on high-throughput screening data for 4470 proprietary compounds. Multiple models differentiating inhibitors (IC(50) <3 microM) and noninhibitors were generated using various machine-learning algorithms (recursive partitioning [RP], Bayesian classifier, logistic regression, k-nearest-neighbor, and support vector machine [SVM]) with structural fingerprints and topological indices. Nineteen models were evaluated by internal 10-fold cross-validation and also by an independent test set. Three most predictive models, Barnard Chemical Information (BCI)-fingerprint/SVM, MDL-keyset/SVM, and topological indices/RP, correctly classified 249, 248, and 236 compounds of 291 noninhibitors and 135, 137, and 147 compounds of 179 inhibitors in the validation set. Their overall accuracies were 82%, 82%, and 81%, respectively. Investigating applicability of the BCI/SVM model found a strong correlation between the predictive performance and the structural similarity to the training set. Using Tanimoto similarity index as a confidence measurement for the predictions, the limitation of the extrapolation was 0.7 in the case of the BCI/SVM model. Taking consensus of the 3 best models yielded a further improvement in predictive capability, kappa = 0.65 and accuracy = 83%. The consensus model could also be tuned to minimize either false positives or false negatives depending on the emphasis of the screening. Keywords: biosvm chemoinformatics
[Aphinyanaphongs2005Text]	Yindalon Aphinyanaphongs, Ioannis Tsamardinos, Alexander Statnikov, Douglas Hardin, and Constantin F Aliferis. Text categorization models for high-quality article retrieval in internal medicine. J. Am. Med. Inform. Assoc., 12(2):207-16, 2005. [ bib \| DOI \| http \| .pdf ] OBJECTIVE Finding the best scientific evidence that applies to a patient problem is becoming exceedingly difficult due to the exponential growth of medical publications. The objective of this study was to apply machine learning techniques to automatically identify high-quality, content-specific articles for one time period in internal medicine and compare their performance with previous Boolean-based PubMed clinical query filters of Haynes et al. DESIGN The selection criteria of the ACP Journal Club for articles in internal medicine were the basis for identifying high-quality articles in the areas of etiology, prognosis, diagnosis, and treatment. Naive Bayes, a specialized AdaBoost algorithm, and linear and polynomial support vector machines were applied to identify these articles. MEASUREMENTS The machine learning models were compared in each category with each other and with the clinical query filters using area under the receiver operating characteristic curves, 11-point average recall precision, and a sensitivity/specificity match method. RESULTS In most categories, the data-induced models have better or comparable sensitivity, specificity, and precision than the clinical query filters. The polynomial support vector machine models perform the best among all learning methods in ranking the articles as evaluated by area under the receiver operating curve and 11-point average recall precision. CONCLUSION This research shows that, using machine learning methods, it is possible to automatically build models for retrieving high-quality, content-specific articles using inclusion or citation by the ACP Journal Club as a gold standard in a given time period in internal medicine that perform better than the 1994 PubMed clinical query filters. Keywords: biosvm nlp
[Aires-de-Sousa2005Prediction]	J. Aires-de Sousa and J. Gasteiger. Prediction of enantiomeric excess in a combinatorial library of catalytic enantioselective reactions. J Comb Chem, 7(2):298-301, 2005. [ bib \| DOI \| http \| .pdf ] A quantitative structure-enantioselectivity relationship was established for a combinatorial library of enantioselective reactions performed by addition of diethyl zinc to benzaldehyde. Chiral catalysts and additives were encoded by their chirality codes and presented as input to neural networks. The networks were trained to predict the enantiomeric excess. With independent test sets, predictions of enantiomeric excess could be made with an average error as low as 6% ee. Multilinear regression, perceptrons, and support vector machines were also evaluated as modeling tools. The method is of interest for the computer-aided design of combinatorial libraries involving chiral compounds or enantioselective reactions. This is the first example of a quantitative structure-property relationship based on chirality codes. Keywords: biosvm chemoinformatics
[Zhang2005Improved]	Qidong Zhang, Sukjoon Yoon, and William J Welsh. Improved method for predicting beta-turn using support vector machine. Bioinformatics, 21(10):2370-4, May 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Numerous methods for predicting beta-turns in proteins have been developed based on various computational schemes. Here, we introduce a new method of beta-turn prediction that uses the support vector machine (SVM) algorithm together with predicted secondary structure information. Various parameters from the SVM have been adjusted to achieve optimal prediction performance. RESULTS: The SVM method achieved excellent performance as measured by the Matthews correlation coefficient (MCC = 0.45) using a 7-fold cross validation on a database of 426 non-homologous protein chains. To our best knowledge, this MCC value is the highest achieved so far for predicting beta-turn. The overall prediction accuracy Qtotal was 77.3%, which is the best among the existing prediction methods. Among its unique attractive features, the present SVM method avoids overtraining and compresses information and provides a predicted reliability index. Keywords: biosvm
[Yu2005Ovarian]	J. S. Yu, S. Ongarello, R. Fiedler, X. W. Chen, G. Toffolo, C. Cobelli, and Z. Trajanoski. Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics, 21(10):2200-9, May 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: High-throughput and high-resolution mass spectrometry instruments are increasingly used for disease classification and therapeutic guidance. However, the analysis of immense amount of data poses considerable challenges. We have therefore developed a novel method for dimensionality reduction and tested on a published ovarian high-resolution SELDI-TOF dataset. RESULTS: We have developed a four-step strategy for data preprocessing based on: (1) binning, (2) Kolmogorov-Smirnov test, (3) restriction of coefficient of variation and (4) wavelet analysis. Subsequently, support vector machines were used for classification. The developed method achieves an average sensitivity of 97.38% (sd = 0.0125) and an average specificity of 93.30% (sd = 0.0174) in 1000 independent k-fold cross-validations, where k = 2, ..., 10. AVAILABILITY: The software is available for academic and non-commercial institutions. Keywords: biosvm proteomics
[Tothill2005expression-based]	Richard W Tothill, Adam Kowalczyk, Danny Rischin, Alex Bousioutas, Izhak Haviv, Ryan K van Laar, Paul M Waring, John Zalcberg, Robyn Ward, Andrew V Biankin, Robert L Sutherland, Susan M Henshall, Kwun Fong, Jonathan R Pollack, David D L Bowtell, and Andrew J Holloway. An expression-based site of origin diagnostic method designed for clinical application to cancer of unknown origin. Cancer Res., 65(10):4031-40, May 2005. [ bib \| DOI \| http \| .pdf ] Gene expression profiling offers a promising new technique for the diagnosis and prognosis of cancer. We have applied this technology to build a clinically robust site of origin classifier with the ultimate aim of applying it to determine the origin of cancer of unknown primary (CUP). A single cDNA microarray platform was used to profile 229 primary and metastatic tumors representing 14 tumor types and multiple histologic subtypes. This data set was subsequently used for training and validation of a support vector machine (SVM) classifier, demonstrating 89% accuracy using a 13-class model. Further, we show the translation of a five-class classifier to a quantitative PCR-based platform. Selecting 79 optimal gene markers, we generated a quantitative-PCR low-density array, allowing the assay of both fresh-frozen and formalin-fixed paraffin-embedded (FFPE) tissue. Data generated using both quantitative PCR and microarray were subsequently used to train and validate a cross-platform SVM model with high prediction accuracy. Finally, we applied our SVM classifiers to 13 cases of CUP. We show that the microarray SVM classifier was capable of making high confidence predictions in 11 of 13 cases. These predictions were supported by comprehensive review of the patients' clinical histories. Keywords: biosvm microarray
[Teramoto2005Prediction]	Reiji Teramoto, Mikio Aoki, Toru Kimura, and Masaharu Kanaoka. Prediction of siRNA functionality using generalized string kernel and support vector machine. FEBS Lett., 579(13):2878-82, May 2005. [ bib \| DOI \| http \| .pdf ] Small interfering RNAs (siRNAs) are becoming widely used for sequence-specific gene silencing in mammalian cells, but designing an effective siRNA is still a challenging task. In this study, we developed an algorithm for predicting siRNA functionality by using generalized string kernel (GSK) combined with support vector machine (SVM). With GSK, siRNA sequences were represented as vectors in a multi-dimensional feature space according to the numbers of subsequences in each siRNA, and subsequently classified with SVM into effective or ineffective siRNAs. We applied this algorithm to published siRNAs, and could classify effective and ineffective siRNAs with 90.6%, 86.2% accuracy, respectively. Keywords: sirna biosvm
[Res2005evolution]	I. Res, I. Mihalek, and O. Lichtarge. An evolution based classifier for prediction of protein interfaces without using protein structures. Bioinformatics, 21(10):2496-501, May 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: The number of available protein structures still lags far behind the number of known protein sequences. This makes it important to predict which residues participate in protein-protein interactions using only sequence information. Few studies have tackled this problem until now. RESULTS: We applied support vector machines to sequences in order to generate a classification of all protein residues into those that are part of a protein interface and those that are not. For the first time evolutionary information was used as one of the attributes and this inclusion of evolutionary importance rankings improves the classification. Leave-one-out cross-validation experiments show that prediction accuracy reaches 64%. Keywords: biosvm
[Plewczynski2005AutoMotif]	Dariusz Plewczynski, Adrian Tkacz, Lucjan Stanislaw Wyrwicz, and Leszek Rychlewski. AutoMotif server: prediction of single residue post-translational modifications in proteins. Bioinformatics, 21(10):2525-7, May 2005. [ bib \| DOI \| http \| .pdf ] The AutoMotif Server allows for identification of post-translational modification (PTM) sites in proteins based only on local sequence information. The local sequence preferences of short segments around PTM residues are described here as linear functional motifs (LFMs). Sequence models for all types of PTMs are trained by support vector machine on short-sequence fragments of proteins in the current release of Swiss-Prot database (phosphorylation by various protein kinases, sulfation, acetylation, methylation, amidation, etc.). The accuracy of the identification is estimated using the standard leave-one-out procedure. The sensitivities for all types of short LFMs are in the range of 70%. AVAILABILITY: The AutoMotif Server is available free for academic use at http://automotif.bioinfo.pl/ Keywords: biosvm
[OFlanagan2005Non]	R. A. O'Flanagan, G. Paillard, R. Lavery, and A. M. Sengupta. Non-additivity in protein-DNA binding. Bioinformatics, 21(10):2254-63, May 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Localizing protein binding sites within genomic DNA is of considerable importance, but remains difficult for protein families, such as transcription factors, which have loosely defined target sequences. It is generally assumed that protein affinity for DNA involves additive contributions from successive nucleotide pairs within the target sequence. This is not necessarily true, and non-additive effects have already been experimentally demonstrated in a small number of cases. The principal origin of non-additivity involves the so-called indirect component of protein-DNA recognition which is related to the sequence dependence of DNA deformation induced during complex formation. Non-additive effects are difficult to study because they require the identification of many more binding sequences than are normally necessary for describing additive specificity (typically via the construction of weight matrices). RESULTS: In the present work we will use theoretically estimated binding energies as a basis for overcoming this problem. Our approach enables us to study the full combinatorial set of sequences for a variety of DNA-binding proteins, make a detailed analysis of non-additive effects and exploit this information to improve binding site predictions using either weight matrices or support vector machines. The results underline the fact that, even in the presence of significant deformation, non-additive effects may involve only a limited number of dinucleotide steps. This information helps to reduce the number of binding sites which need to be identified for successful predictions and to avoid problems of over-fitting. AVAILABILITY: The SVM software is available upon request from the authors. Keywords: biosvm
[Majoros2005Efficient]	W. H. Majoros, L. Pertea, and S. L. Salzberg. Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics, 21(9):1782-1788, May 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: The increased availability of genome sequences of closely related organisms has generated much interest in utilizing homology to improve the accuracy of gene prediction programs. Generalized pair hidden Markov models (GPHMMs) have been proposed as one means to address this need. However, all GPHMM implementations currently available are either closed-source or the details of their operation are not fully described in the literature, leaving a significant hurdle for others wishing to advance the state of the art in GPHMM design. RESULTS: We have developed an open-source GPHMM gene finder, TWAIN, which performs very well on two related Aspergillus species, A.fumigatus and A.nidulans, finding 89% of the exons and predicting 74% of the gene models exactly correctly in a test set of 147 conserved gene pairs. We describe the implementation of this GPHMM and we explicitly address the assumptions and limitations of the system. We suggest possible ways of relaxing those assumptions to improve the utility of the system without sacrificing efficiency beyond what is practical. AVAILABILITY: Available at http://www.tigr.org/software/pirate/twain/twain.html under the open-source Artistic License. Keywords: biogm
[Jonsdottir2005Prediction]	Svava Osk JÃ³nsdÃ³ttir, Flemming Steen JÃ¸rgensen, and SÃ¸ren Brunak. Prediction methods and databases within chemoinformatics: emphasis on drugs and drug candidates. Bioinformatics, 21(10):2145-2160, May 2005. [ bib \| DOI \| http ] MOTIVATION: To gather information about available databases and chemoinformatics methods for prediction of properties relevant to the drug discovery and optimization process. RESULTS: We present an overview of the most important databases with 2-dimensional and 3-dimensional structural information about drugs and drug candidates, and of databases with relevant properties. Access to experimental data and numerical methods for selecting and utilizing these data is crucial for developing accurate predictive in silico models. Many interesting predictive methods for classifying the suitability of chemical compounds as potential drugs, as well as for predicting their physico-chemical and ADMET properties have been proposed in recent years. These methods are discussed, and some possible future directions in this rapidly developing field are described. Keywords: Chemistry, Pharmaceutical; Computational Biology; Databases, Factual; Drug Design; Models, Chemical; Models, Molecular; Pharmaceutical Preparations; Structure-Activity Relationship
[Hofmann2005Concept-based]	Oliver Hofmann and Dietmar Schomburg. Concept-based annotation of enzyme classes. Bioinformatics, 21(9):2059-66, May 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: Given the explosive growth of biomedical data as well as the literature describing results and findings, it is getting increasingly difficult to keep up to date with new information. Keeping databases synchronized with current knowledge is a time-consuming and expensive task-one which can be alleviated by automatically gathering findings from the literature using linguistic approaches. We describe a method to automatically annotate enzyme classes with disease-related information extracted from the biomedical literature for inclusion in such a database. RESULTS: Enzyme names for the 3901 enzyme classes in the BRENDA database, a repository for quantitative and qualitative enzyme information, were identified in more than 100,000 abstracts retrieved from the PubMed literature database. Phrases in the abstracts were assigned to concepts from the Unified Medical Language System (UMLS) utilizing the MetaMap program, allowing for the identification of disease-related concepts by their semantic fields in the UMLS ontology. Assignments between enzyme classes and diseases were created based on their co-occurrence within a single sentence. False positives could be removed by a variety of filters including minimum number of co-occurrences, removal of sentences containing a negation and the classification of sentences based on their semantic fields by a Support Vector Machine. Verification of the assignments with a manually annotated set of 1500 sentences yielded favorable results of 92% precision at 50% recall, sufficient for inclusion in a high-quality database. AVAILABILITY: Source code is available from the author upon request. SUPPLEMENTARY INFORMATION: ftp.uni-koeln.de/institute/biochemie/pub/brenda/info/diseaseSupp.pdf. Keywords: biosvm
[Bao2005Prediction]	Lei Bao and Yan Cui. Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics, 21(10):2185-90, May 2005. [ bib \| DOI \| http \| .pdf ] MOTIVATION: There has been great expectation that the knowledge of an individual's genotype will provide a basis for assessing susceptibility to diseases and designing individualized therapy. Non-synonymous single nucleotide polymorphisms (nsSNPs) that lead to an amino acid change in the protein product are of particular interest because they account for nearly half of the known genetic variations related to human inherited diseases. To facilitate the identification of disease-associated nsSNPs from a large number of neutral nsSNPs, it is important to develop computational tools to predict the phenotypic effects of nsSNPs. RESULTS: We prepared a training set based on the variant phenotypic annotation of the Swiss-Prot database and focused our analysis on nsSNPs having homologous 3D structures. Structural environment parameters derived from the 3D homologous structure as well as evolutionary information derived from the multiple sequence alignment were used as predictors. Two machine learning methods, support vector machine and random forest, were trained and evaluated. We compared the performance of our method with that of the SIFT algorithm, which is one of the best predictive methods to date. An unbiased evaluation study shows that for nsSNPs with sufficient evolutionary information (with not <10 homologous sequences), the performance of our method is comparable with the SIFT algorithm, while for nsSNPs with insufficient evolutionary information (<10 homologous sequences), our method outperforms the SIFT algorithm significantly. These findings indicate that incorporating structural information is critical to achieving good prediction accuracy when sufficient evolutionary information is not available. AVAILABILITY: The codes and curated dataset are available at http://compbio.utmem.edu/snp/dataset/ Keywords: biosvm
[Vert2006Kernels]	J.-P. Vert, R. Thurman, and W. S. Noble. Kernels for gene regulatory regions. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Adv. Neural. Inform. Process Syst., volume 18, pages 1401-1408, Cambridge, MA, 2006. MIT Press. [ bib ] Keywords: biosvm
[Salomon2006Predicting]	J. Salomon and D. R. Flower. Predicting Class II MHC-Peptide binding: a kernel based approach using similarity scores. BMC Bioinformatics, 7:501, 2006. [ bib \| DOI \| http ] BACKGROUND: Modelling the interaction between potentially antigenic peptides and Major Histocompatibility Complex (MHC) molecules is a key step in identifying potential T-cell epitopes. For Class II MHC alleles, the binding groove is open at both ends, causing ambiguity in the positional alignment between the groove and peptide, as well as creating uncertainty as to what parts of the peptide interact with the MHC. Moreover, the antigenic peptides have variable lengths, making naive modelling methods difficult to apply. This paper introduces a kernel method that can handle variable length peptides effectively by quantifying similarities between peptide sequences and integrating these into the kernel. RESULTS: The kernel approach presented here shows increased prediction accuracy with a significantly higher number of true positives and negatives on multiple MHC class II alleles, when testing data sets from MHCPEP 1, MCHBN 2, and MHCBench 3. Evaluation by cross validation, when segregating binders and non-binders, produced an average of 0.824 AROC for the MHCBench data sets (up from 0.756), and an average of 0.96 AROC for multiple alleles of the MHCPEP database. CONCLUSION: The method improves performance over existing state-of-the-art methods of MHC class II peptide binding predictions by using a custom, knowledge-based representation of peptides. Similarity scores, in contrast to a fixed-length, pocket-specific representation of amino acids, provide a flexible and powerful way of modelling MHC binding, and can easily be applied to other dynamic sequence problems. Keywords: Amino Acid, Binding Sites, Computational Biology, Databases, Epitope Mapping, Genetic, HLA-A Antigens, HLA-DR Antigens, Histocompatibility Antigens Class II, Humans, Peptides, Protein, Protein Binding, Protein Conformation, ROC Curve, Reproducibility of Results, Sequence Alignment, Sequence Analysis, Sequence Homology, 17105666
[Paik2006Gene]	Soonmyung Paik, Gong Tang, Steven Shak, Chungyeul Kim, Joffre Baker, Wanseop Kim, Maureen Cronin, Frederick L. Baehner, Drew Watson, John Bryant, Joseph P. Costantino, Charles E Geyer, Jr, D Lawrence Wickerham, and Norman Wolmark. Gene expression and benefit of chemotherapy in women with node-negative, estrogen receptor-positive breast cancer. J Clin Oncol, 24(23):3726-3734, Aug 2006. [ bib \| DOI \| http ] The 21-gene recurrence score (RS) assay quantifies the likelihood of distant recurrence in women with estrogen receptor-positive, lymph node-negative breast cancer treated with adjuvant tamoxifen. The relationship between the RS and chemotherapy benefit is not known.The RS was measured in tumors from the tamoxifen-treated and tamoxifen plus chemotherapy-treated patients in the National Surgical Adjuvant Breast and Bowel Project (NSABP) B20 trial. Cox proportional hazards models were utilized to test for interaction between chemotherapy treatment and the RS.A total of 651 patients were assessable (227 randomly assigned to tamoxifen and 424 randomly assigned to tamoxifen plus chemotherapy). The test for interaction between chemotherapy treatment and RS was statistically significant (P = .038). Patients with high-RS (> or = 31) tumors (ie, high risk of recurrence) had a large benefit from chemotherapy (relative risk, 0.26; 95% CI, 0.13 to 0.53; absolute decrease in 10-year distant recurrence rate: mean, 27.6%; SE, 8.0%). Patients with low-RS (< 18) tumors derived minimal, if any, benefit from chemotherapy treatment (relative risk, 1.31; 95% CI, 0.46 to 3.78; absolute decrease in distant recurrence rate at 10 years: mean, -1.1%; SE, 2.2%). Patients with intermediate-RS tumors did not appear to have a large benefit, but the uncertainty in the estimate can not exclude a clinically important benefit.The RS assay not only quantifies the likelihood of breast cancer recurrence in women with node-negative, estrogen receptor-positive breast cancer, but also predicts the magnitude of chemotherapy benefit. Keywords: Adult; Aged; Antineoplastic Combined Chemotherapy Protocols, administration /&/ dosage/therapeutic use; Breast Neoplasms, drug therapy/metabolism/pathology/prevention /&/ control; Cisplatin, administration /&/ dosage; Female; Fluorouracil, administration /&/ dosage; Gene Expression Regulation, Neoplastic; Humans; Linear Models; Lymphatic Metastasis; Methotrexate, administration /&/ dosage; Middle Aged; Mitomycins, administration /&/ dosage; Neoplasm Proteins, metabolism; Neoplasm Recurrence, Local, metabolism/prevention /&/ control; Odds Ratio; Predictive Value of Tests; Prognosis; Proportional Hazards Models; Randomized Controlled Trials as Topic; Receptors, Estrogen, metabolism; Recurrence, prevention /&/ control; Reverse Transcriptase Polymerase Chain Reaction; Risk Assessment; Risk Factors; Tamoxifen, administration /&/ dosage; Tumor Markers, Biological, metabolism
[Oloff2006Chemometric]	Scott Oloff, Shuxing Zhang, Nagamani Sukumar, Curt Breneman, and Alexander Tropsha. Chemometric analysis of ligand receptor complementarity: identifying complementary ligands based on receptor information (colibri). J. Chem. Inf. Model., 46(2):844-851, 2006. [ bib \| DOI \| http ] We have developed a novel structure-based chemoinformatics approach to search for Complimentary Ligands Based on Receptor Information (CoLiBRI). CoLiBRI is based on the representation of both receptor binding sites and their respective ligands in a space of universal chemical descriptors. The binding site atoms involved in the interaction with ligands are identified by the means of a computational geometry technique known as Delaunay tessellation as applied to X-ray characterized ligand-receptor complexes. TAE/RECON multiple chemical descriptors are calculated independently for each ligand as well as for its active site atoms. The representation of both ligands and active sites using chemical descriptors allows the application of well-known chemometric techniques in order to correlate chemical similarities between active sites and their respective ligands. We have established a protocol to map patterns of nearest neighbor active site vectors in a multidimensional TAE/RECON space onto those of their complementary ligands and vice versa. This protocol affords the prediction of a virtual complementary ligand vector in the ligand chemical space from the position of a known active site vector. This prediction is followed by chemical similarity calculations between this virtual ligand vector and those calculated for molecules in a chemical database to identify real compounds most similar to the virtual ligand. Consequently, the knowledge of the receptor active site structure affords straightforward and efficient identification of its complementary ligands in large databases of chemical compounds using rapid chemical similarity searches. Conversely, starting from the ligand chemical structure, one may identify possible complementary receptor cavities as well. We have applied the CoLiBRI approach to a data set of 800 X-ray characterized ligand-receptor complexes in the PDBbind database. Using a k nearest neighbor (kNN) pattern recognition approach and variable selection, we have shown that knowledge of the active site structure affords identification of its complimentary ligand among the top 1% of a large chemical database in over 90% of all test active sites when a binding site of the same protein family was present in the training set. In the case where test receptors are highly dissimilar and not present among the receptor families in the training set, the prediction accuracy is decreased; however, CoLiBRI was still able to quickly eliminate 75% of the chemical database as improbable ligands. CoLiBRI affords rapid prefiltering of a large chemical database to eliminate compounds that have little chance of binding to a receptor active site. Keywords: Algorithms; Binding Sites; Binding, Competitive; Computational Biology; Databases, Factual; Drug Design; Drug Evaluation, Preclinical; Ligands; Models, Biological; Structure-Activity Relationship
[Ma2006MSB]	Wenzhe Ma, Luhua Lai, Qi Ouyang, and Chao Tang. Robustness and modular design of the drosophila segment polarity network. Mol Syst Biol, 2:70, 2006. [ bib \| DOI \| http ] Biomolecular networks have to perform their functions robustly. A robust function may have preferences in the topological structures of the underlying network. We carried out an exhaustive computational analysis on network topologies in relation to a patterning function in Drosophila embryogenesis. We found that whereas the vast majority of topologies can either not perform the required function or only do so very fragilely, a small fraction of topologies emerges as particularly robust for the function. The topology adopted by Drosophila, that of the segment polarity network, is a top ranking one among all topologies with no direct autoregulation. Furthermore, we found that all robust topologies are modular-each being a combination of three kinds of modules. These modules can be traced back to three subfunctions of the patterning function, and their combinations provide a combinatorial variability for the robust topologies. Our results suggest that the requirement of functional robustness drastically reduces the choices of viable topology to a limited set of modular combinations among which nature optimizes its choice under evolutionary and other biological constraints. Keywords: Animals; Biological Evolution; Body Patterning; Computer Simulation; Drosophila Proteins, physiology; Drosophila melanogaster, anatomy /&/ histology/physiology; Feedback, Physiological; Gene Expression Regulation, Developmental; Genes, Insect; Models, Biological; Signal Transduction; Systems Biology, methods; Transcription Factors
[Kurata2006PlosCompBio]	Hiroyuki Kurata, Hana El-Samad, Rei Iwasaki, Hisao Ohtake, John C Doyle, Irina Grigorova, Carol A Gross, and Mustafa Khammash. Module-based analysis of robustness tradeoffs in the heat shock response system. PLoS Comput Biol, 2(7):e59, Jul 2006. [ bib \| DOI \| http ] Biological systems have evolved complex regulatory mechanisms, even in situations where much simpler designs seem to be sufficient for generating nominal functionality. Using module-based analysis coupled with rigorous mathematical comparisons, we propose that in analogy to control engineering architectures, the complexity of cellular systems and the presence of hierarchical modular structures can be attributed to the necessity of achieving robustness. We employ the Escherichia coli heat shock response system, a strongly conserved cellular mechanism, as an example to explore the design principles of such modular architectures. In the heat shock response system, the sigma-factor sigma32 is a central regulator that integrates multiple feedforward and feedback modules. Each of these modules provides a different type of robustness with its inherent tradeoffs in terms of transient response and efficiency. We demonstrate how the overall architecture of the system balances such tradeoffs. An extensive mathematical exploration nevertheless points to the existence of an array of alternative strategies for the existing heat shock response that could exhibit similar behavior. We therefore deduce that the evolutionary constraints facing the system might have steered its architecture toward one of many robustly functional solutions. Keywords: Computer Simulation; Escherichia coli Proteins, metabolism; Escherichia coli, metabolism; Feedback, physiology; Gene Expression Regulation, Bacterial, physiology; Heat-Shock Proteins, metabolism; Heat-Shock Response, physiology; Models, Biological; Oxidative Stress, physiology; Signal Transduction, physiology; Systems Biology, methods
[Korber2006Immunoinformatics]	Bette Korber, Montiago LaBute, and Karina Yusim. Immunoinformatics comes of age. PLoS Comput. Biol., 2(6):e71, Jun 2006. [ bib \| DOI \| http ] With the burgeoning immunological data in the scientific literature, scientists must increasingly rely on Internet resources to inform and enhance their work. Here we provide a brief overview of the adaptive immune response and summaries of immunoinformatics resources, emphasizing those with Web interfaces. These resources include searchable databases of epitopes and immune-related molecules, and analysis tools for T cell and B cell epitope prediction, vaccine design, and protein structure comparisons. There is an agreeable synergy between the growing collections in immune-related databases and the growing sophistication of analysis software; the databases provide the foundation for developing predictive computational tools, which in turn enable more rapid identification of immune responses to populate the databases. Collectively, these resources contribute to improved understanding of immune responses and escape, and evolution of pathogens under immune pressure. The public health implications are vast, including designing vaccines, understanding autoimmune diseases, and defining the correlates of immune protection. Keywords: Amino Acid Sequence; Animals; Computational Biology; Databases, Factual; Epitopes, B-Lymphocyte; Epitopes, T-Lymphocyte; Humans; Immunity
[Coupez2006Docking]	B. Coupez and R. A. Lewis. Docking and scoring-theoretically easy, practically impossible? Curr. Med. Chem., 13(25):2995-3003, 2006. [ bib ] Structure-based Drug Design (SBDD) is an essential part of the modern medicinal chemistry, and has led to the acceleration of many projects, and even to drugs on the market. Programs that perform docking and scoring of ligands to receptors are powerful tools in the drug designer's armoury that enhance the process of SBDD. They are even deployed on the desktop of many bench chemists. It is timely to review the state of the art, to understand how good our docking programs are, and what are the issues. In this review we would like to provide a guide around the reliable aspects of docking and scoring and the associated pitfalls aiming at an audience of medicinal chemists rather than modellers. For convenience, we will divide the review into two parts: docking and scoring. Docking concerns the preparation of the receptor and the ligand(s), the sampling of conformational space and stereochemistry (if appropriate). Scoring concerns the evaluation of all of the ligand-receptor poses generated by docking. The two processes are not truly independent, and this will be discussed here in detail. The preparation of the receptor and ligand(s) before docking requires great care. For the receptor, issues of protonation, tautomerisation and hydration are key, and we will discuss current approaches to these issues. Even more important is the degree of sampling: can the algorithms reproduce what is observed experimentally? If they can, are the scoring algorithms good enough to recognise this pose as the best? Do the scores correlate with observed binding affinity? How does local knowledge of the target (for example hinge-binding to a kinase) affect the accuracy of the predictions? We will review the key findings from several evaluation studies and present conclusions about when and how to interpret and trust the results of docking and scoring. Finally, we will present an outline of some of the latest developments in the area of scoring functions. Keywords: Cluster Analysis; Computational Biology, methods; Computer Simulation; Computer-Aided Design; Databases, Factual; Drug Design; Ligands; Models, Chemical; Software; Structure-Activity Relationship
[Bui2006Structural]	H.-H. Bui, A. J. Schiewe, H. von Grafenstein, and I. S. Haworth. Structural prediction of peptides binding to MHC class I molecules. Proteins, 63(1):43-52, Apr 2006. [ bib \| DOI \| http ] Peptide binding to class I major histocompatibility complex (MHCI) molecules is a key step in the immune response and the structural details of this interaction are of importance in the design of peptide vaccines. Algorithms based on primary sequence have had success in predicting potential antigenic peptides for MHCI, but such algorithms have limited accuracy and provide no structural information. Here, we present an algorithm, PePSSI (peptide-MHC prediction of structure through solvated interfaces), for the prediction of peptide structure when bound to the MHCI molecule, HLA-A2. The algorithm combines sampling of peptide backbone conformations and flexible movement of MHC side chains and is unique among other prediction algorithms in its incorporation of explicit water molecules at the peptide-MHC interface. In an initial test of the algorithm, PePSSI was used to predict the conformation of eight peptides bound to HLA-A2, for which X-ray data are available. Comparison of the predicted and X-ray conformations of these peptides gave RMSD values between 1.301 and 2.475 A. Binding conformations of 266 peptides with known binding affinities for HLA-A2 were then predicted using PePSSI. Structural analyses of these peptide-HLA-A2 conformations showed that peptide binding affinity is positively correlated with the number of peptide-MHC contacts and negatively correlated with the number of interfacial water molecules. These results are consistent with the relatively hydrophobic binding nature of the HLA-A2 peptide binding interface. In summary, PePSSI is capable of rapid and accurate prediction of peptide-MHC binding conformations, which may in turn allow estimation of MHCI-peptide binding affinity. Keywords: Algorithms, Amino Acid Sequence, Antigens, Artificial Intelligence, Automated, Binding Sites, Chemical, Computational Biology, Computer Simulation, Crystallog, Crystallography, Electrostatics, Genes, Genetic, HLA Antigens, Histocompatibility Antigens Class I, Humans, Hydrogen Bonding, Ligands, MHC Class I, Major Histocompatibility Complex, Models, Molecular, Molecular Conformation, Molecular Sequence Data, Pattern Recognition, Peptides, Protein, Protein Binding, Protein Conformation, Proteomics, Quantitative Structure-Activity Relationship, Sequence Alignment, Sequence Analysis, Software, Structural Homology, Structure-Activity Relationship, Thermodynamics, Water, X-Ray, X-Rays, raphy, 16447245
[Bhavani2006Substructure-based]	S. Bhavani, A. Nagargadde, A. Thawani, V. Sridhar, and N. Chandra. Substructure-based support vector machine classifiers for prediction of adverse effects in diverse classes of drugs. J. Chem. Inform. Model., 46(6):2478-2486, 2006. [ bib \| DOI \| http ] Unforeseen adverse effects exhibited by drugs contribute heavily to late-phase failure and even withdrawal of marketed drugs. Torsade de pointes (TdP) is one such important adverse effect, which causes cardiac arrhythmia and, in some cases, sudden death, making it crucial for potential drugs to be screened for torsadogenicity. The need to tap the power of computational approaches for the prediction of adverse effects such as TdP is increasingly becoming evident. The availability of screening data including those in organized databases greatly facilitates exploration of newer computational approaches. In this paper, we report the development of a prediction method based on a support machine vector algorithm. The method uses a combination of descriptors, encoding both the type of toxicophore as well as the position of the toxicophore in the drug molecule, thus considering both the pharmacophore and the three-dimensional shape information of the molecule. For delineating toxicophores, a novel pattern-recognition method that utilizes substructures within a molecule has been developed. The results obtained using the hybrid approach have been compared with those available in the literature for the same data set. An improvement in prediction accuracy is clearly seen, with the accuracy reaching up to 97% in predicting compounds that can cause TdP and 90% for predicting compounds that do not cause TdP. The generic nature of the method has been demonstrated with four data sets available for carcinogenicity, where prediction accuracies were significantly higher, with a best receiver operating characteristics (ROC) value of 0.81 as against a best ROC value of 0.7 reported in the literature for the same data set. Thus, the method holds promise for wide applicability in toxicity prediction. Keywords: Algorithms; Carcinogens; Chemistry, Pharmaceutical; Computational Biology; Drug Evaluation, Preclinical; Drug Industry; Humans; Models, Chemical; Models, Statistical; Neural Networks (Computer); Pattern Recognition, Automated; ROC Curve; Sequence Analysis, Protein; Software; Torsades de Pointes
[Yan2007Determining]	Mingjin Yan and Keying Ye. Determining the number of clusters using the weighted gap statistic. Biometrics, 63(4):1031-1037, Dec 2007. [ bib \| DOI \| http ] Estimating the number of clusters in a data set is a crucial step in cluster analysis. In this article, motivated by the gap method (Tibshirani, Walther, and Hastie, 2001, Journal of the Royal Statistical Society B63, 411-423), we propose the weighted gap and the difference of difference-weighted (DD-weighted) gap methods for estimating the number of clusters in data using the weighted within-clusters sum of errors: a measure of the within-clusters homogeneity. In addition, we propose a "multilayer" clustering approach, which is shown to be more accurate than the original gap method, particularly in detecting the nested cluster structure of the data. The methods are applicable when the input data contain continuous measurements and can be used with any clustering method. Simulation studies and real data are investigated and compared among these proposed methods as well as with the original gap method. Keywords: Algorithms; Biometry, methods; Cluster Analysis; Computer Simulation; Data Interpretation, Statistical; Models, Biological; Models, Statistical; Pattern Recognition, Automated, methods
[Rhodes2007Oncomine]	Daniel R. Rhodes, Shanker Kalyana-Sundaram, Vasudeva Mahavisno, Radhika Varambally, Jianjun Yu, Benjamin B. Briggs, Terrence R. Barrette, Matthew J. Anstet, Colleen Kincead-Beal, Prakash Kulkarni, Sooryanaryana Varambally, Debashis Ghosh, and Arul M. Chinnaiyan. Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia, 9(2):166-180, Feb 2007. [ bib ] DNA microarrays have been widely applied to cancer transcriptome analysis; however, the majority of such data are not easily accessible or comparable. Furthermore, several important analytic approaches have been applied to microarray analysis; however, their application is often limited. To overcome these limitations, we have developed Oncomine, a bioinformatics initiative aimed at collecting, standardizing, analyzing, and delivering cancer transcriptome data to the biomedical research community. Our analysis has identified the genes, pathways, and networks deregulated across 18,000 cancer gene expression microarrays, spanning the majority of cancer types and subtypes. Here, we provide an update on the initiative, describe the database and analysis modules, and highlight several notable observations. Results from this comprehensive analysis are available at http://www.oncomine.org. Keywords: Antineoplastic Agents, pharmacology; Automatic Data Processing; Chromosome Mapping; Chromosomes, Human, genetics; Computational Biology, organization /&/ administration; Data Collection; Data Display; Data Interpretation, Statistical; Databases, Genetic; Drug Design; Gene Expression Profiling, statistics /&/ numerical data; Gene Expression Regulation, Neoplastic; Genes, Neoplasm; Humans; Internet; Models, Biological; Neoplasm Proteins, biosynthesis/chemistry/genetics; Neoplasms, classification/genetics/metabolism; Oligonucleotide Array Sequence Analysis; Subtraction Technique; Transcription, Genetic
[Kroemer2007Structure]	Romano T Kroemer. Structure-based drug design: docking and scoring. Curr. Protein Pept. Sci., 8(4):312-328, Aug 2007. [ bib ] This review gives an introduction into ligand - receptor docking and illustrates the basic underlying concepts. An overview of different approaches and algorithms is provided. Although the application of docking and scoring has led to some remarkable successes, there are still some major challenges ahead, which are outlined here as well. Approaches to address some of these challenges and the latest developments in the area are presented. Some aspects of the assessment of docking program performance are discussed. A number of successful applications of structure-based virtual screening are described. Keywords: Algorithms; Artificial Intelligence; Computational Biology; Computer Simulation; Computer-Aided Design; Drug Design; Imaging, Three-Dimensional; Ligands; Models, Molecular; Protein Binding; Protein Conformation; Software; Structure-Activity Relationship
[Guiot2007Morphological]	Caterina Guiot, Pier P Delsanto, and Thomas S Deisboeck. Morphological instability and cancer invasion: a `splashing water drop' analogy. Theor. Biol. Med. Model., 4:4, 2007. [ bib \| DOI \| http ] BACKGROUND: Tissue invasion, one of the hallmarks of cancer, is a major clinical problem. Recent studies suggest that the process of invasion is driven at least in part by a set of physical forces that may be susceptible to mathematical modelling which could have practical clinical value. MODEL AND CONCLUSION: We present an analogy between two unrelated instabilities. One is caused by the impact of a drop of water on a solid surface while the other concerns a tumor that develops invasive cellular branches into the surrounding host tissue. In spite of the apparent abstractness of the idea, it yields a very practical result, i.e. an index that predicts tumor invasion based on a few measurable parameters. We discuss its application in the context of experimental data and suggest potential clinical implications. Keywords: Animals; Biomechanics; Cell Adhesion; Humans; Mathematics; Models, Biological; Neoplasm Invasiveness; Neoplasms, pathology; Surface Tension
[Dostie2007Chromosome]	Josée Dostie, Ye Zhan, and Job Dekker. Chromosome conformation capture carbon copy technology. Curr Protoc Mol Biol, Chapter 21:Unit 21.14, Oct 2007. [ bib \| DOI \| http ] Chromosome conformation capture (3C) is used to quantify physical DNA contacts in vivo at high resolution. 3C was first used in yeast to map the spatial chromatin organization of chromosome III, and in higher eukaryotes to demonstrate that genomic DNA elements regulate target genes by physically interacting with them. 3C has been widely adopted for small-scale analysis of functional chromatin interactions along (cis) or between (trans) chromosomes. For larger-scale applications, chromosome conformation capture carbon copy (5C) combines 3C with ligation-mediated amplification (LMA) to simultaneously quantify hundreds of thousands of physical DNA contacts by microarray or ultra-high-throughput DNA sequencing. 5C allows the mapping of extensive networks of physical interactions among large sets of genomic elements throughout the genome. Such networks can provide important biological insights, e.g., by identifying relationships between regulatory elements and their target genes. This unit describes 5C for large-scale analysis of cis- and trans-chromatin interactions in mammalian cells. Keywords: Chromosomes, Artificial, Bacterial; Chromosomes, chemistry; DNA Primers, metabolism; Molecular Biology, methods; Nucleic Acid Conformation; Oligonucleotide Array Sequence Analysis; Polymerase Chain Reaction; Sequence Analysis, DNA; Templates, Genetic
[Davies2007Harnessing]	Matthew N Davies and Darren R Flower. Harnessing bioinformatics to discover new vaccines. Drug Discov Today, 12(9-10):389-395, May 2007. [ bib \| DOI \| http ] Vaccine design is highly suited to the application of in silico techniques, for both the discovery and development of new and existing vaccines. Here, we discuss computational contributions to epitope mapping and reverse vaccinology, two techniques central to the new discipline of immunomics. Also discussed are methods to improve the efficiency of vaccination, such as codon optimization and adjuvant discovery in addition to the identification of allergenic proteins. We also review current software developed to facilitate vaccine design. Keywords: Animals; Computational Biology; Drug Design; Epitope Mapping; Humans; Software Design; Vaccination; Vaccines
[Wheeler2008Complete]	David A Wheeler, Maithreyan Srinivasan, Michael Egholm, Yufeng Shen, Lei Chen, Amy McGuire, Wen He, Yi-Ju Chen, Vinod Makhijani, G. Thomas Roth, Xavier Gomes, Karrie Tartaro, Faheem Niazi, Cynthia L Turcotte, Gerard P Irzyk, James R Lupski, Craig Chinault, Xing zhi Song, Yue Liu, Ye Yuan, Lynne Nazareth, Xiang Qin, Donna M Muzny, Marcel Margulies, George M Weinstock, Richard A Gibbs, and Jonathan M Rothberg. The complete genome of an individual by massively parallel dna sequencing. Nature, 452(7189):872-876, Apr 2008. [ bib \| DOI \| http ] The association of genetic variation with disease and drug response, and improvements in nucleic acid technologies, have given great optimism for the impact of 'genomic medicine'. However, the formidable size of the diploid human genome, approximately 6 gigabases, has prevented the routine application of sequencing methods to deciphering complete individual human genomes. To realize the full potential of genomics for human health, this limitation must be overcome. Here we report the DNA sequence of a diploid genome of a single individual, James D. Watson, sequenced to 7.4-fold redundancy in two months using massively parallel sequencing in picolitre-size reaction vessels. This sequence was completed in two months at approximately one-hundredth of the cost of traditional capillary electrophoresis methods. Comparison of the sequence to the reference genome led to the identification of 3.3 million single nucleotide polymorphisms, of which 10,654 cause amino-acid substitution within the coding sequence. In addition, we accurately identified small-scale (2-40,000 base pair (bp)) insertion and deletion polymorphism as well as copy number variation resulting in the large-scale gain and loss of chromosomal segments ranging from 26,000 to 1.5 million base pairs. Overall, these results agree well with recent results of sequencing of a single individual by traditional methods. However, in addition to being faster and significantly less expensive, this sequencing technology avoids the arbitrary loss of genomic sequences inherent in random shotgun sequencing by bacterial cloning because it amplifies DNA in a cell-free system. As a result, we further demonstrate the acquisition of novel human sequence, including novel genes not previously identified by traditional genomic sequencing. This is the first genome sequenced by next-generation technologies. Therefore it is a pilot for the future challenges of 'personalized genome sequencing'. Keywords: Alleles; Computational Biology; Genetic Predisposition to Disease, genetics; Genetic Variation, genetics; Genome, Human, genetics; Genomics, economics/methods/trends; Genotype; Humans; Individuality; Male; Oligonucleotide Array Sequence Analysis; Polymorphism, Single Nucleotide, genetics; Reproducibility of Results; Sensitivity and Specificity; Sequence Alignment; Sequence Analysis, DNA, economics/methods; Software
[Mishra2008Review]	K. P. Mishra, L. Ganju, M. Sairam, P. K. Banerjee, and R. C. Sawhney. A review of high throughput technology for the screening of natural products. Biomed Pharmacother, 62(2):94-98, Feb 2008. [ bib \| DOI \| http ] High throughput screening is commonly defined as automatic testing of potential drug candidates at a rate in excess of 10,000 compounds per week. The aim of high throughput drug discovery is to test large compound collections for potentially active compounds ('hits') in order to allow further development of compounds for pre-clinical testing ('leads'). High throughput technology has emerged over the last few years as an important tool for drug discovery and lead optimisation. In this approach, the molecular diversity and range of biological properties displayed by secondary metabolites constitutes a challenge to combinatorial strategies for natural products synthesis and derivatization. This article reviews the approach of High throughput technique for the screening of natural products for drug discovery. Keywords: Automation; Biological Products, pharmacology; Combinatorial Chemistry Techniques; Drug Design; Drug Evaluation, Preclinical; Technology, Pharmaceutical, methods
[Kohler2008Walking]	S. Köhler, S. Bauer, D. Horn, and P.N. Robinson. Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet., 82(4):949-958, Apr 2008. [ bib \| DOI \| http ] The identification of genes associated with hereditary disorders has contributed to improving medical care and to a better understanding of gene functions, interactions, and pathways. However, there are well over 1500 Mendelian disorders whose molecular basis remains unknown. At present, methods such as linkage analysis can identify the chromosomal region in which unknown disease genes are located, but the regions could contain up to hundreds of candidate genes. In this work, we present a method for prioritization of candidate genes by use of a global network distance measure, random walk analysis, for definition of similarities in protein-protein interaction networks. We tested our method on 110 disease-gene families with a total of 783 genes and achieved an area under the ROC curve of up to 98% on simulated linkage intervals of 100 genes surrounding the disease gene, significantly outperforming previous methods based on local distance measures. Our results not only provide an improved tool for positional-cloning projects but also add weight to the assumption that phenotypically similar diseases are associated with disturbances of subnetworks within the larger protein interactome that extend beyond the disease proteins themselves. Keywords: Animals; Chromosome Mapping; Computational Biology; Databases, Genetic; Genetic Diseases, Inborn; Genetic Predisposition to Disease; Humans; Internet; Linkage (Genetics); Mice; Pedigree; Protein Interaction Mapping; Software
[Harris2008Single-molecule]	Timothy D Harris, Phillip R Buzby, Hazen Babcock, Eric Beer, Jayson Bowers, Ido Braslavsky, Marie Causey, Jennifer Colonell, James Dimeo, J. William Efcavitch, Eldar Giladi, Jaime Gill, John Healy, Mirna Jarosz, Dan Lapen, Keith Moulton, Stephen R Quake, Kathleen Steinmann, Edward Thayer, Anastasia Tyurina, Rebecca Ward, Howard Weiss, and Zheng Xie. Single-molecule dna sequencing of a viral genome. Science, 320(5872):106-109, Apr 2008. [ bib \| DOI \| http ] The full promise of human genomics will be realized only when the genomes of thousands of individuals can be sequenced for comparative analysis. A reference sequence enables the use of short read length. We report an amplification-free method for determining the nucleotide sequence of more than 280,000 individual DNA molecules simultaneously. A DNA polymerase adds labeled nucleotides to surface-immobilized primer-template duplexes in stepwise fashion, and the asynchronous growth of individual DNA molecules was monitored by fluorescence imaging. Read lengths of >25 bases and equivalent phred software program quality scores approaching 30 were achieved. We used this method to sequence the M13 virus to an average depth of >150x and with 100% coverage; thus, we resequenced the M13 genome with high-sensitivity mutation detection. This demonstrates a strategy for high-throughput low-cost resequencing. Keywords: Algorithms; Bacteriophage M13; Computational Biology; DNA Primers; DNA, Viral; Genome, Viral; Mutation; Sequence Alignment; Sequence Analysis, DNA; Software; Templates, Genetic
[Ala2008Prediction]	U. Ala, R.M. Piro, E. Grassi, C. Damasco, L. Silengo, M. Oti, P. Provero, and F. Di Cunto. Prediction of human disease genes by human-mouse conserved coexpression analysis. PLoS Comput. Biol., 4(3):e1000043, Mar 2008. [ bib \| DOI \| http ] BACKGROUND: Even in the post-genomic era, the identification of candidate genes within loci associated with human genetic diseases is a very demanding task, because the critical region may typically contain hundreds of positional candidates. Since genes implicated in similar phenotypes tend to share very similar expression profiles, high throughput gene expression data may represent a very important resource to identify the best candidates for sequencing. However, so far, gene coexpression has not been used very successfully to prioritize positional candidates. METHODOLOGY/PRINCIPAL FINDINGS: We show that it is possible to reliably identify disease-relevant relationships among genes from massive microarray datasets by concentrating only on genes sharing similar expression profiles in both human and mouse. Moreover, we show systematically that the integration of human-mouse conserved coexpression with a phenotype similarity map allows the efficient identification of disease genes in large genomic regions. Finally, using this approach on 850 OMIM loci characterized by an unknown molecular basis, we propose high-probability candidates for 81 genetic diseases. CONCLUSION: Our results demonstrate that conserved coexpression, even at the human-mouse phylogenetic distance, represents a very strong criterion to predict disease-relevant relationships among human genes. Keywords: Algorithms; Animals; Biological Markers; Chromosome Mapping; Conserved Sequence; Diagnosis, Computer-Assisted; Gene Expression Profiling; Genetic Diseases, Inborn; Genetic Predisposition to Disease; Humans; Mice; Proteome
[Xie2009Unified]	Lei Xie, Li Xie, and Philip E Bourne. A unified statistical model to support local sequence order independent similarity searching for ligand-binding sites and its application to genome-based drug discovery. Bioinformatics, 25(12):i305-i312, Jun 2009. [ bib \| DOI \| http ] Functional relationships between proteins that do not share global structure similarity can be established by detecting their ligand-binding-site similarity. For a large-scale comparison, it is critical to accurately and efficiently assess the statistical significance of this similarity. Here, we report an efficient statistical model that supports local sequence order independent ligand-binding-site similarity searching. Most existing statistical models only take into account the matching vertices between two sites that are defined by a fixed number of points. In reality, the boundary of the binding site is not known or is dependent on the bound ligand making these approaches limited. To address these shortcomings and to perform binding-site mapping on a genome-wide scale, we developed a sequence-order independent profile-profile alignment (SOIPPA) algorithm that is able to detect local similarity between unknown binding sites a priori. The SOIPPA scoring integrates geometric, evolutionary and physical information into a unified framework. However, this imposes a significant challenge in assessing the statistical significance of the similarity because the conventional probability model that is based on fixed-point matching cannot be applied. Here we find that scores for binding-site matching by SOIPPA follow an extreme value distribution (EVD). Benchmark studies show that the EVD model performs at least two-orders faster and is more accurate than the non-parametric statistical method in the previous SOIPPA version. Efficient statistical analysis makes it possible to apply SOIPPA to genome-based drug discovery. Consequently, we have applied the approach to the structural genome of Mycobacterium tuberculosis to construct a protein-ligand interaction network. The network reveals highly connected proteins, which represent suitable targets for promiscuous drugs. Keywords: Binding Sites; Computational Biology, methods; Drug Discovery, methods; Genome; Ligands; Models, Statistical; Mycobacterium tuberculosis, genetics/metabolism; Proteins, chemistry
[Vassetzky2009Chromosome]	Yegor Vassetzky, Alexey Gavrilov, Elvira Eivazova, Iryna Priozhkova, Marc Lipinski, and Sergey Razin. Chromosome conformation capture (from 3c to 5c) and its chip-based modification. Methods Mol Biol, 567:171-188, 2009. [ bib \| DOI \| http ] Chromosome conformation capture (3C) methodology was developed to study spatial organization of long genomic regions in living cells. Briefly, chromatin is fixed with formaldehyde in vivo to cross-link interacting sites, digested with a restriction enzyme and ligated at a low DNA concentration so that ligation between cross-linked fragments is favored over ligation between random fragments. Ligation products are then analyzed and quantified by PCR. So far, semi-quantitative PCR methods were widely used to estimate the ligation frequencies. However, it is often important to estimate the ligation frequencies more precisely which is only possible by using the real-time PCR. At the same time, it is equally necessary to monitor the specificity of PCR amplification. That is why the real-time PCR with TaqMan probes is becoming more and more popular in 3C studies. In this chapter, we describe the general protocol for 3C analysis with the subsequent estimation of ligation frequencies by using the real-time PCR technology with TaqMan probes. We discuss in details all steps of the experimental procedure paying special attention to weak points and possible ways to solve the problems. A special attention is also paid to the problems in interpretation of the results and necessary control experiments. Besides, in theory, we consider other approaches to analysis of the ligation products used in frames of the so-called 4C and 5C methods. The recently developed chromatin immunoprecipitation (ChIP)-loop assay representing a combination of 3C and ChIP is also discussed. Keywords: Chromatin Immunoprecipitation; Chromosome Mapping; Chromosomes; Cross-Linking Reagents; Humans; Models, Biological; Nucleic Acid Conformation; Polymerase Chain Reaction; Quality Control
[Terentiev2009Dynamic]	A. A. Terentiev, N. T. Moldogazieva, and K. V. Shaitan. Dynamic proteomics in modeling of the living cell. protein-protein interactions. Biochemistry (Mosc), 74(13):1586-1607, Dec 2009. [ bib ] This review is devoted to describing, summarizing, and analyzing of dynamic proteomics data obtained over the last few years and concerning the role of protein-protein interactions in modeling of the living cell. Principles of modern high-throughput experimental methods for investigation of protein-protein interactions are described. Systems biology approaches based on integrative view on cellular processes are used to analyze organization of protein interaction networks. It is proposed that finding of some proteins in different protein complexes can be explained by their multi-modular and polyfunctional properties; the different protein modules can be located in the nodes of protein interaction networks. Mathematical and computational approaches to modeling of the living cell with emphasis on molecular dynamics simulation are provided. The role of the network analysis in fundamental medicine is also briefly reviewed. Keywords: Animals; Humans; Mass Spectrometry; Models, Theoretical; Molecular Dynamics Simulation; Multiprotein Complexes; Protein Conformation; Protein Interaction Mapping; Proteins; Proteomics; Systems Biology; Two-Hybrid System Techniques
[Sutherland2009Transcription]	Heidi Sutherland and Wendy A Bickmore. Transcription factories: gene expression in unions? Nat Rev Genet, 10(7):457-466, Jul 2009. [ bib \| DOI \| http ] Transcription is a fundamental step in gene expression, yet it remains poorly understood at a cellular level. Visualization of transcription sites and active genes has led to the suggestion that transcription occurs at discrete sites in the nucleus, termed transcription factories, where multiple active RNA polymerases are concentrated and anchored to a nuclear substructure. However, this concept is not universally accepted. This Review discusses the experimental evidence in support of the transcription factory model and the evidence that argues against such a spatially structured view of transcription. The transcription factory model has implications for the regulation of transcription initiation and elongation, for the organization of genes in the genome, for the co-regulation of genes and for genome instability. Keywords: Animals; Cell Nucleus; DNA-Directed RNA Polymerases; Genome; Genomic Instability; Humans; Models, Biological; Transcription, Genetic
[Park2009ChIP]	Peter J Park. Chip-seq: advantages and challenges of a maturing technology. Nat Rev Genet, 10(10):669-680, Oct 2009. [ bib \| DOI \| http ] Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a technique for genome-wide profiling of DNA-binding proteins, histone modifications or nucleosomes. Owing to the tremendous progress in next-generation sequencing technology, ChIP-seq offers higher resolution, less noise and greater coverage than its array-based predecessor ChIP-chip. With the decreasing cost of sequencing, ChIP-seq has become an indispensable tool for studying gene regulation and epigenetic mechanisms. In this Review, I describe the benefits and challenges in harnessing this technique with an emphasis on issues related to experimental design and data analysis. ChIP-seq experiments generate large quantities of data, and effective computational analysis will be crucial for uncovering biological mechanisms. Keywords: Animals; Chromatin Immunoprecipitation, methods; Computational Biology; DNA-Binding Proteins, genetics; Epigenesis, Genetic; Humans; Nucleosomes, genetics; Sequence Analysis, DNA, methods
[Marbach2009Replaying]	D. Marbach, C. Mattiussi, and D. Floreano. Replaying the evolutionary tape: biomimetic reverse engineering of gene networks. Ann N Y Acad Sci, 1158:234-245, Mar 2009. [ bib \| DOI \| http ] In this paper, we suggest a new approach for reverse engineering gene regulatory networks, which consists of using a reconstruction process that is similar to the evolutionary process that created these networks. The aim is to integrate prior knowledge into the reverse-engineering procedure, thus biasing the search toward biologically plausible solutions. To this end, we propose an evolutionary method that abstracts and mimics the natural evolution of gene regulatory networks. Our method can be used with a wide range of nonlinear dynamical models. This allows us to explore novel model types such as the log-sigmoid model introduced here. We apply the biomimetic method to a gold-standard dataset from an in vivo gene network. The obtained results won a reverse engineering competition of the second DREAM conference (Dialogue on Reverse Engineering Assessments and Methods 2007, New York, NY). Keywords: Algorithms; Biomimetics; Computational Biology; Databases, Genetic; Evolution; Gene Regulatory Networks; Models, Biological; Nonlinear Dynamics
[Lima-Mendez2009powerful]	G. Lima-Mendez and J. van Helden. The powerful law of the power law and other myths in network biology. Mol Biosyst, 5(12):1482-1493, Dec 2009. [ bib \| DOI \| http ] For almost 10 years, topological analysis of different large-scale biological networks (metabolic reactions, protein interactions, transcriptional regulation) has been highlighting some recurrent properties: power law distribution of degree, scale-freeness, small world, which have been proposed to confer functional advantages such as robustness to environmental changes and tolerance to random mutations. Stochastic generative models inspired different scenarios to explain the growth of interaction networks during evolution. The power law and the associated properties appeared so ubiquitous in complex networks that they were qualified as "universal laws". However, these properties are no longer observed when the data are subjected to statistical tests: in most cases, the data do not fit the expected theoretical models, and the cases of good fitting merely result from sampling artefacts or improper data representation. The field of network biology seems to be founded on a series of myths, i.e. widely believed but false ideas. The weaknesses of these foundations should however not be considered as a failure for the entire domain. Network analysis provides a powerful frame for understanding the function and evolution of biological processes, provided it is brought to an appropriate level of description, by focussing on smaller functional modules and establishing the link between their topological properties and their dynamical behaviour. Keywords: Computational Biology, methods; Gene Regulatory Networks; Metabolic Networks and Pathways; Models, Biological; Semantics; Signal Transduction
[Lievens2009Mammalian]	Sam Lievens, Irma Lemmens, and Jan Tavernier. Mammalian two-hybrids come of age. Trends Biochem Sci, 34(11):579-588, Nov 2009. [ bib \| DOI \| http ] A diverse series of mammalian two-hybrid technologies for the detection of protein-protein interactions have emerged in the past few years, complementing the established yeast two-hybrid approach. Given the mammalian background in which they operate, these assays open new avenues to study the dynamics of mammalian protein interaction networks, i.e. the temporal, spatial and functional modulation of protein-protein associations. In addition, novel assay formats are available that enable high-throughput mammalian two-hybrid applications, facilitating their use in large-scale interactome mapping projects. Finally, as they can be applied in drug discovery and development programs, these techniques also offer exciting new opportunities for biomedical research. Keywords: Animals; Genes, Reporter; Humans; Models, Biological; Protein Binding; Protein Interaction Mapping; Proteins; Recombinant Fusion Proteins; Transfection; Two-Hybrid System Techniques
[LeCao2009Sparse]	K.-A. Lê Cao, P. G. P. Martin, C. Robert-Granié, and P. Besse. Sparse canonical methods for biological data integration: application to a cross-platform study. BMC Bioinformatics, 10:34, 2009. [ bib \| DOI \| http ] In the context of systems biology, few sparse approaches have been proposed so far to integrate several data sets. It is however an important and fundamental issue that will be widely encountered in post genomic studies, when simultaneously analyzing transcriptomics, proteomics and metabolomics data using different platforms, so as to understand the mutual interactions between the different data sets. In this high dimensional setting, variable selection is crucial to give interpretable results. We focus on a sparse Partial Least Squares approach (sPLS) to handle two-block data sets, where the relationship between the two types of variables is known to be symmetric. Sparse PLS has been developed either for a regression or a canonical correlation framework and includes a built-in procedure to select variables while integrating data. To illustrate the canonical mode approach, we analyzed the NCI60 data sets, where two different platforms (cDNA and Affymetrix chips) were used to study the transcriptome of sixty cancer cell lines.We compare the results obtained with two other sparse or related canonical correlation approaches: CCA with Elastic Net penalization (CCA-EN) and Co-Inertia Analysis (CIA). The latter does not include a built-in procedure for variable selection and requires a two-step analysis. We stress the lack of statistical criteria to evaluate canonical correlation methods, which makes biological interpretation absolutely necessary to compare the different gene selections. We also propose comprehensive graphical representations of both samples and variables to facilitate the interpretation of the results.sPLS and CCA-EN selected highly relevant genes and complementary findings from the two data sets, which enabled a detailed understanding of the molecular characteristics of several groups of cell lines. These two approaches were found to bring similar results, although they highlighted the same phenomenons with a different priority. They outperformed CIA that tended to select redundant information. Keywords: Computational Biology, methods; Genomics; Metabolomics; Proteomics; Systems Biology, methods
[Eid2009Real]	John Eid, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geoff Otto, Paul Peluso, David Rank, Primo Baybayan, Brad Bettman, Arkadiusz Bibillo, Keith Bjornson, Bidhan Chaudhuri, Frederick Christians, Ronald Cicero, Sonya Clark, Ravindra Dalal, Alex Dewinter, John Dixon, Mathieu Foquet, Alfred Gaertner, Paul Hardenbol, Cheryl Heiner, Kevin Hester, David Holden, Gregory Kearns, Xiangxu Kong, Ronald Kuse, Yves Lacroix, Steven Lin, Paul Lundquist, Congcong Ma, Patrick Marks, Mark Maxham, Devon Murphy, Insil Park, Thang Pham, Michael Phillips, Joy Roy, Robert Sebra, Gene Shen, Jon Sorenson, Austin Tomaney, Kevin Travers, Mark Trulson, John Vieceli, Jeffrey Wegener, Dawn Wu, Alicia Yang, Denis Zaccarin, Peter Zhao, Frank Zhong, Jonas Korlach, and Stephen Turner. Real-time dna sequencing from single polymerase molecules. Science, 323(5910):133-138, Jan 2009. [ bib \| DOI \| http ] We present single-molecule, real-time sequencing data obtained from a DNA polymerase performing uninterrupted template-directed synthesis using four distinguishable fluorescently labeled deoxyribonucleoside triphosphates (dNTPs). We detected the temporal order of their enzymatic incorporation into a growing DNA strand with zero-mode waveguide nanostructure arrays, which provide optical observation volume confinement and enable parallel, simultaneous detection of thousands of single-molecule sequencing reactions. Conjugation of fluorophores to the terminal phosphate moiety of the dNTPs allows continuous observation of DNA synthesis over thousands of bases without steric hindrance. The data report directly on polymerase dynamics, revealing distinct polymerization states and pause sites corresponding to DNA secondary structure. Sequence data were aligned with the known reference sequence to assay biophysical parameters of polymerization for each template position. Consensus sequences were generated from the single-molecule reads at 15-fold coverage, showing a median accuracy of 99.3%, with no systematic error beyond fluorophore-dependent error rates. Keywords: Base Sequence; Consensus Sequence; DNA, Circular, chemistry; DNA, Single-Stranded, chemistry; DNA, biosynthesis; DNA-Directed DNA Polymerase, metabolism; Deoxyribonucleotides, metabolism; Enzymes, Immobilized; Fluorescent Dyes; Kinetics; Nanostructures; Sequence Analysis, DNA, methods; Spectrometry, Fluorescence
[Tayrac2009Simultaneous]	M. de Tayrac, S. Lê, M. Aubry, J. Mosser, and F. Husson. Simultaneous analysis of distinct omics data sets with integration of biological knowledge: Multiple factor analysis approach. BMC Genomics, 10:32, 2009. [ bib \| DOI \| http ] Genomic analysis will greatly benefit from considering in a global way various sources of molecular data with the related biological knowledge. It is thus of great importance to provide useful integrative approaches dedicated to ease the interpretation of microarray data.Here, we introduce a data-mining approach, Multiple Factor Analysis (MFA), to combine multiple data sets and to add formalized knowledge. MFA is used to jointly analyse the structure emerging from genomic and transcriptomic data sets. The common structures are underlined and graphical outputs are provided such that biological meaning becomes easily retrievable. Gene Ontology terms are used to build gene modules that are superimposed on the experimentally interpreted plots. Functional interpretations are then supported by a step-by-step sequence of graphical representations.When applied to genomic and transcriptomic data and associated Gene Ontology annotations, our method prioritize the biological processes linked to the experimental settings. Furthermore, it reduces the time and effort to analyze large amounts of 'Omics' data. Keywords: Animals; Comparative Genomic Hybridization; Factor Analysis, Statistical; Gene Expression Profiling, methods; Genomics, methods; Glioma, genetics; Humans; Mice; Models, Biological; Oligonucleotide Array Sequence Analysis, methods
[Voduc2010Breast]	K. David Voduc, Maggie C U Cheang, Scott Tyldesley, Karen Gelmon, Torsten O Nielsen, and Hagen Kennecke. Breast cancer subtypes and the risk of local and regional relapse. J Clin Oncol, 28(10):1684-1691, Apr 2010. [ bib \| DOI \| http \| .pdf ] The risk of local and regional relapse associated with each breast cancer molecular subtype was determined in a large cohort of patients with breast cancer. Subtype assignment was accomplished using a validated six-marker immunohistochemical panel applied to tissue microarrays.Semiquantitative analysis of estrogen receptor (ER), progesterone receptor (PR), Ki-67, human epidermal growth factor receptor 2 (HER2), epidermal growth factor receptor (EGFR), and cytokeratin (CK) 5/6 was performed on tissue microarrays constructed from 2,985 patients with early invasive breast cancer. Patients were classified into the following categories: luminal A, luminal B, luminal-HER2, HER2 enriched, basal-like, or triple-negative phenotype-nonbasal. Multivariable Cox analysis was used to determine the risk of local or regional relapse associated the intrinsic subtypes, adjusting for standard clinicopathologic factors.The intrinsic molecular subtype was successfully determined in 2,985 tumors. The median follow-up time was 12 years, and there have been a total of 325 local recurrences and 227 regional lymph node recurrences. Luminal A tumors (ER or PR positive, HER2 negative, Ki-67 < 1%) had the best prognosis and the lowest rate of local or regional relapse. For patients undergoing breast conservation, HER2-enriched and basal subtypes demonstrated an increased risk of regional recurrence, and this was statistically significant on multivariable analysis. After mastectomy, luminal B, luminal-HER2, HER2-enriched, and basal subtypes were all associated with an increased risk of local and regional relapse on multivariable analysis.Luminal A tumors are associated with a low risk of local or regional recurrence. Molecular subtyping of breast tumors using a six-marker immunohistochemical panel can identify patients at increased risk of local and regional recurrence. Keywords: Adult; Breast Neoplasms, mortality/pathology; Female; Humans; Ki-67 Antigen, metabolism; Lymphatic Metastasis; Middle Aged; Neoplasm Metastasis; Neoplasm Recurrence, Local; Neoplasms, Hormone-Dependent; Receptor, Epidermal Growth Factor, metabolism; Receptors, Estrogen, analysis; Receptors, Progesterone, analysis; Tissue Array Analysis; Tumor Markers, Biological, analysis
[Taby2010Cancer]	Rodolphe Taby and Jean-Pierre J Issa. Cancer epigenetics. CA Cancer J Clin, 60(6):376-392, 2010. [ bib \| DOI \| http ] Epigenetics refers to stable alterations in gene expression with no underlying modifications in the genetic sequence and is best exemplified by differentiation, in which multiple cell types diverge physiologically despite a common genetic code. Interest in this area of science has grown over the past decades, especially since it was found to play a major role in physiologic phenomena such as embryogenesis, imprinting, and X chromosome inactivation, and in disease states such as cancer. The latter had been previously thought of as a disease with an exclusive genetic etiology. However, recent data have demonstrated that the complexity of human carcinogenesis cannot be accounted for by genetic alterations alone, but also involves epigenetic changes in processes such as DNA methylation, histone modifications, and microRNA expression. In turn, these molecular alterations lead to permanent changes in the expression of genes that regulate the neoplastic phenotype, such as cellular growth and invasiveness. Targeting epigenetic modifiers has been referred to as epigenetic therapy. The success of this approach in hematopoietic malignancies validates the importance of epigenetic alterations in cancer, not only at the therapeutic level but also with regard to prevention, diagnosis, risk stratification, and prognosis. Keywords: Animals; Cell Cycle, genetics; Cell Transformation, Neoplastic, genetics; DNA Methylation; Epigenesis, Genetic; Histones, genetics; Humans; MicroRNAs, genetics; Neoplasm Invasiveness, genetics; Neoplasms, classification/diagnosis/genetics/metabolism/prevention /&/ control/therapy; Prognosis; Risk Assessment; Tumor Markers, Biological, genetics
[Markowetz2010How]	Florian Markowetz. How to understand the cell by breaking it: network analysis of gene perturbation screens. PLoS Comput Biol, 6(2):e1000655, 2010. [ bib \| DOI \| http ] Keywords: Animals; Cell Physiological Processes; Cluster Analysis; Gene Regulatory Networks; Genomics; Humans; Models, Genetic; Models, Statistical; Phenotype; Signal Transduction; Systems Biology
[Jovanovic2010epigenetics]	Jovana Jovanovic, Jo Anders Rønneberg, Jörg Tost, and Vessela Kristensen. The epigenetics of breast cancer. Mol Oncol, 4(3):242-254, Jun 2010. [ bib \| DOI \| http ] Epigenetic changes can be defined as stable molecular alterations of a cellular phenotype such as the gene expression profile of a cell that are heritable during somatic cell divisions (and sometimes germ line transmissions) but do not involve changes of the DNA sequence itself. Epigenetic phenomena are mediated by several molecular mechanisms comprising histone modifications, polycomb/trithorax protein complexes, small non-coding or antisense RNAs and DNA methylation. These different modifications are closely interconnected. Epigenetic regulation is critical in normal growth and development and closely conditions the transcriptional potential of genes. Epigenetic mechanisms convey genomic adaption to an environment thereby ultimately contributing towards given phenotype. In this review we will describe the various aspects of epigenetics and in particular DNA methylation in breast carcinogenesis and their potential application for diagnosis, prognosis and treatment decision. Keywords: Breast Neoplasms, diagnosis/genetics/pathology/therapy; Chromatin, chemistry/metabolism; DNA Methylation; DNA Modification Methylases, metabolism; DNA, chemistry/metabolism; Epigenesis, Genetic; Female; Gene Expression Regulation, Neoplastic; Histones, metabolism; Humans; MicroRNAs, genetics/metabolism; Molecular Structure; Prognosis; Receptors, Estrogen, genetics/metabolism; Tumor Markers, Biological, metabolism
[Gehlenborg2010Visualization]	Nils Gehlenborg, Seán I O'Donoghue, Nitin S Baliga, Alexander Goesmann, Matthew A Hibbs, Hiroaki Kitano, Oliver Kohlbacher, Heiko Neuweger, Reinhard Schneider, Dan Tenenbaum, and Anne-Claude Gavin. Visualization of omics data for systems biology. Nat Methods, 7(3 Suppl):S56-S68, Mar 2010. [ bib \| DOI \| http ] High-throughput studies of biological systems are rapidly accumulating a wealth of 'omics'-scale data. Visualization is a key aspect of both the analysis and understanding of these data, and users now have many visualization methods and tools to choose from. The challenge is to create clear, meaningful and integrated visualizations that give biological insight, without being overwhelmed by the intrinsic complexity of the data. In this review, we discuss how visualization tools are being used to help interpret protein interaction, gene expression and metabolic profile data, and we highlight emerging new directions. Keywords: Genomics; Image Processing, Computer-Assisted; Mass Spectrometry; Metabolomics; Nuclear Magnetic Resonance, Biomolecular; Protein Binding; Proteomics; Systems Biology
[Fullwood2010Chromatin]	Melissa J Fullwood, Yuyuan Han, Chia-Lin Wei, Xiaoan Ruan, and Yijun Ruan. Chromatin interaction analysis using paired-end tag sequencing. Curr Protoc Mol Biol, Chapter 21:Unit 21.15.1-Unit 21.1525, Jan 2010. [ bib \| DOI \| http ] Chromatin Interaction Analysis using Paired-End Tag sequencing (ChIA-PET) is a technique developed for large-scale, de novo analysis of higher-order chromatin structures. Cells are treated with formaldehyde to cross-link chromatin interactions, DNA segments bound by protein factors are enriched by chromatin immunoprecipitation, and interacting DNA fragments are then captured by proximity ligation. The Paired-End Tag (PET) strategy is applied to the construction of ChIA-PET libraries, which are sequenced by high-throughput next-generation sequencing technologies. Finally, raw PET sequences are subjected to bioinformatics analysis, resulting in a genome-wide map of binding sites and chromatin interactions mediated by the protein factor under study. This unit describes ChIA-PET for genome-wide analysis of chromatin interactions in mammalian cells, with the application of Roche/454 and Illumina sequencing technologies. Keywords: Animals; Chromatin; Computational Biology; Databases, Nucleic Acid; Genome-Wide Association Study; Humans; Sequence Analysis, DNA
[Bullard2010Evaluation]	J. H. Bullard, E. Purdom, K. D. Hansen, and S. Dudoit. Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC Bioinformatics, 11:94, 2010. [ bib \| DOI \| http ] High-throughput sequencing technologies, such as the Illumina Genome Analyzer, are powerful new tools for investigating a wide range of biological and medical questions. Statistical and computational methods are key for drawing meaningful and accurate conclusions from the massive and complex datasets generated by the sequencers. We provide a detailed evaluation of statistical methods for normalization and differential expression (DE) analysis of Illumina transcriptome sequencing (mRNA-Seq) data.We compare statistical methods for detecting genes that are significantly DE between two types of biological samples and find that there are substantial differences in how the test statistics handle low-count genes. We evaluate how DE results are affected by features of the sequencing platform, such as, varying gene lengths, base-calling calibration method (with and without phi X control lane), and flow-cell/library preparation effects. We investigate the impact of the read count normalization method on DE results and show that the standard approach of scaling by total lane counts (e.g., RPKM) can bias estimates of DE. We propose more general quantile-based normalization procedures and demonstrate an improvement in DE detection.Our results have significant practical and methodological implications for the design and analysis of mRNA-Seq experiments. They highlight the importance of appropriate statistical methods for normalization and DE inference, to account for features of the sequencing platform that could impact the accuracy of results. They also reveal the need for further research in the development of statistical and computational methods for mRNA-Seq. Keywords: Computational Biology; Databases, Genetic; RNA, Messenger; Sequence Analysis, RNA
[Aranda2010IntAct]	B. Aranda, P. Achuthan, Y. Alam-Faruque, I. Armean, A. Bridge, C. Derow, M. Feuermann, A. T. Ghanbarian, S. Kerrien, J. Khadake, J. Kerssemakers, C. Leroy, M. Menden, M. Michaut, L. Montecchi-Palazzi, S. N. Neuhauser, S. Orchard, V. Perreau, B. Roechert, K. van Eijk, and H. Hermjakob. The intact molecular interaction database in 2010. Nucleic Acids Res, 38(Database issue):D525-D531, Jan 2010. [ bib \| DOI \| http ] IntAct is an open-source, open data molecular interaction database and toolkit. Data is abstracted from the literature or from direct data depositions by expert curators following a deep annotation model providing a high level of detail. As of September 2009, IntAct contains over 200.000 curated binary interaction evidences. In response to the growing data volume and user requests, IntAct now provides a two-tiered view of the interaction data. The search interface allows the user to iteratively develop complex queries, exploiting the detailed annotation with hierarchical controlled vocabularies. Results are provided at any stage in a simplified, tabular view. Specialized views then allows 'zooming in' on the full annotation of interactions, interactors and their properties. IntAct source code and data are freely available at http://www.ebi.ac.uk/intact. Keywords: Animals; Computational Biology; Databases, Genetic; Databases, Protein; False Positive Reactions; Humans; Information Storage and Retrieval; Internet; Programming Languages; Protein Interaction Mapping; Protein Structure, Tertiary; Proteins; Software; User-Computer Interface; Vocabulary, Controlled
[Consortium2010map]	1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061-1073, Oct 2010. [ bib \| DOI \| http ] The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research. Keywords: Calibration; Chromosomes, Human, Y, genetics; Computational Biology; DNA Mutational Analysis; DNA, Mitochondrial, genetics; Evolution, Molecular; Female; Genetic Association Studies; Genetic Variation, genetics; Genetics, Population, methods; Genome, Human, genetics; Genome-Wide Association Study; Genomics, methods; Genotype; Haplotypes, genetics; Humans; Male; Mutation, genetics; Pilot Projects; Polymorphism, Single Nucleotide, genetics; Recombination, Genetic, genetics; Sample Size; Selection, Genetic, genetics; Sequence Alignment; Sequence Analysis, DNA, methods
[Tomizaki2010Protein]	Kin-ya Tomizaki, Kenji Usui, and Hisakazu Mihara. Protein-protein interactions and selection: array-based techniques for screening disease-associated biomarkers in predictive/early diagnosis. FEBS J, 277(9):1996-2005, May 2010. [ bib \| DOI \| http ] There has been considerable interest in recent years in the development of miniaturized and parallelized array technology for protein-protein interaction analysis and protein profiling, namely 'protein-detecting microarrays'. Protein-detecting microarrays utilize a wide variety of capture agents (antibodies, fusion proteins, DNA/RNA aptamers, synthetic peptides, carbohydrates, and small molecules) immobilized at high spatial density on a solid surface. Each capture agent binds selectively to its target protein in a complex mixture, such as serum or cell lysate samples. Captured proteins are subsequently detected and quantified in a high-throughput fashion, with minimal sample consumption. Protein-detecting microarrays were first described by MacBeath and Schreiber in 2000, and the number of publications involving this technology is rapidly increasing. Furthermore, the first multiplex immunoassay systems have been cleared by the US Food and Drug Administration, signaling recognition of the usefulness of miniaturized and parallelized array technology for protein detection in predictive/early diagnosis. Although genetic tests still predominate, with further development protein-based diagnosis will become common in clinical use within a few years. Keywords: Animals; Biological Markers, analysis/metabolism; Early Diagnosis; Humans; Mass Screening, methods; Protein Array Analysis, methods; Proteins, analysis/metabolism; Risk Factors
[Blows2010Subtyping]	Fiona M Blows, Kristy E Driver, Marjanka K Schmidt, Annegien Broeks, Flora E van Leeuwen, Jelle Wesseling, Maggie C Cheang, Karen Gelmon, Torsten O Nielsen, Carl Blomqvist, Päivi Heikkilä, Tuomas Heikkinen, Heli Nevanlinna, Lars A Akslen, Louis R Bégin, William D Foulkes, Fergus J Couch, Xianshu Wang, Vicky Cafourek, Janet E Olson, Laura Baglietto, Graham G Giles, Gianluca Severi, Catriona A McLean, Melissa C Southey, Emad Rakha, Andrew R Green, Ian O Ellis, Mark E Sherman, Jolanta Lissowska, William F Anderson, Angela Cox, Simon S Cross, Malcolm W R Reed, Elena Provenzano, Sarah-Jane Dawson, Alison M Dunning, Manjeet Humphreys, Douglas F Easton, Montserrat García-Closas, Carlos Caldas, Paul D Pharoah, and David Huntsman. Subtyping of breast cancer by immunohistochemistry to investigate a relationship between subtype and short and long term survival: a collaborative analysis of data for 10,159 cases from 12 studies. PLoS Med, 7(5):e1000279, May 2010. [ bib \| DOI \| http \| .pdf ] Immunohistochemical markers are often used to classify breast cancer into subtypes that are biologically distinct and behave differently. The aim of this study was to estimate mortality for patients with the major subtypes of breast cancer as classified using five immunohistochemical markers, to investigate patterns of mortality over time, and to test for heterogeneity by subtype.We pooled data from more than 10,000 cases of invasive breast cancer from 12 studies that had collected information on hormone receptor status, human epidermal growth factor receptor-2 (HER2) status, and at least one basal marker (cytokeratin [CK]5/6 or epidermal growth factor receptor [EGFR]) together with survival time data. Tumours were classified as luminal and nonluminal tumours according to hormone receptor expression. These two groups were further subdivided according to expression of HER2, and finally, the luminal and nonluminal HER2-negative tumours were categorised according to expression of basal markers. Changes in mortality rates over time differed by subtype. In women with luminal HER2-negative subtypes, mortality rates were constant over time, whereas mortality rates associated with the luminal HER2-positive and nonluminal subtypes tended to peak within 5 y of diagnosis and then decline over time. In the first 5 y after diagnosis the nonluminal tumours were associated with a poorer prognosis, but over longer follow-up times the prognosis was poorer in the luminal subtypes, with the worst prognosis at 15 y being in the luminal HER2-positive tumours. Basal marker expression distinguished the HER2-negative luminal and nonluminal tumours into different subtypes. These patterns were independent of any systemic adjuvant therapy.The six subtypes of breast cancer defined by expression of five markers show distinct behaviours with important differences in short term and long term prognosis. Application of these markers in the clinical setting could have the potential to improve the targeting of adjuvant chemotherapy to those most likely to benefit. The different patterns of mortality over time also suggest important biological differences between the subtypes that may result in differences in response to specific therapies, and that stratification of breast cancers by clinically relevant subtypes in clinical trials is urgently required. Keywords: Adult; Aged; Aged, 80 and over; Breast Neoplasms, metabolism/mortality/pathology; Female; Hormones, analysis; Humans; Immunohistochemistry; Keratins; Middle Aged; Prognosis; Proportional Hazards Models; Receptor, Epidermal Growth Factor, analysis; Receptors, Cell Surface, metabolism; Tumor Markers, Biological, analysis; Young Adult
[Rodriguez-Paredes2011Cancer]	Manuel Rodríguez-Paredes and Manel Esteller. Cancer epigenetics reaches mainstream oncology. Nat Med, 17(3):330-339, Mar 2011. [ bib \| DOI \| http ] Epigenetics is one of the most promising and expanding fields in the current biomedical research landscape. Since the inception of epigenetics in the 1940s, the discoveries regarding its implications in normal and disease biology have not stopped, compiling a vast amount of knowledge in the past decade. The field has moved from just one recognized marker, DNA methylation, to a variety of others, including a wide spectrum of histone modifications. From the methodological standpoint, the successful initial single gene candidate approaches have been complemented by the current comprehensive epigenomic approaches that allow the interrogation of genomes to search for translational applications in an unbiased manner. Most important, the discovery of mutations in the epigenetic machinery and the approval of the first epigenetic drugs for the treatment of subtypes of leukemias and lymphomas has been an eye-opener for many biomedical scientists and clinicians. Herein, we will summarize the progress in the field of cancer epigenetics research that has reached mainstream oncology in the development of new biomarkers of the disease and new pharmacological strategies. Keywords: Amino Acid Sequence; DNA Methylation; Epigenesis, Genetic; Humans; Molecular Sequence Data; Neoplasms, genetics/therapy; Tumor Markers, Biological

This file was generated by bibtex2html 1.97.