stat references

[CatoniGibbs] O. Catoni. Gibbs estimators. Revised version. [ bib | .dvi | .ps ]
[Bernstein1977Protein] F. C. Bernstein, T. F. Koetzle, G. J. Williams, E. F. Meyer, M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi. The protein data bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112(3):535-542, May 1977. [ bib ]
Keywords: Computers; Great Britain; Information Systems; Japan; Protein Conformation; Proteins; United States
[Talagrand1995Concentration] M. Talagrand. Concentration of measure and isoperimetric inequalities in product spaces. Publ. Math. I.H.E.S., 81:73-203, 1995. [ bib | .dvi | .pdf ]
[Talagrand1996Newa] M. Talagrand. A New Look at Independence. Ann. Probab., 24:1-34, 1996. [ bib | .dvi | .pdf ]
[Talagrand1996New] M. Talagrand. New concentration inequalities for product spaces. Inventionnes Math., 126:505-563, 1996. [ bib | .dvi | .pdf ]
[Talagrand1996Majorizing] M. Talagrand. Majorizing measures: The generic chaining. Ann. Probab., 24:1049-1103, 1996. [ bib | .dvi | .pdf ]
[Lauritzen1996Graphical] S. Lauritzen. Graphical Models. Oxford, 1996. [ bib ]
[Zhu1997Minimax] S. C. Zhu, Z. N. Wu, and D. Mumford. Minimax Entropy Principle and Its Application to Texture Modeling. Neural Comput., 9(8):1627-1660, 1997. [ bib | .ps.gz | .pdf ]
[Zhu1998FRAME:] S. C. Zhu, Y. Wu, and D. Mumford. FRAME: Filters, Random field And Maximum Entropy: - Towards a Unified Theory for Texture Modeling. Int'l Journal of Computer Vision, 27(2):1-20, 1998. [ bib | .ps.gz | .pdf ]
[Schneider1998Artificial] G. Schneider and P. Wrede. Artificial neural networks for computer-based molecular design. Prog Biophys Mol Biol, 70(3):175-222, 1998. [ bib ]
The theory of artificial neural networks is briefly reviewed focusing on supervised and unsupervised techniques which have great impact on current chemical applications. An introduction to molecular descriptors and representation schemes is given. In addition, worked examples of recent advances in this field are highlighted and pioneering publications are discussed. Applications of several types of artificial neural networks to compound classification, modelling of structure-activity relationships, biological target identification, and feature extraction from biopolymers are presented and compared to other techniques. Advantages and limitations of neural networks for computer-aided molecular design and sequence analysis are discussed.

Keywords: Algorithms, Amino Acid Sequence, Amino Acids, Animals, Artificial Intelligence, Automated, Bacterial, Bacterial Proteins, Bicuculline, Binding Sites, Biological, Biological Availability, Blood Proteins, Blood-Brain Barrier, Cation Transport Proteins, Cats, Cell Membrane Permeability, Chemical, Chemistry, Cluster Analysis, Combinatorial Chemistry Techniques, Comparative Study, Computational Biology, Computer Simulation, Computer Systems, Computer-Aided Design, Computer-Assisted, Computing Methodologies, DNA-Binding Proteins, Databases, Dogs, Drug Design, Electric Stimulation, Electromyography, Enzyme Inhibitors, Ether-A-Go-Go Potassium Channels, Excitatory Amino Acid Antagonists, Factual, False Positive Reactions, Forecasting, Forelimb, GABA Antagonists, Gene Expression Profiling, Genome, Glutamic Acid, Humans, Hydrogen Bonding, Image Enhancement, Image Interpretation, Image Processing, Information Storage and Retrieval, Iontophoresis, Kynurenic Acid, Least-Squares Analysis, Linear Models, Liver, Markov Chains, Metabolic Clearance Rate, Metalloendopeptidases, Microelectrodes, Models, Molecular, Molecular Conformation, Molecular Sequence Data, Molecular Structure, Motor Cortex, Movement, Multivariate Analysis, Nerve Net, Neural Networks (Computer), Neuropeptides, Non-U.S. Gov't, Nonlinear Dynamics, Pattern Recognition, Pharmaceutical, Pharmaceutical Preparations, Pharmacokinetics, Phylogeny, Potassium Channels, Predictive Value of Tests, Protein Interaction Mapping, Protein Sorting Signals, Protein Structure, Proteins, Rats, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Shoulder, Signal Processing, Software, Statistical, Stereotaxic Techniques, Structure-Activity Relationship, Terminology, Tertiary, Trans-Activators, Voltage-Gated, Zinc, 9830312
[Poggio1998Sparse] Poggio and Girosi. A Sparse Representation for Function Approximation. Neural Comput, 10(6):1445-54, Jul 1998. [ bib ]
We derive a new general representation for a function as a linear combination of local correlation kernels at optimal sparse locations (and scales) and characterize its relation to principal component analysis, regularization, sparsity principles, and support vector machines.

Keywords: Algorithms, Automated, Biometry, Computers, DNA, Databases, Factual, Fungal, Fungal Proteins, GTP-Binding Proteins, Gene Expression, Genes, Learning, Markov Chains, Models, Neural Networks (Computer), Neurological, Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Hybridization, Open Reading Frames, P.H.S., Pattern Recognition, Protein, Protein Structure, Proteins, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Sequence Alignment, Sequence Analysis, Software, Statistical, Tertiary, U.S. Gov't, 9698352
[Lugosi1998On] G. Lugosi. On concentration-of-measure inequalities. Seminar notes, 1998. [ bib | .ps | .pdf ]
[Girosi1998Equivalence] Girosi. An Equivalence Between Sparse Approximation and Support Vector Machines. Neural Comput, 10(6):1455-80, Jul 1998. [ bib ]
This article shows a relationship between two different approximation techniques: the support vector machines (SVM), proposed by V. Vapnik (1995) and a sparse approximation scheme that resembles the basis pursuit denoising algorithm (Chen, 1995; Chen, Donoho, and Saunders, 1995). SVM is a technique that can be derived from the structural risk minimization principle (Vapnik, 1982) and can be used to estimate the parameters of several different approximation schemes, including radial basis functions, algebraic and trigonometric polynomials, B-splines, and some forms of multilayer perceptrons. Basis pursuit denoising is a sparse approximation technique in which a function is reconstructed by using a small number of basis functions chosen from a large set (the dictionary). We show that if the data are noiseless, the modified version of basis pursuit denoising proposed in this article is equivalent to SVM in the following sense: if applied to the same data set, the two techniques give the same solution, which is obtained by solving the same quadratic programming problem. In the appendix, we present a derivation of the SVM technique in one framework of regularization theory, rather than statistical learning theory, establishing a connection between SVM, sparse approximation, and regularization theory.

Keywords: Algorithms, Automated, Biometry, Computers, DNA, Databases, Factual, Fungal, Fungal Proteins, GTP-Binding Proteins, Gene Expression, Genes, Learning, Markov Chains, Models, Neural Networks (Computer), Neurological, Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Hybridization, Open Reading Frames, P.H.S., Pattern Recognition, Protein, Protein Structure, Proteins, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Sequence Alignment, Sequence Analysis, Software, Statistical, Tertiary, U.S. Gov't, 9698353
[Pontil1998Properties] M. Pontil and A. Verri. Properties of support vector machines. Neural Comput, 10(4):955-74, May 1998. [ bib ]
Support vector machines (SVMs) perform pattern recognition between two point classes by finding a decision surface determined by certain points of the training set, termed support vectors (SV). This surface, which in some feature space of possibly infinite dimension can be regarded as a hyperplane, is obtained from the solution of a problem of quadratic programming that depends on a regularization parameter. In this article, we study some mathematical properties of support vectors and show that the decision surface can be written as the sum of two orthogonal terms, the first depending on only the margin vectors (which are SVs lying on the margin), the second proportional to the regularization parameter. For almost all values of the parameter, this enables us to predict how the decision surface varies for small parameter changes. In the special but important case of feature space of finite dimension m, we also show that m + 1 SVs are usually sufficient to determine the decision surface fully. For relatively small m, this latter result leads to a consistent reduction of the SV number.

Keywords: Algorithms, Artificial Intelligence, Automated, Biometry, Computers, DNA, Databases, Factual, Fungal, Fungal Proteins, GTP-Binding Proteins, Gene Expression, Genes, Learning, Linear Models, Markov Chains, Mathematics, Models, Neural Networks (Computer), Neurological, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Nucleic Acid Hybridization, Open Reading Frames, P.H.S., Pattern Recognition, Protein, Protein Structure, Proteins, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Sequence Alignment, Sequence Analysis, Software, Statistical, Tertiary, U.S. Gov't, 9573414
[Mathews1999Expandeda] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288(5):911-940, May 1999. [ bib | DOI | http ]
An improved dynamic programming algorithm is reported for RNA secondary structure prediction by free energy minimization. Thermodynamic parameters for the stabilities of secondary structure motifs are revised to include expanded sequence dependence as revealed by recent experiments. Additional algorithmic improvements include reduced search time and storage for multibranch loop free energies and improved imposition of folding constraints. An extended database of 151,503 nt in 955 structures? determined by comparative sequence analysis was assembled to allow optimization of parameters not based on experiments and to test the accuracy of the algorithm. On average, the predicted lowest free energy structure contains 73 % of known base-pairs when domains of fewer than 700 nt are folded; this compares with 64 % accuracy for previous versions of the algorithm and parameters. For a given sequence, a set of 750 generated structures contains one structure that, on average, has 86 % of known base-pairs. Experimental constraints, derived from enzymatic and flavin mononucleotide cleavage, improve the accuracy of structure predictions.

Keywords: 16S, 23S, 5S, Affinity, Algorithms, Aluminum Silicates, Amino Acid, Amino Acid Sequence, Amyloidosis, Archaeal, Bacillus, Bacterial, Bacterial Proteins, Bacteriophage T4, Base Sequence, Chloroplast, Chromatography, Circular Dichroism, Comparative Study, Computational Biology, Databases, Electrophoresis, Entropy, Enzyme Stability, Escherichia coli, Factual, Fibroblast Growth Factor 2, Flavin Mononucleotide, Fluorescence, Genetic, Guanidine, Humans, Huntington Disease, Kinetics, Light, Models, Molecular Sequence Data, Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, P.H.S., Peptides, Phylogeny, Polyacrylamide Gel, Predictive Value of Tests, Protein Binding, Protein Denaturation, Protein Folding, Protein Structure, RNA, Radiation, Recombinant Proteins, Research Support, Ribosomal, Scattering, Secondary, Sequence Homology, Solutions, Spectrometry, Statistical, Temperature, Thermodynamics, Time Factors, Trinucleotide Repeat Expansion, U.S. Gov't, alpha-Amylase, 10329189
[Lugosi1999Adaptive] G. Lugosi and A. Nobel. Adaptive Model Selection Using Empirical Complexities. Ann. Stat., 27(6):1830-1864, December 1999. [ bib | .ps | .pdf ]
[Wilbur2000Boosting] W. J. Wilbur. Boosting naive Bayesian learning on a large subset of MEDLINE. Proc AMIA Symp, pages 918-22, 2000. [ bib ]
We are concerned with the rating of new documents that appear in a large database (MEDLINE) and are candidates for inclusion in a small specialty database (REBASE). The requirement is to rank the new documents as nearly in order of decreasing potential to be added to the smaller database as possible, so as to improve the coverage of the smaller database without increasing the effort of those who manage this specialty database. To perform this ranking task we have considered several machine learning approaches based on the naï ve Bayesian algorithm. We find that adaptive boosting outperforms naï ve Bayes, but that a new form of boosting which we term staged Bayesian retrieval outperforms adaptive boosting. Staged Bayesian retrieval involves two stages of Bayesian retrieval and we further find that if the second stage is replaced by a support vector machine we again obtain a significant improvement over the strictly Bayesian approach.

Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Animals, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Brachyura, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Classification, Codon, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diabetes Mellitus, Diagnosis, Discriminant Analysis, Discrimination Learning, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Genetic Predisposition to Disease, Genomics, Hemolysins, Humans, Indians, Information Storage and Retrieval, Initiator, Ion Channels, Kinetics, Leukemia, Likelihood Functions, Lipid Bilayers, Logistic Models, Lymphocytic, MEDLINE, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, North American, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Organ Specificity, Organelles, Ovarian Neoplasms, Ovary, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Biosynthesis, Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Sound Spectrography, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, Type 2, U.S. Gov't, Vertebrates, 11080018
[Risau-Gusman2000Generalization] Risau-Gusman and Gordon. Generalization properties of finite-size polynomial support vector machines. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics, 62(5 Pt B):7092-9, Nov 2000. [ bib ]
The learning properties of finite-size polynomial support vector machines are analyzed in the case of realizable classification tasks. The normalization of the high-order features acts as a squeezing factor, introducing a strong anisotropy in the patterns distribution in feature space. As a function of the training set size, the corresponding generalization error presents a crossover, more or less abrupt depending on the distribution's anisotropy and on the task to be learned, between a fast-decreasing and a slowly decreasing regime. This behavior corresponds to the stepwise decrease found by Dietrich et al. [Phys. Rev. Lett. 82, 2975 (1999)] in the thermodynamic limit. The theoretical results are in excellent agreement with the numerical simulations.

Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Animals, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Brachyura, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Classification, Codon, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diabetes Mellitus, Diagnosis, Discriminant Analysis, Discrimination Learning, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Genetic Predisposition to Disease, Genomics, Hemolysins, Humans, Indians, Initiator, Ion Channels, Kinetics, Leukemia, Likelihood Functions, Lipid Bilayers, Logistic Models, Lymphocytic, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, North American, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Organ Specificity, Organelles, Ovarian Neoplasms, Ovary, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Biosynthesis, Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Sound Spectrography, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, Type 2, U.S. Gov't, Vertebrates, 0011102066
[Opper2000Gaussian] M. Opper and O. Winther. Gaussian processes for classification: mean-field algorithms. Neural Comput, 12(11):2655-84, Nov 2000. [ bib ]
We derive a mean-field algorithm for binary classification with gaussian processes that is based on the TAP approach originally proposed in statistical physics of disordered systems. The theory also yields an approximate leave-one-out estimator for the generalization error, which is computed with no extra computational cost. We show that from the TAP approach, it is possible to derive both a simpler "naive" mean-field theory and support vector machines (SVMs) as limiting cases. For both mean-field algorithms and support vector machines, simulation results for three small benchmark data sets are presented. They show that one may get state-of-the-art performance by using the leave-one-out estimator for model selection and the built-in leave-one-out estimators are extremely precise when compared to the exact leave-one-out estimate. The second result is taken as strong support for the internal consistency of the mean-field approach.

Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Animals, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Brachyura, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Classification, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diabetes Mellitus, Diagnosis, Discriminant Analysis, Discrimination Learning, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Genetic Predisposition to Disease, Hemolysins, Humans, Indians, Ion Channels, Kinetics, Leukemia, Likelihood Functions, Lipid Bilayers, Logistic Models, Lymphocytic, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, North American, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Organ Specificity, Organelles, Ovarian Neoplasms, Ovary, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Sound Spectrography, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, Type 2, U.S. Gov't, 11110131
[Boucheron2000sharp] S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with applications. Random Structures and Algorithms, 16:277-292, 2000. [ bib | .ps | .pdf ]
[Juditsky2000Functional] A. Juditsky and A. Nemirovski. Functional Aggregation for Nonparametric Estimation. Ann. Stat., 28(3):681-712, June 2000. [ bib | .ps.gz | .pdf ]
[Vercoutere2001Rapid] W. Vercoutere, S. Winters-Hilt, H. Olsen, D. Deamer, D. Haussler, and M. Akeson. Rapid discrimination among individual DNA hairpin molecules at single-nucleotide resolution using an ion channel. Nat Biotechnol, 19(3):248-52, Mar 2001. [ bib | DOI | http | .pdf ]
RNA and DNA strands produce ionic current signatures when driven through an alpha-hemolysin channel by an applied voltage. Here we combine this nanopore detector with a support vector machine (SVM) to analyze DNA hairpin molecules on the millisecond time scale. Measurable properties include duplex stem length, base pair mismatches, and loop length. This nanopore instrument can discriminate between individual DNA hairpins that differ by one base pair or by one nucleotide.

Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diagnosis, Discriminant Analysis, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Ion Channels, Kinetics, Leukemia, Lipid Bilayers, Logistic Models, Lymphocytic, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplastic, Neural Networks (Computer), Nevus, Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, Organ Specificity, Organelles, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, U.S. Gov't, 11231558
[Suykens2001Optimal] J. A. Suykens, J. Vandewalle, and B. De Moor. Optimal control by least squares support vector machines. Neural Netw, 14(1):23-35, Jan 2001. [ bib ]
Support vector machines have been very successful in pattern recognition and function estimation problems. In this paper we introduce the use of least squares support vector machines (LS-SVM's) for the optimal control of nonlinear systems. Linear and neural full static state feedback controllers are considered. The problem is formulated in such a way that it incorporates the N-stage optimal control problem as well as a least squares support vector machine approach for mapping the state space into the action space. The solution is characterized by a set of nonlinear equations. An alternative formulation as a constrained nonlinear optimization problem in less unknowns is given, together with a method for imposing local stability in the LS-SVM control scheme. The results are discussed for support vector machines with radial basis function kernel. Advantages of LS-SVM control are that no number of hidden units has to be determined for the controller and that no centers have to be specified for the Gaussian kernels when applying Mercer's condition. The curse of dimensionality is avoided in comparison with defining a regular grid for the centers in classical radial basis function networks. This is at the expense of taking the trajectory of state variables as additional unknowns in the optimization problem, while classical neural network approaches typically lead to parametric optimization problems. In the SVM methodology the number of unknowns equals the number of training data, while in the primal space the number of unknowns can be infinite dimensional. The method is illustrated both on stabilization and tracking problems including examples on swinging up an inverted pendulum with local stabilization at the endpoint and a tracking problem for a ball and beam system.

Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diagnosis, Discriminant Analysis, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Ion Channels, Kinetics, Leukemia, Lipid Bilayers, Logistic Models, Lymphocytic, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplastic, Neural Networks (Computer), Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, Nucleic Acid Conformation, Organ Specificity, Organelles, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, U.S. Gov't, 11213211
[Sherry2001dbSNP] S. T. Sherry, M. H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, and K. Sirotkin. dbsnp: the ncbi database of genetic variation. Nucleic Acids Res, 29(1):308-311, Jan 2001. [ bib ]
In response to a need for a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, the National Center for Biotechnology Information (NCBI) has established the dbSNP database [S.T.Sherry, M.Ward and K. Sirotkin (1999) Genome Res., 9, 677-679]. Submissions to dbSNP will be integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data. The complete contents of dbSNP are available to the public at website: http://www.ncbi.nlm.nih.gov/SNP. The complete contents of dbSNP can also be downloaded in multiple formats via anonymous FTP at ftp://ncbi.nlm.nih.gov/snp/.

Keywords: Animals; Biotechnology; Databases, Factual; Genetic Variation; Humans; Information Services; Internet; National Institutes of Health (U.S.); National Library of Medicine (U.S.); Polymorphism, Single Nucleotide, genetics; United States
[Miwakeichi2001comparison] F. Miwakeichi, R. Ramirez-Padron, P. A. Valdes-Sosa, and T. Ozaki. A comparison of non-linear non-parametric models for epilepsy data. Comput. Biol. Med., 31(1):41-57, Jan 2001. [ bib ]
EEG spike and wave (SW) activity has been described through a non-parametric stochastic model estimated by the Nadaraya-Watson (NW) method. In this paper the performance of the NW, the local linear polynomial regression and support vector machines (SVM) methods were compared. The noise-free realizations obtained by the NW and SVM methods reproduced SW better than as reported in previous works. The tuning parameters had to be estimated manually. Adding dynamical noise, only the NW method was capable of generating SW similar to training data. The standard deviation of the dynamical noise was estimated by means of the correlation dimension.

Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Animals, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Brachyura, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Classification, Codon, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diabetes Mellitus, Diagnosis, Discriminant Analysis, Discrimination Learning, Electric Conductivity, Electroencephalography, Electrophysiology, Epilepsy, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Genetic Predisposition to Disease, Genomics, Hemolysins, Humans, Indians, Information Storage and Retrieval, Initiator, Ion Channels, Kinetics, Leukemia, Likelihood Functions, Linear Models, Lipid Bilayers, Logistic Models, Lymphocytic, MEDLINE, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, North American, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Organ Specificity, Organelles, Ovarian Neoplasms, Ovary, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Biosynthesis, Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Sound Spectrography, Statistical, Stochastic Processes, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, Type 2, U.S. Gov't, Vertebrates, 11058693
[Amari2001Information] S.-I. Amari. Information geometry on hierarchy of probability distributions. IEEE Trans. Inform. Theory, 47(5):1701-1711, July 2001. [ bib | .ps.gz | .pdf ]
[Weber2002Building] Griffin Weber, Staal Vinterbo, and Lucila Ohno-Machado. Building an asynchronous web-based tool for machine learning classification. Proc AMIA Symp, pages 869-73, 2002. [ bib ]
Various unsupervised and supervised learning methods including support vector machines, classification trees, linear discriminant analysis and nearest neighbor classifiers have been used to classify high-throughput gene expression data. Simpler and more widely accepted statistical tools have not yet been used for this purpose, hence proper comparisons between classification methods have not been conducted. We developed free software that implements logistic regression with stepwise variable selection as a quick and simple method for initial exploration of important genetic markers in disease classification. To implement the algorithm and allow our collaborators in remote locations to evaluate and compare its results against those of other methods, we developed a user-friendly asynchronous web-based application with a minimal amount of programming using free, downloadable software tools. With this program, we show that classification using logistic regression can perform as well as other more sophisticated algorithms, and it has the advantages of being easy to interpret and reproduce. By making the tool freely and easily available, we hope to promote the comparison of classification methods. In addition, we believe our web application can be used as a model for other bioinformatics laboratories that need to develop web-based analysis tools in a short amount of time and on a limited budget.

Keywords: Acute, Algorithms, Animals, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Drug, Drug Design, Eukaryotic Cells, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Internet, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lymphocytic, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Protein Binding, Proteins, Quality Control, RNA, RNA Splicing, Receptors, Reference Values, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12463949
[Wahba2002Soft] Grace Wahba. Soft and hard classification by reproducing kernel Hilbert space methods. Proc Natl Acad Sci U S A, 99(26):16524-30, Dec 2002. [ bib | DOI | http | .pdf ]
Reproducing kernel Hilbert space (RKHS) methods provide a unified context for solving a wide variety of statistical modelling and function estimation problems. We consider two such problems: We are given a training set [yi, ti, i = 1, em leader, n], where yi is the response for the ith subject, and ti is a vector of attributes for this subject. The value of y(i) is a label that indicates which category it came from. For the first problem, we wish to build a model from the training set that assigns to each t in an attribute domain of interest an estimate of the probability pj(t) that a (future) subject with attribute vector t is in category j. The second problem is in some sense less ambitious; it is to build a model that assigns to each t a label, which classifies a future subject with that t into one of the categories or possibly "none of the above." The approach to the first of these two problems discussed here is a special case of what is known as penalized likelihood estimation. The approach to the second problem is known as the support vector machine. We also note some alternate but closely related approaches to the second problem. These approaches are all obtained as solutions to optimization problems in RKHS. Many other problems, in particular the solution of ill-posed inverse problems, can be obtained as solutions to optimization problems in RKHS and are mentioned in passing. We caution the reader that although a large literature exists in all of these topics, in this inaugural article we are selectively highlighting work of the author, former students, and other collaborators.

Keywords: Acute, Algorithms, Animals, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Drug, Drug Design, Eukaryotic Cells, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Leukemia, Ligands, Likelihood Functions, Lymphocytic, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Protein Binding, Proteins, Quality Control, RNA, RNA Splicing, Receptors, Reference Values, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Statistical, Stomach Neoplasms, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12477931
[Sturn2002Genesis:] Alexander Sturn, John Quackenbush, and Zlatko Trajanoski. Genesis: cluster analysis of microarray data. Bioinformatics, 18(1):207-8, Jan 2002. [ bib ]
A versatile, platform independent and easy to use Java suite for large-scale gene expression analysis was developed. Genesis integrates various tools for microarray data analysis such as filters, normalization and visualization tools, distance measures as well as common clustering algorithms including hierarchical clustering, self-organizing maps, k-means, principal component analysis, and support vector machines. The results of the clustering are transparent across all implemented methods and enable the analysis of the outcome of different algorithms and parameters. Additionally, mapping of gene expression data onto chromosomal sequences was implemented to enhance promoter analysis and investigation of transcriptional control mechanisms.

Keywords: Algorithms, Artificial Intelligence, Cluster Analysis, Comparative Study, Computational Biology, Databases, Gene Expression Profiling, Genetic, Models, Molecular Structure, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Principal Component Analysis, Programming Languages, Promoter Regions (Genetics), Protein, Proteins, Research Support, Software, Statistical, Transcription, 11836235
[Song2002Prediction] Minghu Song, Curt M Breneman, Jinbo Bi, N. Sukumar, Kristin P Bennett, Steven Cramer, and Nihal Tugcu. Prediction of protein retention times in anion-exchange chromatography systems using support vector regression. J Chem Inf Comput Sci, 42(6):1347-57, 2002. [ bib ]
Quantitative Structure-Retention Relationship (QSRR) models are developed for the prediction of protein retention times in anion-exchange chromatography systems. Topological, subdivided surface area, and TAE (Transferable Atom Equivalent) electron-density-based descriptors are computed directly for a set of proteins using molecular connectivity patterns and crystal structure geometries. A novel algorithm based on Support Vector Machine (SVM) regression has been employed to obtain predictive QSRR models using a two-step computational strategy. In the first step, a sparse linear SVM was utilized as a feature selection procedure to remove irrelevant or redundant information. Subsequently, the selected features were used to produce an ensemble of nonlinear SVM regression models that were combined using bootstrap aggregation (bagging) techniques, where various combinations of training and validation data sets were selected from the pool of available data. A visualization scheme (star plots) was used to display the relative importance of each selected descriptor in the final set of "bagged" models. Once these predictive models have been validated, they can be used as an automated prediction tool for virtual high-throughput screening (VHTS).

Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Decision Making, Diagnosis, Differential, Drug, Drug Design, Electrostatics, Eukaryotic Cells, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Internet, Ion Exchange, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12444731
[Quackenbush2002Microarray] John Quackenbush. Microarray data normalization and transformation. Nat Genet, 32 Suppl:496-501, Dec 2002. [ bib | DOI | http ]
Keywords: Animals; Data Interpretation, Statistical; Forecasting; Gene Expression Profiling, methods; Humans; Oligonucleotide Array Sequence Analysis, methods; Research Design
[Mateos2002Systematic] Alvaro Mateos, Joaquín Dopazo, Ronald Jansen, Yuhai Tu, Mark Gerstein, and Gustavo Stolovitzky. Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. Genome Res., 12(11):1703-15, Nov 2002. [ bib | DOI | http | .pdf ]
Recent advances in microarray technology have opened new ways for functional annotation of previously uncharacterised genes on a genomic scale. This has been demonstrated by unsupervised clustering of co-expressed genes and, more importantly, by supervised learning algorithms. Using prior knowledge, these algorithms can assign functional annotations based on more complex expression signatures found in existing functional classes. Previously, support vector machines (SVMs) and other machine-learning methods have been applied to a limited number of functional classes for this purpose. Here we present, for the first time, the comprehensive application of supervised neural networks (SNNs) for functional annotation. Our study is novel in that we report systematic results for  100 classes in the Munich Information Center for Protein Sequences (MIPS) functional catalog. We found that only  10% of these are learnable (based on the rate of false negatives). A closer analysis reveals that false positives (and negatives) in a machine-learning context are not necessarily "false" in a biological sense. We show that the high degree of interconnections among functional classes confounds the signatures that ought to be learned for a unique class. We term this the "Borges effect" and introduce two new numerical indices for its quantification. Our analysis indicates that classification systems with a lower Borges effect are better suitable for machine learning. Furthermore, we introduce a learning procedure for combining false positives with the original class. We show that in a few iterations this process converges to a gene set that is learnable with considerably low rates of false positives and negatives and contains genes that are biologically related to the original class, allowing for a coarse reconstruction of the interactions between associated biological pathways. We exemplify this methodology using the well-studied tricarboxylic acid cycle.

Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Citric Acid Cycle, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Databases, Decision Making, Diagnosis, Differential, Drug, Drug Design, Electrostatics, Eukaryotic Cells, Factual, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Heterogeneity, Genetic Markers, Hemolysins, Humans, Internet, Ion Exchange, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Structural, Structure-Activity Relationship, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12421757
[Martoglio2002decomposition] Ann-Marie Martoglio, James W Miskin, Stephen K Smith, and David J C MacKay. A decomposition model to track gene expression signatures: preview on observer-independent classification of ovarian cancer. Bioinformatics, 18(12):1617-24, Dec 2002. [ bib ]
MOTIVATION: A number of algorithms and analytical models have been employed to reduce the multidimensional complexity of DNA array data and attempt to extract some meaningful interpretation of the results. These include clustering, principal components analysis, self-organizing maps, and support vector machine analysis. Each method assumes an implicit model for the data, many of which separate genes into distinct clusters defined by similar expression profiles in the samples tested. A point of concern is that many genes may be involved in a number of distinct behaviours, and should therefore be modelled to fit into as many separate clusters as detected in the multidimensional gene expression space. The analysis of gene expression data using a decomposition model that is independent of the observer involved would be highly beneficial to improve standard and reproducible classification of clinical and research samples. RESULTS: We present a variational independent component analysis (ICA) method for reducing high dimensional DNA array data to a smaller set of latent variables, each associated with a gene signature. We present the results of applying the method to data from an ovarian cancer study, revealing a number of tissue type-specific and tissue type-independent gene signatures present in varying amounts among the samples surveyed. The observer independent results of such molecular analysis of biological samples could help identify patients who would benefit from different treatment strategies. We further explore the application of the model to similar high-throughput studies.

Keywords: Acute, Algorithms, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Cluster Analysis, Comparative Study, Computer-Assisted, Cystadenoma, DNA, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Leukemia, Lymphocytic, Markov Chains, Messenger, Models, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplastic, Neural Networks (Computer), Non-U.S. Gov't, Nucleic Acid Conformation, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, Pattern Recognition, Quality Control, RNA, Reference Values, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Statistical, Stomach Neoplasms, Transcription, Tumor Markers, 12490446
[Marsland2002self-organising] Stephen Marsland, Jonathan Shapiro, and Ulrich Nehmzow. A self-organising network that grows when required. Neural Netw, 15(8-9):1041-58, 2002. [ bib ]
The ability to grow extra nodes is a potentially useful facility for a self-organising neural network. A network that can add nodes into its map space can approximate the input space more accurately, and often more parsimoniously, than a network with predefined structure and size, such as the Self-Organising Map. In addition, a growing network can deal with dynamic input distributions. Most of the growing networks that have been proposed in the literature add new nodes to support the node that has accumulated the highest error during previous iterations or to support topological structures. This usually means that new nodes are added only when the number of iterations is an integer multiple of some pre-defined constant, A. This paper suggests a way in which the learning algorithm can add nodes whenever the network in its current state does not sufficiently match the input. In this way the network grows very quickly when new data is presented, but stops growing once the network has matched the data. This is particularly important when we consider dynamic data sets, where the distribution of inputs can change to a new regime after some time. We also demonstrate the preservation of neighbourhood relations in the data by the network. The new network is compared to an existing growing network, the Growing Neural Gas (GNG), on a artificial dataset, showing how the network deals with a change in input distribution after some time. Finally, the new network is applied to several novelty detection tasks and is compared with both the GNG and an unsupervised form of the Reduced Coulomb Energy network on a robotic inspection task and with a Support Vector Machine on two benchmark novelty detection tasks.

Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Citric Acid Cycle, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Databases, Decision Making, Diagnosis, Differential, Drug, Drug Design, Electrostatics, Eukaryotic Cells, Factual, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Heterogeneity, Genetic Markers, Hemolysins, Humans, Internet, Ion Exchange, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Probability Learning, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Robotics, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Structural, Structure-Activity Relationship, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12416693
[Churchill2002Fundamentals] G. A. Churchill. Fundamentals of experimental design for cdna microarrays. Nat. Genet., 32 Suppl:490-495, Dec 2002. [ bib | DOI | http ]
Microarray technology is now widely available and is being applied to address increasingly complex scientific questions. Consequently, there is a greater demand for statistical assessment of the conclusions drawn from microarray experiments. This review discusses fundamental issues of how to design an experiment to ensure that the resulting data are amenable to statistical analysis. The discussion focuses on two-color spotted cDNA microarrays, but many of the same issues apply to single-color gene-expression assays as well.

Keywords: Animals; DNA, Complementary, analysis; Gene Expression; Gene Expression Profiling, methods; Mice; Models, Biological; Oligonucleotide Array Sequence Analysis, methods; Reference Standards; Reproducibility of Results; Research Design; Statistics as Topic
[Chan2002Comparison] Kwokleung Chan, Te-Won Lee, Pamela A Sample, Michael H Goldbaum, Robert N Weinreb, and Terrence J Sejnowski. Comparison of machine learning and traditional classifiers in glaucoma diagnosis. IEEE Trans Biomed Eng, 49(9):963-74, Sep 2002. [ bib | DOI | http | .pdf ]
Glaucoma is a progressive optic neuropathy with characteristic structural changes in the optic nerve head reflected in the visual field. The visual-field sensitivity test is commonly used in a clinical setting to evaluate glaucoma. Standard automated perimetry (SAP) is a common computerized visual-field test whose output is amenable to machine learning. We compared the performance of a number of machine learning algorithms with STATPAC indexes mean deviation, pattern standard deviation, and corrected pattern standard deviation. The machine learning algorithms studied included multilayer perceptron (MLP), support vector machine (SVM), and linear (LDA) and quadratic discriminant analysis (QDA), Parzen window, mixture of Gaussian (MOG), and mixture of generalized Gaussian (MGG). MLP and SVM are classifiers that work directly on the decision boundary and fall under the discriminative paradigm. Generative classifiers, which first model the data probability density and then perform classification via Bayes' rule, usually give deeper insight into the structure of the data space. We have applied MOG, MGG, LDA, QDA, and Parzen window to the classification of glaucoma from SAP. Performance of the various classifiers was compared by the areas under their receiver operating characteristic curves and by sensitivities (true-positive rates) at chosen specificities (true-negative rates). The machine-learning-type classifiers showed improved performance over the best indexes from STATPAC. Forward-selection and backward-elimination methodology further improved the classification rate and also has the potential to reduce testing time by diminishing the number of visual-field location measurements.

Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Citric Acid Cycle, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Databases, Decision Making, Diagnosis, Differential, Discriminant Analysis, Drug, Drug Design, Electrostatics, Epitopes, Eukaryotic Cells, Factual, False Negative Reactions, False Positive Reactions, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Heterogeneity, Genetic Markers, Glaucoma, HLA Antigens, Hemolysins, Histocompatibility Antigens Class I, Humans, Internet, Intraocular Pressure, Ion Exchange, Lasers, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Open-Angle, Ophthalmoscopy, Optic Disk, Optic Nerve Diseases, Ovarian Neoplasms, P.H.S., Pattern Recognition, Peptides, Perimetry, Predictive Value of Tests, Probability, Probability Learning, Protein, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, ROC Curve, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Robotics, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Structural, Structure-Activity Relationship, T-Lymphocyte, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12214886
[Catoni2002Data] O. Catoni. Data Compression and Adaptive Histograms. In Felipe Cucker and J. Maurice Rojas, editors, Foundations of Computational Mathematics, Proceedings of Smalefest 2000. World Scientific, 2002. [ bib | http | .pdf ]
[Bowd2002Comparing] Christopher Bowd, Kwokleung Chan, Linda M Zangwill, Michael H Goldbaum, Te-Won Lee, Terrence J Sejnowski, and Robert N Weinreb. Comparing neural networks and linear discriminant functions for glaucoma detection using confocal scanning laser ophthalmoscopy of the optic disc. Invest Ophthalmol Vis Sci, 43(11):3444-54, Nov 2002. [ bib | http | .pdf ]
PURPOSE: To determine whether neural network techniques can improve differentiation between glaucomatous and nonglaucomatous eyes, using the optic disc topography parameters of the Heidelberg Retina Tomograph (HRT; Heidelberg Engineering, Heidelberg, Germany). METHODS: With the HRT, one eye was imaged from each of 108 patients with glaucoma (defined as having repeatable visual field defects with standard automated perimetry) and 189 subjects without glaucoma (no visual field defects with healthy-appearing optic disc and retinal nerve fiber layer on clinical examination) and the optic nerve topography was defined by 17 global and 66 regional HRT parameters. With all the HRT parameters used as input, receiver operating characteristic (ROC) curves were generated for the classification of eyes, by three neural network techniques: linear and Gaussian support vector machines (SVM linear and SVM Gaussian, respectively) and a multilayer perceptron (MLP), as well as four previously proposed linear discriminant functions (LDFs) and one LDF developed on the current data with all HRT parameters used as input. RESULTS: The areas under the ROC curves for SVM linear and SVM Gaussian were 0.938 and 0.945, respectively; for MLP, 0.941; for the current LDF, 0.906; and for the best previously proposed LDF, 0.890. With the use of forward selection and backward elimination optimization techniques, the areas under the ROC curves for SVM Gaussian and the current LDF were increased to approximately 0.96. CONCLUSIONS: Trained neural networks, with global and regional HRT parameters used as input, improve on previously proposed HRT parameter-based LDFs for discriminating between glaucomatous and nonglaucomatous eyes. The performance of both neural networks and LDFs can be improved with optimization of the features in the input. Neural network analyses show promise for increasing diagnostic accuracy of tests for glaucoma.

Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Citric Acid Cycle, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Databases, Decision Making, Diagnosis, Differential, Discriminant Analysis, Drug, Drug Design, Electrostatics, Eukaryotic Cells, Factual, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Heterogeneity, Genetic Markers, Glaucoma, Hemolysins, Humans, Internet, Intraocular Pressure, Ion Exchange, Lasers, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Open-Angle, Ophthalmoscopy, Optic Disk, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Probability Learning, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, ROC Curve, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Robotics, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Structural, Structure-Activity Relationship, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12407155
[Zhu2003Introduction] Lingyun Zhu, Baoming Wu, and Changxiu Cao. Introduction to medical data mining. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi, 20(3):559-62, Sep 2003. [ bib ]
Modern medicine generates a great deal of information stored in the medical database. Extracting useful knowledge and providing scientific decision-making for the diagnosis and treatment of disease from the database increasingly becomes necessary. Data mining in medicine can deal with this problem. It can also improve the management level of hospital information and promote the development of telemedicine and community medicine. Because the medical information is characteristic of redundancy, multi-attribution, incompletion and closely related with time, medical data mining differs from other one. In this paper we have discussed the key techniques of medical data mining involving pretreatment of medical data, fusion of different pattern and resource, fast and robust mining algorithms and reliability of mining results. The methods and applications of medical data mining based on computation intelligence such as artificial neural network, fuzzy system, evolutionary algorithms, rough set, and support vector machine have been introduced. The features and problems in data mining are summarized in the last section.

Keywords: Algorithms, Anion Exchange Resins, Automatic Data Processing, Chemical, Chromatography, Computational Biology, Computer-Assisted, Data Interpretation, Databases, Decision Making, Decision Trees, English Abstract, Factual, Fuzzy Logic, Humans, Indicators and Reagents, Information Storage and Retrieval, Ion Exchange, Models, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, P.H.S., Proteins, Quantitative Structure-Activity Relationship, RNA, ROC Curve, Research Support, Sequence Analysis, Statistical, Transfer, U.S. Gov't, 14565039
[Mayr2003Cross-reactive] Torsten Mayr, Christian Igel, Gregor Liebsch, Ingo Klimant, and Otto S Wolfbeis. Cross-reactive metal ion sensor array in a micro titer plate format. Anal Chem, 75(17):4389-96, Sep 2003. [ bib ]
A cross-reactive array in a micro titer plate (MTP) format is described that is based on a versatile and highly flexible scheme. It makes use of rather unspecific metal ions probes having almost identical fluorescence spectra, thus enabling (a) interrogation at identical analytical wavelengths, and (b) imaging of the probes contained in the wells of the MTP using a CCD camera and an array of blue-light-emitting diodes as a light source. The unselective response of the indicators in the presence of mixtures of five divalent cations generates a characteristic pattern that was analyzed by chemometric tools. The fluorescence intensity of the indicators was transferred into a time-dependent parameter applying a scheme called dual lifetime referencing. In this method, the fluorescence decay profile of the indicator is referenced against the phosphorescence of an inert reference dye added to the system. The intrinsically referenced measurements also were performed using blue LEDs as light sources and a CCD camera without intensifiers as the detector. The best performance was observed if each well was excited by a single LED. The assembly allows the detection of dye concentrations in the nanomoles-per-liter range without amplification and the acquisition of 96 wells simultaneously. The pictures obtained form the basis for evaluation by pattern recognition algorithms. Support vector machines are capable of predicting the presence of significant concentrations of metal ions with high accuracy.

Keywords: Agrochemicals, Air Pollutants, Aircraft, Algorithms, Artificial Intelligence, Automated, Base Composition, Base Sequence, Bayes Theorem, Carbonic Anhydrase Inhibitors, Cluster Analysis, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer Systems, Computer-Assisted, Computing Methodologies, Confidence Intervals, Cytosine, DNA, Data Interpretation, Databases, Diagnosis, Drug Design, Enhancer Elements (Genetics), Environmental Monitoring, Enzyme Inhibitors, Ethanol, Exons, Forecasting, Fourier Transform Infrared, Gene Expression Profiling, Gene Expression Regulation, Genetic, Genetic Screening, Glucuronosyltransferase, Guanine, Humans, Image Interpretation, Isoenzymes, Least-Squares Analysis, Leukemia, Linear Models, Lymphoma, Models, Molecular, Molecular Conformation, Molecular Sequence Data, Natural Disasters, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Oligonucleotide Array Sequence Analysis, Online Systems, P.H.S., Pattern Recognition, Pharmaceutical Preparations, Phenotype, Photography, Probability, Pyrimidines, Quantitative Structure-Activity Relationship, RNA Precursors, RNA Splice Sites, RNA Splicing, Radiation, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Signal Processing, Software, Spectroscopy, Statistical, Subtraction Technique, Terminology, Thermodynamics, Time Factors, U.S. Gov't, Untranslated Regions, Video Recording, Walking, 14632041
[Li2003Simple] Jinyan Li, Huiqing Liu, James R Downing, Allen Eng-Juh Yeoh, and Limsoon Wong. Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients. Bioinformatics, 19(1):71-8, Jan 2003. [ bib ]
MOTIVATIONS AND RESULTS: For classifying gene expression profiles or other types of medical data, simple rules are preferable to non-linear distance or kernel functions. This is because rules may help us understand more about the application in addition to performing an accurate classification. In this paper, we discover novel rules that describe the gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients. We also introduce a new classifier, named PCL, to make effective use of the rules. PCL is accurate and can handle multiple parallel classifications. We evaluate this method by classifying 327 heterogeneous ALL samples. Our test error rate is competitive to that of support vector machines, and it is 71% better than C4.5, 50% better than Naive Bayes, and 43% better than k-nearest neighbour. Experimental results on another independent data sets are also presented to show the strength of our method. AVAILABILITY: Under http://sdmc.lit.org.sg/GEDatasets/, click on Supplementary Information.

Keywords: Acute, Algorithms, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Cluster Analysis, Comparative Study, Computer-Assisted, DNA, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Leukemia, Lymphocytic, Markov Chains, Messenger, Models, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplastic, Neural Networks (Computer), Non-U.S. Gov't, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Quality Control, RNA, Research Support, Signal Processing, Statistical, Stomach Neoplasms, Tumor Markers, 12499295
[Ifantis2003nonlinear] A. Ifantis and S. Papadimitriou. The nonlinear predictability of the electrotelluric field variations data analyzed with support vector machines as an earthquake precursor. Int J Neural Syst, 13(5):315-32, Oct 2003. [ bib ]
This work investigates the nonlinear predictability of the Electro Telluric Field (ETF) variations data in order to develop new intelligent tools for the difficult task of earthquake prediction. Support Vector Machines trained on a signal window have been used to predict the next sample. We observe a significant increase at this short-term unpredictability of the ETF signal at about two weeks time period before the major earthquakes that took place in regions near the recording devices. The unpredictability increase can be attributed to a quick time variation of the dynamics that produce the ETF signal due to the earthquake generation process. Thus, this increase can be taken into advantage for signaling for an increased possibility of a large earthquake within the next few days in the neighboring region of the recording station.

Keywords: Air Pollutants, Aircraft, Algorithms, Artificial Intelligence, Automated, Base Composition, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Computing Methodologies, Cytosine, Data Interpretation, Databases, Enhancer Elements (Genetics), Environmental Monitoring, Ethanol, Exons, Fourier Transform Infrared, Genetic, Guanine, Humans, Image Interpretation, Natural Disasters, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Online Systems, P.H.S., Pattern Recognition, Photography, Probability, Pyrimidines, RNA Precursors, RNA Splice Sites, RNA Splicing, Radiation, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Spectroscopy, Statistical, Subtraction Technique, Thermodynamics, Time Factors, U.S. Gov't, Untranslated Regions, Video Recording, Walking, 14652873
[Ge2003Reducing] Xijin Ge, Shuichi Tsutsumi, Hiroyuki Aburatani, and Shuichi Iwata. Reducing false positives in molecular pattern recognition. Genome Inform Ser Workshop Genome Inform, 14:34-43, 2003. [ bib ]
In the search for new cancer subtypes by gene expression profiling, it is essential to avoid misclassifying samples of unknown subtypes as known ones. In this paper, we evaluated the false positive error rates of several classification algorithms through a 'null test' by presenting classifiers a large collection of independent samples that do not belong to any of the tumor types in the training dataset. The benchmark dataset is available at www2.genome.rcast.u-tokyo.ac.jp/pm/. We found that k-nearest neighbor (KNN) and support vector machine (SVM) have very high false positive error rates when fewer genes (<100) are used in prediction. The error rate can be partially reduced by including more genes. On the other hand, prototype matching (PM) method has a much lower false positive error rate. Such robustness can be achieved without loss of sensitivity by introducing suitable measures of prediction confidence. We also proposed a cluster-and-select technique to select genes for classification. The nonparametric Kruskal-Wallis H test is employed to select genes differentially expressed in multiple tumor types. To reduce the redundancy, we then divided these genes into clusters with similar expression patterns and selected a given number of genes from each cluster. The reliability of the new algorithm is tested on three public datasets.

Keywords: Amino Acid Sequence, Amino Acids, Animals, Automated, Base Sequence, Bayes Theorem, Biological, Carbohydrate Conformation, Carbohydrate Sequence, Cattle, Computational Biology, Computer Simulation, Crystallography, DNA, Databases, Factual, False Positive Reactions, Gene Expression Profiling, Genes, Genetic, Genetic Techniques, Genome, Histocompatibility Antigens Class I, Human, Humans, Introns, Least-Squares Analysis, MHC Class I, Major Histocompatibility Complex, Markov Chains, Messenger, Mice, Models, Monosaccharides, Neoplasms, Non-U.S. Gov't, Nonparametric, Pattern Recognition, Peptides, Phylogeny, Plants, Poly A, Polysaccharides, Predictive Value of Tests, Protein, Protein Structure, Proteins, RNA, Rats, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sequence Alignment, Software, Species Specificity, Statistics, Theoretical, X-Ray, 15706518
[Garrett2003Comparison] D. Garrett, D. A Peterson, C. Anderson, and M. Thaut. Comparison of linear, nonlinear, and feature selection methods for EEG signal classification. IEEE Trans Neural Syst Rehabil Eng, 11(2):141-4, Jun 2003. [ bib ]
The reliable operation of brain-computer interfaces (BCIs) based on spontaneous electroencephalogram (EEG) signals requires accurate classification of multichannel EEG. The design of EEG representations and classifiers for BCI are open research questions whose difficulty stems from the need to extract complex spatial and temporal patterns from noisy multidimensional time series obtained from EEG measurements. The high-dimensional and noisy nature of EEG may limit the advantage of nonlinear classification methods over linear ones. This paper reports the results of a linear (linear discriminant analysis) and two nonlinear classifiers (neural networks and support vector machines) applied to the classification of spontaneous EEG during five mental tasks, showing that nonlinear classifiers produce only slightly better classification results. An approach to feature selection based on genetic algorithms is also presented with preliminary results of application to EEG during finger movement.

Keywords: 80 and over, Adnexal Diseases, Adult, Aged, Algorithms, Artificial Intelligence, Automated, Bayes Theorem, Biological, Brain, Brain Mapping, Breast Neoplasms, Case-Control Studies, Chromatography, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Diagnosis, Differential, Discriminant Analysis, Electroencephalography, Evoked Potentials, Feasibility Studies, Female, Fingers, Gene Expression Profiling, Gene Expression Regulation, Genetic, Genetic Markers, Genetic Predisposition to Disease, Genetic Screening, Habituation (Psychophysiology), High Pressure Liquid, Humans, Linear Models, Logistic Models, Male, Middle Aged, Migraine, Models, Movement, Neural Networks (Computer), Neurological, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Nucleosides, Ovarian Neoplasms, Pattern Recognition, Photic Stimulation, Predictive Value of Tests, ROC Curve, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Software, Statistical, Thinking, Tumor Markers, U.S. Gov't, User-Computer Interface, Visual, 12899257
[DiMasi2003price] J. A. DiMasi, R. W. Hansen, and H. G. Grabowski. The price of innovation: new estimates of drug development costs. J Health Econ, 22(2):151-185, Mar 2003. [ bib ]
The research and development costs of 68 randomly selected new drugs were obtained from a survey of 10 pharmaceutical firms. These data were used to estimate the average pre-tax cost of new drug development. The costs of compounds abandoned during testing were linked to the costs of compounds that obtained marketing approval. The estimated average out-of-pocket cost per new drug is 403 million US dollars (2000 dollars). Capitalizing out-of-pocket costs to the point of marketing approval at a real discount rate of 11% yields a total pre-approval cost estimate of 802 million US dollars (2000 dollars). When compared to the results of an earlier study with a similar methodology, total capitalized costs were shown to have increased at an annual rate of 7.4% above general price inflation.

Keywords: Capital Expenditures, Costs and Cost Analysis, Data Collection, Drug Approval, Drug Evaluation, Drug Industry, Drugs, Economic, Humans, Inflation, Investigational, Organizational Innovation, Preclinical, Research Support, United States, 16087260
[Diekman2003Hybrid] Casey Diekman, Wei He, Nagabhushana Prabhu, and Harvey Cramer. Hybrid methods for automated diagnosis of breast tumors. Anal Quant Cytol Histol, 25(4):183-90, Aug 2003. [ bib ]
OBJECTIVE: To design and analyze a new family of hybrid methods for the diagnosis of breast tumors using fine needle aspirates. STUDY DESIGN: We present a radically new approach to the design of diagnosis systems. In the new approach, a nonlinear classifier with high sensitivity but low specificity is hybridized with a linear classifier having low sensitivity but high specificity. Data from the Wisconsin Breast Cancer Database are used to evaluate, computationally, the performance of the hybrid classifiers. RESULTS: The diagnosis scheme obtained by hybridizing the nonlinear classifier ellipsoidal multisurface method (EMSM) with the linear classifier proximal support vector machine (PSVM) was found to have a mean sensitivity of 97.36% and a mean specificity of 95.14% and was found to yield a 2.44% improvement in the reliability of positive diagnosis over that of EMSM at the expense of 0.4% degradation in the reliability of negative diagnosis, again compared to EMSM. At the 95% confidence level we can trust the hybrid method to be 96.19-98.53% correct in its malignant diagnosis of new tumors and 93.57-96.71% correct in its benign diagnosis. CONCLUSION: Hybrid diagnosis schemes represent a significant paradigm shift and provide a promising new technique to improve the specificity of nonlinear classifiers without seriously affecting the high sensitivity of nonlinear classifiers.

Keywords: Algorithms, Amino Acid Sequence, Amino Acids, Anion Exchange Resins, Antigen-Antibody Complex, Artificial Intelligence, Automated, Automatic Data Processing, Benchmarking, Biological, Biological Markers, Biopsy, Blood Cells, Blood Proteins, Breast Neoplasms, Cell Line, Cellular Structures, Chemical, Chromatography, Chromosome Aberrations, Cluster Analysis, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Computing Methodologies, DNA, Data Interpretation, Databases, Decision Making, Decision Trees, Diagnosis, Diffusion Magnetic Resonance Imaging, Disease, English Abstract, Epitopes, Expert Systems, Factual, Female, Fine-Needle, Fusion, Fuzzy Logic, Gene Expression Profiling, Gene Expression Regulation, Gene Targeting, Genetic, Genome, Histocompatibility Antigens Class I, Humans, Hydrogen Bonding, Hydrophobicity, Image Interpretation, Image Processing, In Vitro, Indicators and Reagents, Information Storage and Retrieval, Ion Exchange, Least-Squares Analysis, Leiomyosarcoma, Liver Cirrhosis, Lung Neoplasms, Magnetic Resonance Imaging, Male, Mass, Mathematical Computing, Matrix-Assisted Laser Desorption-Ionization, Models, Molecular, Molecular Sequence Data, Neoplasm Proteins, Neoplasms, Neoplastic, Nephroblastoma, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonl, Nucleic Acid Conformation, Nucleic Acid Hybridization, Oligonucleotide Array Sequence Analysis, Oncogene Proteins, Ovarian Neoplasms, P.H.S., Pattern Recognition, Predictive Value of Tests, Pro, Prostatic Neoplasms, Protein, Protein Binding, Protein Interaction Mapping, Protein Structure, Proteins, Quantitative Structure-Activity Relationship, RNA, ROC Curve, Reproducibility of Results, Research Support, Rhabdomyosarcoma, Secondary, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Severity of Illness Index, Software, Solubility, Spectrometry, Statistical, Structure-Activity Relationship, Subcellular Fractions, Subtraction Technique, T-Lymphocyte, Tissue Distribution, Transcription Factors, Transfer, Treatment Outcome, Tumor, Tumor Markers, U.S. Gov't, User-Computer Interface, inear Dynamics, teome, 12961824
[Chan2003Detection] Ian Chan, William Wells, Robert V Mulkern, Steven Haker, Jianqing Zhang, Kelly H Zou, Stephan E Maier, and Clare M C Tempany. Detection of prostate cancer by integration of line-scan diffusion, T2-mapping and T2-weighted magnetic resonance imaging; a multichannel statistical classifier. Med Phys, 30(9):2390-8, Sep 2003. [ bib | .pdf ]
A multichannel statistical classifier for detecting prostate cancer was developed and validated by combining information from three different magnetic resonance (MR) methodologies: T2-weighted, T2-mapping, and line scan diffusion imaging (LSDI). From these MR sequences, four different sets of image intensities were obtained: T2-weighted (T2W) from T2-weighted imaging, Apparent Diffusion Coefficient (ADC) from LSDI, and proton density (PD) and T2 (T2 Map) from T2-mapping imaging. Manually segmented tumor labels from a radiologist, which were validated by biopsy results, served as tumor "ground truth." Textural features were extracted from the images using co-occurrence matrix (CM) and discrete cosine transform (DCT). Anatomical location of voxels was described by a cylindrical coordinate system. A statistical jack-knife approach was used to evaluate our classifiers. Single-channel maximum likelihood (ML) classifiers were based on 1 of the 4 basic image intensities. Our multichannel classifiers: support vector machine (SVM) and Fisher linear discriminant (FLD), utilized five different sets of derived features. Each classifier generated a summary statistical map that indicated tumor likelihood in the peripheral zone (PZ) of the prostate gland. To assess classifier accuracy, the average areas under the receiver operator characteristic (ROC) curves over all subjects were compared. Our best FLD classifier achieved an average ROC area of 0.839(+/-0.064), and our best SVM classifier achieved an average ROC area of 0.761(+/-0.043). The T2W ML classifier, our best single-channel classifier, only achieved an average ROC area of 0.599(+/-0.146). Compared to the best single-channel ML classifier, our best multichannel FLD and SVM classifiers have statistically superior ROC performance (P=0.0003 and 0.0017, respectively) from pairwise two-sided t-test. By integrating the information from multiple images and capturing the textural and anatomical features in tumor areas, summary statistical maps can potentially aid in image-guided prostate biopsy and assist in guiding and controlling delivery of localized therapy under image guidance.

Keywords: Algorithms, Anion Exchange Resins, Antigen-Antibody Complex, Artificial Intelligence, Automated, Automatic Data Processing, Biological, Blood Cells, Chemical, Chromatography, Cluster Analysis, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Data Interpretation, Databases, Decision Making, Decision Trees, Diffusion Magnetic Resonance Imaging, English Abstract, Epitopes, Expert Systems, Factual, Fuzzy Logic, Gene Expression Profiling, Gene Expression Regulation, Gene Targeting, Genome, Histocompatibility Antigens Class I, Humans, Image Interpretation, Image Processing, In Vitro, Indicators and Reagents, Information Storage and Retrieval, Ion Exchange, Least-Squares Analysis, Liver Cirrhosis, Magnetic Resonance Imaging, Male, Models, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonl, Nucleic Acid Conformation, P.H.S., Pattern Recognition, Pro, Prostatic Neoplasms, Protein, Protein Binding, Protein Interaction Mapping, Proteins, Quantitative Structure-Activity Relationship, RNA, ROC Curve, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Analysis, Severity of Illness Index, Statistical, Structure-Activity Relationship, Subtraction Technique, T-Lymphocyte, Transcription Factors, Transfer, Treatment Outcome, U.S. Gov't, User-Computer Interface, inear Dynamics, teome, 14528961
[Bagirov2003New] A. M. Bagirov, B. Ferguson, S. Ivkovic, G. Saunders, and J. Yearwood. New algorithms for multi-class cancer diagnosis using tumor gene expression signatures. Bioinformatics, 19(14):1800-7, Sep 2003. [ bib | http | .pdf ]
MOTIVATION: The increasing use of DNA microarray-based tumor gene expression profiles for cancer diagnosis requires mathematical methods with high accuracy for solving clustering, feature selection and classification problems of gene expression data. RESULTS: New algorithms are developed for solving clustering, feature selection and classification problems of gene expression data. The clustering algorithm is based on optimization techniques and allows the calculation of clusters step-by-step. This approach allows us to find as many clusters as a data set contains with respect to some tolerance. Feature selection is crucial for a gene expression database. Our feature selection algorithm is based on calculating overlaps of different genes. The database used, contains over 16 000 genes and this number is considerably reduced by feature selection. We propose a classification algorithm where each tissue sample is considered as the center of a cluster which is a ball. The results of numerical experiments confirm that the classification algorithm in combination with the feature selection algorithm perform slightly better than the published results for multi-class classifiers based on support vector machines for this data set. AVAILABILITY: Available on request from the authors.

Keywords: Algorithms, Amino Acid Sequence, Anion Exchange Resins, Antigen-Antibody Complex, Artificial Intelligence, Automated, Automatic Data Processing, Biological, Blood Cells, Chemical, Chromatography, Cluster Analysis, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Making, Decision Trees, Diffusion Magnetic Resonance Imaging, English Abstract, Epitopes, Expert Systems, Factual, Fuzzy Logic, Gene Expression Profiling, Gene Expression Regulation, Gene Targeting, Genetic, Genome, Histocompatibility Antigens Class I, Humans, Image Interpretation, Image Processing, In Vitro, Indicators and Reagents, Information Storage and Retrieval, Ion Exchange, Least-Squares Analysis, Liver Cirrhosis, Magnetic Resonance Imaging, Male, Models, Molecular Sequence Data, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonl, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, P.H.S., Pattern Recognition, Pro, Prostatic Neoplasms, Protein, Protein Binding, Protein Interaction Mapping, Proteins, Quantitative Structure-Activity Relationship, RNA, ROC Curve, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Severity of Illness Index, Statistical, Structure-Activity Relationship, Subtraction Technique, T-Lymphocyte, Transcription Factors, Transfer, Treatment Outcome, Tumor Markers, U.S. Gov't, User-Computer Interface, inear Dynamics, teome, 14512351
[Yu2004Advances] J. Yu, V.A. Smith, P.P. Wang, A.J. Hartemink, and E.D. Jarvis. Advances to bayesian network inference for generating causal networks from observational biological data. Bioinformatics, 20(18):3594-3603, Dec 2004. [ bib | DOI | http ]
MOTIVATION: Network inference algorithms are powerful computational tools for identifying putative causal interactions among variables from observational data. Bayesian network inference algorithms hold particular promise in that they can capture linear, non-linear, combinatorial, stochastic and other types of relationships among variables across multiple levels of biological organization. However, challenges remain when applying these algorithms to limited quantities of experimental data collected from biological systems. Here, we use a simulation approach to make advances in our dynamic Bayesian network (DBN) inference algorithm, especially in the context of limited quantities of biological data. RESULTS: We test a range of scoring metrics and search heuristics to find an effective algorithm configuration for evaluating our methodological advances. We also identify sampling intervals and levels of data discretization that allow the best recovery of the simulated networks. We develop a novel influence score for DBNs that attempts to estimate both the sign (activation or repression) and relative magnitude of interactions among variables. When faced with limited quantities of observational data, combining our influence score with moderate data interpolation reduces a significant portion of false positive interactions in the recovered networks. Together, our advances allow DBN inference algorithms to be more effective in recovering biological networks from experimentally collected data. AVAILABILITY: Source code and simulated data are available upon request. SUPPLEMENTARY INFORMATION: http://www.jarvislab.net/Bioinformatics/BNAdvances/

Keywords: Algorithms; Bayes Theorem; Computer Simulation; Gene Expression Profiling; Gene Expression Regulation; Models, Genetic; Models, Statistical; Oligonucleotide Array Sequence Analysis; Signal Transduction; Software
[Tzeng2004Predicting] Huey-Ming Tzeng, Jer-Guang Hsieh, and Yih-Lon Lin. Predicting nurses' intention to quit with a support vector machine: a new approach to set up an early warning mechanism in human resource management. Comput Inform Nurs, 22(4):232-42, 2004. [ bib ]
This project developed a Support Vector Machine for predicting nurses' intention to quit, using working motivation, job satisfaction, and stress levels as predictors. This study was conducted in three hospitals located in southern Taiwan. The target population was all nurses (389 valid cases). For cross-validation, we randomly split cases into four groups of approximately equal sizes, and performed four training runs. After the training, the average percentage of misclassification on the training data was 0.86, while that on the testing data was 10.8, resulting in predictions with 89.2% accuracy. This Support Vector Machine can predict nurses' intention to quit, without asking these nurses whether they have an intention to quit.

Keywords: Adolescent, Adult, Algorithms, Amino Acid Sequence, Amino Acids, Anatomic, Attitude of Health Personnel, Bacterial Proteins, Bias (Epidemiology), Brain, Brain Mapping, Burnout, Comparative Study, Computer Simulation, Computer-Assisted, Data Interpretation, Diffusion Magnetic Resonance Imaging, Facial Asymmetry, Facial Expression, Facial Paralysis, Female, Gene Expression Profiling, Gram-Negative Bacteria, Gram-Positive Bacteria, Hospital, Humans, Image Interpretation, Intention, Job Satisfaction, Logistic Models, Magnetoencephalography, Male, Middle Aged, Models, Motion, Neural Networks (Computer), Neural Pathways, Non-U.S. Gov't, Nonlinear Dynamics, Nursing Administration Research, Nursing Staff, Personnel Management, Personnel Turnover, Photography, Predictive Value of Tests, Professional, Protein, Proteins, Proteome, Psychological, Questionnaires, Regression Analysis, Reproducibility of Results, Research Support, Retina, Risk Factors, Sequence Alignment, Sequence Analysis, Severity of Illness Index, Software, Statistical, Subcellular Fractions, Taiwan, Theoretical, Workplace, 15494654
[Song2004Comparison] Xiaowei Song, Arnold Mitnitski, Jafna Cox, and Kenneth Rockwood. Comparison of machine learning techniques with classical statistical models in predicting health outcomes. Medinfo, 11(Pt 1):736-40, 2004. [ bib ]
Several machine learning techniques (multilayer and single layer perceptron, logistic regression, least square linear separation and support vector machines) are applied to calculate the risk of death from two biomedical data sets, one from patient care records, and another from a population survey. Each dataset contained multiple sources of information: history of related symptoms and other illnesses, physical examination findings, laboratory tests, medications (patient records dataset), health attitudes, and disabilities in activities of daily living (survey dataset). Each technique showed very good mortality prediction in the acute patients data sample (AUC up to 0.89) and fair prediction accuracy for six year mortality (AUC from 0.70 to 0.76) in individuals from epidemiological database surveys. The results suggest that the nature of data is of primary importance rather than the learning technique. However, the consistently superior performance of the artificial neural network (multi-layer perceptron) indicates that nonlinear relationships (which cannot be discerned by linear separation techniques) can provide additional improvement in correctly predicting health outcomes.

Keywords: Aged, Air, Algorithms, Amino Acids, Animals, Area Under Curve, Artifacts, Artificial Intelligence, Atrial, Automated, Canada, Carotid Stenosis, Cerebrovascular Accident, Cerebrovascular Circulation, Comparative Study, Computer-Assisted, Cysteine, Decision Trees, Dementia, Diagnosis, Disulfides, Doppler, Embolism, Expert Systems, Extramural, Factor Analysis, Female, Gene Expression, Gene Expression Profiling, Health Status, Heart Septal Defects, Humans, Intracranial Embolism, Male, Models, Molecular, Myocardial Infarction, N.I.H., Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Oxidation-Reduction, P.H.S., Pattern Recognition, Prognosis, Protein Binding, Protein Folding, Proteins, ROC Curve, Research Support, Sensitivity and Specificity, Software, Statistical, Transcranial, Treatment Outcome, U.S. Gov't, Ultrasonography, 15360910
[Seeger2004Gaussian] Matthias Seeger. Gaussian processes for machine learning. Int J Neural Syst, 14(2):69-106, Apr 2004. [ bib ]
Gaussian processes (GPs) are natural generalisations of multivariate Gaussian random variables to infinite (countably or continuous) index sets. GPs have been applied in a large number of fields to a diverse range of ends, and very many deep theoretical analyses of various properties are available. This paper gives an introduction to Gaussian processes on a fairly elementary level with special emphasis on characteristics relevant in machine learning. It draws explicit connections to branches such as spline smoothing models and support vector machines in which similar ideas have been investigated. Gaussian process models are routinely used to solve hard machine learning problems. They are attractive because of their flexible non-parametric nature and computational simplicity. Treated within a Bayesian framework, very powerful statistical methods can be implemented which offer valid estimates of uncertainties in our predictions and generic model selection procedures cast as nonlinear optimization problems. Their main drawback of heavy computational scaling has recently been alleviated by the introduction of generic sparse approximations.13,78,31 The mathematical literature on GPs is large and often uses deep concepts which are not required to fully understand most machine learning applications. In this tutorial paper, we aim to present characteristics of GPs relevant to machine learning and to show up precise connections to other "kernel machines" popular in the community. Our focus is on a simple presentation, but references to more detailed sources are provided.

Keywords: Algorithms, Amino Acids, Antibodies, Artificial Intelligence, Astrocytoma, Automated, Bayes Theorem, Biological, Biopsy, Brain, Brain Mapping, Brain Neoplasms, Calibration, Comparative Study, Computational Biology, Computer-Assisted, Computing Methodologies, Cysteine, Cystine, Dysplastic Nevus Syndrome, Electrodes, Electroencephalography, Entropy, Eosine Yellowish-(YS), Evoked Potentials, Female, Gene Expression Profiling, Hematoxylin, Horseradish Peroxidase, Humans, Image Interpretation, Image Processing, Imagery (Psychotherapy), Imagination, Laterality, Linear Models, Male, Melanoma, Models, Monoclonal, Movement, Neoplasms, Neural Networks (Computer), Neuropeptides, Non-P.H.S., Non-U.S. Gov't, Nonparametric, Normal Distribution, P.H.S., Pattern Recognition, Perception, Principal Component Analysis, Protein, Protein Array Analysis, Protein Interaction Mapping, Proteins, Regression Analysis, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Ana, Sequence Analysis, Skin Neoplasms, Software, Statistical, Statistics, Tumor Markers, U.S. Gov't, User-Computer Interface, World Health Organization, lysis, 15112367
[Qin2004[Automated] Dong mei Qin, Zhan yi Hu, and Yong heng Zhao. Automated classification of celestial spectra based on support vector machines. Guang Pu Xue Yu Guang Pu Fen Xi, 24(4):507-11, Apr 2004. [ bib ]
The main objective of an automatic recognition system of celestial objects via their spectra is to classify celestial spectra and estimate physical parameters automatically. This paper proposes a new automatic classification method based on support vector machines to separate non-active objects from active objects via their spectra. With low SNR and unknown red-shift value, it is difficult to extract true spectral lines, and as a result, active objects can not be determined by finding strong spectral lines and the spectral classification between non-active and active objects becomes difficult. The proposed method in this paper combines the principal component analysis with support vector machines, and can automatically recognize the spectra of active objects with unknown red-shift values from non-active objects. It finds its applicability in the automatic processing of voluminous observed data from large sky surveys in astronomy.

Keywords: 80 and over, Adult, Aged, Algorithms, Amino Acids, Animals, Area Under Curve, Artifacts, Automated, Birefringence, Brain Chemistry, Brain Neoplasms, Comparative Study, Computer-Assisted, Cornea, Cross-Sectional Studies, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Face, Female, Genetic, Glaucoma, Humans, Intraocular Pressure, Lasers, Least-Squares Analysis, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Male, Middle Aged, Models, Molecular, Nerve Fibers, Non-U.S. Gov't, Numerical Analysis, Ophthalmological, Optic Nerve Diseases, Optical Coherence, P.H.S., Pattern Recognition, Photic Stimulation, Prospective Studies, Protein, ROC Curve, Regression Analysis, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, Sequence Analysis, Statistics, Tomography, U.S. Gov't, Visual Fields, beta-Lactamases, 15766170
[Pavlidis2004Support] Paul Pavlidis, Ilan Wapinski, and William Stafford Noble. Support vector machine classification on the web. Bioinformatics, 20(4):586-7, Mar 2004. [ bib | DOI | http | .pdf ]
The support vector machine (SVM) learning algorithm has been widely applied in bioinformatics. We have developed a simple web interface to our implementation of the SVM algorithm, called Gist. This interface allows novice or occasional users to apply a sophisticated machine learning algorithm easily to their data. More advanced users can download the software and source code for local installation. The availability of these tools will permit more widespread application of this powerful learning algorithm in bioinformatics.

Keywords: Adaptation, Algorithms, Ambergris, Amino Acid Sequence, Animals, Artifacts, Artificial Intelligence, Automated, Cadmium, Candida, Candida albicans, Capillary, Clinical, Cluster Analysis, Combinatorial Chemistry Techniques, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Computing Methodologies, Databases, Decision Support Systems, Electrophoresis, Enzymes, Europe, Eye Enucleation, Humans, Image Interpretation, Image Processing, Information Storage and Retrieval, Internet, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Markov Chains, Melanoma, Models, Molecular, Molecular Conformation, Molecular Sequence Data, Molecular Structure, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Odors, P.H.S., Pattern Recognition, Perfume, Physiological, Predictive Value of Tests, Prognosis, Prospective Studies, Protein, Protein Structure, Proteins, Proteomics, Quantitative Structure-Activity Relationship, Rats, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Secondary, Sensitivity and Specificity, Signal Processing, Single-Blind Method, Soft Tissue Neoplasms, Software, Statistical, U.S. Gov't, Uveal Neoplasms, Visual, 14990457
[Madeira2004Biclustering] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform, 1(1):24-45, 2004. [ bib | DOI | http ]
A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the data matrix has been proposed. The goal is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this paper, we refer to this class of algorithms as biclustering. Biclustering is also referred in the literature as coclustering and direct clustering, among others names, and has also been used in fields such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications.

Keywords: Algorithms; Cluster Analysis; Computational Biology, methods; Gene Expression Profiling, statistics /&/ numerical data; Gene Expression, genetics; Humans; Models, Statistical; Oligonucleotide Array Sequence Analysis, methods; Saccharomyces cerevisiae, genetics
[Li2004Fusing] Shutao Li, James Tin-Yau Kwok, Ivor Wai-Hung Tsang, and Yaonan Wang. Fusing images with different focuses using support vector machines. IEEE Trans Neural Netw, 15(6):1555-61, Nov 2004. [ bib ]
Many vision-related processing tasks, such as edge detection, image segmentation and stereo matching, can be performed more easily when all objects in the scene are in good focus. However, in practice, this may not be always feasible as optical lenses, especially those with long focal lengths, only have a limited depth of field. One common approach to recover an everywhere-in-focus image is to use wavelet-based image fusion. First, several source images with different focuses of the same scene are taken and processed with the discrete wavelet transform (DWT). Among these wavelet decompositions, the wavelet coefficient with the largest magnitude is selected at each pixel location. Finally, the fused image can be recovered by performing the inverse DWT. In this paper, we improve this fusion procedure by applying the discrete wavelet frame transform (DWFT) and the support vector machines (SVM). Unlike DWT, DWFT yields a translation-invariant signal representation. Using features extracted from the DWFT coefficients, a SVM is trained to select the source image that has the best focus at each pixel location, and the corresponding DWFT coefficients are then incorporated into the composite wavelet representation. Experimental results show that the proposed method outperforms the traditional approach both visually and quantitatively.

Keywords: Algorithms, Amino Acid, Amino Acids, Artificial Intelligence, Ascomycota, Automated, Base Sequence, Chromosome Mapping, Codon, Colonic Neoplasms, Comparative Study, Computer Simulation, Computer-Assisted, Computing Methodologies, Crystallography, DNA, DNA Primers, Databases, Diagnostic Imaging, Enzymes, Fixation, Gene Expression Profiling, Genetic, Hordeum, Host-Parasite Relations, Humans, Image Enhancement, Image Interpretation, Informatics, Information Storage and Retrieval, Kinetics, Magnetic Resonance Spectroscopy, Models, Nanotechnology, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Ocular, Oligonucleotide Array Sequence Analysis, P.H.S., Pattern Recognition, Plant, Plants, Predictive Value of Tests, Protein, Protein Conformation, Research Support, Sample Size, Selection (Genetics), Sequence Alignment, Sequence Analysis, Sequence Homology, Signal Processing, Skin, Software, Statistical, Subtraction Technique, Theoretical, Thermodynamics, U.S. Gov't, Viral Proteins, X-Ray, 15565781
[Glotsos2004Automated] Dimitris Glotsos, Panagiota Spyridonos, Dionisis Cavouras, Panagiota Ravazoula, Petroula-Arampantoni Dadioti, and George Nikiforidis. Automated segmentation of routinely hematoxylin-eosin-stained microscopic images by combining support vector machine clustering and active contour models. Anal Quant Cytol Histol, 26(6):331-40, Dec 2004. [ bib ]
OBJECTIVE: To develop a method for the automated segmentation of images of routinely hematoxylin-eosin (H-E)-stained microscopic sections to guarantee correct results in computer-assisted microscopy. STUDY DESIGN: Clinical material was composed 50 H-E-stained biopsies of astrocytomas and 50 H-E-stained biopsies of urinary bladder cancer. The basic idea was to use a support vector machine clustering (SVMC) algorithm to provide gross segmentation of regions holding nuclei and subsequently to refine nuclear boundary detection with active contours. The initialization coordinates of the active contour model were defined using a SVMC pixel-based classification algorithm that discriminated nuclear regions from the surrounding tissue. Starting from the boundaries of these regions, the snake fired and propagated until converging to nuclear boundaries. RESULTS: The method was validated for 2 different types of H-E-stained images. Results were evaluated by 2 histopathologists. On average, 94% of nuclei were correctly delineated. CONCLUSION: The proposed algorithm could be of value in computer-based systems for automated interpretation of microscopic images.

Keywords: Adenosinetriphosphatase, Adolescent, Adult, Algorithms, Amino Acid Sequence, Amino Acids, Animals, Astrocytoma, Automated, Automation, Base Sequence, Bayes Theorem, Biological, Biopsy, Bladder Neoplasms, Breast Neoplasms, Carbohydrate Conformation, Carbohydrate Sequence, Cattle, Cell Cycle Proteins, Cell Nucleus, Computational Biology, Computer Simulation, Computer-Assisted, Crystallography, DNA, Databases, Diagnosis, Differential, Eosine Yellowish-(YS), Exoribonucleases, Factual, False Negative Reactions, False Positive Reactions, Female, Gene Expression, Gene Expression Profiling, Genes, Genetic, Genetic Techniques, Genetic Vectors, Genome, Hematoxylin, Histocompatibility Antigens Class I, Human, Humans, Image Interpretation, Image Processing, Introns, Least-Squares Analysis, MHC Class I, Major Histocompatibility Complex, Markov Chains, Messenger, Mice, Middle Aged, Models, Molecular Structure, Monosaccharides, Multigene Family, Mutation, Neoplasms, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonparametric, Nucleotidyltransferases, Observer Variation, Oligonucleotide Array Sequence Analysis, P.H.S., Pattern Recognition, Peptides, Phenotype, Phylogeny, Plants, Poly A, Polysaccharides, Predictive Value of Tests, Protein, Protein Biosynthesis, Protein Kinase Inhibitors, Protein Structure, Proteins, RNA, RNA Helicases, RNA Splicing, Rats, Reproducibility of Results, Research Support, Retrospective Studies, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Secondary, Sensitivity and Specificity, Sequence Alignment, Software, Species Specificity, Staining and Labeling, Statistics, Theoretical, Transcription, U.S. Gov't, Ultrasonography, X-Ray, 15678615
[Faugeras2004Variational] Olivier Faugeras, Geoffray Adde, Guillaume Charpiat, Christophe Chefd'hotel, Maureen Clerc, Thomas Deneux, Rachid Deriche, Gerardo Hermosillo, Renaud Keriven, Pierre Kornprobst, Jan Kybic, Christophe Lenglet, Lucero Lopez-Perez, Théo Papadopoulo, Jean-Philippe Pons, Florent Segonne, Bertrand Thirion, David Tschumperlé, Thierry Viéville, and Nicolas Wotawa. Variational, geometric, and statistical methods for modeling brain anatomy and function. Neuroimage, 23 Suppl 1:S46-55, 2004. [ bib | DOI | http | .pdf ]
We survey the recent activities of the Odyssée Laboratory in the area of the application of mathematics to the design of models for studying brain anatomy and function. We start with the problem of reconstructing sources in MEG and EEG, and discuss the variational approach we have developed for solving these inverse problems. This motivates the need for geometric models of the head. We present a method for automatically and accurately extracting surface meshes of several tissues of the head from anatomical magnetic resonance (MR) images. Anatomical connectivity can be extracted from diffusion tensor magnetic resonance images but, in the current state of the technology, it must be preceded by a robust estimation and regularization stage. We discuss our work based on variational principles and show how the results can be used to track fibers in the white matter (WM) as geodesics in some Riemannian space. We then go to the statistical modeling of functional magnetic resonance imaging (fMRI) signals from the viewpoint of their decomposition in a pseudo-deterministic and stochastic part that we then use to perform clustering of voxels in a way that is inspired by the theory of support vector machines and in a way that is grounded in information theory. Multimodal image matching is discussed next in the framework of image statistics and partial differential equations (PDEs) with an eye on registering fMRI to the anatomy. The paper ends with a discussion of a new theory of random shapes that may prove useful in building anatomical and functional atlases.

Keywords: Adolescent, Adult, Algorithms, Anatomic, Bacterial Proteins, Brain, Brain Mapping, Comparative Study, Computer Simulation, Computer-Assisted, Diffusion Magnetic Resonance Imaging, Facial Asymmetry, Facial Expression, Facial Paralysis, Female, Gene Expression Profiling, Gram-Negative Bacteria, Gram-Positive Bacteria, Humans, Image Interpretation, Magnetoencephalography, Male, Middle Aged, Models, Motion, Neural Pathways, Non-U.S. Gov't, Photography, Protein, Proteome, Research Support, Retina, Sequence Alignment, Sequence Analysis, Severity of Illness Index, Software, Statistical, Subcellular Fractions, 15501100
[Eroes2004Comparison] D. Erös, G. Kéri, I. Kövesdi, C. Szántai-Kis, G. Mészáros, and L. Orfi. Comparison of predictive ability of water solubility QSPR models generated by MLR, PLS and ANN methods. Mini Rev Med Chem, 4(2):167-177, Feb 2004. [ bib ]
ADME/Tox computational screening is one of the most hot topics of modern drug research. About one half of the potential drug candidates fail because of poor ADME/Tox properties. Since the experimental determination of water solubility is time-consuming also, reliable computational predictions are needed for the pre-selection of acceptable "drug-like" compounds from diverse combinatorial libraries. Recently many successful attempts were made for predicting water solubility of compounds. A comprehensive review of previously developed water solubility calculation methods is presented here, followed by the description of the solubility prediction method designed and used in our laboratory. We have selected carefully 1381 compounds from scientific publications in a unified database and used this dataset in the calculations. The externally validated models were based on calculated descriptors only. The aim of model optimization was to improve repeated evaluations statistics of the predictions and effective descriptor scoring functions were used to facilitate quick generation of multiple linear regression analysis (MLR), partial least squares method (PLS) and artificial neural network (ANN) models with optimal predicting ability. Standard error of prediction of the best model generated with ANN (with 39-7-1 network structure) was 0.72 in logS units while the cross validated squared correlation coefficient (Q(2)) was better than 0.85. These values give a good chance for successful pre-selection of screening compounds from virtual libraries, based on the predicted water solubility.

Keywords: Chemical, Chemistry, Comparative Study, Cytochrome P-450 Enzyme System, Estradiol, Least-Squares Analysis, Ligands, Linear Models, Models, Molecular, Naphthalenes, Neural Networks (Computer), Non-U.S. Gov't, Physical, Quantitative Structure-Activity Relationship, Reproducibility of Results, Research Support, Solubility, Spectrum Analysis, Statistical, Water, 14965289
[Cohen2004application] Gilles Cohen, Mélanie Hilario, Hugo Sax, Stéphane Hugonnet, Christian Pellegrini, and Antoine Geissbuhler. An application of one-class support vector machine to nosocomial infection detection. Medinfo, 11(Pt 1):716-20, 2004. [ bib ]
Nosocomial infections (NIs)-those acquired in health care settings-are among the major causes of increased mortality among hospitalized patients. They are a significant burden for patients and health authorities alike; it is thus important to monitor and detect them through an effective surveillance system. This paper describes a retrospective analysis of a prevalence survey of NIs done in the Geneva University Hospital. Our goal is to identify patients with one or more NIs on the basis of clinical and other data collected during the survey. In this two-class classification task, the main difficulty lies in the significant imbalance between positive or infected (11%) and negative (89%) cases. To cope with class imbalance, we investigate one-class SVMs which can be trained to distinguish two classes on the basis of examples from a single class (in this case, only "normal" or non infected patients). The infected ones are then identified as "abnormal" cases or outliers that deviate significantly from the normal profile. Experimental results are encouraging: whereas standard 2-class SVMs scored a baseline sensitivity of 50.6% on this problem, the one-class approach increased sensitivity to as much as 92.6%. These results are comparable to those obtained by the authors in a previous study on asymmetrical soft margin SVMs; they suggest that one-class SVMs can provide an effective and efficient way of overcoming data imbalance in classification problems.

Keywords: Aged, Air, Algorithms, Amino Acids, Animals, Area Under Curve, Artifacts, Artificial Intelligence, Atrial, Automated, Canada, Carotid Stenosis, Cerebrovascular Accident, Cerebrovascular Circulation, Comparative Study, Computer-Assisted, Cross Infection, Cysteine, Data Collection, Decision Trees, Dementia, Diagnosis, Disulfides, Doppler, Embolism, Expert Systems, Extramural, Factor Analysis, Female, Gene Expression, Gene Expression Profiling, Health Status, Heart Septal Defects, Hospitals, Humans, Infection Control, Intracranial Embolism, Male, Models, Molecular, Myocardial Infarction, N.I.H., Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Oxidation-Reduction, P.H.S., Pattern Recognition, Population Surveillance, Prevalence, Prognosis, Protein Binding, Protein Folding, Proteins, ROC Curve, Research Support, Retrospective Studies, Sensitivity and Specificity, Software, Statistical, Switzerland, Transcranial, Treatment Outcome, U.S. Gov't, Ultrasonography, University, 15360906
[Bowd2004Confocal] Christopher Bowd, Linda M Zangwill, Felipe A Medeiros, Jiucang Hao, Kwokleung Chan, Te-Won Lee, Terrence J Sejnowski, Michael H Goldbaum, Pamela A Sample, Jonathan G Crowston, and Robert N Weinreb. Confocal scanning laser ophthalmoscopy classifiers and stereophotograph evaluation for prediction of visual field abnormalities in glaucoma-suspect eyes. Invest Ophthalmol Vis Sci, 45(7):2255-62, Jul 2004. [ bib | DOI | http | .pdf ]
PURPOSE: To determine whether Heidelberg Retina Tomograph (HRT; Heidelberg Engineering, Dossenheim, Germany) classification techniques and investigational support vector machine (SVM) analyses can detect optic disc abnormalities in glaucoma-suspect eyes before the development of visual field abnormalities. METHODS: Glaucoma-suspect eyes (n = 226) were classified as converts or nonconverts based on the development of repeatable (either two or three consecutive) standard automated perimetry (SAP)-detected abnormalities over the course of the study (mean follow-up, approximately 4.5 years). Hazard ratios for development of SAP abnormalities were calculated based on baseline classification results, follow-up time, and end point status (convert, nonconvert). Classification techniques applied were HRT classification (HRTC), Moorfields Regression Analysis, forward-selection optimized SVM (SVM fwd) and backward elimination-optimized SVM (SVM back) analysis of HRT data, and stereophotograph assessment. RESULTS: Univariate analyses indicated that all classification techniques were predictors of the development of two repeatable abnormal SAP results, with hazards ratios (95% confidence interval [CI]) ranging from 1.32 (1.00-1.75) for HRTC to 2.0 (1.48-2.76) for stereophotograph assessment (all P < or = 0.05). Only SVM (SVM fwd and SVM back) analysis of HRT data and stereophotograph assessment were univariate predictors of the development of three repeatable abnormal SAP results, with hazard ratios (95% CI) ranging from 1.73 (1.16-2.82) for SVM fwd to 1.82 (1.19-3.12) for SVM back (both P < 0.007). Multivariate analyses including each classification technique individually in a model with age, baseline SAP pattern standard deviation [PSD], and baseline IOP indicated that all classification techniques except HRTC (P = 0.06) were predictors of the development of two repeatable abnormal SAP results with hazards ratios ranging from 1.30 (0.99, 1.73) for HRTC to 1.90 (1.37, 2.69) for stereophotograph assessment. Only SVM (SVM fwd and SVM back) analysis of HRT data and stereophotograph assessment were significant predictors of the development of three repeatable abnormal SAP results in multivariate analyses; hazard ratios of 1.57 (1.03, 2.59) and 1.70 (1.18, 2.51), respectively. SAP PSD was a significant predictor of two repeatable abnormal SAP results in multivariate models with all classification techniques, with hazard ratios ranging from 3.31 (1.39, 7.89) to 4.70 (2.02, 10.93) per 1-dB increase. CONCLUSIONS: HRT classifications techniques and stereophotograph assessment can detect optic disc topography abnormalities in glaucoma-suspect eyes before the development of SAP abnormalities. These data support strongly the importance of optic disc examination for early glaucoma diagnosis.

Keywords: 80 and over, Adolescent, Adult, Aged, Algorithms, Artificial Intelligence, Auditory, Benchmarking, Binding Sites, Brain Stem, Breast Diseases, Chemical, Child, Chromosomes, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Data Interpretation, Databases, Diagnosis, Diagnostic Errors, Differential, Drug Resistance, Electroencephalography, Epilepsy, Evoked Potentials, Female, Forecasting, Gene Expression, Gene Expression Profiling, Genetic, Genotype, Glaucoma, Greece, HIV Protease Inhibitors, HIV-1, Human, Humans, Infant, Information Management, Information Storage and Retrieval, Intraocular Pressure, Kinetics, Language Development Disorders, Lasers, Least-Squares Analysis, Linear Models, Male, Microbial Sensitivity Tests, Middle Aged, Models, Molecular, Monitoring, Nephroblastoma, Non-U.S. Gov't, Nonlinear Dynamics, Ocular Hypertension, Oligonucleotide Array Sequence Analysis, Ophthalmoscopy, Optic Disk, Optic Nerve Diseases, P.H.S., Pair 1, Perimetry, Periodicals, Phosphorylation, Phosphotransferases, Photography, Physiologic, Point Mutation, Preschool, Prognosis, Protein, Proteins, Pyrimidinones, Reaction Time, Recurrence, Reproducibility of Results, Research Support, Reverse Transcriptase Inhibitors, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Signal Processing, Software, Sound Localization, Statistical, Stochastic Processes, Structure-Activity Relationship, Theoretical, Time Factors, U.S. Gov't, Viral, Vision Disorders, Visual Fields, 15223803
[Roth2004Bayesian] Volker Roth and Tilman Lange. Bayesian class discovery in microarray datasets. IEEE Trans Biomed Eng, 51(5):707-718, May 2004. [ bib ]
A novel approach to class discovery in gene expression datasets is presented. In the context of clinical diagnosis, the central goal of class discovery algorithms is to simultaneously find putative (sub-)types of diseases and to identify informative subsets of genes with disease-type specific expression profile. Contrary to many other approaches in the literature, the method presented implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. The usual combinatorial problems associated with wrapper approaches are overcome by a Bayesian inference mechanism. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of the optimization method is selected by a resampling-based stability analysis. Experiments with Leukemia and Lymphoma datasets demonstrate that our method is able to correctly infer partitions and corresponding subsets of genes which both are relevant in a biological sense. Moreover, the frequently observed problem of ambiguities caused by different but equally high-scoring partitions is successfully overcome by the model selection method proposed.

Keywords: Algorithms, Automated, Bayes Theorem, Cluster Analysis, Comparative Study, DNA, Databases, Gene Expression Profiling, Genetic, Genetic Screening, Humans, Leukemia, Models, Non-U.S. Gov't, Nucleic Acid, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Statistical, 15132496
[Luo2004gene-silencing] K. Q. Luo and D. C. Chang. The gene-silencing efficiency of siRNA is strongly dependent on the local structure of mRNA at the targeted region. Biochem. Biophys. Res. Commun., 318(1):303-10, May 2004. [ bib | DOI | http ]
The gene-silencing effect of short interfering RNA (siRNA) is known to vary strongly with the targeted position of the mRNA. A number of hypotheses have been suggested to explain this phenomenon. We would like to test if this positional effect is mainly due to the secondary structure of the mRNA at the target site. We proposed that this structural factor can be characterized by a single parameter called "the hydrogen bond (H-b) index," which represents the average number of hydrogen bonds formed between nucleotides in the target region and the rest of the mRNA. This index can be determined using a computational approach. We tested the correlation between the H-b index and the gene-silencing effects on three genes (Bcl-2, hTF, and cyclin B1) using a variety of siRNAs. We found that the gene-silencing effect is inversely dependent on the H-b index, indicating that the local mRNA structure at the targeted site is the main cause of the positional effect. Based on this finding, we suggest that the H-b index can be a useful guideline for future siRNA design.

Keywords: Animals, Apoptosis, Base Composition, Base Pairing, Base Sequence, Binding Sites, Cell Cycle, Cell Proliferation, Comparative Study, Cultured, Cyclin B, Cyclin D1, DNA-Binding Proteins, Down-Regulation, Extramural, Fluorescence, Gene Silencing, Gene Targeting, Genetic Vectors, Green Fluorescent Proteins, Hela Cells, Humans, Hydrogen Bonding, Luminescent Proteins, Male, Messenger, Mice, Microscopy, Models, Molecular, Molecular Sequence Data, N.I.H., Non-U.S. Gov't, Nucleic Acid Conformation, Nude, P.H.S., Prostatic Neoplasms, Proto-Oncogene Proteins c-bcl-2, Proto-Oncogene Proteins c-myc, RNA, Regression Analysis, Research Support, STAT3 Transcription Factor, Small Interfering, Thromboplastin, Trans-Activators, Tumor Cells, U.S. Gov't, 15110788
[Shen2005[Detection] Li Shen, Jie Yang, and Yue Zhou. Detection of PVCs with support vector machine. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi, 22(1):78-81, Feb 2005. [ bib ]
The classifiction of heart beats is the foundation for automated arrhythmia monitoring devices. Support vector machnies (SVMs) have meant a great advance in solving classification or pattern recognition. This study describes SVM for the identification of premature ventricular contractions (PVCs) in surface ECGs. Features for the classification task are extracted by analyzing the heart rate, morphology and wavelet energy of the heart beats from a single lead. The performance of different SVMs is evaluated on the MIT-BIH arrhythmia database following the association for the advancement of medical instrumentation (AAMI) recommendations.

Keywords: 80 and over, Adult, Aged, Algorithms, Amino Acids, Animals, Area Under Curve, Artifacts, Automated, Birefringence, Brain Chemistry, Brain Neoplasms, Comparative Study, Computer-Assisted, Cornea, Cross-Sectional Studies, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Face, Female, Genetic, Glaucoma, Humans, Intraocular Pressure, Lasers, Least-Squares Analysis, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Male, Middle Aged, Models, Molecular, Nerve Fibers, Non-U.S. Gov't, Numerical Analysis, Ophthalmological, Optic Nerve Diseases, Optical Coherence, P.H.S., Pattern Recognition, Photic Stimulation, Prospective Studies, Protein, ROC Curve, Regression Analysis, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, Sequence Analysis, Statistics, Tomography, U.S. Gov't, Visual Fields, beta-Lactamases, 15762121
[Sheinerman2005High] Felix B Sheinerman, Elie Giraud, and Abdelazize Laoui. High affinity targets of protein kinase inhibitors have similar residues at the positions energetically important for binding. J. Mol. Biol., 352(5):1134-1156, Oct 2005. [ bib | DOI | http ]
Inhibition of protein kinase activity is a focus of intense drug discovery efforts in several therapeutic areas. Major challenges facing the field include understanding of the factors determining the selectivity of kinase inhibitors and the development of compounds with the desired selectivity profile. Here, we report the analysis of sequence variability among high and low affinity targets of eight different small molecule kinase inhibitors (BIRB796, Tarceva, NU6102, Gleevec, SB203580, balanol, H89, PP1). It is observed that all high affinity targets of each inhibitor are found among a relatively small number of kinases, which have similar residues at the specific positions important for binding. The findings are highly statistically significant, and allow one to exclude the majority of kinases in a genome from a list of likely targets for an inhibitor. The findings have implications for the design of novel inhibitors with a desired selectivity profile (e.g. targeted at multiple kinases), the discovery of new targets for kinase inhibitor drugs, comparative analysis of different in vivo models, and the design of "a-la-carte" chemical libraries tailored for individual kinases.

Keywords: Amino Acid Sequence; Amino Acids; Binding Sites; Electrostatics; Humans; Ligands; Molecular Sequence Data; Piperazines; Protein Binding; Protein Kinase Inhibitors; Protein Kinases; Pyrazoles; Pyrimidines; Sequence Alignment; Thermodynamics
[Sassi2005automated] Alexander P Sassi, Frank Andel, Hans-Marcus L Bitter, Michael P S Brown, Robert G Chapman, Jeraldine Espiritu, Alfred C Greenquist, Isabelle Guyon, Mariana Horchi-Alegre, Kathy L Stults, Ann Wainright, Jonathan C Heller, and John T Stults. An automated, sheathless capillary electrophoresis-mass spectrometry platform for discovery of biomarkers in human serum. Electrophoresis, 26(7-8):1500-12, Apr 2005. [ bib | DOI | http | .pdf ]
A capillary electrophoresis-mass spectrometry (CE-MS) method has been developed to perform routine, automated analysis of low-molecular-weight peptides in human serum. The method incorporates transient isotachophoresis for in-line preconcentration and a sheathless electrospray interface. To evaluate the performance of the method and demonstrate the utility of the approach, an experiment was designed in which peptides were added to sera from individuals at each of two different concentrations, artificially creating two groups of samples. The CE-MS data from the serum samples were divided into separate training and test sets. A pattern-recognition/feature-selection algorithm based on support vector machines was used to select the mass-to-charge (m/z) values from the training set data that distinguished the two groups of samples from each other. The added peptides were identified correctly as the distinguishing features, and pattern recognition based on these peptides was used to assign each sample in the independent test set to its respective group. A twofold difference in peptide concentration could be detected with statistical significance (p-value < 0.0001). The accuracy of the assignment was 95%, demonstrating the utility of this technique for the discovery of patterns of biomarkers in serum.

Keywords: 80 and over, Adult, Aged, Algorithms, Amino Acids, Animals, Area Under Curve, Artifacts, Automated, Birefringence, Brain Chemistry, Brain Neoplasms, Comparative Study, Computer-Assisted, Cornea, Cross-Sectional Studies, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Face, Female, Genetic, Glaucoma, Humans, Intraocular Pressure, Lasers, Least-Squares Analysis, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Male, Middle Aged, Models, Molecular, Nerve Fibers, Non-U.S. Gov't, Numerical Analysis, Ophthalmological, Optic Nerve Diseases, Optical Coherence, P.H.S., Pattern Recognition, Photic Stimulation, Prospective Studies, Protein, ROC Curve, Regression Analysis, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, Sequence Analysis, Statistics, Tomography, U.S. Gov't, Visual Fields, beta-Lactamases, 15765480
[Rice2005Reconstructing] J.J. Rice, Y. Tu, and G. Stolovitzky. Reconstructing biological networks using conditional correlation analysis. Bioinformatics, 21(6):765-773, Mar 2005. [ bib | DOI | http ]
MOTIVATION: One of the present challenges in biological research is the organization of the data originating from high-throughput technologies. One way in which this information can be organized is in the form of networks of influences, physical or statistical, between cellular components. We propose an experimental method for probing biological networks, analyzing the resulting data and reconstructing the network architecture. METHODS: We use networks of known topology consisting of nodes (genes), directed edges (gene-gene interactions) and a dynamics for the genes' mRNA concentrations in terms of the gene-gene interactions. We proposed a network reconstruction algorithm based on the conditional correlation of the mRNA equilibrium concentration between two genes given that one of them was knocked down. Using simulated gene expression data on networks of known connectivity, we investigated how the reconstruction error is affected by noise, network topology, size, sparseness and dynamic parameters. RESULTS: Errors arise from correlation between nodes connected through intermediate nodes (false positives) and when the correlation between two directly connected nodes is obscured by noise, non-linearity or multiple inputs to the target node (false negatives). Two critical components of the method are as follows: (1) the choice of an optimal correlation threshold for predicting connections and (2) the reduction of errors arising from indirect connections (for which a novel algorithm is proposed). With these improvements, we can reconstruct networks with the topology of the transcriptional regulatory network in Escherichia coli with a reasonably low error rate.

Keywords: Algorithms; Computer Simulation; Gene Expression Profiling; Gene Expression Regulation; Models, Biological; Models, Statistical; Oligonucleotide Array Sequence Analysis; Protein Interaction Mapping; Signal Transduction; Statistics as Topic; Transcription Factors
[Perez-Cruz2005Convergence] Fernando Pérez-Cruz, Carlos Bousoño-Calzón, and Antonio Artés-Rodríguez. Convergence of the IRWLS Procedure to the Support Vector Machine Solution. Neural Comput, 17(1):7-18, Jan 2005. [ bib ]
An iterative reweighted least squares (IRWLS) procedure recently proposed is shown to converge to the support vector machine solution. The convergence to a stationary point is ensured by modifying the original IRWLS procedure.

Keywords: 80 and over, Aged, Algorithms, Amino Acids, Animals, Area Under Curve, Automated, Brain Chemistry, Brain Neoplasms, Comparative Study, Computer-Assisted, Cross-Sectional Studies, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Face, Genetic, Glaucoma, Humans, Lasers, Least-Squares Analysis, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Middle Aged, Models, Molecular, Nerve Fibers, Non-U.S. Gov't, Numerical Analysis, Ophthalmological, Optic Nerve Diseases, P.H.S., Pattern Recognition, Photic Stimulation, Protein, ROC Curve, Regression Analysis, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, Sequence Analysis, Statistics, U.S. Gov't, beta-Lactamases, 15779160
[Prill2005PlosBiol] Robert J Prill, Pablo A Iglesias, and Andre Levchenko. Dynamic properties of network motifs contribute to biological network organization. PLoS Biol, 3(11):e343, Nov 2005. [ bib | DOI | http ]
Biological networks, such as those describing gene regulation, signal transduction, and neural synapses, are representations of large-scale dynamic systems. Discovery of organizing principles of biological networks can be enhanced by embracing the notion that there is a deep interplay between network structure and system dynamics. Recently, many structural characteristics of these non-random networks have been identified, but dynamical implications of the features have not been explored comprehensively. We demonstrate by exhaustive computational analysis that a dynamical property-stability or robustness to small perturbations-is highly correlated with the relative abundance of small subnetworks (network motifs) in several previously determined biological networks. We propose that robust dynamical stability is an influential property that can determine the non-random structure of biological networks.

Keywords: Animals; Caenorhabditis elegans, physiology; Computational Biology, methods; Computer Simulation; Drosophila melanogaster, physiology; Escherichia coli, physiology; Models, Biological; Nerve Net; Saccharomyces cerevisiae, physiology; Signal Transduction; Statistics as Topic; Systems Theory; Transcription, Genetic
[Peters2005Generating] Bjoern Peters and Alessandro Sette. Generating quantitative models describing the sequence specificity of biological processes with the stabilized matrix method. BMC Bioinformatics, 6:132, 2005. [ bib | DOI | http ]
BACKGROUND: Many processes in molecular biology involve the recognition of short sequences of nucleic-or amino acids, such as the binding of immunogenic peptides to major histocompatibility complex (MHC) molecules. From experimental data, a model of the sequence specificity of these processes can be constructed, such as a sequence motif, a scoring matrix or an artificial neural network. The purpose of these models is two-fold. First, they can provide a summary of experimental results, allowing for a deeper understanding of the mechanisms involved in sequence recognition. Second, such models can be used to predict the experimental outcome for yet untested sequences. In the past we reported the development of a method to generate such models called the Stabilized Matrix Method (SMM). This method has been successfully applied to predicting peptide binding to MHC molecules, peptide transport by the transporter associated with antigen presentation (TAP) and proteasomal cleavage of protein sequences. RESULTS: Herein we report the implementation of the SMM algorithm as a publicly available software package. Specific features determining the type of problems the method is most appropriate for are discussed. Advantageous features of the package are: (1) the output generated is easy to interpret, (2) input and output are both quantitative, (3) specific computational strategies to handle experimental noise are built in, (4) the algorithm is designed to effectively handle bounded experimental data, (5) experimental data from randomized peptide libraries and conventional peptides can easily be combined, and (6) it is possible to incorporate pair interactions between positions of a sequence. CONCLUSION: Making the SMM method publicly available enables bioinformaticians and experimental biologists to easily access it, to compare its performance to other prediction methods, and to extend it to other applications.

Keywords: Algorithms; Amino Acid Sequence; Biology; Computational Biology; Computer Simulation; Data Interpretation, Statistical; Databases, Protein; Models, Biological; Models, Statistical; Neural Networks (Computer); Peptide Library; Peptides; Programming Languages; Prote; Sensitivity and Specificity; Software; in Binding
[Pang2005Face] Shaoning Pang, Daijin Kim, and Sung Yang Bang. Face membership authentication using SVM classification tree generated by membership-based LLE data partition. IEEE Trans Neural Netw, 16(2):436-46, Mar 2005. [ bib ]
This paper presents a new membership authentication method by face classification using a support vector machine (SVM) classification tree, in which the size of membership group and the members in the membership group can be changed dynamically. Unlike our previous SVM ensemble-based method, which performed only one face classification in the whole feature space, the proposed method employed a divide and conquer strategy that first performs a recursive data partition by membership-based locally linear embedding (LLE) data clustering, then does the SVM classification in each partitioned feature subset. Our experimental results show that the proposed SVM tree not only keeps the good properties that the SVM ensemble method has, such as a good authentication accuracy and the robustness to the change of members, but also has a considerable improvement on the stability under the change of membership group size.

Keywords: 80 and over, Aged, Algorithms, Area Under Curve, Cross-Sectional Studies, Decision Trees, Diagnostic Imaging, Diagnostic Techniques, Face, Glaucoma, Humans, Lasers, Least-Squares Analysis, Middle Aged, Nerve Fibers, Non-U.S. Gov't, Ophthalmological, Optic Nerve Diseases, P.H.S., Photic Stimulation, ROC Curve, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, Statistics, U.S. Gov't, 15787150
[Nabieva2005Whole-proteome] Elena Nabieva, Kam Jim, Amit Agarwal, Bernard Chazelle, and Mona Singh. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21 Suppl 1:i302-i310, Jun 2005. [ bib | DOI | http ]
MOTIVATION: Determining protein function is one of the most important problems in the post-genomic era. For the typical proteome, there are no functional annotations for one-third or more of its proteins. Recent high-throughput experiments have determined proteome-scale protein physical interaction maps for several organisms. These physical interactions are complemented by an abundance of data about other types of functional relationships between proteins, including genetic interactions, knowledge about co-expression and shared evolutionary history. Taken together, these pairwise linkages can be used to build whole-proteome protein interaction maps. RESULTS: We develop a network-flow based algorithm, FunctionalFlow, that exploits the underlying structure of protein interaction maps in order to predict protein function. In cross-validation testing on the yeast proteome, we show that FunctionalFlow has improved performance over previous methods in predicting the function of proteins with few (or no) annotated protein neighbors. By comparing several methods that use protein interaction maps to predict protein function, we demonstrate that FunctionalFlow performs well because it takes advantage of both network topology and some measure of locality. Finally, we show that performance can be improved substantially as we consider multiple data sources and use them to create weighted interaction networks. AVAILABILITY: http://compbio.cs.princeton.edu/function

Keywords: Algorithms; Computational Biology, methods; Evolution, Molecular; Fungal Proteins, chemistry; Genomics; Models, Statistical; Models, Theoretical; Protein Interaction Mapping, methods; Proteins, chemistry; Proteomics, methods
[Micchelli2005On] Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural Comput, 17(1):177-204, Jan 2005. [ bib | DOI | http ]
In this letter, we provide a study of learning in a Hilbert space of vectorvalued functions. We motivate the need for extending learning theory of scalar-valued functions by practical considerations and establish some basic results for learning vector-valued functions that should prove useful in applications. Specifically, we allow an output space Y to be a Hilbert space, and we consider a reproducing kernel Hilbert space of functions whose values lie in Y. In this setting, we derive the form of the minimal norm interpolant to a finite set of data and apply it to study some regularization functionals that are important in learning theory. We consider specific examples of such functionals corresponding to multiple-output regularization networks and support vector machines, for both regression and classification. Finally, we provide classes of operator-valued kernels of the dot product and translation-invariant type.

Keywords: Algorithms, Amino Acid, Amino Acids, Artificial Intelligence, Ascomycota, Automated, Base Sequence, Chromosome Mapping, Codon, Colonic Neoplasms, Comparative Study, Computer Simulation, Computer-Assisted, Computing Methodologies, Crystallography, DNA, DNA Primers, Databases, Decision Support Techniques, Diagnostic Imaging, Enzymes, Feedback, Fixation, Gene Expression Profiling, Genetic, Hordeum, Host-Parasite Relations, Humans, Image Enhancement, Image Interpretation, Informatics, Information Storage and Retrieval, Kinetics, Logistic Models, Magnetic Resonance Spectroscopy, Mathematical Computing, Models, Nanotechnology, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Ocular, Oligonucleotide Array Sequence Analysis, P.H.S., Pattern Recognition, Plant, Plants, Predictive Value of Tests, Protein, Protein Conformation, Regression Analysis, Research Support, Sample Size, Selection (Genetics), Sequence Alignment, Sequence Analysis, Sequence Homology, Signal Processing, Skin, Software, Statistical, Subtraction Technique, Theoretical, Thermodynamics, U.S. Gov't, Viral Proteins, X-Ray, 15563752
[Larsen2005integrative] Mette Voldby Larsen, Claus Lundegaard, Kasper Lamberth, Søren Buus, Søren Brunak, Ole Lund, and Morten Nielsen. An integrative approach to CTL epitope prediction: a combined algorithm integrating MHC class I binding, TAP transport efficiency, and proteasomal cleavage predictions. Eur. J. Immunol., 35(8):2295-2303, Aug 2005. [ bib | DOI | http ]
Reverse immunogenetic approaches attempt to optimize the selection of candidate epitopes, and thus minimize the experimental effort needed to identify new epitopes. When predicting cytotoxic T cell epitopes, the main focus has been on the highly specific MHC class I binding event. Methods have also been developed for predicting the antigen-processing steps preceding MHC class I binding, including proteasomal cleavage and transporter associated with antigen processing (TAP) transport efficiency. Here, we use a dataset obtained from the SYFPEITHI database to show that a method integrating predictions of MHC class I binding affinity, TAP transport efficiency, and C-terminal proteasomal cleavage outperforms any of the individual methods. Using an independent evaluation dataset of HIV epitopes from the Los Alamos database, the validity of the integrated method is confirmed. The performance of the integrated method is found to be significantly higher than that of the two publicly available prediction methods BIMAS and SYFPEITHI. To identify 85% of the epitopes in the HIV dataset, 9% and 10% of all possible nonamers in the HIV proteins must be tested when using the BIMAS and SYFPEITHI methods, respectively, for the selection of candidate epitopes. This number is reduced to 7% when using the integrated method. In practical terms, this means that the experimental effort needed to identify an epitope in a hypothetical protein with 85% probability is reduced by 20-30% when using the integrated method.The method is available at http://www.cbs.dtu.dk/services/NetCTL. Supplementary material is available at http://www.cbs.dtu.dk/suppl/immunology/CTL.php.

Keywords: Algorithms; Data Interpretation, Statistical; Epitopes, T-Lymphocyte; Histocompatibility Antigens Class I; Humans; Hydrolysis; Predictive Value of Tests; Proteasome Endopeptidase Complex; Protein Binding; Research Support, N.I.H., Extramural; Research Support, Non-U.S. Gov't; Research Support, U.S. Gov't, P.H.S.; T-Lymphocytes, Cytotoxic
[Golland2005Detection] Polina Golland, W. Eric L Grimson, Martha E Shenton, and Ron Kikinis. Detection and analysis of statistical differences in anatomical shape. Med Image Anal, 9(1):69-86, Feb 2005. [ bib | DOI | http ]
We present a computational framework for image-based analysis and interpretation of statistical differences in anatomical shape between populations. Applications of such analysis include understanding developmental and anatomical aspects of disorders when comparing patients versus normal controls, studying morphological changes caused by aging, or even differences in normal anatomy, for example, differences between genders. Once a quantitative description of organ shape is extracted from input images, the problem of identifying differences between the two groups can be reduced to one of the classical questions in machine learning of constructing a classifier function for assigning new examples to one of the two groups while making as few misclassifications as possible. The resulting classifier must be interpreted in terms of shape differences between the two groups back in the image domain. We demonstrate a novel approach to such interpretation that allows us to argue about the identified shape differences in anatomically meaningful terms of organ deformation. Given a classifier function in the feature space, we derive a deformation that corresponds to the differences between the two classes while ignoring shape variability within each class. Based on this approach, we present a system for statistical shape analysis using distance transforms for shape representation and the support vector machines learning algorithm for the optimal classifier estimation and demonstrate it on artificially generated data sets, as well as real medical studies.

Keywords: Algorithms, Amino Acid, Artificial Intelligence, Ascomycota, Automated, Base Sequence, Chromosome Mapping, Codon, Colonic Neoplasms, Comparative Study, Computer-Assisted, Crystallography, DNA, DNA Primers, Databases, Diagnostic Imaging, Gene Expression Profiling, Hordeum, Host-Parasite Relations, Humans, Image Interpretation, Informatics, Kinetics, Magnetic Resonance Spectroscopy, Models, Nanotechnology, Non-P.H.S., Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, P.H.S., Pattern Recognition, Plant, Plants, Predictive Value of Tests, Protein, Research Support, Selection (Genetics), Sequence Alignment, Sequence Analysis, Sequence Homology, Skin, Software, Statistical, Theoretical, Thermodynamics, U.S. Gov't, Viral Proteins, X-Ray, 15581813
[Ehlers2005NBS1] Justis P Ehlers and J. William Harbour. NBS1 expression as a prognostic marker in uveal melanoma. Clin. Cancer Res., 11(5):1849-53, Mar 2005. [ bib | DOI | http | .pdf ]
PURPOSE: Up to half of uveal melanoma patients die of metastatic disease. Treatment of the primary eye tumor does not improve survival in high-risk patients due to occult micrometastatic disease, which is present at the time of eye tumor diagnosis but is not detected and treated until months to years later. Here, we use microarray gene expression data to identify a new prognostic marker. EXPERIMENTAL DESIGN: Microarray gene expression profiles were analyzed in 25 primary uveal melanomas. Tumors were ranked by support vector machine (SVM) and by cytologic severity. Nbs1 protein expression was assessed by quantitative immunohistochemistry in 49 primary uveal melanomas. Survival was assessed using Kaplan-Meier life-table analysis. RESULTS: Expression of the Nijmegen breakage syndrome (NBS1) gene correlated strongly with SVM and cytologic tumor rankings (P < 0.0001). Further, immunohistochemistry expression of the Nbs1 protein correlated strongly with both SVM and cytologic rankings (P < 0.0001). The 6-year actuarial survival was 100% in patients with low immunohistochemistry expression of Nbs1 and 22% in those with high Nbs1 expression (P = 0.01). CONCLUSIONS: NBS1 is a strong predictor of uveal melanoma survival and potentially could be used as a clinical marker for guiding clinical management.

Keywords: 80 and over, Adult, Aged, Algorithms, Amino Acid Sequence, Amino Acids, Analysis of Variance, Animals, Area Under Curve, Artifacts, Automated, Bacteriophage T4, Base Sequence, Biological, Birefringence, Brain Chemistry, Brain Neoplasms, Cell Cycle Proteins, Comparative Study, Computational Biology, Computer-Assisted, Cornea, Cross-Sectional Studies, Databases, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Extramural, Face, Female, Gene Expression Profiling, Genetic, Glaucoma, Humans, Immunohistochemistry, Intraocular Pressure, Lasers, Least-Squares Analysis, Likelihood Functions, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Male, Markov Chains, Melanoma, Middle Aged, Models, Molecular, Mutation, N.I.H., Nerve Fibers, Non-P.H.S., Non-U.S. Gov't, Nuclear Proteins, Nucleic Acid, Nucleic Acid Conformation, Numerical Analysis, Oligonucleotide Array Sequence Analysis, Ophthalmological, Optic Nerve Diseases, Optical Coherence, P.H.S., Pattern Recognition, Photic Stimulation, Polymorphism, Prognosis, Prospective Studies, Protein, Protein Structure, Proteins, RNA, ROC Curve, Regression Analysis, Reproducibility of Results, Research Support, Retinal Ganglion Cells, Secondary, Sensitivity and Specificity, Sequence Analysis, Single Nucleotide, Single-Stranded Conformational, Software, Statistics, Survival Analysis, Tertiary, Tomography, Tumor Markers, U.S. Gov't, Untranslated, Uveal Neoplasms, Visual Fields, beta-Lactamases, 15756009
[Ding2005Minimum] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol, 3(2):185-205, Apr 2005. [ bib ]
How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. We propose a minimum redundancy - maximum relevance (MRMR) feature selection framework. Genes selected via MRMR provide a more balanced coverage of the space and capture broader characteristics of phenotypes. They lead to significantly improved class predictions in extensive experiments on 6 gene expression data sets: NCI, Lymphoma, Lung, Child Leukemia, Leukemia, and Colon. Improvements are observed consistently among 4 classification methods: Naive Bayes, Linear discriminant analysis, Logistic regression, and Support vector machines. SUPPLIMENTARY: The top 60 MRMR genes for each of the datasets are listed in http://crd.lbl.gov/ cding/MRMR/. More information related to MRMR methods can be found at http://www.hpeng.net/.

Keywords: Adult, Aged, Aging, Algorithms, Animals, Apoptosis, Artificial Intelligence, Automated, Biological, Bone Marrow, Breast Neoplasms, Classification, Cluster Analysis, Comparative Study, Computer Simulation, Computer-Assisted, Diagnosis, Dose-Response Relationship, Drug, Female, Foot, Gait, Gene Expression Profiling, Gene Expression Regulation, Gene Silencing, Genetic Vectors, Humans, Image Interpretation, Information Storage and Retrieval, Kidney, Liver, Logistic Models, Male, Messenger, Models, Myocardium, Neoplasms, Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Pharmaceutical Preparations, Polymerase Chain Reaction, Principal Component Analysis, Proteins, RNA, Rats, Reproducibility of Results, Research Support, Sensitivity and Specificity, Small Interfering, Sprague-Dawley, Statistical, Subcellular Fractions, Unknown Primary, 15852500
[Bernardo2005Chemogenomica] D. di Bernardo, M.J. Thompson, T.S. Gardner, S.E. Chobot, E.L. Eastwood, A.P. Wojtovich, S.J. Elliott, S.E. Schaus, and J.J. Collins. Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat Biotechnol, 23(3):377-383, Mar 2005. [ bib | DOI | http ]
A major challenge in drug discovery is to distinguish the molecular targets of a bioactive compound from the hundreds to thousands of additional gene products that respond indirectly to changes in the activity of the targets. Here, we present an integrated computational-experimental approach for computing the likelihood that gene products and associated pathways are targets of a compound. This is achieved by filtering the mRNA expression profile of compound-exposed cells using a reverse-engineered model of the cell's gene regulatory network. We apply the method to a set of 515 whole-genome yeast expression profiles resulting from a variety of treatments (compounds, knockouts and induced expression), and correctly enrich for the known targets and associated pathways in the majority of compounds examined. We demonstrate our approach with PTSB, a growth inhibitory compound with a previously unknown mode of action, by predicting and validating thioredoxin and thioredoxin reductase as its target.

Keywords: Algorithms; Artificial Intelligence; Computer Simulation; Drug Delivery Systems; Drug Design; Gene Expression Profiling; Gene Expression Regulation; Models, Biological; Models, Statistical; Protein Engineering; Protein Interaction Mapping; Saccharomyces cerevisiae; Saccharomyces cerevisiae Proteins; Signal Transduction; Thioredoxin-Disulfide Reductase; Thioredoxins
[Bagga2005Quantitative] Harmohina Bagga, David S Greenfield, and William J Feuer. Quantitative assessment of atypical birefringence images using scanning laser polarimetry with variable corneal compensation. Am J Ophthalmol, 139(3):437-46, Mar 2005. [ bib | DOI | http | .pdf ]
PURPOSE: To define the clinical characteristics of atypical birefringence images and to describe a quantitative method for their identification. DESIGN: Prospective, comparative, clinical observational study. METHODS: Normal and glaucomatous eyes underwent complete examination, standard automated perimetry, scanning laser polarimetry with variable corneal compensation (GDx-VCC), and optical coherence tomography (OCT) of the macula, peripapillary retinal nerve fiber layer (RNFL), and optic disk. Eyes were classified into two groups: normal birefringence pattern (NBP) and atypical birefringence pattern (ABP). Clinical, functional, and structural characteristics were assessed separately. A multiple logistic regression model was used to predict eyes with ABP on the basis of a quantitative scan score generated by a support vector machine (SVM) with GDx-VCC. RESULTS: Sixty-five eyes of 65 patients were enrolled. ABP images were observed in 5 of 20 (25%) normal eyes and 23 of 45 (51%) glaucomatous eyes. Compared with eyes with NBP, glaucomatous eyes with ABP demonstrated significantly lower SVM scores (P < .0001, < 0.0001, 0.008, 0.03, and 0.03, respectively) and greater temporal, mean, inferior, and nasal RNFL thickness using GDx-VCC; and a weaker correlation with OCT generated RNFL thickness (R(2) = .75 vs .27). ABP images were significantly correlated with older age (R(2) = .16, P = .001). The SVM score was the only significant (P < .0001) predictor of ABP images and provided high discriminating power between eyes with NBP and ABP (area under the receiver operator characteristic curve = 0.98). CONCLUSIONS: ABP images exist in a subset of normal and glaucomatous eyes, are associated with older patient age, and produce an artifactual increase in RNFL thickness using GDx-VCC. The SVM score is highly predictive of ABP images.

Keywords: 80 and over, Adult, Aged, Algorithms, Amino Acids, Animals, Area Under Curve, Artifacts, Automated, Birefringence, Brain Chemistry, Brain Neoplasms, Comparative Study, Computer-Assisted, Cornea, Cross-Sectional Studies, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Face, Female, Genetic, Glaucoma, Humans, Intraocular Pressure, Lasers, Least-Squares Analysis, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Male, Middle Aged, Models, Molecular, Nerve Fibers, Non-U.S. Gov't, Numerical Analysis, Ophthalmological, Optic Nerve Diseases, Optical Coherence, P.H.S., Pattern Recognition, Photic Stimulation, Prospective Studies, Protein, ROC Curve, Regression Analysis, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, Sequence Analysis, Statistics, Tomography, U.S. Gov't, Visual Fields, beta-Lactamases, 15767051
[Neuvial2006Spatial] Pierre Neuvial, Philippe Hupé, Isabel Brito, Stéphane Liva, Elodie Manié, Caroline Brennetot, François Radvanyi, Alain Aurias, and Emmanuel Barillot. Spatial normalization of array-cgh data. BMC Bioinformatics, 7:264, 2006. [ bib | DOI | http ]
BACKGROUND: Array-based comparative genomic hybridization (array-CGH) is a recently developed technique for analyzing changes in DNA copy number. As in all microarray analyses, normalization is required to correct for experimental artifacts while preserving the true biological signal. We investigated various sources of systematic variation in array-CGH data and identified two distinct types of spatial effect of no biological relevance as the predominant experimental artifacts: continuous spatial gradients and local spatial bias. Local spatial bias affects a large proportion of arrays, and has not previously been considered in array-CGH experiments. RESULTS: We show that existing normalization techniques do not correct these spatial effects properly. We therefore developed an automatic method for the spatial normalization of array-CGH data. This method makes it possible to delineate and to eliminate and/or correct areas affected by spatial bias. It is based on the combination of a spatial segmentation algorithm called NEM (Neighborhood Expectation Maximization) and spatial trend estimation. We defined quality criteria for array-CGH data, demonstrating significant improvements in data quality with our method for three data sets coming from two different platforms (198, 175 and 26 BAC-arrays). CONCLUSION: We have designed an automatic algorithm for the spatial normalization of BAC CGH-array data, preventing the misinterpretation of experimental artifacts as biologically relevant outliers in the genomic profile. This algorithm is implemented in the R package MANOR (Micro-Array NORmalization), which is described at http://bioinfo.curie.fr/projects/manor and available from the Bioconductor site http://www.bioconductor.org. It can also be tested on the CAPweb bioinformatics platform at http://bioinfo.curie.fr/CAPweb.

Keywords: Algorithms; Artifacts; Base Sequence; Chromosome Mapping, methods; Computer Simulation; Data Interpretation, Statistical; Gene Dosage; In Situ Hybridization, methods; Models, Genetic; Models, Statistical; Molecular Sequence Data; Oligonucleotide Array Sequence Analysis, methods; Sequence Analysis, DNA, methods
[Kapp2006Discovery] Amy V Kapp, Stefanie S Jeffrey, Anita Langerød, Anne-Lise Børresen-Dale, Wonshik Han, Dong-Young Noh, Ida R K Bukholm, Monica Nicolau, Patrick O Brown, and Robert Tibshirani. Discovery and validation of breast cancer subtypes. BMC Genomics, 7:231, 2006. [ bib | DOI | http ]
Previous studies demonstrated breast cancer tumor tissue samples could be classified into different subtypes based upon DNA microarray profiles. The most recent study presented evidence for the existence of five different subtypes: normal breast-like, basal, luminal A, luminal B, and ERBB2+.Based upon the analysis of 599 microarrays (five separate cDNA microarray datasets) using a novel approach, we present evidence in support of the most consistently identifiable subtypes of breast cancer tumor tissue microarrays being: ESR1+/ERBB2-, ESR1-/ERBB2-, and ERBB2+ (collectively called the ESR1/ERBB2 subtypes). We validate all three subtypes statistically and show the subtype to which a sample belongs is a significant predictor of overall survival and distant-metastasis free probability.As a consequence of the statistical validation procedure we have a set of centroids which can be applied to any microarray (indexed by UniGene Cluster ID) to classify it to one of the ESR1/ERBB2 subtypes. Moreover, the method used to define the ESR1/ERBB2 subtypes is not specific to the disease. The method can be used to identify subtypes in any disease for which there are at least two independent microarray datasets of disease samples.

Keywords: Algorithms; Breast Neoplasms, classification/genetics/pathology; Female; Gene Expression Profiling, methods/statistics /&/ numerical data; Humans; Multivariate Analysis; Oligonucleotide Array Sequence Analysis, methods/statistics /&/ numerical data; Proportional Hazards Models; Risk Factors; Survival Analysis
[Consortium2006MicroArray] M. A. Q. C. Consortium, Leming Shi, Laura H Reid, Wendell D Jones, Richard Shippy, Janet A Warrington, Shawn C Baker, Patrick J Collins, Francoise de Longueville, Ernest S Kawasaki, Kathleen Y Lee, Yuling Luo, Yongming Andrew Sun, James C Willey, Robert A Setterquist, Gavin M Fischer, Weida Tong, Yvonne P Dragan, David J Dix, Felix W Frueh, Frederico M Goodsaid, Damir Herman, Roderick V Jensen, Charles D Johnson, Edward K Lobenhofer, Raj K Puri, Uwe Schrf, Jean Thierry-Mieg, Charles Wang, Mike Wilson, Paul K Wolber, Lu Zhang, Shashi Amur, Wenjun Bao, Catalin C Barbacioru, Anne Bergstrom Lucas, Vincent Bertholet, Cecilie Boysen, Bud Bromley, Donna Brown, Alan Brunner, Roger Canales, Xiaoxi Megan Cao, Thomas A Cebula, James J Chen, Jing Cheng, Tzu-Ming Chu, Eugene Chudin, John Corson, J. Christopher Corton, Lisa J Croner, Christopher Davies, Timothy S Davison, Glenda Delenstarr, Xutao Deng, David Dorris, Aron C Eklund, Xiao hui Fan, Hong Fang, Stephanie Fulmer-Smentek, James C Fuscoe, Kathryn Gallagher, Weigong Ge, Lei Guo, Xu Guo, Janet Hager, Paul K Haje, Jing Han, Tao Han, Heather C Harbottle, Stephen C Harris, Eli Hatchwell, Craig A Hauser, Susan Hester, Huixiao Hong, Patrick Hurban, Scott A Jackson, Hanlee Ji, Charles R Knight, Winston P Kuo, J. Eugene LeClerc, Shawn Levy, Quan-Zhen Li, Chunmei Liu, Ying Liu, Michael J Lombardi, Yunqing Ma, Scott R Magnuson, Botoul Maqsodi, Tim McDaniel, Nan Mei, Ola Myklebost, Baitang Ning, Natalia Novoradovskaya, Michael S Orr, Terry W Osborn, Adam Papallo, Tucker A Patterson, Roger G Perkins, Elizabeth H Peters, Ron Peterson, Kenneth L Philips, P. Scott Pine, Lajos Pusztai, Feng Qian, Hongzu Ren, Mitch Rosen, Barry A Rosenzweig, Raymond R Samaha, Mark Schena, Gary P Schroth, Svetlana Shchegrova, Dave D Smith, Frank Staedtler, Zhenqiang Su, Hongmei Sun, Zoltan Szallasi, Zivana Tezak, Danielle Thierry-Mieg, Karol L Thompson, Irina Tikhonova, Yaron Turpaz, Beena Vallanat, Christophe Van, Stephen J Walker, Sue Jane Wang, Yonghong Wang, Russ Wolfinger, Alex Wong, Jie Wu, Chunlin Xiao, Qian Xie, Jun Xu, Wen Yang, Liang Zhang, Sheng Zhong, Yaping Zong, and William Slikker. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol., 24(9):1151-1161, Sep 2006. [ bib | DOI | http ]
Over the last decade, the introduction of microarray technology has had a profound impact on gene expression research. The publication of studies with dissimilar or altogether contradictory results, obtained using different microarray platforms to analyze identical RNA samples, has raised concerns about the reliability of this technology. The MicroArray Quality Control (MAQC) project was initiated to address these concerns, as well as other performance and data analysis issues. Expression data on four titration pools from two distinct reference RNA samples were generated at multiple test sites using a variety of microarray-based and alternative technology platforms. Here we describe the experimental design and probe mapping efforts behind the MAQC project. We show intraplatform consistency across test sites as well as a high level of interplatform concordance in terms of genes identified as differentially expressed. This study provides a resource that represents an important first step toward establishing a framework for the use of microarrays in clinical and regulatory settings.

Keywords: Equipment Design; Equipment Failure Analysis; Gene Expression Profiling, instrumentation/methods; Oligonucleotide Array Sequence Analysis, instrumentation; Quality Assurance, Health Care, methods; Quality Control; Reproducibility of Results; Sensitivity and Specificity; United States
[Bui2006Structural] H.-H. Bui, A. J. Schiewe, H. von Grafenstein, and I. S. Haworth. Structural prediction of peptides binding to MHC class I molecules. Proteins, 63(1):43-52, Apr 2006. [ bib | DOI | http ]
Peptide binding to class I major histocompatibility complex (MHCI) molecules is a key step in the immune response and the structural details of this interaction are of importance in the design of peptide vaccines. Algorithms based on primary sequence have had success in predicting potential antigenic peptides for MHCI, but such algorithms have limited accuracy and provide no structural information. Here, we present an algorithm, PePSSI (peptide-MHC prediction of structure through solvated interfaces), for the prediction of peptide structure when bound to the MHCI molecule, HLA-A2. The algorithm combines sampling of peptide backbone conformations and flexible movement of MHC side chains and is unique among other prediction algorithms in its incorporation of explicit water molecules at the peptide-MHC interface. In an initial test of the algorithm, PePSSI was used to predict the conformation of eight peptides bound to HLA-A2, for which X-ray data are available. Comparison of the predicted and X-ray conformations of these peptides gave RMSD values between 1.301 and 2.475 A. Binding conformations of 266 peptides with known binding affinities for HLA-A2 were then predicted using PePSSI. Structural analyses of these peptide-HLA-A2 conformations showed that peptide binding affinity is positively correlated with the number of peptide-MHC contacts and negatively correlated with the number of interfacial water molecules. These results are consistent with the relatively hydrophobic binding nature of the HLA-A2 peptide binding interface. In summary, PePSSI is capable of rapid and accurate prediction of peptide-MHC binding conformations, which may in turn allow estimation of MHCI-peptide binding affinity.

Keywords: Algorithms, Amino Acid Sequence, Antigens, Artificial Intelligence, Automated, Binding Sites, Chemical, Computational Biology, Computer Simulation, Crystallog, Crystallography, Electrostatics, Genes, Genetic, HLA Antigens, Histocompatibility Antigens Class I, Humans, Hydrogen Bonding, Ligands, MHC Class I, Major Histocompatibility Complex, Models, Molecular, Molecular Conformation, Molecular Sequence Data, Pattern Recognition, Peptides, Protein, Protein Binding, Protein Conformation, Proteomics, Quantitative Structure-Activity Relationship, Sequence Alignment, Sequence Analysis, Software, Structural Homology, Structure-Activity Relationship, Thermodynamics, Water, X-Ray, X-Rays, raphy, 16447245
[Bhavani2006Substructure-based] S. Bhavani, A. Nagargadde, A. Thawani, V. Sridhar, and N. Chandra. Substructure-based support vector machine classifiers for prediction of adverse effects in diverse classes of drugs. J. Chem. Inform. Model., 46(6):2478-2486, 2006. [ bib | DOI | http ]
Unforeseen adverse effects exhibited by drugs contribute heavily to late-phase failure and even withdrawal of marketed drugs. Torsade de pointes (TdP) is one such important adverse effect, which causes cardiac arrhythmia and, in some cases, sudden death, making it crucial for potential drugs to be screened for torsadogenicity. The need to tap the power of computational approaches for the prediction of adverse effects such as TdP is increasingly becoming evident. The availability of screening data including those in organized databases greatly facilitates exploration of newer computational approaches. In this paper, we report the development of a prediction method based on a support machine vector algorithm. The method uses a combination of descriptors, encoding both the type of toxicophore as well as the position of the toxicophore in the drug molecule, thus considering both the pharmacophore and the three-dimensional shape information of the molecule. For delineating toxicophores, a novel pattern-recognition method that utilizes substructures within a molecule has been developed. The results obtained using the hybrid approach have been compared with those available in the literature for the same data set. An improvement in prediction accuracy is clearly seen, with the accuracy reaching up to 97% in predicting compounds that can cause TdP and 90% for predicting compounds that do not cause TdP. The generic nature of the method has been demonstrated with four data sets available for carcinogenicity, where prediction accuracies were significantly higher, with a best receiver operating characteristics (ROC) value of 0.81 as against a best ROC value of 0.7 reported in the literature for the same data set. Thus, the method holds promise for wide applicability in toxicity prediction.

Keywords: Algorithms; Carcinogens; Chemistry, Pharmaceutical; Computational Biology; Drug Evaluation, Preclinical; Drug Industry; Humans; Models, Chemical; Models, Statistical; Neural Networks (Computer); Pattern Recognition, Automated; ROC Curve; Sequence Analysis, Protein; Software; Torsades de Pointes
[Driel2006text-mining] M.A. van Driel, J. Bruggeman, G. Vriend, H.G. Brunner, and J.A.M. Leunissen. A text-mining analysis of the human phenome. Eur. J. Hum. Genet., 14(5):535-542, May 2006. [ bib | DOI | http ]
A number of large-scale efforts are underway to define the relationships between genes and proteins in various species. But, few attempts have been made to systematically classify all such relationships at the phenotype level. Also, it is unknown whether such a phenotype map would carry biologically meaningful information. We have used text mining to classify over 5000 human phenotypes contained in the Online Mendelian Inheritance in Man database. We find that similarity between phenotypes reflects biological modules of interacting functionally related genes. These similarities are positively correlated with a number of measures of gene function, including relatedness at the level of protein sequence, protein motifs, functional annotation, and direct protein-protein interaction. Phenotype grouping reflects the modular nature of human disease genetics. Thus, phenotype mapping may be used to predict candidate genes for diseases as well as functional relations between genes and proteins. Such predictions will further improve if a unified system of phenotype descriptors is developed. The phenotype similarity data are accessible through a web interface at http://www.cmbi.ru.nl/MimMiner/.

Keywords: Chromosome Mapping; Databases, Genetic; Genetic Predisposition to Disease; Genetic Vectors; Genome, Human; Genotype; Humans; Models, Genetic; Models, Statistical; Multigene Family; Phenotype
[Yan2007Determining] Mingjin Yan and Keying Ye. Determining the number of clusters using the weighted gap statistic. Biometrics, 63(4):1031-1037, Dec 2007. [ bib | DOI | http ]
Estimating the number of clusters in a data set is a crucial step in cluster analysis. In this article, motivated by the gap method (Tibshirani, Walther, and Hastie, 2001, Journal of the Royal Statistical Society B63, 411-423), we propose the weighted gap and the difference of difference-weighted (DD-weighted) gap methods for estimating the number of clusters in data using the weighted within-clusters sum of errors: a measure of the within-clusters homogeneity. In addition, we propose a "multilayer" clustering approach, which is shown to be more accurate than the original gap method, particularly in detecting the nested cluster structure of the data. The methods are applicable when the input data contain continuous measurements and can be used with any clustering method. Simulation studies and real data are investigated and compared among these proposed methods as well as with the original gap method.

Keywords: Algorithms; Biometry, methods; Cluster Analysis; Computer Simulation; Data Interpretation, Statistical; Models, Biological; Models, Statistical; Pattern Recognition, Automated, methods
[Rhodes2007Oncomine] Daniel R. Rhodes, Shanker Kalyana-Sundaram, Vasudeva Mahavisno, Radhika Varambally, Jianjun Yu, Benjamin B. Briggs, Terrence R. Barrette, Matthew J. Anstet, Colleen Kincead-Beal, Prakash Kulkarni, Sooryanaryana Varambally, Debashis Ghosh, and Arul M. Chinnaiyan. Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia, 9(2):166-180, Feb 2007. [ bib ]
DNA microarrays have been widely applied to cancer transcriptome analysis; however, the majority of such data are not easily accessible or comparable. Furthermore, several important analytic approaches have been applied to microarray analysis; however, their application is often limited. To overcome these limitations, we have developed Oncomine, a bioinformatics initiative aimed at collecting, standardizing, analyzing, and delivering cancer transcriptome data to the biomedical research community. Our analysis has identified the genes, pathways, and networks deregulated across 18,000 cancer gene expression microarrays, spanning the majority of cancer types and subtypes. Here, we provide an update on the initiative, describe the database and analysis modules, and highlight several notable observations. Results from this comprehensive analysis are available at http://www.oncomine.org.

Keywords: Antineoplastic Agents, pharmacology; Automatic Data Processing; Chromosome Mapping; Chromosomes, Human, genetics; Computational Biology, organization /&/ administration; Data Collection; Data Display; Data Interpretation, Statistical; Databases, Genetic; Drug Design; Gene Expression Profiling, statistics /&/ numerical data; Gene Expression Regulation, Neoplastic; Genes, Neoplasm; Humans; Internet; Models, Biological; Neoplasm Proteins, biosynthesis/chemistry/genetics; Neoplasms, classification/genetics/metabolism; Oligonucleotide Array Sequence Analysis; Subtraction Technique; Transcription, Genetic
[Kahraman2007Shape] A. Kahraman, R. J. Morris, R. A. Laskowski, and J. M. Thornton. Shape variation in protein binding pockets and their ligands. J. Mol. Biol., 368(1):283-301, Apr 2007. [ bib | DOI | http ]
A common assumption about the shape of protein binding pockets is that they are related to the shape of the small ligand molecules that can bind there. But to what extent is that assumption true? Here we use a recently developed shape matching method to compare the shapes of protein binding pockets to the shapes of their ligands. We find that pockets binding the same ligand show greater variation in their shapes than can be accounted for by the conformational variability of the ligand. This suggests that geometrical complementarity in general is not sufficient to drive molecular recognition. Nevertheless, we show when considering only shape and size that a significant proportion of the recognition power of a binding pocket for its ligand resides in its shape. Additionally, we observe a "buffer zone" or a region of free space between the ligand and protein, which results in binding pockets being on average three times larger than the ligand that they bind.

Keywords: Binding Sites; Computer Simulation; Ligands; Models, Molecular; Models, Statistical; Protein Binding; Protein Conformation; Protein Folding
[Johnson2007Adjusting] W. Evan Johnson, Cheng Li, and Ariel Rabinovic. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics, 8(1):118-127, Jan 2007. [ bib | DOI | http ]
Non-biological experimental variation or "batch effects" are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes ( > 25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.

Keywords: Bayes Theorem; Data Interpretation, Statistical; Gene Expression Profiling, methods; Humans; Oligonucleotide Array Sequence Analysis, methods
[Xie2009Unified] Lei Xie, Li Xie, and Philip E Bourne. A unified statistical model to support local sequence order independent similarity searching for ligand-binding sites and its application to genome-based drug discovery. Bioinformatics, 25(12):i305-i312, Jun 2009. [ bib | DOI | http ]
Functional relationships between proteins that do not share global structure similarity can be established by detecting their ligand-binding-site similarity. For a large-scale comparison, it is critical to accurately and efficiently assess the statistical significance of this similarity. Here, we report an efficient statistical model that supports local sequence order independent ligand-binding-site similarity searching. Most existing statistical models only take into account the matching vertices between two sites that are defined by a fixed number of points. In reality, the boundary of the binding site is not known or is dependent on the bound ligand making these approaches limited. To address these shortcomings and to perform binding-site mapping on a genome-wide scale, we developed a sequence-order independent profile-profile alignment (SOIPPA) algorithm that is able to detect local similarity between unknown binding sites a priori. The SOIPPA scoring integrates geometric, evolutionary and physical information into a unified framework. However, this imposes a significant challenge in assessing the statistical significance of the similarity because the conventional probability model that is based on fixed-point matching cannot be applied. Here we find that scores for binding-site matching by SOIPPA follow an extreme value distribution (EVD). Benchmark studies show that the EVD model performs at least two-orders faster and is more accurate than the non-parametric statistical method in the previous SOIPPA version. Efficient statistical analysis makes it possible to apply SOIPPA to genome-based drug discovery. Consequently, we have applied the approach to the structural genome of Mycobacterium tuberculosis to construct a protein-ligand interaction network. The network reveals highly connected proteins, which represent suitable targets for promiscuous drugs.

Keywords: Binding Sites; Computational Biology, methods; Drug Discovery, methods; Genome; Ligands; Models, Statistical; Mycobacterium tuberculosis, genetics/metabolism; Proteins, chemistry
[Parkhomenko2009Sparse] E. Parkhomenko, D. Tritchler, and J. Beyene. Sparse canonical correlation analysis with application to genomic data integration. Stat Appl Genet Mol Biol, 8(1):Article 1, Jan 2009. [ bib | DOI | http ]
Large scale genomic studies with multiple phenotypic or genotypic measures may require the identification of complex multivariate relationships. In multivariate analysis a common way to inspect the relationship between two sets of variables based on their correlation is canonical correlation analysis, which determines linear combinations of all variables of each type with maximal correlation between the two linear combinations. However, in high dimensional data analysis, when the number of variables under consideration exceeds tens of thousands, linear combinations of the entire sets of features may lack biological plausibility and interpretability. In addition, insufficient sample size may lead to computational problems, inaccurate estimates of parameters and non-generalizable results. These problems may be solved by selecting sparse subsets of variables, i.e. obtaining sparse loadings in the linear combinations of variables of each type. In this paper we present Sparse Canonical Correlation Analysis (SCCA) which examines the relationships between two types of variables and provides sparse solutions that include only small subsets of variables of each type by maximizing the correlation between the subsets of variables of different types while performing variable selection. We also present an extension of SCCA-adaptive SCCA. We evaluate their properties using simulated data and illustrate practical use by applying both methods to the study of natural variation in human gene expression.

Keywords: Algorithms; Genomics, statistics /&/ numerical data; Humans; Models, Statistical; Sample Size
[Ioannidis2009Repeatability] John P A Ioannidis, David B Allison, Catherine A Ball, Issa Coulibaly, Xiangqin Cui, Aedín C Culhane, Mario Falchi, Cesare Furlanello, Laurence Game, Giuseppe Jurman, Jon Mangion, Tapan Mehta, Michael Nitzberg, Grier P Page, Enrico Petretto, and Vera van Noort. Repeatability of published microarray gene expression analyses. Nat Genet, 41(2):149-155, Feb 2009. [ bib | DOI | http ]
Given the complexity of microarray-based gene expression studies, guidelines encourage transparent design and public data availability. Several journals require public data deposition and several public databases exist. However, not all data are publicly available, and even when available, it is unknown whether the published results are reproducible by independent scientists. Here we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling published in Nature Genetics in 2005-2006. One table or figure from each article was independently evaluated by two teams of analysts. We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis. Repeatability of published microarray studies is apparently limited. More strict publication rules enforcing public data availability and explicit description of data processing and analysis should be considered.

Keywords: Animals; Data Interpretation, Statistical; Databases, Genetic; Gene Expression Profiling, standards; Genome-Wide Association Study, standards; Humans; Oligonucleotide Array Sequence Analysis, standards; Peer Review, Research; Publications, standards; Reproducibility of Results
[Tayrac2009Simultaneous] M. de Tayrac, S. Lê, M. Aubry, J. Mosser, and F. Husson. Simultaneous analysis of distinct omics data sets with integration of biological knowledge: Multiple factor analysis approach. BMC Genomics, 10:32, 2009. [ bib | DOI | http ]
Genomic analysis will greatly benefit from considering in a global way various sources of molecular data with the related biological knowledge. It is thus of great importance to provide useful integrative approaches dedicated to ease the interpretation of microarray data.Here, we introduce a data-mining approach, Multiple Factor Analysis (MFA), to combine multiple data sets and to add formalized knowledge. MFA is used to jointly analyse the structure emerging from genomic and transcriptomic data sets. The common structures are underlined and graphical outputs are provided such that biological meaning becomes easily retrievable. Gene Ontology terms are used to build gene modules that are superimposed on the experimentally interpreted plots. Functional interpretations are then supported by a step-by-step sequence of graphical representations.When applied to genomic and transcriptomic data and associated Gene Ontology annotations, our method prioritize the biological processes linked to the experimental settings. Furthermore, it reduces the time and effort to analyze large amounts of 'Omics' data.

Keywords: Animals; Comparative Genomic Hybridization; Factor Analysis, Statistical; Gene Expression Profiling, methods; Genomics, methods; Glioma, genetics; Humans; Mice; Models, Biological; Oligonucleotide Array Sequence Analysis, methods
[Vanunu2010Associating] O. Vanunu, O. Magger, E. Ruppin, T. Shlomi, and R. Sharan. Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol., 6(1):e1000641, Jan 2010. [ bib | DOI | http ]
A fundamental challenge in human health is the identification of disease-causing genes. Recently, several studies have tackled this challenge via a network-based approach, motivated by the observation that genes causing the same or similar diseases tend to lie close to one another in a network of protein-protein or functional interactions. However, most of these approaches use only local network information in the inference process and are restricted to inferring single gene associations. Here, we provide a global, network-based method for prioritizing disease genes and inferring protein complex associations, which we call PRINCE. The method is based on formulating constraints on the prioritization function that relate to its smoothness over the network and usage of prior information. We exploit this function to predict not only genes but also protein complex associations with a disease of interest. We test our method on gene-disease association data, evaluating both the prioritization achieved and the protein complexes inferred. We show that our method outperforms extant approaches in both tasks. Using data on 1,369 diseases from the OMIM knowledgebase, our method is able (in a cross validation setting) to rank the true causal gene first for 34% of the diseases, and infer 139 disease-related complexes that are highly coherent in terms of the function, expression and conservation of their member proteins. Importantly, we apply our method to study three multi-factorial diseases for which some causal genes have been found already: prostate cancer, alzheimer and type 2 diabetes mellitus. PRINCE's predictions for these diseases highly match the known literature, suggesting several novel causal genes and protein complexes for further investigation.

Keywords: Algorithms; Alzheimer Disease; Databases, Genetic; Diabetes Mellitus; Disease; Genes; Humans; Male; Multiprotein Complexes; Prostatic Neoplasms; Protein Interaction Mapping; Proteins; Reproducibility of Results
[Markowetz2010How] Florian Markowetz. How to understand the cell by breaking it: network analysis of gene perturbation screens. PLoS Comput Biol, 6(2):e1000655, 2010. [ bib | DOI | http ]
Keywords: Animals; Cell Physiological Processes; Cluster Analysis; Gene Regulatory Networks; Genomics; Humans; Models, Genetic; Models, Statistical; Phenotype; Signal Transduction; Systems Biology
[Chen2011Removing] Chao Chen, Kay Grennan, Judith Badner, Dandan Zhang, Elliot Gershon, Li Jin, and Chunyu Liu. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One, 6(2):e17238, 2011. [ bib | DOI | http ]
The expression microarray is a frequently used approach to study gene expression on a genome-wide scale. However, the data produced by the thousands of microarray studies published annually are confounded by "batch effects," the systematic error introduced when samples are processed in multiple batches. Although batch effects can be reduced by careful experimental design, they cannot be eliminated unless the whole study is done in a single batch. A number of programs are now available to adjust microarray data for batch effects prior to analysis. We systematically evaluated six of these programs using multiple measures of precision, accuracy and overall performance. ComBat, an Empirical Bayes method, outperformed the other five programs by most metrics. We also showed that it is essential to standardize expression data at the probe level when testing for correlation of expression profiles, due to a sizeable probe effect in microarray data that can inflate the correlation among replicates and unrelated samples.

Keywords: Bayes Theorem; Case-Control Studies; Data Interpretation, Statistical; Gene Expression Profiling, standards/statistics /&/ numerical data; Humans; Microarray Analysis, standards/statistics /&/ numerical data; ROC Curve; Reference Standards; Research Design; Sample Size; Selection Bias; Validation Studies as Topic

This file was generated by bibtex2html 1.97.