references

[CatoniGibbs] O. Catoni. Gibbs estimators. Revised version. [ bib | .dvi | .ps ]
[Raliou2010Human] Mariam Raliou, Marta Grauso, Brice Hoffmann, Claude Nespoulous, hélène Debat, Bano Singh, Didier Trotier, Jean-Claude Pernollet, Jean-Pierre Montmayeur, Annick Faurion, and Loïc Briand. Human genetic polymorphisms in t1r1 an d t1r3 affect their function. Chemical Senses, In Press. [ bib ]
[Degroot1970Optimal] Morris H. De Groot. Optimal statistical decisions / Morris H. De Groot. McGraw-Hill, New York :,, 1970. [ bib ]
[Cayley1859Theory] A. Cayley. On the theory of the analytical forms called threes. Philos. Mag., 37(18):374-378, 1859. [ bib ]
[Galton1869Hereditary] S.F. Galton. Hereditary genius. Macmillan and Company, 1869. [ bib ]
[Cayley1874Mathematical] A. Cayley. On the mathematical theory of isomers. Philos. Mag., 10(47):444-446, 1874. [ bib ]
[Cayley1875Theorychemical] A. Cayley. On the theory of the analytical forms called threes, with application to the theory of chemical combinations. Rep. Brit. Assoc. Sci., 4(45):257-305, 1875. [ bib ]
[Cayley1877Number] A. Cayley. On the number of univalent radicals cnh2n+1. Philos. Mag., 18(3):34-35, 1877. [ bib ]
[Sylvester1878Chemistry] J. J. Sylvester. Chemistry and algebra. Nature, 17(432), 1878. [ bib ]
[Clifford1878Note] W. K. Clifford. Note on quantics of alternate numbers, used as a means for determining the invariants and covariants of quantics in general. Proc. London Math. Soc., 10(9):258-265, 1878. [ bib ]
[Clifford1878Binary] W. K. Clifford. Binary forms of alternate variables. Proc. London Math. Soc., 10(9):277-286, 1878. [ bib ]
[Pearson1901On] K. Pearson. On lines and planes of closest fit to systems of points in space. Philos. Mag., 2(6):559-572, 1901. [ bib ]
Keywords: pca
[Hadamard1902Sur] J. Hadamard. Sur les problèmes aux dérivées partielles et leur signification physique. Princeton University Bulletin, 13:49-52, 1902. [ bib ]
[Hadamard1923Lectures] J. Hadamard. Lectures on Cauchy's Problem: In Linear Partial Differential Equations. Dover Publications, 1923. [ bib ]
[Redfield1927Theory] J. H. Redfield. The theory of group-reduced distributions. Amer. J. Math., 49:433-455, 1927. [ bib ]
[Lunn1929Isomerism] A. C. Lunn and J. K. Senior. Isomerism and configuration. J. Phys. Chem., 33:1027-1079, 1929. [ bib ]
[Whitney1930Nonseparable] H. Whitney. Non-separable and planar graphs. Proc. Natl. Acad. Sci., 93:415-443, 1930. [ bib ]
[Whitney1932Congruent] H. Whitney. Congruent graphs and the connectivity of graphs. Amer. J. Math., 54:150-168, 1932. [ bib ]
[Polya1934] G. Pólya. Algebraische berechnung der anzhal der isomeren einiger organischer verbindungen. Z. Kristal., 93:415-443, 1936. [ bib ]
[Hotelling1936Relation] H. Hotelling. Relation between two sets of variates. Biometrika, 28:322-377, 1936. [ bib | http | .pdf ]
[Schoenberg1938Metric] I. J. Schoenberg. Metric spaces and positive definite functions. Trans. Am. Math. Soc., 44(3):522-536, 1938. [ bib | .pdf ]
[Tikhonov1943stability] A.N. Tikhonov. On the stability of inverse problems. Doklady Akademii nauk SSSR, 39(5):195-198, 1943. [ bib ]
[Schrodinger1944Vie] Erwin Schrödinger. Qu'est-ce que la vie? Christian Bourgois Editeur, 1986, 1944. [ bib ]
Keywords: csbcbook
[Wilcoxon1945Individual] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80-83, 1945. [ bib ]
[Mann1947test] H.B. Mann and D.R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, 18(1):50-60, 1947. [ bib ]
[Fisher1950Gene] R.A. Fisher. Gene frequencies in a cline determined by selection and diffusion. Biometrics, 6(4):353-361, 1950. [ bib ]
[Aronszajn1950Theory] N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 - 404, 1950. [ bib | .pdf ]
Keywords: kernel-theory
[Markowitz1952Portfolio] H. Markowitz. Portfolio selection. The Journal of Finance, 7(1):77-91, March 1952. [ bib ]
[Watson1953Structure] J. D. Watson and F. H. C. Crick. A Structure for Deoxyribose Nucleic Acid. Nature, 171:737, 1953. [ bib | .html | .pdf ]
Keywords: bio
[Kuhn1955Hungarian] H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research, 2:83-97, 1955. [ bib ]
[Frank1956Algorithm] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95-110, 1956. [ bib ]
[Ray1957Finding] L. C. Ray and R. A. Kirsch. Finding chemical records by digital computers. Science, 126:814, 1957. [ bib ]
[Gotusso1957Fortran] L. Gotusso and A. T. Santolini. A fortran iv quasi decision algorithm for the p-equivalence of two matrices. Calcolo, 5:17-35, 1957. [ bib ]
[Berge1959Espaces] C. Berge. Espaces topologiques et fonctions multivoques. Dunod, Paris, 1959. [ bib ]
[Bellman1959adaptive] R. Bellman and R. Kalaba. On adaptive control processes. Automatic Control, IRE Transactions on, 4(2):1-9, 1959. [ bib ]
[Zellner1962An] Arnold Zellner. An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. J. Am. Stat. Assoc., 57(298):348-368, 1962. [ bib | DOI | http ]
In this paper a method of estimating the parameters of a set of regression equations is reported which involves application of Aitken's generalized least-squares [1] to the whole system of equations. Under conditions generally encountered in practice, it is found that the regression coefficient estimators so obtained are at least asymptotically more efficient than those obtained by an equation-by-equation application of least squares. This gain in efficiency can be quite large if ïndependent" variables in different equations are not highly correlated and if disturbance terms in different equations are highly correlated. Further, tests of the hypothesis that all regression equation coefficient vectors are equal, based on "micro" and "macro" data, are described. If this hypothesis is accepted, there will be no aggregation bias. Finally, the estimation procedure and the "micro-test" for aggregation bias are applied in the analysis of annual investment data, 1935-1954, for two firms.

Keywords: 480-3, econometrics, sur
[Philips1962A] D. L. Philips. A technique for the numerical solution of certain integral equations of the first kind. J. Assoc. Comput. Mach., 9:84-97, 1962. [ bib ]
[Hoerl1962Application] A. E. Hoerl. Application of ridge regression analysis to regression problems. Chemical Engineering Progress, 58:54-59, 1962. [ bib ]
[Tikhonov1963Solution] A.N. Tikhonov. Solution of incorrectly problems and the regularization method. Soviet Mathematics Doklady, 4:1035-1038, 1963. [ bib ]
[Sussenguth1963Graph] E. H. Sussenguth. A graph-theoretic algorithm for matching chemical structures. J. Chem. Doc., 5(1):36-43, 1963. [ bib ]
[moreau1963inf] JJ Moreau. Inf-convolution des fonctions numeriques sur un espace vectoriel proximite et dualite dans un espace hilbertien. Comptes Rendus de l’Academie des Sciences de Paris, 256:125-129, 1963. [ bib ]
[Huber1964Robust] P. J. Huber. Robust estimation of a location parameter. Ann. Math. Statist., 35(1):73-101, 1964. [ bib | DOI | http | .pdf ]
[Hansch1964method] C. Hansch and T. Fujita. A method for the correlation of biological activity and chemical structure. J. Am. Chem. Soc, 86:1616-1626, 1964. [ bib ]
[Aizerman1964Theoretical] M. A. Aizerman, E. M. Braverman, and L. I. Rozonoér. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821-837, 1964. [ bib ]
[Morgan1965Generation] H.L. Morgan. The Generation of Unique Machine Description for Chemical Structures - A Technique Developed at Chemical Abstracts Service. J Chem Doc, 5:107-113, 1965. [ bib ]
[Moreau1965Proximite] J.-J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la S.M.F., 93:273-299, 1965. [ bib ]
[MacQueen1967Some] J.B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281-297. University of California Press, 1967. [ bib ]
[Hansch1968Linear] C. Hansch, J. E. Quinlan, and G. L. Lawrence. Linear free-energy relationship between partition coefficients and the aqueous solubility of organic liquids. J. Org. Chem., 33:347 - 350, 1968. [ bib ]
[Milnor1969Topology] J.W. Milnor. Topology from the Differentiable Viewpoint. Univ. Press of Virginia, 1969. [ bib ]
[Needleman1970general] S.B. Needleman and C.D. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol., 48:443-453, 1970. [ bib ]
[Hoerl1970Ridge] A. E. Hoerl and R. W. Kennard. Ridge regression : biased estimation for nonorthogonal problems. Technometrics, 12(1):55-67, 1970. [ bib ]
[Crick1970Central] F. Crick. Central dogma of molecular biology. Nature, 227:561-563, 1970. [ bib | .pdf | .pdf ]
The central dogma of molecular biology deals with the detailed residue-by-residue transfer of sequential information. It states that such informatfon cannot be transferred from protein to either proteln or nucleic acid.

Keywords: csbcbook
[Knudson1971Mutation] Alfred G. Knudson. Mutation and cancer: Statistical study of retinoblastoma. Proceedings of the National Academy of Sciences, 68:820-823, 1971. [ bib ]
Keywords: csbcbook
[Kimeldorf1971Some] G. S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Appl., 33:82-95, 1971. [ bib ]
[Gower1971general] J. C. Gower. A general coefficient of similarity and some of its properties. Biometrics, 27(4):857-871, 1971. [ bib | .pdf ]
[Baranger1971Matrices] J. Baranger and M. Duc-Jacquet. Matrices tridiagonales symétriques et matrices factorisables. Revue française d'informatique et de recherche opérationnelle, série rouge, 5(3):61-66, 1971. [ bib | http | .pdf ]
[Dempster1972Covariance] A. P. Dempster. Covariance selection. Biometrics, 28:157-175, 1972. [ bib | http ]
[Viterbi1973Error] A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inform. Theory, 13(2):260-269, 1973. [ bib | http ]
The probability of error in decoding an optimal convolutional code transmitted over a memoryless channel is bounded from above and below as a function of the constraint length of the code. For all but pathological channels the bounds are asymptotically (exponentially) tight for rates aboveR_0, the computational cutoff rate of sequential decoding. As a function of constraint length the performance of optimal convolutional codes is shown to be superior to that of block codes of the same length, the relative improvement increasing with rate. The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates aboveR_0and whose performance bears certain similarities to that of sequential decoding algorithms.

[Mallows1973Some] C. L. Mallows. Some comments on cp. Technometrics, 15:661-675, 1973. [ bib | .pdf ]
Keywords: criteria, model, selection
[Lin73Effective] S. Lin and B. W. Kernighan. An effective heuristic algorithm for the traveling-salesman problem. Operations Res., 21:498-516, 1973. [ bib ]
[Forney1973Viterbi] G. D. Forney. The viterbi algorithm. Proc. IEEE, 61, 1973. [ bib | .pdf ]
[Davisson1973Universal] L. Davisson. Universal noiseless coding. IEEE Trans. Inform. Theory, 19(6):783- 795, Nov 1973. [ bib | .pdf ]
Universal coding is any asymptotically optimum method of block-to-block memoryless source coding for sources with unknown parameters. This paper considers noiseless coding for such sources, primarily in terms of variable-length coding, with performance measured as a function of the coding redundancy relative to the per-letter conditional source entropy given the unknown parameter. It is found that universal (i.e., zero redundancy) coding in a weighted sense is possible if and only if the per-letter average mutual information between the parameter space and the message space is zero. Universal coding is possible in a maximin sense if and only if the channel capacity between the two spaces is zero. Universal coding is possible in a minimax sense if and only if a probability mass function exists, independent of the unknown parameter, for which the relative entropy of the known conditional-probability mass-function is zero. Several examples are given to illustrate the ideas. Particular attention is given to sources that are stationary and ergodic for any fixed parameter although the whole ensemble is not. For such sources, weighted universal codes always exist if the alphabet is finite, or more generally if the entropy is finite. Minimax universal codes result if an additional entropy stability constraint is applied. A discussion of fixed-rate universal coding is also given briefly with performance measured by a probability of error.

Keywords: information-theory universal-coding
[Akaike1973Information] Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Petrov B. N. and Csaki F., editors, Proc. of the 2nd Int. Symp. on Information Theory, pages 267-281, 1973. [ bib ]
The problem of estimating the dimensionality of a model occurs in various forms in applied statistics. There is estimating the number of factor in factor analysis, estimating the degree of a polynomial describing the data, selecting the variables to

Keywords: conf, Akaike Information Criterion, criteria, AIC, model, modelling, parameters, complexity, overfitting, c1973, c197x, c19xx
[Wellings1973origin] S. R. Wellings and H. M. Jensen. On the origin and progression of ductal carcinoma in the human breast. J. Natl. Cancer Inst., 50(5):1111-1118, May 1973. [ bib ]
Keywords: breastcancer
[Vapnik1974Theory] V.N. Vapnik and A. Ya. Chervonenkis. Teoriya raspoznavaniya obrazov: Statisticheskie problemy obucheniya. (Russian) [Theory of Pattern Recognition: Statistical Problems of Learning]. Moscow: Nauka, 1974. [ bib ]
[Martin1974Discriminant] Y. C. Martin, J. B. Holland, C. H. Jarboe, and N. Plotnikoff. Discriminant analysis of the relationship between physical properties and the inhibition of monoamine oxidase by aminotetralins and aminoindans. J. Med. Chem., 17(4):409-413, Apr 1974. [ bib ]
Keywords: Animals, Brain, Dihydroxyphenylalanine, Dose-Response Relationship, Drug, Drug Synergism, Indenes, Mathematics, Mice, Monoamine Oxidase Inhibitors, Naphthalenes, Oxidation-Reduction, Oxotremorine, Reserpine, Structure-Activity Relationship, Tryptamines, 4830537
[Furnival1974Regressions] G.M. Furnival and R.W. Wilson. Regressions by leaps and bounds. Technometrics, 16(4):499-511, 1974. [ bib ]
[Calinski1974dendrite] R. B. Calinski and J. Harabasz. A dendrite method for cluster analysis. Communs Statist., 3:1-27, 1974. [ bib ]
[Southern1975Detection] E. M. Southern. Detection of specific sequences among DNA fragments separated by gel electrophoresis. J. Mol. Biol., 98:503-517, 1975. [ bib ]
[Hartigan1975Clustering] J. Hartigan. Clustering algorithms. Wiley, New-York, 1975. [ bib ]
[Ullman1976Algorithm] J. R. Ullmann. An algorithm for subgraph isomorphism. J. ACM, 23(1):31-42, 1976. [ bib | DOI ]
[Schmidt1976Fast] D. C. Schmidt and L. E. Druffel. A fast backtracking algorithm to test directed graphs for isomorphism using distance matrices. J. ACM, 23(3):433-445, 1976. [ bib | DOI ]
[Ivanov1976theory] V.V. Ivanov. The theory of approximate methods and their application to the numerical solution of singular integral equations. Nordhoff International, Leiden, 1976. [ bib ]
[Bondy1976Graph] J. A. Bondy and U. S. R. Murty. Graph theory with applications. Macmillan Press Ltd., 1976. [ bib ]
[Biggs1976Graph] N. L. Biggs, E.K. Lloyd, and R. J. Wilson. Graph theory 1736-1936. Oxford University Press, 1976. [ bib ]
[Tikhonov1977Solutions] A.N. Tikhonov and V.Y. Arsenin. Solutions of ill-posed problems. W.H. Winston, Washington, D.C., 1977. [ bib ]
[Stone1977Consistent] C.J. Stone. Consistent nonparametric regression. Ann. Stat., 8:1348-1360, 1977. [ bib | http | .pdf ]
[Read1977Graph] R. C. Read and D. G. Corneil. The graph isomorphism disease. J. Graph Theor., 1(4):339-363, 1977. [ bib ]
[Narendra1977branch] P.M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. Computers, IEEE Transactions on, 100(9):917-922, 1977. [ bib ]
[Bernstein1977Protein] F. C. Bernstein, T. F. Koetzle, G. J. Williams, E. F. Meyer, M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi. The protein data bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112(3):535-542, May 1977. [ bib ]
Keywords: Computers; Great Britain; Information Systems; Japan; Protein Conformation; Proteins; United States
[Ziv1978Compression] J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory, 24(5):530-536, Sep 1978. [ bib | .pdf ]
[Schwarz1978Estimating] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461-464, 1978. [ bib ]
[Kieffer1978unified] J. Kieffer. A unified approach to weak universal source coding. IEEE Trans. Inform. Theory, 24(6):674-682, Nov 1978. [ bib | .pdf ]
A new method of constructing a universal sequence of block codes for coding a class of ergodic sources is given. With this method, a weakly universal sequence of codes is constructed for variable-rate noise. less coding and for fixed- and variable-rate coding with respect to a fidelity criterion. In this way a unified approach to weak universal block source coding is obtained. For the noiseless variable-rate coding and the fixed-rate coding with respect to fidelity criterion, the assumptions made on the alphabets, distortion measures, and class of sources are both necessary and sufficient. For fixed-rate coding with respect to a fidelity criterion, the sample distortion of the universal code sequence converges inL^lnorm for each source to the optimum distortion for that source. For both variable-rate noiseless coding and variable-rate coding with respect to a fidelity criterion, the sample rate of the universal code sequence converges inL^1norm for each source to the optimum rate for that source. Using this fact, a universal sequence of codes for fixed-rate noiseless coding is obtained. Some applications to stationary nonergodic sources are also considered. The results of Davisson, Ziv, Neuhoff, Gray, Pursley, and Mackenthun are extended.

Keywords: universal-coding information-theory
[Green1978Conjoint] Paul E. Green and V. Srinivasan. Conjoint analysis in consumer research: Issues and outlook. The Journal of Consumer Research, 5(2):103-123, 1978. [ bib | http ]
Since 1971 conjoint analysis has been applied to a wide variety of problems in consumer research. This paper discusses various issues involved in implementing conjoint analysis and describes some new technical developments and application areas for the methodology.

Keywords: conjoint_analysis
[Csorgo1978Strong] M. Csorgo and P. Revesz. Strong Approximations of the Quantile Process. Ann. Stat., 6(4):882-894, July 1978. [ bib | http | .pdf ]
[Levine1979Review] H. A. Levine. Review of : Solutions of ill posed problems. Bull. Amer. Math. Soc., 1:521-524, 1979. [ bib ]
[Holm1979simple] S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65-70, 1979. [ bib ]
[Hartigan1979A] J. A. Hartigan and M. A. Wong. A K-means clustering algorithm. Applied Statistics, 28:100-108, 1979. [ bib ]
[Gati1979Further] G. Gati. Further annotated bibliography on the isomorphism disease. J. Graph Theor., 3:95-109, 1979. [ bib ]
[Garey1979Computer] M. R. Garey and D. S. Johnson. Computer and intractability: A guide to the theory of NP-completeness. San Francisco, CA: W. H. Freeman, 1979. [ bib ]
[Efron1979Bootstrap] B. Efron. Bootstrap methods: another look at the jackknife. Ann. Stat., 7(1):1-26, 1979. [ bib | .pdf ]
[Srivastava1979Estimation] V. K. Srivastava and T. D. Dwivedi. Estimation of seemingly unrelated regression equations : A brief survey. Journal of Econometrics, 10(1):15-32, April 1979. [ bib | .html ]
[Tsai1979Error] W.H. Tsai and K.S. Fu. Error-correcting isomorphisms of attributed relational graphs for pattern analysis. SMC, 9(12):757-768, December 1979. [ bib ]
[Moreau1980Autocorrelation] G. Moreau and P. Broto. Autocorrelation of molecular structures: Application to SAR studies. Nouv. J. Chim., 757:764, 1980. [ bib ]
Keywords: chemoinformatics
[Kumagai1980An] S. Kumagai. An implicit function theorem: Comment. Journal of Optimization Theory and Applications, 31:285-288, Jun 1980. [ bib ]
[Carbo1980How] R. Carbó, L. Leyda, and M. Arnau. How similar is a molecule to another - an electron-density measure of similarity between 2 molecular structures. Int. J. Qantum Chem., 17:1185-1189, 1980. [ bib ]
Keywords: chemoinformatics
[Brown1980Adaptive] P. J. Brown and J. V. Zidek. Adaptive multivariate ridge regression. Ann. Statist., 8(1):64-74, 1980. [ bib ]
[Weisberg1981Applied] S. Weisberg. Applied linear regression. Wiley, New-York, 1981. [ bib ]
[Vostrikova1981Detection] L. J. Vostrikova. Detection of disorder in multidimensional stochastic processes. Soviet Mathematics Doklady, 24:55-59, 1981. [ bib ]
[Smith1981Identification] T. Smith and M. Waterman. Identification of common molecular subsequences. J. Mol. Biol., 147:195-197, 1981. [ bib ]
[Rissanen1981Universal] J. Rissanen and G. Jr. Langdon. Universal modeling and coding. IEEE Trans. Inform. Theory, 27(1):12-23, Jan 1981. [ bib | .pdf ]
Keywords: information-theory source-coding
[Krichevsky1981performance] R. Krichevsky and V. Trofimov. The performance of universal encoding. IEEE Trans. Inform. Theory, 27(2):199-207, Mar 1981. [ bib | .pdf ]
Universal coding theory is surveyed from the viewpoint of the interplay between delay and redundancy. The price for universality turns out to be acceptably small.

Keywords: information-theory source-coding
[Fischler1981Random] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. of the ACM, 24(6):381-395, 1981. [ bib ]
A new paradigm, Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. A major portion of this paper describes the application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a set of landmarks with know locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing.

[Felsenstein1981Evolutionary] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17:368-376, 1981. [ bib ]
[Silverman1982On] B. W. Silverman. On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method. Ann. Stat., 10:795-810, 1982. [ bib | http | .pdf ]
[Hoerl1982Citation] A. E. Hoerl and R. W. Kennard. Citation classic - ridge regression : biased estimation for nonorthogonal problems. CC/Eng. Tech. Appl. Sci., 35:18-18, 1982. [ bib ]
[McGinnis1983Implementation] L. F. McGinnis. Implementation and testing of a primal-dual algorithm for the assignment problem. Operations Research, 31(2):277-291, 1983. [ bib ]
[Bunke1983Inexact] H. Bunke. Inexact graph matching for structural pattern recognition. Pattern Recogn. Lett., 1(4):245-253, May 1983. [ bib | DOI | http ]
This paper is concerned with the inexact matching of attributed, relational graphs for structural pattern recognition. The matching procedure is based on a state space search utilizing heuristic information. Some experimental results are reported.

Keywords: graph, matching
[Roepstorff1984Proposal] P. Roepstorff and J. Fohlman. Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed Mass Spectrom, 11(11):601, Nov 1984. [ bib | DOI | http ]
Keywords: Mass Spectrometry; Peptides; Terminology as Topic
[Rissanen1984Universal] J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Trans. Inform. Theory, 30(4):629-636, Jul 1984. [ bib | .pdf ]
A connection between universal codes and the problems of prediction and statistical estimation is established. A known lower bound for the mean length of universal codes is sharpened and generalized, and optimum universal codes constructed. The bound is defined to give the information in strings relative to the considered class of processes. The earlier derived minimum description length criterion for estimation of parameters, including their number, is given a fundamental information, theoretic justification by showing that its estimators achieve the information in the strings. It is also shown that one cannot do prediction in Gaussian autoregressive moving average (ARMA) processes below a bound, which is determined by the information in the data.

Keywords: information-theory universal-coding
[Chavel1984Eigenvalues] I. Chavel. Eigenvalues in Riemannian geometry. Academic Press, Orlando, Fl., 1984. [ bib ]
[Breiman1984Classification] L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and regression trees. Chapman & Hall/CRC, 1984. [ bib ]
[Berg1984Harmonic] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups. Springer-Verlag, New-York, 1984. [ bib ]
[Saito1985ETL9b] T. Saito, H. Yamada, and K. Yamamoto. On the data base etl9b of handprinted characters in jis chinese characters and its analysis. IEICE Trans, 68, 1985. [ bib ]
[Hedges1985Statistical] L. V. Hedges and I. Olkin. Statistical methods for meta-analysis. Academic Press, 1985. [ bib ]
[Carhart1985Atom] R.E. Carhart, D.H. Smith, and R. Venkataraghavan. Atom Pairs as Molecular Features in Structure-Activity Studies: Definitions and Applications. J Chem Inf Comput Sci, 25:64-73, 1985. [ bib ]
[Border1985Fixed] K. C. Border. Fixed point theorems with applications to economics and game theory. Cambridge University Press, Cambridge, UK, 1985. [ bib ]
[Berger1985Statistical] J.O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, 1985. [ bib ]
[Willett1986Implementation] P. Willett, V. Winterman, and D. Bawden. Implementation of nearest-neighbor searching in an online chemical structure search system. J. Chem. Inform. Comput. Sci., 26(1):36-41, 1986. [ bib ]
[Rumelhart1986Learning] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323:533-536, 1986. [ bib | .pdf ]
[Leighton1986Estimating] F. Leighton and R. Rivest. Estimating a probability using finite memory. IEEE Trans. Inform. Theory, 32(6):733- 742, Nov 1986. [ bib | .pdf ]
LetX_i_i=1^inftybe a sequence of independent Bernoulli random variables with probabilitypthatX_i = 1and probabilityq=1-pthatX_i = 0for alli geq 1. Time-invariant finite-memory (i.e., finite-state) estimation procedures for the parameter p are considered which takeX_1, cdotsas an input sequence. In particular, an n-state deterministic estimation procedure is described which can estimate p with mean-square errorO(log n/n)and ann-state probabilistic estimation procedure which can estimatepwith mean-square errorO(1/n). It is proved that theO(1/n)bound is optimal to within a constant factor. In addition, it is shown that linear estimation procedures are just as powerful (up to the measure of mean-square error) as arbitrary estimation procedures. The proofs are based on an analog of the well-known matrix tree theorem that is called the Markov chain tree theorem.

[Kuich1986Semirings] W. Kuich and A. Salomaa. Semirings, Automata, Languages. In EATCS Monographs on Computer Science, volume 5. Springer-Verlag, 1986. [ bib ]
[Freier1986Improved] S. M. Freier, R. Kierzek, J. A. Jaeger, N. Sugimoto, M. H. Caruthers, T. Neilson, and D. H. Turner. Improved free-energy parameters for predictions of RNA duplex stability. Proc. Natl. Acad. Sci. USA, 83(24):9373-7, Dec 1986. [ bib ]
Thermodynamic parameters for prediction of RNA duplex stability are reported. One parameter for duplex initiation and 10 parameters for helix propagation are derived from enthalpy and free-energy changes for helix formation by 45 RNA oligonucleotide duplexes. The oligomer sequences were chosen to maximize reliability of secondary structure predictions. Each of the 10 nearest-neighbor sequences is well-represented among the 45 oligonucleotides, and the sequences were chosen to minimize experimental errors in delta GO at 37 degrees C. These parameters predict melting temperatures of most oligonucleotide duplexes within 5 degrees C. This is about as good as can be expected from the nearest-neighbor model. Free-energy changes for helix propagation at dangling ends, terminal mismatches, and internal G X U mismatches, and free-energy changes for helix initiation at hairpin loops, internal loops, or internal bulges are also tabulated.

[Feder1986Maximum] M. Feder. Maximum entropy as a special case of the minimum description length criterion. IEEE Trans. Inform. Theory, 32(6):847 - 849, Nov 1986. [ bib | .pdf ]
The Maximum Entropy (ME) and Maximum Likelihood (ML) criteria are the bases for two approaches to statistical inference problems. A new criterion, called the Minimum Description Length (MDL), has been recently introduced. This criterion generalizes the ML method so it can be applied to more general situations, e.g., when the number of parameters is unknown. It is shown that ME is also a special case of the MDL criterion; maximizing the entropy subject to some constraints on the underlying probability function is identical to minimizing the code length required to represent all possible i.i.d, realizations of the random variable such that the sample frequencies (or histogram) satisfy those given constraints.

Keywords: information-theory
[Jameson1987Summing] G. J. O. Jameson. Summing and Nuclear Norms in Banach Space Theory. Number 8 in London Mathematical Society Student Texts. Cambridge University Press, 1987. [ bib | DOI | http | .pdf ]
[Hartigan1987Estimation] J. A. Hartigan. Estimation of a convex density contour in two dimensions. J. Amer. Statist. Assoc., 82(397):267-270, 1987. [ bib | http | .pdf ]
[Blake1987Visual] A. Blake and A. Zisserman. Visual Reconstruction. MIT Press, 1987. [ bib ]
[Yao1988Estimating] Y. C. Yao. Estimating the number of change-points via schwarz criterion. Stat. Probab. Lett., 6:181-189, 1988. [ bib ]
[Umeyama1988eigendecomposition] S. Umeyama. An eigendecomposition approach to weighted graph matching problems. IEEE Trans. Pattern Anal. Mach. Intell., 10(5):695-703, Sept. 1988. [ bib | DOI | http ]
An approximate solution to the weighted-graph-matching problem is discussed for both undirected and directed graphs. The weighted-graph-matching problem is that of finding the optimum matching between two weighted graphs, which are graphs with weights at each arc. The proposed method uses an analytic instead of a combinatorial or iterative approach to the optimum matching problem. Using the eigendecompositions of the adjacency matrices (in the case of the undirected-graph-matching problem) or Hermitian matrices derived from the adjacency matrices (in the case of the directed-graph-matching problem), a matching close to the optimum can be found efficiently when the graphs are sufficiently close to each other. Simulation results are given to evaluate the performance of the proposed method.

[Saitoh1988Theory] S. Saitoh. Theory of reproducing Kernels and its applications. Longman Scientific & Technical, Harlow, UK, 1988. [ bib ]
[Pearl1988Probabilistic] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, 1988. The classic original book on belief networks, which was certainly motivated by the idea that belief networks might have relevance to brains. [ bib ]
[Groebe1988Characterization] D. R. Groebe and O. C. Uhlenbeck. Characterization of RNA hairpin loop stability. Nucleic Acids Res., 16(24):11725-35, Dec 1988. [ bib ]
Fifteen RNA hairpins that share the same stem sequence and have homopolymer loops of A, C and U residues which vary in length from three to nine nucleotides were synthesized and their thermal stabilities determined. Tm varies as a function of loop size but is almost independent of loop composition. Loops of four or five nucleotides are found to be the most stable loop size. This is consistent with the observation that four-membered loops are the most prevalent loop size in 16S-like RNAs. The contribution of each loop to hairpin stability was calculated by subtracting the known contribution of the helical stem. These data should be useful for predicting the stability of other hairpins.

[Devroye1988Automatic] L. Devroye. Automatic pattern recognition: a study of the probability of error. IEEE Trans. Pattern Anal. Mach. Intell., 10(4):530-543, Jul 1988. [ bib | .pdf ]
A test sequence is used to select the best rule from a class of discrimination rules defined in terms of the training sequence. The Vapnik-Chervonenkis and related inequalities are used to obtain distribution-free bounds on the difference between the probability of error of the selected rule and the probability of error of the best rule in the given class. The bounds are used to prove the consistency and asymptotic optimality for several popular classes, including linear discriminators, nearest-neighbor rules, kernel-based rules, histogram rules, binary tree classifiers, and Fourier series classifiers. In particular, the method can be used to choose the smoothing parameter in kernel-based rules, to choose k in the k-nearest neighbor rule, and to choose between parametric and nonparametric rules

Keywords: information-theory
[Daubechies1988Orthonormal] I. Daubechies. Orthonormal bases of compactly supported weavelets. Comm. Pure Appl. Math., 41:909-996, 1988. [ bib | .pdf ]
[Basak1988Determining] S.C. Basak, V.R. Magnuson, G.J. Niemi, and R.R. Regal. Determining Structural Similarity of Chemicals Using Graph Theoretic Indices. Discrete Appl. Math., 19:17-44, 1988. [ bib ]
[Barron1988bound] A.R. Barron and T.M. Cover. A bound on the financial value of information. IEEE Trans. Inform. Theory, 34(5):1097-1100, Sep 1988. [ bib | .pdf ]
It is shown that each bit of information at most doubles the resulting wealth in the general stock-market setup. This information bound on the growth of wealth is actually attained for certain probability distributions on the market investigated by J. Kelly (1956). The bound is shown to be a special case of the result that the increase in exponential growth of wealth achieved with true knowledge of the stock market distribution F over that achieved with incorrect knowledge G is bounded above by the entropy of F relative to G

Keywords: information-theory
[Zuker1989finding] M. Zuker. On finding all suboptimal foldings of an RNA molecule. Science, 244(4900):48-52, Apr 1989. [ bib ]
An algorithm and a computer program have been prepared for determining RNA secondary structures within any prescribed increment of the computed global minimum free energy. The mathematical problem of determining how well defined a minimum energy folding is can now be solved. All predicted base pairs that can participate in suboptimal structures may be displayed and analyzed graphically. Representative suboptimal foldings are generated by selecting these base pairs one at a time and computing the best foldings that contain them. A distance criterion that ensures that no two structures are "too close" is used to avoid multiple generation of similar structures. Thermodynamic parameters, including free-energy increments for single-base stacking at the ends of helices and for terminal mismatched pairs in interior and hairpin loops, are incorporated into the underlying folding model of the above algorithm.

Keywords: sirna
[Yao1989Least] Y.-C. Yao and S. T. Au. Least-squares estimation of a step function. Sankhya: The Indian Journal of Statistics, Series A, 51(3):370-381, 1989. [ bib | http ]
[Wyner1989Some] A.D. Wyner and J. Ziv. Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression. IEEE Trans. Inform. Theory, 35(6):1250-1258, Nov 1989. [ bib | .pdf ]
Theorems concerning the entropy of a stationary ergodic information source are derived and used to obtain insight into the workings of certain data-compression coding schemes, in particular the Lempel-Siv data compression algorithm

Keywords: information-theory
[Willems1989Universal] F. M. J. Willems. Universal data compression and repetition times. IEEE Trans. Inform. Theory, 35(1):54-58, Jan 1989. [ bib | DOI | http | .pdf ]
A novel universal data compression algorithm is described. This algorithm encodes L source symbols at a time. An upper limit for the number of bits per source symbol is given for the class of binary stationary sources. In the author's analysis, a property of repetition times turns out to be of crucial importance

[Mallat1989theory] S. G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE T. Pattern. Anal., 2:674-693, 1989. [ bib | .pdf ]
[Fields1989novel] S. Fields and O. Song. A novel genetic system to detect protein-protein interactions. Nature, 340(6230):245-246, Jul 1989. [ bib | DOI | http ]
Protein-protein interactions between two proteins have generally been studied using biochemical techniques such as crosslinking, co-immunoprecipitation and co-fractionation by chromatography. We have generated a novel genetic system to study these interactions by taking advantage of the properties of the GAL4 protein of the yeast Saccharomyces cerevisiae. This protein is a transcriptional activator required for the expression of genes encoding enzymes of galactose utilization. It consists of two separable and functionally essential domains: an N-terminal domain which binds to specific DNA sequences (UASG); and a C-terminal domain containing acidic regions, which is necessary to activate transcription. We have generated a system of two hybrid proteins containing parts of GAL4: the GAL4 DNA-binding domain fused to a protein 'X' and a GAL4 activating region fused to a protein 'Y'. If X and Y can form a protein-protein complex and reconstitute proximity of the GAL4 domains, transcription of a gene regulated by UASG occurs. We have tested this system using two yeast proteins that are known to interact-SNF1 and SNF4. High transcriptional activity is obtained only when both hybrids are present in a cell. This system may be applicable as a general method to identify proteins that interact with a known protein by the use of a simple galactose selection.

[Doubet1989Complex] S. Doubet, K. Bock, D. Smith, A. Darvill, and P. Albersheim. The Complex Carbohydrate Structure Database. Trends Biochem Sci, 14(12):475-7, Dec 1989. [ bib ]
The Complex Carbohydrate Structure Database (CCSD) and CarbBank, an IBM PC/AT (or compatible) database management system, were created to provide an information system to meet the needs of people interested in carbohydrate science. The CCSD, which presently contains more than 2000 citations, is expected to double in size in the next two years and to include, soon thereafter, all of the published structures of carbohydrates larger than disaccharides.

Keywords: Carbohydrate Sequence, Carbohydrates, Databases, Factual, Information Systems, Molecular Structure, 2623761
[Bookstein1989Principal] F. L. Bookstein. Principal warps: thin-plate splines and the decomposition of deformations. IEEE T. Pattern. Anal., 11(6):567-585, 1989. [ bib | DOI | http | .pdf ]
The decomposition of deformations by principal warps is demonstrated. The method is extended to deal with curving edges between landmarks. This formulation is related to other applications of splines current in computer vision. How they might aid in the extraction of features for analysis, comparison, and diagnosis of biological and medical images in indicated

[Johnson1990Concepts] M. A. Johnson and G. M. Maggiora, editors. Concepts and Applications of Molecular Similarity. Wiley, 1990. [ bib ]
[Wahba1990Spline] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990. [ bib ]
[Pearson1990Rapid] W. R. Pearson. Rapid and sensitive sequence comparisons with FASTP and FASTA. Meth. Enzymol., 183:63-98, 1990. [ bib ]
[Kirby1990Application] M. Kirby and L. Sirovich. Application of the Karhunen-Loève procedure for the characterization of human faces. IEEE Trans. Pattern Anal. Mach. Intell., 12(1):103-108, 1990. [ bib | DOI | http | .pdf ]
[Gribskov1990Profile] M. Gribskov, R. Lüthy, , and D. Eisenberg. Profile Analysis. Methods in Enzymology, 183:146-159, 1990. [ bib ]
[Cox1990Asymptotic] D. Cox and F. O'Sullivan. Asymptotic analysis of penalized likelihood and related estimators. Ann. Stat., 18:1676-1695, 1990. [ bib | .pdf ]
[Cover1990Elements] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley, New-York, 1990. [ bib ]
[Aoyama1990Neural] T. Aoyama, Y. Suzuki, and H. Ichikawa. Neural networks applied to quantitative structure-activity relationship analysis. J. Med. Chem., 33(9):2583-2590, Sep 1990. [ bib ]
An application of the neural network to quantitative structure-activity relationship (QSAR) analysis has been studied. The new method was compared with the linear multiregression analysis in various ways. It was found that the neural network can be a potential tool in the routine work of QSAR analysis. The mathematical relationship of operation between the neural network and the multiregression analysis was described. It was shown that the neural network can exceed the level of the linear multiregression analysis.

Keywords: Animals, Azirines, Benzodiazepines, Carbazilquinone, Comparative Study, Nerve Net, Nervous System, Regression Analysis, Structure-Activity Relationship, 2202830
[Altschul1990basic] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. A basic local alignment search tool. J. Mol. Biol., 215:403-410, 1990. [ bib ]
[Allgoweri1990Numerical] E.L. Allgower and K.Georg. Numerical continuation methods. Springer, 1990. [ bib ]
[Clarke1990Information-theoretic] B.S. Clarke and A.R. Barron. Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory, 36(3):453-471, May 1990. [ bib | .pdf ]
In the absence of knowledge of the true density function, Bayesian models take the joint density function for a sequence of n random variables to be an average of densities with respect to a prior. The authors examine the relative entropy distance Dn between the true density and the Bayesian density and show that the asymptotic distance is (d/2)(log n)+c, where d is the dimension of the parameter vector. Therefore, the relative entropy rate Dn/n converges to zero at rate (log n)/n. The constant c, which the authors explicitly identify, depends only on the prior density function and the Fisher information matrix evaluated at the true parameter value. Consequences are given for density estimation, universal data compression, composite hypothesis testing, and stock-market portfolio selection

Keywords: information-theory
[Pepperrell1991Techniques] C. A. Pepperrell and P. Willett. Techniques for the calculation of three-dimensional structural similarity using inter-atomic distances. J Comput Aided Mol Des, 5(5):455-474, Oct 1991. [ bib ]
This paper reports a comparison of several methods for measuring the degree of similarity between pairs of 3-D chemical structures that are represented by inter-atomic distance matrices. The methods that have been tested use the distance information in very different ways and have very different computational requirements. Experiments with 10 small datasets, for which both structural and biological activity data are available, suggest that the most cost-effective technique is based on a mapping procedure that tries to match pairs of atoms, one from each of the molecules that are being compared, that have neighbouring atoms at approximately the same distances.

Keywords: Algorithms, Binding Sites, Chemical, Chemistry, Comparative Study, Computer Simulation, Databases, Factual, Macromolecular Substances, Models, Molecular Conformation, Molecular Structure, Non-U.S. Gov't, Physical, Protein Conformation, Protein Structure, Proteins, Research Support, Structure-Activity Relationship, Tertiary, 1770381
[Mohar1991Laplacian] B. Mohar. The Laplacian spectrum of graphs. In Y. Alavi, G. Chartrand, O. Ollermann, and A. Schwenk, editors, Graph theory, combinatorics, and applications, pages 871-898, New-York, 1991. John Wiley and Sons, Inc. [ bib | .pdf | .pdf ]
[Fodor1991Light-directed] S. P. Fodor, J. L. Read, M. C. Pirrung, L. Stryer, A. T. Lu, and D. Solas. Light-directed, spatially addressable parallel chemical synthesis. Science, 251:767-773, 1991. [ bib | http | .pdf ]
[Elston1991Pathological] C. W. Elston and I. O. Ellis. Pathological prognostic factors in breast cancer. i. the value of histological grade in breast cancer: experience from a large study with long-term follow-up. Histopathology, 19(5):403-410, Nov 1991. [ bib | DOI | http | .pdf ]
Morphological assessment of the degree of differentiation has been shown in numerous studies to provide useful prognostic information in breast cancer, but until recently histological grading has not been accepted as a routine procedure, mainly because of perceived problems with reproducibility and consistency. In the Nottingham/Tenovus Primary Breast Cancer Study the most commonly used method, described by Bloom & Richardson, has been modified in order to make the criteria more objective. The revised technique involves semiquantitative evaluation of three morphological features-the percentage of tubule formation, the degree of nuclear pleomorphism and an accurate mitotic count using a defined field area. A numerical scoring system is used and the overall grade is derived from a summation of individual scores for the three variables: three grades of differentiation are used. Since 1973, over 2200 patients with primary operable breast cancer have been entered into a study of multiple prognostic factors. Histological grade, assessed in 1831 patients, shows a very strong correlation with prognosis; patients with grade I tumours have a significantly better survival than those with grade II and III tumours (P less than 0.0001). These results demonstrate that this method for histological grading provides important prognostic information and, if the grading protocol is followed consistently, reproducible results can be obtained. Histological grade forms part of the multifactorial Nottingham prognostic index, together with tumour size and lymph node stage, which is used to stratify individual patients for appropriate therapy.

Keywords: csbcbook, csbcbook-ch3
[Debnath1991Structure-Activity] A.K. Debnath, R.L. Lopez de Compadre, G. Debnath, A.J. Schusterman, and C. Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. J. Med. Chem., 34(2):786-797, 1991. [ bib ]
Keywords: chemoinformatics
[Barron1991Minimum] A.R. Barron and T.M. Cover. Minimum complexity density estimation. IEEE Trans. Inform. Theory, 37(4):1034-1054, Jul 1991. [ bib | .pdf ]
The authors introduce an index of resolvability that is proved to bound the rate of convergence of minimum complexity density estimators as well as the information-theoretic redundancy of the corresponding total description length. The results on the index of resolvability demonstrate the statistical effectiveness of the minimum description-length principle as a method of inference. The minimum complexity estimator converges to true density nearly as fast as an estimator based on prior knowledge of the true subclass of densities. Interpretations and basic properties of minimum complexity estimators are discussed. Some regression and classification problems that can be examined from the minimum description-length framework are considered

Keywords: information-theory
[Andrea1991Applications] T. A. Andrea and H. Kalayeh. Applications of neural networks in quantitative structure-activity relationships of dihydrofolate reductase inhibitors. J. Med. Chem., 34(9):2824-2836, Sep 1991. [ bib ]
Back propagation neural networks is a new technology useful for modeling nonlinear functions of several variables. This paper explores their applications in the field of quantitative structure-activity relationships. In particular, their ability to fit biological activity surfaces, predict activity, and determine the "functional forms" of its dependence on physical properties is compared to well-established methods in the field. A dataset of 256 5-phenyl-3,4-diamino-6,6-dimethyldihydrotriazines that inhibit dihydrofolate reductase enzyme is used as a basis for comparison. It is found that neural networks lead to enhanced surface fits and predictions relative to standard regression methods. Moreover, they circumvent the need for ad hoc indicator variables, which account for a significant part of the variance in linear regression models. Additionally, they lead to the elucidation of nonlinear and "cross-products" effects that correspond to trade-offs between physical properties in their effect on biological activity. This is the first demonstration of the latter two findings. On the other hand, due to the complexity of the resulting models, an understanding of the local, but not the global, structure-activity relationships is possible. The latter must await further developments. Furthermore, the longer computational time required to train the networks is somewhat inconveniencing, although not restrictive.

Keywords: Animals, Carcinoma 256, Cultured, Experimental, Folic Acid Antagonists, Leukemia, Neural Pathways, Neurons, Regression Analysis, Structure-Activity Relationship, Tumor Cells, Walker, 1895302
[Zhang1992Iterative] Z. Zhang. Iterative point matching for registration of free-form curves. Technical report, Institut National de Recherche en Informatique et en Automatique (INRIA), 1992. [ bib ]
[Tjalkens1992universal] T.J. Tjalkens and F.M.J. Willems. A universal variable-to-fixed length source code based on Lawrence's algorithm. IEEE Trans. Inform. Theory, 38(2):247-253, Mar 1992. [ bib | .pdf ]
It is shown that the modified Lawrence algorithm is universal over the class of binary memoryless sources and that the rate converges asymptotically optimally fast to the source entropy. It is proven that no codes exist that have a better asymptotic performance. The asymptotic bounds show that universal variable-to-fixed-length codes can have a significantly lower redundancy than universal fixed-to-variable-length codes with the same number of codewords

Keywords: information-theory source-coding
[Simard1992Tangent] P. Simard, B. Victorri, Y. LeCun, and J. S. Denker. Tangent prop - a formalism for specifying selected invariances in an adaptive network. In J. E. Moody, S. J. Hanson, and R. Lippmann, editors, Adv. Neural. Inform. Process Syst. 4, pages 895-903. Morgan Kaufman, 1992. [ bib | .pdf ]
[Shapiro1992Feature] L. S. Shapiro and M. Brady. Feature-based correspondence: an eigenvector approach. Image Vision Comput., 10(5):283-288, 1992. [ bib ]
[Rotzschke1992Peptide] O. Rötzschke, K. Falk, S. Stevanović, G. Jung, and H. C. Rammensee. Peptide motifs of closely related HLA class I molecules encompass substantial differences. Eur. J. Immunol., 22(9):2453-2456, Sep 1992. [ bib ]
The peptides presented by major histocompatibility complex class I molecules adhere to strict rules concerning peptide length and occupancy by certain amino acid residues at anchor positions. Peptides presented by HLA-A*0201 molecules, for example, are generally nonapeptides requiring Leu or Met at position 2 and an aliphatic residue, predominantly Val, at position 9. A closely related molecule, HLA-A*0205, differing from the former at four amino acid residues, has a related but substantially different peptide motif. A*0205-presented peptides are still nonapeptides, and position 9 is still aliphatic, although it is preferentially occupied by Leu instead of Val. Position 2 not only allows aliphatic residues but also polar ones. Occupancy at position 6, considered as an auxiliary anchor in A*0201, as well as non-anchor residues at positions 3, 4, and 8 are relatively well conserved between the two peptide motifs. Thus, although a number of the T cell epitopes presented by the two HLA-A2 forms is expected to be identical, a considerable number of epitopes should be different.

Keywords: immunoinformatics
[Russell1992Multiple] R. B. Russell and G. J. Barton. Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins, 14(2):309-323, Oct 1992. [ bib | DOI | http ]
An algorithm is presented for the accurate and rapid generation of multiple protein sequence alignments from tertiary structure comparisons. A preliminary multiple sequence alignment is performed using sequence information, which then determines an initial superposition of the structures. A structure comparison algorithm is applied to all pairs of proteins in the superimposed set and a similarity tree calculated. Multiple sequence alignments are then generated by following the tree from the branches to the root. At each branchpoint of the tree, a structure-based sequence alignment and coordinate transformations are output, with the multiple alignment of all structures output at the root. The algorithm encoded in STAMP (STructural Alignment of Multiple Proteins) is shown to give alignments in good agreement with published structural accounts within the dehydrogenase fold domains, globins, and serine proteinases. In order to reduce the need for visual verification, two similarity indices are introduced to determine the quality of each generated structural alignment. Sc quantifies the global structural similarity between pairs or groups of proteins, whereas Pij' provides a normalized measure of the confidence in the alignment of each residue. STAMP alignments have the quality of each alignment characterized by Sc and Pij' values and thus provide a reproducible resource for studies of residue conservation within structural motifs.

Keywords: Algorithms; Amino Acid Sequence; Animals; Confidence Intervals; Globins; Humans; Molecular Sequence Data; Protein Structure, Tertiary; Sequence Alignment; Sequence Homology, Amino Acid; Serine Endopeptidases; Software
[Rudin1992Nonlinear] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259-268, 1992. [ bib | .pdf ]
A constrained optimization type of numerical algorithm for removing noise from images is presented. The total variation of the image is minimized subject to constraints involvingthe statistics of the noise. The constraints are imposed using Lagrange multipliers. The solution is obtained using the gradient-projection method. This amounts to solving a time dependent partial differential equation on a manifold determined by the constraints. As t- 0othe solution converges to a steady state which is the denoised image. The numerical algorithm is simple and relatively fast. The results appear to be state-of-the-art for very noisy images. The method is noninvasive, yielding sharp edges in the image. The technique could be interpreted as a first step of moving each level set of the the level set divided by the magnitude of the gradient of the the constraint set.

Keywords: segmentation
[Rissanen1992Density] J. Rissanen, T. P. Speed, and B. Yu. Density estimation by stochastic complexity. IEEE Trans. Inform. Theory, 38(2):315-323, Mar 1992. [ bib | .pdf ]
The results by P. Hall and E.J. Hannan (1988) on optimization of histogram density estimators with equal bin widths by minimization of the stochastic complexity are extended and sharpened in two separate ways. As the first contribution, two generalized histogram estimators are constructed. The first has unequal bin widths which, together with the number of the bins, are determined by minimization of the stochastic complexity using dynamic programming. The other estimator consists of a mixture of equal bin width estimators, each of which is defined by the associated stochastic complexity. As the main contribution in the present work, two theorems are proved, which together extend the universal coding theorems to a large class of data generating densities. The first gives an asymptotic upper bound for the code redundancy in the order of magnitude, achieved with a special predictive type of histogram estimator, which sharpens a related bound. The second theorem states that this bound cannot be improved upon by any code whatsoever

Keywords: information-theory
[Perry1992Database] N. Perry and V. J. van Geerestein. Database Searching on the basis of Three-Dimensional Similarity Using the sperm Program. J Chem Inf Comput Sci, 32:607-616, 1992. [ bib ]
[King1992Drug] R. D. King, S. Muggleton, R. A. Lewis, and M. J. Sternberg. Drug design by machine learning: the use of inductive logic programming to model the structure-activity relationships of trimethoprim analogues binding to dihydrofolate reductase. Proc. Natl. Acad. Sci. USA, 89(23):11322-11326, Dec 1992. [ bib ]
The machine learning program GOLEM from the field of inductive logic programming was applied to the drug design problem of modeling structure-activity relationships. The training data for the program were 44 trimethoprim analogues and their observed inhibition of Escherichia coli dihydrofolate reductase. A further 11 compounds were used as unseen test data. GOLEM obtained rules that were statistically more accurate on the training data and also better on the test data than a Hansch linear regression model. Importantly machine learning yields understandable rules that characterized the chemistry of favored inhibitors in terms of polarity, flexibility, and hydrogen-bonding character. These rules agree with the stereochemistry of the interaction observed crystallographically.

Keywords: Algorithms, Artificial Intelligence, Drug Design, Escherichia coli, Folic Acid Antagonists, Molecular Structure, Mutagens, Nitroso Compounds, Non-U.S. Gov't, Research Support, Structure-Activity Relationship, Trimethoprim, 1454814
[Kallioniemi1992Comparative] A. Kallioniemi, O. P. Kallioniemi, D. Sudar, D. Rutovitz, J. W. Gray, F. Waldman, and D. Pinkel. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science, 258(5083):818-821, Oct 1992. [ bib | http | .pdf ]
Comparative genomic hybridization produces a map of DNA sequence copy number as a function of chromosomal location throughout the entire genome. Differentially labeled test DNA and normal reference DNA are hybridized simultaneously to normal chromosome spreads. The hybridization is detected with two different fluorochromes. Regions of gain or loss of DNA sequences, such as deletions, duplications, or amplifications, are seen as changes in the ratio of the intensities of the two fluorochromes along the target chromosomes. Analysis of tumor cell lines and primary bladder tumors identified 16 different regions of amplification, many in loci not previously known to be amplified.

Keywords: csbcbook, csbcbook-ch2, cgh
[Henikoff1992Amino] S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA, 89(22):10915-10919, Nov 1992. [ bib ]
Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.

Keywords: Algorithms; Amino Acid Sequence; Animals; Caenorhabditis elegans; Drosophila; Lod Score; Mathematics; Molecular Sequence Data; Probability; Proteins; Sequence Homology, Amino Acid; Software
[Feder1992Universal] M. Feder, N. Merhav, and M. Gutman. Universal prediction of individual sequences. IEEE Trans. Inform. Theory, 38(4):1258-1270, Jul 1992. [ bib | .pdf ]
The problem of predicting the next outcome of an individual binary sequence using finite memory is considered. The finite-state predictability of an infinite sequence is defined as the minimum fraction of prediction errors that can be made by any finite-state (FS) predictor. It is proven that this FS predictability can be achieved by universal sequential prediction schemes. An efficient prediction procedure based on the incremental parsing procedure of the Lempel-Ziv data compression algorithm is shown to achieve asymptotically the FS predictability. Some relations between compressibility and predictability are discussed, and the predictability is proposed as an additional measure of the complexity of a sequence

Keywords: information-theory universal-coding
[Ellis1992Pathological] I. O. Ellis, M. Galea, N. Broughton, A. Locker, R. W. Blamey, and C. W. Elston. Pathological prognostic factors in breast cancer. ii. histological type. relationship with survival in a large study with long-term follow-up. Histopathology, 20(6):479-489, Jun 1992. [ bib | DOI | http | .pdf ]
The histological tumour type determined by current criteria has been investigated in a consecutive series of 1621 women with primary operable breast carcinoma, presenting between 1973 and 1987. All women underwent definitive surgery with node biopsy and none received adjuvant systemic therapy. Special types, tubular, invasive cribriform and mucinous, with a very favourable prognosis can be identified. A common type of tumour recognized by our group and designated tubular mixed carcinoma is shown to be prognostically distinct from carcinomas of no special type; it has a characteristic histological appearance and is the third most common type in this series. Analysis of subtypes of lobular carcinoma confirms differing prognoses. The classical, tubulo-lobular and lobular mixed types are associated with a better prognosis than carcinomas of no special type; this is not so for the solid variant. Tubulo-lobular carcinoma in particular has an extremely good prognosis similar to tumours included in the 'special type' category above. Neither medullary carcinoma nor atypical medullary carcinoma are found to carry a survival advantage over carcinomas of no special type. The results confirm that histological typing of human breast carcinoma can provide useful prognostic information.

Keywords: csbcbook, csbcbook-ch3
[Doubet1992CarbBank] S. Doubet and P. Albersheim. CarbBank. Glycobiology, 2(6):505, Dec 1992. [ bib ]
Keywords: glycans
[Cleveland1992Statistical] W. S. Cleveland, E. Grosse, and W. M. Shyu. Statistical Models in S, chapter Local regression models. Wasworth & Brooks/Cole, 1992. [ bib ]
[Boser1992training] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the 5th annual ACM workshop on Computational Learning Theory, pages 144-152, New York, NY, USA, 1992. ACM Press. [ bib | .ps.Z | .pdf ]
[Barron1992Distribution] A.R. Barron, L. Györfi, and E.C. van der Meulen. Distribution estimation consistent in total variation and in two types of information divergence. IEEE Trans. Inform. Theory, 38(5):1437-1454, Sep 1992. [ bib | .pdf ]
The problem of the nonparametric estimation of a probability distribution is considered from three viewpoints: the consistency in total variation, the consistency in information divergence, and consistency in reversed-order information divergence. These types of consistencies are relatively strong criteria of convergence, and a probability distribution cannot be consistently estimated in either type of convergence without any restrictions on the class of probability distributions allowed. Histogram-based estimators of distribution are presented which, under certain conditions, converge in total variation, in information divergence, and in reversed-order information divergence to the unknown probability distribution. Some a priori information about the true probability distribution is assumed in each case. As the concept of consistency in information divergence is stronger than that of convergence in total variation, additional assumptions are imposed in the cases of informational divergences

[Razzak1992Applications] M. A-Razzak and R. C. Glen. Applications of rule-induction in the derivation of quantitative structure-activity relationships. J. Comput. Aided Mol. Des., 6(4):349-383, Aug 1992. [ bib ]
Recently, methods have been developed in the field of Artificial Intelligence (AI), specifically in the expert systems area using rule-induction, designed to extract rules from data. We have applied these methods to the analysis of molecular series with the objective of generating rules which are predictive and reliable. The input to rule-induction consists of a number of examples with known outcomes (a training set) and the output is a tree-structured series of rules. Unlike most other analysis methods, the results of the analysis are in the form of simple statements which can be easily interpreted. These are readily applied to new data giving both a classification and a probability of correctness. Rule-induction has been applied to in-house generated and published QSAR datasets and the methodology, application and results of these analyses are discussed. The results imply that in some cases it would be advantageous to use rule-induction as a complementary technique in addition to conventional statistical and pattern-recognition methods.

Keywords: Algorithms, Anticonvulsants, Antimalarials, Artificial Intelligence, Cardiotonic Agents, Software, Structure-Activity Relationship, gamma-Aminobutyric Acid, 1403028
[Einmahl1992Generalized] J. H. J. Einmahl and D. M. Mason. Generalized Quantile Process. Ann. Stat., 20:1062-1078, June 1992. [ bib | http | .pdf ]
[International1992Enzyme] International Union of Biochemistry and Molecular Biology. Enzyme Nomenclature 1992. Academic Press, San Diego, California, United States, August 1992. [ bib | http ]
Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes.

Keywords: gene-ontology, ontology
[Westfall1993Resampling] P. H. Westfall and S. S. Young. Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley and Sons, 1993. [ bib ]
[de1993Multiplicites] Y. C. de Verdière. Multiplicités des valeurs propres Laplaciens discrets et Laplaciens continus. Rendiconti di Matematica, 13:433-460, 1993. [ bib ]
[Shields1993Universal] P.C. Shields. Universal redundancy rates do not exist. IEEE Trans. Inform. Theory, 39(2):520-524, Mar 1993. [ bib | .pdf ]
The expected redundancy per symbol of an n-block prefix code Cn on a source ? measures how far the code is from being optimal for that source. The existence of sequences of codes with expected redundancy per symbol of O((log n)/n) for `nice' classes of sources, such as Markov sources of a given order, is well known. It is shown that some restriction on the class of processes is necessary in order to obtain such redundancy bounds, for there is no universal redundancy rate for any sequence of prefix codes on the class of all ergodic sources

Keywords: information-theory
[Rucker1993Counts] G. Rücker and C. Rücker. Counts of All Walks as Atomic and Molecular Descriptors. J Chem Inf Comput Sci, 33:683-695, 1993. [ bib ]
Keywords: chemoinformatics
[Ornstein1993Entropy] D.S. Ornstein and B. Weiss. Entropy and data compression schemes. IEEE Trans. Inform. Theory, 39(1):78-83, Jan 1993. [ bib | .pdf ]
Some new ways of defining the entropy of a process by observing a single typical output sequence as well as a new kind of Shannon-McMillan-Breiman theorem are presented. This provides a new and conceptually very simple ways of estimating the entropy of an ergodic stationary source as well as new insight into the workings of such well-known data compression schemes as the Lempel-Ziv algorithm

Keywords: information-theory
[Noon93Efficient] C. Noon and J.C. Bean. An efficient transformation of the generalized traveling salesman problem. INFOR, pages 39-44, 1993. [ bib ]
[Miller1993On] J.W. Miller, R. Goodman, and P. Smyth. On loss functions which minimize to conditional expected values and posterior probabilities. IEEE Trans. Inform. Theory, 39(4):1404-1408, Jul 1993. [ bib | .pdf ]
A loss function, or objective function, is a function used to compare parameters when fitting a model to data. The loss function gives a distance between the model output and the desired output. Two common examples are the squared-error loss function and the cross entropy loss function. Minimizing the mean-square error loss function is equivalent to minimizing the mean square difference between the model output and the expected value of the output given a particular input. This property of minimization to the expected value is formalized as P-admissibility. The necessary and sufficient conditions for P-admissibility, leading to a parametric description of all P-admissible loss functions, are found. In particular, it is shown that two of the simplest members of this class of functions are the squared error and the cross entropy loss functions. One application of this work is in the choice of a loss function for training neural networks to provide probability estimates

[Merhav1993Universal] N. Merhav and M. Feder. Universal schemes for sequential decision from individual data sequences. IEEE Trans. Inform. Theory, 39(4):1280-1292, Jul 1993. [ bib | .pdf ]
Sequential decision algorithms are investigated in relation to a family of additive performance criteria for individual data sequences. Simple universal sequential schemes are known, under certain conditions, to approach optimality uniformly as fast as n-1 log n, where n is the sample size. For the case of finite-alphabet observations, the class of schemes that can be implemented by finite-state machines (FSMs) is studied. It is shown that Markovian machines with sufficiently long memory exist, which are asymptotically nearly as good as any given deterministic or randomized FSM for the purpose of sequential decision. For the continuous-valued observation case, a useful class of parametric schemes is discussed with special attention to the recursive least squares algorithm

Keywords: information-theory source-coding
[Mallat1993Matching] S. G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on, 41(12):3397-3415, 1993. [ bib | DOI | http ]
The authors introduce an algorithm, called matching pursuit, that decomposes any signal into a linear expansion of waveforms that are selected from a redundant dictionary of functions. These waveforms are chosen in order to best match the signal structures. Matching pursuits are general procedures to compute adaptive signal representations. With a dictionary of Gabor functions a matching pursuit defines an adaptive time-frequency transform. They derive a signal energy distribution in the time-frequency plane, which does not include interference terms, unlike Wigner and Cohen class distributions. A matching pursuit isolates the signal structures that are coherent with respect to a given dictionary. An application to pattern extraction from noisy signals is described. They compare a matching pursuit decomposition with a signal expansion over an optimized wavepacket orthonormal basis, selected with the algorithm of Coifman and Wickerhauser see (IEEE Trans. Informat. Theory, vol. 38, Mar. 1992)

Keywords: pursuit
[Kearns1993Efficient] M. Kearns. Efficient noise-tolerant learning from statistical queries. In Journal of the ACM, pages 392-401, 1993. [ bib | http ]
[Good1993Rapid] A.C. Good and W.G. Richards. Rapid Evaluation of Molecular Shape Similarity Using Gaussian Functions. J Chem Inf Comput Sci, 33:112-116, 1993. [ bib ]
[Farago1993Strong] A. Farago and G. Lugosi. Strong universal consistency of neural network classifiers. IEEE Trans. Inform. Theory, 39(4):1146-1151, Jul 1993. [ bib | .pdf ]
In statistical pattern recognition, a classifier is called universally consistent if its error probability converges to the Bayes-risk as the size of the training data grows for all possible distributions of the random variable pair of the observation vector and its class. It is proven that if a one-layered neural network with properly chosen number of nodes is trained to minimize the empirical risk on the training data, then a universally consistent classifier results. It is shown that the exponent in the rate of convergence does not depend on the dimension if certain smoothness conditions on the distribution are satisfied. That is, this class of universally consistent classifiers does not suffer from the curse of dimensionality. A training algorithm is presented that finds the optimal set of parameters in polynomial time if the number of nodes and the space dimension is fixed and the amount of training data grows

[Egolf1993Prediction] L. M. Egolf and P. C. Jurs. Prediction of Boiling Points of Organic Hterocyclic Compounds Using Regression and Neural Networks Techniques. J Chem Inf Comput Sci, 33:616-635, 1993. [ bib ]
[DeVore1993Constructive] R. A. DeVore and G. G. Lorentz. Constructive Approximation. Springer Grundlehren der Mathematischen Wissenschaften. Springer Verlag, 1993. [ bib ]
[Caruana1993Multitask] Richard Caruana. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the Tenth International Conference on Machine Learning, pages 41-48. Morgan Kaufmann, 1993. [ bib ]
[Brown1993Using] M.P. Brown, R. Hughey, A. Krogh, I.S. Mian, K. Sjolander, and D. Haussler. Using Dirichlet mixture priors to derive hidden Markov models for protein families. In Proc. First International Conference on Intelligent Systems for Molecular Biology (ISMB 1993), 1993. [ bib ]
[Brodsky1993Nonparametric] B. Brodsky and B. Darkhovsky. Nonparametric Methods in Change-Point Problems. Kluwer Academic Publishers, 1993. [ bib ]
[Basseville1993Detection] M. Basseville and N. Nikiforov. Detection of abrupt changes: theory and application. Information and System Sciences Series. Prentice Hall Information, 1993. [ bib | .pdf ]
[Baird1993Document] H. Baird. Document image defect models and their uses. In Proceedings of the Second International Conference on Document Analysis and Recognition ICDAR-93, 1993. [ bib | .pdf ]
[Friedman1993Some] J. Friedman. Some geometric aspects of graphs and their eigenfunctions. Duke Math. J., 69:487-525, March 1993. [ bib ]
[Merhav1993Some] N. Merhav, M. Feder, and M. Gutman. Some properties of sequential predictors for binary Markov sources. IEEE Trans. Inform. Theory, 39(3):887-892, May 1993. [ bib | .pdf ]
Universal predictions of the next outcome of a binary sequence drawn from a Markov source with unknown parameters is considered. For a given source, the predictability is defined as the least attainable expected fraction of prediction errors. A lower bound is derived on the maximum rate at which the predictability is asymptotically approached uniformly over all sources in the Markov class. This bound is achieved by a simple majority predictor. For Bernoulli sources, bounds on the large deviations performance are investigated. A lower bound is derived for the probability that the fraction of errors will exceed the predictability by a prescribed amount ?>0. This bound is achieved by the same predictor if ? is sufficiently small

Keywords: information-theory source-coding
[Barron1993Universal] A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory, 39(3):930-945, May 1993. [ bib | .pdf ]
Approximation properties of a class of artificial neural networks are established. It is shown that feedforward networks with one layer of sigmoidal nonlinearities achieve integrated squared error of order O (1/n), where n is the number of nodes. The approximated function is assumed to have a bound on the first moment of the magnitude distribution of the Fourier transform. The nonlinear parameters associated with the sigmoidal nodes, as well as the parameters of linear combination, are adjusted in the approximation. In contrast, it is shown that for series expansions with n terms, in which only the parameters of linear combination are adjusted, the integrated squared approximation error cannot be made smaller than order 1/n2d/ uniformly for functions satisfying the same smoothness assumption, where d is the dimension of the input to the function. For the class of functions examined, the approximation rate and the parsimony of the parameterization of the networks are shown to be advantageous in high-dimensional settings

Keywords: information-theory
[Almohamad1993linear] H.A. Almohamad and S.O. Duffuaa. A linear programming approach for the weighted graph matching problem. IEEE Trans. Pattern Anal. Mach. Intell., 15(5):522-525, May 1993. [ bib | DOI | http | .pdf ]
[Xu94pca] L. Xu and I. King. A pca approach for fast retrieval of structural patterns in attributed graphs. In Humboldt University Berlin, 1994. [ bib ]
[Weinberger1994Optimal] M. J. Weinberger, N. Merhav, and M. Feder. Optimal sequential probability assignment for individual sequences. IEEE Trans. Inform. Theory, 40(2):384-396, Mar 1994. [ bib | .pdf ]
The problem of sequential probability assignment for individual sequences is investigated. The authors compare the probabilities assigned by any sequential scheme to the performance of the best ?batch? scheme (model) in some class. For the class of finite-state schemes and other related families, they derive a deterministic performance bound, analogous to the classical (probabilistic) minimum description length (MDL) bound. It holds for ?most? sequences, similarly to the probabilistic setting, where the bound holds for ?most? sources in a class. It is shown that the bound can be attained both pointwise and sequentially for any model family in the reference class and without any prior knowledge of its order. This is achieved by a universal scheme based on a mixing approach. The bound and its sequential achievability establish a completely deterministic significance to the concept of predictive MDL

Keywords: information-theory
[Warmke1994family] J. W. Warmke and B. Ganetzky. A family of potassium channel genes related to eag in Drosophila and mammals. Proc. Natl. Acad. Sci. U. S. A., 91(8):3438-3442, Apr 1994. [ bib | http | .pdf ]
We have identified a conserved family of genes related to Drosophila eag, which encodes a distinct type of voltage-activated K+ channel. Three related genes were recovered in screens of cDNA libraries from Drosophila, mouse, and human tissues. One gene is the mouse counterpart of eag; the other two represent additional subfamilies. The human gene maps to chromosome 7. Family members share at least 47% amino acid identity in their hydrophobic cores and all contain a segment homologous to a cyclic nucleotide-binding domain. Sequence comparisons indicate that members of this family are most closely related to vertebrate cyclic nucleotide-gated cation channels and plant inward-rectifying K+ channels. The existence of another family of K+ channel structural genes further extends the known diversity of K+ channels and has important implications for the structure, function, and evolution of the superfamily of voltage-sensitive ion channels.

[Sette1994relationship] A. Sette, A. Vitiello, B. Reherman, P. Fowler, R. Nayersina, W. M. Kast, C. J. Melief, C. Oseroff, L. Yuan, J. Ruppert, J. Sidney, M. F. del Guercio, S. Southwood, R. T. Kubo, R. W. Chesnut, H. M. Grey, and F. V. Chisari. The relationship between class i binding affinity and immunogenicity of potential cytotoxic t cell epitopes. J. Immunol., 153(12):5586-5592, Dec 1994. [ bib ]
The relationship between binding affinity for HLA class I molecules and immunogenicity of discrete peptide epitopes has been analyzed in two different experimental approaches. In the first approach, the immunogenicity of potential epitopes ranging in MHC binding affinity over a 10,000-fold range was analyzed in HLA-A*0201 transgenic mice. In the second approach, the antigenicity of approximately 100 different hepatitis B virus (HBV)-derived potential epitopes, all carrying A*0201 binding motifs, was assessed by using PBL of acute hepatitis patients. In both cases, it was found that an affinity threshold of approximately 500 nM (preferably 50 nM or less) apparently determines the capacity of a peptide epitope to elicit a CTL response. These data correlate well with class I binding affinity measurements of either naturally processed peptides or previously described T cell epitopes. Taken together, these data have important implications for the selection of epitopes for peptide-based vaccines, and also formally demonstrate the crucial role of determinant selection in the shaping of T cell responses. Because in most (but not all) cases, high affinity peptides seem to be immunogenic, our data also suggest that holes in the functional T cell repertoire, if they exist, may be relatively rare.

Keywords: Amino Acid Sequence; Animals; Cell Line; Cytotoxicity Tests, Immunologic; Epitopes; HLA-A Antigens; Hepatitis B; Hepatitis B Antigens; Humans; Mice; Mice, Transgenic; Molecular Sequence Data; Peptides; Protein Binding; T-Lymphocytes, Cytotoxic
[Schuster1994elementary] S. Schuster and C. Hilgetag. On elementary flux modes in biochemical reaction systems at steady state. J. Biol. Syst., 2(2):165-182, 1994. [ bib | DOI | http | .pdf ]
[Rogers1994Application] D. Rogers and A. J. Hopfinger. Application of genetic function approximation to quantitative structure-activity relationships and quantitative structure-property relationships. J Chem Inf Comput Sci, 34:854-866, 1994. [ bib ]
[Parker1994Scheme] K. C. Parker, M. A. Bednarek, and J. E. Coligan. Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide side-chains. J. Immunol., 152(1):163-175, Jan 1994. [ bib ]
A method to predict the relative binding strengths of all possible nonapeptides to the MHC class I molecule HLA-A2 has been developed based on experimental peptide binding data. These data indicate that, for most peptides, each side-chain of the peptide contributes a certain amount to the stability of the HLA-A2 complex that is independent of the sequence of the peptide. To quantify these contributions, the binding data from a set of 154 peptides were combined together to generate a table containing 180 coefficients (20 amino acids x 9 positions), each of which represents the contribution of one particular amino acid residue at a specified position within the peptide to binding to HLA-A2. Eighty peptides formed stable HLA-A2 complexes, as assessed by measuring the rate of dissociation of beta 2m. The remaining 74 peptides formed complexes that had a half-life of beta 2m dissociation of less than 5 min at 37 degrees C, or did not bind to HLA-A2, and were included because they could be used to constrain the values of some of the coefficients. The "theoretical" binding stability (calculated by multiplying together the corresponding coefficients) matched the experimental binding stability to within a factor of 5. The coefficients were then used to calculate the theoretical binding stability for all the previously identified self or antigenic nonamer peptides known to bind to HLA-A2. The binding stability for all other nonamer peptides that could be generated from the proteins from which these peptides were derived was also predicted. In every case, the previously described HLA-A2 binding peptides were ranked in the top 2% of all possible nonamers for each source protein. Therefore, most biologically relevant nonamer peptides should be identifiable using the table of coefficients. We conclude that the side-chains of most nonamer peptides to the first approximation bind independently of one another to the HLA-A2 molecule.

Keywords: immunoinformatics
[Paatero1994Positive] P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2):111-126, 1994. [ bib | DOI | http ]
[Nesterov1994Interior] Y. Nesterov and A. Nemirovsky. Interior point polynomial methods in convex programming: Theory and applications. SIAM, 1994. [ bib ]
[Krogh1994Hidden] A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in computational biology: Applications to protein modeling. J. Mol. Biol., 235:1501-1531, 1994. [ bib ]
[Gyorfi1994There] L. Gyorfi, I. Pali, and E.C. Van der Meulen. There is no universal source code for an infinite source alphabet. IEEE Trans. Inform. Theory, 40(1):267-271, Jan 1994. [ bib | .pdf ]
Shows that a discrete infinite distribution with finite entropy cannot be estimated consistently in information divergence. As a corollary the authors show that there is no universal source code for an infinite source alphabet over the class of all discrete memoryless sources with finite entropy

Keywords: information-theory
[Foster1994A] Dean P. Foster and Edward I. George. The risk inflation criterion for multiple regression. The Annals of Statistics, 22(4):1947-1975, 1994. [ bib | DOI | http ]
A new criterion is proposed for the evaluation of variable selection procedures in multiple regression. This criterion, which we call the risk inflation, is based on an adjustment to the risk. Essentially, the risk inflation is the maximum increase in risk due to selecting rather than knowing the "correct" predictors. A new variable selection procedure is obtained which, in the case of orthogonal predictors, substantially improves on AIC, Cp and BIC and is close to optimal. In contrast to AIC, Cp and BIC which use dimensionality penalties of 2, 2 and log n, respectively, this new procedure uses a penalty 2 log p, where p is the number of available predictors. For the case of nonorthogonal predictors, bounds for the optimal penalty are obtained.

Keywords: mdl, regression
[Feder1994Relations] M. Feder and N. Merhav. Relations between entropy and error probability. IEEE Trans. Inform. Theory, 40(1):259 - 266, Jan 1994. [ bib | .pdf ]
The relation between the entropy of a discrete random variable and the minimum attainable probability of error made in guessing its value is examined. While Fano's inequality provides a tight lower bound on the error probability in terms of the entropy, the present authors derive a converse result-a tight upper bound on the minimal error probability in terms of the entropy. Both bounds are sharp, and can draw a relation, as well, between the error probability for the maximum a posteriori (MAP) rule, and the conditional entropy (equivocation), which is a useful uncertainty measure in several applications. Combining this relation and the classical channel coding theorem, the authors present a channel coding theorem for the equivocation which, unlike the channel coding theorem for error probability, is meaningful at all rates. This theorem is proved directly for DMCs, and from this proof it is further concluded that for R⩾C the equivocation achieves its minimal value of R-C at the rate of n1/2 where n is the block length

Keywords: information-theory
[Donoho1994Denoising] David L. Donoho. De-noising by soft-thresholding. IEEE Trans. IT, 41(3):613-627, 1994. [ bib ]
[Comon1994Independent] Pierre Comon. Independent component analysis: a new concept? Signal Processing, 36(3):287-314, 1994. [ bib ]
[Clarke1994Jeffreys] B. S. Clarke and A. R. Barron. Jeffreys' prior is asymptotically least favorable under entropy risk. J. Stat. Plann. Infer., 31(1):37-60, 1994. [ bib | DOI | http | .pdf ]
We provide a rigorous proof that Jeffreys' prior asymptotically maximizes Shannon's mutual information between a sample of size n and the parameter. This was conjectured by Bernardo (1979) and, despite the absence of a proof, forms the basis of the reference prior method in Bayesian statistical analysis. Our proof rests on an examination of large sample decision theoretic properties associated with the relative entropy or the Kullback?Leibler distance between probability density functions for independent and identically distributed random variables. For smooth finite-dimensional parametric families we derive an asymptotic expression for the minimax risk and for the related maximin risk. As a result, we show that, among continuous positive priors, Jeffreys' prior uniquely achieves the asymptotic maximin value. In the discrete parameter case we show that, asymptotically, the Bayes risk reduces to the entropy of the prior so that the reference prior is seen to be the maximum entropy prior. We identify the physical significance of the risks by giving two information-theoretic interpretations in terms of probabilistic coding.

Keywords: information-theory
[Clark1994Do] G. M. Clark. Do we really need prognostic factors for breast cancer? Breast Cancer Res Treat, 30:117-126, 1994. [ bib | DOI | http ]
[Baldi1994Hidden] P. Baldi, Y. Chauvin, T. Hunkapiller, and M.A. McClure. Hidden Markov models of biological primary sequence information. Proc. Natl. Acad. Sci. USA, 91(3):1053-1063, 1994. [ bib | .pdf ]
[Bailey1994Fitting] T. L. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol, 2:28-36, 1994. [ bib | http ]
The algorithm described in this paper discovers one or more motifs in a collection of DNA or protein sequences by using the technique of expectation maximization to fit a two-component finite mixture model to the set of sequences. Multiple motifs are found by fitting a mixture model to the data, probabilistically erasing the occurrences of the motif thus found, and repeating the process to find successive motifs. The algorithm requires only a set of unaligned sequences and a number specifying the width of the motifs as input. It returns a model of each motif and a threshold which together can be used as a Bayes-optimal classifier for searching for occurrences of the motif in other databases. The algorithm estimates how many times each motif occurs in each sequence in the dataset and outputs an alignment of the occurrences of the motif. The algorithm is capable of discovering several different motifs with differing numbers of occurrences in a single dataset.

[Algoet1994strong] P.H. Algoet. The strong law of large numbers for sequential decisions under uncertainty. IEEE Trans. Inform. Theory, 40(3):609-633, May 1994. [ bib | DOI | http | .pdf ]
Combines optimization and ergodic theory to characterize the optimum long-run average performance that can be asymptotically attained by nonanticipating sequential decisions. Let Xt be a stationary ergodic process, and suppose an action bt must be selected in a space ℬ with knowledge of the t-past (X0, ···, Xt-1) at the beginning of every period t⩾0. Action bt will incur a loss l(bt, Xt) at the end of period t when the random variable Xt is revealed. The author proves under mild integrability conditions that the optimum strategy is to select actions that minimize the conditional expected loss given the currently available information at each step. The minimum long-run average loss per decision can be approached arbitrarily closely by strategies that are finite-order Markov, and under certain continuity conditions, it is equal to the minimum expected loss given the infinite past. If the loss l(b, x) is bounded and continuous and if the space ℬ is compact, then the minimum can be asymptotically attained, even if the distribution of the process Xt is unknown a priori and must be learned from experience

Keywords: information-theory
[Wagener1995Autocorrelation] M. Wagener, J. Sadowski, and J. Gasteiger. Autocorrelation of molecular surface properties for modeling. corticosteroid binding globulin and cytosolic. ah. receptor. activity by neural networks. J. Am. Chem. Soc., 117:7769-7775, 1995. [ bib ]
[Vapnik1995nature] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. [ bib | http ]
Keywords: svms
[Talagrand1995Concentration] M. Talagrand. Concentration of measure and isoperimetric inequalities in product spaces. Publ. Math. I.H.E.S., 81:73-203, 1995. [ bib | .dvi | .pdf ]
[Lee1995Mouse] L.M. Silver. Mouse Genetics: Concepts and Applications. Oxford University Press, 1995. [ bib | http ]
[Sidney1995Several] J. Sidney, M. F. del Guercio, S. Southwood, V. H. Engelhard, E. Appella, H. G. Rammensee, K. Falk, O. Rötzschke, M. Takiguchi, and R. T. Kubo. Several HLA alleles share overlapping peptide specificities. J. Immunol., 154(1):247-259, Jan 1995. [ bib ]
Herein we describe the establishment of assays to measure peptide binding to purified HLA-B*0701, -B*0801, -B*2705, -B*3501-03, -B*5401, -Cw*0401, -Cw*0602, and -Cw*0702 molecules. The binding of known peptide epitopes or naturally processed peptides correlates well with HLA restriction or origin, underscoring the immunologic relevance of these assays. Analysis of the sequences of various HLA class I alleles suggested that alleles with peptide motifs characterized by proline in position 2 and aromatic or hydrophobic residues at their C-terminus shared key consensus residues at positions 9, 63, 66, 67, and 70 (B pocket) and residue 116 (F pocket). Prediction of the peptide-binding specificity of HLA-B*5401, on the basis of this consensus B and F pocket structure, verified this hypothesis and suggested that a relatively large family of HLA-B alleles (which we have defined as the HLA-B7-like supertype) may significantly overlap in peptide binding specificity. Availability of quantitative binding assays allowed verification that, indeed, many (25%) of the peptide ligands carrying proline in position 2 and hydrophobic/aromatic residues at the C-terminus (the B7-like supermotif) were capable of binding at least three of five HLA-B7-like supertype alleles. Identification of epitopes carrying the B7-like supermotif and binding to a family of alleles represented in over 40% of individuals from all major ethnic groups may be of considerable use in the design of peptide vaccines.

Keywords: Alleles; Amino Acid Sequence; Cell Line, Transformed; Consensus Sequence; Epitopes; Genes, MHC Class I; HLA-B Antigens; HLA-C Antigens; Humans; Molecular Sequence Data; Peptide Fragments; Protein Binding; Protein Structure, Tertiary; Structure-Activity Relationship; Substrate Specificity
[Scholkopf1995Extracting] B . Schölkopf, C. Burges, and V. Vapnik. Extracting support data for a given task. In M. Fayyad and R. Uthurusamy, editors, Proceedings of the First International Conference on Knowledge Discovery & Data Mining. AAAI Press, 1995. [ bib ]
[Schena1995Quantitative] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science, 270(5235):467-470, Oct 1995. [ bib | DOI | http | .pdf ]
A high-capacity system was developed to monitor the expression of many genes in parallel. Microarrays prepared by high-speed robotic printing of complementary DNAs on glass were used for quantitative expression measurements of the corresponding genes. Because of the small format and high density of the arrays, hybridization volumes of 2 microliters could be used that enabled detection of rare transcripts in probe mixtures derived from 2 micrograms of total cellular messenger RNA. Differential expression measurements of 45 Arabidopsis genes were made by means of simultaneous, two-color fluorescence hybridization.

Keywords: microarray
[Rosenfeld1995Flexible] R. Rosenfeld, Q. Zheng, S. Vajda, and C. DeLisi. Flexible docking of peptides to class I major-histocompatibility-complex receptors. Genet. Anal., 12(1):1-21, Mar 1995. [ bib ]
We present a new method for docking flexible peptides to class I Major-Histocompatibility-Complex (MHC) receptors. Docking is performed in two steps: (a) The charged terminal peptide residues are located by randomly distributing multiple copies of each in volumes of approximately 150 A at either end of the binding groove, and then minimizing the system energy using a modified multiple-copy search algorithm. This is followed by (b) construction of the intervening chain using the multiple-copy bond-scaling-relaxation loop closure algorithm. In both steps, the copies tend to cluster and the size of the resulting clusters is proportional to the basin of attraction of the corresponding energy well. We show that native MHC-bound peptides have broad minima and, consequently, that misfolded, low-energy peptide conformations can be eliminated by restricting consideration to groups of peptides which cluster into broad minima. The accuracy of the method is assessed by comparing the predictions with crystallographic data for three different MHC peptide systems, at various degrees of stringency: (a) the extent to which we can determine side chain function (anchor vs. T-cell epitopes); (b) the extent to which we can determine the peptide-receptor orientation; and (c) the accuracy with which we can predict atomic coordinates. We find the method correct on (a) for 19 of the 22 non-Gly positions; the failures appearing to be a consequence of omitting solvation. Predictions related to (b) are also very encouraging, with the overall orientation of the predicted peptides being very similar to the crystal conformation, when measured by the hydrogen bonding pattern between the two. The degree of success in predicting atomic coordinates varied considerably, however, from 1.4 A for the HLA-A2 peptide to 2.7 A for the Kb peptide. The inaccuracy of the latter appears to reflect an incomplete target function, most likely the ommission of solvation. The calculations thus define the current limits of accuracy in docking flexible peptides to Class I receptors and identify the methodological improvements that must be made for the next advance in accuracy.

Keywords: immunoinformatics
[Ravdin1995computer] P. M. Ravdin. A computer based program to assist in adjuvant therapy decisions for individual breast cancer patients. Bull Cancer, 82 Suppl 5:561s-564s, Dec 1995. [ bib ]
This paper describes a personal computer based tool to aid in decision making about whether a woman should receive adjuvant therapy for breast cancer. This tool can assist in engaging women with primary breast cancer in the discussion about: 1) her risk of breast cancer related mortality if she receives only local control measures, but no systemic adjuvant therapy, 2) how much receiving adjuvant therapy may reduce this risk, and 3) what the impact of receiving the adjuvant systemic therapy is in terms of survival. The tool utilizes life table analytical techniques to project outcomes after entry of patient age (used to calculate natural mortality rates), estimated risk of breast cancer related mortality (with a help tool allowing the physician to use estimates based on national database information), and estimate of the efficacy of adjuvant chemotherapy (with included tables of estimates based on the Early Breast Cancer Trialists' meta-analysis). Computer based tools can serve as valuable aids in patient and physician education, and the process of informed decision making.

Keywords: Aged; Breast Neoplasms, drug therapy/mortality; Chemotherapy, Adjuvant; Decision Making, Computer-Assisted; Education, Medical; Female; Humans; Life Tables; Middle Aged; Patient Participation; Prognosis; Software; Survival Rate
[Rammensee1995MHC] H. G. Rammensee, T. Friede, and S. Stevanoviíc. MHC ligands and peptide motifs: first listing. Immunogenetics, 41(4):178-228, 1995. [ bib ]
Keywords: immunoinformatics
[Polonik1995Measuring] W. Polonik. Measuring Mass Concentrations and Estimating Density Contour Clusters-An Excess Mass Approach. Ann. Stat., 23(3):855-881, 1995. [ bib | http | .pdf ]
[Pardalos95aparallel] P.M. Pardalos, L. S. Pitsoulis, and M. G. C. Resende. A parallel grasp implementation for the quadratic assignment problem. In Parallel Algorithms for Irregularly Structured Problems, pages 111-130. Kluwer Academic Publishers, 1995. [ bib ]
[Natarajan1995Sparse] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM J. Comput., 24(2):227-234, 1995. [ bib | DOI ]
[Murzin1995SCOP] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247:536-540, 1995. [ bib ]
[Molloy1995critical] M. Molloy and B. Reed. A critical point for random graphs with a given degree sequence. Random Struct. Algorithm., 6:161-179, 1995. [ bib | .ps | .pdf ]
[Levin1995Stock] A. E. Levin. Stock selection via nonlinear multi-factor models. In D. S. Touretzky, M. Mozer, and M. E. Hasselmo, editors, Adv. Neural. Inform. Process Syst., volume 8, pages 966-972. MIT Press, 1995. [ bib | .pdf ]
[Kass1995Reference] R. E. Kass and L. Wasserman. A reference bayesian test for nested hypotheses and its relationship to the schwarz criterion. J. Am. Stat. Assoc., 90(431):928-934, 1995. [ bib | DOI | http | .pdf ]
[Girosi1995Regularization] F. Girosi, M. Jones, and T. Poggio. Regularization Theory and Neural Networks Architectures. Neural Comput., 7(2):219-269, 1995. [ bib | .html | .pdf ]
[Filatov1995Graph] A. Filatov, A. Gitis, and I. Kil. Graph-based handwritten digit string recognition. In ICDAR '95: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2), page 845, Washington, DC, USA, 1995. IEEE Computer Society. [ bib ]
[Curran1995molecular] M. E. Curran, I. Splawski, K. W. Timothy, G. M. Vincent, E. D. Green, and M. T. Keating. A molecular basis for cardiac arrhythmia: HERG mutations cause long QT syndrome. Cell, 80(5):795-803, Mar 1995. [ bib ]
To identify genes involved in cardiac arrhythmia, we investigated patients with long QT syndrome (LQT), an inherited disorder causing sudden death from a ventricular tachyarrythmia, torsade de pointes. We previously mapped LQT loci on chromosomes 11 (LQT1), 7 (LQT2), and 3 (LQT3). Here, linkage and physical mapping place LQT2 and a putative potassium channel gene, HERG, on chromosome 7q35-36. Single strand conformation polymorphism and DNA sequence analyses reveal HERG mutations in six LQT families, including two intragenic deletions, one splice-donor mutation, and three missense mutations. In one kindred, the mutation arose de novo. Northern blot analyses show that HERG is strongly expressed in the heart. These data indicate that HERG is LQT2 and suggest a likely cellular mechanism for torsade de pointes.

Keywords: herg
[Cortes1995Support-Vector] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273-297, 1995. 10.1023/A:1022627411411. [ bib | http ]
The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.

[Caldwell1995introduction] J. Caldwell, I. Gardner, and N. Swales. An introduction to drug disposition: the basic principles of absorption, distribution, metabolism, and excretion. Toxicol. Pathol., 23(2):102-114, 1995. [ bib ]
A knowledge of the fate of a drug, its disposition (absorption, distribution, metabolism, and excretion, known by the acronym ADME) and pharmacokinetics (the mathematical description of the rates of these processes and of concentration-time relationships), plays a central role throughout pharmaceutical research and development. These studies aid in the discovery and selection of new chemical entities, support safety assessment, and are critical in defining conditions for safe and effective use in patients. ADME studies provide the only basis for critical judgments from situations where the behavior of the drug is understood to those where it is unknown: this is most important in bridging from animal studies to the human situation. This presentation is intended to provide an introductory overview of the life cycle of a drug in the animal body and indicates the significance of such information for a full understanding of mechanisms of action and toxicity.

Keywords: chemogenomics
[Benjamini1995Controlling] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B, 57:289-300, 1995. [ bib | .pdf ]
[Willems1995Context] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The Context Tree Weighting Method: Basic Properties. IEEE Trans. Inform. Theory, 41(3):653-664, May 1995. [ bib | .ps | .pdf ]
Describes a sequential universal data compression procedure for binary tree sources that performs the ?double mixture.? Using a context tree, this method weights in an efficient recursive way the coding distributions corresponding to all bounded memory tree sources, and achieves a desirable coding distribution for tree sources with an unknown model and unknown parameters. Computational and storage complexity of the proposed procedure are both linear in the source sequence length. The authors derive a natural upper bound on the cumulative redundancy of the method for individual sequences. The three terms in this bound can be identified as coding, parameter, and model redundancy, The bound holds for all source sequence lengths, not only for asymptotically large lengths. The analysis that leads to this bound is based on standard techniques and turns out to be extremely simple. The upper bound on the redundancy shows that the proposed context-tree weighting procedure is optimal in the sense that it achieves the Rissanen (1984) lower bound

[Weinberger1995universal] M. J. Weinberger, J. J. Rissanen, and M. Feder. A universal finite memory source. IEEE Trans. Inform. Theory, 41(3):643-652, May 1995. [ bib | .pdf ]
An irreducible parameterization for a finite memory source is constructed in the form of a tree machine. A universal information source for the set of finite memory sources is constructed by a predictive modification of an earlier studied algorithm-Context. It is shown that this universal source incorporates any minimal data-generating tree machine in an asymptotically optimal manner in the following sense: the negative logarithm of the probability it assigns to any long typical sequence, generated by any tree machine, approaches that assigned by the tree machine at the best possible rate

Keywords: information-theory
[Merhav1995strong] N. Merhav and M. Feder. A strong version of the redundancy-capacity theorem of universal. IEEE Trans. Inform. Theory, 41(3):714-722, May 1995. [ bib | .pdf ]
The capacity of the channel induced by a given class of sources is well known to be an attainable lower bound on the redundancy of universal codes with respect to this class, both in the minimax sense and in the Bayesian (maximin) sense. We show that this capacity is essentially a lower bound also in a stronger sense, that is, for ?most? sources in the class. This result extends Rissanen's (1984, 1986) lower bound for parametric families. We demonstrate the applicability of this result in several examples, e.g., parametric families with growing dimensionality, piecewise-fixed sources, arbitrarily varying sources, and noisy samples of learnable functions. Finally, we discuss implications of our results to statistical inference

Keywords: information-theory
[Lugosi1995Nonparametric] G. Lugosi and K. Zeger. Nonparametric estimation via empirical risk minimization. IEEE Trans. Inform. Theory, 41(3):677-687, May 1995. [ bib | .pdf ]
A general notion of universal consistency of nonparametric estimators is introduced that applies to regression estimation, conditional median estimation, curve fitting, pattern recognition, and learning concepts. General methods for proving consistency of estimators based on minimizing the empirical error are shown. In particular, distribution-free almost sure consistency of neural network estimates and generalized linear estimators is established

Keywords: information-theory
[Yu1996Lower] B. Yu. Lower bounds on expected redundancy for nonparametric classes. IEEE Trans. Inform. Theory, 42(1):272-275, Jan 1996. [ bib | DOI | http | .pdf ]
The article focuses on lower bound results on expected redundancy for universal coding of independent and identically distributed data on [0, 1] from parametric and nonparametric families. After reviewing existing lower bounds, we provide a new proof for minimax lower bounds on expected redundancy over nonparametric density classes. This new proof is based on the calculation of a mutual information quantity, or it utilizes the relationship between redundancy and Shannon capacity. It therefore unifies the minimax redundancy lower bound proofs in the parametric and nonparametric cases

Keywords: information-theory
[Willems1996Context] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. Context Weighting for General Finite Context Sources. IEEE Trans. Inform. Theory, 42(5):1514-1520, 1996. [ bib | .ps | .pdf ]
[Willems1996Coding] F. M. J. Willems. Coding for a Binary Independent Piecewise-Identically Distributed Source. IEEE Trans. Inform. Theory, 42:2210-2217, nov 1996. [ bib | .ps | .pdf ]
[Wilkins1996From] M. R. Wilkins, C. Pasquali, R. D. Appel, K. Ou, O. Golaz, J. C. Sanchez, J. X. Yan, A. A. Gooley, G. Hughes, I. Humphery-Smith, K. L. Williams, and D. F. Hochstrasser. From proteins to proteomes: large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Biotechnology (N Y), 14(1):61-65, Jan 1996. [ bib ]
Separation and identification of proteins by two-dimensional (2-D) electrophoresis can be used for protein-based gene expression analysis. In this report single protein spots, from polyvinylidene difluoride blots of micropreparative E. coli 2-D gels, were rapidly and economically identified by matching their amino acid composition, estimated pI and molecular weight against all E. coli entries in the SWISS-PROT database. Thirty proteins from an E. coli 2-D map were analyzed and identities assigned. Three of the proteins were unknown. By protein sequencing analysis, 20 of the 27 proteins were correctly identified. Importantly, correct identifications showed unambiguous "correct" score patterns. While incorrect protein identifications also showed distinctive score patterns, indicating that protein must be identified by other means. These techniques allow large-scale screening of the protein complement of simple organisms, or tissues in normal and disease states. The computer program described here is accessible via the World Wide Web at URL address (http:@expasy.hcuge.ch/).

Keywords: Amino Acids; Bacterial Proteins; Blood Proteins; Databases, Factual; Electrophoresis, Gel, Two-Dimensional; Escherichia coli; Humans; Microchemistry; Molecular Weight; Multienzyme Complexes; Proteins; Reproducibility of Results; Software; Time Factors
[Tibshirani1996Regression] R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B, 58(1):267-288, 1996. [ bib ]
[Talagrand1996Newa] M. Talagrand. A New Look at Independence. Ann. Probab., 24:1-34, 1996. [ bib | .dvi | .pdf ]
[Talagrand1996New] M. Talagrand. New concentration inequalities for product spaces. Inventionnes Math., 126:505-563, 1996. [ bib | .dvi | .pdf ]
[Talagrand1996Majorizing] M. Talagrand. Majorizing measures: The generic chaining. Ann. Probab., 24:1049-1103, 1996. [ bib | .dvi | .pdf ]
[Stadler1996Landscapes] P. F. Stadler. Landscapes and Their Correlation Functions. J. Math. Chem., 20:1-45, 1996. [ bib | .html | .pdf ]
[Sidney1996Definition] J. Sidney, H. M. Grey, S. Southwood, E. Celis, P. A. Wentworth, M. F. del Guercio, R. T. Kubo, R. W. Chesnut, and A. Sette. Definition of an HLA-A3-like supermotif demonstrates the overlapping peptide-binding repertoires of common HLA molecules. Hum Immunol, 45(2):79-93, Feb 1996. [ bib ]
An HLA-A3-like supertype (minimally comprised of products from the HLA class I alleles A3, A11, A31, A*3301, and A*6801) has been defined on the basis of (a) structural similarities in the antigen-binding groove, (b) shared main anchor peptide-binding motifs, (c) the identification of peptides cross-reacting with most or all of these molecules, and (d) the definition of an A3-like supermotif that efficiently predicts highly cross-reactive peptides. Detailed secondary anchor maps for A3, A11, A31, A*3301, and A*6801 are also described. The biologic relevance of the A3-like supertype is indicated by the fact that high frequencies of the A3-like supertype alleles are conserved in all major ethnic groups. Because A3-like supertype alleles are found in most major HLA evolutionary lineages, possibly a reflection of common ancestry, the A3-like supermotif might in fact represent a primeval human HLA class I peptide-binding specificity. It is also possible that these phenomena might be related to optimal exploitation of the peptide specificity by human TAP molecules. The grouping of HLA alleles into supertypes on the basis of their overlapping peptide-binding repertoires represents an alternative to serologic or phylogenetic classification.

Keywords: Alleles; Amino Acid Sequence; Cell Line, Transformed; Cross Reactions; HLA Antigens; HLA-A3 Antigen; HLA-B Antigens; Haplotypes; Humans; Molecular Sequence Data; Peptide Fragments; Protein Binding; Structure-Activity Relationship
[Scholkopf1996Incorporating] B. Schölkopf, C. Burges, and V. Vapnik. Incorporating invariances in support vector learning machines. In C. von der Malsburg, W. von Seelen, J. C. Vorbrüggen, and B. Sendhoff, editors, ICANN 96: Proceedings of the 1996 International Conference on Artificial Neural Networks, pages 47-52, London, UK, 1996. Springer-Verlag. [ bib | .pdf ]
[Rissanen1996Fisher] J. J. Rissanen. Fisher information and stochastic complexity. IEEE Trans. Inform. Theory, 42(1):40-47, Jan 1996. [ bib | .pdf ]
By taking into account the Fisher information and removing an inherent redundancy in earlier two-part codes, a sharper code length as the stochastic complexity and the associated universal process are derived for a class of parametric processes. The main condition required is that the maximum-likelihood estimates satisfy the central limit theorem. The same code length is also obtained from the so-called maximum-likelihood code

Keywords: information-theory
[Ravdin1996computer] P. M. Ravdin. A computer program to assist in making breast cancer adjuvant therapy decisions. Semin Oncol, 23(1 Suppl 2):43-50, Feb 1996. [ bib ]
This report describes a computer program designed to assist health care professionals in making projections of the average benefit of systemic adjuvant therapy for individual breast cancer patients. It requires as input patient age (used to make projections of natural mortality), an estimate of breast cancer-related mortality at 5 years (used to make projections of breast cancer-specific mortality), and the proportional risk reduction for breast cancer mortality expected for the adjuvant therapy (with included tables from the Early Breast Cancer Trialist's 1992 meta-analysis). The program uses life table analytical techniques to make projections of outcome in three scenarios: that the breast cancer never occurred, that the breast cancer patient received definitive local therapy but no adjuvant systemic therapy, and that the patient received adjuvant therapy. The outcome projections are given for total, natural (non-breast cancer-related), and breast cancer-related mortality at several time points and also of total remaining life expectancy. These estimates are currently widely made by clinicians by nonnumerical techniques. Computer-based tools can serve as valuable aids in physician and patient education and in the process of informed decision making.

Keywords: Adult; Age Factors; Aged; Aged, 80 and over; Breast Neoplasms, drug therapy/mortality/surgery; Chemotherapy, Adjuvant; Decision Making; Female; Humans; Life Expectancy; Life Tables; Middle Aged; Proportional Hazards Models; Risk Factors; SEER Program; Software; Survival Rate; Treatment Outcome
[Rarey1996Placement] M. Rarey, S. Wefing, and T. Lengauer. Placement of medium-sized molecular fragments into active sites of proteins. J Comput Aided Mol Des, 10(1):41-54, Feb 1996. [ bib ]
We present an algorithm for placing molecular fragments into the active site of a receptor. A molecular fragment is defined as a connected part of a molecule containing only complete ring systems. The algorithm is part of a docking tool, called FLEXX, which is currently under development at GMD. The overall goal is to provide means of automatically computing low-energy conformations of the ligand within the active site, with an accuracy approaching the limitations of experimental methods for resolving molecular structures and within a run time that allows for docking large sets of ligands. The methods by which we plan to achieve this goal are the explicit exploitation of molecular flexibility of the ligand and the incorporation of physicochemical properties of the molecules. The algorithm for fragment placement, which is the topic of this paper, is based on pattern recognition techniques and is able to predict a small set of possible positions of a molecular fragment with low flexibility within seconds on a workstation. In most cases, a placement with rms deviation below 1.0 A with respect to the X-ray structure is found among the 10 highest ranking solutions, assuming that the receptor is given in the bound conformation.

Keywords: Algorithms; Binding Sites; Databases, Factual; Ligands; Models, Chemical; Peptide Fragments, chemistry; Proteins, chemistry; Software
[Rarey1996fast] M. Rarey, B. Kramer, T. Lengauer, and G. Klebe. A fast flexible docking method using an incremental construction algorithm. J. Mol. Biol., 261(3):470-489, Aug 1996. [ bib | DOI | http ]
We present an automatic method for docking organic ligands into protein binding sites. The method can be used in the design process of specific protein ligands. It combines an appropriate model of the physico-chemical properties of the docked molecules with efficient methods for sampling the conformational space of the ligand. If the ligand is flexible, it can adopt a large variety of different conformations. Each such minimum in conformational space presents a potential candidate for the conformation of the ligand in the complexed state. Our docking method samples the conformation space of the ligand on the basis of a discrete model and uses a tree-search technique for placing the ligand incrementally into the active site. For placing the first fragment of the ligand into the protein, we use hashing techniques adapted from computer vision. The incremental construction algorithm is based on a greedy strategy combined with efficient methods for overlap detection and for the search of new interactions. We present results on 19 complexes of which the binding geometry has been crystallographically determined. All considered ligands are docked in at most three minutes on a current workstation. The experimentally observed binding mode of the ligand is reproduced with 0.5 to 1.2 A rms deviation. It is almost always found among the highest-ranking conformations computed.

Keywords: Aldehyde Reductase, Algorithms, Amiloride, Aminoimidazole Carboxamide, Animals, Arabinose, Automation, Binding Sites, Carbonic Anhydrases, Computational Biology, Computer Simulation, Concanavalin A, Crystallography, Databases, Drug Design, Drug Evaluation, Enzyme Inhibitors, Factual, Folic Acid, Folic Acid Antagonists, Fructose-Bisphosphatase, Humans, Internet, Ligands, Methotrexate, Models, Molecular, Non-U.S. Gov't, Pancreatic Elastase, Pentamidine, Pliability, Point Mutation, Preclinical, Protein Binding, Protein Conformation, Proteins, Reproducibility of Results, Research Support, Ribonucleosides, Software, Tetrahydrofolate Dehydrogenase, Thermolysin, Time Factors, Trypsin, X-Ray, 8780787
[Rangarajan96lagrangian] A. Rangarajan and E. Mjolsness. A lagrangian relaxation network for graph matching. In IEEE Trans. Neural Networks, pages 4629-4634. IEEE Press, 1996. [ bib ]
[Pickett1996Diversity] S. D. Pickett, J. S. Mason, and I. M. McLay. Diversity profiling and design using 3D pharmacophores : Pharmacophores-Derived Queries (PQD). J. Chem. Inf. Comput. Sci., 36(6):1214-1223, 1996. [ bib | DOI | http | .pdf ]
The current interest in combinatorial chemistry for lead generation has necessitated the development of methods for design and evaluation of the diversity of the resultant compound libraries. Such methods also have application in selecting diverse sets of compounds for general screening from corporate databases and in the analysis of large sets of structures to identify common patterns. In this paper we describe a novel methodology for calculating diversity and identifying common features based on the three-point pharmacophores expressed by a compound.1 The method has been implemented within the environment of the Chem-X molecular modeling package (ChemDBS-3D), using a systematic analysis of 3D distance space with three point combinations of six pharmacophoric groups. The strategy used to define the pharmacophores is discussed, including an in-house developed atom type parameterization. The method is compared with the related approach being developed into the ChemDiverse module of Chem-X. Results from an analysis of a large corporate database and examples of combinatorial library profiling with both methods are presented. The use of 3D pharmacophores for assessing diversity, and the application of such methods to combinatorial library design, is discussed.

Keywords: chemoinformatics
[Modha1996Minimum] D.S. Modha and E. Masry. Minimum complexity regression estimation with weakly dependent observations. IEEE Trans. Inform. Theory, 42(6):2133-2145, Nov 1996. [ bib | DOI | http | .pdf ]
The minimum complexity regression estimation framework (Barron, 1991; Barron and Cover, 1991 and Rissanen, 1989) is a general data-driven methodology for estimating a regression function from a given list of parametric models using independent and identically distributed (i.i.d.) observations. We extend Barron's regression estimation framework to m-dependent observations and to strongly mixing observations. In particular, we propose abstract minimum complexity regression estimators for dependent observations, which may be adapted to a particular list of parametric models, and establish upper bounds on the statistical risks of the proposed estimators in terms of certain deterministic indices of resolvability. Assuming that the regression function satisfies a certain Fourier-transform-type representation, we examine minimum complexity regression estimators adapted to a list of parametric models based on neural networks and by using the upper bounds for the abstract estimators, we establish rates of convergence for the statistical risks of these estimators. Also, as a key tool, we extend the classical Bernstein inequality from i.i.d. random variables to m-dependent processes and to strongly mixing processes

Keywords: information-theory
[Lugosi1996Concept] G. Lugosi and K. Zeger. Concept learning using complexity regularization. IEEE Trans. Inform. Theory, 42(1):48-54, Jan 1996. [ bib | .pdf ]
In pattern recognition or, as it has also been called, concept learning, the value of a 0,1-valued random variable Y is to be predicted based upon observing an Rd-valued random variable X. We apply the method of complexity regularization to learn concepts from large concept classes. The method is shown to automatically find a good balance between the approximation error and the estimation error. In particular, the error probability of the obtained classifier is shown to decrease as O(?(logn/n)) to the achievable optimum, for large nonparametric classes of distributions, as the sample size n grows. We also show that if the Bayes error probability is zero and the Bayes rule is in a known family of decision rules, the error probability is O(logn/n) for many large families, possibly with infinite VC dimension

Keywords: information-theory
[Lauritzen1996Graphical] S. Lauritzen. Graphical Models. Oxford, 1996. [ bib ]
[Kristiansen1996database] K. Kristiansen, S. G. Dahl, and O. Edvardsen. A database of mutants and effects of site-directed mutagenesis experiments on G protein-coupled receptors. Proteins, 26(1):81-94, Sep 1996. [ bib | DOI | http ]
A database system and computer programs for storage and retrieval of information about guanine nucleotide-binding protein (G protein) -coupled receptor mutants and associated biological effects have been developed. Mutation data on the receptors were collected from the literature and a database of mutants and effects of mutations was developed. The G protein-coupled receptor, family A, point mutation database (GRAP) provides detailed information on ligand-binding and signal transduction properties of more than 2130 receptor mutants. The amino acid sequences of receptors for which mutation experiments have been reported were aligned, and from this alignment mutation data may be retrieved. Alternatively, a search form allowing detailed specification of which mutants to retrieve may be used, for example, to search for specific amino acid substitutions, substitutions in specific protein domains or reported biological effects. Furthermore, ligand and bibliographic oriented queries may be performed. GRAP is available on the Internet (URL: http://www-grap.fagmed.uit.no/GRAP/+ +homepage.html) using the World-Wide Web system.

Keywords: Amino Acid Sequence; Computer Communication Networks; Computers; GTP-Binding Proteins; Information Systems; Molecular Sequence Data; Mutagenesis, Site-Directed; Mutation; Receptors, Cell Surface; Sequence Alignment
[King1996Structure-activity] R. D. King, S. H. Muggleton, A. Srinivasan, and M. J. Sternberg. Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc. Natl. Acad. Sci. USA, 93(1):438-442, Jan 1996. [ bib ]
We present a general approach to forming structure-activity relationships (SARs). This approach is based on representing chemical structure by atoms and their bond connectivities in combination with the inductive logic programming (ILP) algorithm PROGOL. Existing SAR methods describe chemical structure by using attributes which are general properties of an object. It is not possible to map chemical structure directly to attribute-based descriptions, as such descriptions have no internal organization. A more natural and general way to describe chemical structure is to use a relational description, where the internal construction of the description maps that of the object described. Our atom and bond connectivities representation is a relational description. ILP algorithms can form SARs with relational descriptions. We have tested the relational approach by investigating the SARs of 230 aromatic and heteroaromatic nitro compounds. These compounds had been split previously into two subsets, 188 compounds that were amenable to regression and 42 that were not. For the 188 compounds, a SAR was found that was as accurate as the best statistical or neural network-generated SARs. The PROGOL SAR has the advantages that it did not need the use of any indicator variables handcrafted by an expert, and the generated rules were easily comprehensible. For the 42 compounds, PROGOL formed a SAR that was significantly (P < 0.025) more accurate than linear regression, quadratic regression, and back-propagation. This SAR is based on an automatically generated structural alert for mutagenicity.

[Kam96documentimage] A. C. Kam and G. E. Kopec. Document image decoding by heuristic search. IEEE Trans. Pattern Anal. Mach. Intell., 18:945-950, 1996. [ bib ]
[Jolliffe1996Principal] I.T. Jolliffe. Principal component analysis. Springer-Verlag, New-York, 1996. [ bib ]
[John1996Stock] G. H. John, P. Miller, and R. Kerber. Stock selection using Recon, pages 303-316. World Scientific, 1996. [ bib | .pdf ]
[Humphrey1996VMD] W. Humphrey, A. Dalke, and K. Schulten. VMD: visual molecular dynamics. J. Mol. Graph., 14(1):33-8, 27-8, Feb 1996. [ bib ]
VMD is a molecular graphics program designed for the display and analysis of molecular assemblies, in particular biopolymers such as proteins and nucleic acids. VMD can simultaneously display any number of structures using a wide variety of rendering styles and coloring methods. Molecules are displayed as one or more "representations," in which each representation embodies a particular rendering method and coloring scheme for a selected subset of atoms. The atoms displayed in each representation are chosen using an extensive atom selection syntax, which includes Boolean operators and regular expressions. VMD provides a complete graphical user interface for program control, as well as a text interface using the Tcl embeddable parser to allow for complex scripts with variable substitution, control loops, and function calls. Full session logging is supported, which produces a VMD command script for later playback. High-resolution raster images of displayed molecules may be produced by generating input scripts for use by a number of photorealistic image-rendering applications. VMD has also been expressly designed with the ability to animate molecular dynamics (MD) simulation trajectories, imported either from files or from a direct connection to a running MD simulation. VMD is the visualization component of MDScope, a set of tools for interactive problem solving in structural biology, which also includes the parallel MD program NAMD, and the MDCOMM software used to connect the visualization and simulation programs. VMD is written in C++, using an object-oriented design; the program, including source code and extensive documentation, is freely available via anonymous ftp and through the World Wide Web.

[Hughey1996Hidden] R. Hughey and A. Krogh. Hidden Markov models for sequence analysis: Extension and analysis of the basic method. CABIOS, 12(2):95-107, 1996. [ bib ]
[Holst1996Topological] H. van der Holst. Topological and spectral graph characterizations. PhD thesis, Universiteit van Amsterdam, 1996. [ bib ]
[Guorong1996Bhattacharyya] X. Guorong, C. Peiqi, and W. Minhui. Bhattacharyya distance feature selection. In Pattern Recognition, 1996., Proceedings of the 13th International Conference on, volume 2, pages 195-199. IEEE, 1996. [ bib ]
[Gribskov1996Use] M. Gribskov and N. L. Robinson. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem., 20(1):25-33, 1996. [ bib ]
[Golub1996Matrix] G. H. Golub and C. F. Van Loan. Matrix computations (3rd ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996. [ bib ]
[Gasteigner1996Chemical] J. Gasteiger, J. Sadowski, J. Schuur, P. Selzer, L. Steinhauer, and V. Steinhauer. Chemical information in 3d space. J. Chem. Inform. Comput. Sci., 36(5):1030-1037, 1996. [ bib | DOI | arXiv | http ]
[Fortin1996graph] S. Fortin. The graph isomorphism problem. Technical report, MIT, 1996. [ bib ]
[Feder1996Hierarchical] M. Feder and N. Merhav. Hierarchical universal coding. IEEE Trans. Inform. Theory, 42(5):1354-1364, Sep 1996. [ bib | .pdf ]
In an earlier paper, we proved a strong version of the redundancy-capacity converse theorem of universal coding, stating that for ?most? sources in a given class, the universal coding redundancy is essentially lower-bounded by the capacity of the channel induced by this class. Since this result holds for general classes of sources, it extends Rissanen's (1986) strong converse theorem for parametric families. While our earlier result has established strong optimality only for mixture codes weighted by the capacity-achieving prior, our first result herein extends this finding to a general prior. For some cases our technique also leads to a simplified proof of the above mentioned strong converse theorem. The major interest in this paper, however, is in extending the theory of universal coding to hierarchical structures of classes, where each class may have a different capacity. In this setting, one wishes to incur redundancy essentially as small as that corresponding to the active class, and not the union of classes. Our main result is that the redundancy of a code based on a two-stage mixture (first, within each class, and then over the classes), is no worse than that of any other code for ?most? sources of ?most? classes. If, in addition, the classes can be efficiently distinguished by a certain decision rule, then the best attainable redundancy is given explicitly by the capacity of the active class plus the normalized negative logarithm of the prior probability assigned to this class. These results suggest some interesting guidelines as for the choice of the prior. We also discuss some examples with a natural hierarchical partition into classes

Keywords: information-theory source-coding
[Devroye1996Probabilistic] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition, volume 31 of Applications of Mathematics. Springer, 1996. [ bib ]
[Devillers1996Neural] J. Devillers. Neural Networks in QSAR and Drug Design. Academic Press, London, 1996. [ bib ]
[Cover1996Universal] T.M. Cover and E. Ordentlich. Universal portfolios with side information. IEEE Trans. Inform. Theory, 42(2):348-363, Mar 1996. [ bib | .pdf ]
We present a sequential investment algorithm, the ?-weighted universal portfolio with side information, which achieves, to first order in the exponent, the same wealth as the best side-information dependent investment strategy (the best state-constant rebalanced portfolio) determined in hindsight from observed market and side-information outcomes. This is an individual sequence result which shows the difference between the exponential growth wealth of the best state-constant rebalanced portfolio and the universal portfolio with side information is uniformly less than (d/(2n))log (n+1)+(k/n)log 2 for every stock market and side-information sequence and for all time n. Here d=k(m-1) is the number of degrees of freedom in the state-constant rebalanced portfolio with k states of side information and m stocks. The proof of this result establishes a close connection between universal investment and universal data compression

Keywords: information-theory
[Cordella1996Efficient] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. An efficient algorithm for the inexact matching of arg graphs using a contextual transformational model. Pattern Recognition, International Conference on, 3:180, 1996. [ bib | DOI ]
[Cogoni1996Transgene] C. Cogoni, J. T. Irelan, M. Schumacher, T. J. Schmidhauser, E. U. Selker, and G. Macino. Transgene silencing of the al-1 gene in vegetative cells of Neurospora is mediated by a cytoplasmic effector and does not depend on DNA-DNA interactions or DNA methylation. EMBO J., 15(12):3153-3163, Jun 1996. [ bib | http | .pdf ]
The molecular mechanisms involved in transgene-induced gene silencing ('quelling') in Neurospora crassa were investigated using the carotenoid biosynthetic gene albino-1 (al-1) as a visual marker. Deletion derivatives of the al-1 gene showed that a transgene must contain at least approximately 132 bp of sequences homologous to the transcribed region of the native gene in order to induce quelling. Transgenes containing only al-1 promoter sequences do not cause quelling. Specific sequences are not required for gene silencing, as different regions of the al-1 gene produced quelling. A mutant defective in cytosine methylation (dim-2) exhibited normal frequencies and degrees of silencing, indicating that cytosine methylation is not responsible for quelling, despite the fact that methylation of transgene sequences frequently is correlated with silencing. Silencing was shown to be a dominant trait, operative in heterokaryotic strains containing a mixture of transgenic and non-transgenic nuclei. This result indicates that a diffusable, trans-acting molecule is involved in quelling. A transgene-derived, sense RNA was detected in quelled strains and was found to be absent in their revertants. These data are consistent with a model in which an RNA-DNA or RNA-RNA interaction is involved in transgene-induced gene silencing in Neurospora.

Keywords: sirna
[Brown1996Use] Robert D. Brown and Yvonne C. Martin. Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection. J Chem Inf Comput Sci, 36:572-584, 1996. [ bib ]
Keywords: chemoinformatics
[Breiman1996Bagging] L. Breiman. Bagging predictors. Mach. Learn., 24(2):123-140, 1996. [ bib | DOI | http | .pdf ]
Keywords: PUlearning
[Billerey1996Etude] C. Billerey and L. Boccon-Gibod. Etude des variations inter-pathologistes dans l'évaluation du grade et du stade des tumeurs vésicales. Progrès en Urologie, 6:49-57, 1996. [ bib | .PDF | .pdf ]
Keywords: csbcbook, csbcbook-ch3
[Baxter1996Learning] Jonathan Baxter. Learning model bias. In Advances in Neural Information Processing Systems, pages 169-175. MIT Press, 1996. [ bib ]
[Baxter1996Bayesian/information] Jonathan Baxter. A bayesian/information theoretic model of bias learning. In COLT '96: Proceedings of the ninth annual conference on Computational learning theory, pages 77-88, New York, NY, USA, 1996. ACM Press. [ bib | DOI ]
[Baulcombe1996RNA] D. C. Baulcombe. RNA as a target and an initiator of post-transcriptional gene silencing in transgenic plants. Plant Mol. Biol., 32(1-2):79-88, Oct 1996. [ bib | DOI | http | .pdf ]
Post-transcriptional gene silencing in transgenic plants is the manifestation of a mechanism that suppresses RNA accumulation in a sequence-specific manner. The target RNA species may be the products of transgenes, endogenous plant genes or viral RNAs. For an RNA to be a target it is necessary only that it has sequence homology to the sense RNA product of the transgene. There are three current hypotheses to account for the mechanism of post transcriptional gene silencing. These models all require production of an antisense RNA of the RNA targets to account for the specificity of the mechanism. There could be either direct transcription of the antisense RNA from the transgene, antisense RNA produced in response to over expression of the transgene or antisense RNA produced in response to the production of an aberrant sense RNA product of the transgene. To determine which of these models is correct it will be necessary to find out whether transgene methylation, which is frequently associated with the potential of transgenes to confer post-transcriptional gene silencing, is a cause or a consequence of the process.

[Bauknecht1996Locating] H. Bauknecht, A. Zell, H. Bayer, P. Levi, M. Wagener, J. Sadowski, and J. Gasteiger. Locating biologically active compounds in medium-sized heterogeneous datasets by topological autocorrelation vectors: dopamine and benzodiazepine agonists. J Chem Inf Comput Sci, 36(6):1205-1213, 1996. [ bib ]
Electronic properties located on the atoms of a molecule such as partial atomic charges as well as electronegativity and polarizability values are encoded by an autocorrelation vector accounting for the constitution of a molecule. This encoding procedure is able to distinguish between compounds being dopamine agonists and those being benzodiazepine receptor agonists even after projection into a two-dimensional self-organizing network. The two types of compounds can still be distinguished if they are buried in a dataset of 8323 compounds of a chemical supplier catalog comprising a wide structural variety. The maps obtained by this sequence of events, calculation of empirical physicochemical effects, encoding in a topological autocorrelation vector, and projection by a self-organizing neural network, can thus be used for searching for structural similarity, and, in particular, for finding new lead structures with biological activity.

Keywords: Animals, Chemical, Chemistry, Databases, Dopamine Agonists, Drug Design, Electrochemistry, Factual, GABA-A, Logistic Models, Models, Molecular Structure, Neural Networks (Computer), Non-U.S. Gov't, Phenols, Physical, Receptors, Research Support, Structure-Activity Relationship, Tetrahymena pyriformis, 8941996
[Gold1996graduated] S. Gold and A. Rangarajan. A graduated assignment algorithm for graph matching. IEEE Trans. Pattern Anal. Mach. Intell., 18(4):377-388, April 1996. [ bib | DOI | http | .pdf ]
A graduated assignment algorithm for graph matching is presented which is fast and accurate even in the presence of high noise. By combining graduated nonconvexity, two-way (assignment) constraints, and sparsity, large improvements in accuracy and speed are achieved. Its low order computational complexity [O(lm), where l and m are the number of links in the two graphs] and robustness in the presence of noise offer advantages over traditional combinatorial approaches. The algorithm, not restricted to any special class of graph, is applied to subgraph isomorphism, weighted graph matching, and attributed relational graph matching. To illustrate the performance of the algorithm, attributed relational graphs derived from objects are matched. Then, results from twenty-five thousand experiments conducted on 100 node random graphs of varying types (graphs with only zero-one links, weighted graphs, and graphs with node attributes and multiple link types) are reported. No comparable results have been reported by any other graph matching algorithm before in the research literature. Twenty-five hundred control experiments are conducted using a relaxation labeling algorithm and large improvements in accuracy are demonstrated.

[Goffeau1996Life] A. Goffeau, B.G. Barrell, H. Bussey, R.W. Davis, B. Dujon, H. Feldmann, F. Galibert, J.D. Hoheisel, C. Jacq, M. Johnston, E.J. Louis, H.W. Mewes, Y. Murakami, P. Philippsen, H. Tettelin, and S. G. Oliver. Life with 6000 genes. Science, 274:546-567, October 1996. [ bib | DOI | http | .pdf ]
[Zhu1997Minimax] S. C. Zhu, Z. N. Wu, and D. Mumford. Minimax Entropy Principle and Its Application to Texture Modeling. Neural Comput., 9(8):1627-1660, 1997. [ bib | .ps.gz | .pdf ]
[Xie1997Minimax] Q. Xie and A.R. Barron. Minimax redundancy for the class of memoryless sources. IEEE Trans. Inform. Theory, 43(2):646-657, Mar 1997. [ bib | .pdf ]
Let Xn=(X1,...,Xn) be a memoryless source with unknown distribution on a finite alphabet of size k. We identify the asymptotic minimax coding redundancy for this class of sources, and provide a sequence of asymptotically minimax codes. Equivalently, we determine the limiting behavior of the minimax relative entropy minQXn maxpXn D(PXn∥QXn), where the maximum is over all independent and identically distributed (i.i.d.) source distributions and the minimum is over all joint distributions. We show in this paper that the minimax redundancy minus ((k-1)/2) log(n/(2?e)) converges to log??(det I(&thetas;))d&thetas;=log (?(1/2)k/?(k/2)), where I(&thetas;) is the Fisher information and the integral is over the whole probability simplex. The Bayes strategy using Jeffreys' prior is shown to be asymptotically maximin but not asymptotically minimax in our setting. The boundary risk using Jeffreys' prior is higher than that of interior points. We provide a sequence of modifications of Jeffreys' prior that put some prior mass near the boundaries of the probability simplex to pull down that risk to the asymptotic minimax level in the limit

Keywords: information-theory
[Tjalkens1997Implementing] Tj.J. Tjalkens and F. M. J. Willems. Implementing the Context-Tree Weighting Method: Arithmetic Coding,. In Int. Conf. on Combinatorics, Information Theory and Statistics, page 83, Portland, Maine, U.S.A, 18-20 1997. [ bib | .ps | .pdf ]
[Tibshirani1997lasso] R. Tibshirani. The lasso method for variable selection in the Cox model. Stat. Med., 16(4):385-395, Feb 1997. [ bib | .pdf ]
I propose a new method for variable selection and shrinkage in Cox's proportional hazards model. My proposal minimizes the log partial likelihood subject to the sum of the absolute values of the parameters being bounded by a constant. Because of the nature of this constraint, it shrinks coefficients and produces some coefficients that are exactly zero. As a result it reduces the estimation variance while providing an interpretable final model. The method is a variation of the 'lasso' proposal of Tibshirani, designed for the linear regression context. Simulations indicate that the lasso can be more accurate than stepwise selection in this setting.

[Sonnhammer1997Pfam] E. L. Sonnhammer, S. R. Eddy, and R. Durbin. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins, 28(3):405-420, Jul 1997. [ bib ]
Databases of multiple sequence alignments are a valuable aid to protein sequence classification and analysis. One of the main challenges when constructing such a database is to simultaneously satisfy the conflicting demands of completeness on the one hand and quality of alignment and domain definitions on the other. The latter properties are best dealt with by manual approaches, whereas completeness in practice is only amenable to automatic methods. Herein we present a database based on hidden Markov model profiles (HMMs), which combines high quality and completeness. Our database, Pfam, consists of parts A and B. Pfam-A is curated and contains well-characterized protein domain families with high quality alignments, which are maintained by using manually checked seed alignments and HMMs to find and align all members. Pfam-B contains sequence families that were generated automatically by applying the Domainer algorithm to cluster and align the remaining protein sequences after removal of Pfam-A domains. By using Pfam, a large number of previously unannotated proteins from the Caenorhabditis elegans genome project were classified. We have also identified many novel family memberships in known proteins, including new kazal, Fibronectin type III, and response regulator receiver domains. Pfam-A families have permanent accession numbers and form a library of HMMs available for searching and automatic annotation of new protein sequences.

[Solinas1997Matrix] S. Solinas-Toldo, S. Lampel, S. Stilgenbauer, J. Nickolenko, A. Benner, H. Dohner, T. Cremer, and P. Lichter. Matrix-based comparative genomic hybridization: Biochips to screen for genomic imbalances. Genes Chromosomes Cancer, 20:399-407, 1997. [ bib ]
Keywords: csbcbook, csbcbook-ch2
[Sebag1997Tractable] M. Sebag and C. Rouveirol. Tractable Induction and Classification in First-Order Logic via Stochastic Matching. In Proceedings of the 15th International Joint Conference on Artificial Intelligence, pages 888-893. Morgan Kaufmann, 1997. [ bib ]
[Rockafellar1997Convex] R.T. Rockafellar. Convex Analysis. Princeton Univ. Press, 1997. [ bib ]
[Polonik1997Minimum] W. Polonik. Minimum volume sets and generalized quantile processes. Stochastic Processes and their Applications, 69:1-24, 1997. [ bib ]
[Nielsen1997Identification] H. Nielsen, J. Engelbrecht, S. Brunak, and G. von Heijne. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng., 10(1):1-6, 1997. [ bib | http | .pdf ]
[Morvai1997Weakly] G. Morvai, S.J. Yakowitz, and P. Algoet. Weakly convergent nonparametric forecasting of stationary time series. IEEE Trans. Inform. Theory, 43(2):483-498, Mar 1997. [ bib | .pdf ]
The conditional distribution of the next outcome given the infinite past of a stationary process can be inferred from finite but growing segments of the past. Several schemes are known for constructing pointwise consistent estimates, but they all demand prohibitive amounts of input data. We consider real-valued time series and construct conditional distribution estimates that make much more efficient use of the input data. The estimates are consistent in a weak sense, and the question whether they are pointwise-consistent is still open. For finite-alphabet processes one may rely on a universal data compression scheme like the Lempel-Ziv (1978) algorithm to construct conditional probability mass function estimates that are consistent in expected information divergence. Consistency in this strong sense cannot be attained in a universal sense for all stationary processes with values in an infinite alphabet, but weak consistency can. Some applications of the estimates to on-line forecasting, regression, and classification are discussed

Keywords: information-theory
[Mohar1997Some] B. Mohar. Some applications of Laplace eigenvalues of graphs. In G. Hahn and G. Sabidussi, editors, Graph Symmetry: Algebraic Methods and Applications, volume 497 of NATO ASI Series C, pages 227-275. Kluwer, Dordrecht, 1997. [ bib | .html | .pdf ]
[McGregor1997Clustering] M.J. McGregor and V. Pallai. Clustering of Large Databases of Compounds: Using the mdl "Keys" as Structural Descriptors. J Chem Inf Comput Sci, 37:443-448, 1997. [ bib ]
[Luo1997Mammalian] Y. Luo, A. Batalao, H. Zhou, and L. Zhu. Mammalian two-hybrid system: a complementary approach to the yeast two-hybrid system. Biotechniques, 22(2):350-352, Feb 1997. [ bib ]
Here we demonstrate the use of a mammalian two-hybrid system to study protein-protein interactions. Like the yeast two-hybrid system, this is a genetic, in vivo assay based on the reconstitution of the function of a transcriptional activator. In this system, one protein of interest is expressed as a fusion to the Gal4 DNA-binding domain and another protein is expressed as a fusion to the activation domain of the VP16 protein of the herpes simplex virus. The vectors that express these fusion proteins are cotransfected with a reporter chloramphenicol acetyltransferase (CAT) vector into a mammalian cell line. The reporter plasmid contains a cat gene under the control of five consensus Gal4 binding sites. If the two fusion proteins interact, there will be a significant increase in expression of the cat reporter gene. Previously, it was reported that mouse p53 antitumor protein and simian virus 40 large T antigen interact in a yeast two-hybrid system. Using a mammalian two-hybrid system, we were able to independently confirm this interaction. The mammalian two-hybrid system can be used as a complementary approach to verify protein-protein interactions detected by a yeast two-hybrid system screening. In addition, the mammalian two-hybrid system has two main advantages: (i) Assay results can be obtained within 48 h of transfection, and (ii) protein interactions in mammalian cells may better mimic actual in vivo interactions.

Keywords: Antigens, Polyomavirus Transforming; Binding Sites; Chloramphenicol O-Acetyltransferase; DNA; DNA-Binding Proteins; Fungal Proteins; Genes, Reporter; Genetic Vectors; Hela Cells; Herpes Simplex Virus Protein Vmw65; Humans; Promoter Regions, Genetic; Recombinant Fusion Proteins; Saccharomyces cerevisiae Proteins; Simian virus 40; Transcription Factors; Transfection; Tumor Suppressor Protein p53
[Lemarechal1997Practical] Claude Lemarechal, Claudia Sagastizábal, Echal, Claudia Sagastiz Abal, and Pii S. Practical aspects of the moreau-yosida regularization: Theoretical preliminaries. SIAM Journal on Optimization, 7:367-385, 1997. [ bib ]
[Land1997Variable] S. R. Land and J. H. Friedman. Variable fusion: A new adaptive signal regression method. Technical Report 656, Department of Statistics, Carnegie Mellon University Pittsburgh, 1997. [ bib ]
Keywords: lasso, ordinal, regression
[Kohavi1997Wrappers] R. Kohavi and G. John. Wrappers for feature selection. Artificial Intelligence, 97(1-2):273-324, 1997. [ bib ]
[Kanehisa1997database] M. Kanehisa. A database for post-genome analysis. Trends Genet., 13:375-376, 1997. [ bib | DOI | http | .pdf ]
[Jones1997Development] G. Jones, P. Willett, R. C. Glen, A. R. Leach, and R. Taylor. Development and validation of a genetic algorithm for flexible docking. J. Mol. Biol., 267(3):727-748, Apr 1997. [ bib | DOI | http ]
Prediction of small molecule binding modes to macromolecules of known three-dimensional structure is a problem of paramount importance in rational drug design (the "docking" problem). We report the development and validation of the program GOLD (Genetic Optimisation for Ligand Docking). GOLD is an automated ligand docking program that uses a genetic algorithm to explore the full range of ligand conformational flexibility with partial flexibility of the protein, and satisfies the fundamental requirement that the ligand must displace loosely bound water on binding. Numerous enhancements and modifications have been applied to the original technique resulting in a substantial increase in the reliability and the applicability of the algorithm. The advanced algorithm has been tested on a dataset of 100 complexes extracted from the Brookhaven Protein DataBank. When used to dock the ligand back into the binding site, GOLD achieved a 71% success rate in identifying the experimental binding mode.

Keywords: Algorithms, Binding Sites, Computer Simulation, Crystallography, Genetic, Humans, Ligands, Models, Molecular, NADP, Protein Binding, Protein Conformation, Proteins, Tetrahydrofolate Dehydrogenase, X-Ray, 9126849
[Joachims97aprobabilistic] T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 143-151, Nashville, Tennessee, USA, 1997. Morgan Kaufmann Publishers Inc. [ bib ]
[Holliday1997Using] J. D. Holliday and P. Willett. Using a genetic algorithm to identify common structural features in sets of ligands. J. Mol. Graph. Model., 15(4):221-232, Aug 1997. [ bib ]
This article describes a program for pharmacophore mapping, called MPHIL (Mapping Pharmacophores in Ligands). Given as input a set of molecules that exhibit some common biological activity, MPHIL identifies the smallest 3D pattern of pharmacophore points that has at least m (a user-defined parameter) points in common with each of the input molecules. The program thus differs from existing programs for pharmacophore mapping in that it does not require all of the molecules to share exactly the same pattern of points, although it will find such a common pattern if it does, indeed, exist. MPHIL uses a genetic algorithm (GA) approach in which an initial, and very rapid, GA is used to suggest possible combinations of points that are then processed by the second GA to yield the final 3D pattern.

Keywords: chemoinformatics
[Hershkovits1997On] Y. Hershkovits and J. Ziv. On fixed-database universal data compression with limited memory. IEEE Trans. Inform. Theory, 43(6):1966-1976, Nov 1997. [ bib | .pdf ]
The amount of fixed side information required for lossless data compression is discussed. Nonasymptotic coding and converse theorems are derived for data-compression algorithms with fixed statistical side information (?training sequence?) that is not large enough so as to yield the ultimate compression, namely, the entropy of the source

Keywords: information-theory
[Helmbold1997Predicting] D. P. Helmbold and R. E. Schapire. Predicting Nearly As Well As the Best Pruning of a Decision Tree. Machine Learning, 27(1):51-68, 1997. [ bib | .ps.Z | .pdf ]
[Hawkins1997Analysis] D.M. Hawkins, S.S. Young, and A. Rusinko. Analysis of a large structure-activity data set using recursive partitioning. Quantitative Structure-Activity Relationships, 16:296-302, 1997. [ bib ]
[Haussler1997general] D. Haussler. A general minimax result for relative entropy. IEEE Trans. Inform. Theory, 43(4):1276-1280, Jul 1997. [ bib | .pdf ]
Suppose nature picks a probability measure P&thetas; on a complete separable metric space X at random from a measurable set P ?=P&thetas;:&thetas;??. Then, without knowing &thetas;, a statistician picks a measure Q on S. Finally, the statistician suffers a loss D(P0∥Q), the relative entropy between P&thetas; and Q. We show that the minimax and maximin values of this game are always equal, and there is always a minimax strategy in the closure of the set of all Bayes strategies. This generalizes previous results of Gallager(1979), and Davisson and Leon-Garcia (1980)

[Gulukota1997Two] K. Gulukota, J. Sidney, A. Sette, and C. DeLisi. Two complementary methods for predicting peptides binding major histocompatibility complex molecules. J. Mol. Biol., 267(5):1258-1267, Apr 1997. [ bib | DOI | http ]
Peptides that bind to major histocompatibility complex products (MHC) are known to exhibit certain sequence motifs which, though common, are neither necessary nor sufficient for binding: MHCs bind certain peptides that do not have the characteristic motifs and only about 30% of the peptides having the required motif, bind. In order to develop and test more accurate methods we measured the binding affinity of 463 nonamer peptides to HLA-A2.1. We describe two methods for predicting whether a given peptide will bind to an MHC and apply them to these peptides. One method is based on simulating a neural network and another, called the polynomial method, is based on statistical parameter estimation assuming independent binding of the side-chains of residues. We compare these methods with each other and with standard motif-based methods. The two methods are complementary, and both are superior to sequence motifs. The neural net is superior to simple motif searches in eliminating false positives. Its behavior can be coarsely tuned to the strength of binding desired and it is extendable in a straightforward fashion to other alleles. The polynomial method, on the other hand, has high sensitivity and is a superior method for eliminating false negatives. We discuss the validity of the independent binding assumption in such predictions.

Keywords: Artificial Intelligence; Computing Methodologies; HLA-A2 Antigen; Neural Networks (Computer); Oligopeptides; Protein Binding; Reproducibility of Results
[Dietterich1997Solving] T.G. Dietterich, R.H. Lathrop, and T. Lozano-Perez. Solving the Multiple Instance Problem with Axis-Parallel Rectangles. Artificial Intelligence, 89(1-2):31-71, 1997. [ bib ]
[DeRisi1997Exploring] J. L. DeRisi, V. R. Iyer, and P. O. Brown. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278(5338):680-686, 1997. [ bib | .pdf | .pdf ]
[Decatur1997PAC] S.E. Decatur. Pac learning with constant-partition classification noise and applications to decision tree induction. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML '97, pages 83-91, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. [ bib | http ]
[Chung1997Spectral] F. R. K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series. American Mathematical Society, Providence, 1997. [ bib ]
[Caruana1997Multitask] Rich Caruana. Multitask learning. Machine Learning, 28(1):41-75, 1997. [ bib ]
[Brown1997information] R. D. Brown and Y. C. Martin. The information content of 2D and 3D structural descriptors relevant to ligand-receptor binding. J Chem Inf Comput Sci, 37:1-9, 1997. [ bib ]
Keywords: chemoinformatics
[Baxter1997A] Jonathan Baxter. A bayesian/information theoretic model of learning to learn via multiple task sampling. In Machine Learning, pages 7-39, 1997. [ bib ]
[Barkai1997Robustness] N. Barkai and S. Leibler. Robustness in simple biochemical networks. Nature, 387(6636):913-917, Jun 1997. [ bib | DOI | http | .pdf ]
Cells use complex networks of interacting molecular components to transfer and process information. These "computational devices of living cells" are responsible for many important cellular processes, including cell-cycle regulation and signal transduction. Here we address the issue of the sensitivity of the networks to variations in their biochemical parameters. We propose a mechanism for robust adaptation in simple signal transduction networks. We show that this mechanism applies in particular to bacterial chemotaxis. This is demonstrated within a quantitative model which explains, in a unified way, many aspects of chemotaxis, including proper responses to chemical gradients. The adaptation property is a consequence of the network's connectivity and does not require the 'fine-tuning' of parameters. We argue that the key properties of biochemical networks should be robust in order to ensure their proper functioning.

[Arkin1997test] A. Arkin, P. Shen, and J. Ross. A test case of correlation metric construction of a reaction pathway from measurements. Science, 277(5330):1275-1279, 1997. [ bib | .pdf ]
A method for the prediction of the interactions within complex reaction networks from experimentally measured time series of the concentration of the species composing the system has been tested experimentally on the first few steps of the glycolytic pathway. The reconstituted reaction system, containing eight enzymes and 14 metabolic intermediates, was kept away from equilibrium in a continuous-flow, stirred-tank reactor. Input concentrations of adenosine monophosphate and citrate were externally varied over time, and their concentrations in the reactor and the response of eight other species were measured. Multidimensional scaling analysis and heuristic algorithms applied to two-species time-lagged correlation functions derived from the time series yielded a diagram from which the interactions among all of the species could be deduced. The diagram predicts essential features of the known reaction network in regard to chemical reactions and interactions among the measured species. The approach is applicable to many complex reaction systems.

Keywords: reconstruction, kinetic, metabolism, analysis, multidimensional scaling, Correlation Metric Construction
[Altschul1997Gapped] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389-3402, 1997. [ bib | .pdf | .pdf ]
[Tsybakov1997On] A. B. Tsybakov. On Nonparametric Estimation of Density Level Sets. Ann. Stat., 25:948-969, June 1997. [ bib | http | .pdf ]
[Thrun1998Learning] Sebastian Thrun and Lorien Pratt, editors. Learning to learn. Kluwer Academic Publishers, Norwell, MA, USA, 1998. [ bib ]
[Zhu1998FRAME:] S. C. Zhu, Y. Wu, and D. Mumford. FRAME: Filters, Random field And Maximum Entropy: - Towards a Unified Theory for Texture Modeling. Int'l Journal of Computer Vision, 27(2):1-20, 1998. [ bib | .ps.gz | .pdf ]
[Xia1998Thermodynamic] T. Xia, J. SantaLucia, M. E. Burkard, R. Kierzek, S. J. Schroeder, X. Jiao, C. Cox, and D. H. Turner. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry, 37(42):14719-35, Oct 1998. [ bib | DOI | http ]
Improved thermodynamic parameters for prediction of RNA duplex formation are derived from optical melting studies of 90 oligoribonucleotide duplexes containing only Watson-Crick base pairs. To test end or base composition effects, new sets of duplexes are included that have identical nearest neighbors, but different base compositions and therefore different ends. Duplexes with terminal GC pairs are more stable than duplexes with the same nearest neighbors but terminal AU pairs. Penalizing terminal AU base pairs by 0.45 kcal/mol relative to terminal GC base pairs significantly improves predictions of DeltaG degrees37 from a nearest-neighbor model. A physical model is suggested in which the differential treatment of AU and GC ends accounts for the dependence of the total number of Watson-Crick hydrogen bonds on the base composition of a duplex. On average, the new parameters predict DeltaG degrees37, DeltaH degrees, DeltaS degrees, and TM within 3.2%, 6.0%, 6.8%, and 1.3 degreesC, respectively. These predictions are within the limit of the model, based on experimental results for duplexes predicted to have identical thermodynamic parameters.

[Williams1998Prediction] C.K.I. Williams. Prediction with Gaussian Processes: From Linear Regression to Linear Prediction and Beyond. In M.I. Jordan, editor, Learning and Inference in Graphical Models. Kluwer Academic Press, 1998. [ bib ]
[Willett1998Chemical] P. Willett. Chemical Similarity Searching. J Chem Inf Comput Sci, 38:983-996, 1998. [ bib ]
[Watts1998Collective] D. J. Watts and S. H. Strogatz. Collective dynamics of 'small-world' networks. Nature, 393:440-442, 1998. [ bib | http | .pdf ]
[Vapnik1998Statistical] V. N. Vapnik. Statistical Learning Theory. Wiley, New-York, 1998. [ bib ]
[Spellman1998Comprehensive] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, and B. Futcher. Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol. Biol. Cell, 9:3273-3297, 1998. [ bib | .pdf | .pdf ]
[Smola1998connection] A.J. Smola, B. Schölkopf, and K.-R. Müller. The connection between regularization operators and support vector kernels. Neural Networks, 11(4):637-649, 1998. [ bib | DOI | http | .pdf ]
[Sette1998HLA] A. Sette and J. Sidney. HLA supertypes and supermotifs: a functional perspective on HLA polymorphism. Curr. Opin. Immunol., 10(4):478-482, Aug 1998. [ bib ]
A large fraction of HLA class I, and possibly class II, molecules can be classified into relatively few supertypes, characterized by overlapping peptide-binding repertoires and consensus B- and F-pocket structures. Cross-binding peptides are frequently recognized by specific T cells in the course of natural disease processes and in the context of multiple HLA molecules, validating the concept of HLA supertypes at the functional level.

Keywords: Animals; Communicable Diseases; Evolution, Molecular; HLA Antigens; HLA-A2 Antigen; HLA-A3 Antigen; Humans; Neoplasms; Polymorphism, Genetic
[Schneider1998Artificial] G. Schneider and P. Wrede. Artificial neural networks for computer-based molecular design. Prog Biophys Mol Biol, 70(3):175-222, 1998. [ bib ]
The theory of artificial neural networks is briefly reviewed focusing on supervised and unsupervised techniques which have great impact on current chemical applications. An introduction to molecular descriptors and representation schemes is given. In addition, worked examples of recent advances in this field are highlighted and pioneering publications are discussed. Applications of several types of artificial neural networks to compound classification, modelling of structure-activity relationships, biological target identification, and feature extraction from biopolymers are presented and compared to other techniques. Advantages and limitations of neural networks for computer-aided molecular design and sequence analysis are discussed.

Keywords: Algorithms, Amino Acid Sequence, Amino Acids, Animals, Artificial Intelligence, Automated, Bacterial, Bacterial Proteins, Bicuculline, Binding Sites, Biological, Biological Availability, Blood Proteins, Blood-Brain Barrier, Cation Transport Proteins, Cats, Cell Membrane Permeability, Chemical, Chemistry, Cluster Analysis, Combinatorial Chemistry Techniques, Comparative Study, Computational Biology, Computer Simulation, Computer Systems, Computer-Aided Design, Computer-Assisted, Computing Methodologies, DNA-Binding Proteins, Databases, Dogs, Drug Design, Electric Stimulation, Electromyography, Enzyme Inhibitors, Ether-A-Go-Go Potassium Channels, Excitatory Amino Acid Antagonists, Factual, False Positive Reactions, Forecasting, Forelimb, GABA Antagonists, Gene Expression Profiling, Genome, Glutamic Acid, Humans, Hydrogen Bonding, Image Enhancement, Image Interpretation, Image Processing, Information Storage and Retrieval, Iontophoresis, Kynurenic Acid, Least-Squares Analysis, Linear Models, Liver, Markov Chains, Metabolic Clearance Rate, Metalloendopeptidases, Microelectrodes, Models, Molecular, Molecular Conformation, Molecular Sequence Data, Molecular Structure, Motor Cortex, Movement, Multivariate Analysis, Nerve Net, Neural Networks (Computer), Neuropeptides, Non-U.S. Gov't, Nonlinear Dynamics, Pattern Recognition, Pharmaceutical, Pharmaceutical Preparations, Pharmacokinetics, Phylogeny, Potassium Channels, Predictive Value of Tests, Protein Interaction Mapping, Protein Sorting Signals, Protein Structure, Proteins, Rats, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Shoulder, Signal Processing, Software, Statistical, Stereotaxic Techniques, Structure-Activity Relationship, Terminology, Tertiary, Trans-Activators, Voltage-Gated, Zinc, 9830312
[Poggio1998Sparse] Poggio and Girosi. A Sparse Representation for Function Approximation. Neural Comput, 10(6):1445-54, Jul 1998. [ bib ]
We derive a new general representation for a function as a linear combination of local correlation kernels at optimal sparse locations (and scales) and characterize its relation to principal component analysis, regularization, sparsity principles, and support vector machines.

Keywords: Algorithms, Automated, Biometry, Computers, DNA, Databases, Factual, Fungal, Fungal Proteins, GTP-Binding Proteins, Gene Expression, Genes, Learning, Markov Chains, Models, Neural Networks (Computer), Neurological, Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Hybridization, Open Reading Frames, P.H.S., Pattern Recognition, Protein, Protein Structure, Proteins, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Sequence Alignment, Sequence Analysis, Software, Statistical, Tertiary, U.S. Gov't, 9698352
[Pinkel1998High] D. Pinkel, R. Segraves, D. Sudar, S. Clark, I. Poole, D. Kowbel, C. Collins, W. L. Kuo, C. Chen, Y. Zhai, S. H. Dairkee, B. M. Ljung, J. W. Gray, and D. G. Albertson. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat. Genet., 20(2):207-211, Oct 1998. [ bib | DOI | http | .pdf ]
Gene dosage variations occur in many diseases. In cancer, deletions and copy number increases contribute to alterations in the expression of tumour-suppressor genes and oncogenes, respectively. Developmental abnormalities, such as Down, Prader Willi, Angelman and Cri du Chat syndromes, result from gain or loss of one copy of a chromosome or chromosomal region. Thus, detection and mapping of copy number abnormalities provide an approach for associating aberrations with disease phenotype and for localizing critical genes. Comparative genomic hybridization (CGH) was developed for genome-wide analysis of DNA sequence copy number in a single experiment. In CGH, differentially labelled total genomic DNA from a 'test' and a 'reference' cell population are cohybridized to normal metaphase chromosomes, using blocking DNA to suppress signals from repetitive sequences. The resulting ratio of the fluorescence intensities at a location on the 'cytogenetic map', provided by the chromosomes, is approximately proportional to the ratio of the copy numbers of the corresponding DNA sequences in the test and reference genomes. CGH has been broadly applied to human and mouse malignancies. The use of metaphase chromosomes, however, limits detection of events involving small regions (of less than 20 Mb) of the genome, resolution of closely spaced aberrations and linking ratio changes to genomic/genetic markers. Therefore, more laborious locus-by-locus techniques have been required for higher resolution studies. Hybridization to an array of mapped sequences instead of metaphase chromosomes could overcome the limitations of conventional CGH (ref. 6) if adequate performance could be achieved. Copy number would be related to the test/reference fluorescence ratio on the array targets, and genomic resolution could be determined by the map distance between the targets, or by the length of the cloned DNA segments. We describe here our implementation of array CGH. We demonstrate its ability to measure copy number with high precision in the human genome, and to analyse clinical specimens by obtaining new information on chromosome 20 aberrations in breast cancer.

Keywords: cgh, csbcbook
[Nakamura1998ATM] Yusuke Nakamura. ATM: the p53 booster. Nature Medicine, 4:1231-1232, 1998. [ bib ]
Keywords: csbcbook
[Mukherjee1998Support] S. Mukherjee, P. Tamayo, J. P. Mesirov, D. Slonim, A. Verri, and T. Poggio. Support vector machine classification of microarray data. Technical Report 182, C.B.L.C., 1998. A.I. Memo 1677. [ bib | .html | .pdf ]
Keywords: biosvm microarray
[Molloy1998size] M. Molloy and B. Reed. The size of the giant component of a random graph with a given degree sequence. Combinator. Probab. Comput., 7:295-305, 1998. [ bib | .ps | .pdf ]
[Modha1998Memory-universal] D.S. Modha and E. Masry. Memory-universal prediction of stationary random processes. IEEE Trans. Inform. Theory, 44(1):117-133, Jan 1998. [ bib | .pdf ]
We consider the problem of one-step-ahead prediction of a real-valued, stationary, strongly mixing random process (Xi)i=-??. The best mean-square predictor of X0 is its conditional mean given the entire infinite past (Xi)i=-?-1. Given a sequence of observations X1, X2, XN, we propose estimators for the conditional mean based on sequences of parametric models of increasing memory and of increasing dimension, for example, neural networks and Legendre polynomials. The proposed estimators select both the model memory and the model dimension, in a data-driven fashion, by minimizing certain complexity regularized least squares criteria. When the underlying predictor function has a finite memory, we establish that the proposed estimators are memory-universal: the proposed estimators, which do not know the true memory, deliver the same statistical performance (rates of integrated mean-squared error) as that delivered by estimators that know the true memory. Furthermore, when the underlying predictor function does not have a finite memory, we establish that the estimator based on Legendre polynomials is consistent

Keywords: information-theory
[Milik1998Application] M. Milik, D. Sauer, A. P. Brunmark, L. Yuan, A. Vitiello, M. R. Jackson, P. A. Peterson, J. Skolnick, and C. A. Glass. Application of an artificial neural network to predict specific class I MHC binding peptide sequences. Nat. Biotechnol., 16(8):753-756, Aug 1998. [ bib | DOI | http ]
Computational methods were used to predict the sequences of peptides that bind to the MHC class I molecule, K(b). The rules for predicting binding sequences, which are limited, are based on preferences for certain amino acids in certain positions of the peptide. It is apparent though, that binding can be influenced by the amino acids in all of the positions of the peptide. An artificial neural network (ANN) has the ability to simultaneously analyze the influence of all of the amino acids of the peptide and thus may improve binding predictions. ANNs were compared to statistically analyzed peptides for their abilities to predict the sequences of K(b) binding peptides. ANN systems were trained on a library of binding and nonbinding peptide sequences from a phage display library. Statistical and ANN methods identified strong binding peptides with preferred amino acids. ANNs detected more subtle binding preferences, enabling them to predict medium binding peptides. The ability to predict class I MHC molecule binding peptides is useful for immunolological therapies involving cytotoxic-T cells.

Keywords: immunoinformatics
[Merhav1998Universal] N. Merhav and M. Feder. Universal prediction. IEEE Trans. Inform. Theory, 44(6):2124-2147, Oct 1998. [ bib | .pdf ]
This paper consists of an overview on universal prediction from an information-theoretic perspective. Special attention is given to the notion of probability assignment under the self-information loss function, which is directly related to the theory of universal data compression. Both the probabilistic setting and the deterministic setting of the universal prediction problem are described with emphasis on the analogy and the differences between results in the two settings

[Mamitsuka1998Predicting] H. Mamitsuka. Predicting peptides that bind to MHC molecules using supervised learning of hidden Markov models. Proteins, 33(4):460-474, Dec 1998. [ bib ]
The binding of a major histocompatibility complex (MHC) molecule to a peptide originating in an antigen is essential to recognizing antigens in immune systems, and it has proved to be important to use computers to predict the peptides that will bind to an MHC molecule. The purpose of this paper is twofold: First, we propose to apply supervised learning of hidden Markov models (HMMs) to this problem, which can surpass existing methods for the problem of predicting MHC-binding peptides. Second, we generate peptides that have high probabilities to bind to a certain MHC molecule, based on our proposed method using peptides binding to MHC molecules as a set of training data. From our experiments, in a type of cross-validation test, the discrimination accuracy of our supervised learning method is usually approximately 2-15% better than those of other methods, including backpropagation neural networks, which have been regarded as the most effective approach to this problem. Furthermore, using an HMM trained for HLA-A2, we present new peptide sequences that are provided with high binding probabilities by the HMM and that are thus expected to bind to HLA-A2 proteins. Peptide sequences not shown in this paper but with rather high binding probabilities can be obtained from the author.

Keywords: immunoinformatics
[Lugosi1998On] G. Lugosi. On concentration-of-measure inequalities. Seminar notes, 1998. [ bib | .ps | .pdf ]
[Liang1998Reveal] S. Liang, S. Fuhrman, and R. Somogyi. REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput., 3:18-29, 1998. [ bib | .pdf ]
Given the immanent gene expression mapping covering whole genomes during development, health and disease, we seek computational methods to maximize functional inference from such large data sets. Is it possible, in principle, to completely infer a complex regulatory network architecture from input/output patterns of its variables? We investigated this possibility using binary models of genetic networks. Trajectories, or state transition tables of Boolean nets, resemble time series of gene expression. By systematically analyzing the mutual information between input states and output states, one is able to infer the sets of input elements controlling each element or gene in the network. This process is unequivocal and exact for complete state transition tables. We implemented this REVerse Engineering ALgorithm (REVEAL) in a C program, and found the problem to be tractable within the conditions tested so far. For n = 50 (elements) and k = 3 (inputs per element), the analysis of incomplete state transition tables (100 state transition pairs out of a possible 10(15)) reliably produced the original rule and wiring sets. While this study is limited to synchronous Boolean networks, the algorithm is generalizable to include multi-state models, essentially allowing direct application to realistic biological data sets. The ability to adequately solve the inverse problem may enable in-depth analysis of complex dynamic systems in biology and other fields.

[Krichevskiy1998Laplace's] R. E. Krichevskiy. Laplace's law of succession and universal encoding. IEEE Trans. Inform. Theory, 44(1):296-303, Jan 1998. [ bib | .pdf ]
Keywords: information-theory source-coding
[Kononen1998Tissue] J. Kononen, L. Bubendorf, A. Kallioniemi, M. Bärlund, P. Schraml, S. Leighton, J. Torhorst, M. J. Mihatsch, G. Sauter, and O. P. Kallioniemi. Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med, 4(7):844-847, Jul 1998. [ bib ]
Many genes and signalling pathways controlling cell proliferation, death and differentiation, as well as genomic integrity, are involved in cancer development. New techniques, such as serial analysis of gene expression and cDNA microarrays, have enabled measurement of the expression of thousands of genes in a single experiment, revealing many new, potentially important cancer genes. These genome screening tools can comprehensively survey one tumor at a time; however, analysis of hundreds of specimens from patients in different stages of disease is needed to establish the diagnostic, prognostic and therapeutic importance of each of the emerging cancer gene candidates. Here we have developed an array-based high-throughput technique that facilitates gene expression and copy number surveys of very large numbers of tumors. As many as 1000 cylindrical tissue biopsies from individual tumors can be distributed in a single tumor tissue microarray. Sections of the microarray provide targets for parallel in situ detection of DNA, RNA and protein targets in each specimen on the array, and consecutive sections allow the rapid analysis of hundreds of molecular markers in the same set of specimens. Our detection of six gene amplifications as well as p53 and estrogen receptor expression in breast cancer demonstrates the power of this technique for defining new subgroups of tumors.

Keywords: Animals; Breast Neoplasms, genetics/metabolism/pathology; Cyclin D1, genetics/metabolism; Female; Genetic Techniques; Humans; Immunoenzyme Techniques; In Situ Hybridization, Fluorescence; Mice; Oncogene Proteins v-myb; Proto-Oncogene Proteins c-myc, genetics/metabolism; Rabbits; Receptor, erbB-2, genetics/metabolism; Receptors, Estrogen, genetics/metabolism; Retroviridae Proteins, Oncogenic, genetics/metabolism; Tumor Markers, Biological, genetics/metabolism; Tumor Suppressor Protein p53, genetics/metabolism
[Kay1998Fundamentals] S.M. Kay. Fundamentals of Statistical Signal Processing, volume 2. Prentice-Hall, 1998. [ bib ]
[Karplus1998Hidden] K. Karplus, C. Barrett, and R. Hughey. Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics, 14(10):846-856, 1998. [ bib | .ps | .pdf ]
[Honeyman1998Neural] M. C. Honeyman, V. Brusic, N. L. Stone, and L. C. Harrison. Neural network-based prediction of candidate T-cell epitopes. Nat. Biotechnol., 16(10):966-969, Oct 1998. [ bib | DOI | http ]
Activation of T cells requires recognition by T-cell receptors of specific peptides bound to major histocompatibility complex (MHC) molecules on the surface of either antigen-presenting or target cells. These peptides, T-cell epitopes, have potential therapeutic applications, such as for use as vaccines. Their identification, however, usually requires that multiple overlapping synthetic peptides encompassing a protein antigen be assayed, which in humans, is limited by volume of donor blood. T-cell epitopes are a subset of peptides that bind to MHC molecules. We use an artificial neural network (ANN) model trained to predict peptides that bind to the MHC class II molecule HLA-DR4(*0401). Binding prediction facilitates identification of T-cell epitopes in tyrosine phosphatase IA-2, an autoantigen in DR4-associated type1 diabetes. Synthetic peptides encompassing IA-2 were tested experimentally for DR4 binding and T-cell proliferation in humans at risk for diabetes. ANN-based binding prediction was sensitive and specific, and reduced the number of peptides required for T-cell assay by more than half, with only a minor loss of epitopes. This strategy could expedite identification of candidate T-cell epitopes in diverse diseases.

Keywords: immunoinformatics
[Grundy1998Family-based] W. N. Grundy. Family-based Homology Detection via Pairwise Sequence Comparison. In Proceedings of the Second Annual International Conference on Computational Molecular Biology, March 22-25, pages 94-100, 1998. [ bib | .html | .pdf ]
[Goto1998LIGAND:] S. Goto, T. Nishioka, and M. Kanehisa. LIGAND: chemical database for enzyme reactions. Bioinformatics, 14:591-599, 1998. [ bib | http | .pdf ]
[Girosi1998Equivalence] Girosi. An Equivalence Between Sparse Approximation and Support Vector Machines. Neural Comput, 10(6):1455-80, Jul 1998. [ bib ]
This article shows a relationship between two different approximation techniques: the support vector machines (SVM), proposed by V. Vapnik (1995) and a sparse approximation scheme that resembles the basis pursuit denoising algorithm (Chen, 1995; Chen, Donoho, and Saunders, 1995). SVM is a technique that can be derived from the structural risk minimization principle (Vapnik, 1982) and can be used to estimate the parameters of several different approximation schemes, including radial basis functions, algebraic and trigonometric polynomials, B-splines, and some forms of multilayer perceptrons. Basis pursuit denoising is a sparse approximation technique in which a function is reconstructed by using a small number of basis functions chosen from a large set (the dictionary). We show that if the data are noiseless, the modified version of basis pursuit denoising proposed in this article is equivalent to SVM in the following sense: if applied to the same data set, the two techniques give the same solution, which is obtained by solving the same quadratic programming problem. In the appendix, we present a derivation of the SVM technique in one framework of regularization theory, rather than statistical learning theory, establishing a connection between SVM, sparse approximation, and regularization theory.

Keywords: Algorithms, Automated, Biometry, Computers, DNA, Databases, Factual, Fungal, Fungal Proteins, GTP-Binding Proteins, Gene Expression, Genes, Learning, Markov Chains, Models, Neural Networks (Computer), Neurological, Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Hybridization, Open Reading Frames, P.H.S., Pattern Recognition, Protein, Protein Structure, Proteins, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Sequence Alignment, Sequence Analysis, Software, Statistical, Tertiary, U.S. Gov't, 9698353
[fujibuchi1998] W. Fujibuchi, K. Sato, H. Ogata, S. Goto, and M. Kanehisa. KEGG and DBGET/LinkDB: Integration of biological relationships in divergenet molecular biology data. Knowledge Sharing Across Biological and Medical Knowledge-Based Systems WS-98-04, AAAI Press, 1998. [ bib ]
[Fu1998Penalized] W. Fu. Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics, 7:397-416, 1998. [ bib ]
[Flower1998Properties] D. R. Flower. On the properties of bit string-based measures of chemical similarity. J Chem Inf Comput Sci, 38:379-386, 1998. [ bib ]
[Fire1998Potent] A. Fire, S. Xu, M. K. Montgomery, S. A. Kostas, S. E. Driver, and C. C. Mello. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature, 391(6669):806-811, Feb 1998. [ bib | DOI | http | .pdf ]
Experimental introduction of RNA into cells can be used in certain biological systems to interfere with the function of an endogenous gene. Such effects have been proposed to result from a simple antisense mechanism that depends on hybridization between the injected RNA and endogenous messenger RNA transcripts. RNA interference has been used in the nematode Caenorhabditis elegans to manipulate gene expression. Here we investigate the requirements for structure and delivery of the interfering RNA. To our surprise, we found that double-stranded RNA was substantially more effective at producing interference than was either strand individually. After injection into adult animals, purified single strands had at most a modest effect, whereas double-stranded mixtures caused potent and specific interference. The effects of this interference were evident in both the injected animals and their progeny. Only a few molecules of injected double-stranded RNA were required per affected cell, arguing against stochiometric interference with endogenous mRNA and suggesting that there could be a catalytic or amplification component in the interference process.

Keywords: sirna
[Finn1998Pharmacophore] P. Finn, S. Muggleton, D. Page, and A. Srinivasan. Pharmacophore discovery using the inductive logic programming language Progol. Machine Learning, 30:241-270, 1998. [ bib ]
Keywords: chemoinformatics
[Feder1998Universal] M. Feder and A.C. Singer. Universal data compression and linear prediction. Data Compression Conference, 1998. [ bib | .pdf ]
The relationship between prediction and data compression can be extended to universal prediction schemes and universal data compression. Previous work shows that minimizing the sequential squared prediction error for individual sequences can be achieved using the same strategies which minimize the sequential code length for data compression of individual sequences. Defining a ?probability? as an exponential function of sequential loss, results from universal data compression can be used to develop universal linear prediction algorithms. Specifically, we present an algorithm for linear prediction of individual sequences which is twice-universal, over parameters and model orders

Keywords: information-theory
[Emsley1998Elements] John Emsley. The Elements (third edition). Oxford University Press, 1998. [ bib ]
[Eisen1998Cluster] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95:14863-14868, Dec 1998. [ bib | .pdf | .pdf ]
[Early1998Polychemotherapy] Early Breast Cancer Trialists’ Collaborative Group. Polychemotherapy for early breast cancer: an overview of the randomised trials. early breast cancer trialists' collaborative group. Lancet, 352(9132):930-942, Sep 1998. [ bib ]
There have been many randomised trials of adjuvant prolonged polychemotherapy among women with early breast cancer, and an updated overview of their results is presented.In 1995, information was sought on each woman in any randomised trial that began before 1990 and involved treatment groups that differed only with respect to the chemotherapy regimens that were being compared. Analyses involved about 18,000 women in 47 trials of prolonged polychemotherapy versus no chemotherapy, about 6000 in 11 trials of longer versus shorter polychemotherapy, and about 6000 in 11 trials of anthracycline-containing regimens versus CMF (cyclophosphamide, methotrexate, and fluorouracil).For recurrence, polychemotherapy produced substantial and highly significant proportional reductions both among women aged under 50 at randomisation (35% [SD 4] reduction; 2p<0.00001) and among those aged 50-69 (20% [SD 3] reduction; 2p<0.00001); few women aged 70 or over had been studied. For mortality, the reductions were also significant both among women aged under 50 (27% [SD 5] reduction; 2p<0.00001) and among those aged 50-69 (11% [SD 3] reduction; 2p=0.0001). The recurrence reductions emerged chiefly during the first 5 years of follow-up, whereas the difference in survival grew throughout the first 10 years. After standardisation for age and time since randomisation, the proportional reductions in risk were similar for women with node-negative and node-positive disease. Applying the proportional mortality reduction observed in all women aged under 50 at randomisation would typically change a 10-year survival of 71% for those with node-negative disease to 78% (an absolute benefit of 7%), and of 42% for those with node-positive disease to 53% (an absolute benefit of 11%). The smaller proportional mortality reduction observed in all women aged 50-69 at randomisation would translate into smaller absolute benefits, changing a 10-year survival of 67% for those with node-negative disease to 69% (an absolute gain of 2%) and of 46% for those with node-positive disease to 49% (an absolute gain of 3%). The age-specific benefits of polychemotherapy appeared to be largely irrespective of menopausal status at presentation, oestrogen receptor status of the primary tumour, and of whether adjuvant tamoxifen had been given. In terms of other outcomes, there was a reduction of about one-fifth (2p=0.05) in contralateral breast cancer, which has already been included in the analyses of recurrence, and no apparent adverse effect on deaths from causes other than breast cancer (death rate ratio 0.89 [SD 0.09]). The directly randomised comparisons of longer versus shorter durations of polychemotherapy did not indicate any survival advantage with the use of more than about 3-6 months of polychemotherapy. By contrast, directly randomised comparisons did suggest that, compared with CMF alone, the anthracycline-containing regimens studied produced somewhat greater effects on recurrence (2p=0.006) and mortality (69% vs 72% 5-year survival; log-rank 2p=0.02). But this comparison is one of many that could have been selected for emphasis, the 99% CI reaches zero, and the results of several of the relevant trials are not yet available.Some months of adjuvant polychemotherapy (eg, with CMF or an anthracycline-containing regimen) typically produces an absolute improvement of about 7-11% in 10-year survival for women aged under 50 at presentation with early breast cancer, and of about 2-3% for those aged 50-69 (unless their prognosis is likely to be extremely good even without such treatment). Treatment decisions involve consideration not only of improvements in cancer recurrence and survival but also of adverse side-effects of treatment, and this report makes no recommendations as to who should or should not be treated.

Keywords: Adult; Aged; Antineoplastic Combined Chemotherapy Protocols, therapeutic use; Breast Neoplasms, chemistry/drug therapy/mortality; Chemotherapy, Adjuvant; Drug Administration Schedule; Female; Humans; Lymphatic Metastasis; Menopause; Middle Aged; Neoplasm Recurrence, Local; Randomized Controlled Trials as Topic; Receptors, Estrogen, analysis; Tamoxifen, administration /&/ dosage
[Durbin1998Biological] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998. [ bib ]
[Dietterich1998Experimental] T. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Mach. Learn., 40(40):139-157, 1998. [ bib | DOI | http | .pdf ]
[Denis1998PAC] F. Denis. Pac learning from positive statistical queries. In Proceedings of the 9th International Conference on Algorithmic Learning Theory, ALT '98, pages 112-126, London, UK, 1998. Springer-Verlag. [ bib | http ]
[Csorgo1998Limit] M. Csörgö and L. Horvath. Limit theorems in change-point analysis. John Wiley, New York, 1998. [ bib ]
[Chu1998Transcriptional] S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P.O. Brown, and I. Herskowitz. The Transcriptional Program of Sporulation in Budding Yeast. Science, 282:699-705, 1998. [ bib | .pdf | .pdf ]
[Chen1998Recursive] X. Chen, A. Russinko III, and S. S. Young. Recursive Partitioning Analysis of a Large Structure-Activity Data Set Using Three-Dimensional Descriptors. J Chem Inf Comput Sci, 38:1054-1062, 1998. [ bib ]
Keywords: chemoinformatics
[Chen1998Atomic] S. S. Chen, D. L. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1):33-61, 1998. [ bib | DOI | http ]
[Burges1998Tutorial] C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Min. Knowl. Discov., 2(2):121-167, 1998. [ bib | .ps.gz | .pdf ]
[Bunke1998Graph] H. Bunke and K. Shearer. A Graph Distance Metric based on the Maximal Common Subgraph. Pattern Recogn. Lett., 19:255-259, 1998. [ bib | DOI ]
[Brown1998Chemoinformatics] F.K. Brown. Chemoinformatics : What is it and How does it Impact Drug Discovery. Annual Reports in Med. Chem., 33:375-384, 1998. [ bib ]
Keywords: chemoinformatics
[Breese1998Empirical] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In 14th Conference on Uncertainty in Artificial Intelligence, pages 43-52, Madison, W.I., 1998. Morgan Kaufman. [ bib ]
[Blum1998Combining] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT' 98: Proceedings of the eleventh annual conference on Computational learning theory, pages 92-100, New York, NY, USA, 1998. ACM. [ bib | DOI | http ]
[Barron1998minimum] A. Barron, J. Rissanen, and Bin Yu. The minimum description length principle in coding and modeling. IEEE Trans. Inform. Theory, 44(6):2743-2760, Oct 1998. [ bib | .pdf ]
We review the principles of minimum description length and stochastic complexity as used in data compression and statistical modeling. Stochastic complexity is formulated as the solution to optimum universal coding problems extending Shannon's basic source coding theorem. The normalized maximized likelihood, mixture, and predictive codings are each shown to achieve the stochastic complexity to within asymptotically vanishing terms. We assess the performance of the minimum description length criterion both from the vantage point of quality of data compression and accuracy of statistical inference. Context tree modeling, density estimation, and model selection in Gaussian linear regression serve as examples

Keywords: information-theory
[Amari1998Natural] S.-I. Amari. Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2):251-276, 1998. [ bib | .ps.gz | .pdf ]
[Alon1998Finding] N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph. Random Struct. Algorithm., 13:457-466, 1998. [ bib | DOI | http | .pdf ]
[Pontil1998Properties] M. Pontil and A. Verri. Properties of support vector machines. Neural Comput, 10(4):955-74, May 1998. [ bib ]
Support vector machines (SVMs) perform pattern recognition between two point classes by finding a decision surface determined by certain points of the training set, termed support vectors (SV). This surface, which in some feature space of possibly infinite dimension can be regarded as a hyperplane, is obtained from the solution of a problem of quadratic programming that depends on a regularization parameter. In this article, we study some mathematical properties of support vectors and show that the decision surface can be written as the sum of two orthogonal terms, the first depending on only the margin vectors (which are SVs lying on the margin), the second proportional to the regularization parameter. For almost all values of the parameter, this enables us to predict how the decision surface varies for small parameter changes. In the special but important case of feature space of finite dimension m, we also show that m + 1 SVs are usually sufficient to determine the decision surface fully. For relatively small m, this latter result leads to a consistent reduction of the SV number.

Keywords: Algorithms, Artificial Intelligence, Automated, Biometry, Computers, DNA, Databases, Factual, Fungal, Fungal Proteins, GTP-Binding Proteins, Gene Expression, Genes, Learning, Linear Models, Markov Chains, Mathematics, Models, Neural Networks (Computer), Neurological, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Nucleic Acid Hybridization, Open Reading Frames, P.H.S., Pattern Recognition, Protein, Protein Structure, Proteins, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Sequence Alignment, Sequence Analysis, Software, Statistical, Tertiary, U.S. Gov't, 9573414
[Roth1998Finding] F. P. Roth, J. D. Hughes, P. W. Estep, and G. M. Church. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mrna quantitation. Nat. Biotechnol., 16(10):939-945, October 1998. [ bib | DOI | http ]
Whole-genome mRNA quantitation can be used to identify the genes that are most responsive to environmental or genotypic change. By searching for mutually similar DNA elements among the upstream non-coding DNA sequences of these genes, we can identify candidate regulatory motifs and corresponding candidate sets of coregulated genes. We have tested this strategy by applying it to three extensively studied regulatory systems in the yeast Saccharomyces cerevisiae: galactose response, heat shock, and mating type. Galactose-response data yielded the known binding site of Gal4, and six of nine genes known to be induced by galactose. Heat shock data yielded the cell-cycle activation motif, which is known to mediate cell-cycle dependent activation, and a set of genes coding for all four nucleosomal proteins. Mating type alpha and a data yielded all of the four relevant DNA motifs and most of the known a- and alpha-specific genes.

Keywords: bioinformatics, genome-wide, tfs
[Lobo1998Applications] Miguel Sousa Lobo, Lobo I, Lieyen Vandenberghe, Herv Lebret, and Stephen Boyd. Applications of second-order cone programming. Linear Algebra and its Applications, 284:193-228, November 1998. [ bib ]
[Zupan1999Neural] J. Zupan and J. Gasteiger. Neural Networks in Chemistry and Drug Design. Wiley-VCH, 1999. [ bib ]
[Yewdell1999Immunodominance] J. W. Yewdell and J. R. Bennink. Immunodominance in major histocompatibility complex class I-restricted T lymphocyte responses. Annu. Rev. Immunol., 17:51-88, 1999. [ bib | DOI | http ]
Of the many thousands of peptides encoded by a complex foreign antigen that can potentially be presented to CD8+ T cells (TCD8+), only a small fraction induce measurable responses in association with any given major histocompatibility complex class I allele. To design vaccines that elicit optimal TCD8+ responses, a thorough understanding of this phenomenon, known as immunodominance, is imperative. Here we review recent progress in unraveling the molecular and cellular basis for immunodominance. Of foremost importance is peptide binding to class I molecules; only approximately 1/200 of potential determinants bind at greater than the threshold affinity (Kd > 500 nM) associated with immunogenicity. Limitations in the TCD8+ repertoire render approximately half of these peptides nonimmunogenic, and inefficient antigen processing further thins the ranks by approximately four fifths. As a result, only approximately 1/2000 of the peptides in a foreign antigen expressed by an appropriate antigen presenting cell achieve immunodominant status with a given class I allele. A roughly equal fraction of peptides have subdominant status, i.e. they induce weak-to-nondetectable primary TCD8+ responses in the context of their natural antigen. Subdominant determinants may be expressed at or above levels of immunodominant determinants, at least on antigen presenting cells in vitro. The immunogenicity of subdominant determinants is often limited by immunodomination: suppression mediated by TCD8+ specific for immunodominant determinants. Immunodomination is a central feature of TCD8+ responses, as it even occurs among clones responding to the same immunodominant determinant. Little is known about how immunodominant and subdominant determinants are distinguished by the TCD8+ repertoire, or how (and why) immunodomination occurs, but new tools are available to address these questions.

Keywords: immunoinformatics
[Yang1999Minimax] Y. Yang. Minimax nonparametric classification - Part I: rates of convergence. IEEE Trans. Inform. Theory, 45(7):2271-2284, 1999. [ bib | DOI | http | .pdf ]
[Wang1999Human] R. F. Wang. Human tumor antigens: implications for cancer vaccine development. J. Mol. Med., 77(9):640-655, Sep 1999. [ bib ]
The adoptive transfer of tumor-infiltrating lymphocytes along with interleukin 2 into autologous patients resulted in the objective regression of tumor in about 30% of patients with melanoma, indicating that these T cells play a role in tumor rejection. To understand the molecular basis of the T cell-cancer cell interaction we and others started to search for tumor antigens expressed on cancer cells recognized by T cells. This led to the identification of several major histocompatibility complex (MHC) class I restricted tumor antigens. These tumor antigens have been classified into several categories: tissue-specific differentiation antigens, tumor-specific shared antigens, and tumor-specific unique antigens. Because CD4+ T cells play a central role in orchestrating the host immune response against cancer, infectious diseases, and autoimmune diseases, a novel genetic approach has recently been developed to identify these MHC class II restricted tumor antigens. The identification of both MHC class I and II restricted tumor antigens provides new opportunities for the development of therapeutic strategies against cancer. This review summarizes the current status of tumor antigens and their potential applications to cancer treatment.

Keywords: immunoinformatics
[Varki1999Essentials] A. Varki, R. Cummings, J. Esko, H. Freeze, G. Hart, and J. Marth. Essentials of glycobiology. Cold Spring Harbor Laboratory Press, 1999. [ bib ]
[Tuschl1999Targeted] T. Tuschl, P.D. Zamore, R. Lehmann, D.P. Bartel, and P.A. Sharp. Targeted mRNA degradation by double-stranded RNA in vitro. Genes Dev., 13(24):3191-7, Dec 1999. [ bib ]
Double-stranded RNA (dsRNA) directs gene-specific, post-transcriptional silencing in many organisms, including vertebrates, and has provided a new tool for studying gene function. The biochemical mechanisms underlying this dsRNA interference (RNAi) are unknown. Here we report the development of a cell-free system from syncytial blastoderm Drosophila embryos that recapitulates many of the features of RNAi. The interference observed in this reaction is sequence specific, is promoted by dsRNA but not single-stranded RNA, functions by specific mRNA degradation, and requires a minimum length of dsRNA. Furthermore, preincubation of dsRNA potentiates its activity. These results demonstrate that RNAi can be mediated by sequence-specific processes in soluble reactions.

[Tomita1999Bioinformatics] M. Tomita, K. Hashimoto, K. Takahashi, T. S. Shimizu, Y. Matsuzaki, F. Miyoshi, K. Saito, S. Tanida, K. Yugi, J. C. Venter, and C. A. Hutchison. E-CELL: software environment for whole-cell simulation. Bioinformatics, 15(1):72-84, 1999. [ bib | DOI | arXiv | http ]
Keywords: csbcbook
[Tavazoie1999Systematic] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church. Systematic determination of genetic network architecture. Nat. Genet., 22:281-285, 1999. [ bib | DOI | http | .pdf ]
[Tamayo1999Interpreting] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U. S. A., 96(6):2907-2912, Mar 1999. [ bib | .pdf ]
Array technologies have made it straightforward to monitor simultaneously the expression pattern of thousands of genes. The challenge now is to interpret such massive data sets. The first step is to extract the fundamental patterns of gene expression inherent in the data. This paper describes the application of self-organizing maps, a type of mathematical cluster analysis that is particularly well suited for recognizing and classifying features in complex, multidimensional data. The method has been implemented in a publicly available computer package, GENECLUSTER, that performs the analytical calculations and provides easy data visualization. To illustrate the value of such analysis, the approach is applied to hematopoietic differentiation in four well studied models (HL-60, U937, Jurkat, and NB4 cells). Expression patterns of some 6,000 human genes were assayed, and an online database was created. GENECLUSTER was used to organize the genes into biologically relevant clusters that suggest novel hypotheses about hematopoietic differentiation-for example, highlighting certain genes and pathways involved in "differentiation therapy" used in the treatment of acute promyelocytic leukemia.

[Strahl1999Methylation] B. D. Strahl, R. Ohba, R. G. Cook, and C. D. Allis. Methylation of histone h3 at lysine 4 is highly conserved and correlates with transcriptionally active nuclei in tetrahymena. Proc Natl Acad Sci U S A, 96(26):14967-14972, Dec 1999. [ bib ]
Studies into posttranslational modifications of histones, notably acetylation, have yielded important insights into the dynamic nature of chromatin structure and its fundamental role in gene expression. The roles of other covalent histone modifications remain poorly understood. To gain further insight into histone methylation, we investigated its occurrence and pattern of site utilization in Tetrahymena, yeast, and human HeLa cells. In Tetrahymena, transcriptionally active macronuclei, but not transcriptionally inert micronuclei, contain a robust histone methyltransferase activity that is highly selective for H3. Microsequence analyses of H3 from Tetrahymena, yeast, and HeLa cells indicate that lysine 4 is a highly conserved site of methylation, which to date, is the major site detected in Tetrahymena and yeast. These data document a nonrandom pattern of H3 methylation that does not overlap with known acetylation sites in this histone. In as much as H3 methylation at lysine 4 appears to be specific to macronuclei in Tetrahymena, we suggest that this modification pattern plays a facilitatory role in the transcription process in a manner that remains to be determined. Consistent with this possibility, H3 methylation in yeast occurs preferentially in a subpopulation of H3 that is preferentially acetylated.

Keywords: Acetyltransferases, metabolism; Amino Acid Sequence; Animals; Cell Nucleus, metabolism; Hela Cells; Histone Acetyltransferases; Histone-Lysine N-Methyltransferase; Histones, metabolism; Humans; Lysine, analogs /&/ derivatives/metabolism; Methylation; Methyltransferases, metabolism; Molecular Sequence Data; Protein Methyltransferases; Protein Processing, Post-Translational; Saccharomyces cerevisiae Proteins; Species Specificity; Tetrahymena thermophila; Transcription, Genetic; Yeasts
[Stadler1999"Spectral] P. F. Stadler. Spectral landscape theory. In J.P. Crutchfield and P. Schuster, editors, Evolutionary Dynamics - Exploring the Interplay of Selection, Neutrality, Accident and Function. Oxford University Press, New York, 1999. [ bib | .html ]
[Southern1999Molecular] Edwin Southern, Kalim Mir, and Mikhail Shchepinov. Molecular interactions on microarrays. Nat. Genet., 21:5-9, 1999. [ bib ]
Keywords: csbcbook, csbcbook-ch2
[Sette1999Nine] A. Sette and J. Sidney. Nine major HLA class I supertypes account for the vast preponderance of HLA-A and -B polymorphism. Immunogenetics, 50(3-4):201-212, Nov 1999. [ bib ]
Herein, we review the epitope approach to vaccine development, and discuss how knowledge of HLA supertypes might be used as a tool in the development of such vaccines. After reviewing the main structural features of the A2-, A3-, B7-, and B44- supertype alleles, and biological data demonstrating their immunological relevance, we analyze the frequency at which these supertype alleles are expressed in various ethnicities and discuss the relevance of those observations to vaccine development. Next, the existence of five new supertypes (A1, A24, B27, B58, and B62) is reported. As a result, it is possible to account for the predominance of all known HLA class I with only nine main functional binding specificities. The practical implications of this finding, as well as its relevance to understanding the functional implication of MHC polymorphism in humans, are discussed.

Keywords: Alleles; Amino Acid Sequence; Animals; Epitopes; HLA-A Antigens; HLA-B Antigens; Histocompatibility Antigens Class I; Humans; Molecular Sequence Data; Polymorphism, Genetic
[Scott99thegromos] W. R. P. Scott, I. G. Tironi, A. E. Mark, S. R. Billeter, J. F., A. E. Torda, T. Huber, and P. Kruger. The gromos biomolecular simulation program package. J. Phys. Chem. A, 103:3596-3607, 1999. [ bib ]
[Schoelkopf1999Kernel] B. Schölkopf, A.J. Smola, and K.-R. Müller. Kernel principal component analysis. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 327-352. MIT Press, 1999. [ bib | .pdf ]
[Rigaut1999generic] G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann, and B. Séraphin. A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol, 17(10):1030-1032, Oct 1999. [ bib | DOI | http ]
We have developed a generic procedure to purify proteins expressed at their natural level under native conditions using a novel tandem affinity purification (TAP) tag. The TAP tag allows the rapid purification of complexes from a relatively small number of cells without prior knowledge of the complex composition, activity, or function. Combined with mass spectrometry, the TAP strategy allows for the identification of proteins interacting with a given target protein. The TAP method has been tested in yeast but should be applicable to other cells or organisms.

Keywords: Affinity Labels; Amino Acid Sequence; Electrophoresis, Polyacrylamide Gel; Methods; Molecular Sequence Data; Proteins; Proteome
[Rammensee1999SYFPEITHI] H. Rammensee, J. Bachmann, N. P. Emmerich, O. A. Bachor, and S. Stevanović. Syfpeithi: database for MHC ligands and peptide motifs. Immunogenetics, 50(3-4):213-219, Nov 1999. [ bib ]
The first version of the major histocompatibility complex (MHC) databank SYFPEITHI: database for MHC ligands and peptide motifs, is now available to the general public. It contains a collection of MHC class I and class II ligands and peptide motifs of humans and other species, such as apes, cattle, chicken, and mouse, for example, and is continuously updated. All motifs currently available are accessible as individual entries. Searches for MHC alleles, MHC motifs, natural ligands, T-cell epitopes, source proteins/organisms and references are possible. Hyperlinks to the EMBL and PubMed databases are included. In addition, ligand predictions are available for a number of MHC allelic products. The database content is restricted to published data only.

Keywords: Amino Acid Motifs; Amino Acid Sequence; Animals; Databases, Factual; Humans; Internet; Ligan; Major Histocompatibility Complex; Molecular Sequence Data; Research Support, Non-U.S. Gov't; Sequence Homology, Amino Acid; ds
[Pollack1999Genome] Jonathan R. Pollack, Charles M. Perou, Ash A. Alizadeh, Michael B. Eisen, Alexander Pergamenschikov, Cheryl F. Williams, Stefanie S. Jeffrey, David Botstein, and Patrick O. Brown. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat. Genet., 23:41-46, 1999. [ bib ]
Keywords: csbcbook, csbcbook-ch2
[Platt1999Fast] J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 185-208. MIT Press, Cambridge, MA, USA, 1999. [ bib ]
Keywords: kernel-theory
[Perou1999Distinctive] C. M. Perou, S. S. Jeffrey, M. van de Rijn, C. A. Rees, M. B. Eisen, D. T. Ross, A. Pergamenschikov, C. F. Williams, S. X. Zhu, J. C. Lee, D. Lashkari, D. Shalon, P. O. Brown, and D. Botstein. Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc. Natl. Acad. Sci. U S A, 96(16):9212-9217, Aug 1999. [ bib | DOI | http | .pdf ]
cDNA microarrays and a clustering algorithm were used to identify patterns of gene expression in human mammary epithelial cells growing in culture and in primary human breast tumors. Clusters of coexpressed genes identified through manipulations of mammary epithelial cells in vitro also showed consistent patterns of variation in expression among breast tumor samples. By using immunohistochemistry with antibodies against proteins encoded by a particular gene in a cluster, the identity of the cell type within the tumor specimen that contributed the observed gene expression pattern could be determined. Clusters of genes with coherent expression patterns in cultured cells and in the breast tumors samples could be related to specific features of biological variation among the samples. Two such clusters were found to have patterns that correlated with variation in cell proliferation rates and with activation of the IFN-regulated signal transduction pathway, respectively. Clusters of genes expressed by stromal cells and lymphocytes in the breast tumors also were identified in this analysis. These results support the feasibility and usefulness of this systematic approach to studying variation in gene expression patterns in human cancers as a means to dissect and classify solid tumors.

Keywords: csbcbook, csbcbook-ch3
[Osborne1999On] Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. On the lasso and its dual. Journal of Computational and Graphical Statistics, 9:319-337, 1999. [ bib ]
[Osborne1999A] M. R. Osborne, Brett Presnell, and B.A. Turlach. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20:389-404, 1999. [ bib ]
[Murphy1999Modelling] K. Murphy and S. Mian. Modelling gene expression data using dynamic Bayesian networks. Technical report, Computer Science Division, University of California, Berkeley, CA., 1999. [ bib | .pdf ]
Keywords: biogm
[Mika1999Fisher] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K.R. Müller. Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for Signal Processing IX, pages 41-48. IEEE, 1999. [ bib | .ps | .pdf ]
[McGregor1999Pharmacophore] M. J. McGregor and S. M. Muskal. Pharmacophore fingerprinting. 1. Application to QSAR and focused library design. J Chem Inf Comput Sci, 39(3):569-574, 1999. [ bib ]
A new method of rapid pharmacophore fingerprinting (PharmPrint method) has been developed. A basis set of 10,549 three-point pharmacophores has been constructed by enumerating several distance ranges and pharmacophoric features. Software has been developed to assign pharmacophoric types to atoms in chemical structures, generate multiple conformations, and construct the binary fingerprint according to the pharmacophores that result. The fingerprint is used as a descriptor for developing a quantitative structure-activity relationship (QSAR) model using partial least squares. An example is given using sets of ligands for the estrogen receptor (ER). The result is compared with previously published results on the same data to show the superiority of a full 3D, conformationally flexible approach. The QSAR model can be readily interpreted in structural/chemical terms. Further examples are given using binary activity data and some of our novel in-house compounds, which show the value of the model when crossing compound classes.

Keywords: Chemistry, Combinatorial Chemistry Techniques, Drug Design, Drug Evaluation, Estradiol Congeners, Estrogen, Least-Squares Analysis, Ligands, Models, Molecular, Pharmaceutical, Preclinical, Receptors, Software, Structure-Activity Relationship, 10361729
[Matter1999Comparing] H. Matter and T. Pötter. Comparing 3D pharmacophore triplets and 2D fingerprints for selecting diverse compound subsets. J. Chem. Inf. Comput. Sci., 39(6):1211-1225, 1999. [ bib | DOI | http | .pdf ]
The performance of two important 2D and 3D molecular descriptors for rational design to maximize the structural diversity of databases is investigated in this publication. Those methods are based either on a 2D description using a binary fingerprint, which accounts for the absence or presence of molecular fragments, or a 3D description based on the geometry of pharmacophoric features encoded in a fingerprint (pharmacophoric definition triplets, PDTs). Both descriptors in combination with maximum dissimilarity selections, complete linkage hierarchical cluster analysis, or sequential dissimilarity selections were compared to random subsets as reference. This comparison is based on their ability to cover representative biological classes from parent databases (coverage analysis) and the degree of separation between active and inactive compounds for a biological target from hierarchical clustering (cluster separation analysis). While the similarity coefficients (Tanimoto, cosine) show only a minor influence, the number of conformations to generate the 3D PDT fingerprint lead to remarkably different results. PDT fingerprints derived from a lower number of conformers perform significantly better, but they are not comparable to a 2D fingerprint-based design. When 2D and 3D descriptors are combined with weighting factors > 0.5 for 2D fingerprints, a significant improvement of coverage and cluster separation results is observed for a small number of PDT conformers and medium sized subsets. Some combined descriptors outperform 2D fingerprints, but not for all subset populations. Applying sequential dissimilarity selection to PDT descriptors reveals that its performance is dependent on the initial ordering of compounds, while presorting according to 2D fingerprint diversity does not improve results. Finally the relationship between biological activity and similarity was investigated, showing that PDTs quantify smaller structural differences due to the large number of bits in the fingerprint.

Keywords: chemoinformatics
[Mason1999New] J. S. Mason, I. Morize, P. R. Menard, D. L. Cheney, C. Hulme, and R. F. Labaudiniere. New 4-point pharmacophore method for molecular similarity and diversity applications: overview of the method and applications, including a novel approach to the design of combinatorial libraries containing privileged substructures. J. Med. Chem., 42(17):3251-3264, Aug 1999. [ bib | DOI | http ]
A new 4-point pharmacophore method for molecular similarity and diversity that rapidly calculates all potential pharmacophores/pharmacophoric shapes for a molecule or a protein site is described. The method, an extension to the ChemDiverse/Chem-X software (Oxford Molecular, Oxford, England), has also been customized to enable a new internally referenced measure of pharmacophore diversity. The "privileged" substructure concept for the design of high-affinity ligands is presented, and an example of this new method is described for the design of combinatorial libraries for 7-transmembrane G-protein-coupled receptor targets, where "privileged" substructures are used as special features to internally reference the pharmacophoric shapes. Up to 7 features and 15 distance ranges are considered, giving up to 350 million potential 4-point 3D pharmacophores/molecule. The resultant pharmacophore "key" ("fingerprint") serves as a powerful measure for diversity or similarity, calculable for both a ligand and a protein site, and provides a consistent frame of reference for comparing molecules, sets of molecules, and protein sites. Explicit "on-the-fly" conformational sampling is performed for a molecule to enable the calculation of all geometries accessible for all combinations of four features (i.e., 4-point pharmacophores) at any desired sampling resolution. For a protein site, complementary site points to groups displayed in the site are generated and all combinations of four site points are considered. In this paper we report (i) the details of our customized implementation of the method and its modification to systematically measure 4-point pharmacophores relative to a "special" substructure of interest present in the molecules under study; (ii) comparisons of 3- and 4-point pharmacophore methods, highlighting the much increased resolution of the 4-point method; (iii) applications of the 4-point potential pharmacophore descriptors as a new measure of molecular similarity and diversity and for the design of focused/biased combinatorial libraries.

[Marcotte1999Detecting] E.M. Marcotte, M. Pellegrini, H.-L. Ng, D.W. Rice, T.O. Yeates, and D. Eisenberg. Detecting Protein Function and Protein-Protein Interactions from Genome Sequences. Science, 285:751-753, 1999. [ bib | .pdf | .pdf ]
[Manallack1999Neural] D.T. Manallack and D.J. Livingstone. Neural networks in drug-discovery: have they lived up with their promise? Eur. J. Med. Chem., 34:195-208, 1999. [ bib ]
[Mammen1999Smooth] E. Mammen and A. Tsybakov. Smooth discrimination analysis. Ann. Stat., 27(6):1808-1829, 1999. [ bib | DOI | http | .pdf ]
[Lee1999oscillatory] R. S. T. Lee and J. N. K. Liu. An oscillatory elastic graph matching model for recognition of offline handwritten chinese characters. In KES, pages 284-287, 1999. [ bib ]
[Lee1999Learning] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788-791, Oct 1999. [ bib | DOI | http | .pdf ]
Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. But little is known about how brains or computers might learn the parts of objects. Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text. This is in contrast to other methods, such as principal components analysis and vector quantization, that learn holistic, not parts-based, representations. Non-negative matrix factorization is distinguished from the other methods by its use of non-negativity constraints. These constraints lead to a parts-based representation because they allow only additive, not subtractive, combinations. When non-negative matrix factorization is implemented as a neural network, parts-based representations emerge by virtue of two properties: the firing rates of neurons are never negative and synaptic strengths do not change sign.

[Lander1999Array] E. S. Lander. Array of hope. Nat Genet, 21(1 Suppl):3-4, Jan 1999. [ bib | DOI | http ]
Keywords: Animals; DNA, analysis; Gene Expression; Genetic Variation; Genome; Humans; Molecular Probe Techniques, trends; Oligonucleotide Array Sequence Analysis, methods; RNA, Messenger, analysis; Saccharomyces cerevisiae, genetics
[Knight99Decoding] K. Knight. Decoding complexity in word-replacement translation models. Computational Linguistics, 25:607-615, 1999. [ bib ]
[Jonsson1999Xpose] E. N. Jonsson and M. O. Karlsson. Xpose-an S-PLUS based population pharmacokinetic/pharmacodynamic model building aid for NONMEM. Comput Meth Prog Bio, 58(1):51-64, 1999. [ bib ]
[Joachims1999Transductive] T. Joachims. Transductive inference for text classification using support vector machines. In ICML '99: Proceedings of the Sixteenth International Conference on Machine Learning, pages 200-209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [ bib | .pdf ]
Keywords: PUlearning
[Joachims1999Making] T. Joachims. Making large-Scale SVM Learning Practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169-184. MIT Press, 1999. [ bib | .pdf ]
[Jain1999Data] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31:3, 1999. [ bib | .pdf ]
[Jaakkola1999Probabilistic] T. S. Jaakkola and D. Haussler. Probabilistic kernel regression models. In Proceedings of the 1999 Conference on AI and Statistics. Morgan Kaufmann, 1999. [ bib | .ps.gz | .pdf ]
[Jaakkola1999Exploiting] T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Proc. of Tenth Conference on Advances in Neural Information Processing Systems, 1999. [ bib | .ps | .pdf ]
Keywords: biosvm
[Jaakkola1999Using] T. S. Jaakkola, M. Diekhans, and D. Haussler. Using the Fisher Kernel Method to Detect Remote Protein Homologies. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 149-158. AAAI Press, 1999. [ bib ]
Keywords: biosvm
[Jaakkola1999Maximum] Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum entropy discrimination. In Adv. Neural Inform. Process. Syst., volume 12. MIT Press, Cambridge, MA, 1999. [ bib ]
[Hindsgaul1999Carbohydrate] O. Hindsgaul. Carbohydrate chemistry. Sugars out in the open. Nature, 399(6737):644-5, Jun 1999. [ bib | DOI | http | .pdf ]
Keywords: glycans
[Heckerman1999tutorial] D. Heckerman. A tutorial on learning with Bayesian networks. In M. Jordan, editor, Learning in graphical models, pages 301-354. MIT Press, Cambridge, MA, USA, 1999. [ bib | .pdf ]
Keywords: biogm
[Haussler1999Convolution] D. Haussler. Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, UC Santa Cruz, 1999. [ bib | .pdf ]
We introduce a new method of constructing kernels on sets whose elements are discrete structures like strings, trees and graphs. The method can be applied iteratively to build a kernel on a infinite set from kernels involving generators of the set. The family of kernels generated generalizes the family of radial basis kernels. It can also be used to define kernels in the form of joint Gibbs probability distributions. Kernels can be built from hidden Markov random fields, generalized regular expressions, pair-HMMs, or ANOVA decompositions. Uses of the method lead to open problems involving the theory of infinitely divisible positive definite functions. Fundamentals of this theory and the theory of reproducing kernel Hilbert spaces are reviewed and applied in establishing the validity of the method.

Keywords: biosvm
[Hastie1999Generalized] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman and Hall, London, UK, 1999. [ bib ]
[Hartwell1999a] L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray. From molecular to modular cell biology. Nature, 402(6761 Suppl):C47-C52, Dec 1999. [ bib | DOI | http ]
Cellular functions, such as signal transmission, are carried out by 'modules' made up of many species of interacting molecules. Understanding how modules work has depended on combining phenomenological analysis with molecular studies. General principles that govern the structure and behaviour of modules may be discovered with help from synthetic sciences such as engineering and computer science, from stronger interactions between experiment and theory in cell biology, and from an appreciation of evolutionary constraints.

Keywords: Action Potentials; Biological Evolution; Forecasting; Models, Biological; Molecular Biology, trends
[Hann1999Chemoinformatics] M. Hann and R. Green. Chemoinformatics-a new name for an old problem? Curr. Opin. Chem. Biol., 3(4):379-383, Aug 1999. [ bib | DOI | http ]
Library chemistry and high-throughput screening require greater use of chemoinformatics to increase their effectiveness. Recent advances in chemoinformatics include new molecular descriptors and pharmacophore techniques, statistical tools and their applications. Visualisation methods and hardware development are also opening new opportunities. The advent of a chemically aware web language and cross-platform working is ensuring that chemoinformatics methods are becoming available to all chemists in a more appropriate manner. Much time will continue to be wasted with incompatible file types without internationally agreed standards.

Keywords: Chemistry, Computers, Drug Design, Information Science, Software, 10419846
[Gyorfi1999simple] L. Gyorfi, G. Lugosi, and G. Morvai. A simple randomized algorithm for sequential prediction of ergodic time series. IEEE Trans. Inform. Theory, 47(5):2642 - 2650, Nov 1999. [ bib | .pdf ]
We present a simple randomized procedure for the prediction of a binary sequence. The algorithm uses ideas from previous developments of the theory of the prediction of individual sequences. We show that if the sequence is a realization of a stationary and ergodic random process then the average number of mistakes converges, almost surely, to that of the optimum, given by the Bayes predictor. The desirable finite-sample properties of the predictor are illustrated by its performance for Markov processes. In such cases the predictor exhibits near-optimal behavior even without knowing the order of the Markov process. Prediction with side information is also considered

Keywords: information-theory
[Gygi1999Quantitative] S. P. Gygi, B. Rist, S. A. Gerber, F. Turecek, M. H. Gelb, and R. Aebersold. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol, 17(10):994-999, Oct 1999. [ bib | DOI | http ]
We describe an approach for the accurate quantification and concurrent sequence identification of the individual proteins within complex mixtures. The method is based on a class of new chemical reagents termed isotope-coded affinity tags (ICATs) and tandem mass spectrometry. Using this strategy, we compared protein expression in the yeast Saccharomyces cerevisiae, using either ethanol or galactose as a carbon source. The measured differences in protein expression correlated with known yeast metabolic function under glucose-repressed conditions. The method is redundant if multiple cysteinyl residues are present, and the relative quantification is highly accurate because it is based on stable isotope dilution techniques. The ICAT approach should provide a widely applicable means to compare quantitatively global protein expression in cells and tissues.

Keywords: Affinity Labels; Amino Acid Sequence; Chromatography, Liquid; Isotope Labeling; Mass Spectrometry; Proteins
[Gordon1999Classification] A. D. Gordon. Classification. Chapman & Hall/CRC, 1999. [ bib ]
[Golub1999Molecular] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531-537, 1999. [ bib | DOI | http | .pdf ]
Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression moni- toring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.

Keywords: csbcbook, csbcbook-ch3, csbcbook-ch4
[Ferea1999Systematic] T. L. Ferea, D. Botstein, P. O. Brown, and R. F. Rosenzweig. Systematic changes in gene expression patterns following adaptive evolution in yeast. Proc. Natl. Acad. Sci. USA, 96(17):9721-9726, 1999. [ bib | .pdf | .pdf ]
[Faloutsos1999On] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. Comput. Comm. Rev., 29:251-262, 1999. [ bib | .html | .pdf ]
[Debouck1999DNA] C. Debouck and P. N. Goodfellow. DNA microarrays in drug discovery and development. Nat. Genet., 21(1 Suppl):48-50, Jan 1999. [ bib | DOI | http ]
DNA microarrays can be used to measure the expression patterns of thousands of genes in parallel, generating clues to gene function that can help to identify appropriate targets for therapeutic intervention. They can also be used to monitor changes in gene expression in response to drug treatments. Here, we discuss the different ways in which microarray analysis is likely to affect drug discovery.

Keywords: Agricultural, Alleles, Alternaria, Amino Acid, Amino Acid Chloromethyl Ketones, Amino Acid Sequence, Animal, Animals, Apoptosis, Asthma, Bacteria, Base Sequence, Binding Sites, Biotechnology, Blotting, Bone Density, Bone Matrix, Bone and Bones, CCR5, Camptothecin, Caspases, Cathepsins, Cell Surface, Central America, Chloroplast, Chondrocytes, Chromosome Mapping, Chromosomes, Cloning, Cluster Analysis, Collagen, Comparative Study, Coumarins, Crops, Crystallography, DNA, DNA Primers, Dipeptides, Disease, Disease Models, Drug Design, Drug Evaluation, Drug Industry, Enzyme Activation, Enzyme Inhibitors, Escherichia coli, Evolution, Exons, Expressed Sequence Tags, Female, Fetus, Fluorescent Dyes, Food Microbiology, Founder Effect, GTP-Binding Proteins, Gene Expression, Gene Frequency, Gene Library, Genes, Genetic, Genetic Predisposition to Disease, Genome, Geography, Growth Plate, Haplotypes, Hordeum, Human, Humans, Inclusion Bodies, Injections, Intraperitoneal, Introns, Isatin, Knockout, Male, Membrane Proteins, Messenger, Mice, Models, Molecular, Molecular Sequence Data, Molecular Structure, Mutation, Mycotoxins, Neutrophils, Non-U.S. Gov't, Northern, Oligonucleotide Array Sequence Analysis, Osteoarthritis, Osteochondrodysplasias, Osteoclasts, Osteopetrosis, Pair 15, Phaseolus, Polymorphism, Preclinical, Pregnancy, Promoter Regions (Genetics), Protein Precursors, Proteomics, RNA, Receptors, Recombinant Fusion Proteins, Recombinant Proteins, Research Support, Restriction Fragment Length, Ribosomal Proteins, Sequence Alignment, Sequence Analysis, Sequence Homology, South America, Species Specificity, Splenomegaly, Sulfonamides, Synteny, Tissue Distribution, Transcription, Trichothecenes, X-Ray, 9915501
[Cuff1999Evaluation] J. A. Cuff and G. J. Barton. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Protein. Struct. Funct. Genet., 34:508-519, 1999. [ bib | http | .pdf ]
[Cordella1999Performance] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. Performance evaluation of the vf graph matching algorithm. In ICIAP '99: Proceedings of the 10th International Conference on Image Analysis and Processing, page 1172, Washington, DC, USA, 1999. IEEE Computer Society. [ bib ]
[Chen1999Modeling] T. Chen, H. L. He, and G. M. Church. Modeling gene expression with differential equations. Pac. Symp. Biocomput., 4:29-40, 1999. [ bib ]
We propose a differential equation model for gene expression and provide two methods to construct the model from a set of temporal data. We model both transcription and translation by kinetic equations with feedback loops from translation products to transcription. Degradation of proteins and mRNAs is also incorporated. We study two methods to construct the model from experimental data: Minimum Weight Solutions to Linear Equations (MWSLE), which determines the regulation by solving under-determined linear equations, and Fourier Transform for Stable Systems (FTSS), which refines the model with cell cycle constraints. The results suggest that a minor set of temporal data may be sufficient to construct the model at the genome level. We also give a comprehensive discussion of other extended models: the RNA Model, the Protein Model, and the Time Delay Model.

[Carlo1999Phylogenetic] Monte Carlo, Shuying Li, Dennis K. Pearl, and Hani Doss. Phylogenetic tree construction using markov chain monte carlo. Journal of the American Statistical Association, 95:493-508, 1999. [ bib ]
[Bockaert1999Molecular] J. Bockaert and J. P. Pin. Molecular tinkering of G protein-coupled receptors: an evolutionary success. EMBO J., 18(7):1723-1729, Apr 1999. [ bib | DOI | http ]
Among membrane-bound receptors, the G protein-coupled receptors (GPCRs) are certainly the most diverse. They have been very successful during evolution, being capable of transducing messages as different as photons, organic odorants, nucleotides, nucleosides, peptides, lipids and proteins. Indirect studies, as well as two-dimensional crystallization of rhodopsin, have led to a useful model of a common 'central core', composed of seven transmembrane helical domains, and its structural modifications during activation. There are at least six families of GPCRs showing no sequence similarity. They use an amazing number of different domains both to bind their ligands and to activate G proteins. The fine-tuning of their coupling to G proteins is regulated by splicing, RNA editing and phosphorylation. Some GPCRs have been found to form either homo- or heterodimers with a structurally different GPCR, but also with membrane-bound proteins having one transmembrane domain such as nina-A, odr-4 or RAMP, the latter being involved in their targeting, function and pharmacology. Finally, some GPCRs are unfaithful to G proteins and interact directly, via their C-terminal domain, with proteins containing PDZ and Enabled/VASP homology (EVH)-like domains.

Keywords: chemogenomics
[Bhalla1999Emergent] U. S. Bhalla and R. Iyengar. Emergent properties of networks of biological signaling pathways. Science, 283(5400):381-387, 1999. [ bib | DOI | arXiv | http | .pdf ]
Keywords: csbcbook
[Bertsekas1999Nonlinear] D. Bertsekas. Nonlinear programming. Athena Scientific, 1999. [ bib ]
[Bejerano1999Modeling] G. Bejerano and G. Yona. Modeling protein families using probabilistic suffix trees. In Proceedings of RECOMB 1999, pages 15-24. ACM Press, 1999. [ bib | .pdf ]
[Barabasi1999Emergence] A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286:509-512, 1999. [ bib | .pdf | .pdf ]
Systems as diverse as genetic networks or the World Wide Web are best described as networks with complex topology. A common property of many large networks is that the vertex connectivities follow a scale-free power-law distribution. This feature was found to be a consequence of two generic mechanisms: (i) networks expand continuously by the addition of new vertices, and (ii) new vertices attach preferentially to sites that are already well connected. A model based on these two ingredients reproduces the observed stationary scale-free distributions, which indicates that the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.

[Baldi1999Exploiting] P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15:937-946, 1999. [ bib | .pdf | .pdf ]
[Armstrong1999review] J. W. Armstrong. A review of high-throughput screening approaches for drug discovery. Application note, 1999. [ bib ]
[Amari1999Improving] S.-I. Amari and S. Wu. Improving support vector machine classifiers by modifying kernel functions. Neural Networks, 12(6):783-789, Jul 1999. [ bib | .ps | .pdf ]
We propose a method of modifying a kernel function to improve the performance of a support vector machine classifier. This is based on the structure of the Riemannian geometry induced by the kernel function. The idea is to enlarge the spatial resolution around the separating boundary surface, by a conformal mapping, such that the separability between classes is increased. Examples are given specifically for modifying Gaussian Radial Basis Function kernels. Simulation results for both artificial and real data show remarkable improvement of generalization errors, supporting our idea.

[Albert1999Diameter] R. Albert, H. Jeong, and A.-L. Barabási. Diameter of the World-Wide Web. Nature, 401:130-131, 1999. [ bib | .pdf | .pdf ]
[Aguda1999CellProl] B. D. Aguda and Y. Tang. The kinetic origins of the restriction point in the mammalian cell cycle. Cell Prolif, 32(5):321-35, 1999. [ bib ]
A detailed model mechanism for the G1/S transition in the mammalian cell cycle is presented and analysed by computer simulation to investigate whether the kinetic origins of the restriction point (R-point) can be identified. The R-point occurs in mid-to-late G1 phase and marks the transition between mitogen-dependent to mitogen-independent progression of the cell cycle. For purposes of computer simulations, the R-point is defined as the first point in time after mitosis where cutting off mitogen stimulation does not prevent the cell reaching the threshold activity of cyclin-E/cdk2 required for entry into S phase. The key components of the network that generate a dynamic switching behaviour associated with the R-point include a positive feedback loop between cyclin-E/cdk2 and Cdc25A, along with the mutually negative interaction between the cdk inhibitor p27Kip1 and cyclin-E/cdk2. Simulations of the passage through the R-point were carried out and the factors affecting the position of the R-point in G1 are determined. The detailed model also shows various points in the network where the activation of cyclin-E/cdk2 can be initiated with or without the involvement of the retinoblastoma protein.

Keywords: csbcbook
[Aguda1999PNAS] B. D. Aguda. A quantitative analysis of the kinetics of the g(2) dna damage checkpoint system. Proc Natl Acad Sci U S A, 96(20):11352-7, 1999. [ bib ]
A detailed model of the G(2) DNA damage checkpoint (G2DDC) system is presented that includes complex regulatory networks of the mitotic kinase Cdc2, phosphatase Cdc25, Wee1 kinase, and damage signal transduction pathways involving Chk1 and p53. Assumptions on the kinetic equations of the G2DDC are made, and computer simulations are carried out to demonstrate how the various subsystems operate to delay or arrest cell cycle progression. The detailed model could be used to explain various experiments relevant to G2DDC reported recently, including the nuclear export of 14-3-3-bound Cdc25, the down-regulation of cyclin B1 expression by p53, the effect of Chk1 and p53 on Cdc25 levels, and Wee1 degradation. It also is shown that, under certain conditions, p53 is necessary to sustain a G(2) arrest.

Keywords: csbcbook
[Aguda1999Oncogene] B. D. Aguda. Instabilities in phosphorylation-dephosphorylation cascades and cell cycle checkpoints. Oncogene, 18(18):2846-51, 1999. [ bib ]
The G2-M checkpoint in the cell cycle is identified with a set of phosphorylation-dephosphorylation (PD) cycles involving Cdc25 and the maturation-promoting factor (MPF); these PD cycles are coupled in a way that generates an instability. This instability arises out of a transcritical bifurcation which could be exploited by the G2 DNA damage checkpoint pathway in order to arrest or delay entry into mitosis. The coupling between PD cycles involving Wee1 and MPF does not lead to an instability and therefore Wee1 may not be a crucial target of the checkpoint pathway. A set of PD cycles exhibiting transcritical bifurcation also possesses the integrative ability of a checkpoint for 'checking' that prerequisites are satisfied prior to the next cell cycle event. Such a set of coupled PD cycles is suggested to be a core mechanism of cell cycle checkpoints.

Keywords: csbcbook
[Pellegrini1999Assigning] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA, 96:4285-4288, April 1999. [ bib | .pdf | .pdf ]
[Toeroenen1999Analysis] P. Törönen, M. Kolehmainen, G. Wong, and E. Castrén. Analysis of gene expression data using self-organizing maps. FEBS Lett., 451(2):142-146, May 1999. [ bib | .pdf ]
DNA microarray technologies together with rapidly increasing genomic sequence information is leading to an explosion in available gene expression data. Currently there is a great need for efficient methods to analyze and visualize these massive data sets. A self-organizing map (SOM) is an unsupervised neural network learning algorithm which has been successfully used for the analysis and organization of large data files. We have here applied the SOM algorithm to analyze published data of yeast gene expression and show that SOM is an excellent tool for the analysis and visualization of gene expression profiles.

[Mathews1999Expandeda] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288(5):911-940, May 1999. [ bib | DOI | http ]
An improved dynamic programming algorithm is reported for RNA secondary structure prediction by free energy minimization. Thermodynamic parameters for the stabilities of secondary structure motifs are revised to include expanded sequence dependence as revealed by recent experiments. Additional algorithmic improvements include reduced search time and storage for multibranch loop free energies and improved imposition of folding constraints. An extended database of 151,503 nt in 955 structures? determined by comparative sequence analysis was assembled to allow optimization of parameters not based on experiments and to test the accuracy of the algorithm. On average, the predicted lowest free energy structure contains 73 % of known base-pairs when domains of fewer than 700 nt are folded; this compares with 64 % accuracy for previous versions of the algorithm and parameters. For a given sequence, a set of 750 generated structures contains one structure that, on average, has 86 % of known base-pairs. Experimental constraints, derived from enzymatic and flavin mononucleotide cleavage, improve the accuracy of structure predictions.

Keywords: 16S, 23S, 5S, Affinity, Algorithms, Aluminum Silicates, Amino Acid, Amino Acid Sequence, Amyloidosis, Archaeal, Bacillus, Bacterial, Bacterial Proteins, Bacteriophage T4, Base Sequence, Chloroplast, Chromatography, Circular Dichroism, Comparative Study, Computational Biology, Databases, Electrophoresis, Entropy, Enzyme Stability, Escherichia coli, Factual, Fibroblast Growth Factor 2, Flavin Mononucleotide, Fluorescence, Genetic, Guanidine, Humans, Huntington Disease, Kinetics, Light, Models, Molecular Sequence Data, Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, P.H.S., Peptides, Phylogeny, Polyacrylamide Gel, Predictive Value of Tests, Protein Binding, Protein Denaturation, Protein Folding, Protein Structure, RNA, Radiation, Recombinant Proteins, Research Support, Ribosomal, Scattering, Secondary, Sequence Homology, Solutions, Spectrometry, Statistical, Temperature, Thermodynamics, Time Factors, Trinucleotide Repeat Expansion, U.S. Gov't, alpha-Amylase, 10329189
[Mathews1999Expanded] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288(5):911-40, May 1999. [ bib | DOI | http ]
An improved dynamic programming algorithm is reported for RNA secondary structure prediction by free energy minimization. Thermodynamic parameters for the stabilities of secondary structure motifs are revised to include expanded sequence dependence as revealed by recent experiments. Additional algorithmic improvements include reduced search time and storage for multibranch loop free energies and improved imposition of folding constraints. An extended database of 151,503 nt in 955 structures? determined by comparative sequence analysis was assembled to allow optimization of parameters not based on experiments and to test the accuracy of the algorithm. On average, the predicted lowest free energy structure contains 73 % of known base-pairs when domains of fewer than 700 nt are folded; this compares with 64 % accuracy for previous versions of the algorithm and parameters. For a given sequence, a set of 750 generated structures contains one structure that, on average, has 86 % of known base-pairs. Experimental constraints, derived from enzymatic and flavin mononucleotide cleavage, improve the accuracy of structure predictions.

Keywords: sirna
[Fields1999Functional] S. Fields, Y. Kohara, and D. J. Lockhart. Functional genomics. Proc. Natl. Acad. Sci. USA, 96:8825-8826, August 1999. [ bib | .pdf | .pdf ]
[Marcotte1999combined] E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg. A combined algorithm for genome-wide prediction of protein function. Nature, 402:83-86, November 1999. [ bib | http | .pdf ]
[Lugosi1999Adaptive] G. Lugosi and A. Nobel. Adaptive Model Selection Using Empirical Complexities. Ann. Stat., 27(6):1830-1864, December 1999. [ bib | .ps | .pdf ]
[Zien2000Engineering] A. Zien, G. Rätsch, S. Mika, B. Schölkopf, T. Lengauer, and K.-R. Müller. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics, 16(9):799-807, 2000. [ bib | http | .pdf ]
Motivation: In order to extract protein sequences from nucleotide sequences, it is an important step to recognize points at which regions start that code for proteins. These points are called translation initiation sites (TIS). Results: The task of finding TIS can be modeled as a classification problem. We demonstrate the applicability of support vector machines for this task, and show how to incorporate prior biological knowledge by engineering an appropriate kernel function. With the described techniques the recognition performance can be improved by 26 We provide evidence that existing related methods (e.g. ESTScan) could profit from advanced TIS recognition.

Keywords: biosvm
[Zhu2000Two] G. Zhu, P. T. Spellman, T. Volpe, P. O. Brown, D. Botstein, T. N. Davis, and B. Futcher. Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth. Nature, 406:90-94, 2000. [ bib | http | .pdf ]
[Zamore2000RNAi] P.D. Zamore, T. Tuschl, P.A. Sharp, and D.P. Bartel. RNAi: double-stranded RNA directs the ATP-dependent cleavage of mRNA at 21 to 23 nucleotide intervals. Cell, 101(1):25-33, Mar 2000. [ bib | DOI | http ]
Double-stranded RNA (dsRNA) directs the sequence-specific degradation of mRNA through a process known as RNA interference (RNAi). Using a recently developed Drosophila in vitro system, we examined the molecular mechanism underlying RNAi. We find that RNAi is ATP dependent yet uncoupled from mRNA translation. During the RNAi reaction, both strands of the dsRNA are processed to RNA segments 21-23 nucleotides in length. Processing of the dsRNA to the small RNA fragments does not require the targeted mRNA. The mRNA is cleaved only within the region of identity with the dsRNA. Cleavage occurs at sites 21-23 nucleotides apart, the same interval observed for the dsRNA itself, suggesting that the 21-23 nucleotide fragments from the dsRNA are guiding mRNA cleavage.

Keywords: sirna
[Xue2000Molecular] L. Xue and J. Bajorath. Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening. Comb. Chem. High. Throughput Screen., 3(5):363-372, Oct 2000. [ bib ]
Many contemporary applications in computer-aided drug discovery and chemoinformatics depend on representations of molecules by descriptors that capture their structural characteristics and properties. Such applications include, among others, diversity analysis, library design, and virtual screening. Hundreds of molecular descriptors have been reported in the literature, ranging from simple bulk properties to elaborate three-dimensional formulations and complex molecular fingerprints, which sometimes consist of thousands of bit positions. Knowledge-based selection of descriptors that are suitable for specific applications is an important task in chemoinformatics research. If descriptors are to be selected on rational grounds, rather than guesses or chemical intuition, detailed evaluation of their performance is required. A number of studies have been reported that investigate the performance of molecular descriptors in specific applications and/or introduce novel types of descriptors. Progress made in this area is reviewed here in the context of other computational developments in combinatorial chemistry and compound screening.

Keywords: chemoinformatics
[Xie2000Asymptotic] Q. Xie and A.R. Barron. Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Trans. Inform. Theory, 46(2):431 - 445, Mar 2000. [ bib | .pdf ]
For problems of data compression, gambling, and prediction of individual sequences x1, ···, xn the following questions arise. Given a target family of probability mass functions p(x1, ···, x n|&thetas;), how do we choose a probability mass function q(x 1, ···, xn) so that it approximately minimizes the maximum regret/belowdisplayskip10ptminus6pt max (log1/q(x1, ···, xn)-log1/p(x1, ···, xn |&thetas;?)) and so that it achieves the best constant C in the asymptotics of the minimax regret, which is of the form (d/2)log(n/2?)+C+o(1), where d is the parameter dimension? Are there easily implementable strategies q that achieve those asymptotics? And how does the solution to the worst case sequence problem relate to the solution to the corresponding expectation version minq max 0 E0(log1/q(x1, ···, xn)-log1/p(x1, ···, xn|&thetas;))? In the discrete memoryless case, with a given alphabet of size m, the Bayes procedure with the Dirichlet(1/2, ···, 1/2) prior is asymptotically maximin. Simple modifications of it are shown to be asymptotically minimax. The best constant is Cm=log(?(1/2)m/(?(m/2)) which agrees with the logarithm of the integral of the square root of the determinant of the Fisher information. Moreover, our asymptotically optimal strategies for the worst case problem are also asymptotically optimal for the expectation version. Analogous conclusions are given for the case of prediction, gambling, and compression when, for each observation, one has access to side information from an alphabet of size k. In this setting the minimax regret is shown to be k(m-1)/2logn/2?k+kCm+o(1)

Keywords: information-theory
[Williamson2000Entropy] R.C. Williamson, A.J. Smola, and B. Schoelkopf. Entropy Numbers of Linear Function Classes. In Proc. 13th Annu. Conference on Comput. Learning Theory, pages 309-319. Morgan Kaufmann, San Francisco, 2000. [ bib | .pdf ]
[Wilbur2000Boosting] W. J. Wilbur. Boosting naive Bayesian learning on a large subset of MEDLINE. Proc AMIA Symp, pages 918-22, 2000. [ bib ]
We are concerned with the rating of new documents that appear in a large database (MEDLINE) and are candidates for inclusion in a small specialty database (REBASE). The requirement is to rank the new documents as nearly in order of decreasing potential to be added to the smaller database as possible, so as to improve the coverage of the smaller database without increasing the effort of those who manage this specialty database. To perform this ranking task we have considered several machine learning approaches based on the naï ve Bayesian algorithm. We find that adaptive boosting outperforms naï ve Bayes, but that a new form of boosting which we term staged Bayesian retrieval outperforms adaptive boosting. Staged Bayesian retrieval involves two stages of Bayesian retrieval and we further find that if the second stage is replaced by a support vector machine we again obtain a significant improvement over the strictly Bayesian approach.

Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Animals, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Brachyura, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Classification, Codon, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diabetes Mellitus, Diagnosis, Discriminant Analysis, Discrimination Learning, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Genetic Predisposition to Disease, Genomics, Hemolysins, Humans, Indians, Information Storage and Retrieval, Initiator, Ion Channels, Kinetics, Leukemia, Likelihood Functions, Lipid Bilayers, Logistic Models, Lymphocytic, MEDLINE, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, North American, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Organ Specificity, Organelles, Ovarian Neoplasms, Ovary, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Biosynthesis, Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Sound Spectrography, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, Type 2, U.S. Gov't, Vertebrates, 11080018
[Watkins2000Dynamic] C. Watkins. Dynamic alignment kernels. In A.J. Smola, P.L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 39-50. MIT Press, Cambridge, MA, 2000. [ bib | .ps.gz | .pdf ]
Keywords: biosvm
[Vert2000Double] J.-P. Vert. Double mixture and universal inference. Technical Report DMA-00-15, Ecole Normale Supérieure, 2000. [ bib ]
[Vapnik2000Bounds] V. Vapnik and O. Chapelle. Bounds on error expectation for support vector machines. Neural Comput, 12(9):2013-36, Sep 2000. [ bib ]
We introduce the concept of span of support vectors (SV) and show that the generalization ability of support vector machines (SVM) depends on this new geometrical concept. We prove that the value of the span is always smaller (and can be much smaller) than the diameter of the smallest sphere containing the support vectors, used in previous bounds (Vapnik, 1998). We also demonstrate experimentally that the prediction of the test error given by the span is very accurate and has direct application in model selection (choice of the optimal parameters of the SVM).

Keywords: Automated, Learning, Models, Neural Networks (Computer), Neurological, Pattern Recognition, 10976137
[Uetz2000comprehensive] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, D. Lockshon, V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili, Y. Li, B. Godwin, D. Conover, T. Kalbfleish, G. Vijayadamodar, M. Yang, M. Johnston, S. Fields, and J. M. Rothberg. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature, 403:623-627, 2000. [ bib | http | .pdf ]
[Tenenbaum2000global] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319-23, Dec 2000. [ bib | DOI | http | .pdf ]
Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs-30,000 auditory nerve fibers or 10(6) optic nerve fibers-a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.

Keywords: dimred
[Strahl2000language] B. D. Strahl and C. D. Allis. The language of covalent histone modifications. Nature, 403(6765):41-45, Jan 2000. [ bib | DOI | http ]
Histone proteins and the nucleosomes they form with DNA are the fundamental building blocks of eukaryotic chromatin. A diverse array of post-translational modifications that often occur on tail domains of these proteins has been well documented. Although the function of these highly conserved modifications has remained elusive, converging biochemical and genetic evidence suggests functions in several chromatin-based processes. We propose that distinct histone modifications, on one or more tails, act sequentially or in combination to form a 'histone code' that is, read by other proteins to bring about distinct downstream events.

Keywords: Acetylation; Amino Acid Sequence; Animals; Chromatin, physiology; Histones, chemistry/metabolism/physiology; Humans; Lysine, physiology; Microtubules, physiology; Models, Biological; Molecular Sequence Data; Phosphorylation; Protein Processing, Post-Translational; Serine, metabolism
[Slanina2000Random] F. Slanina and M. Kotrla. Random networks created by biological evolution. Phys. Rev. E, 62(5):6170-6177, 2000. [ bib | http | .pdf ]
[Selinger2000RNA] Douglas W. Selinger, Kevin J. Cheung, Rui Mei, Erik M. Johansson, Craig S. Richmond, Frederick R. Blattner, David J. Lockhart, and George M. Church. RNA expression analysis using a 30 base pair resolution Escherichia coli genome array. Nat. Biotechnol., 18:1262-1268, 2000. [ bib | http | .pdf ]
[Scholkopf2000Support] B. Schölkopf, R. Williamson, A. Smola, J. Shawe-Taylor, and J. Platt. Support Vector Method for Novelty Detection. In S.A. Solla, T.K. Leen, and K.-R. Müller, editors, Adv. Neural Inform. Process. Syst., volume 12, pages 582-588. MIT Press, 2000. [ bib | .html | .pdf ]
[Schuster2000general] S. Schuster, D. A. Fell, and T. Dandekar. A general definition of metabolic pathways useful for systematic organization and analysis of complex metabolic networks. Nat Biotechnol, 18(3):326-332, Mar 2000. [ bib | DOI | http | .pdf ]
A set of linear pathways often does not capture the full range of behaviors of a metabolic network. The concept of 'elementary flux modes' provides a mathematical tool to define and comprehensively describe all metabolic routes that are both stoichiometrically and thermodynamically feasible for a group of enzymes. We have used this concept to analyze the interplay between the pentose phosphate pathway (PPP) and glycolysis. The set of elementary modes for this system involves conventional glycolysis, a futile cycle, all the modes of PPP function described in biochemistry textbooks, and additional modes that are a priori equally entitled to pathway status. Applications include maximizing product yield in amino acid and antibiotic synthesis, reconstruction and consistency checks of metabolism from genome data, analysis of enzyme deficiencies, and drug target identification in metabolic networks.

[Schueler-Furman2000Structure-based] O. Schueler-Furman, Y. Altuvia, A. Sette, and H. Margalit. Structure-based prediction of binding peptides to MHC class I molecules: application to a broad range of MHC alleles. Protein Sci., 9(9):1838-1846, Sep 2000. [ bib ]
Specific binding of antigenic peptides to major histocompatibility complex (MHC) class I molecules is a prerequisite for their recognition by cytotoxic T-cells. Prediction of MHC-binding peptides must therefore be incorporated in any predictive algorithm attempting to identify immunodominant T-cell epitopes, based on the amino acid sequence of the protein antigen. Development of predictive algorithms based on experimental binding data requires experimental testing of a very large number of peptides. A complementary approach relies on the structural conservation observed in crystallographically solved peptide-MHC complexes. By this approach, the peptide structure in the MHC groove is used as a template upon which peptide candidates are threaded, and their compatibility to bind is evaluated by statistical pairwise potentials. Our original algorithm based on this approach used the pairwise potential table of Miyazawa and Jernigan (Miyazawa S, Jernigan RL, 1996, J Mol Biol 256:623-644) and succeeded to correctly identify good binders only for MHC molecules with hydrophobic binding pockets, probably because of the high emphasis of hydrophobic interactions in this table. A recently developed pairwise potential table by Betancourt and Thirumalai (Betancourt MR, Thirumalai D, 1999, Protein Sci 8:361-369) that is based on the Miyazawa and Jernigan table describes the hydrophilic interactions more appropriately. In this paper, we demonstrate how the use of this table, together with a new definition of MHC contact residues by which only residues that contribute exclusively to sequence specific binding are included, allows the development of an improved algorithm that can be applied to a wide range of MHC class I alleles.

Keywords: immunoinformatics
[Roweis2000Nonlinear] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323-6, Dec 2000. [ bib | DOI | http | .pdf ]
Many areas of science depend on exploratory data analysis and visualization. The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction: how to discover compact representations of high-dimensional data. Here, we introduce locally linear embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs. Unlike clustering methods for local dimensionality reduction, LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima. By exploiting the local symmetries of linear reconstructions, LLE is able to learn the global structure of nonlinear manifolds, such as those generated by images of faces or documents of text.

Keywords: dimred
[Robinson2000IMGT/HLAa] J. Robinson, A. Malik, P. Parham, J. G. Bodmer, and S. G. Marsh. IMGT/HLA database-a sequence database for the human major histocompatibility complex. Tissue Antigens, 55(3):280-287, Mar 2000. [ bib ]
The IMGT/HLA Database is a specialist database for sequences of the human major histocompatibility (MHC) system. It includes all the HLA sequences officially recognised and named by the WHO Nomenclature Committee for Factors of the HLA System. The database provides users with online tools and facilities for the retrieval and analysis of these sequences. These include allele reports, alignment tools and a detailed database of all source cells. The online IMGT/HLA submission tool allows the submission of both new and confirmatory allele sequences directly to the WHO Nomenclature Committee for Factors of the HLA System. The latest version (release 1.4.1, November 1999) contains 1,015 HLA alleles from over 2,270 component sequences derived from the EMBL/GenBank/DDBJ databases. From its release in December 1998 until December 1999 the IMGT/HLA website received approximately 100,000 hits. The database currently focuses on the human major histocompatibility complex but will be used as a model system to provide specialist databases for the MHC sequences of other species.

Keywords: Base Sequence; Databases, Factual; Humans; Major Histocompatibility Complex; Molecular Sequence Data
[Risau-Gusman2000Generalization] Risau-Gusman and Gordon. Generalization properties of finite-size polynomial support vector machines. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics, 62(5 Pt B):7092-9, Nov 2000. [ bib ]
The learning properties of finite-size polynomial support vector machines are analyzed in the case of realizable classification tasks. The normalization of the high-order features acts as a squeezing factor, introducing a strong anisotropy in the patterns distribution in feature space. As a function of the training set size, the corresponding generalization error presents a crossover, more or less abrupt depending on the distribution's anisotropy and on the task to be learned, between a fast-decreasing and a slowly decreasing regime. This behavior corresponds to the stepwise decrease found by Dietrich et al. [Phys. Rev. Lett. 82, 2975 (1999)] in the thermodynamic limit. The theoretical results are in excellent agreement with the numerical simulations.

Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Animals, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Brachyura, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Classification, Codon, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diabetes Mellitus, Diagnosis, Discriminant Analysis, Discrimination Learning, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Genetic Predisposition to Disease, Genomics, Hemolysins, Humans, Indians, Initiator, Ion Channels, Kinetics, Leukemia, Likelihood Functions, Lipid Bilayers, Logistic Models, Lymphocytic, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, North American, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Organ Specificity, Organelles, Ovarian Neoplasms, Ovary, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Biosynthesis, Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Sound Spectrography, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, Type 2, U.S. Gov't, Vertebrates, 0011102066
[Rice2000EMBOSS] P. Rice, I. Longden, and A. Bleasby. Emboss: the european molecular biology open software suite. Trends Genet., 16(6):276-277, Jun 2000. [ bib ]
Keywords: Internet; Molecular Biology; Sequence Alignment, methods; Software; User-Computer Interface
[Ren2000Genome-wide] B. Ren, F. Robert, J. J. Wyrick, O. Aparicio, E. G. Jennings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett, E. Kanin, T. L. Volkert, C. J. Wilson, S. P. Bell, and R. A. Young. Genome-wide location and function of DNA binding proteins. Science, 290(5500):2306-2309, Dec 2000. [ bib | DOI | http | .pdf ]
Understanding how DNA binding proteins control global gene expression and chromosomal maintenance requires knowledge of the chromosomal locations at which these proteins function in vivo. We developed a microarray method that reveals the genome-wide location of DNA-bound proteins and used this method to monitor binding of gene-specific transcription activators in yeast. A combination of location and expression profiles was used to identify genes whose expression is directly controlled by Gal4 and Ste12 as cells respond to changes in carbon source and mating pheromone, respectively. The results identify pathways that are coordinately regulated by each of the two activators and reveal previously unknown functions for Gal4 and Ste12. Genome-wide location analysis will facilitate investigation of gene regulatory networks, gene function, and genome maintenance.

[Rea2000Regulation] S. Rea, F. Eisenhaber, D. O'Carroll, B. D. Strahl, Z. W. Sun, M. Schmid, S. Opravil, K. Mechtler, C. P. Ponting, C. D. Allis, and T. Jenuwein. Regulation of chromatin structure by site-specific histone h3 methyltransferases. Nature, 406(6796):593-599, Aug 2000. [ bib | DOI | http ]
The organization of chromatin into higher-order structures influences chromosome function and epigenetic gene regulation. Higher-order chromatin has been proposed to be nucleated by the covalent modification of histone tails and the subsequent establishment of chromosomal subdomains by non-histone modifier factors. Here we show that human SUV39H1 and murine Suv39h1-mammalian homologues of Drosophila Su(var)3-9 and of Schizosaccharomyces pombe clr4-encode histone H3-specific methyltransferases that selectively methylate lysine 9 of the amino terminus of histone H3 in vitro. We mapped the catalytic motif to the evolutionarily conserved SET domain, which requires adjacent cysteine-rich regions to confer histone methyltransferase activity. Methylation of lysine 9 interferes with phosphorylation of serine 10, but is also influenced by pre-existing modifications in the amino terminus of H3. In vivo, deregulated SUV39H1 or disrupted Suv39h activity modulate H3 serine 10 phosphorylation in native chromatin and induce aberrant mitotic divisions. Our data reveal a functional interdependence of site-specific H3 tail modifications and suggest a dynamic mechanism for the regulation of higher-order chromatin.

Keywords: Amino Acid Sequence; Animals; Catalytic Domain; Chromatin, chemistry/metabolism; Drosophila; Hela Cells; Histone-Lysine N-Methyltransferase; Humans; Lysine, metabolism; Methylation; Methyltransferases, genetics/metabolism; Mice; Molecular Sequence Data; Phosphorylation; Protein Conformation; Protein Methyltransferases; Protein Structure, Tertiary; Recombinant Proteins, metabolism; Repressor Proteins, genetics/metabolism; Sequence Homology, Amino Acid; Serine, metabolism; Substrate Specificity
[Raychaudhuri2000Principal] S. Raychaudhuri, J. M. Stuart, and R. B. Altman. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput., pages 455-466, 2000. [ bib | .pdf | .pdf ]
A series of microarray experiments produces observations of differential expression for thousands of genes across multiple conditions. It is often not clear whether a set of experiments are measuring fundamentally different gene expression states or are measuring similar states created through different mechanisms. It is useful, therefore, to define a core set of independent features for the expression states that allow them to be compared directly. Principal components analysis (PCA) is a statistical technique for determining the key variables in a multidimensional data set that explain the differences in the observations, and can be used to simplify the analysis and visualization of multidimensional data sets. We show that application of PCA to expression data (where the experimental conditions are the variables, and the gene expression measurements are the observations) allows us to summarize the ways in which gene responses vary under different conditions. Examination of the components also provides insight into the underlying factors that are measured in the experiments. We applied PCA to the publicly released yeast sporulation data set (Chu et al. 1998). In that work, 7 different measurements of gene expression were made over time. PCA on the time-points suggests that much of the observed variability in the experiment can be summarized in just 2 components-i.e. 2 variables capture most of the information. These components appear to represent (1) overall induction level and (2) change in induction level over time. We also examined the clusters proposed in the original paper, and show how they are manifested in principal component space. Our results are available on the internet at http:¿www.smi.stanford.edu/project/helix/PCArray .

[Raudys2000How] S. Raudys. How good are support vector machines? Neural Netw, 13(1):17-9, Jan 2000. [ bib ]
Support vector (SV) machines are useful tools to classify populations characterized by abrupt decreases in density functions. At least for one class of Gaussian data model the SV classifier is not an optimal one according to a mean generalization error criterion. In real world problems, we have neither Gaussian populations nor data with sharp linear boundaries. Thus, the SV (maximal margin) classifiers can lose against other methods where more than a fixed number of supporting vectors contribute in determining the final weights of the classification and prediction rules. A good alternative to the linear SV machine is a specially trained and optimally stopped SLP in a transformed feature space obtained after decorrelating and scaling the multivariate data.

Keywords: Automated, Learning, Models, Neural Networks (Computer), Neurological, Pattern Recognition, 10935455
[Rajwan2000Universal] D. Rajwan and M. Feder. Universal finite memory machines for coding binary sequences. In Proceedings of the Data Compression Conference (DCC 2000), pages 113-122, 2000. [ bib | .pdf ]
In this work we consider the problem of universal sequential probability assignment, under self-information loss, where the machine for performing the universal probability assignment is constrained to have a finite memory. Sequential probability assignment is equivalent to lossless source coding if we ignore the number of states required to convert the probability estimate into code bits. We consider both the probabilistic setting where the sequence is generated by a probabilistic source (either Bernoulli IID source or q-th order Markov source), and the deterministic setting where the sequence is a deterministic individual sequence. We also consider the case where the universal machine is deterministic, randomized, time-invariant or time-variant. We provide in most cases lower bounds and describe finite memory universal machines whose performance, in terms of the memory size, is compared to these bounds

[Perou2000Molecular] C M. Perou, T. Sørlie, M. B. Eisen, M. van de Rijn, S. S. Jeffrey, C. A. Rees, J. R. Pollack, D. T. Ross, H. Johnsen, L. A. Akslen, O. Fluge, A. Pergamenschikov, C. Williams, S. X. Zhu, P. E. Lønning, A. L. Børresen-Dale, P. O. Brown, and D. Botstein. Molecular portraits of human breast tumours. Nature, 406(6797):747-752, Aug 2000. [ bib | DOI | http | .pdf ]
Human breast tumours are diverse in their natural history and in their responsiveness to treatments. Variation in transcriptional programs accounts for much of the biological diversity of human cells and tumours. In each cell, signal transduction and regulatory systems transduce information from the cell's identity to its environmental status, thereby controlling the level of expression of every gene in the genome. Here we have characterized variation in gene expression patterns in a set of 65 surgical specimens of human breast tumours from 42 different individuals, using complementary DNA microarrays representing 8,102 human genes. These patterns provided a distinctive molecular portrait of each tumour. Twenty of the tumours were sampled twice, before and after a 16-week course of doxorubicin chemotherapy, and two tumours were paired with a lymph node metastasis from the same patient. Gene expression patterns in two tumour samples from the same individual were almost always more similar to each other than either was to any other sample. Sets of co-expressed genes were identified for which variation in messenger RNA levels could be related to specific features of physiological variation. The tumours could be classified into subtypes distinguished by pervasive differences in their gene expression patterns.

Keywords: breastcancer, csbcbook, csbcbook-ch3
[Pelleg2000X-means] D. Pelleg and A. Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 727-734, San Francisco, 2000. Morgan Kaufmann. [ bib ]
[Pandey2000Proteomics] A. Pandey and M. Mann. Proteomics to study genes and genomes. Nature, 405:837-846, 2000. [ bib | http | .pdf ]
[Overbeek2000WIT] R. Overbeek, N. Larsen, G. D. Pusch, M. D'Souza, E. Jr. Selkov, N. Kyrpides, M. Fonstein, N. Maltsev, and E. Selkov. WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res., 28:123-125, 2000. [ bib | http | .pdf ]
[Opper2000Gaussian] M. Opper and O. Winther. Gaussian processes for classification: mean-field algorithms. Neural Comput, 12(11):2655-84, Nov 2000. [ bib ]
We derive a mean-field algorithm for binary classification with gaussian processes that is based on the TAP approach originally proposed in statistical physics of disordered systems. The theory also yields an approximate leave-one-out estimator for the generalization error, which is computed with no extra computational cost. We show that from the TAP approach, it is possible to derive both a simpler "naive" mean-field theory and support vector machines (SVMs) as limiting cases. For both mean-field algorithms and support vector machines, simulation results for three small benchmark data sets are presented. They show that one may get state-of-the-art performance by using the leave-one-out estimator for model selection and the built-in leave-one-out estimators are extremely precise when compared to the exact leave-one-out estimate. The second result is taken as strong support for the internal consistency of the mean-field approach.

Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Animals, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Brachyura, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Classification, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diabetes Mellitus, Diagnosis, Discriminant Analysis, Discrimination Learning, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Genetic Predisposition to Disease, Hemolysins, Humans, Indians, Ion Channels, Kinetics, Leukemia, Likelihood Functions, Lipid Bilayers, Logistic Models, Lymphocytic, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, North American, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Organ Specificity, Organelles, Ovarian Neoplasms, Ovary, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Sound Spectrography, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, Type 2, U.S. Gov't, 11110131
[Ogawa2000New] Nobuo Ogawa, Joseph DeRisi, and Patrick O. Brown. New Components of a System for Phosphate Accumulation and Polyphosphate Metabolism in Saccharomyces cerevisiae Revealed by Genomic Expression Analysis. Mol. Biol. Cell, 11:4309-4321, Dec 2000. [ bib | .pdf | .pdf ]
[Nigam2000Text] K. Nigam, A.K. Mccallum, S. Thrun, and T.M. Mitchell. Text classification from labeled and unlabeled documents using em. Mach. Learn., 39(2/3):103-134, 2000. [ bib | .html ]
Keywords: em, semi-supervised-learning, text-classification, unlabeled-data
[Moler2000Analysis] E. J. Moler, M. L. Chow, and I. S. Mian. Analysis of molecular profile data using generative and discriminative methods. Physiol. Genomics, 4(2):109-126, Dec 2000. [ bib | http | .pdf ]
A modular framework is proposed for modeling and understanding the relationships between molecular profile data and other domain knowledge using a combination of generative (here, graphical models) and discriminative [Support Vector Machines (SVMs)] methods. As illustration, naive Bayes models, simple graphical models, and SVMs were applied to published transcription profile data for 1,988 genes in 62 colon adenocarcinoma tissue specimens labeled as tumor or nontumor. These unsupervised and supervised learning methods identified three classes or subtypes of specimens, assigned tumor or nontumor labels to new specimens and detected six potentially mislabeled specimens. The probability parameters of the three classes were utilized to develop a novel gene relevance, ranking, and selection method. SVMs trained to discriminate nontumor from tumor specimens using only the 50-200 top-ranked genes had the same or better generalization performance than the full repertoire of 1,988 genes. Approximately 90 marker genes were pinpointed for use in understanding the basic biology of colon adenocarcinoma, defining targets for therapeutic intervention and developing diagnostic tools. These potential markers highlight the importance of tissue biology in the etiology of cancer. Comparative analysis of molecular profile data is proposed as a mechanism for predicting the physiological function of genes in instances when comparative sequence analysis proves uninformative, such as with human and yeast translationally controlled tumour protein. Graphical models and SVMs hold promise as the foundations for developing decision support systems for diagnosis, prognosis, and monitoring as well as inferring biological networks.

Keywords: biosvm
[Miertus2000Concepts] S. Miertus, G. Fassina, and P.F. Seneci. Concepts of Combinatorial Chemistry and Combinatorial Technologies. Chemické Listy, 94:1104-1110, 2000. [ bib | www: ]
Keywords: chemoinformatics
[Massart2000Some] P. Massart. Some applications of concentration inequalities to statistics. Ann. Fac. Sc. Toulouse, IX(2):245-303, 2000. [ bib ]
[Luo2000Alignement] B. Luo and E. R. Hancock. Alignment and correspondence using singular value decomposition. In Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition, pages 226-235, London, UK, 2000. Springer-Verlag. [ bib ]
[Lodish2000Molecular] H. Lodish, A. Berk, S.L. Zipursky, P. Matsudaira, D. Baltimore, and J. Darnell. Molecular cell biology. New York, 2000. [ bib ]
[Lodhi2000Text] H. Lodhi, J. Shawe-Taylor, N. Cristianini, and C. J. C. H. Watkins. Text Classification using String Kernels. In Adv. Neural Inform. Process. Syst., pages 563-569, 2000. [ bib | .ps.gz | .pdf ]
Keywords: biosvm
[Lockhart2000Genomics] D.J. Lockhart, E.A. Winzeler, et al. Genomics, gene expression and dna arrays. NATURE-LONDON-, pages 827-836, 2000. [ bib ]
[LevBarOr2000PNAS] R. Lev Bar-Or, R. Maya, L. A. Segel, U. Alon, A. J. Levine, and M. Oren. Generation of oscillations by the p53-mdm2 feedback loop: a theoretical and experimental study. Proc Natl Acad Sci U S A, 97(21):11250-5, 2000. [ bib ]
The intracellular activity of the p53 tumor suppressor protein is regulated through a feedback loop involving its transcriptional target, mdm2. We present a simple mathematical model suggesting that, under certain circumstances, oscillations in p53 and Mdm2 protein levels can emerge in response to a stress signal. A delay in p53-dependent induction of Mdm2 is predicted to be required, albeit not sufficient, for this oscillatory behavior. In line with the predictions of the model, oscillations of both p53 and Mdm2 indeed occur on exposure of various cell types to ionizing radiation. Such oscillations may allow cells to repair their DNA without risking the irreversible consequences of continuous excessive p53 activation.

Keywords: csbcbook
[Letouzey2000Learning] F. Letouzey, F. Denis, and R. Gilleron. Learning from positive and unlabeled examples, 2000. [ bib | http ]
[Lemmen2000Computational] C. Lemmen and T. Lengauer. Computational methods for the structural alignment of molecules. J. Comput. Aided. Mol. Des., 14(3):215-232, Mar 2000. [ bib ]
In drug design, often enough, no structural information on a particular receptor protein is available. However, frequently a considerable number of different ligands is known together with their measured binding affinities towards a receptor under consideration. In such a situation, a set of plausible relative superpositions of different ligands, hopefully approximating their putative binding geometry, is usually the method of choice for preparing data for the subsequent application of 3D methods that analyze the similarity or diversity of the ligands. Examples are 3D-QSAR studies, pharmacophore elucidation, and receptor modeling. An aggravating fact is that ligands are usually quite flexible and a rigorous analysis has to incorporate molecular flexibility. We review the past six years of scientific publishing on molecular superposition. Our focus lies on automatic procedures to be performed on arbitrary molecular structures. Methodical aspects are our main concern here. Accordingly, plain application studies with few methodical elements are omitted in this presentation. While this review cannot mention every contribution to this actively developing field, we intend to provide pointers to the recent literature providing important contributions to computational methods for the structural alignment of molecules. Finally we provide a perspective on how superposition methods can effectively be used for the purpose of virtual database screening. In our opinion it is the ultimate goal to detect analogues in structure databases of nontrivial size in order to narrow down the search space for subsequent experiments.

Keywords: chemoinformatics
[Lazo2000Combinatorial] J. S. Lazo and P. Wipf. Combinatorial chemistry and contemporary pharmacology. J. Pharmacol. Exp. Ther., 293(3):705-709, Jun 2000. [ bib ]
Both solid- and liquid-phase combinatorial chemistry have emerged as powerful tools for identifying pharmacologically active compounds and optimizing the biological activity of a lead compound. Complementary high-throughput in vitro assays are essential for compound evaluation. Cell-based assays that use optical endpoints permit investigation of a wide variety of functional properties of these compounds including specific intracellular biochemical pathways, protein-protein interactions, and the subcellular localization of targets. Integration of combinatorial chemistry with contemporary pharmacology now represents an important factor in drug discovery and development.

Keywords: Alzheimer Disease, Animals, Antineoplastic Agents, Biological, Bleomycin, Cell Cycle, Cell Cycle Proteins, Cell Death, Cell Line, Cell Nucleus, Cell Shape, Cell Transformation, Combinatorial Chemistry Techniques, Cultured, Drug Delivery Systems, Drug Design, Drug Evaluation, Enzyme Inhibitors, Formazans, Gene Expression, Humans, Inhibitory Concentration 50, Kinetics, Magnetic Resonance Spectroscopy, Mass, Mitochondria, Models, Molecular, Neoplasms, Neoplastic, Non-P.H.S., Non-U.S. Gov't, P.H.S., Paclitaxel, Peptide Library, Pharmaceutical Preparations, Pharmacology, Phosphoprotein Phosphatase, Preclinical, Protease Inhibitors, Protein-Tyrosine-Phosphatase, Research Support, Sensitivity and Specificity, Signal Transduction, Spectrum Analysis, Stereoisomerism, Structure-Activity Relationship, Sulfonic Acids, Tetrazolium Salts, Thiazoles, Toxicity Tests, Tumor, Tumor Cells, U.S. Gov't, cdc25 Phosphatase, 10869367
[Lai2000Kernel] P.L. Lai and C. Fyfe. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst., 10(5):365-377, 2000. [ bib | .html | .pdf ]
[Kwon2000candidate] J. M. Kwon and A. M. Goate. The candidate gene approach. Alcohol Res Health, 24(3):164-168, 2000. [ bib ]
Alcoholism has a significant genetic basis, and identifying genes that confer a susceptibility to alcoholism will aid clinicians in preventing and effectively treating the disease. One commonly used technique to identify genetic risk factors for complex disorders such as alcoholism is the candidate gene approach, which directly tests the effects of genetic variants of a potentially contributing gene in an association study. These studies, which may include members of an affected family or unrelated cases and controls, can be performed relatively quickly and inexpensively and may allow identification of genes with small effects. However, the candidate gene approach is limited by how much is known of the biology of the disease being investigated. As researchers identify potential candidate genes using animal studies or linking them to DNA regions implicated through other analyses, the candidate gene approach will continue to be commonly used.

Keywords: Alcoholism; Animals; Chromosome Mapping; Disease Models, Animal; Genetic Predisposition to Disease; Genetic Variation; Humans; Mice; Mutation; Pedigree; Polymorphism, Genetic; Quantitative Trait, Heritable
[Knight2000Asymptotics] K. Knight and W. Fu. Asymptotics for lasso-type estimators. Ann. Stat., 28(5):1356-1378, 2000. [ bib | DOI | http ]
Keywords: lasso
[Klebe2000Recent] G. Klebe. Recent developments in structure-based drug design. J Mol Med, 78(5):269-281, 2000. [ bib ]
Structure-based design has emerged as a new tool in medicinal chemistry. A prerequisite for this new approach is an understanding of the principles of molecular recognition in protein-ligand complexes. If the three-dimensional structure of a given protein is known, this information can be directly exploited for the retrieval and design of new ligands. Structure-based ligand design is an iterative approach. First of all, it requires the crystal structure or a model derived from the crystal structure of a closely related homolog of the target protein, preferentially complexed with a ligand. This complex unravels the binding mode and conformation of a ligand under investigation and indicates the essential aspects determining its binding affinity. It is then used to generate new ideas about ways of improving an existing ligand or of developing new alternative bonding skeletons. Computational methods supplemented by molecular graphics are applied to assist this step of hypothesis generation. The features of the protein binding pocket can be translated into queries used for virtual computer screening of large compound libraries or to design novel ligands de novo. These initial proposals must be confirmed experimentally. Subsequently they are optimized toward higher affinity and better selectivity. The latter aspect is of utmost importance in defining and controlling the pharmacological profile of a ligand. A prerequisite to tailoring selectivity by rational design is a detailed understanding of molecular parameters determining selectivity. Taking examples from current drug development programs (HIV proteinase, t-RNA transglycosylase, thymidylate synthase, thrombin and, related serine proteinases), we describe recent advances in lead discovery via computer screening, iterative design, and understanding of selectivity discrimination.

Keywords: Animals, Chemistry, Computer Simulation, Cross-Over Studies, Crystallography, Deglutition, Deglutition Disorders, Drug Design, Endoscopy, Enzyme Inhibitors, Female, Fluoroscopy, Glossopharyngeal Nerve, HIV Protease Inhibitors, Horse Diseases, Horses, Male, Models, Molecular, Nerve Block, Non-U.S. Gov't, P.H.S., Pharmaceutical, Proteins, Quantitative Structure-Activity Relationship, Random Allocation, Research Support, Thrombin, Thymidylate Synthase, U.S. Gov't, X-Ray, 10954196
[Jeong2000large-scale] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.-L. Barabási. The large-scale organization of metabolic networks. Nature, 407:651-654, 2000. [ bib | http | .pdf ]
[Jaakkola2000Discriminative] T. Jaakkola, M. Diekhans, and D. Haussler. A Discriminative Framework for Detecting Remote Protein Homologies. J. Comput. Biol., 7(1,2):95-114, 2000. [ bib | .ps | .pdf ]
Keywords: biosvm
[Ito2000Toward] T. Ito, K. Tashiro, S. Muta, R. Ozawa, T. Chiba, M. Nishizawa, K. Yamamoto, S. Kuhara, and Y. Sakaki. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl. Acad. Sci. USA, 93(3):1143-1147, 2000. [ bib | http | .pdf ]
[Heskes2000Empirical] Tom Heskes. Empirical bayes for learning to learn. In ICML '00: Proceedings of the Seventeenth International Conference on Machine Learning, pages 367-374, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. [ bib ]
[Heckerman2000Dependency] D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative filtering, and data visualization. J. Mach. Learn. Res., 1:49-75, 2000. [ bib ]
[He2000Alternating] BS He, H. Yang, and SL Wang. Alternating direction method with self-adaptive penalty parameters for monotone variational inequalities. Journal of Optimization Theory and applications, 106(2):337-356, 2000. [ bib ]
[Haverkamp2000potential] W. Haverkamp, G. Breithardt, A. J. Camm, M. J. Janse, M. R. Rosen, C. Antzelevitch, D. Escande, M. Franz, M. Malik, A. Moss, and R. Shah. The potential for QT prolongation and proarrhythmia by non-antiarrhythmic drugs: clinical and regulatory implications. Report on a policy conference of the European Society of Cardiology. Eur. Heart J., 21(15):1216-1231, Aug 2000. [ bib | DOI | http ]
Keywords: herg
[Hanahan2000hallmarks] D. Hanahan and R. A. Weinberg. The hallmarks of cancer. Cell, 100:57-70, 2000. [ bib | DOI | http | .pdf ]
Keywords: csbcbook, csbcbook-mustread
[Guner2000Pharmacophore] O. F. Güner. Pharmacophore Perception, Development, and Use in Drug Design, volume 2 of IUL Biotechnology Series. International University Line, 2000. [ bib ]
[Gross2000Identification] C. Gross, M. Kelleher, V.R. Iyer, P.O. Brown, and D.R. Winge. Identification of the copper regulon in Saccharomyces cerevisiae by DNA microarrays. J. Biol. Chem., 275(41):32310-32316, 2000. [ bib | http | .pdf ]
[Goldfarb2000What] L. Goldfarb, O. Golubitsky, and D. Korkin. What is a structural representation? Technical report, University of New Brunswick, 2000. Technical report TR00-137. [ bib | .ps ]
[Godsil2000Algebraic] C. Godsil and G. Royle. Algebraic graph theory. Springer-Verlag, 2000. [ bib ]
[Gether2000Uncovering] U. Gether. Uncovering molecular mechanisms involved in activation of g protein-coupled receptors. Endocr Rev, 21(1):90-113, Feb 2000. [ bib ]
G protein-coupled, seven-transmembrane segment receptors (GPCRs or 7TM receptors), with more than 1000 different members, comprise the largest superfamily of proteins in the body. Since the cloning of the first receptors more than a decade ago, extensive experimental work has uncovered multiple aspects of their function and challenged many traditional paradigms. However, it is only recently that we are beginning to gain insight into some of the most fundamental questions in the molecular function of this class of receptors. How can, for example, so many chemically diverse hormones, neurotransmitters, and other signaling molecules activate receptors believed to share a similar overall tertiary structure? What is the nature of the physical changes linking agonist binding to receptor activation and subsequent transduction of the signal to the associated G protein on the cytoplasmic side of the membrane and to other putative signaling pathways? The goal of the present review is to specifically address these questions as well as to depict the current awareness about GPCR structure-function relationships in general.

Keywords: Animals; GTP-Binding Proteins; Humans; Ligands; Models, Biological; Molecular Conformation; Receptors, Cell Surface
[Gerstein2000current] M. Gerstein and R. Jansen. The current excitement in bioinformatics-analysis of whole-genome expression data: how does it relate to protein structure and function? Curr. Opin. Struct. Biol., 10(5):574-584, Oct 2000. [ bib | .pdf ]
Whole-genome expression profiles provide a rich new data-trove for bioinformatics. Initial analyses of the profiles have included clustering and cross-referencing to 'external' information on protein structure and function. Expression profile clusters do relate to protein function, but the correlation is not perfect, with the discrepancies partially resulting from the difficulty in consistently defining function. Other attributes of proteins can also be related to expression-in particular, structure and localization-and sometimes show a clearer relationship than function.

[Gasch2000Genomic] A. P. Gasch, P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein, and P. O. Brown. Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Mol. Biol. Cell, 11:4241-4257, Dec 2000. [ bib | .pdf | .pdf ]
[Fussenegger2000NatBio] M. Fussenegger, J. Bailey, and J. Varner. A mathematical model of caspase function in apoptosis. Nat. Biotechnol., 18:768-774, 2000. [ bib | DOI | .pdf ]
Caspases (cysteine-containing aspartate-specific proteases) are at the core of the cell's suicide machinery. These enzymes, once activated, dismantle the cell by selectively cleaving key proteins after aspartate residues. The events culminating in caspase activation are the subject of intense study because of their role in cancer, and neurodegenerative and autoimmune disorders. Here we present a mechanistic mathematical model, formulated on the basis of newly emerging information, describing key elements of receptor-mediated and stress-induced caspase activation. We have used mass-conservation principles in conjunction with kinetic rate laws to formulate ordinary differential equations that describe the temporal evolution of caspase activation. Qualitative strategies for the prevention of caspase activation are simulated and compared with experimental data. We show that model predictions are consistent with available information. Thus, the model could aid in better understanding caspase activation and identifying therapeutic approaches promoting or retarding apoptotic cell death.

Keywords: csbcbook
[Furey2000Support] T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10):906-914, Oct 2000. [ bib | http | .pdf ]
Motivation: DNA microarray experiments generating thousands of gene expression measurements, are being used to gather information from tissue and cell samples regarding gene expression differences that will be useful in diagnosing disease. We have developed a new method to analyse this kind of data using support vector machines (SVMs). This analysis consists of both classification of the tissue samples, and an exploration of the data for mis-labeled or questionable tissue results. Results: We demonstrate the method in detail on samples consisting of ovarian cancer tissues, normal ovarian tissues, and other normal tissues. The dataset consists of expression experiment results for 97802 cDNAs for each tissue. As a result of computational analysis, a tissue sample is discovered and confirmed to be wrongly labeled. Upon correction of this mistake and the removal of an outlier, perfect classification of tissues is achieved, but not with high confidence. We identify and analyse a subset of genes from the ovarian dataset whose expression is highly differentiated between the types of tissues. To show robustness of the SVM method, two previously published datasets from other types of tissues or cells are analysed. The results are comparable to those previously obtained. We show that other machine learning methods also perform comparably to the SVM on many of those datasets. Availability: The SVM software is available at http://www.cs.columbia.edu/ bgrundy/svm. Contact: booch@cse.ucsc.edu

Keywords: biosvm
[Friedman2000Using] N. Friedman, M. Linial, I. Nachman, and D. Pe'er. Using Bayesian networks to analyze expression data. J. Comput. Biol., 7(3-4):601-620, 2000. [ bib | DOI | http | .pdf ]
DNA hybridization arrays simultaneously measure the expression level for thousands of genes. These measurements provide a "snapshot" of transcription levels within the cell. A major challenge in computational biology is to uncover, from such measurements, gene/protein interactions and key biological features of cellular systems. In this paper, we propose a new framework for discovering interactions between genes based on multiple expression measurements. This framework builds on the use of Bayesian networks for representing statistical dependencies. A Bayesian network is a graph-based model of joint multivariate probability distributions that captures properties of conditional independence between variables. Such models are attractive for their ability to describe complex stochastic processes and because they provide a clear methodology for learning from (noisy) observations. We start by showing how Bayesian networks can describe interactions between genes. We then describe a method for recovering gene interactions from microarray data using tools for learning Bayesian networks. Finally, we demonstrate this method on the S. cerevisiae cell-cycle measurements of Spellman et al. (1998).

Keywords: biogm
[Freund2000Analysis] Y. Freund, Y. Mansour, and R. E. Schapire. Analysis of a Pseudo-Bayesian Prediction Method. In Conference on Information Sciences and Systems, Princeton University, March 15-17, 2000. [ bib | .pdf ]
[Evgeniou2000Regularization] T. Evgeniou, M. Pontil, and T. Poggio. Regularization Networks and Support Vector Machines. Adv. Comput. Math., 13(1):1-50, 2000. [ bib | DOI | http | .pdf ]
[Eskin2000Protein] E. Eskin, W.N. Grundy, and Y. Singer. Protein family classification using sparse Markov transducers. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB 2000), pages 134-145, 2000. [ bib ]
[Egan2000Prediction] W. J. Egan, K. M. Merz, and J. J. Baldwin. Prediction of drug absorption using multivariate statistics. J. Med. Chem., 43(21):3867-3877, Oct 2000. [ bib ]
Literature data on compounds both well- and poorly-absorbed in humans were used to build a statistical pattern recognition model of passive intestinal absorption. Robust outlier detection was utilized to analyze the well-absorbed compounds, some of which were intermingled with the poorly-absorbed compounds in the model space. Outliers were identified as being actively transported. The descriptors chosen for inclusion in the model were PSA and AlogP98, based on consideration of the physical processes involved in membrane permeability and the interrelationships and redundancies between available descriptors. These descriptors are quite straightforward for a medicinal chemist to interpret, enhancing the utility of the model. Molecular weight, while often used in passive absorption models, was shown to be superfluous, as it is already a component of both PSA and AlogP98. Extensive validation of the model on hundreds of known orally delivered drugs, "drug-like" molecules, and Pharmacopeia, Inc. compounds, which had been assayed for Caco-2 cell permeability, demonstrated a good rate of successful predictions (74-92%, depending on the dataset and exact criterion used).

Keywords: chemogenomics
[Donoho2000High] D. L. Donoho, I. Johnstone, B. Stine, and G. Piatetsky-Shapiro. High-dimensional data analysis : The curses and blessings of dimensionality. Statistics, pages 1-33, 2000. [ bib ]
[Doi2000Hybrid] A. Doi, H. Matsuno, M. Nagasaki, and S. Miyano. Hybrid Petri net representation of gene regulatory network. In Proceedings of the Pacific Symposium on Biocomputing, volume 5, pages 341-352, 2000. [ bib | .pdf | .pdf ]
[Diestel2000Graph] R. Diestel. Graph theory. Springer-Verlag, 2000. [ bib ]
[Devroye2000Combinatorial] L. Devroye and G. Lugosi. Combinatorial Methods in Density Estimation. Springer Series in Statistics. Springer, 2000. [ bib ]
[Cristianini2000introduction] N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, 2000. [ bib | http ]
[Coelho2000Genome-wide] P.S. Coelho, A. Kumar, and M. Snyder. Genome-wide mutant collections: toolboxes for functional genomics. Curr. Opin. Microbiol., 3:309-315, 2000. [ bib | .pdf ]
[Cheng2000Biclustering] Y. Cheng and G. M. Church. Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol, 8:93-103, 2000. [ bib ]
An efficient node-deletion algorithm is introduced to find submatrices in expression data that have low mean squared residue scores and it is shown to perform well in finding co-regulation patterns in yeast and human. This introduces "biclustering", or simultaneous clustering of both genes and conditions, to knowledge discovery from expression data. This approach overcomes some problems associated with traditional clustering methods, by allowing automatic discovery of similarity based on a subset of attributes, simultaneous clustering of genes and conditions, and overlapped grouping that provides a better representation for genes with multiple functions or regulated by many factors.

Keywords: Algorithms; Animals; Gene Expression Profiling, methods; Humans; Multigene Family; Oligonucleotide Array Sequence Analysis, methods
[Cai2000Support] Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou. Support vector machines for prediction of protein subcellular location. Mol. Cell Biol. Res. Commun., 4(4):230-234, 2000. [ bib | DOI | http | www: ]
Support Vector Machine (SVM), which is one kind of learning machines, was applied to predict the subcellular location of proteins from their amino acid composition. In this research, the proteins are classified into the following 12 groups: (1) chloroplast, (2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracall, (6) Golgi apparatus, (7) lysosome, (8) mitochondria, (9) nucleus, (10) peroxisome, (11) plasma membrane, and (12) vacuole, which have covered almost all the organelles and subcellular compartments in an animal or plant cell. The examination for the self-consistency and the jackknife test of the SVMs method was tested for the three sets: 2022 proteins, 2161 proteins, and 2319 proteins. As a result, the correct rate of self-consistency and jackknife test reaches 91 and 82 73 rate was tested by the three independent testing datasets containing 2240 proteins, 2513 proteins, and 2591 proteins. The correct prediction rates reach 82, 75, and 73 2591 proteins, respectively.

Keywords: biosvm
[Butte2000Discovering] A. J. Butte, P. Tamayo, D. Slonim, T. R. Golub, and I. S. Kohane. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl. Acad. Sci. USA, 97(22):12182-12186, Oct 2000. [ bib | DOI | http | .pdf ]
In an effort to find gene regulatory networks and clusters of genes that affect cancer susceptibility to anticancer agents, we joined a database with baseline expression levels of 7,245 genes measured by using microarrays in 60 cancer cell lines, to a database with the amounts of 5,084 anticancer agents needed to inhibit growth of those same cell lines. Comprehensive pair-wise correlations were calculated between gene expression and measures of agent susceptibility. Associations weaker than a threshold strength were removed, leaving networks of highly correlated genes and agents called relevance networks. Hypotheses for potential single-gene determinants of anticancer agent susceptibility were constructed. The effect of random chance in the large number of calculations performed was empirically determined by repeated random permutation testing; only associations stronger than those seen in multiply permuted data were used in clustering. We discuss the advantages of this methodology over alternative approaches, such as phylogenetic-type tree clustering and self-organizing maps.

[Butte2000Mutual] A. J. Butte and I. S. Kohane. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput, pages 418-429, 2000. [ bib ]
Increasing numbers of methodologies are available to find functional genomic clusters in RNA expression data. We describe a technique that computes comprehensive pair-wise mutual information for all genes in such a data set. An association with a high mutual information means that one gene is non-randomly associated with another; we hypothesize this means the two are related biologically. By picking a threshold mutual information and using only associations at or above the threshold, we show how this technique was used on a public data set of 79 RNA expression measurements of 2,467 genes to construct 22 clusters, or Relevance Networks. The biological significance of each Relevance Network is explained.

Keywords: Computer Simulation; Gene Expression; Genome; Genome, Fungal; Genome, Human; Humans; Models, Genetic; Multigene Family; RNA; Saccharomyces cerevisiae
[Brown2000Exploring] P.O. Brown and D. Botstein. Exploring the new world of the genome with DNA microarrays. Nat. Genet., 21:33-37, 2000. [ bib | .html | .pdf ]
[Brown2000Knowledge-based] M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA, 97(1):262-7, Jan 2000. [ bib | http | .pdf ]
We introduce a method of functionally classifying genes by using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data.

Keywords: biosvm microarray
[Boucheron2000sharp] S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with applications. Random Structures and Algorithms, 16:277-292, 2000. [ bib | .ps | .pdf ]
[Borwein2000Convex] J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization. Springer-Verlag, New York, 2000. [ bib ]
[Blake2000Chemoinformatics] J. F. Blake. Chemoinformatics - predicting the physicochemical properties of 'drug-like' molecules. Curr. Opin. Biotechnol., 11(1):104-107, Feb 2000. [ bib | DOI | http | .pdf ]
A few major advances have occurred in the area of physicochemical modeling of organic compounds during the past several years, spurred on by changes in the pharmaceutical industry. Recent advances include the ability to categorize and screen the overall physicochemical properties of potential drug candidates based entirely on their molecular structures and the ability to model the components that contribute to the oral absorption characteristics of potential drug candidates.

[Bittner2000Molecular] M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, and V. Sondak. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature, 406(6795):536-540, Aug 2000. [ bib | DOI | http | .pdf ]
The most common human cancers are malignant neoplasms of the skin. Incidence of cutaneous melanoma is rising especially steeply, with minimal progress in non-surgical treatment of advanced disease. Despite significant effort to identify independent predictors of melanoma outcome, no accepted histopathological, molecular or immunohistochemical marker defines subsets of this neoplasm. Accordingly, though melanoma is thought to present with different 'taxonomic' forms, these are considered part of a continuous spectrum rather than discrete entities. Here we report the discovery of a subset of melanomas identified by mathematical analysis of gene expression in a series of samples. Remarkably, many genes underlying the classification of this subset are differentially regulated in invasive melanomas that form primitive tubular networks in vitro, a feature of some highly aggressive metastatic melanomas. Global transcript analysis can identify unrecognized subtypes of cutaneous melanoma and predict experimentally verifiable phenotypic characteristics that may be of importance to disease progression.

[Ben-Dor2000Tissue] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. Tissue classification with gene expression profiles. J. Comput. Biol., 7(3-4):559-583, 2000. [ bib | http | .pdf ]
Constantly improving gene expression profiling technologies are expected to provide understanding and insight into cancer-related cellular processes. Gene expression data is also expected to significantly aid in the development of efficient cancer diagnosis and classification platforms. In this work we examine three sets of gene expression data measured across sets of tumor(s) and normal clinical samples: The first set consists of 2,000 genes, measured in 62 epithelial colon samples (Alon et al., 1999). The second consists of approximately equal to 100,000 clones, measured in 32 ovarian samples (unpublished extension of data set described in Schummer et al. (1999)). The third set consists of approximately equal to 7,100 genes, measured in 72 bone marrow and peripheral blood samples (Golub et al, 1999). We examine the use of scoring methods, measuring separation of tissue type (e.g., tumors from normals) using individual gene expression levels. These are then coupled with high-dimensional classification methods to assess the classification power of complete expression profiles. We present results of performing leave-one-out cross validation (LOOCV) experiments on the three data sets, employing nearest neighbor classifier, SVM (Cortes and Vapnik, 1995), AdaBoost (Freund and Schapire, 1997) and a novel clustering-based classification technique. As tumor samples can differ from normal samples in their cell-type composition, we also perform LOOCV experiments using appropriately modified sets of genes, attempting to eliminate the resulting bias. We demonstrate success rate of at least 90 in tumor versus normal classification, using sets of selected genes, with, as well as without, cellular-contamination-related members. These results are insensitive to the exact selection mechanism, over a certain range.

Keywords: biosvm microarray
[Baxter2000Model] Jonathan Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149-198, 2000. [ bib | .html ]
[Baudat2000Generalized] G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach. Neural Comput., 12(10):2385-404, Oct 2000. [ bib | DOI | http | .pdf ]
We present a new method that we call generalized discriminant analysis (GDA) to deal with nonlinear discriminant analysis using kernel function operator. The underlying theory is close to the support vector machines (SVM) insofar as the GDA method provides a mapping of the input vectors into high-dimensional feature space. In the transformed space, linear properties make it easy to extend and generalize the classical linear discriminant analysis (LDA) to nonlinear discriminant analysis. The formulation is expressed as an eigenvalue problem resolution. Using a different kernel, one can cover a wide class of nonlinearities. For both simulated data and alternate kernels, we give classification results, as well as the shape of the decision function. The results are confirmed using real data to perform seed classification.

[Amaral2000Classes] L. A. N. Amaral, A. Scala, M. Barthélémy, and H. E. Stanley. Classes of small-world networks. Proc. Natl. Acad. Sci. USA, 97(21):11149-11152, 2000. [ bib | http | .pdf ]
[Alter2000Singular] O. Alter, P. O. Brown, and D. Botstein. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci U S A, 97(18):10101-10106, Aug 2000. [ bib | DOI | http | .pdf ]
We describe the use of singular value decomposition in transforming genome-wide expression data from genes x arrays space to reduced diagonalized "eigengenes" x "eigenarrays" space, where the eigengenes (or eigenarrays) are unique orthonormal superpositions of the genes (or arrays). Normalizing the data by filtering out the eigengenes (and eigenarrays) that are inferred to represent noise or experimental artifacts enables meaningful comparison of the expression of different genes across different arrays in different experiments. Sorting the data according to the eigengenes and eigenarrays gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype, respectively. After normalization and sorting, the significant eigengenes and eigenarrays can be associated with observed genome-wide effects of regulators, or with measured samples, in which these regulators are overactive or underactive, respectively.

[Alizadeh2000Distinct] A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu, J. I. Powell, L. Yang, G. E. Marti, T. Moore, J. Hudson, L. Lu, D. B. Lewis, R. Tibshirani, G. Sherlock, W. C. Chan, T. C. Greiner, D. D. Weisenburger, J. O. Armitage, R. Warnke, R. Levy, W. Wilson, M. R. Grever, J. C. Byrd, D. Botstein, P. O. Brown, and L. M. Staudt. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403(6769):503-511, Feb 2000. [ bib | DOI | http | .pdf ]
Diffuse large B-cell lymphoma (DLBCL), the most common subtype of non-Hodgkin's lymphoma, is clinically heterogeneous: 40% of patients respond well to current therapy and have prolonged survival, whereas the remainder succumb to the disease. We proposed that this variability in natural history reflects unrecognized molecular heterogeneity in the tumours. Using DNA microarrays, we have conducted a systematic characterization of gene expression in B-cell malignancies. Here we show that there is diversity in gene expression among the tumours of DLBCL patients, apparently reflecting the variation in tumour proliferation rate, host response and differentiation state of the tumour. We identified two molecularly distinct forms of DLBCL which had gene expression patterns indicative of different stages of B-cell differentiation. One type expressed genes characteristic of germinal centre B cells ('germinal centre B-like DLBCL'); the second type expressed genes normally induced during in vitro activation of peripheral blood B cells ('activated B-like DLBCL'). Patients with germinal centre B-like DLBCL had a significantly better overall survival than those with activated B-like DLBCL. The molecular classification of tumours on the basis of gene expression can thus identify previously undetected and clinically significant subtypes of cancer.

Keywords: csbcbook
[Albert2000Attack] R. Albert, H. Jeong, and A.-L. Barabási. Attack and error tolerance in complex networks. Nature, 406:378-381, 2000. [ bib | .pdf | .pdf ]
[Akutsu2000Inferring] T. Akutsu, S. Miyano, and S. Kuhara. Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics, 16(8):727-734, 2000. [ bib | http | .pdf ]
[Akutsu2000Algorithms] T. Akutsu, S. Miyano, and S. Kuhara. Algorithms for identifying Boolean networks and related biological networks based on matrix multiplication and fingerprint function. J. Comput. Biol., 7(3-4):331-343, 2000. [ bib | DOI | http | .pdf ]
Due to the recent progress of the DNA microarray technology, a large number of gene expression profile data are being produced. How to analyze gene expression data is an important topic in computational molecular biology. Several studies have been done using the Boolean network as a model of a genetic network. This paper proposes efficient algorithms for identifying Boolean networks of bounded indegree and related biological networks, where identification of a Boolean network can be formalized as a problem of identifying many Boolean functions simultaneously. For the identification of a Boolean network, an O(mnD+1) time naive algorithm and a simple O (mnD) time algorithm are known, where n denotes the number of nodes, m denotes the number of examples, and D denotes the maximum in degree. This paper presents an improved O(momega-2nD + mnD+omega-3) time Monte-Carlo type randomized algorithm, where omega is the exponent of matrix multiplication (currently, omega < 2.376). The algorithm is obtained by combining fast matrix multiplication with the randomized fingerprint function for string matching. Although the algorithm and its analysis are simple, the result is nontrivial and the technique can be applied to several related problems.

[Drews2000Drug] J. Drews. Drug Discovery: A Historical Perspective. Science, 287:1960-1964, March 2000. [ bib | DOI | http | .pdf ]
Keywords: chemoinformatics
[Edwards2000Escherichia] J. S. Edwards and B. O. Palsson. The Escherichia coli mg1655 in silico metabolic genotype: its definition, characteristics, and capabilities. Proc Natl Acad Sci U S A, 97(10):5528-5533, May 2000. [ bib | DOI | http | .pdf ]
The Escherichia coli MG1655 genome has been completely sequenced. The annotated sequence, biochemical information, and other information were used to reconstruct the E. coli metabolic map. The stoichiometric coefficients for each metabolic enzyme in the E. coli metabolic map were assembled to construct a genome-specific stoichiometric matrix. The E. coli stoichiometric matrix was used to define the system's characteristics and the capabilities of E. coli metabolism. The effects of gene deletions in the central metabolic pathways on the ability of the in silico metabolic network to support growth were assessed, and the in silico predictions were compared with experimental observations. It was shown that based on stoichiometric and capacity constraints the in silico analysis was able to qualitatively predict the growth potential of mutant strains in 86% of the cases examined. Herein, it is demonstrated that the synthesis of in silico metabolic genotypes based on genomic, biochemical, and strain-specific information is possible, and that systems analysis methods are available to analyze and interpret the metabolic phenotype.

[Ashburner2000Gene] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet, 25(1):25-29, May 2000. [ bib | DOI | http ]
Keywords: Animals; Computer Communication Networks; Databases, Factual; Eukaryotic Cells; Genes; Humans; Metaphysics; Mice; Molecular Biology; Sequence Analysis, DNA; Terminology as Topic
[Juditsky2000Functional] A. Juditsky and A. Nemirovski. Functional Aggregation for Nonparametric Estimation. Ann. Stat., 28(3):681-712, June 2000. [ bib | .ps.gz | .pdf ]
[Shimodaira2000Improving] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227-244, October 2000. [ bib | DOI | http ]
A class of predictive densities is derived by weighting the observed samples in maximizing the log-likelihood function. This approach is effective in cases such as sample surveys or design of experiments, where the observed covariate follows a different distribution than that in the whole population. Under misspecification of the parametric model, the optimal choice of the weight function is asymptotically shown to be the ratio of the density function of the covariate in the population to that in the observations. This is the pseudo-maximum likelihood estimation of sample surveys. The optimality is defined by the expected Kullback&#x2013;Leibler loss, and the optimal weight is obtained by considering the importance sampling identity. Under correct specification of the model, however, the ordinary maximum likelihood estimate (i.e. the uniform weight) is shown to be optimal asymptotically. For moderate sample size, the situation is in between the two extreme cases, and the weight function is selected by minimizing a variant of the information criterion derived as an estimate of the expected loss. The method is also applied to a weighted version of the Bayesian predictive density. Numerical examples as well as Monte-Carlo simulations are shown for polynomial regression. A connection with the robust parametric estimation is discussed.

Keywords: domain-adaptation
[Jordan2001Learning] M. Jordan, editor. Learning in Graphical Models. The MIT Press, 2001. [ bib ]
[Zhu2001Global] H. Zhu, M. Bilgin, R. Bangham, D. Hall, A. Casamayor, P. Bertone, N. Lan, R. Jansen, S. Bidlingmaier, T. Houfek, T. Mitchell, P. Miller, R. A. Dean, M. Gerstein, and M. Snyder. Global analysis of protein activities using proteome chips. Science, 293(5537):2101-5, Sep 2001. [ bib | DOI | http | .pdf ]
To facilitate studies of the yeast proteome, we cloned 5800 open reading frames and overexpressed and purified their corresponding proteins. The proteins were printed onto slides at high spatial density to form a yeast proteome microarray and screened for their ability to interact with proteins and phospholipids. We identified many new calmodulin- and phospholipid-interacting proteins; a common potential binding motif was identified for many of the calmodulin-binding proteins. Thus, microarrays of an entire eukaryotic proteome can be prepared and screened for diverse biochemical activities. The microarrays can also be used to screen protein-drug interactions and to detect posttranslational modifications.

Keywords: Amino Acid Motifs, Amino Acid Sequence, Calmodulin, Calmodulin-Binding Proteins, Cell Membrane, Cloning, Fungal Proteins, Glucose, Liposomes, Membrane Proteins, Molecular, Molecular Sequence Data, Non-U.S. Gov't, Open Reading Frames, P.H.S., Peptide Library, Phosphatidylcholines, Phosphatidylinositols, Phospholipids, Protein Binding, Proteome, Recombinant Fusion Proteins, Research Support, Saccharomyces cerevisiae, Signal Transduction, Streptavidin, U.S. Gov't, 11474067
[Yook2001Weighted] S. H. Yook, H. Jeong, Y. Tu, and A.-L. Barabási. Weighted evolution networks. Phys. Rev. Lett., 86(25):5835-5838, 2001. [ bib | .pdf | .pdf ]
[Yeang2001Molecular] C.H. Yeang, S. Ramaswamy, P. Tamayo, S. Mukherjee, R.M. Rifkin, M. Angelo, M. Reich, E. Lander, J. Mesirov, and T. Golub. Molecular classification of multiple tumor types. Bioinformatics, 17(Suppl. 1):S316-S322, 2001. [ bib | http | .pdf ]
Using gene expression data to classify tumor types is a very promising tool in cancer diagnosis. Previous works show several pairs of tumor types can be successfully distinguished by their gene expression patterns (Golub et al. 1999, Ben-Dor et al. 2000, Alizadeh et al. 2000). However, the simultaneous classification across a heterogeneous set of tumor types has not been well studied yet. We obtained 190 samples from 14 tumor classes and generated a combined expression dataset containing 16063 genes for each of those samples. We performed multi-class classification by combining the outputs of binary classifiers. Three binary classifiers (k-nearest neighbors, weighted voting, and support vector machines) were applied in conjunction with three combination scenarios (one-vs-all, all-pairs, hierarchical partitioning). We achieved the best cross validation error rate of 18.75 support vector machine algorithm. The results demonstrate the feasibility of performing clinically useful classification from samples of multiple tumor types.

Keywords: biosvm
[Yano2001Evaluating] Y. Yano, S.L. Beal, and L.B. Sheiner. Evaluating pharmacokinetic/pharmacodynamic models using the Posterior Predictive Check. J Pharmacokin Pharmacodynam, 28(2):171-192, 2001. [ bib ]
The posterior predictive check (PPC) is a model evaluation tool. It assigns a value (p PPC ) to the probability that the value of a given statistic computed from data arising under an analysis model is as or more extreme than the value computed from the real data themselves. If this probability is too small, the analysis model is regarded as invalid for the given statistic. Properties of the PPC for pharmacokinetic (PK) and pharmacodynamic (PD) model evaluation are examined herein for a particularly simple simulation setting: extensive sampling of a single individual's data arising from simple PK/PD and error models. To test the performance characteristics of the PPC, repeatedly, ldquorealrdquo data are simulated and for a variety of statistics, the PPC is applied to an analysis model, which may (null hypothesis) or may not (alternative hypothesis) be identical to the simulation model. Five models are used here: (PK1) mono-exponential with proportional error, (PK2) biexponential with proportional error, (PK2epsi) biexponential with additive error, (PD1) E max model with additive error under the logit transform, and (PD2) sigmoid E max model with additive error under the logit transform. Six simulation/analysis settings are studied. The first three, (PK1/PK1), (PK2/PK2), and (PD1/PD1) evaluate whether the PPC has appropriate type-I error level, whereas the second three (PK2/PK1), (PK2epsi/PK2), and (PD2/PD1) evaluate whether the PPC has adequate power. For a set of 100 data sets simulated/analyzed under each model pair according to a stipulated extensive sampling design, the p PPC is computed for a number of statistics in three different ways (each way uses a different approximation to the posterior distribution on the model parameters). We find that in general; (i) The PPC is conservative under the null in the sense that for many statistics, prob(p PPC leagr)agr for small agr. With respect to such statistics, this means that useful models will rarely be regarded incorrectly as invalid. A high correlation of a statistic with the parameter estimates obtained from the same data used to compute the statistic (a measure of statistical ldquosufficiencyrdquo) tends to identify the most conservative statistics. (ii) Power is not very great, at least for the alternative models we tested, and it is especially poor with ldquostatisticsrdquo that are in part a function of parameters as well as data. Although there is a tendency for nonsufficient statistics (as we have measured this) to have greater power, this is by no means an infallible diagnostic. (iii) No clear advantage for one or another method of approximating the posterior distribution on model parameters is found.

[Xue2001Mini-fingerprints] L. Xue, F. L. Stahura, J. W. Godden, and J. Bajorath. Mini-fingerprints detect similar activity of receptor ligands previously recognized only by three-dimensional pharmacophore-based methods. J Chem Inf Comput Sci, 41(2):394-401, 2001. [ bib ]
Mini-fingerprints (MFPs) are short binary bit string representations of molecular structure and properties, composed of few selected two-dimensional (2D) descriptors and a number of structural keys. MFPs were specifically designed to recognize compounds with similar activity. Here we report that MFPs are capable of detecting similar activities of some druglike molecules, including endothelin A antagonists and alpha(1)-adrenergic receptor ligands, the recognition of which was previously thought to depend on the use of multiple point three-dimensional (3D) pharmacophore methods. Thus, in these cases, MFPs and pharmacophore fingerprints produce similar results, although they define, in terms of their complexity, opposite ends of the spectrum of methods currently used to study molecular similarity or diversity. For each of the studied compound classes, comparison of MFP bit settings identified a consensus or signature pattern. Scaling factors can be applied to these bits in order to increase the probability of finding compounds with similar activity by virtual screening.

Keywords: Adrenergic, Angiotensin II, Cell Surface, Combinatorial Chemistry Techniques, Databases, Drug Evaluation, Endothelins, Environmental Pollutants, Factual, Information Management, Ligands, Molecular Structure, Pharmaceutical Preparations, Platelet Glycoprotein GPIIb-IIIa Complex, Preclinical, Receptors, Serine Proteinase Inhibitors, Structure-Activity Relationship, User-Computer Interface, alpha-1, 11277728
[Xue2001Fingerprint] L. Xue, F. L. Stahura, J. W. Godden, and J. Bajorath. Fingerprint scaling increases the probability of identifying molecules with similar activity in virtual screening calculations. J Chem Inf Comput Sci, 41(3):746-753, 2001. [ bib ]
Results of systematic virtual screening calculations using a structural key-type fingerprint are reported for compounds belonging to 14 activity classes added to randomly selected synthetic molecules. For each class, a fingerprint profile was calculated to monitor the relative occupancy of fingerprint bit positions. Consensus bit patterns were determined consisting of all bits that were always set on in compounds belonging to a specific activity class. In virtual screening calculations, scale factors were applied to each consensus bit position in fingerprints of query molecules. This technique, called "fingerprint scaling", effectively increases the weight of consensus bit positions in fingerprint comparisons. Although overall prediction accuracy was satisfactory using unscaled calculations, scaling significantly increased the number of correct predictions but only slightly increased the rate of false positives. These observations suggest that fingerprint scaling is an attractive approach to increase the probability of identifying molecules with similar activity by virtual screening. It requires the availability of a series of related compounds and can be easily applied to any keyed fingerprint representation that associates bit positions with specific molecular features.

Keywords: 16S, Algae, Algorithms, Animals, Archaeal, Automation, Bacteria, Biodiversity, Chemical, Colorimetry, Computational Biology, Computer Terminals, DNA, DNA Fingerprinting, Daphnia, Databases, Ecosystem, Euryarchaeota, Factual, Fresh Water, Hazardous Substances, Humans, Information Storage and Retrieval, Methane, Models, Non-U.S. Gov't, Oxidoreductases, Perciformes, Photic Stimulation, Photometry, Polymorphism, Quantitative Structure-Activity Relationship, RNA, Research Support, Restriction Fragment Length, Ribosomal, Seasons, Soil Microbiology, Spain, Sulfur, Theoretical, Time Factors, Toxicity Tests, Water Microbiology, Water Pollutants, 11410055
[Xiong2001Biomarker] M. Xiong, X. Fang, and J. Zhao. Biomarker Identification by Feature Wrappers. Genome Res., 11(11):1878-1887, 2001. [ bib | http | .pdf ]
Gene expression studies bridge the gap between DNA information and trait information by dissecting biochemical pathways into intermediate components between genotype and phenotype. These studies open new avenues for identifying complex disease genes and biomarkers for disease diagnosis and for assessing drug efficacy and toxicity. However, the majority of analytical methods applied to gene expression data are not efficient for biomarker identification and disease diagnosis. In this paper, we propose a general framework to incorporate feature (gene) selection into pattern recognition in the process to identify biomarkers. Using this framework, we develop three feature wrappers that search through the space of feature subsets using the classification error as measure of goodness for a particular feature subset being "wrapped around": linear discriminant analysis, logistic regression, and support vector machines. To effectively carry out this computationally intensive search process, we employ sequential forward search and sequential forward floating search algorithms. To evaluate the performance of feature selection for biomarker identification we have applied the proposed methods to three data sets. The preliminary results demonstrate that very high classification accuracy can be attained by identified composite classifiers with several biomarkers.

Keywords: biosvm
[Wang2001Methylation] H. Wang, Z. Q. Huang, L. Xia, Q. Feng, H. Erdjument-Bromage, B. D. Strahl, S. D. Briggs, C. D. Allis, J. Wong, P. Tempst, and Y. Zhang. Methylation of histone h4 at arginine 3 facilitating transcriptional activation by nuclear hormone receptor. Science, 293(5531):853-857, Aug 2001. [ bib | DOI | http ]
Acetylation of core histone tails plays a fundamental role in transcription regulation. In addition to acetylation, other posttranslational modifications, such as phosphorylation and methylation, occur in core histone tails. Here, we report the purification, molecular identification, and functional characterization of a histone H4-specific methyltransferase PRMT1, a protein arginine methyltransferase. PRMT1 specifically methylates arginine 3 (Arg 3) of H4 in vitro and in vivo. Methylation of Arg 3 by PRMT1 facilitates subsequent acetylation of H4 tails by p300. However, acetylation of H4 inhibits its methylation by PRMT1. Most important, a mutation in the S-adenosyl-l-methionine-binding site of PRMT1 substantially crippled its nuclear receptor coactivator activity. Our finding reveals Arg 3 of H4 as a novel methylation site by PRMT1 and indicates that Arg 3 methylation plays an important role in transcriptional regulation.

Keywords: Acetylation; Amino Acid Sequence; Animals; Arginine, metabolism; Binding Sites; Cell Nucleus, metabolism; Hela Cells; Histones, chemistry/metabolism; Humans; Hydroxamic Acids, pharmacology; Lysine, metabolism; Methylation; Methyltransferases, chemistry/genetics/isolation /&/ purification/metabolism; Molecular Sequence Data; Mutation; Oocytes; Receptors, Androgen, metabolism; Recombinant Proteins, metabolism; S-Adenosylmethionine, metabolism; Transcriptional Activation; Xenopus
[Wagner2001Yeast] A. Wagner. The Yeast Protein Interaction Network Evolves Rapidly and Contains Few Redundant Duplicate Genes. Mol. Biol. Evol., 18:1283-1292, 2001. [ bib | .html | .pdf ]
[Vert2001Statistical] J.-P. Vert. Statistical Methods for Natural Language Modelling. PhD thesis, Paris 6 University, 2001. [ bib | www: ]
[Vert2001Adaptive] J.-P. Vert. Adaptive context trees and text clustering. IEEE Trans. Inform. Theory, 47(5):1884-1901, Jul 2001. [ bib | DOI | http | .pdf ]
In the finite-alphabet context we propose four alternatives to fixed-order Markov models to estimate a conditional distribution. They consist in working with a large class of variable-length Markov models represented by context trees, and building an estimator of the conditional distribution with a risk of the same order as the risk of the best estimator for every model simultaneously, in a conditional Kullback-Leibler sense. Such estimators can be used to model complex objects like texts written in natural language and define a notion of similarity between them. This idea is illustrated by experimental results of unsupervised text clustering

[Vert2001Text] J.-P. Vert. Text categorization using adaptive context trees. In A. Gelbukh, editor, Proceedings of the CICLing-2001 conference, volume 2004 of LNCS, pages 423-436. Springer Verlag, 2001. [ bib ]
[Vercoutere2001Rapid] W. Vercoutere, S. Winters-Hilt, H. Olsen, D. Deamer, D. Haussler, and M. Akeson. Rapid discrimination among individual DNA hairpin molecules at single-nucleotide resolution using an ion channel. Nat Biotechnol, 19(3):248-52, Mar 2001. [ bib | DOI | http | .pdf ]
RNA and DNA strands produce ionic current signatures when driven through an alpha-hemolysin channel by an applied voltage. Here we combine this nanopore detector with a support vector machine (SVM) to analyze DNA hairpin molecules on the millisecond time scale. Measurable properties include duplex stem length, base pair mismatches, and loop length. This nanopore instrument can discriminate between individual DNA hairpins that differ by one base pair or by one nucleotide.

Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diagnosis, Discriminant Analysis, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Ion Channels, Kinetics, Leukemia, Lipid Bilayers, Logistic Models, Lymphocytic, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplastic, Neural Networks (Computer), Nevus, Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, Organ Specificity, Organelles, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, U.S. Gov't, 11231558
[Venter2001Sequence] J. C. et al. Venter. The Sequence of the Human Genome. Science, 291(5507):1304-1351, 2001. [ bib | http | .pdf ]
A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90 bp or more, and 25 or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional  12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1 with 75 segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1 task of determining which SNPs have functional consequences remains an open challenge.

Keywords: genomics bio
[Vazquez2001Modeling] A. Vazquez, A. Flammini, A. Maritan, and A. Vespignani. Modeling of protein interaction networks. E-print cond-mat/0108043, Aug 2001. [ bib | http | .pdf ]
[Tut2001Cyclin] V.M. Tut, K.L. Braithwaite, B. Angus, D.E. Neal, J. Lunec, and J.K. Mellon. Cyclin d1 expression in transitional cell carcinoma of the bladder: correlation with p53, waf1, prb and ki67. Br J Cancer, 84:270-275, 2001. [ bib ]
[Tusher2001Significance] V. G. Tusher, R. Tibshirani, and G. Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA, 98(9):5116-5121, Apr 2001. [ bib | DOI | http | .pdf ]
Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared with FDRs of 60 and 84% by using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation and 3 in apoptosis. Surprisingly, four nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.

Keywords: csbcbook, csbcbook-ch4
[Turlin2001Regulation] E. Turlin, M. Perrotte-piquemal, A. Danchin, and F. Biville. Regulation of the early steps of 3-phenylpropionate catabolism in Escherichia coli. J. Mol. Microbiol. Biotechnol., 3(1):127-133, Jan 2001. [ bib ]
Microbial catabolism of phenylpropanoid compounds plays a key role in the degradation of aromatic molecules originating from the degradation of proteins and plant constituents. In this study, the regulation of the early steps in the utilisation of 3-phenylpropionate, a phenylpropanoid compound, was investigated. Expression of the hcaA gene product, which is involved in 3-phenylpropionate catabolism in Escherichia coli, was positively regulated by HcaR, a regulatory protein similar to members of the LysR regulators family. Remarkably, the expression of hcaA in the presence of 3-phenylpropionate was sharply and transiently induced at the end of the exponential growth phase. This occurred in a rpoS-independent manner. This transient induction was also mediated by HcaR. The expression of this positive regulator is negatively autoregulated, as for other members of the LysR family. The expression of hcaR is strongly repressed in the presence of glucose. Glucose-dependent repression of hcaR expression could only be partially overcome by adding exogenous cAMP.

[Troyanskaya2001Missing] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17:520-525, 2001. [ bib | http | .pdf ]
[Tibshirani2001Estimating] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistics. J. R. Stat. Soc. Ser. B, 63:411-423, 2001. [ bib ]
[ThomasDLF2001] R. Thomas and M. Kaufman. Multistationarity, the basis of cell differentiation and memory. II. Logical analysis of regulatory networks in terms of feedback circuits. Chaos, 11(1):180-195, 2001. [ bib ]
Circuits and their involvement in complex dynamics are described in differential terms in Part I of this work. Here, we first explain why it may be appropriate to use a logical description, either by itself or in symbiosis with the differential description. The major problem of a logical description is to find an adequate way to involve time. The procedure we adopted differs radically from the classical one by its fully asynchronous character. In Sec. II we describe our "naive" logical approach, and use it to illustrate the major laws of circuitry (namely, the involvement of positive circuits in multistationarity and of negative circuits in periodicity) and in a biological example. Already in the naive description, the major steps of the logical description are to: (i) describe a model as a set of logical equations, (ii) derive the state table from the equations, (iii) derive the graph of the sequences of states from the state table, and (iv) determine which of the possible pathways will be actually followed in terms of time delays. In the following sections we consider multivalued variables where required, the introduction of logical parameters and of logical values ascribed to the thresholds, and the concept of characteristic state of a circuit. This generalized logical description provides an image whose qualitative fit with the differential description is quite remarkable. A major interest of the generalized logical description is that it implies a limited and often quite small number of possible combinations of values of the logical parameters. The space of the logical parameters is thus cut into a limited number of boxes, each of which is characterized by a defined qualitative behavior of the system. Our analysis tells which constraints on the logical parameters must be fulfilled in order for any circuit (or combination of circuits) to be functional. Functionality of a circuit will result in multistationarity (in the case of a positive circuit) or in a cycle (in the case of a negative circuit). The last sections deal with "more about time delays" and "reverse logic," an approach that aims to proceed rationally from facts to models. (c) 2001 American Institute of Physics.

[Teixeira2001Recent] R. D. Teixeira, A. P. Braga, R. H. Takahashi, and R. R. Saldanha. Recent advances in the MOBJ algorithm for training artificial neural networks. Int J Neural Syst, 11(3):265-70, Jun 2001. [ bib ]
This paper presents a new scheme for training MLPs which employs a relaxation method for multi-objective optimization. The algorithm works by obtaining a reduced set of solutions, from which the one with the best generalization is selected. This approach allows balancing between the training error and norm of network weight vectors, which are the two objective functions of the multi-objective optimization problem. The method is applied to classification and regression problems and compared with Weight Decay (WD), Support Vector Machines (SVMs) and standard Backpropagation (BP). It is shown that the systematic procedure for training proposed results on good generalization neural models, and outperforms traditional methods.

[Tax2001Uniform] D. M. J. Tax and R. P. W. Duin. Uniform Object Generation for Optimizing One-class Classifiers. J. Mach. Learn. Res., 2:155-173, 2001. [ bib | .pdf ]
[Talukder2001closed-form] A. Talukder and D. Casasent. A closed-form neural network for discriminatory feature extraction from high-dimensional data. Neural Netw, 14(9):1201-18, Nov 2001. [ bib ]
We consider a new neural network for data discrimination in pattern recognition applications. We refer to this as a maximum discriminating feature (MDF) neural network. Its weights are obtained in closed-form, thereby overcoming problems associated with other nonlinear neural networks. It uses neuron activation functions that are dynamically chosen based on the application. It is theoretically shown to provide nonlinear transforms of the input data that are more general than those provided by other nonlinear multilayer perceptron neural network and support-vector machine techniques for cases involving high-dimensional (image) inputs where training data are limited and the classes are not linearly separable. We experimentally verify this on synthetic examples.

[Soerlie2001Gene] T. Sørlie, C. M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie, M. B. Eisen, M. van de Rijn, S. S. Jeffrey, T. Thorsen, H. Quist, J. C. Matese, P. O. Brown, D. Botstein, P. Eystein Lønning, and A. L. Børresen-Dale. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci. USA, 98(19):10869-10874, Sep 2001. [ bib | DOI | http | .pdf ]
The purpose of this study was to classify breast carcinomas based on variations in gene expression patterns derived from cDNA microarrays and to correlate tumor characteristics to clinical outcome. A total of 85 cDNA microarray experiments representing 78 cancers, three fibroadenomas, and four normal breast tissues were analyzed by hierarchical clustering. As reported previously, the cancers could be classified into a basal epithelial-like group, an ERBB2-overexpressing group and a normal breast-like group based on variations in gene expression. A novel finding was that the previously characterized luminal epithelial/estrogen receptor-positive group could be divided into at least two subgroups, each with a distinctive expression profile. These subtypes proved to be reasonably robust by clustering using two different gene sets: first, a set of 456 cDNA clones previously selected to reflect intrinsic properties of the tumors and, second, a gene set that highly correlated with patient outcome. Survival analyses on a subcohort of patients with locally advanced breast cancer uniformly treated in a prospective study showed significantly different outcomes for the patients belonging to the various groups, including a poor prognosis for the basal-like subtype and a significant difference in outcome for the two estrogen receptor-positive groups.

Keywords: breastcancer, csbcbook, csbcbook-ch2
[Suykens2001Optimal] J. A. Suykens, J. Vandewalle, and B. De Moor. Optimal control by least squares support vector machines. Neural Netw, 14(1):23-35, Jan 2001. [ bib ]
Support vector machines have been very successful in pattern recognition and function estimation problems. In this paper we introduce the use of least squares support vector machines (LS-SVM's) for the optimal control of nonlinear systems. Linear and neural full static state feedback controllers are considered. The problem is formulated in such a way that it incorporates the N-stage optimal control problem as well as a least squares support vector machine approach for mapping the state space into the action space. The solution is characterized by a set of nonlinear equations. An alternative formulation as a constrained nonlinear optimization problem in less unknowns is given, together with a method for imposing local stability in the LS-SVM control scheme. The results are discussed for support vector machines with radial basis function kernel. Advantages of LS-SVM control are that no number of hidden units has to be determined for the controller and that no centers have to be specified for the Gaussian kernels when applying Mercer's condition. The curse of dimensionality is avoided in comparison with defining a regular grid for the centers in classical radial basis function networks. This is at the expense of taking the trajectory of state variables as additional unknowns in the optimization problem, while classical neural network approaches typically lead to parametric optimization problems. In the SVM methodology the number of unknowns equals the number of training data, while in the primal space the number of unknowns can be infinite dimensional. The method is illustrated both on stabilization and tracking problems including examples on swinging up an inverted pendulum with local stabilization at the endpoint and a tracking problem for a ball and beam system.

Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diagnosis, Discriminant Analysis, Electric Conductivity, Electrophysiology, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Ion Channels, Kinetics, Leukemia, Lipid Bilayers, Logistic Models, Lymphocytic, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplastic, Neural Networks (Computer), Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, Nucleic Acid Conformation, Organ Specificity, Organelles, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Statistical, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, U.S. Gov't, 11213211
[Su2001Molecular] A. I. Su, J. B. Welsh, L. M. Sapinoso, S. G. Kern, P. Dimitrov, H. Lapp, P. G. Schultz, S. M. Powell, C. A. Moskaluk, H. F.Jr. Frierson, and G. M. Hampton. Molecular Classification of Human Carcinomas by Use of Gene Expression Signatures. Cancer Res., 61(20):7388-7393, 2001. [ bib | http | .html ]
Classification of human tumors according to their primary anatomical site of origin is fundamental for the optimal treatment of patients with cancer. Here we describe the use of large-scale RNA profiling and supervised machine learning algorithms to construct a first-generation molecular classification scheme for carcinomas of the prostate, breast, lung, ovary, colorectum, kidney, liver, pancreas, bladder/ureter, and gastroesophagus, which collectively account for [ ]70 cancer-related deaths in the United States. The classification scheme was based on identifying gene subsets whose expression typifies each cancer class, and we quantified the extent to which these genes are characteristic of a specific tumor type by accurately and confidently predicting the anatomical site of tumor origin for 90 including 9 of 12 metastatic lesions. The predictor gene subsets include those whose expression is typical of specific types of normal epithelial differentiation, as well as other genes whose expression is elevated in cancer. This study demonstrates the feasibility of predicting the tissue origin of a carcinoma in the context of multiple cancer classes.

Keywords: biosvm, breastcancer
[Strogatz2001Exploring] S. S. Strogatz. Exploring complex networks. Nature, 410:268-276, 2001. [ bib | http | .pdf ]
[Strahl2001Methylation] B. D. Strahl, S. D. Briggs, C. J. Brame, J. A. Caldwell, S. S. Koh, H. Ma, R. G. Cook, J. Shabanowitz, D. F. Hunt, M. R. Stallcup, and C. D. Allis. Methylation of histone h4 at arginine 3 occurs in vivo and is mediated by the nuclear receptor coactivator prmt1. Curr Biol, 11(12):996-1000, Jun 2001. [ bib ]
Posttranslational modifications of histone amino termini play an important role in modulating chromatin structure and function. Lysine methylation of histones has been well documented, and recently this modification has been linked to cellular processes involving gene transcription and heterochromatin assembly. However, the existence of arginine methylation on histones has remained unclear. Recent discoveries of protein arginine methyltransferases, CARM1 and PRMT1, as transcriptional coactivators for nuclear receptors suggest that histones may be physiological targets of these enzymes as part of a poorly defined transcriptional activation pathway. Here we show by using mass spectrometry that histone H4, isolated from asynchronously growing human 293T cells, is methylated at arginine 3 (Arg-3) in vivo. In support, a novel antibody directed against histone H4 methylated at Arg-3 independently demonstrates the in vivo occurrence of this modification and reveals that H4 Arg-3 methylation is highly conserved throughout eukaryotes. Finally, we show that PRMT1 is the major, if not exclusive, H4 Arg-3 methyltransfase in human 293T cells. These findings suggest a role for arginine methylation of histones in the transcription process.

Keywords: Amino Acid Motifs; Animals; Arginine, metabolism; Cell Line; Genes, Reporter; Histones, metabolism; Humans; Immunoblotting; Methylation; Protein-Arginine N-Methyltransferases, metabolism; Recombinant Fusion Proteins, genetics/metabolism
[Steinwart2001On] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res., 2:67-93, 2001. [ bib | .html | .pdf ]
In this article we study the generalization abilities of several classifiers of support vector machine (SVM) type using a certain class of kernels that we call universal. It is shown that the soft margin algorithms with universal kernels are consistent for a large class of classification problems including some kind of noisy tasks provided that the regularization parameter is chosen well. In particular we derive a simple sufficient condition for this parameter in the case of Gaussian RBF kernels. On the one hand our considerations are based on an investigation of an approximation property-the so-called universality-of the used kernels that ensures that all continuous functions can be approximated by certain kernel expressions. This approximation property also gives a new insight into the role of kernels in these and other algorithms. On the other hand the results are achieved by a precise study of the underlying optimization problems of the classifiers. Furthermore, we show consistency for the maximal margin classifier as well as for the soft margin SVM's in the presence of large margins. In this case it turns out that also constant regularization parameters ensure consistency for the soft margin SVM's. Finally we prove that even for simple, noise free classification problems SVM's with polynomial kernels can behave arbitrarily badly.

[Sole2001Model] R. V. Solé, R. Pastor-Satorras, E. D. Smith, and T. Kepler. A Model of Large-Scale Proteome Evolution. Technical report, Santa Fe Institute, 2001. Working paper 01-08-041. [ bib | .html | .pdf ]
[Shimodaira2001Dynamic] H. Shimodaira, K.-I. Noma, M. Nakai, and S. Sagayama. Dynamic time-alignment kernel in support vector machine. In Adv. Neural. Inform. Process Syst., pages 921-928, 2001. [ bib ]
[Sherry2001dbSNP] S. T. Sherry, M. H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, and K. Sirotkin. dbsnp: the ncbi database of genetic variation. Nucleic Acids Res, 29(1):308-311, Jan 2001. [ bib ]
In response to a need for a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, the National Center for Biotechnology Information (NCBI) has established the dbSNP database [S.T.Sherry, M.Ward and K. Sirotkin (1999) Genome Res., 9, 677-679]. Submissions to dbSNP will be integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data. The complete contents of dbSNP are available to the public at website: http://www.ncbi.nlm.nih.gov/SNP. The complete contents of dbSNP can also be downloaded in multiple formats via anonymous FTP at ftp://ncbi.nlm.nih.gov/snp/.

Keywords: Animals; Biotechnology; Databases, Factual; Genetic Variation; Humans; Information Services; Internet; National Institutes of Health (U.S.); National Library of Medicine (U.S.); Polymorphism, Single Nucleotide, genetics; United States
[Sherlock2001Stanford] G. Sherlock, T. Hernandez-Boussard, A. Kasarskis, G. Binkley, J.C. Matese, S.S. Dwight, M. Kaloper, S. Weng, H. Jin, C.A. Ball, M.B. Eisen, and P.T. Spellman. The Stanford Microarray Database. Nucleic Acids Res., 29(1):152-155, Jan 2001. [ bib | .pdf | .pdf ]
[Sette2001HLA] A. Sette, R. Chesnut, and J. Fikes. HLA expression in cancer: implications for T cell-based immunotherapy. Immunogenetics, 53(4):255-263, 2001. [ bib ]
HLA class I expression is altered in a significant fraction of the tumor types reviewed here, reflecting either immune pressure or, simply, the accumulation of pathological changes and alterations. However, in all tumor types analyzed, a majority of the tumors express HLA class I. with a general tendency for the more severe alterations to be found in later-stage and less differentiated tumors. These results are encouraging for the development of specific immunotherapies, especially considering that (1) the relatively low sensitivity of immunohistochemical techniques might underestimate HLA expression in tumors, (2) class I expression can be induced in tumor cells as a result of local inflammation and lymphokine release, and (3) class I-negative cells would be predicted to be sensitive to Iysis by natural killer cells.

Keywords: immunoinformatics
[Seol2001Skp1] J. H. Seol, A. Shevchenko, A. Shevchenko, and R. J. Deshaies. Skp1 forms multiple protein complexes, including RAVE, a regulator of V-ATPase assembly. Nat Cell Biol, 3(4):384-91, Apr 2001. [ bib | DOI | http | .pdf ]
SCF ubiquitin ligases are composed of Skp1, Cdc53, Hrt1 and one member of a large family of substrate receptors known as F-box proteins (FBPs). Here we report the identification, using sequential rounds of epitope tagging, affinity purification and mass spectrometry, of 16 Skp1 and Cdc53-associated proteins in budding yeast, including all components of SCF, 9 FBPs, Yjr033 (Rav1) and Ydr202 (Rav2). Rav1, Rav2 and Skp1 form a complex that we have named 'regulator of the (H+)-ATPase of the vacuolar and endosomal membranes' (RAVE), which associates with the V1 domain of the vacuolar membrane (H+)-ATPase (V-ATPase). V-ATPases are conserved throughout eukaryotes, and have been implicated in tumour metastasis and multidrug resistance, and here we show that RAVE promotes glucose-triggered assembly of the V-ATPase holoenzyme. Previous systematic genome-wide two-hybrid screens yielded 17 proteins that interact with Skp1 and Cdc53, only 3 of which overlap with those reported here. Thus, our results provide a distinct view of the interactions that link proteins into a comprehensive cellular network.

Keywords: Affinity, Affinity Labels, Amino Acid Sequence, Animals, Cell Cycle Proteins, Cells, Chromatography, Cloning, Comparative Study, Cullin Proteins, Cultured, Cytoplasm, DNA, DNA Damage, DNA Repair, Electrospray Ionization, Fungal, Fungal Proteins, Gene Targeting, Genetic, Glucose, Holoenzymes, Humans, Macromolecular Substances, Mass, Matrix-Assisted Laser Desorption-Ionization, Mitosis, Molecular, Molecular Sequence Data, Non-P.H.S., Non-U.S. Gov't, P.H.S., Phosphoric Monoester Hydrolases, Protein Binding, Protein Interaction Mapping, Protein Kinases, Proteome, Proteomics, Proton-Translocating ATPases, Recombinant Fusion Proteins, Research Support, Ribonucleoproteins, Ribosomes, S-Phase Kinase-Associated Proteins, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Alignment, Signal Transduction, Species Specificity, Spectrometry, Spectrum Analysis, Transcription, U.S. Gov't, Vacuolar Proton-Translocating ATPases, 11283612
[Scholkopf2001Estimating] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-himensional distributions. Neural Comput., 13:1443-1471, 2001. [ bib | .pdf ]
[Scholkopf2001Generalized] B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings of the 14th Annual Conference on Computational Learning Theory, volume 2011 of Lecture Notes in Computer Science, pages 416-426, Berlin / Heidelberg, 2001. Springer. [ bib | DOI ]
[Schellewald2001Evaluation] C. Schellewald, S. Roth, and C. Schnörr. Evaluation of convex optimization techniques for the weighted graph-matching problem in computer vision. In Proceedings of the 23rd DAGM-Symposium on Pattern Recognition, pages 361-368, London, UK, 2001. Springer-Verlag. [ bib ]
[Saupe20013D] D. Saupe and D. V. Vranic. 3d model retrieval with spherical harmonics and moments. In Proceedings of the 23rd DAGM-Symposium on Pattern Recognition, pages 392-397, London, UK, 2001. Springer-Verlag. [ bib ]
[Sachidanandam2001Map] R. Sachidanandam, D. Weissman, S. C. Schmidt, J. M. Kakol, L. D. Stein, G. Marth, S. Sherry, J. C. Mullikin, B. J. Mortimore, D. L. Willey, S. E. Hunt, C. G. Cole, P. C. Coggill, C. M. Rice, Z. Ning, J. Rogers, D. R. Bentley, P. Y. Kwok, E. R. Mardis, R. T. Yeh, B. Schultz, L. Cook, R. Davenport, M. Dante, L. Fulton, L. Hillier, R. H. Waterston, J. D. McPherson, B. Gilman, S. Schaffner, W. J. Van Etten, D. Reich, J. Higgins, M. J. Daly, B. Blumenstiel, J. Baldwin, N. Stange-Thomann, M. C. Zody, L. Linton, E. S. Lander, D. Altshuler, and International SNP Map Working Group. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409(6822):928-933, Feb 2001. [ bib ]
We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.

Keywords: Chromosome Mapping; Genetic Variation; Genetics, Medical; Genetics, Population; Genome, Human; Humans; Nucleotides; Polymorphism, Single Nucleotide
[Rosipal2001Kernel] R. Rosipal and L. J. Trejo. Kernel partial least squares regression in reproducing kernel hilbert space. J. Mach. Learn. Res., 2:97-123, 2001. [ bib ]
[Risau-Gusman2001Statistical] S. Risau-Gusman and M. B. Gordon. Statistical mechanics of learning with soft margin classifiers. Phys Rev E Stat Nonlin Soft Matter Phys, 64(3 Pt 1):031907, Sep 2001. [ bib ]
We study the typical learning properties of the recently introduced soft margin classifiers (SMCs), learning realizable and unrealizable tasks, with the tools of statistical mechanics. We derive analytically the behavior of the learning curves in the regime of very large training sets. We obtain exponential and power laws for the decay of the generalization error towards the asymptotic value, depending on the task and on general characteristics of the distribution of stabilities of the patterns to be learned. The optimal learning curves of the SMCs, which give the minimal generalization error, are obtained by tuning the coefficient controlling the trade-off between the error and the regularization terms in the cost function. If the task is realizable by the SMC, the optimal performance is better than that of a hard margin support vector machine and is very close to that of a Bayesian classifier.

[Remm2001Automatic] M. Remm, C.E. Storm, and E.L. Sonnhammer. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol., 314(5):1041-1052, Dec 2001. [ bib | DOI | http | .pdf ]
Orthologs are genes in different species that originate from a single gene in the last common ancestor of these species. Such genes have often retained identical biological roles in the present-day organisms. It is hence important to identify orthologs for transferring functional information between genes in different organisms with a high degree of reliability. For example, orthologs of human proteins are often functionally characterized in model organisms. Unfortunately, orthology analysis between human and e.g. invertebrates is often complex because of large numbers of paralogs within protein families. Paralogs that predate the species split, which we call out-paralogs, can easily be confused with true orthologs. Paralogs that arose after the species split, which we call in-paralogs, however, are bona fide orthologs by definition.Orthologs and in-paralogs are typically detected with phylogenetic methods, but these are slow and difficult to automate. Automatic clustering methods based on two-way best genome-wide matches on the other hand, have so far not separated in-paralogs from out-paralogs effectively.We present a fully automatic method for finding orthologs and in-paralogs from two species. Ortholog clusters are seeded with a two-way best pairwise match, after which an algorithm for adding in-paralogs is applied. The method bypasses multiple alignments and phylogenetic trees, which can be slow and error-prone steps in classical ortholog detection. Still, it robustly detects complex orthologous relationships and assigns confidence values for both orthologs and in-paralogs. The program, called INPARANOID, was tested on all completely sequenced eukaryotic genomes. To assess the quality of INPARANOID results, ortholog clusters were generated from a dataset of worm and mammalian transmembrane proteins, and were compared to clusters derived by manual tree-based ortholog detection methods. This study led to the identification with a high degree of confidence of over a dozen novel worm-mammalian ortholog assignments that were previously undetected because of shortcomings of phylogenetic methods.A WWW server that allows searching for orthologs between human and several fully sequenced genomes is installed at http://www.cgb.ki.se/inparanoid/. This is the first comprehensive resource with orthologs of all fully sequenced eukaryotic genomes. Programs and tables of orthology assignments are available from the same location.

[Ravdin2001Computer] P. M. Ravdin, L. A. Siminoff, G. J. Davis, M. B. Mercer, J. Hewlett, N. Gerson, and H. L. Parker. Computer program to assist in making decisions about adjuvant therapy for women with early breast cancer. J Clin Oncol, 19(4):980-991, Feb 2001. [ bib ]
The goal of the computer program Adjuvant! is to allow health professionals and their patients with early breast cancer to make more informed decisions about adjuvant therapy.Actuarial analysis was used to project outcomes of patients with and without adjuvant therapy based on estimates of prognosis largely derived from Surveillance, Epidemiology, and End-Results data and estimates of the efficacy of adjuvant therapy based on the 1998 overviews of randomized trials of adjuvant therapy. These estimates can be refined using the Prognostic Factor Impact Calculator, which uses a Bayesian method to make adjustments based on relative risks conferred and prevalence of positive test results.From the entries of patient information (age, menopausal status, comorbidity estimate) and tumor staging and characteristics (tumor size, number of positive axillary nodes, estrogen receptor status), baseline prognostic estimates are made. Estimates for the efficacy of endocrine therapy (5 years of tamoxifen) and of polychemotherapy (cyclophosphamide/methotrexate/fluorouracil-like regimens, or anthracycline-based therapy, or therapy based on both an anthracycline and a taxane) can then be used to project outcomes presented in both numerical and graphical formats. Outcomes for overall survival and disease-free survival and the improvement seen in clinical trials, are reasonably modeled by Adjuvant!, although an ideal validation for all patient subsets with all treatment options is not possible. Additional speculative estimates of years of remaining life expectancy and long-term survival curves can also be produced. Help files supply general information about breast cancer. The program's Internet links supply national treatment guidelines, cooperative group trial options, and other related information.The computer program Adjuvant! can play practical and educational roles in clinical settings.

Keywords: Actuarial Analysis; Breast Neoplasms, mortality/therapy; Chemotherapy, Adjuvant; Decision Making; Female; Humans; Prognosis; Software; Survival Analysis
[Ramaswamy2001Multiclass] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S. Lander, and T.R. Golub. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. USA, 98(26):15149-15154, Dec 2001. [ bib | DOI | http | .pdf ]
The optimal treatment of patients with cancer depends on establishing accurate diagnoses by using a complex combination of clinical and histopathological data. In some instances, this task is difficult or impossible because of atypical clinical presentation or histopathology. To determine whether the diagnosis of multiple common adult malignancies could be achieved purely by molecular classification, we subjected 218 tumor samples, spanning 14 common tumor types, and 90 normal tissue samples to oligonucleotide microarray gene expression analysis. The expression levels of 16,063 genes and expressed sequence tags were used to evaluate the accuracy of a multiclass classifier based on a support vector machine algorithm. Overall classification accuracy was 78 Poorly differentiated cancers resulted in low-confidence predictions and could not be accurately classified according to their tissue of origin, indicating that they are molecularly distinct entities with dramatically different gene expression patterns compared with their well differentiated counterparts. Taken together, these results demonstrate the feasibility of accurate, multiclass molecular cancer classification and suggest a strategy for future clinical implementation of molecular cancer diagnostics.

Keywords: biosvm microarray
[Rain2001protein-protein] J.-C. Rain, L. Selig, H. De Reuse, V. Battaglia, C. Reverdy, S. Simon, G. Lenzen, F. Petel, J. Wojcik, V. Schächter, Y. Chemama, A. Labigne, and P. Legrain. The protein-protein interaction map of Helicobacter pylori. Nature, 409:211-215, 2001. [ bib | http | .pdf ]
[Qian2001Protein] J. Qian, N. M. Luscombe, and M. Gerstein. Protein Fold and Family Occurrence in Genomes: Power-Law Behaviour and Evolutionary Model. J. Mol. Biol., 313:673-681, 2001. [ bib | http | .pdf ]
[Puig2001tandem] O. Puig, F. Caspary, G. Rigaut, B. Rutz, E. Bouveret, E. Bragado-Nilsson, M. Wilm, and B. Séraphin. The tandem affinity purification (tap) method: a general procedure of protein complex purification. Methods, 24(3):218-229, Jul 2001. [ bib | DOI | http ]
Identification of components present in biological complexes requires their purification to near homogeneity. Methods of purification vary from protein to protein, making it impossible to design a general purification strategy valid for all cases. We have developed the tandem affinity purification (TAP) method as a tool that allows rapid purification under native conditions of complexes, even when expressed at their natural level. Prior knowledge of complex composition or function is not required. The TAP method requires fusion of the TAP tag, either N- or C-terminally, to the target protein of interest. Starting from a relatively small number of cells, active macromolecular complexes can be isolated and used for multiple applications. Variations of the method to specifically purify complexes containing two given components or to subtract undesired complexes can easily be implemented. The TAP method was initially developed in yeast but can be successfully adapted to various organisms. Its simplicity, high yield, and wide applicability make the TAP method a very useful procedure for protein purification and proteome exploration.

Keywords: Bacterial Proteins; Blotting, Western; DNA, Bacterial; Fungal Proteins; Genetic Vectors; Methods; Mutation; Polymerase Chain Reaction; Proteins; Proteome; Ribonucleases; Ribonucleoproteins; Saccharomyces cerevisiae; Saccharomyces cerevisiae Proteins; Staphylococcus aureus
[Popat] K. Popat, D. H. Greene, J. K. Romberg, and D. S. Bloomberg. Adding linguistic constraints to document image decoding: Comparing the iterated complete path and stack algorithms, 2001. [ bib ]
[Podani2001Comparable] J. Podani, Z.N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and E. Szathmáry. Comparable system-level organization of Archaea and Eukaryotes. Nat. Genet., 29:54-56, 2001. [ bib | http | .pdf ]
[Pilpel2001Identifying] Y. Pilpel, P. Sudarsanam, and G. M. Church. Identifying regulatory networks by combinatorial analysis of promoter elements. Nature, 29:153-159, 2001. [ bib | http | .pdf ]
[Pazos2001Similarity] F. Pazos and A. Valencia. Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng., 9(14):609-614, 2001. [ bib | http | .pdf ]
[Pavlidis2001Gene] P. Pavlidis, J. Weston, J. Cai, and W.N. Grundy. Gene functional classification from heterogeneous data. In Proceedings of the Fifth Annual International Conference on Computational Biology, pages 249-255, 2001. [ bib | .pdf | .pdf ]
Keywords: biosvm
[Pavlidis2001Promoter] P. Pavlidis, T. S. Furey, M. Liberto, D. Haussler, and W. N. Grundy. Promoter Region-Based Classification of Genes. In Pacific Symposium on Biocomputing, pages 139-150, 2001. [ bib | .pdf | .pdf ]
Keywords: biosvm
[BLEU] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. IBM Research Report, RC22176, 2001. [ bib ]
[Ng2001On] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849-856. MIT Press, 2001. [ bib ]
[Newman2001Random] M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E, 64:26118, 2001. [ bib | http | .pdf ]
[Nakayama2001Role] J. Nakayama, J. C. Rice, B. D. Strahl, C. D. Allis, and S. I. Grewal. Role of histone h3 lysine 9 methylation in epigenetic control of heterochromatin assembly. Science, 292(5514):110-113, Apr 2001. [ bib | DOI | http ]
The assembly of higher order chromatin structures has been linked to the covalent modifications of histone tails. We provide in vivo evidence that lysine 9 of histone H3 (H3 Lys9) is preferentially methylated by the Clr4 protein at heterochromatin-associated regions in fission yeast. Both the conserved chromo- and SET domains of Clr4 are required for H3 Lys9 methylation in vivo. Localization of Swi6, a homolog of Drosophila HP1, to heterochomatic regions is dependent on H3 Lys9 methylation. Moreover, an H3-specific deacetylase Clr3 and a beta-propeller domain protein Rik1 are required for H3 Lys9 methylation by Clr4 and Swi6 localization. These data define a conserved pathway wherein sequential histone modifications establish a "histone code" essential for the epigenetic inheritance of heterochromatin assembly.

Keywords: Acetylation; Cell Cycle Proteins, chemistry/genetics/metabolism; Centromere, metabolism; Chromosomes, Fungal, metabolism; Fungal Proteins, genetics/metabolism; Gene Silencing; Genes, Fungal; Heterochromatin, metabolism; Histone Deacetylases, genetics/metabolism; Histone-Lysine N-Methyltransferase; Histones, chemistry/metabolism; Lysine, metabolism; Methylation; Methyltransferases, chemistry/genetics/metabolism; Mutation; Protein Methyltransferases; Protein Structure, Tertiary; Recombinant Proteins, chemistry/metabolism; Saccharomyces cerevisiae Proteins; Schizosaccharomyces pombe Proteins; Schizosaccharomyces, genetics/metabolism; Transcription Factors, metabolism
[Nakaya2001Extraction] A. Nakaya, S. Goto, and M. Kanehisa. Extraction of correlated gene clusters by multiple graph comparison. In Genome Informatics 2001, pages 44-53. Universal Academy Press, Tokyo, Japan, 2001. [ bib | .html | .pdf ]
[Model2001Feature] F. Model, P. Adorjan, A. Olek, and C. Piepenbrock. Feature selection for DNA methylation based cancer classification. Bioinformatics, 17(Supp. 1):S157-S164, 2001. [ bib | http | .pdf ]
Molecular portraits, such as mRNA expression or DNA methylation patterns, have been shown to be strongly correlated with phenotypical parameters. These molecular patterns can be revealed routinely on a genomic scale. However, class prediction based on these patterns is an under-determined problem, due to the extreme high dimensionality of the data compared to the usually small number of available samples. This makes a reduction of the data dimensionality necessary. Here we demonstrate how phenotypic classes can be predicted by combining feature selection and discriminant analysis. By comparing several feature selection methods we show that the right dimension reduction strategy is of crucial importance for the classification performance. The techniques are demonstrated by methylation pattern based discrimination between acute lymphoblastic leukemia and acute myeloid leukemia. Contact: Fabian.Model@epigenomics.com

Keywords: biosvm
[Miwakeichi2001comparison] F. Miwakeichi, R. Ramirez-Padron, P. A. Valdes-Sosa, and T. Ozaki. A comparison of non-linear non-parametric models for epilepsy data. Comput. Biol. Med., 31(1):41-57, Jan 2001. [ bib ]
EEG spike and wave (SW) activity has been described through a non-parametric stochastic model estimated by the Nadaraya-Watson (NW) method. In this paper the performance of the NW, the local linear polynomial regression and support vector machines (SVM) methods were compared. The noise-free realizations obtained by the NW and SVM methods reproduced SW better than as reported in previous works. The tuning parameters had to be estimated manually. Adding dynamical noise, only the NW method was capable of generating SW similar to training data. The standard deviation of the dynamical noise was estimated by means of the correlation dimension.

Keywords: Acute, Acute Disease, Adenocarcinoma, Algorithms, Amino Acid Sequence, Animals, Artificial Intelligence, Automated, B-Lymphocytes, Bacterial Proteins, Base Pair Mismatch, Base Sequence, Bayes Theorem, Binding Sites, Biological, Bone Marrow Cells, Brachyura, Cell Compartmentation, Chemistry, Child, Chromosome Aberrations, Classification, Codon, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Trees, Diabetes Mellitus, Diagnosis, Discriminant Analysis, Discrimination Learning, Electric Conductivity, Electroencephalography, Electrophysiology, Epilepsy, Escherichia coli Proteins, Factual, Feedback, Female, Fungal, Gastric Emptying, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Genetic Predisposition to Disease, Genomics, Hemolysins, Humans, Indians, Information Storage and Retrieval, Initiator, Ion Channels, Kinetics, Leukemia, Likelihood Functions, Linear Models, Lipid Bilayers, Logistic Models, Lymphocytic, MEDLINE, Male, Markov Chains, Melanoma, Models, Molecular, Myeloid, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Nevus, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Normal Distribution, North American, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Organ Specificity, Organelles, Ovarian Neoplasms, Ovary, P.H.S., Pattern Recognition, Physical, Pigmented, Predictive Value of Tests, Promoter Regions (Genetics), Protein Biosynthesis, Protein Folding, Protein Structure, Proteins, Proteome, RNA, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Sex Characteristics, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Sound Spectrography, Statistical, Stochastic Processes, Stomach Diseases, T-Lymphocytes, Thermodynamics, Transcription, Transcription Factors, Tumor Markers, Type 2, U.S. Gov't, Vertebrates, 11058693
[Mercier2001Transcriptional] G. Mercier, Y. Denis, P. Marc, L. Picard, and M. Dutreix. Transcriptional induction of repair genes during slowing of replication in irradiated Saccharomyces cerevisiae. Mutat. Res., 487(3-4):157-172, Dec 2001. [ bib ]
We investigated the inhibition of cell-cycle progression and replication and the induction of the transcriptional response in diploid budding yeast populations exposed to two different doses of gamma-rays resulting in 15 and 85% survival respectively. We studied the kinetics of the cellular response to ionizing treatment during the period required for all of the surviving cells to achieve at least one cell division. The length of these periods increased with the dose. Irradiated populations arrested as large-budded cells containing partially replicated chromosomes. The extent of the S-phase was proportional to the amount of damage and lasted 3 or 7h depending on the irradiation dose. In parallel to the division study, we carried out a kinetic analysis of the expression of 126 selected genes by use of dedicated microarrays. About 26 genes were induced by irradiation and displayed various pattern of expression. Interestingly, 10 repair genes (RAD51, RAD54, CDC8, MSH2, RFA2, RFA3, UBC5, SRS2, SPO12 and TOP1), involved in recombination and DNA synthesis, display similar regulation of expression in the two irradiated populations. Their pattern of expression were confirmed by Northern analysis. At the two doses, the expression of this group of genes closely followed the extended replication period, and their expression resumed when replication restarted. These results suggest that the damage-induced response and DNA synthesis are closely regulated during repair. The analysis of the promoter regions indicates a high occurrence of the three MCB, HAP and UASH regulatory boxes in the promoters of this group of genes. The association of the three boxes could confer an irradiation-replication specific regulation.

[Menard2001Applied] S. Menard. Applied logistic regression analysis, volume 106. Sage Publications, Incorporated, 2001. [ bib ]
[Manly2001impact] C. Manly, S. Louise-May, and J. Hammer. The impact of informatics and computational chemistry on synthesis and screening. Drug Discov. Today, 6(21):1101-1110, Nov 2001. [ bib ]
High-throughput synthesis and screening technologies have enhanced the impact of computational chemistry on the drug discovery process. From the design of targeted, drug-like libraries to 'virtual' optimization of potency, selectivity and ADME/Tox properties, computational chemists are able to efficiently manage costly resources and dramatically shorten drug discovery cycle times. This review will describe some of the successful strategies and applications of state-of-the-art algorithms to enhance drug discovery, as well as key points in the drug discovery process where computational methods can have, and have had, greatest impact.

Keywords: chemoinformatics
[Manevitz2001One-Class] L. M. Manevitz and M Yousef. One-class SVMs for document classification. J. Mach. Learn. Res., 2:139-154, 2001. [ bib | http | .pdf ]
[Ma2001Hormone-dependent] H. Ma, C. T. Baumann, H. Li, B. D. Strahl, R. Rice, M. A. Jelinek, D. W. Aswad, C. D. Allis, G. L. Hager, and M. R. Stallcup. Hormone-dependent, carm1-directed, arginine-specific methylation of histone h3 on a steroid-regulated promoter. Curr Biol, 11(24):1981-1985, Dec 2001. [ bib ]
Activation of gene transcription involves chromatin remodeling by coactivator proteins that are recruited by DNA-bound transcription factors. Local modification of chromatin structure at specific gene promoters by ATP-dependent processes and by posttranslational modifications of histone N-terminal tails provides access to RNA polymerase II and its accompanying transcription initiation complex. While the roles of lysine acetylation, serine phosphorylation, and lysine methylation of histones in chromatin remodeling are beginning to emerge, low levels of arginine methylation of histones have only recently been documented, and its physiological role is unknown. The coactivator CARM1 methylates histone H3 at Arg17 and Arg26 in vitro and cooperates synergistically with p160-type coactivators (e.g., GRIP1, SRC-1, ACTR) and coactivators with histone acetyltransferase activity (e.g., p300, CBP) to enhance gene activation by steroid and nuclear hormone receptors (NR) in transient transfection assays. In the current study, CARM1 cooperated with GRIP1 to enhance steroid hormone-dependent activation of stably integrated mouse mammary tumor virus (MMTV) promoters, and this coactivator function required the methyltransferase activity of CARM1. Chromatin immunoprecipitation assays and immunofluorescence studies indicated that CARM1 and the CARM1-methylated form of histone H3 specifically associated with a large tandem array of MMTV promoters in a hormone-dependent manner. Thus, arginine-specific histone methylation by CARM1 is an important part of the transcriptional activation process.

Keywords: Acetylation; Arginine, metabolism; Fluorescent Antibody Technique; Histones, chemistry/metabolism; Hormones, physiology; Lysine, metabolism; Mammary Tumor Virus, Mouse, genetics; Methylation; Phosphorylation; Precipitin Tests; Promoter Regions, Genetic; Protein-Arginine N-Methyltransferases, physiology; Serine, metabolism; Steroids, physiology
[Lipinski2001Experimental] C. A. Lipinski, F. Lombardo, B. W. Dominy, and P. J. Feeney. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug. Deliv. Rev, 46(1-3):3-26, Mar 2001. [ bib ]
Experimental and computational approaches to estimate solubility and permeability in discovery and development settings are described. In the discovery setting 'the rule of 5' predicts that poor absorption or permeation is more likely when there are more than 5 H-bond donors, 10 H-bond acceptors, the molecular weight (MWT) is greater than 500 and the calculated Log P (CLogP) is greater than 5 (or MlogP > 4.15). Computational methodology for the rule-based Moriguchi Log P (MLogP) calculation is described. Turbidimetric solubility measurement is described and applied to known drugs. High throughput screening (HTS) leads tend to have higher MWT and Log P and lower turbidimetric solubility than leads in the pre-HTS era. In the development setting, solubility calculations focus on exact value prediction and are difficult because of polymorphism. Recent work on linear free energy relationships and Log P approaches are critically reviewed. Useful predictions are possible in closely related analog series when coupled with experimental thermodynamic solubility measurements.

Keywords: chemoinformatics
[Leung2001Representing] T. Leung and J. Malik. Representing and recognizing the visual appearance of materials using three-dimensional textons. Int. J. Comput. Vision, 43(1):29-44, 2001. [ bib | DOI | http | .pdf ]
We study the recognition of surfaces made from different materials such as concrete, rug, marble, or leather on the basis of their textural appearance. Such natural textures arise from spatial variation of two surface attributes: (1) reflectance and (2) surface normal. In this paper, we provide a unified model to address both these aspects of natural texture. The main idea is to construct a vocabulary of prototype tiny surface patches with associated local geometric and photometric properties. We call these 3D textons. Examples might be ridges, grooves, spots or stripes or combinations thereof. Associated with each texton is an appearance vector, which characterizes the local irradiance distribution, represented as a set of linear Gaussian derivative filter outputs, under different lighting and viewing conditions. Given a large collection of images of different materials, a clustering approach is used to acquire a small (on the order of 100) 3D texton vocabulary. Given a few (1 to 4) images of any material, it can be characterized using these textons. We demonstrate the application of this representation for recognition of the material viewed under novel lighting and viewing conditions. We also illustrate how the 3D texton model can be used to predict the appearance of materials under novel conditions.

[Lenovere2001MELTING] M. Le Novère. MELTING, computing the melting temperature of nucleic acid duplex. Bioinformatics, 17(12):1226-7, Dec 2001. [ bib ]
MELTING computes the enthalpy and entropy of an oligonucleotide duplex helix-coil transition, and then its melting temperature. The program uses the method of nearest-neighbours. The set of thermodynamic parameters can be easily customized. The program provides several correction methods for the concentration of salt. MELTING is a free program, available at no cost and open-source. Perl scripts are provided to show how MELTING can be used to construct more ambitious programs. AVAILABILITY: MELTING is available for several platforms (http://www.pasteur.fr/recherche/unites/neubiomol/meltinghome.html) and is accessible via a www server (http://bioweb.pasteur.fr/seqanal/interfaces/melting.html). CONTACT: nl223@cus.cam.ac.uk

[Lapinsh2001Development] M. Lapinsh, P. Prusis, A. Gutcaits, T Lundstedt, and J. E. S. Wikberg. Development of proteo-chemometrics: A novel technology of use for analysis of drug-receptor interactions. Biochem. Biophys. Acta, 1525:180-190, 2001. [ bib ]
[Langford2001property] E. Langford, N. Schwertman, and M. Owens. Is the property of being positively correlated transitive? The American Statistician, 55(4):322-325, 2001. [ bib ]
[Lafferty2001Conditional] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. 18th International Conf. on Machine Learning, pages 282-289. Morgan Kaufmann, San Francisco, CA, 2001. [ bib | .html | .pdf ]
Keywords: conditional-random-field
[Kuhn2001Global] K. M. Kuhn, J. L. DeRisi, P. O. Brown, and P. Sarnow. Global and specific translational regulation in the genomic response of Saccharomyces cerevisiae to a rapid transfer from a fermentable to a nonfermentable carbon source. Mol. Cell. Biol., 21(3):916-927, 2001. [ bib | http | .pdf ]
[Kramer2001Feature] S. Kramer and L. De Raedt. Feature Construction with Version Spaces for Biochemical Applications. In C.E. Brodley and A. Pohoreckyj Danyluk, editors, Proceedings of the Eighteenth International Conference on Machine Learning, pages 258-265. Morgan Kaufmann, 2001. [ bib ]
[Kitano2001Foundations] H. Kitano. Foundations of Systems Biology. MIT Press, 2001. [ bib ]
[Kim2001Evolving] J. Kim, P.L. Krapivsky, B. Kahng, and S. Redner. Evolving protein interaction networks. E-print cond-mat/0203167, 2001. [ bib | http | .pdf ]
[Kerr2001Experimental] M. K. Kerr and G. A. Churchill. Experimental design for gene expression microarrays. Biostatistics, 2(2):183-201, Jun 2001. [ bib | DOI | http ]
We examine experimental design issues arising with gene expression microarray technology. Microarray experiments have multiple sources of variation, and experimental plans should ensure that effects of interest are not confounded with ancillary effects. A commonly used design is shown to violate this principle and to be generally inefficient. We explore the connection between microarray designs and classical block design and use a family of ANOVA models as a guide to choosing a design. We combine principles of good design and A-optimality to give a general set of recommendations for design with microarrays. These recommendations are illustrated in detail for one kind of experimental objective, where we also give the results of a computer search for good designs.

[Kato2001Operator] T. Kato. Operator dynamics in molecular biology. Technical report, I.H.E.S., 2001. Technical report IHES/M/01/41. [ bib | .html | .pdf ]
[Kanehisa2001Prediction] M. Kanehisa. Prediction of higher order functional networks from genomic data. Pharmacogenomics, 2(4):373-385, 2001. [ bib | DOI | http ]
[Kallioniemi2001Tissue] O. P. Kallioniemi, U. Wagner, J. Kononen, and G. Sauter. Tissue microarray technology for high-throughput molecular profiling of cancer. Hum Mol Genet, 10(7):657-662, Apr 2001. [ bib ]
Tissue microarray (TMA) technology allows rapid visualization of molecular targets in thousands of tissue specimens at a time, either at the DNA, RNA or protein level. The technique facilitates rapid translation of molecular discoveries to clinical applications. By revealing the cellular localization, prevalence and clinical significance of candidate genes, TMAs are ideally suitable for genomics-based diagnostic and drug target discovery. TMAs have a number of advantages compared with conventional techniques. The speed of molecular analyses is increased by more than 100-fold, precious tissues are not destroyed and a very large number of molecular targets can be analyzed from consecutive TMA sections. The ability to study archival tissue specimens is an important advantage as such specimens are usually not applicable in other high-throughput genomic and proteomic surveys. Construction and analysis of TMAs can be automated, increasing the throughput even further. Most of the applications of the TMA technology have come from the field of cancer research. Examples include analysis of the frequency of molecular alterations in large tumor materials, exploration of tumor progression, identification of predictive or prognostic factors and validation of newly discovered genes as diagnostic and therapeutic targets.

Keywords: Animals; Genetic Techniques; Humans; In Situ Hybridization, methods; Neoplasms, metabolism/pathology; Oligonucleotide Array Sequence Analysis; Tissue Distribution
[Jeong2001Lethality] H. Jeong, S. P. Mason, A.-L. Barabási, and Z. N. Oltvai. Lethality and centrality in protein networks. Nature, 411:41-42, 2001. [ bib | http | .pdf ]
[Ito2001comprehensive] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA, 98(8):4569-4574, 2001. [ bib | http | .pdf ]
[Consortium2001Initial] International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860-921, Feb 2001. [ bib | DOI | http | .pdf ]
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

Keywords: genomics bio
[Consortium2001Physical] International Human Genome Mapping Consortium. A physical map of the human genome. Nature, 409, 2001. [ bib ]
[Hua2001Support] S. Hua and Z. Sun. Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17(8):721-728, 2001. [ bib | http | .pdf ]
Motivation: Subcellular localization is a key functional characteristic of proteins. A fully automatic and reliable prediction system for protein subcellular localization is needed, especially for the analysis of large-scale genome sequences. Results: In this paper, Support Vector Machine has been introduced to predict the subcellular localization of proteins from their amino acid compositions. The total prediction accuracies reach 91.4 in prokaryotic organisms and 79.4 organisms. Predictions by our approach are robust to errors in the protein N-terminal sequences. This new approach provides superior prediction performance compared with existing algorithms based on amino acid composition and can be a complementary method to other existing methods based on sorting signals. Availability: A web server implementing the prediction method is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/. Contact: sunzhr@mail.tsinghua.edu.cn; huasj00@mails.tsinghua.edu.cn Supplementary information: Supplementary material is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc

Keywords: biosvm
[Helden2001Application] J. van Helden, D. Gilbert, L. Wernisch, M. Schroeder, and S. J. Wodak. Application of regulatory sequence analysis and metabolic network analysis to the interpretation of gene expression data. In JOBIM '00: Selected papers from the First International Conference on Computational Biology, Biology, Informatics, and Mathematics, pages 147-164, London, UK, 2001. Springer-Verlag. [ bib ]
[Hastie2001elements] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer, 2001. [ bib ]
[Goektuerk2001statistical] S. B. Göktürk, C. Tomasi, B. Acar, C. F. Beaulieu, D. S. Paik, R. B. Jeffrey, J. Yee, and S. Napel. A statistical 3-D pattern processing method for computer-aided detection of polyps in CT colonography. IEEE Trans Med Imaging, 20(12):1251-60, Dec 2001. [ bib ]
Adenomatous polyps in the colon are believed to be the precursor to colorectal carcinoma, the second leading cause of cancer deaths in United States. In this paper, we propose a new method for computer-aided detection of polyps in computed tomography (CT) colonography (virtual colonoscopy), a technique in which polyps are imaged along the wall of the air-inflated, cleansed colon with X-ray CT. Initial work with computer aided detection has shown high sensitivity, but at a cost of too many false positives. We present a statistical approach that uses support vector machines to distinguish the differentiating characteristics of polyps and healthy tissue, and uses this information for the classification of the new cases. One of the main contributions of the paper is the new three-dimensional pattern processing approach, called random orthogonal shape sections method, which combines the information from many random images to generate reliable signatures of shape. The input to the proposed system is a collection of volume data from candidate polyps obtained by a high-sensitivity, low-specificity system that we developed previously. The results of our ten-fold cross-validation experiments show that, on the average, the system increases the specificity from 0.19 (0.35) to 0.69 (0.74) at a sensitivity level of 1.0 (0.95).

[Germann01Fast] U. Germann, M. Jahr, K. Knight, and D. Marcu. Fast decoding and optimal decoding for machine translation. In In Proceedings of ACL 39, pages 228-235, 2001. [ bib ]
[Gasch2001Genomic] A.P. Gasch, M. Huang, S. Metzner, D. Botstein, S.J. Elledge, and P.O. Brown. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol. Biol. Cell, 12(10):2987-3003, 2001. [ bib | http | .pdf ]
[Garber2001Diversity] M. E. Garber, O. G. Troyanskaya, K. Schluens, S. Petersen, Z. Thaesler, M. Pacyna-Gengelbach, M. van de Rijn, G. D. Rosen, C. M. Perou, R. I. Whyte, R. B. Altman, P. O. Brown, D. Botstein, and I. Petersen. Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci U S A, 98(24):13784-13789, Nov 2001. [ bib | DOI | http | .pdf ]
The global gene expression profiles for 67 human lung tumors representing 56 patients were examined by using 24,000-element cDNA microarrays. Subdivision of the tumors based on gene expression patterns faithfully recapitulated morphological classification of the tumors into squamous, large cell, small cell, and adenocarcinoma. The gene expression patterns made possible the subclassification of adenocarcinoma into subgroups that correlated with the degree of tumor differentiation as well as patient survival. Gene expression analysis thus promises to extend and refine standard pathologic analysis.

[Fine2001Efficient] S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. J. Mach. Learn. Res., 2:243-264, 2001. [ bib ]
[Fazel2001rank] M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic with application to minimum order system approximation. In Proceedings of the 2001 American Control Conference, volume 6, pages 4734-4739, 2001. [ bib | DOI ]
[Fan2001Stock] A. Fan and M. Palaniswami. Stock selection using support vector machines. In Proc. Int. Joint Conf. Neural Networks IJCNN '01, volume 3, pages 1793-1798, 2001. [ bib | DOI | .pdf ]
We used the support vector machines (SVM) in a classification approach to `beat the market'. Given the fundamental accounting and price information of stocks trading on the Australian Stock Exchange, we attempt to use SVM to identify stocks that are likely to outperform the market by having exceptional returns. The equally weighted portfolio formed by the stocks selected by SVM has a total return of 208 a five years period, significantly outperformed the benchmark of 71 whereby the output of SVM is interpreted as a probability measure and ranked, such that the stocks selected can be fixed to the top 25%

[Elbashir2001RNA] S. M. Elbashir, W. Lendeckel, and T. Tuschl. RNA interference is mediated by 21- and 22-nucleotide RNAs. Genes Dev., 15(2):188-200, Jan 2001. [ bib ]
Double-stranded RNA (dsRNA) induces sequence-specific posttranscriptional gene silencing in many organisms by a process known as RNA interference (RNAi). Using a Drosophila in vitro system, we demonstrate that 21- and 22-nt RNA fragments are the sequence-specific mediators of RNAi. The short interfering RNAs (siRNAs) are generated by an RNase III-like processing reaction from long dsRNA. Chemically synthesized siRNA duplexes with overhanging 3' ends mediate efficient target RNA cleavage in the lysate, and the cleavage site is located near the center of the region spanned by the guiding siRNA. Furthermore, we provide evidence that the direction of dsRNA processing determines whether sense or antisense target RNA can be cleaved by the siRNA-protein complex.

Keywords: sirna
[Duda2001Pattern] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience, 2001. [ bib ]
[Dreiseitl2001comparison] S. Dreiseitl, L. Ohno-Machado, H. Kittler, S. Vinterbo, H. Billhardt, and M. Binder. A comparison of machine learning methods for the diagnosis of pigmented skin lesions. J Biomed Inform, 34(1):28-36, Feb 2001. [ bib | DOI | http | .pdf ]
We analyze the discriminatory power of k-nearest neighbors, logistic regression, artificial neural networks (ANNs), decision tress, and support vector machines (SVMs) on the task of classifying pigmented skin lesions as common nevi, dysplastic nevi, or melanoma. Three different classification tasks were used as benchmarks: the dichotomous problem of distinguishing common nevi from dysplastic nevi and melanoma, the dichotomous problem of distinguishing melanoma from common and dysplastic nevi, and the trichotomous problem of correctly distinguishing all three classes. Using ROC analysis to measure the discriminatory power of the methods shows that excellent results for specific classification problems in the domain of pigmented skin lesions can be achieved with machine-learning methods. On both dichotomous and trichotomous tasks, logistic regression, ANNs, and SVMs performed on about the same level, with k-nearest neighbors and decision trees performing worse.

Keywords: Algorithms, Amino Acid Sequence, Artificial Intelligence, Biological, Cell Compartmentation, Comparative Study, Computer Simulation, Computer-Assisted, Decision Trees, Diagnosis, Discriminant Analysis, Humans, Logistic Models, Melanoma, Models, Neural Networks (Computer), Nevus, Non-U.S. Gov't, Organelles, P.H.S., Pigmented, Predictive Value of Tests, Proteins, Reproducibility of Results, Research Support, Skin Diseases, Skin Neoplasms, Skin Pigmentation, U.S. Gov't, 11376540
[Ding2001Multi-class] C.H.Q. Ding and I. Dubchak. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17:349-358, 2001. [ bib | .pdf | .pdf ]
Motivation: Protein fold recognition is an important approach to structure discovery without relying on sequence similarity. We study this approach with new multi-class classification methods and examined many issues important for a practical recognition system. Results: Most current discriminative methods for protein fold prediction use the one-against-others method, which has the well-known ?False Positives? problem. We investigated two new methods: the unique one-against-others and the all-against-all methods. Both improve prediction accuracy by 14?110 SCOP folds. We used the Support Vector Machine (SVM) and the Neural Network (NN) learning methods as base classifiers. SVMs converges fast and leads to high accuracy. When scores of multiple parameter datasets are combined, majority voting reduces noise and increases recognition accuracy. We examined many issues involved with large number of classes, including dependencies of prediction accuracy on the number of folds and on the number of representatives in a fold. Overall, recognition systems achieve 56 accuracy on a protein test dataset, where most of the proteins have below 25 information: The protein parameter datasets used in this paper are available online (http://www.nersc.gov/ cding/protein).

Keywords: biosvm
[Deng2001Unsupervised] Y. Deng and B. S. Manjunath. Unsupervised segmentation of color-texture regions in images and video. IEEE Trans. Pattern Anal. Mach. Intell., 23(8):800-810, Aug 2001. [ bib | DOI | http | .pdf ]
A method for unsupervised segmentation of color-texture regions in images and video is presented. This method, which we refer to as JSEG, consists of two independent steps: color quantization and spatial segmentation. In the first step, colors in the image are quantized to several representative classes that can be used to differentiate regions in the image. The image pixels are then replaced by their corresponding color class labels, thus forming a class-map of the image. The focus of this work is on spatial segmentation, where a criterion for "good" segmentation using the class-map is proposed. Applying the criterion to local windows in the class-map results in the "J-image," in which high and low values correspond to possible boundaries and interiors of color-texture regions. A region growing method is then used to segment the image based on the multiscale J-images. A similar approach is applied to video sequences. An additional region tracking scheme is embedded into the region growing process to achieve consistent segmentation and tracking results, even for scenes with nonrigid object motion. Experiments show the robustness of the JSEG algorithm on real images and video.

[Davies2001Discrete] E. B. Davies, G. M. L. Gladwell, J. Leydold, and P. F. Stadler. Discrete Nodal Domain Theorems. Lin. Alg. Appl., 336:51-60, 2001. [ bib | .html | .pdf ]
[Cordella01improved] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. An improved algorithm for matching large graphs. In In: 3rd IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition, Cuen, pages 149-159, 2001. [ bib ]
[Cooper2001GlycoSuiteDB] C. Cooper, M. Harrison, M. Wilkins, and N. Packer. GlycoSuiteDB: a new curated relational database of glycoprotein glycan structures and their biological sources. Nucleic Acids Res., 29:332-335, 2001. [ bib | http | .pdf ]
[Collins2001Convolution] M. Collins and N. Duffy. Convolution Kernels for Natural Language. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Adv. Neural. Inform. Process Syst., volume 14, pages 625-632. MIT Press, 2001. [ bib ]
[Cole2001Polychemotherapy] B. F. Cole, R. D. Gelber, S. Gelber, A. S. Coates, and A. Goldhirsch. Polychemotherapy for early breast cancer: an overview of the randomised clinical trials with quality-adjusted survival analysis. Lancet, 358(9278):277-286, Jul 2001. [ bib | DOI | http ]
Keywords: Aged; Antineoplastic Agents, therapeutic use; Breast Neoplasms, drug therapy/mortality; Chemotherapy, Adjuvant; Drug Therapy, Combination; Female; Follow-Up Studies; Humans; Middle Aged; Prognosis; Quality-Adjusted Life Years; Randomized Controlled Trials as Topic; Survival Analysis; Tamoxifen, therapeutic use
[Chow2001Identifying] M. L. Chow, E. J. Moler, and I. S. Mian. Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. Physiol. Genomics, 5(2):99-111, Mar 2001. [ bib | http | .pdf ]
Transcription profiling experiments permit the expression levels of many genes to be measured simultaneously. Given profiling data from two types of samples, genes that most distinguish the samples (marker genes) are good candidates for subsequent in-depth experimental studies and developing decision support systems for diagnosis, prognosis, and monitoring. This work proposes a mixture of feature relevance experts as a method for identifying marker genes and illustrates the idea using published data from samples labeled as acute lymphoblastic and myeloid leukemia (ALL, AML). A feature relevance expert implements an algorithm that calculates how well a gene distinguishes samples, reorders genes according to this relevance measure, and uses a supervised learning method [here, support vector machines (SVMs)] to determine the generalization performances of different nested gene subsets. The mixture of three feature relevance experts examined implement two existing and one novel feature relevance measures. For each expert, a gene subset consisting of the top 50 genes distinguished ALL from AML samples as completely as all 7,070 genes. The 125 genes at the union of the top 50s are plausible markers for a prototype decision support system. Chromosomal aberration and other data support the prediction that the three genes at the intersection of the top 50s, cystatin C, azurocidin, and adipsin, are good targets for investigating the basic biology of ALL/AML. The same data were employed to identify markers that distinguish samples based on their labels of T cell/B cell, peripheral blood/bone marrow, and male/female. Selenoprotein W may discriminate T cells from B cells. Results from analysis of transcription profiling data from tumor/nontumor colon adenocarcinoma samples support the general utility of the aforementioned approach. Theoretical issues such as choosing SVM kernels and their parameters, training and evaluating feature relevance experts, and the impact of potentially mislabeled samples on marker identification (feature selection) are discussed.

Keywords: biosvm
[Chou2001Using] K.-C. Chou. Using subsite coupling to predict signal peptides. Protein Eng., 14(2):75-79, 2001. [ bib | http | .pdf ]
[Chou2001Prediction] K.-C. Chou. Prediction of protein signal sequences and their cleavage sites. Protein. Struct. Funct. Genet., 42:136-139, 2001. [ bib | http | .pdf ]
[Chiang2001Visualizing] D. Y. Chiang, P. O. Brown, and M. B. Eisen. Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. Bioinformatics, 17:49S-55S, 2001. [ bib | .pdf | .pdf ]
[Chang2001LIBSVM] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [ bib ]
[Carter2001computational] R. J. Carter, I. Dubchak, and S. R. Holbrook. A computational approach to identify genes for functional RNAs in genomic sequences. Nucl. Acids Res., 29(19):3928-3938, 2001. [ bib | http | .pdf ]
Currently there is no successful computational approach for identification of genes encoding novel functional RNAs (fRNAs) in genomic sequences. We have developed a machine learning approach using neural networks and support vector machines to extract common features among known RNAs for prediction of new RNA genes in the unannotated regions of prokaryotic and archaeal genomes. The Escherichia coli genome was used for development, but we have applied this method to several other bacterial and archaeal genomes. Networks based on nucleotide composition were 80-90 for bacteria and 90-99 achieved a significant improvement in accuracy by combining these predictions with those obtained using a second set of parameters consisting of known RNA sequence motifs and the calculated free energy of folding. Several known fRNAs not included in the training datasets were identified as well as several hundred predicted novel RNAs. These studies indicate that there are many unidentified RNAs in simple genomes that can be predicted computationally as a precursor to experimental study. Public access to our RNA gene predictions and an interface for user predictions is available via the web.

Keywords: biosvm
[Caplen2001Specific] N. J. Caplen, S. Parrish, F. Imani, A. Fire, and R. A. Morgan. Specific inhibition of gene expression by small double-stranded RNAs in invertebrate and vertebrate systems. Proc. Natl. Acad. Sci. USA, 98(17):9742-9747, Aug 2001. [ bib | DOI | http | .pdf ]
Short interfering RNAs (siRNAs) are double-stranded RNAs of approximately 21-25 nucleotides that have been shown to function as key intermediaries in triggering sequence-specific RNA degradation during posttranscriptional gene silencing in plants and RNA interference in invertebrates. siRNAs have a characteristic structure, with 5'-phosphate/3'-hydroxyl ends and a 2-base 3' overhang on each strand of the duplex. In this study, we present data that synthetic siRNAs can induce gene-specific inhibition of expression in Caenorhabditis elegans and in cell lines from humans and mice. In each case, the interference by siRNAs was superior to the inhibition of gene expression mediated by single-stranded antisense oligonucleotides. The siRNAs seem to avoid the well documented nonspecific effects triggered by longer double-stranded RNAs in mammalian cells. These observations may open a path toward the use of siRNAs as a reverse genetic and therapeutic tool in mammalian cells.

Keywords: sirna
[Cai2001Support] Y.-D. Cai, X.-J. Liu, X.-B. Xu, and G.-P. Zhou. Support Vector Machines for predicting protein structural class. BMC Bioinformatics, 2(3):3, 2001. [ bib | DOI | http | .pdf ]
Background We apply a new machine learning method, the so-called Support Vector Machine method, to predict the protein structural class. Support Vector Machine method is performed based on the database derived from SCOP, in which protein domains are classified based on known structures and the evolutionary relationships and the principles that govern their 3-D structure. Results High rates of both self-consistency and jackknife tests are obtained. The good results indicate that the structural class of a protein is considerably correlated with its amino acid composition. Conclusions It is expected that the Support Vector Machine method and the elegant component-coupled method, also named as the covariant discrimination algorithm, if complemented with each other, can provide a powerful computational tool for predicting the structural classes of proteins.

Keywords: biosvm
[Bussemaker2001Regulatory] H. J. Bussemaker, H. Li, and E. D. Siggia. Regulatory element detection using correlation with expression. Nat. Genet., 27:167-174, 2001. [ bib | http | .pdf ]
[Briggs2001Histone] S. D. Briggs, M. Bryk, B. D. Strahl, W. L. Cheung, J. K. Davie, S. Y. Dent, F. Winston, and C. D. Allis. Histone h3 lysine 4 methylation is mediated by set1 and required for cell growth and rdna silencing in saccharomyces cerevisiae. Genes Dev, 15(24):3286-3295, Dec 2001. [ bib | DOI | http ]
Histone methylation is known to be associated with both transcriptionally active and repressive chromatin states. Recent studies have identified SET domain-containing proteins such as SUV39H1 and Clr4 as mediators of H3 lysine 9 (Lys9) methylation and heterochromatin formation. Interestingly, H3 Lys9 methylation is not observed from bulk histones isolated from asynchronous populations of Saccharomyces cerevisiae or Tetrahymena thermophila. In contrast, H3 lysine 4 (Lys4) methylation is a predominant modification in these smaller eukaryotes. To identify the responsible methyltransferase(s) and to gain insight into the function of H3 Lys4 methylation, we have developed a histone H3 Lys4 methyl-specific antiserum. With this antiserum, we show that deletion of SET1, but not of other putative SET domain-containing genes, in S. cerevisiae, results in the complete abolishment of H3 Lys4 methylation in vivo. Furthermore, loss of H3 Lys4 methylation in a set1 Delta strain can be rescued by SET1. Analysis of histone H3 mutations at Lys4 revealed a slow-growth defect similar to a set1 Delta strain. Chromatin immunoprecipitation assays show that H3 Lys4 methylation is present at the rDNA locus and that Set1-mediated H3 Lys4 methylation is required for repression of RNA polymerase II transcription within rDNA. Taken together, these data suggest that Set1-mediated H3 Lys4 methylation is required for normal cell growth and transcriptional silencing.

Keywords: Animals; Antibody Formation; Blotting, Western; Cell Division; DNA Primers, chemistry; DNA, Bacterial, genetics; DNA, Ribosomal, genetics; DNA-Binding Proteins, metabolism; Fungal Proteins, metabolism; Gene Silencing; Genetic Vectors; Heterochromatin, chemistry/metabolism; Histone-Lysine N-Methyltransferase; Histones, metabolism; Lysine, metabolism; Methylation; Methyltransferases, genetics/metabolism; Mutation; Nucleosomes, chemistry/metabolism; Polymerase Chain Reaction; Precipitin Tests; Protein Methyltransferases; RNA Polymerase III, metabolism; Rabbits; Saccharomyces cerevisiae Proteins; Saccharomyces cerevisiae, genetics; Transcription Factors, metabolism
[Breiman2001Statistical] L. Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3):199-231, 2001. [ bib ]
[Breiman2001Random] L. Breiman. Random forests. Mach. Learn., 45(1):5-32, 2001. [ bib | DOI | http | .pdf ]
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

Keywords: PUlearning
[Brazma2001Minimum] A. Brazma, P. Hingamp, J. Quackenbush, G. Sherlock, P. Spellman, C. Stoeckert, J. Aach, W. Ansorge, C. A. Ball, H. C. Causton, T. Gaasterland, P. Glenisson, F. C. Holstege, I. F. Kim, V. Markowitz, J. C. Matese, H. Parkinson, A. Robinson, U. Sarkans, S. Schulze-Kremer, J. Stewart, R. Taylor, J. Vilo, and M. Vingron. Minimum information about a microarray experiment (miame)-toward standards for microarray data. Nat. Genet., 29(4):365-371, Dec 2001. [ bib | DOI | http ]
Microarray analysis has become a widely used tool for the generation of gene expression data on a genomic scale. Although many significant results have been derived from microarray studies, one limitation has been the lack of standards for presenting and exchanging such data. Here we present a proposal, the Minimum Information About a Microarray Experiment (MIAME), that describes the minimum information required to ensure that microarray data can be easily interpreted and that results derived from its analysis can be independently verified. The ultimate goal of this work is to establish a standard for recording and reporting microarray-based gene expression data, which will in turn facilitate the establishment of databases and public repositories and enable the development of data analysis tools. With respect to MIAME, we concentrate on defining the content and structure of the necessary information rather than the technical format for capturing it.

Keywords: Computational Biology; Gene Expression Profiling, methods; Oligonucleotide Array Sequence Analysis, standards
[Bower2001Computational] J. M. Bower and H. Bolouri. Computational modeling of genetic and biochemical networks. MIT Press, Cambridge, MA, 2001. [ bib ]
[Bostroem2001Reproducing] J. Boström. Reproducing the conformations of protein-bound ligands: a critical evaluation of several popular conformational searching tools. J Comput Aided Mol Des, 15(12):1137-1152, Dec 2001. [ bib ]
Several programs (Catalyst, Confort, Flo99, MacroModel, and Omega) that are commonly used to generate conformational ensembles have been tested for their ability to reproduce bioactive conformations. The ligands from thirty-two different ligand-protein complexes determined by high-resolution (< 2.0 A) X-ray crystallography have been analyzed. The Low-Mode Conformational Search method (with AMBER* and the GB/SA hydration model), as implemented in MacroModel, was found to perform better than the other algorithms. The rule-based method Omega, which is orders of magnitude faster than the other methods, also gave reasonable results but were found to be dependent on the input structure. The methods supporting diverse sampling (Catalyst, Confort) performed least well. For the seven ligands in the set having eight or more rotatable bonds, none of the bioactive conformations were ever found, save for one exception (Flo99). These ligands do not bind in a local minimum conformation according to AMBER*/SA. Taking these last two observations together, it is clear that geometrically similar structures should be collected in order to increase the probability of finding the bioactive conformation among the generated ensembles. Factors influencing bioactive conformational retrieval have been identified and are discussed.

Keywords: Algorithms; Crystallography, X-Ray; Ligands; Models, Molecular; Molecular Conformation; Protein Binding; Quantum Theory; Software
[Bosshard2001Molecular] H. R. Bosshard. Molecular recognition by induced fit: how fit is the concept? News Physiol Sci, 16:171-173, Aug 2001. [ bib ]
Induced fit explains why biomolecules can bind together even if they are not optimized for binding. However, induced fit can lead to a kinetic bottleneck and does not describe every interaction in the absence of prior complementarity. Preselection of a fitting conformer is an alternative to induced fit.

Keywords: Antigen-Antibody Complex, physiology; Biological Products, chemistry/metabolism; Models, Biological; Molecular Conformation
[Bock2001Predicting] J. R. Bock and D. A. Gough. Predicting protein-protein interactions from primary structure. Bioinformatics, 17(5):455-460, 2001. [ bib | .pdf | .pdf ]
Keywords: biosvm
[Blanchard2001Methodes] G. Blanchard. Méthodes de mélange et d'aggregation d'estimateurs en reconnaissance de formes. Applications aux arbres de decision. PhD thesis, University Paris 13, January 2001. [ bib | .ps.gz | .pdf ]
[Birge2001Gaussian] L. Birgé and P. Massart. Gaussian model selection. J. Eur. Math. Soc., 3:203-268, 2001. [ bib ]
[Billerey2001Frequent] C. Billerey, Chopin D, Aubriot-Lorton MH, Ricol D, Gil Diez de Medina S, Van Rhijn B, Bralet MP, Lefrere-Belda MA, Lahaye JB, Abbou CC, Bonaventure J, Zafrani ES, van der Kwast T, Thiery JP, and Radvanyi F. Frequent fgfr3 mutations in papillary non-invasive bladder (pta) tumors. Am J Pathol., 158:1955-1959, 2001. [ bib ]
We recently identified activating mutations of fibroblast growth factor receptor 3 (FGFR3) in bladder carcinoma. In this study we assessed the incidence of FGFR3 mutations in a series of 132 bladder carcinomas: 20 carcinoma in situ (CIS), 50 pTa, 19 pT1, and 43 pT2-4. All 48 mutations identified were identical to the germinal activating mutations that cause thanatophoric dysplasia, a lethal form of dwarfism. The S249C mutation, found in 33 of the 48 mutated tumors, was the most common. The frequency of mutations was higher in pTa tumors (37 of 50, 74 P < 0.0001) and pT2-4 tumors (7 of 43, 16 were detected in 27 of 32 (84 (7 grade was highly significant (P < 0.0001). FGFR3 is the first gene found to be mutated at a high frequency in pTa tumors. The absence of FGFR3 mutations in CIS and the low frequency of FGFR3 mutations in pT1 and pT2-4 tumors are consistent with the model of bladder tumor progression in which the most common precursor of pT1 and pT2-4 tumors is CIS.

[Bhattacharjee2001Classification] A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E. J. Mark, E. S. Lander, W. Wong, B. E. Johnson, T. R. Golub, D. A. Sugarbaker, and M. Meyerson. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA, 98(24):13790-13795, Nov 2001. [ bib | DOI | http | .pdf ]
We have generated a molecular taxonomy of lung carcinoma, the leading cause of cancer death in the United States and worldwide. Using oligonucleotide microarrays, we analyzed mRNA expression levels corresponding to 12,600 transcript sequences in 186 lung tumor samples, including 139 adenocarcinomas resected from the lung. Hierarchical and probabilistic clustering of expression data defined distinct subclasses of lung adenocarcinoma. Among these were tumors with high relative expression of neuroendocrine genes and of type II pneumocyte genes, respectively. Retrospective analysis revealed a less favorable outcome for the adenocarcinomas with neuroendocrine gene expression. The diagnostic potential of expression profiling is emphasized by its ability to discriminate primary lung adenocarcinomas from metastases of extra-pulmonary origin. These results suggest that integration of expression profile data with clinical parameters could aid in diagnosis of lung cancer patients.

[Ben-Hur2001Support] A. Ben-Hur, D. Horn, H.T. Siegelmann, and V. Vapnik. Support Vector Clustering. J. Mach. Learn. Res., 2:125-137, 2001. [ bib | .pdf | .pdf ]
[Bejerano2001Variations] G. Bejerano and G. Yona. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 17:23-43, 2001. [ bib | http | .pdf ]
[Beerenwinkel2001Geno2pheno] N. Beerenwinkel, B. Schmidt, H. Walter, R. Kaiser, T. Lengauer, D. Hoffman, K. Korn, and J. Selbig. Geno2pheno: Interpreting Genotypic HIV Drug Resistance Tests. IEEE Intelligent Systems, 6(6):35-41, 2001. [ bib | DOI | http | .pdf ]
Rapid accumulation of resistance mutations in the genome of the human immunodeficiency virus (HIV) plays a central role in drug treatment failure in infected patients. The authors have developed geno2pheno, an intelligent system that uses the information encoded in the viral genomic sequence to predict resistance or susceptibility of the virus to 13 antiretroviral agents. To predict phenotypic drug resistance from genotype, they applied two machine learning techniques: decision trees and linear support vector machines. These techniques performed learning on more than 400 genotype-phenotype pairs for each drug. The authors compared the generalization performance of the two families of models in leave-one-out experiments. Except for three drugs, all error estimates ranged between 7.25 and 15.5 percent. Support vector machines performed slightly better for most drugs, but knowledge extraction was easier for decision trees. Geno2pheno is freely available at http://cartan.gmd.de/geno2pheno.html.

Keywords: biosvm
[Bazzani2001SVM] A. Bazzani, A. Bevilacqua, D. Bollini, R. Brancaccio, R. Campanini, N. Lanconelli, A. Riccardi, and D. Romani. An SVM classifier to separate false signals from microcalcifications in digital mammograms. Phys Med Biol, 46(6):1651-63, Jun 2001. [ bib | DOI | http | .pdf ]
In this paper we investigate the feasibility of using an SVM (support vector machine) classifier in our automatic system for the detection of clustered microcalcifications in digital mammograms. SVM is a technique for pattern recognition which relies on the statistical learning theory. It minimizes a function of two terms: the number of misclassified vectors of the training set and a term regarding the generalization classifier capability. We compare the SVM classifier with an MLP (multi-layer perceptron) in the false-positive reduction phase of our detection scheme: a detected signal is considered either microcalcification or false signal, according to the value of a set of its features. The SVM classifier gets slightly better results than the MLP one (Az value of 0.963 against 0.958) in the presence of a high number of training data; the improvement becomes much more evident (Az value of 0.952 against 0.918) in training sets of reduced size. Finally, the setting of the SVM classifier is much easier than the MLP one.

Keywords: biosvm image
[Barabasi2001Deterministic] A.-L. Barabási and E. Ravasz. Deterministic scale-free networks. E-print cond-mat/0107419, 2001. [ bib | http | .pdf ]
[Ballesteros2001G] J. Ballesteros and K. Palczewski. G protein-coupled receptor drug discovery: implications from the crystal structure of rhodopsin. Curr. Opin. Drug Discov. Devel., 4(5):561-574, Sep 2001. [ bib ]
G protein-coupled receptors (GPCRs) are a functionally diverse group of membrane proteins that play a critical role in signal transduction. Because of the lack of a high-resolution structure, the heptahelical transmembrane bundle within the N-terminal extracellular and C-terminal intracellular region of these receptors has initially been modeled based on the high-resolution structure of bacterial retinal-binding protein, bacteriorhodopsin. However, the low-resolution structure of rhodopsin, a prototypical GPCR, revealed that there is a minor relationship between GPCRs and bacteriorhodopsins. The high-resolution crystal structure of the rhodopsin ground state and further refinements of the model provide the first structural information about the entire organization of the polypeptide chain and post-translational moieties. These studies provide a structural template for Family 1 GPCRs that has the potential to significantly improve structure-based approaches to GPCR drug discovery.

Keywords: Amino Acid Sequence; Animals; Crystallography, X-Ray; Drug Design; GTP-Binding Proteins; Humans; Models, Molecular; Molecular Sequence Data; Receptors, Drug; Rhodopsin
[Baldi2001Bioinformatcs] P. Baldi and S. Brunak. Bioinformatcs, the machine learning approach. MIT Press, 2001. [ bib ]
[Bajorath2001Selected] J. Bajorath. Selected concepts and investigations in compound classification, molecular descriptor analysis, and virtual screening. J Chem Inf Comput Sci, 41(2):233-245, 2001. [ bib ]
Keywords: chemoinformatics
[Anstreicheir2001new] K. M. Anstreicher and N. W. Brixius. A new bound for the quadratic assignment problem based on convex quadratic programming. Math. Program., 89(3):341-357, 2001. [ bib ]
[Amari2001Methods] S.-I. Amari and H. Nagaoka. Methods of information geometry. AMS vol. 191, 2001. [ bib ]
[Aguda2001Chaos] B. D. Aguda. Kick-starting the cell cycle: From growth-factor stimulation to initiation of dna replication. Chaos, 11(1):269-276, 2001. [ bib ]
The essential genes, proteins and associated regulatory networks involved in the entry into the mammalian cell cycle are identified, from activation of growth-factor receptors to intracellular signal transduction pathways that impinge on the cell cycle machinery and ultimately on the initiation of DNA replication. Signaling pathways mediated by the oncoproteins Ras and Myc induce the activation of cyclin-dependent kinases CDK4 and CDK2, and the assembly and firing of pre-replication complexes require a collaboration among E2F, CDK2, and Cdc7 kinase. A proposed core mechanism of the restriction point, the major checkpoint prior to commitment to DNA synthesis, involves cyclin E/CDK2, the phosphatase Cdc25A, and the CDK inhibitor p27Kip1. (c) 2001 American Institute of Physics.

[Achard2001XML] F. Achard, G. Vaysseix, and E. Barillot. Xml, bioinformatics and data integration. Bioinformatics, 17(2):115-125, Feb 2001. [ bib ]
Motivation: The eXtensible Markup Language (XML) is an emerging standard for structuring documents, notably for the World Wide Web. In this paper, the authors present XML and examine its use as a data language for bioinformatics. In particular, XML is compared to other languages, and some of the potential uses of XML in bioinformatics applications are presented. The authors propose to adopt XML for data interchange between databases and other sources of data. Finally the discussion is illustrated by a test case of a pedigree data model in XML. Contact: Emmanuel.Barillot@infobiogen.fr

Keywords: Computational Biology; Humans; Information Storage and Retrieval; Internet; Programming Languages
[Hua2001Novel] S. Hua and Z. Sun. A Novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: Support Vector Machine Approach. J. Mol. Biol., 308(2):397-407, April 2001. [ bib | DOI | .pdf ]
Keywords: biosvm
[Opper2001Universal] M. Opper and R. Urbanczik. Universal learning curves of support vector machines. Phys Rev Lett, 86(19):4410-3, May 2001. [ bib ]
Using methods of statistical physics, we investigate the role of model complexity in learning with support vector machines (SVMs), which are an important alternative to neural networks. We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which depend only on the target rule but not on the SVM kernel.

Keywords: Algorithms, Amino Acid Sequence, Artificial Intelligence, Biological, Cell Compartmentation, Chemistry, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Databases, Decision Trees, Diagnosis, Discriminant Analysis, Electrophysiology, Factual, Gastric Emptying, Humans, Logistic Models, Melanoma, Models, Neural Networks (Computer), Nevus, Non-U.S. Gov't, Organelles, P.H.S., Physical, Pigmented, Predictive Value of Tests, Proteins, Proteome, Reproducibility of Results, Research Support, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Software, Stomach Diseases, U.S. Gov't, 11328187
[Liang2001Detection] H. Liang and Z. Lin. Detection of delayed gastric emptying from electrogastrograms with support vector machine. IEEE Trans Biomed Eng, 48(5):601-4, May 2001. [ bib ]
A recent study reported a conventional neural network (NN) approach for the noninvasive diagnosis of delayed gastric emptying from the cutaneous electrogastrograms. Using support vector machine, we show that this relatively new technique can be used for detection of delayed gastric emptying and is in fact able to outdo the conventional NN.

Keywords: Algorithms, Amino Acid Sequence, Artificial Intelligence, Biological, Cell Compartmentation, Comparative Study, Computer Simulation, Computer-Assisted, Decision Trees, Diagnosis, Discriminant Analysis, Electrophysiology, Gastric Emptying, Humans, Logistic Models, Melanoma, Models, Neural Networks (Computer), Nevus, Non-U.S. Gov't, Organelles, P.H.S., Pigmented, Predictive Value of Tests, Proteins, Reproducibility of Results, Research Support, Skin Diseases, Skin Neoplasms, Skin Pigmentation, Stomach Diseases, U.S. Gov't, 11341535
[Jeong2001] H. Jeong, S. P. Mason, A. L. Barabási, and Z. N. Oltvai. Lethality and centrality in protein networks. Nature, 411(6833):41-42, May 2001. [ bib | DOI | http ]
Keywords: Fungal Proteins, genetics/physiology; Gene Deletion; Protein Binding; Proteome; Saccharomyces cerevisiae, genetics/physiology; Signal Transduction
[Elbashir2001Duplexes] S. M. Elbashir, J. Harborth, W. Lendeckel, A. Yalcin, K. Weber, and T. Tuschl. Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature, 411(6836):494-498, May 2001. [ bib | DOI | http | .pdf ]
RNA interference (RNAi) is the process of sequence-specific, post-transcriptional gene silencing in animals and plants, initiated by double-stranded RNA (dsRNA) that is homologous in sequence to the silenced gene. The mediators of sequence-specific messenger RNA degradation are 21- and 22-nucleotide small interfering RNAs (siRNAs) generated by ribonuclease III cleavage from longer dsRNAs. Here we show that 21-nucleotide siRNA duplexes specifically suppress expression of endogenous and heterologous genes in different mammalian cell lines, including human embryonic kidney (293) and HeLa cells. Therefore, 21-nucleotide siRNA duplexes provide a new tool for studying gene function in mammalian cells and may eventually be used as gene-specific therapeutics.

Keywords: sirna
[Logan2001Study] B. Logan, P. Moreno, B. Suzek, Z. Weng, and S. Kasif. A Study of Remote Homology Detection. Technical Report CRL 2001/05, Compaq Cambridge Research laboratory, June 2001. [ bib | .pdf ]
Functional annotation of newly sequenced genomes is an important challenge for computational biology systems. While much progress has been made towards scalingup experimental methods for functional assignment to putative genes, most current genomic annotation systems rely on computational solutions for homology modeling via sequence or structural similarity. We present a new method for remote homology detection that relies on combining probabilistic modeling and supervised learning in high-dimensional features spaces. Our system uses a transformation that converts protein domains to fixed-dimension representative feature vectors, where each feature records the sensitivity of each protein domain to a previously learned set of ?protein motifs? or ?blocks?. Subsequently, the system utilizes Support Vector Machine (SVM) classifiers to learn the boundaries between structural protein classes. Our experiments suggest that this technique performs well relative to several other remote homology methods for the majority of protein domains in SCOP 1.37 PDB90.

Keywords: biosvm
[Amari2001Information] S.-I. Amari. Information geometry on hierarchy of probability distributions. IEEE Trans. Inform. Theory, 47(5):1701-1711, July 2001. [ bib | .ps.gz | .pdf ]
[Burbidge2001Drug] R. Burbidge, M. Trotter, B. Buxton, and S. Holden. Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput. Chem., 26(1):4-15, December 2001. [ bib | .pdf | .pdf ]
Keywords: biosvm chemoinformatics
[Zhou2002covering] D. Zhou. The covering number in learning theory. J. Complexity, 18:739-767, 2002. [ bib | DOI | http | .pdf ]
The covering number of a ball of a reproducing kernel Hilbert space as a subset of the continuous function space plays an important role in Learning Theory. We give estimates for this covering number by means of the regularity of the Mercer kernel K. For convolution type kernels K(x, t) = k(x - t) on [0, 1]n, we provide estimates depending on the decay of k, the Fourier transform of k. In particular, when k decays exponentially, our estimate for this covering number is better than all the previous results and covers many important Mercer kernels. A counter example is presented to show that the eigenfunctions of the Hilbert-Schmidt operator LK associated with a Mercer kernel K may not be uniformly bounded. Hence some previous methods used for estimating the covering number in Learning Theory are not valid. We also provide an example of a Mercer kernel to show that LK½ may not be generated by a Mercer kernel.

[Zavaljevski2002Support] N. Zavaljevski, F.J. Stevens, and J. Reifman. Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions. Bioinformatics, 18(5):689-696, 2002. [ bib | http | .pdf ]
Motivation: Data that characterize primary and tertiary structures of proteins are now accumulating at a rapid and accelerating rate and require automated computational tools to extract critical information relating amino acid changes with the spectrum of functionally attributes exhibited by a protein. We propose that immunoglobulin-type beta-domains, which are found in approximate 400 functionally distinct forms in humans alone, provide the immense genetic variation within limited conformational changes that might facilitate the development of new computational tools. As an initial step, we describe here an approach based on Support Vector Machine (SVM) technology to identify amino acid variations that contribute to the functional attribute of pathological self-assembly by some human antibody light chains produced during plasma cell diseases. Results: We demonstrate that SVMs with selective kernel scaling are an effective tool in discriminating between benign and pathologic human immunoglobulin light chains. Initial results compare favorably against manual classification performed by experts and indicate the capability of SVMs to capture the underlying structure of the data. The data set consists of 70 proteins of human antibody 1 light chains, each represented by aligned sequences of 120 amino acids. We perform feature selection based on a first-order adaptive scaling algorithm, which confirms the importance of changes in certain amino acid positions and identifies other positions that are key in the characterization of protein function.

Keywords: biosvm
[Yuan2002Prediction] Z. Yuan, K. Burrage, and J.S. Mattick. Prediction of protein solvent accessibility using support vector machines. Proteins, 48(3):566-570, 2002. [ bib | DOI | http | .pdf ]
A Support Vector Machine learning system has been trained to predict protein solvent accessibility from the primary structure. Different kernel functions and sliding window sizes have been explored to find how they affect the prediction performance. Using a cut-off threshold of 15 of exposed and buried residues), this method was able to achieve a prediction accuracy of 70.1 for multiple alignment sequence input, respectively. The prediction of three and more states of solvent accessibility was also studied and compared with other methods. The prediction accuracies are better than, or comparable to, those obtained by other methods such as neural networks, Bayesian classification, multiple linear regression, and information theory. In addition, our results further suggest that this system may be combined with other prediction methods to achieve more reliable results, and that the Support Vector Machine method is a very useful tool for biological sequence analysis.

Keywords: biosvm
[Yu2002Methods] Kun Yu, Nikolai Petrovsky, Christian Schönbach, Judice Y L Koh, and Vladimir Brusic. Methods for prediction of peptide binding to MHC molecules: a comparative study. Mol Med, 8(3):137-148, Mar 2002. [ bib ]
BACKGROUND: A variety of methods for prediction of peptide binding to major histocompatibility complex (MHC) have been proposed. These methods are based on binding motifs, binding matrices, hidden Markov models (HMM), or artificial neural networks (ANN). There has been little prior work on the comparative analysis of these methods. MATERIALS AND METHODS: We performed a comparison of the performance of six methods applied to the prediction of two human MHC class I molecules, including binding matrices and motifs, ANNs, and HMMs. RESULTS: The selection of the optimal prediction method depends on the amount of available data (the number of peptides of known binding affinity to the MHC molecule of interest), the biases in the data set and the intended purpose of the prediction (screening of a single protein versus mass screening). When little or no peptide data are available, binding motifs are the most useful alternative to random guessing or use of a complete overlapping set of peptides for selection of candidate binders. As the number of known peptide binders increases, binding matrices and HMM become more useful predictors. ANN and HMM are the predictive methods of choice for MHC alleles with more than 100 known binding peptides. CONCLUSION: The ability of bioinformatic methods to reliably predict MHC binding peptides, and thereby potential T-cell epitopes, has major implications for clinical immunology, particularly in the area of vaccine design.

Keywords: Amino Acid Motifs; Computational Biology; Histocompatibility Antigens Class I; Humans; Models, Molecular; Peptides; Protein Binding
[Yeung2002Reverse] M. K. Stephen Yeung, Jesper Tegnér, and James J. Collins. Reverse engineering gene networks using singular value decomposition and robust regression. Proc. Natl. Acad. Sci. USA, 99(9):6163-6168, 2002. [ bib | DOI | arXiv | http ]
We propose a scheme to reverse-engineer gene networks on a genome-wide scale using a relatively small amount of gene expression data from microarray experiments. Our method is based on the empirical observation that such networks are typically large and sparse. It uses singular value decomposition to construct a family of candidate solutions and then uses robust regression to identify the solution with the smallest number of connections as the most likely solution. Our algorithm has O(log N) sampling complexity and O(N4) computational complexity. We test and validate our approach in a series of in numero experiments on model gene networks.

[Yang2002Design] Yee Hwa Yang and Terry Speed. Design issues for cdna microarray experiments. Nat Rev Genet, 3(8):579-588, Aug 2002. [ bib | DOI | http ]
Keywords: Gene Expression Profiling; Gene Expression Regulation; Humans; Oligonucleotide Array Sequence Analysis, methods; Research Design; Time Factors
[Xu2002Chemoinformatics] J. Xu and A. Hagler. Chemoinformatics and Drug Discovery. Molecules, 7:566-600, 2002. [ bib | .pdf ]
Keywords: chemoinformatics
[Xenarios2002DIP] I. Xenarios, L. Salwínski, J. Duan, X, P. Higney, S-M Kim, and D. Eisenberg. Dip, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res, 30(1):303-305, Jan 2002. [ bib ]
The Database of Interacting Proteins (DIP: http://dip.doe-mbi.ucla.edu) is a database that documents experimentally determined protein-protein interactions. It provides the scientific community with an integrated set of tools for browsing and extracting information about protein interaction networks. As of September 2001, the DIP catalogs approximately 11 000 unique interactions among 5900 proteins from >80 organisms; the vast majority from yeast, Helicobacter pylori and human. Tools have been developed that allow users to analyze, visualize and integrate their own experimental data with the information about protein-protein interactions available in the DIP database.

[Wu2002Natural] Jiann-Ming Wu. Natural discriminant analysis using interactive Potts models. Neural Comput, 14(3):689-713, Mar 2002. [ bib | DOI | http ]
Natural discriminant analysis based on interactive Potts models is developed in this work. A generative model composed of piece-wise multivariate gaussian distributions is used to characterize the input space, exploring the embedded clustering and mixing structures and developing proper internal representations of input parameters. The maximization of a log-likelihood function measuring the fitness of all input parameters to the generative model, and the minimization of a design cost summing up square errors between posterior outputs and desired outputs constitutes a mathematical framework for discriminant analysis. We apply a hybrid of the mean-field annealing and the gradient-descent methods to the optimization of this framework and obtain multiple sets of interactive dynamics, which realize coupled Potts models for discriminant analysis. The new learning process is a whole process of component analysis, clustering analysis, and labeling analysis. Its major improvement compared to the radial basis function and the support vector machine is described by using some artificial examples and a real-world application to breast cancer diagnosis.

[Weber2002Building] Griffin Weber, Staal Vinterbo, and Lucila Ohno-Machado. Building an asynchronous web-based tool for machine learning classification. Proc AMIA Symp, pages 869-73, 2002. [ bib ]
Various unsupervised and supervised learning methods including support vector machines, classification trees, linear discriminant analysis and nearest neighbor classifiers have been used to classify high-throughput gene expression data. Simpler and more widely accepted statistical tools have not yet been used for this purpose, hence proper comparisons between classification methods have not been conducted. We developed free software that implements logistic regression with stepwise variable selection as a quick and simple method for initial exploration of important genetic markers in disease classification. To implement the algorithm and allow our collaborators in remote locations to evaluate and compare its results against those of other methods, we developed a user-friendly asynchronous web-based application with a minimal amount of programming using free, downloadable software tools. With this program, we show that classification using logistic regression can perform as well as other more sophisticated algorithms, and it has the advantages of being easy to interpret and reproduce. By making the tool freely and easily available, we hope to promote the comparison of classification methods. In addition, we believe our web application can be used as a model for other bioinformatics laboratories that need to develop web-based analysis tools in a short amount of time and on a limited budget.

Keywords: Acute, Algorithms, Animals, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Drug, Drug Design, Eukaryotic Cells, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Internet, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lymphocytic, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Protein Binding, Proteins, Quality Control, RNA, RNA Splicing, Receptors, Reference Values, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12463949
[Weber2002Cancer] B. L. Weber. Cancer genomics. Cancer Cell, 1(1):37-47, 2002. [ bib | DOI | http | .pdf ]
The draft human genome sequence and the dissemination of high throughput technology provides opportunities for systematic analysis of cancer cells. Genome-wide mutation screens, high resolution analysis of chromosomal abberations and expression profiling all give comprehensive views of genetic alterations in cancer cells. From these analyses will come a complete list of the genetic changes that drive malignant transformation and of the therapeutic targets that may be exploited for clinical benefit.

[Warmuth2002Active] M. K. Warmuth, G. Rätsch, M. Mathieson, L. Liao, and C. Lemmen. Active learning in the drug discovery process. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Adv. Neural Inform. Process. Syst., volume 14, pages 1449-1456. MIT Press, 2002. [ bib ]
Keywords: biosvm
[Walters2002Prediction] W. Patrick Walters and Mark A. Murcko. Prediction of 'drug-likeliness'. Adv. Drug Deliv. Rev., 54:255-271, 2002. [ bib ]
[Wahba2002Soft] Grace Wahba. Soft and hard classification by reproducing kernel Hilbert space methods. Proc Natl Acad Sci U S A, 99(26):16524-30, Dec 2002. [ bib | DOI | http | .pdf ]
Reproducing kernel Hilbert space (RKHS) methods provide a unified context for solving a wide variety of statistical modelling and function estimation problems. We consider two such problems: We are given a training set [yi, ti, i = 1, em leader, n], where yi is the response for the ith subject, and ti is a vector of attributes for this subject. The value of y(i) is a label that indicates which category it came from. For the first problem, we wish to build a model from the training set that assigns to each t in an attribute domain of interest an estimate of the probability pj(t) that a (future) subject with attribute vector t is in category j. The second problem is in some sense less ambitious; it is to build a model that assigns to each t a label, which classifies a future subject with that t into one of the categories or possibly "none of the above." The approach to the first of these two problems discussed here is a special case of what is known as penalized likelihood estimation. The approach to the second problem is known as the support vector machine. We also note some alternate but closely related approaches to the second problem. These approaches are all obtained as solutions to optimization problems in RKHS. Many other problems, in particular the solution of ill-posed inverse problems, can be obtained as solutions to optimization problems in RKHS and are mentioned in passing. We caution the reader that although a large literature exists in all of these topics, in this inaugural article we are selectively highlighting work of the author, former students, and other collaborators.

Keywords: Acute, Algorithms, Animals, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Drug, Drug Design, Eukaryotic Cells, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Leukemia, Ligands, Likelihood Functions, Lymphocytic, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Protein Binding, Proteins, Quality Control, RNA, RNA Splicing, Receptors, Reference Values, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Statistical, Stomach Neoplasms, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12477931
[Vinokourov2002Finding] A. Vinokourov, J. Shawe-Taylor, and N. Cristianini. Finding Language-Independent Semantic Representation of Text using Kernel Canonical Correlation Analysis. Technical report, Neurocolt, 2002. NeuroCOLT Technical Report NC-TR-02-119. [ bib | .html | .ps.gz ]
[Vijver2002gene-expression] M. J. van de Vijver, Y. D. He, L. J. van't Veer, H. Dai, A. A. M. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse, C. Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H. Friend, and R. Bernards. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med., 347(25):1999-2009, Dec 2002. [ bib | DOI | http | .pdf ]
BACKGROUND: A more accurate means of prognostication in breast cancer will improve the selection of patients for adjuvant systemic therapy. METHODS: Using microarray analysis to evaluate our previously established 70-gene prognosis profile, we classified a series of 295 consecutive patients with primary breast carcinomas as having a gene-expression signature associated with either a poor prognosis or a good prognosis. All patients had stage I or II breast cancer and were younger than 53 years old; 151 had lymph-node-negative disease, and 144 had lymph-node-positive disease. We evaluated the predictive power of the prognosis profile using univariable and multivariable statistical analyses. RESULTS: Among the 295 patients, 180 had a poor-prognosis signature and 115 had a good-prognosis signature, and the mean (+/-SE) overall 10-year survival rates were 54.6+/-4.4 percent and 94.5+/-2.6 percent, respectively. At 10 years, the probability of remaining free of distant metastases was 50.6+/-4.5 percent in the group with a poor-prognosis signature and 85.2+/-4.3 percent in the group with a good-prognosis signature. The estimated hazard ratio for distant metastases in the group with a poor-prognosis signature, as compared with the group with the good-prognosis signature, was 5.1 (95 percent confidence interval, 2.9 to 9.0; P<0.001). This ratio remained significant when the groups were analyzed according to lymph-node status. Multivariable Cox regression analysis showed that the prognosis profile was a strong independent factor in predicting disease outcome. CONCLUSIONS: The gene-expression profile we studied is a more powerful predictor of the outcome of disease in young patients with breast cancer than standard systems based on clinical and histologic criteria.

Keywords: breastcancer, csbcbook, csbcbook-ch3
[Vert2002Graph-driven] J.-P. Vert and M. Kanehisa. Graph-driven features extraction from microarray data. Technical Report 0206055, Arxiv physics, 2002. [ bib ]
Keywords: biosvm
[Vert2002tree] J.-P. Vert. A tree kernel to analyze phylogenetic profiles. Bioinformatics, 18:S276-S284, 2002. [ bib | .html | .pdf ]
Keywords: biosvm
[Vert2002Support] J.-P. Vert. Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauerdale, and T. E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 649-660. World Scientific, 2002. [ bib | .pdf | .pdf ]
Keywords: biosvm
[Venables2002Modern] W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer, New York, fourth edition, 2002. ISBN 0-387-95457-0. [ bib | http ]
[Veer2002Gene] L. J. van 't Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression profiling predicts clinical outcome of breast cancers. Nature, 415(6871):530-536, Jan 2002. [ bib | DOI | http | .pdf ]
Breast cancer patients with the same stage of disease can have markedly different treatment responses and overall outcome. The strongest predictors for metastases (for example, lymph node status and histological grade) fail to classify accurately breast tumours according to their clinical behaviour. Chemotherapy or hormonal therapy reduces the risk of distant metastases by approximately one-third; however, 70-80% of patients receiving this treatment would have survived without it. None of the signatures of breast cancer gene expression reported to date allow for patient-tailored therapy strategies. Here we used DNA microarray analysis on primary breast tumours of 117 young patients, and applied supervised classification to identify a gene expression signature strongly predictive of a short interval to distant metastases ('poor prognosis' signature) in patients without tumour cells in local lymph nodes at diagnosis (lymph node negative). In addition, we established a signature that identifies tumours of BRCA1 carriers. The poor prognosis signature consists of genes regulating cell cycle, invasion, metastasis and angiogenesis. This gene expression profile will outperform all currently used clinical parameters in predicting disease outcome. Our findings provide a strategy to select patients who would benefit from adjuvant therapy.

Keywords: breastcancer, csbcbook, csbcbook-ch3
[Veber2002Molecular] D. F. Veber, S. R. Johnson, H.-Y. Cheng, B. R. Smith, K. W. Ward, and K. D. Kopple. Molecular properties that influence the oral bioavailability of drug candidates. J. Med. Chem., 45(12):2615-2623, Jun 2002. [ bib ]
Oral bioavailability measurements in rats for over 1100 drug candidates studied at SmithKline Beecham Pharmaceuticals (now GlaxoSmithKline) have allowed us to analyze the relative importance of molecular properties considered to influence that drug property. Reduced molecular flexibility, as measured by the number of rotatable bonds, and low polar surface area or total hydrogen bond count (sum of donors and acceptors) are found to be important predictors of good oral bioavailability, independent of molecular weight. That on average both the number of rotatable bonds and polar surface area or hydrogen bond count tend to increase with molecular weight may in part explain the success of the molecular weight parameter in predicting oral bioavailability. The commonly applied molecular weight cutoff at 500 does not itself significantly separate compounds with poor oral bioavailability from those with acceptable values in this extensive data set. Our observations suggest that compounds which meet only the two criteria of (1) 10 or fewer rotatable bonds and (2) polar surface area equal to or less than 140 A(2) (or 12 or fewer H-bond donors and acceptors) will have a high probability of good oral bioavailability in the rat. Data sets for the artificial membrane permeation rate and for clearance in the rat were also examined. Reduced polar surface area correlates better with increased permeation rate than does lipophilicity (C log P), and increased rotatable bond count has a negative effect on the permeation rate. A threshold permeation rate is a prerequisite of oral bioavailability. The rotatable bond count does not correlate with the data examined here for the in vivo clearance rate in the rat.

Keywords: chemogenomics
[Valentini2002Gene] G. Valentini. Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles. Artif. Intell. Med., 26(3):281-304, Nov 2002. [ bib | DOI | .pdf ]
The large amount of data generated by DNA microarrays was originally analysed using unsupervised methods, such as clustering or self-organizing maps. Recently supervised methods such as decision trees, dot-product support vector machines (SVM) and multi-layer perceptrons (MLP) have been applied in order to classify normal and tumoural tissues. We propose methods based on non-linear SVM with polynomial and Gaussian kernels, and output coding (OC) ensembles of learning machines to separate normal from malignant tissues, to classify different types of lymphoma and to analyse the role of sets of coordinately expressed genes in carcinogenic processes of lymphoid tissues. Using gene expression data from "Lymphochip", a specialised DNA microarray developed at Stanford University School of Medicine, we show that SVM can correctly separate normal from tumoural tissues, and OC ensembles can be successfully used to classify different types of lymphoma. Moreover, we identify a group of coordinately expressed genes related to the separation of two distinct subgroups inside diffuse large B-cell lymphoma (DLBCL), validating a previous Alizadeh's hypothesis about the existence of two distinct diseases inside DLBCL.

Keywords: biosvm
[Turner2002Cellular] Bryan M. Turner. Cellular memory and the histone code. Cell, 111:285-291, 2002. [ bib ]
Keywords: csbcbook
[Tucker2002Gene] D. L. Tucker, N. Tucker, and T. Conway. Gene expression profiling of the ph response in escherichia coli. J Bacteriol., 184(23):6551-6558, Dec 2002. [ bib ]
Escherichia coli MG1655 acid-inducible genes were identified by whole-genome expression profiling. Cultures were grown to the mid-logarithmic phase on acidified glucose minimal medium, conditions that induce glutamate-dependent acid resistance (AR), while the other AR systems are either repressed or not induced. A total of 28 genes were induced in at least two of three experiments in which the gene expression profiles of cells grown in acid (pH 5.5 or 4.5) were compared to those of cells grown at pH 7.4. As expected, the genes encoding glutamate decarboxylase, gadA and gadB, were significantly induced. Interestingly, two acid-inducible genes code for small basic proteins with pIs of >10.5, and six code for small acidic proteins with pIs ranging from 5.7 to 4.0; the roles of these small basic and acidic proteins in acid resistance are unknown. The acid-induced genes represented only five functional grouping categories, including eight genes involved in metabolism, nine associated with cell envelope structures or modifications, two encoding chaperones, six regulatory genes, and six unknown genes. It is unlikely that all of these genes are involved in the glutamate-dependent AR. However, nine acid-inducible genes are clustered in the gadA region, including hdeA, which encodes a putative periplasmic chaperone, and four putative regulatory genes. One of these putative regulators, yhiE, was shown to significantly increase acid resistance when overexpressed in cells that had not been preinduced by growth at pH 5.5, and mutation of yhiE decreased acid resistance; yhiE could therefore encode an activator of AR genes. Thus, the acid-inducible genes clustered in the gadA region appear to be involved in glutatmate-dependent acid resistance, although their specific roles remain to be elucidated.

Keywords: Culture Media; Escherichia coli; Escherichia coli Proteins; Gene Expression Profiling; Gene Expression Regulation, Bacterial; Heat-Shock Response; Hydrogen-Ion Concentration; Morpholines; Oligonucleotide Array Sequence Analysis
[Tsuda2002Marginalized] K. Tsuda, T. Kin, and K. Asai. Marginalized Kernels for Biological Sequences. Bioinformatics, 18:S268-S275, 2002. [ bib | .pdf ]
Motivation: Kernel methods such as support vector machines require a kernel function between objects to be defined a priori. Several works have been done to derive kernels from probability distributions, e.g., the Fisher kernel. However, a general methodology to design a kernel is not fully developed. Results: We propose a reasonable way of designing a kernel when objects are generated from latent variable models (e.g., HMM). First of all, a joint kernel is designed for complete data which include both visible and hidden variables. Then a marginalized kernel for visible data is obtained by taking the expectation with respect to hidden variables. We will show that the Fisher kernel is a special case of marginalized kernels, which gives another viewpoint to the Fisher kernel theory. Although our approach can be applied to any object, we particularly derive several marginalized kernels useful for biological sequences (e.g., DNA and proteins). The effectiveness of marginalized kernels is illustrated in the task of classifying bacterial gyrase subunit B (gyrB) amino acid sequences.

Keywords: biosvm
[Tsuda2002new] K. Tsuda, M. Kawanabe, G. Rätsch, S. Sonnenburg, and K.-R. Müller. A new discriminative kernel from probabilistic models. Neural Computation, 14(10):2397-2414, 2002. [ bib | DOI | http | .pdf ]
Keywords: biosvm
[Todeschini2002Handbook] R. Todeschini and V. Consonni. Handbook of Molecular Descriptors. Wiley-VCH, New York, 2002. [ bib ]
Keywords: chemoinformatics
[Surabhi2002RNA] R. M. Surabhi and R. B. Gaynor. RNA interference directed against viral and cellular targets inhibits human immunodeficiency Virus Type 1 replication. J. Virol., 76(24):12963-12973, Dec 2002. [ bib ]
Human immunodeficiency virus type 1 (HIV-1) gene expression is regulated by both cellular transcription factors and Tat. The ability of Tat to stimulate transcriptional elongation is dependent on its binding to TAR RNA in conjunction with cyclin T1 and CDK9. A variety of other cellular factors that bind to the HIV-1 long terminal repeat, including NF-kappaB, SP1, LBP, and LEF, are also important in the control of HIV-1 gene expression. Although these factors have been demonstrated to regulate HIV-1 gene expression by both genetic and biochemical analysis, in most cases a direct in vivo demonstration of their role on HIV-1 replication has not been established. Recently, the efficacy of RNA interference in mammalian cells has been shown utilizing small interfering RNAs (siRNAs) to result in the specific degradation of host mRNAs and decreases the levels of their corresponding proteins. In this study, we addressed whether siRNAs directed against either HIV-1 tat or reverse transcriptase or the NF-kappaB p65 subunit could specifically decrease the levels of these proteins and thus alter HIV-1 replication. Our results demonstrate the specificity of siRNAs for decreasing the expression of these viral and cellular proteins and inhibiting HIV-1 replication. These studies suggest that RNA interference is useful in exploring the biological role of cellular and viral regulatory factors involved in the control of HIV-1 gene expression.

Keywords: sirna
[Sultan2002Binary] M. Sultan, D. A. Wigle, C. A. Cumbaa, M. Maziarz, J. Glasgow, M. S. Tsao, and I. Jurisica. Binary tree-structured vector quantization approach to clustering and visualizing microarray data. Bioinformatics, 18 Suppl 1:S111-9, 2002. [ bib ]
MOTIVATION: With the increasing number of gene expression databases, the need for more powerful analysis and visualization tools is growing. Many techniques have successfully been applied to unravel latent similarities among genes and/or experiments. Most of the current systems for microarray data analysis use statistical methods, hierarchical clustering, self-organizing maps, support vector machines, or k-means clustering to organize genes or experiments into 'meaningful' groups. Without prior explicit bias almost all of these clustering methods applied to gene expression data not only produce different results, but may also produce clusters with little or no biological relevance. Of these methods, agglomerative hierarchical clustering has been the most widely applied, although many limitations have been identified. RESULTS: Starting with a systematic comparison of the underlying theories behind clustering approaches, we have devised a technique that combines tree-structured vector quantization and partitive k-means clustering (BTSVQ). This hybrid technique has revealed clinically relevant clusters in three large publicly available data sets. In contrast to existing systems, our approach is less sensitive to data preprocessing and data normalization. In addition, the clustering results produced by the technique have strong similarities to those of self-organizing maps (SOMs). We discuss the advantages and the mathematical reasoning behind our approach.

[Su2002Large-scale] A.I. Su, M.P. Cooke, K.A. Ching, Y. Hakak, J.R. Walker, T. Wiltshire, A.P. Orth, R.Q. Vega, L.M. Sapinoso, A. Moqrich, A. Patapoutian, G.M. Hampton, P.G. Schultz, and J.B. Hogenesch. Large-scale analysis of the human and mouse transcriptomes. Proc. Natl. Acad. Sci. U. S. A., 99(7):4465-4470, Apr 2002. [ bib | DOI | http ]
High-throughput gene expression profiling has become an important tool for investigating transcriptional activity in a variety of biological samples. To date, the vast majority of these experiments have focused on specific biological processes and perturbations. Here, we have generated and analyzed gene expression from a set of samples spanning a broad range of biological conditions. Specifically, we profiled gene expression from 91 human and mouse samples across a diverse array of tissues, organs, and cell lines. Because these samples predominantly come from the normal physiological state in the human and mouse, this dataset represents a preliminary, but substantial, description of the normal mammalian transcriptome. We have used this dataset to illustrate methods of mining these data, and to reveal insights into molecular and physiological gene function, mechanisms of transcriptional regulation, disease etiology, and comparative genomics. Finally, to allow the scientific community to use this resource, we have built a free and publicly accessible website (http://expression.gnf.org) that integrates data visualization and curation of current gene annotations.

Keywords: Animals; Collagen; Female; Gene Expression Profiling; Humans; Male; Mice; Organ Specificity; Polymerase Chain Reaction; Receptors, Cell Surface; Transcription, Genetic
[Sturn2002Genesis:] Alexander Sturn, John Quackenbush, and Zlatko Trajanoski. Genesis: cluster analysis of microarray data. Bioinformatics, 18(1):207-8, Jan 2002. [ bib ]
A versatile, platform independent and easy to use Java suite for large-scale gene expression analysis was developed. Genesis integrates various tools for microarray data analysis such as filters, normalization and visualization tools, distance measures as well as common clustering algorithms including hierarchical clustering, self-organizing maps, k-means, principal component analysis, and support vector machines. The results of the clustering are transparent across all implemented methods and enable the analysis of the outcome of different algorithms and parameters. Additionally, mapping of gene expression data onto chromosomal sequences was implemented to enhance promoter analysis and investigation of transcriptional control mechanisms.

Keywords: Algorithms, Artificial Intelligence, Cluster Analysis, Comparative Study, Computational Biology, Databases, Gene Expression Profiling, Genetic, Models, Molecular Structure, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Principal Component Analysis, Programming Languages, Promoter Regions (Genetics), Protein, Proteins, Research Support, Software, Statistical, Transcription, 11836235
[Strahl2002Set2] Brian D Strahl, Patrick A Grant, Scott D Briggs, Zu-Wen Sun, James R Bone, Jennifer A Caldwell, Sahana Mollah, Richard G Cook, Jeffrey Shabanowitz, Donald F Hunt, and C. David Allis. Set2 is a nucleosomal histone h3-selective methyltransferase that mediates transcriptional repression. Mol Cell Biol, 22(5):1298-1306, Mar 2002. [ bib ]
Recent studies of histone methylation have yielded fundamental new insights pertaining to the role of this modification in gene activation as well as in gene silencing. While a number of methylation sites are known to occur on histones, only limited information exists regarding the relevant enzymes that mediate these methylation events. We thus sought to identify native histone methyltransferase (HMT) activities from Saccharomyces cerevisiae. Here, we describe the biochemical purification and characterization of Set2, a novel HMT that is site-specific for lysine 36 (Lys36) of the H3 tail. Using an antiserum directed against Lys36 methylation in H3, we show that Set2, via its SET domain, is responsible for methylation at this site in vivo. Tethering of Set2 to a heterologous promoter reveals that Set2 represses transcription, and part of this repression is mediated through the HMT activity of the SET domain. These results suggest that Set2 and methylation at H3 Lys36 play a role in the repression of gene transcription.

Keywords: Amino Acid Sequence; Gene Expression Regulation, Fungal; Histones, metabolism; Methyltransferases, metabolism; Molecular Sequence Data; Nucleosomes, enzymology; Saccharomyces cerevisiae Proteins, metabolism; Saccharomyces cerevisiae, enzymology/genetics; Substrate Specificity; Transcription, Genetic; Transcriptional Activation
[Stelling2002Metabolic] J. Stelling, S. Klamt, K. Bettenbrock, S. Schuster, and E. D. Gilles. Metabolic network structure determines key aspects of functionality and regulation. Nature, 420(6912):190-193, Nov 2002. [ bib | DOI | http | .pdf ]
The relationship between structure, function and regulation in complex cellular networks is a still largely open question. Systems biology aims to explain this relationship by combining experimental and theoretical approaches. Current theories have various strengths and shortcomings in providing an integrated, predictive description of cellular networks. Specifically, dynamic mathematical modelling of large-scale networks meets difficulties because the necessary mechanistic detail and kinetic parameters are rarely available. In contrast, structure-oriented analyses only require network topology, which is well known in many cases. Previous approaches of this type focus on network robustness or metabolic phenotype, but do not give predictions on cellular regulation. Here, we devise a theoretical method for simultaneously predicting key aspects of network functionality, robustness and gene regulation from network structure alone. This is achieved by determining and analysing the non-decomposable pathways able to operate coherently at steady state (elementary flux modes). We use the example of Escherichia coli central metabolism to illustrate the method.

[Steinwart2002Support] I. Steinwart. Support Vector Machines are Universally Consistent. J. Complexity, 18:768-791, 2002. [ bib | DOI | http | .pdf ]
We show that support vector machines of the 1-norm soft margin type are universally consistent provided that the regularization parameter is chosen in a distinct manner and the kernel belongs to a specific class?the so-called universal kernels?which has recently been considered by the author. In particular it is shown that the 1-norm soft margin classifier with Gaussian RBF kernel on a compact subset X of d and regularization parameter cn=n??1 is universally consistent, if n is the training set size and 0<?<1/d.

[Stapley2002Predicting] B.J. Stapley, L.A. Kelley, and M.J. Sternberg. Predicting the sub-cellular location of proteins from text using support vector machines. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Kevin Lauerdale, and Teri E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 374-385. World Scientific, 2002. [ bib | .pdf | .pdf ]
We present an automatic method to classify the sub-cellular location of proteins based on the text of relevant medline abstracts. For each protein, a vector of terms is generated from medline abstracts in which the protein/gene's name or synonym occurs. A Support Vector Machine (SVM) is used to automatically partition the term space and to thus discriminate the textual features that define sub-cellular location. The method is benchmarked on a set of proteins of known sub-cellular location from S. cerevisiae. No prior knowledge of the problem domain nor any natural language processing is used at any stage. The method out-performs support vector machines trained on amino acid composition and has comparable performance to rule-based text classifiers. Combining text with protein amino-acid composition improves recall for some sub-cellular locations. We discuss the generality of the method and its potential application to a variety of biological classification problems.

Keywords: biosvm
[Speybroeck2002From] L Van Speybroeck. From epigenesis to epigenetics: The case of C. H. Waddington. Annals of the New York Academy of Sciences, 981:61-81, 2002. [ bib ]
Keywords: csbcbook
[Sonnenburg2002New] S. Sonnenburg, G. Rätsch, A. Jagota, and K.-R. Müller. New methods for splice-site recognition. In JR. Dorronsoro, editor, Proc. International conference on artificial Neural Networks ? ICANN?02, number 2415 in LNCS, pages 329-336. Springer Berlin, 2002. [ bib | .pdf ]
Keywords: biosvm
[Song2002Prediction] Minghu Song, Curt M Breneman, Jinbo Bi, N. Sukumar, Kristin P Bennett, Steven Cramer, and Nihal Tugcu. Prediction of protein retention times in anion-exchange chromatography systems using support vector regression. J Chem Inf Comput Sci, 42(6):1347-57, 2002. [ bib ]
Quantitative Structure-Retention Relationship (QSRR) models are developed for the prediction of protein retention times in anion-exchange chromatography systems. Topological, subdivided surface area, and TAE (Transferable Atom Equivalent) electron-density-based descriptors are computed directly for a set of proteins using molecular connectivity patterns and crystal structure geometries. A novel algorithm based on Support Vector Machine (SVM) regression has been employed to obtain predictive QSRR models using a two-step computational strategy. In the first step, a sparse linear SVM was utilized as a feature selection procedure to remove irrelevant or redundant information. Subsequently, the selected features were used to produce an ensemble of nonlinear SVM regression models that were combined using bootstrap aggregation (bagging) techniques, where various combinations of training and validation data sets were selected from the pool of available data. A visualization scheme (star plots) was used to display the relative importance of each selected descriptor in the final set of "bagged" models. Once these predictive models have been validated, they can be used as an automated prediction tool for virtual high-throughput screening (VHTS).

Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Decision Making, Diagnosis, Differential, Drug, Drug Design, Electrostatics, Eukaryotic Cells, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Internet, Ion Exchange, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12444731
[Slonim2002From] D. K. Slonim. From patterns to pathways: gene expression data analysis comes of age. Nat. Genet., 32 Suppl:502-508, Dec 2002. [ bib | DOI | http | .pdf ]
Many different biological questions are routinely studied using transcriptional profiling on microarrays. A wide range of approaches are available for gleaning insights from the data obtained from such experiments. The appropriate choice of data-analysis technique depends both on the data and on the goals of the experiment. This review summarizes some of the common themes in microarray data analysis, including detection of differential expression, clustering, and predicting sample characteristics. Several approaches to each problem, and their relative merits, are discussed and key areas for additional research highlighted.

[Singer2002Universal] A.C. Singer, S.S. Kozat, and M. Feder. Universal linear least squares prediction: upper and lower bounds. IEEE Trans. Inform. Theory, 48(8):2354 - 2362, Aug 2002. [ bib | DOI | http | .pdf ]
[Shipp2002Diffuse] M. A. Shipp, K. N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. A. Pinkus, T. S. Ray, M. A. Koval, K. W. Last, A. Norton, T. A. Lister, J. Mesirov, D. S. Neuberg, E. S. Lander, J. C. Aster, and T. R. Golub. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med., 8(1):68-74, 2002. [ bib | DOI | .pdf ]
Diffuse large B-cell lymphoma (DLBCL), the most common lymphoid malignancy in adults, is curable in less than 50 models based on pre-treatment characteristics, such as the International Prognostic Index (IPI), are currently used to predict outcome in DLBCL. However, clinical outcome models identify neither the molecular basis of clinical heterogeneity, nor specific therapeutic targets. We analyzed the expression of 6,817 genes in diagnostic tumor specimens from DLBCL patients who received cyclophosphamide, adriamycin, vincristine and prednisone (CHOP)-based chemotherapy, and applied a supervised learning prediction method to identify cured versus fatal or refractory disease. The algorithm classified two categories of patients with very different five-year overall survival rates (70 within specific IPI risk categories who were likely to be cured or to die of their disease. Genes implicated in DLBCL outcome included some that regulate responses to B-cell?receptor signaling, critical serine/threonine phosphorylation pathways and apoptosis. Our data indicate that supervised learning classification techniques can predict outcome in DLBCL and identify rational targets for intervention.

Keywords: biosvm
[Segre2002Analysis] D. Segrè, D. Vitkup, and G. M. Church. Analysis of optimality in natural and perturbed metabolic networks. Proc Natl Acad Sci U S A, 99(23):15112-15117, Nov 2002. [ bib | DOI | http | .pdf ]
An important goal of whole-cell computational modeling is to integrate detailed biochemical information with biological intuition to produce testable predictions. Based on the premise that prokaryotes such as Escherichia coli have maximized their growth performance along evolution, flux balance analysis (FBA) predicts metabolic flux distributions at steady state by using linear programming. Corroborating earlier results, we show that recent intracellular flux data for wild-type E. coli JM101 display excellent agreement with FBA predictions. Although the assumption of optimality for a wild-type bacterium is justifiable, the same argument may not be valid for genetically engineered knockouts or other bacterial strains that were not exposed to long-term evolutionary pressure. We address this point by introducing the method of minimization of metabolic adjustment (MOMA), whereby we test the hypothesis that knockout metabolic fluxes undergo a minimal redistribution with respect to the flux configuration of the wild type. MOMA employs quadratic programming to identify a point in flux space, which is closest to the wild-type point, compatibly with the gene deletion constraint. Comparing MOMA and FBA predictions to experimental flux data for E. coli pyruvate kinase mutant PB25, we find that MOMA displays a significantly higher correlation than FBA. Our method is further supported by experimental data for E. coli knockout growth rates. It can therefore be used for predicting the behavior of perturbed metabolic networks, whose growth performance is in general suboptimal. MOMA and its possible future extensions may be useful in understanding the evolutionary optimization of metabolism.

[Seeger2002Covariance] M. Seeger. Covariance Kernels from Bayesian Generative Models. In Adv. Neural Inform. Process. Syst., volume 14, pages 905-912, 2002. [ bib | www: ]
Keywords: biosvm
[Schoelkopf2002Kernel] B. Schölkopf, J. Weston, E. Eskin, C. Leslie, and W.S. Noble. A Kernel Approach for Learning from Almost Orthogonal Patterns. In Proceedings of ECML 2002, 2002. [ bib | .pdf | .pdf ]
[Scholkopf2002Learning] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, 2002. [ bib | http ]
[Schuster2002Reaction] S. Schuster, C. Hilgetag, J. H. Woods, and D. A. Fell. Reaction routes in biochemical reaction systems: algebraic properties, validated calculation procedure and example from nucleotide metabolism. J Math Biol, 45(2):153-181, Aug 2002. [ bib | DOI | http | .pdf ]
Elementary flux modes (direct reaction routes) are minimal sets of enzymes that can operate at steady state, with all irreversible reactions used in the appropriate direction. They can be interpreted as component pathways of a (bio)chemical reaction network. Here, two different definitions of elementary modes are given and their equivalence is proved. Several algebraic properties of elementary modes are then presented and proved. This concerns, amongst other features, the minimal number of enzymes of the network not used in an elementary mode and the situations where irreversible reactions are replaced by reversible ones. Based on these properties, a refined algorithm is presented, and it is formally proved that this algorithm will exclusively generate all the elementary flux modes of an arbitrary network containing reversible or irreversible reactions or both. The algorithm is illustrated by a biochemical example relevant in nucleotide metabolism. The computer implementation in two different programming languages is discussed.

[Schuffenhauer2002ontology] A. Schuffenhauer, J. Zimmermann, R. Stoop, J. J. van der Vyver, S. Lecchini, and E. Jacoby. An ontology for pharmaceutical ligands and its application for in silico screening and library design. J. Chem. Inf. Comput. Sci., 42(4):947-955, 2002. [ bib ]
Annotation efforts in biosciences have focused in past years mainly on the annotation of genomic sequences. Only very limited effort has been put into annotation schemes for pharmaceutical ligands. Here we propose annotation schemes for the ligands of four major target classes, enzymes, G protein-coupled receptors (GPCRs), nuclear receptors (NRs), and ligand-gated ion channels (LGICs), and outline their usage for in silico screening and combinatorial library design. The proposed schemes cover ligand functionality and hierarchical levels of target classification. The classification schemes are based on those established by the EC, GPCRDB, NuclearDB, and LGICDB. The ligands of the MDL Drug Data Report (MDDR) database serve as a reference data set of known pharmacologically active compounds. All ligands were annotated according to the schemes when attribution was possible based on the activity classification provided by the reference database. The purpose of the ligand-target classification schemes is to allow annotation-based searching of the ligand database. In addition, the biological sequence information of the target is directly linkable to the ligand, hereby allowing sequence similarity-based identification of ligands of next homologous receptors. Ligands of specified levels can easily be retrieved to serve as comprehensive reference sets for cheminformatics-based similarity searches and for design of target class focused compound libraries. Retrospective in silico screening experiments within the MDDR01.1 database, searching for structures binding to dopamine D2, all dopamine receptors and all amine-binding class A GPCRs using known dopamine D2 binding compounds as a reference set, have shown that such reference sets are in particular useful for the identification of ligands binding to receptors closely related to the reference system. The potential for ligand identification drops with increasing phylogenetic distance. The analysis of the focus of a tertiary amine based combinatorial library compared to known amine binding class A GPCRs, peptide binding class A GPCRs, and LGIC ligands constitutes a second application scenario which illustrates how the focus of a combinatorial library can be treated quantitatively. The provided annotation schemes, which bridge chem- and bioinformatics by linking ligands to sequences, are expected to be of key utility for further systematic chemogenomics exploration of previously well explored target families.

Keywords: chemogenomics
[Schmitt2002New] Stefan Schmitt, Daniel Kuhn, and Gerhard Klebe. A new method to detect related function among proteins independent of sequence and fold homology. J. Mol. Biol., 323(2):387-406, Oct 2002. [ bib ]
A new method has been developed to detect functional relationships among proteins independent of a given sequence or fold homology. It is based on the idea that protein function is intimately related to the recognition and subsequent response to the binding of a substrate or an endogenous ligand in a well-characterized binding pocket. Thus, recognition of similar ligands, supposedly linked to similar function, requires conserved recognition features exposed in terms of common physicochemical interaction properties via the functional groups of the residues flanking a particular binding cavity. Following a technique commonly used in the comparison of small molecule ligands, generic pseudocenters coding for possible interaction properties were assigned for a large sample set of cavities extracted from the entire PDB and stored in the database Cavbase. Using a particular query cavity a series of related cavities of decreasing similarity is detected based on a clique detection algorithm. The detected similarity is ranked according to property-based surface patches shared in common by the different clique solutions. The approach either retrieves protein cavities accommodating the same (e.g. co-factors) or closely related ligands or it extracts proteins exhibiting similar function in terms of a related catalytic mechanism. Finally the new method has strong potential to suggest alternative molecular skeletons in de novo design. The retrieval of molecular building blocks accommodated in a particular sub-pocket that shares similarity with the pocket in a protein studied by drug design can inspire the discovery of novel ligands.

Keywords: Algorithms; Binding Sites; Databases, Protein; Models, Molecular; Molecular Structure; Protein Binding; Protein Folding; Protein Structure, Tertiary; Proteins, chemistry/metabolism; Reproducibility of Results
[Sample2002Using] Pamela A Sample, Michael H Goldbaum, Kwokleung Chan, Catherine Boden, Te-Won Lee, Christiana Vasile, Andreas G Boehm, Terrence Sejnowski, Chris A Johnson, and Robert N Weinreb. Using machine learning classifiers to identify glaucomatous change earlier in standard visual fields. Invest Ophthalmol Vis Sci, 43(8):2660-5, Aug 2002. [ bib ]
PURPOSE: To compare the ability of several machine learning classifiers to predict development of abnormal fields at follow-up in ocular hypertensive (OHT) eyes that had normal visual fields in baseline examination. METHODS: The visual fields of 114 eyes of 114 patients with OHT with four or more visual field tests with standard automated perimetry over three or more years and for whom stereophotographs were available were assessed. The mean (+/-SD) number of visual field tests was 7.89 +/- 3.04. The mean number of years covered (+/-SD) was 5.92 +/- 2.34 (range, 2.81-11.77). Fields were classified as normal or abnormal based on Statpac-like methods (Humphrey Instruments, Dublin, CA) and by several machine learning classifiers. The machine learning classifiers were two types of support vector machine (SVM), a mixture of Gaussian (MoG) classifier, a constrained MoG, and a mixture of generalized Gaussian (MGG). Specificity was set to 96% for all classifiers, using data from 94 normal eyes evaluated longitudinally. Specificity cutoffs required confirmation of abnormality. RESULTS: Thirty-two percent (36/114) of the eyes converted to abnormal fields during follow-up based on the Statpac-like methods. All 36 were identified by at least one machine classifier. In nearly all cases, the machine learning classifiers predicted the confirmed abnormality, on average, 3.92 +/- 0.55 years earlier than traditional Statpac-like methods. CONCLUSIONS: Machine learning classifiers can learn complex patterns and trends in data and adapt to create a decision surface without the constraints imposed by statistical classifiers. This adaptation allowed the machine learning classifiers to identify abnormality in visual field converts much earlier than the traditional methods.

[Roth2002Thegeneralized] Volker Roth. The generalized lasso: a wrapper approach to gene selection for microarray data. In Proc. CADE-14, 252-255, 2002. [ bib ]
[Roth02Thegeneralized] Volker Roth. The generalized lasso: a wrapper approach to gene selection for microarray data. Technical report, Proceedings 14th International Conference on Automated Deduction (CADE-14), 252-255, 2002. [ bib ]
[Rhoades2002Prediction] Matthew W Rhoades, Brenda J Reinhart, Lee P Lim, Christopher B Burge, Bonnie Bartel, and David P Bartel. Prediction of plant microrna targets. Cell, 110(4):513-520, Aug 2002. [ bib | .pdf ]
We predict regulatory targets for 14 Arabidopsis microRNAs (miRNAs) by identifying mRNAs with near complementarity. Complementary sites within predicted targets are conserved in rice. Of the 49 predicted targets, 34 are members of transcription factor gene families involved in developmental patterning or cell differentiation. The near-perfect complementarity between plant miRNAs and their targets suggests that many plant miRNAs act similarly to small interfering RNAs and direct mRNA cleavage. The targeting of developmental transcription factors suggests that many plant miRNAs function during cellular differentiation to clear key regulatory transcripts from daughter cell lineages.

Keywords: sirna
[Reche2002Prediction] P. A. Reche, J.-P. Glutting, and E. L. Reinherz. Prediction of MHC class I binding peptides using profile motifs. Hum. Immunol., 63(9):701-709, Sep 2002. [ bib ]
Peptides that bind to a given major histocompatibility complex (MHC) molecule share sequence similarity. Therefore, a position specific scoring matrix (PSSM) or profile derived from a set of peptides known to bind to a specific MHC molecule would be a suitable predictor of whether other peptides might bind, thus anticipating possible T-cell epitopes within a protein. In this approach, the binding potential of any peptide sequence (query) to a given MHC molecule is linked to its similarity to a group of aligned peptides known to bind to that MHC, and can be obtained by comparing the query to the PSSM. This article describes the derivation of alignments and profiles from a collection of peptides known to bind a specific MHC, compatible with the structural and molecular basis of the peptide-MHC class I (MHCI) interaction. Moreover, in order to apply these profiles to the prediction of peptide-MHCI binding, we have developed a new search algorithm (RANKPEP) that ranks all possible peptides from an input protein using the PSSM coefficients. The predictive power of the method was evaluated by running RANKPEP on proteins known to bear MHCI K(b)- and D(b)-restricted T-cell epitopes. Analysis of the results indicates that > 80% of these epitopes are among the top 2% of scoring peptides. Prediction of peptide-MHC binding using a variety of MHCI-specific PSSMs is available on line at our RANKPEP web server (www.mifoundation.org/Tools/rankpep.html). In addition, the RANKPEP server also allows the user to enter additional profiles, making the server a powerful and versatile computational biology benchmark for the prediction of peptide-MHC binding.

Keywords: immunoinformatics
[Quackenbush2002Microarray] John Quackenbush. Microarray data normalization and transformation. Nat Genet, 32 Suppl:496-501, Dec 2002. [ bib | DOI | http ]
Keywords: Animals; Data Interpretation, Statistical; Forecasting; Gene Expression Profiling, methods; Humans; Oligonucleotide Array Sequence Analysis, methods; Research Design
[Prakash2002Fetal] K. N Bhanu Prakash, A. G. Ramakrishnan, S. Suresh, and Teresa W P Chow. Fetal lung maturity analysis using ultrasound image features. IEEE Trans Inf Technol Biomed, 6(1):38-45, Mar 2002. [ bib ]
This pilot study was carried out to find the feasibility of analyzing the maturity of the fetal lung using ultrasound images. Data were collected from normal pregnant women at intervals of two weeks from the gestation age of 24 to 38 weeks. Images were acquired at two centers located at different geographical locations. The total data acquired consisted of 750 images of immature and 250 images of mature class. A region of interest of 64 x 64 pixels was used for extracting the features. Various textural features were computed from the fetal lung and liver images. The ratios of fetal lung to liver feature values were investigated as possible indexes for classifying the images into those from mature (reduced pulmonary risk) and immature (possible pulmonary risk) lung. The features used are fractal dimension, lacunarity, and features derived from the histogram of the images. The following classifiers were used to classify the fetal lung images as belonging to mature or immature lung: nearest neighbor, k-nearest neighbor, modified k-nearest neighbor, multilayer perceptron, radial basis function network, and support vector machines. The classification accuracy obtained for the testing set ranges from 73% to 96%.

[Pollack2002Microarray] Jonathan R Pollack, Therese Sørlie, Charles M Perou, Christian A Rees, Stefanie S Jeffrey, Per E Lonning, Robert Tibshirani, David Botstein, Anne-Lise Børresen-Dale, and Patrick O Brown. Microarray analysis reveals a major direct role of dna copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci U S A, 99(20):12963-12968, Oct 2002. [ bib | DOI | http ]
Genomic DNA copy number alterations are key genetic events in the development and progression of human cancers. Here we report a genome-wide microarray comparative genomic hybridization (array CGH) analysis of DNA copy number variation in a series of primary human breast tumors. We have profiled DNA copy number alteration across 6,691 mapped human genes, in 44 predominantly advanced, primary breast tumors and 10 breast cancer cell lines. While the overall patterns of DNA amplification and deletion corroborate previous cytogenetic studies, the high-resolution (gene-by-gene) mapping of amplicon boundaries and the quantitative analysis of amplicon shape provide significant improvement in the localization of candidate oncogenes. Parallel microarray measurements of mRNA levels reveal the remarkable degree to which variation in gene copy number contributes to variation in gene expression in tumor cells. Specifically, we find that 62% of highly amplified genes show moderately or highly elevated expression, that DNA copy number influences gene expression across a wide range of DNA copy number alterations (deletion, low-, mid- and high-level amplification), that on average, a 2-fold change in DNA copy number is associated with a corresponding 1.5-fold change in mRNA levels, and that overall, at least 12% of all the variation in gene expression among the breast tumors is directly attributable to underlying variation in gene copy number. These findings provide evidence that widespread DNA copy number alteration can lead directly to global deregulation of gene expression, which may contribute to the development or progression of cancer.

Keywords: Breast Neoplasms, genetics; Chromosome Aberrations; Disease Progression; Gene Dosage; Genome; Humans; Oligonucleotide Array Sequence Analysis; RNA, Messenger, metabolism; Transcription, Genetic; Tumor Cells, Cultured
[Perez-Iratxeta2002Association] C. Perez-Iratxeta, P. Bork, and M. A. Andrade. Association of genes to genetically inherited diseases using data mining. Nat. Genet., 31(3):316-319, Jul 2002. [ bib | DOI | http | .pdf ]
Although approximately one-quarter of the roughly 4,000 genetically inherited diseases currently recorded in respective databases (LocusLink, OMIM) are already linked to a region of the human genome, about 450 have no known associated gene. Finding disease-related genes requires laborious examination of hundreds of possible candidate genes (sometimes, these are not even annotated; see, for example, refs 3,4). The public availability of the human genome draft sequence has fostered new strategies to map molecular functional features of gene products to complex phenotypic descriptions, such as those of genetically inherited diseases. Owing to recent progress in the systematic annotation of genes using controlled vocabularies, we have developed a scoring system for the possible functional relationships of human genes to 455 genetically inherited diseases that have been mapped to chromosomal regions without assignment of a particular gene. In a benchmark of the system with 100 known disease-associated genes, the disease-associated gene was among the 8 best-scoring genes with a 25% chance, and among the best 30 genes with a 50% chance, showing that there is a relationship between the score of a gene and its likelihood of being associated with a particular disease. The scoring also indicates that for some diseases, the chance of identifying the underlying gene is higher.

[Pavlidis2002Learning] P. Pavlidis, J. Weston, J. Cai, and W.S. Noble. Learning gene functional classifications from multiple data types. J. Comput. Biol., 9(2):401-411, 2002. [ bib | DOI | .pdf ]
In our attempts to understand cellular function at the molecular level, we must be able to synthesize information from disparate types of genomic data. We consider the problem of inferring gene functional classifications from a heterogeneous data set consisting of DNA microarray expression measurements and phylogenetic profiles from whole-genome sequence comparisons. We demonstrate the application of the support vector machine (SVM) learning algorithm to this functional inference task. Our results suggest the importance of exploiting prior information about the heterogeneity of the data. In particular, we propose an SVM kernel function that is explicitly heterogeneous. In addition, we describe feature scaling methods for further exploiting prior knowledge of heterogeneity by giving each data type different weights.

Keywords: biosvm
[Pavlidis2002Exploring] P. Pavlidis, D. P. Lewis, and W. S. Noble. Exploring gene expression data with class scores. Pac. Symp. Biocomput., pages 474-485, 2002. [ bib | .pdf ]
We address a commonly asked question about gene expression data sets: "What functional classes of genes are most interesting in the data?" In the methods we present, expression data is partitioned into classes based on existing annotation schemes. Each class is then given three separately derived "interest" scores. The first score is based on an assessment of the statistical significance of gene expression changes experienced by members of the class, in the context of the experimental design. The second is based on the co-expression of genes in the class. The third score is based on the learnability of the classification. We show that all three methods reveal significant classes in each of three different gene expression data sets. Many classes are identified by one method but not the others, indicating that the methods are complementary. The classes identified are in many cases of clear relevance to the experiment. Our results suggest that these class scoring methods are useful tools for exploring gene expression data.

[Patterson2002Pre-mRNA] D.J. Patterson, K. Yasuhara, and W.L. Ruzzo. Pre-mRNA secondary structure prediction aids splice site prediction. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Kevin Lauerdale, and Teri E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 223-234. World Scientific, 2002. [ bib | .pdf | .pdf ]
Accurate splice site prediction is a critical component of any computational approach to gene prediction in higher organisms. Existing approaches generally use sequence-based models that capture local dependencies among nucleotides in a small window around the splice site. We present evidence that computationally predicted secondary structure of moderate length pre-mRNA subsequencies contains information that can be exploited to improve acceptor splice site prediction beyond that possible with conventional sequence-based approaches. Both decision tree and support vector machine classifiers, using folding energy and structure metrics characterizing helix formation near the splice site, achieve a 5-10 a human data set. Based on our data, we hypothesize that acceptors preferentially exhibit short helices at the splice site.

Keywords: biosvm
[Pastor-Satorras2002Evolving] R. Pastor-Satorras, E. D. Smith, and R. V. Solé. Evolving protein interaction networks through gene duplication. Technical report, Santa Fe Institute, 2002. Working paper 02-02-008. [ bib | .html | .pdf ]
[Newman2002Random] M. E. J. Newman. Random graphs as models of networks. In S. Bornholdt and H. G. Schuster, editors, Handbook of Graphs and Networks. Wiley-VCH, Berlin, 2002. To appear. [ bib | http | .pdf ]
[Myasnikova2002Support] E. Myasnikova, A. Samsonova, M. Samsonova, and J. Reinitz. Support vector regression applied to the determination of the developmental age of a Drosophila embryo from its segmentation gene expression patterns. Bioinformatics, 18(Suppl. 1):S87-S95, 2002. [ bib | http | .pdf ]
Motivation: In this paper we address the problem of the determination of developmental age of an embryo from its segmentation gene expression patterns in Drosophila. Results: By applying support vector regression we have developed a fast method for automated staging of an embryo on the basis of its gene expression pattern. Support vector regression is a statistical method for creating regression functions of arbitrary type from a set of training data. The training set is composed of embryos for which the precise developmental age was determined by measuring the degree of membrane invagination. Testing the quality of regression on the training set showed good prediction accuracy. The optimal regression function was then used for the prediction of the gene expression based age of embryos in which the precise age has not been measured by membrane morphology. Moreover, we show that the same accuracy of prediction can be achieved when the dimensionality of the feature vector was reduced by applying factor analysis. The data reduction allowed us to avoid over-fitting and to increase the efficiency of the algorithm. Availability: This software may be obtained from the authors. Contact: samson@fn.csa.ru Keywords: gene expression patterns; development; embryo staging; support vector regression; segmentation genes; Drosophila.

Keywords: biosvm
[Misra2002Interactive] J. Misra, W. Schmitt, D. Hwang, L.-L. Hsiao, S. Gullans, G. Stephanopoulos, and G. Stephanopoulos. Interactive exploration of microarray gene expression patterns in a reduced dimensional space. Genome Res., 12(7):1112-1120, Jul 2002. [ bib | DOI | http | .pdf ]
The very high dimensional space of gene expression measurements obtained by DNA microarrays impedes the detection of underlying patterns in gene expression data and the identification of discriminatory genes. In this paper we show the use of projection methods such as principal components analysis (PCA) to obtain a direct link between patterns in the genes and patterns in samples. This feature is useful in the initial interactive pattern exploration of gene expression data and data-driven learning of the nature and types of samples. Using oligonucleotide microarray measurements of 40 samples from different normal human tissues, we show that distinct patterns are obtained when the genes are projected on a two-dimensional plane spanned by the loadings of the two major principal components. These patterns define the particular genes associated with a sample class (i.e., tissue). When used separately from the other genes, these class-specific (i.e., tissue-specific) genes in turn define distinct tissue patterns in the projection space spanned by the scores of the two major principal components. In this study, PCA projection facilitated discriminatory gene selection for different tissues and identified tissue-specific gene expression signatures for liver, skeletal muscle, and brain samples. Furthermore, it allowed the classification of nine new samples belonging to these three types using the linear combination of the expression levels of the tissue-specific genes determined from the first set of samples. The application of the technique to other published data sets is also discussed.

[Mewes2002MIPS:] H.W. Mewes, D. Frishman, U. Güldener, G. Mannhaupt, K. Mayer, M. Mokrejs, B. Morgenstern, M. Münsterkoetter, S. Rudd, and B. Weil. MIPS: a database for genomes and protein sequences. Nucleic Acids Res., 30(1):31-34, 2002. [ bib | http | .pdf ]
[McMichael2002quest] A. McMichael and T. Hanke. The quest for an AIDS vaccine: is the CD8+ T-cell approach feasible? Nat. Rev. Immunol., 2(4):283-291, Apr 2002. [ bib | DOI | http ]
The rationale for developing anti-HIV vaccines that stimulate cytotoxic T-lymphocyte responses is given. We argue that such vaccines will work, provided that attention is paid to the development of memory T-cell responses that are strong and preferably activated. Furthermore, the vaccine should match the prevailing virus clade as closely as possible. Vaccines will have to stimulate a wide range of responses, but it is not clear how this can be achieved.

Keywords: immunoinformatics
[McManus2002Gene] M. T. McManus and P. A. Sharp. Gene silencing in mammals by small interfering RNAs. Nat. Rev. Genet., 3(10):737-747, Oct 2002. [ bib | DOI | http | .pdf ]
Among the 3 billion base pairs of the human genome, there are approximately 30,000-40,000 protein-coding genes, but the function of at least half of them remains unknown. A new tool - short interfering RNAs (siRNAs) - has now been developed for systematically deciphering the functions and interactions of these thousands of genes. siRNAs are an intermediate of RNA interference, the process by which double-stranded RNA silences homologous genes. Although the use of siRNAs to silence genes in vertebrate cells was only reported a year ago, the emerging literature indicates that most vertebrate genes can be studied with this technology.

Keywords: sirna
[Mateos2002Systematic] Alvaro Mateos, Joaquín Dopazo, Ronald Jansen, Yuhai Tu, Mark Gerstein, and Gustavo Stolovitzky. Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. Genome Res., 12(11):1703-15, Nov 2002. [ bib | DOI | http | .pdf ]
Recent advances in microarray technology have opened new ways for functional annotation of previously uncharacterised genes on a genomic scale. This has been demonstrated by unsupervised clustering of co-expressed genes and, more importantly, by supervised learning algorithms. Using prior knowledge, these algorithms can assign functional annotations based on more complex expression signatures found in existing functional classes. Previously, support vector machines (SVMs) and other machine-learning methods have been applied to a limited number of functional classes for this purpose. Here we present, for the first time, the comprehensive application of supervised neural networks (SNNs) for functional annotation. Our study is novel in that we report systematic results for  100 classes in the Munich Information Center for Protein Sequences (MIPS) functional catalog. We found that only  10% of these are learnable (based on the rate of false negatives). A closer analysis reveals that false positives (and negatives) in a machine-learning context are not necessarily "false" in a biological sense. We show that the high degree of interconnections among functional classes confounds the signatures that ought to be learned for a unique class. We term this the "Borges effect" and introduce two new numerical indices for its quantification. Our analysis indicates that classification systems with a lower Borges effect are better suitable for machine learning. Furthermore, we introduce a learning procedure for combining false positives with the original class. We show that in a few iterations this process converges to a gene set that is learnable with considerably low rates of false positives and negatives and contains genes that are biologically related to the original class, allowing for a coarse reconstruction of the interactions between associated biological pathways. We exemplify this methodology using the well-studied tricarboxylic acid cycle.

Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Citric Acid Cycle, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Databases, Decision Making, Diagnosis, Differential, Drug, Drug Design, Electrostatics, Eukaryotic Cells, Factual, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Heterogeneity, Genetic Markers, Hemolysins, Humans, Internet, Ion Exchange, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Structural, Structure-Activity Relationship, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12421757
[Matache2002Hilbert] M. T. Matache and V. Matache. Hilbert spaces induced by Toeplitz covariance kernels. In Lecture Notes in Control and Information Sciences, volume 280, pages 319-334. Springer, Jan 2002. [ bib ]
[Maslov2002Specificity] S. Maslov and K. Sneppen. Specificity and stability in topology of protein networks. Science, 296:910-913, 2002. [ bib | .pdf | .pdf ]
[Martoglio2002decomposition] Ann-Marie Martoglio, James W Miskin, Stephen K Smith, and David J C MacKay. A decomposition model to track gene expression signatures: preview on observer-independent classification of ovarian cancer. Bioinformatics, 18(12):1617-24, Dec 2002. [ bib ]
MOTIVATION: A number of algorithms and analytical models have been employed to reduce the multidimensional complexity of DNA array data and attempt to extract some meaningful interpretation of the results. These include clustering, principal components analysis, self-organizing maps, and support vector machine analysis. Each method assumes an implicit model for the data, many of which separate genes into distinct clusters defined by similar expression profiles in the samples tested. A point of concern is that many genes may be involved in a number of distinct behaviours, and should therefore be modelled to fit into as many separate clusters as detected in the multidimensional gene expression space. The analysis of gene expression data using a decomposition model that is independent of the observer involved would be highly beneficial to improve standard and reproducible classification of clinical and research samples. RESULTS: We present a variational independent component analysis (ICA) method for reducing high dimensional DNA array data to a smaller set of latent variables, each associated with a gene signature. We present the results of applying the method to data from an ovarian cancer study, revealing a number of tissue type-specific and tissue type-independent gene signatures present in varying amounts among the samples surveyed. The observer independent results of such molecular analysis of biological samples could help identify patients who would benefit from different treatment strategies. We further explore the application of the model to similar high-throughput studies.

Keywords: Acute, Algorithms, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Cluster Analysis, Comparative Study, Computer-Assisted, Cystadenoma, DNA, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Leukemia, Lymphocytic, Markov Chains, Messenger, Models, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplastic, Neural Networks (Computer), Non-U.S. Gov't, Nucleic Acid Conformation, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, Pattern Recognition, Quality Control, RNA, Reference Values, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Statistical, Stomach Neoplasms, Transcription, Tumor Markers, 12490446
[Marsland2002self-organising] Stephen Marsland, Jonathan Shapiro, and Ulrich Nehmzow. A self-organising network that grows when required. Neural Netw, 15(8-9):1041-58, 2002. [ bib ]
The ability to grow extra nodes is a potentially useful facility for a self-organising neural network. A network that can add nodes into its map space can approximate the input space more accurately, and often more parsimoniously, than a network with predefined structure and size, such as the Self-Organising Map. In addition, a growing network can deal with dynamic input distributions. Most of the growing networks that have been proposed in the literature add new nodes to support the node that has accumulated the highest error during previous iterations or to support topological structures. This usually means that new nodes are added only when the number of iterations is an integer multiple of some pre-defined constant, A. This paper suggests a way in which the learning algorithm can add nodes whenever the network in its current state does not sufficiently match the input. In this way the network grows very quickly when new data is presented, but stops growing once the network has matched the data. This is particularly important when we consider dynamic data sets, where the distribution of inputs can change to a new regime after some time. We also demonstrate the preservation of neighbourhood relations in the data by the network. The new network is compared to an existing growing network, the Growing Neural Gas (GNG), on a artificial dataset, showing how the network deals with a change in input distribution after some time. Finally, the new network is applied to several novelty detection tasks and is compared with both the GNG and an unsupervised form of the Reduced Coulomb Energy network on a robotic inspection task and with a Support Vector Machine on two benchmark novelty detection tasks.

Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Citric Acid Cycle, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Databases, Decision Making, Diagnosis, Differential, Drug, Drug Design, Electrostatics, Eukaryotic Cells, Factual, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Heterogeneity, Genetic Markers, Hemolysins, Humans, Internet, Ion Exchange, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Probability Learning, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Robotics, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Structural, Structure-Activity Relationship, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12416693
[MacBeath2002Protein] Gavin MacBeath. Protein microarrays and proteomics. Nat Genet, 32 Suppl:526-532, Dec 2002. [ bib | DOI | http ]
The system-wide study of proteins presents an exciting challenge in this information-rich age of whole-genome biology. Although traditional investigations have yielded abundant information about individual proteins, they have been less successful at providing us with an integrated understanding of biological systems. The promise of proteomics is that, by studying many components simultaneously, we will learn how proteins interact with each other, as well as with non-proteinaceous molecules, to control complex processes in cells, tissues and even whole organisms. Here, I discuss the role of microarray technology in this burgeoning area.

Keywords: Forecasting; Humans; Immunoassay, methods; Protein Array Analysis, methods; Proteomics, methods
[Lodhi2002Text] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C.je n'ai pas vraiment d'éléments de réponse. Watkins. Text classification using string kernels. J. Mach. Learn. Res., 2:419-444, 2002. [ bib | .html | .pdf ]
Keywords: biosvm
[Liu2002Comparative] H. Liu, J. Li, and L. Wong. A Comparative Study on Feature Selection and Classification Methods Using Gene Expression Profiles and Proteomic Patterns. In R. Lathrop, K. Nakai, S. Miyano, T. Takagi, and M. Kanehisa, editors, Genome Informatics 2002, volume 12, Tokyo, 2002. Universal Academy Press. [ bib | .html | .pdf ]
Feature selection plays an important role in classification. We present a comparative study on six feature selection heuristics by applying them to two sets of data. The first set of data are gene expression profiles from Acute Lymphoblastic Leukemia (ALL) patients. The second set of data are proteomic patterns from ovarian cancer patients. Based on features chosen by these methods, error rates of several classification algorithms were obtained for analysis. Our results demonstrate the importance of feature selection in accurately classifying new samples.

[Liu2002Partially] B. Liu, W. S. Lee, P. S. Yu, and X. Li. Partially supervised classification of text documents. In C. Sammut and A. G. Hoffmann, editors, ICML '02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 387-394, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. [ bib | http | .pdf ]
Keywords: PUlearning
[Lin2002Support] Y. Lin. Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery, 6(3):259-275, 2002. [ bib | DOI | http | .pdf ]
[Lin2002Conserved] K. Lin, Y. Kuang, J. S. Joseph, and P. R. Kolatkar. Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: lessons from supervised machine learning in functional genomics. Nucl. Acids Res., 30(11):2599-2607, 2002. [ bib | http | .pdf ]
Genomics projects have resulted in a flood of sequence data. Functional annotation currently relies almost exclusively on inter-species sequence comparison and is restricted in cases of limited data from related species and widely divergent sequences with no known homologs. Here, we demonstrate that codon composition, a fusion of codon usage bias and amino acid composition signals, can accurately discriminate, in the absence of sequence homology information, cytoplasmic ribosomal protein genes from all other genes of known function in Saccharomyces cerevisiae, Escherichia coli and Mycobacterium tuberculosis using an implementation of support vector machines, SVMlight. Analysis of these codon composition signals is instructive in determining features that confer individuality to ribosomal protein genes. Each of the sets of positively charged, negatively charged and small hydrophobic residues, as well as codon bias, contribute to their distinctive codon composition profile. The representation of all these signals is sensitively detected, combined and augmented by the SVMs to perform an accurate classification. Of special mention is an obvious outlier, yeast gene RPL22B, highly homologous to RPL22A but employing very different codon usage, perhaps indicating a non-ribosomal function. Finally, we propose that codon composition be used in combination with other attributes in gene/protein classification by supervised machine learning algorithms.

Keywords: biosvm
[Liebermeister2002Linear] W. Liebermeister. Linear modes of gene expression determined by independent component analysis. Bioinformatics, 18(1):51-60, Jan 2002. [ bib | DOI | http | .pdf ]
MOTIVATION: The expression of genes is controlled by specific combinations of cellular variables. We applied Independent Component Analysis (ICA) to gene expression data, deriving a linear model based on hidden variables, which we term 'expression modes'. The expression of each gene is a linear function of the expression modes, where, according to the ICA model, the linear influences of different modes show a minimal statistical dependence, and their distributions deviate sharply from the normal distribution. RESULTS: Studying cell cycle-related gene expression in yeast, we found that the dominant expression modes could be related to distinct biological functions, such as phases of the cell cycle or the mating response. Analysis of human lymphocytes revealed modes that were related to characteristic differences between cell types. With both data sets, the linear influences of the dominant modes showed distributions with large tails, indicating the existence of specifically up- and downregulated target genes. The expression modes and their influences can be used to visualize the samples and genes in low-dimensional spaces. A projection to expression modes helps to highlight particular biological functions, to reduce noise, and to compress the data in a biologically sensible way.

[Liberles2002use] D. A. Liberles, A. Thorén, G. von Heijne, and A. Elofsson. The use of phylogenetic profiles for gene predictions. Curr. Genom., 2002. To appear. [ bib | .pdf | .pdf ]
[Liaw2002Classification] A. Liaw and M. Wiener. Classification and Regression by randomForest. R News, 2(3):18-22, 2002. [ bib ]
[Liao2002Combining] L. Liao and W. S. Noble. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In Proceedings of the Sixth International Conference on Computational Molecular Biology, 2002. [ bib | .html | .pdf ]
Keywords: biosvm
[Li2002Involvement] Jiwen Li, Qiushi Lin, Ho-Geun Yoon, Zhi-Qing Huang, Brian D Strahl, C. David Allis, and Jiemin Wong. Involvement of histone methylation and phosphorylation in regulation of transcription by thyroid hormone receptor. Mol Cell Biol, 22(16):5688-5697, Aug 2002. [ bib ]
Previous studies have established an important role of histone acetylation in transcriptional control by nuclear hormone receptors. With chromatin immunoprecipitation assays, we have now investigated whether histone methylation and phosphorylation are also involved in transcriptional regulation by thyroid hormone receptor (TR). We found that repression by unliganded TR is associated with a substantial increase in methylation of H3 lysine 9 (H3-K9) and a decrease in methylation of H3 lysine 4 (H3-K4), methylation of H3 arginine 17 (H3-R17), and a dual modification of phosphorylation of H3 serine 10 and acetylation of lysine 14 (pS10/acK14). On the other hand, transcriptional activation by liganded TR is coupled with a substantial decrease in both H3-K4 and H3-K9 methylation and a robust increase in H3-R17 methylation and the dual modification of pS10/acK14. Trichostatin A treatment results in not only histone hyperacetylation but also an increase in methylation of H3-K4, increase in dual modification of pS10/acK14, and reduction in methylation of H3-K9, revealing an extensive interplay between histone acetylation, methylation, and phosphorylation. In an effort to understand the underlying mechanism for an increase in H3-K9 methylation during repression by unliganded TR, we demonstrated that TR interacts in vitro with an H3-K9-specific histone methyltransferase (HMT), SUV39H1. Functional analysis indicates that SUV39H1 can facilitate repression by unliganded TR and in so doing requires its HMT activity. Together, our data uncover a novel role of H3-K9 methylation in repression by unliganded TR and provide strong evidence for the involvement of multiple distinct histone covalent modifications (acetylation, methylation, and phosphorylation) in transcriptional control by nuclear hormone receptors.

Keywords: Animals; Cell Fractionation; Gene Expression Regulation, drug effects; Genes, Reporter; Histone-Lysine N-Methyltransferase; Histones, chemistry/genetics/metabolism; Humans; Hydroxamic Acids, pharmacology; Methylation; Methyltransferases, metabolism; Oocytes, physiology; Phosphorylation; Protein Methyltransferases; Protein Synthesis Inhibitors, pharmacology; Receptors, Thyroid Hormone, metabolism; Transcription, Genetic; Xenopus laevis, physiology
[Leslie2002spectrum] C. Leslie, E. Eskin, and W.S. Noble. The spectrum kernel: a string kernel for SVM protein classification. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Kevin Lauerdale, and Teri E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 564-575, Singapore, 2002. World Scientific. [ bib | .pdf ]
Keywords: biosvm
[Lasonder2002Analysis] Edwin Lasonder, Yasushi Ishihama, Jens S Andersen, Adriaan M W Vermunt, Arnab Pain, Robert W Sauerwein, Wijnand M C Eling, Neil Hall, Andrew P Waters, Hendrik G Stunnenberg, and Matthias Mann. Analysis of the Plasmodium falciparum proteome by high-accuracy mass spectrometry. Nature, 419(6906):537-542, Oct 2002. [ bib | DOI | http | .pdf ]
The annotated genomes of organisms define a 'blueprint' of their possible gene products. Post-genome analyses attempt to confirm and modify the annotation and impose a sense of the spatial, temporal and developmental usage of genetic information by the organism. Here we describe a large-scale, high-accuracy (average deviation less than 0.02 Da at 1,000 Da) mass spectrometric proteome analysis of selected stages of the human malaria parasite Plasmodium falciparum. The analysis revealed 1,289 proteins of which 714 proteins were identified in asexual blood stages, 931 in gametocytes and 645 in gametes. The last two groups provide insights into the biology of the sexual stages of the parasite, and include conserved, stage-specific, secreted and membrane-associated proteins. A subset of these proteins contain domains that indicate a role in cell-cell interactions, and therefore can be evaluated as potential components of a malaria vaccine formulation. We also report a set of peptides with significant matches in the parasite genome but not in the protein set predicted by computational methods.

Keywords: plasmodium
[Kramer2002Fragment] S. Kramer, E. Frank, and C. Helma. Fragment generation and support vector machines for inducing SARs. SAR QSAR Environ Res, 13(5):509-23, Jul 2002. [ bib | DOI | http ]
We present a new approach to the induction of SARs based on the generation of structural fragments and support vector machines (SVMs). It is tailored for bio-chemical databases, where the examples are two-dimensional descriptions of chemical compounds. The fragment generator finds all fragments (i.e. linearly connected atoms) that satisfy user-specified constraints regarding their frequency and generality. In this paper, we are querying for fragments within a minimum and a maximum frequency in the dataset. After fragment generation, we propose to apply SVMs to the problem of inducing SARs from these fragments. We conjecture that the SVMs are particularly useful in this context, as they can deal with a large number of features. Experiments in the domains of carcinogenicity and mutagenicity prediction show that the minimum and the maximum frequency queries for fragments can be answered within a reasonable time, and that the predictive accuracy obtained using these fragments is satisfactory. However, further experiments will have to confirm that this is a viable approach to inducing SARs.

Keywords: biosvm
[Kou2002Karyotyping] Zhenzhen Kou, Liang Ji, and Xuegong Zhang. Karyotyping of comparative genomic hybridization human metaphases by using support vector machines. Cytometry, 47(1):17-23, Jan 2002. [ bib | DOI | http | .pdf ]
BACKGROUND: Comparative genomic hybridization (CGH) is a relatively new molecular cytogenetic method for detecting chromosomal imbalance. Karyotyping of human metaphases is an important step to assign each chromosome to one of 23 or 24 classes (22 autosomes and two sex chromosomes). Automatic karyotyping in CGH analysis is needed. However, conventional karyotyping approaches based on DAPI images require complex image enhancement procedures. METHODS: This paper proposes a simple feature extraction method, one that generates density profiles from original true color CGH images and uses normalized profiles as feature vectors without quantization. A classifier is developed by using support vector machine (SVM). It has good generalization ability and needs only limited training samples. RESULTS: Experiment results show that the feature extraction method of using color information in CGH images can improve greatly the classification success rate. The SVM classifier is able to acquire knowledge about human chromosomes from relatively few samples and has good generalization ability. A success rate of moe than 90% has been achieved and the time for training and testing is very short. CONCLUSIONS: The feature extraction method proposed here and the SVM-based classifier offer a promising computerized intelligent system for automatic karyotyping of CGH human chromosomes.

Keywords: cgh
[Kondor2002Diffusion] R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 315-322, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. [ bib | .pdf ]
Keywords: biosvm
[Kitano2002Computational] H. Kitano. Computational systems biology. Nature, 420:206-210, 2002. [ bib | DOI | http | .pdf ]
To understand complex biological systems requires the integration of experimental and computational research ? in other words a systems biology approach. Computational biology, through pragmatic modelling and theoretical exploration, provides a powerful foundation from which to address critical scientific questions head-on. The reviews in this Insight cover many different aspects of this energetic field, although all, in one way or another, illuminate the functioning of modular circuits, including their robustness, design and manipulation. Computational systems biology addresses questions fundamental to our understanding of life, yet progress here will lead to practical innovations in medicine, drug discovery and engineering.

[Kin2002Marginalized] T. Kin, K. Tsuda, and K. Asai. Marginalized kernels for RNA sequence data analysis. In R.H. Lathtop, K. Nakai, S. Miyano, T. Takagi, and M. Kanehisa, editors, Genome Informatics 2002, pages 112-122. Universal Academic Press, 2002. [ bib | .html | .pdf ]
We present novel kernels that measure similarity of two RNA sequences, taking account of their secondary structures. Two types of kernels are presented. One is for RNA sequences with known secondary structures, the other for those without known secondary structures. The latter employs stochastic context-free grammar (SCFG) for estimating the secondary structure. We call the latter the marginalized count kernel (MCK). We show computational experiments for MCK using 74 sets of human tRNA sequence data: (i) kernel principal component analysis (PCA) for visualizing tRNA similarities, (ii) supervised classification with support vector machines (SVMs). Both types of experiment show promising results for MCKs.

Keywords: biosvm
[Karchin2002Classifying] R. Karchin, K. Karplus, and D. Haussler. Classifying G-protein coupled receptors with support vector machines. Bioinformatics, 18:147-159, 2002. [ bib | http | .pdf ]
Motivation: The enormous amount of protein sequence data uncovered by genome research has increased the demand for computer software that can automate the recognition of new proteins. We discuss the relative merits of various automated methods for recognizing G-Protein Coupled Receptors (GPCRs), a superfamily of cell membrane proteins. GPCRs are found in a wide range of organisms and are central to a cellular signalling network that regulates many basic physiological processes. They are the focus of a significant amount of current pharmaceutical research because they play a key role in many diseases. However, their tertiary structures remain largely unsolved. The methods described in this paper use only primary sequence information to make their predictions. We compare a simple nearest neighbor approach (BLAST), methods based on multiple alignments generated by a statistical profile Hidden Markov Model (HMM), and methods, including Support Vector Machines (SVMs), that transform protein sequences into fixed-length feature vectors. Results: The last is the most computationally expensive method, but our experiments show that, for those interested in annotation-quality classification, the results are worth the effort. In two-fold cross-validation experiments testing recognition of GPCR subfamilies that bind a specific ligand (such as a histamine molecule), the errors per sequence at the Minimum Error Point (MEP) were 13.7 SVMs, 17.1 SVM classification, 25.5 and 49 Kernel Nearest Neighbor (kernNN). The percentage of true positives recognized before the first false positive was 65 both SVM methods, 13 for kernNN. Availability: We have set up a web server for GPCR subfamily classification based on hierarchical multi-class SVMs at http://www.soe.ucsc.edu/research/compbio/gpcr-subclass. By scanning predicted peptides found in the human genome with the SVMtree server, we have identified a large number of genes that encode GPCRs. A list of our predictions for human GPCRs is available at http://www.soe.ucsc.edu/research/compbio/gpcr·hg/class·results. We also provide suggested subfamily classification for 18 sequences previously identified as unclassified Class A (rhodopsin-like) GPCRs in GPCRDB (Horn et al. , Nucleic Acids Res. , 26, 277?281, 1998), available at http://www.soe.ucsc.edu/research/compbio/gpcr/classA·unclassified/

Keywords: fisher-kernel sequence-classification biosvm
[Kanehisa2002KEGG] M. Kanehisa, S. Goto, S. Kawashima, and A. Nakaya. The KEGG databases at GenomeNet. Nucleic Acids Res., 30:42-46, 2002. [ bib | http | .pdf ]
[Kandola2002On] J. Kandola, J. Shawe-Taylor, and N. Cristianini. On the application of diffusion kernel to text data. Technical report, Neurocolt, 2002. NeuroCOLT Technical Report NC-TR-02-122. [ bib | .html | .ps.gz ]
[Jones2002fundamental] P. A. Jones and S. B. Baylin. The fundamental role of epigenetic events in cancer. Nat. Rev. Genet., 3(6):415-428, Jun 2002. [ bib | DOI | http | .pdf ]
Patterns of DNA methylation and chromatin structure are profoundly altered in neoplasia and include genome-wide losses of, and regional gains in, DNA methylation. The recent explosion in our knowledge of how chromatin organization modulates gene transcription has further highlighted the importance of epigenetic mechanisms in the initiation and progression of human cancer. These epigenetic changes - in particular, aberrant promoter hypermethylation that is associated with inappropriate gene silencing - affect virtually every step in tumour progression. In this review, we discuss these epigenetic events and the molecular alterations that might cause them and/or underlie altered gene expression in cancer.

[Jones2002DNA] Peter A Jones. Dna methylation and cancer. Oncogene, 21(35):5358-5360, Aug 2002. [ bib | DOI | http ]
There is tremendous ferment in the field of epigenetics as the relationships between chromatin structure and DNA methylation patterns become clearer. Central to this activity is the realization that the 'histone code', which involves the post-translational modification of histones and which has important ramifications for chromatin structure, may be linked to the DNA cytosine methylation pattern. New discoveries have suggested that histone lysine 9 methylation is implicated in the spread of heterochromatin in Drosophila and other organisms. Very recently it has been found that histone lysine 9 methylation is also necessary for some DNA methylation in Neurospora and plants. There is therefore the possibility that these two processes are closely linked, suggesting ways in which DNA methylation patterns may be established during normal development. Understanding these processes is fundamental to understanding what goes awry during the process of aging and carcinogenesis where DNA methylation patterns become substantially altered and contribute to the malignant phenotype.

Keywords: Animals; Chromatin; CpG Islands; DNA Methylation; DNA, Neoplasm; Gene Expression Regulation; Gene Silencing; Histone-Lysine N-Methyltransferase; Humans; Neoplasms; Plants; Transcription, Genetic
[Johnson02Experimental] D.S. Johnson, G. Gutin, L.A. McGeoch, A. Yeo, W. Zhang, and A. Zverovich. Experimental analysis of heuristics for the atsp. In The Travelling Salesman Problem and Its Variations, pages 445-487, 2002. [ bib ]
[Joachims2002Learning] T. Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, 2002. [ bib ]
[Jansen2002Relating] R. Jansen, D. Greenbaum, and M. Gerstein. Relating whole-genome expression data with protein-protein interactions. Genome Res., 12(1):37-46, Jan 2002. [ bib | DOI | http | .pdf ]
We investigate the relationship of protein-protein interactions with mRNA expression levels, by integrating a variety of data sources for yeast. We focus on known protein complexes that have clearly defined interactions between their subunits. We find that subunits of the same protein complex show significant coexpression, both in terms of similarities of absolute mRNA levels and expression profiles, e.g., we can often see subunits of a complex having correlated patterns of expression over a time course. We classify the yeast protein complexes as either permanent or transient, with permanent ones being maintained through most cellular conditions. We find that, generally, permanent complexes, such as the ribosome and proteasome, have a particularly strong relationship with expression, while transient ones do not. However, we note that several transient complexes, such as the RNA polymerase II holoenzyme and the replication complex, can be subdivided into smaller permanent ones, which do have a strong relationship to gene expression. We also investigated the interactions in aggregated, genome-wide data sets, such as the comprehensive yeast two-hybrid experiments, and found them to have only a weak relationship with gene expression, similar to that of transient complexes. (Further details on genecensus.org/expression/interactions and bioinfo.mbb.yale.edu/expression/interactions.)

[Jablonka2002changing] A. Jablonka and M. J. Lamb. The changing concept of epigenetics. Ann N Y Acad Sci, 981:82-96, Dec 2002. [ bib ]
We discuss the changing use of epigenetics, a term coined by Conrad Waddington in the 1940s, and how the epigenetic approach to development differs from the genetic approach. Originally, epigenetics referred to the study of the way genes and their products bring the phenotype into being. Today, it is primarily concerned with the mechanisms through which cells become committed to a particular form or function and through which that functional or structural state is then transmitted in cell lineages. We argue that modern epigenetics is important not only because it has practical significance for medicine, agriculture, and species conservation, but also because it has implications for the way in which we should view heredity and evolution. In particular, recognizing that there are epigenetic inheritance systems through which non-DNA variations can be transmitted in cell and organismal lineages broadens the concept of heredity and challenges the widely accepted gene-centered neo-Darwinian version of Darwinism.

Keywords: csbcbook
[Imoto2002Bayesian] S. Imoto, K. Sunyong, T. Goto, S. Aburatani, K. Tashiro, S. Kuhara, and S. Miyano. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. Proc. IEEE Comput. Soc. Bioinform. Conf., 1:219-227, 2002. [ bib | DOI | http | .pdf ]
We propose a new statistical method for constructing genetic network from microarray gene expression data by using a Bayesian network. An essential point of Bayesian network construction is in the estimation of the conditional distribution of each random variable. We consider fitting nonparametric regression models with heterogeneous error variances to the microarray gene expression data to capture the nonlinear structures between genes. A problem still remains to be solved in selecting an optimal graph, which gives the best representation of the system among genes. We theoretically derive a new graph selection criterion from Bayes approach in general situations. The proposed method includes previous methods based on Bayesian networks. We demonstrate the effectiveness of the proposed method through the analysis of Saccharomyces cerevisiae gene expression data newly obtained by disrupting 100 genes.

Keywords: biogm
[Imoto2002Estimation] S. Imoto, T. Goto, and S. Miyano. Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Pac. Symp. Biocomput., pages 175-186, 2002. [ bib | .pdf | .pdf ]
We propose a new method for constructing genetic network from gene expression data by using Bayesian networks. We use nonparametric regression for capturing nonlinear relationships between genes and derive a new criterion for choosing the network in general situations. In a theoretical sense, our proposed theory and methodology include previous methods based on Bayes approach. We applied the proposed method to the S. cerevisiae cell cycle data and showed the effectiveness of our method by comparing with previous methods.

Keywords: biogm
[Ideker2002Discovering] T. Ideker, O. Ozier, B. Schwikowski, and A. F. Siegel. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics, 18 Suppl 1:S233-S240, 2002. [ bib | .pdf ]
MOTIVATION: In model organisms such as yeast, large databases of protein-protein and protein-DNA interactions have become an extremely important resource for the study of protein function, evolution, and gene regulatory dynamics. In this paper we demonstrate that by integrating these interactions with widely-available mRNA expression data, it is possible to generate concrete hypotheses for the underlying mechanisms governing the observed changes in gene expression. To perform this integration systematically and at large scale, we introduce an approach for screening a molecular interaction network to identify active subnetworks, i.e., connected regions of the network that show significant changes in expression over particular subsets of conditions. The method we present here combines a rigorous statistical measure for scoring subnetworks with a search algorithm for identifying subnetworks with high score. RESULTS: We evaluated our procedure on a small network of 332 genes and 362 interactions and a large network of 4160 genes containing all 7462 protein-protein and protein-DNA interactions in the yeast public databases. In the case of the small network, we identified five significant subnetworks that covered 41 out of 77 (53%) of all significant changes in expression. Both network analyses returned several top-scoring subnetworks with good correspondence to known regulatory mechanisms in the literature. These results demonstrate how large-scale genomic approaches may be used to uncover signalling and regulatory pathways in a systematic, integrative fashion.

[Hyman2002Impact] Elizabeth Hyman, Päivikki Kauraniemi, Sampsa Hautaniemi, Maija Wolf, Spyro Mousses, Ester Rozenblum, Markus Ringnér, Guido Sauter, Outi Monni, Abdel Elkahloun, Olli-P. Kallioniemi, and Anne Kallioniemi. Impact of dna amplification on gene expression patterns in breast cancer. Cancer Res, 62(21):6240-6245, Nov 2002. [ bib ]
Genetic changes underlie tumor progression and may lead to cancer-specific expression of critical genes. Over 1100 publications have described the use of comparative genomic hybridization (CGH) to analyze the pattern of copy number alterations in cancer, but very few of the genes affected are known. Here, we performed high-resolution CGH analysis on cDNA microarrays in breast cancer and directly compared copy number and mRNA expression levels of 13,824 genes to quantitate the impact of genomic changes on gene expression. We identified and mapped the boundaries of 24 independent amplicons, ranging in size from 0.2 to 12 Mb. Throughout the genome, both high- and low-level copy number changes had a substantial impact on gene expression, with 44% of the highly amplified genes showing overexpression and 10.5% of the highly overexpressed genes being amplified. Statistical analysis with random permutation tests identified 270 genes whose expression levels across 14 samples were systematically attributable to gene amplification. These included most previously described amplified genes in breast cancer and many novel targets for genomic alterations, including the HOXB7 gene, the presence of which in a novel amplicon at 17q21.3 was validated in 10.2% of primary breast cancers and associated with poor patient prognosis. In conclusion, CGH on cDNA microarrays revealed hundreds of novel genes whose overexpression is attributable to gene amplification. These genes may provide insights to the clonal evolution and progression of breast cancer and highlight promising therapeutic targets.

[Hopkins2002druggable] A. L. Hopkins and C. R. Groom. The druggable genome. Nat. Rev. Drug Discov., 1(9):727-730, Sep 2002. [ bib | DOI | http ]
An assessment of the number of molecular targets that represent an opportunity for therapeutic intervention is crucial to the development of post-genomic research strategies within the pharmaceutical industry. Now that we know the size of the human genome, it is interesting to consider just how many molecular targets this opportunity represents. We start from the position that we understand the properties that are required for a good drug, and therefore must be able to understand what makes a good drug target.

Keywords: chemogenomics
[Holen2002Positional] T. Holen, M. Amarzguioui, M. T. Wiiger, E. Babaie, and H. Prydz. Positional effects of short interfering RNAs targeting the human coagulation trigger Tissue Factor. Nucleic Acids Res., 30(8):1757-1766, Apr 2002. [ bib ]
Chemically synthesised 21-23 bp double-stranded short interfering RNAs (siRNA) can induce sequence-specific post-transcriptional gene silencing, in a process termed RNA interference (RNAi). In the present study, several siRNAs synthesised against different sites on the same target mRNA (human Tissue Factor) demonstrated striking differences in silencing efficiency. Only a few of the siRNAs resulted in a significant reduction in expression, suggesting that accessible siRNA target sites may be rare in some human mRNAs. Blocking of the 3'-OH with FITC did not reduce the effect on target mRNA. Mutations in the siRNAs relative to target mRNA sequence gradually reduced, but did not abolish mRNA depletion. Inactive siRNAs competed reversibly with active siRNAs in a sequence-independent manner. Several lines of evidence suggest the existence of a near equilibrium kinetic balance between mRNA production and siRNA-mediated mRNA depletion. The silencing effect was transient, with the level of mRNA recovering fully within 4-5 days, suggesting absence of a propagative system for RNAi in humans. Finally, we observed 3' mRNA cleavage fragments resulting from the action of the most effective siRNAs. The depletion rate-dependent appearance of these fragments argues for the existence of a two-step mRNA degradation mechanism.

Keywords: sirna
[Ho2002Systematic] Yuen Ho, Albrecht Gruhler, Adrian Heilbut, Gary D Bader, Lynda Moore, Sally-Lin Adams, Anna Millar, Paul Taylor, Keiryn Bennett, Kelly Boutilier, Lingyun Yang, Cheryl Wolting, Ian Donaldson, Søren Schandorff, Juanita Shewnarane, Mai Vo, Joanne Taggart, Marilyn Goudreault, Brenda Muskat, Cris Alfarano, Danielle Dewar, Zhen Lin, Katerina Michalickova, Andrew R Willems, Holly Sassi, Peter A Nielsen, Karina J Rasmussen, Jens R Andersen, Lene E Johansen, Lykke H Hansen, Hans Jespersen, Alexandre Podtelejnikov, Eva Nielsen, Janne Crawford, Vibeke Poulsen, Birgitte D Sørensen, Jesper Matthiesen, Ronald C Hendrickson, Frank Gleeson, Tony Pawson, Michael F Moran, Daniel Durocher, Matthias Mann, Christopher W V Hogue, Daniel Figeys, and Mike Tyers. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415(6868):180-3, Jan 2002. [ bib | DOI | http | .pdf ]
The recent abundance of genome sequence data has brought an urgent need for systematic proteomics to decipher the encoded protein networks that dictate cellular function. To date, generation of large-scale protein-protein interaction maps has relied on the yeast two-hybrid system, which detects binary interactions through activation of reporter gene expression. With the advent of ultrasensitive mass spectrometric protein identification methods, it is feasible to identify directly protein complexes on a proteome-wide scale. Here we report, using the budding yeast Saccharomyces cerevisiae as a test case, an example of this approach, which we term high-throughput mass spectrometric protein complex identification (HMS-PCI). Beginning with 10% of predicted yeast proteins as baits, we detected 3,617 associated proteins covering 25% of the yeast proteome. Numerous protein complexes were identified, including many new interactions in various signalling pathways and in the DNA damage response. Comparison of the HMS-PCI data set with interactions reported in the literature revealed an average threefold higher success rate in detection of known complexes compared with large-scale two-hybrid studies. Given the high degree of connectivity observed in this study, even partial HMS-PCI coverage of complex proteomes, including that of humans, should allow comprehensive identification of cellular networks.

Keywords: Affinity Labels, Amino Acid Sequence, Animals, Cell Cycle Proteins, Cloning, Comparative Study, DNA, DNA Damage, DNA Repair, Electrospray Ionization, Fungal, Genetic, Humans, Macromolecular Substances, Mass, Mitosis, Molecular, Molecular Sequence Data, Non-P.H.S., Non-U.S. Gov't, P.H.S., Phosphoric Monoester Hydrolases, Protein Binding, Protein Interaction Mapping, Protein Kinases, Proteome, Proteomics, Research Support, Ribonucleoproteins, Ribosomes, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Sequence Alignment, Signal Transduction, Spectrometry, Spectrum Analysis, Transcription, U.S. Gov't, 11805813
[Hartemink2002Using] A.J. Hartemink, D.K. Gifford, T.S. Jaakkola, and R.A. Young. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Kevin Lauerdale, and Teri E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 422-433. World Scientific, 2002. [ bib | .html | .pdf ]
[Hanisch2002Co-clustering] D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 2002. [ bib | .pdf ]
[Halperin2002Principles] I. Halperin, B. Ma, H. Wolfson, and R. Nussinov. Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins, 47(4):409-443, Jun 2002. [ bib | DOI | http ]
The docking field has come of age. The time is ripe to present the principles of docking, reviewing the current state of the field. Two reasons are largely responsible for the maturity of the computational docking area. First, the early optimism that the very presence of the "correct" native conformation within the list of predicted docked conformations signals a near solution to the docking problem, has been replaced by the stark realization of the extreme difficulty of the next scoring/ranking step. Second, in the last couple of years more realistic approaches to handling molecular flexibility in docking schemes have emerged. As in folding, these derive from concepts abstracted from statistical mechanics, namely, populations. Docking and folding are interrelated. From the purely physical standpoint, binding and folding are analogous processes, with similar underlying principles. Computationally, the tools developed for docking will be tremendously useful for folding. For large, multidomain proteins, domain docking is probably the only rational way, mimicking the hierarchical nature of protein folding. The complexity of the problem is huge. Here we divide the computational docking problem into its two separate components. As in folding, solving the docking problem involves efficient search (and matching) algorithms, which cover the relevant conformational space, and selective scoring functions, which are both efficient and effectively discriminate between native and non-native solutions. It is universally recognized that docking of drugs is immensely important. However, protein-protein docking is equally so, relating to recognition, cellular pathways, and macromolecular assemblies. Proteins function when they are bound to other molecules. Consequently, we present the review from both the computational and the biological points of view. Although large, it covers only partially the extensive body of literature, relating to small (drug) and to large protein-protein molecule docking, to rigid and to flexible. Unfortunately, when reviewing these, a major difficulty in assessing the results is the non-uniformity in the formats in which they are presented in the literature. Consequently, we further propose a way to rectify it here.

Keywords: chemoinformatics
[Gartner03graph] T. Gärtner, K. Driessens, and J. Ramon. Exponential and geometric kernels for graphs. In Mach. Learn., pages 146-163. Springer, 2002. [ bib ]
[Gartner2002Multi-Instance] T. Gärtner, P.A. Flach, A. Kowalczyk, and A.J. Smola. Multi-Instance Kernels. In C. Sammut and A. Hoffmann, editors, Proceedings of the Nineteenth International Conference on Machine Learning, pages 179-186. Morgan Kaufmann, 2002. [ bib ]
[Gartner2002Exponential] T. Gärtner. Exponential and Geometric Kernels for Graphs. In NIPS Workshop on Unreal Data: Principles of Modeling Nonvectorial Data, 2002. [ bib ]
[Guyon2002Gene] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Mach. Learn., 46(1/3):389-422, Jan 2002. [ bib | .pdf | .pdf ]
DNA micro-arrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new micro-array devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive signatures of gene expression over normal tissues or other types of cancer tissues. In this paper, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA micro-arrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address this problem select genes with correlation techniques. We propose a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE). We demonstrate experimentally that the genes selected by our techniques yield better classification performance and are biologically relevant to cancer. In contrast with the baseline method, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. In patients with leukemia our method discovered 2 genes that yield zero leave-one-out error, while 64 genes are necessary for the baseline method to get the best result (one leave-one-out error). In the colon cancer database, using only 4 genes our method is 98% accurate, while the baseline method is only 86% accurate.

Keywords: biosvm
[Guermeur2002Combining] Y. Guermeur. Combining Discriminant Models with New Multi-Class SVMs. Pattern Anal. Appl., 5(2):168-179, 2002. [ bib | DOI | http | .pdf ]
The idea of performing model combination, instead of model selection, has a long theoretical background in statistics. However, making use of theoretical results is ordinarily subject to the satisfaction of strong hypotheses (weak error correlation, availability of large training sets, possibility to rerun the training procedure an arbitrary number of times, etc.). In contrast, the practitioner is frequently faced with the problem of combining a given set of pre-trained classifiers, with highly correlated errors, using only a small training sample. Overfitting is then the main risk, which cannot be overcome but with a strict complexity control of the combiner selected. This suggests that SVMs should be well suited for these difficult situations. Investigating this idea, we introduce a family of multi-class SVMs and assess them as ensemble methods on a real-world problem. This task, protein secondary structure prediction, is an open problem in biocomputing for which model combination appears to be an issue of central importance. Experimental evidence highlights the gain in quality resulting from combining some of the most widely used prediction methods with our SVMs rather than with the ensemble methods traditionally used in the field. The gain increases when the outputs of the combiners are post-processed with a DP algorithm.

Keywords: biosvm
[Guelzim2002Topological] N. Guelzim, S. Bottani, P. Bourgine, and F. Képès. Topological and causal structure of the yeast transcriptional regulatory network. Nat. Genet., 31:60-63, 2002. [ bib | .html | .pdf ]
[Goto2002LIGAND:] S. Goto, Y. Okuno, M. Hattori, T. Nishioka, and M. Kanehisa. LIGAND: database of chemical compounds and reactions in biological pathways. Nucleic Acids Res., 30:402-404, 2002. [ bib | http | .pdf ]
[Goldbaum2002Comparing] Michael H Goldbaum, Pamela A Sample, Kwokleung Chan, Julia Williams, Te-Won Lee, Eytan Blumenthal, Christopher A Girkin, Linda M Zangwill, Christopher Bowd, Terrence Sejnowski, and Robert N Weinreb. Comparing machine learning classifiers for diagnosing glaucoma from standard automated perimetry. Invest Ophthalmol Vis Sci, 43(1):162-9, Jan 2002. [ bib ]
PURPOSE: To determine which machine learning classifier learns best to interpret standard automated perimetry (SAP) and to compare the best of the machine classifiers with the global indices of STATPAC 2 and with experts in glaucoma. METHODS: Multilayer perceptrons (MLP), support vector machines (SVM), mixture of Gaussian (MoG), and mixture of generalized Gaussian (MGG) classifiers were trained and tested by cross validation on the numerical plot of absolute sensitivity plus age of 189 normal eyes and 156 glaucomatous eyes, designated as such by the appearance of the optic nerve. The authors compared performance of these classifiers with the global indices of STATPAC, using the area under the ROC curve. Two human experts were judged against the machine classifiers and the global indices by plotting their sensitivity-specificity pairs. RESULTS: MoG had the greatest area under the ROC curve of the machine classifiers. Pattern SD (PSD) and corrected PSD (CPSD) had the largest areas under the curve of the global indices. MoG had significantly greater ROC area than PSD and CPSD. Human experts were not better at classifying visual fields than the machine classifiers or the global indices. CONCLUSIONS: MoG, using the entire visual field and age for input, interpreted SAP better than the global indices of STATPAC. Machine classifiers may augment the global indices of STATPAC.

[Gavin2002Functionala] Anne-Claude Gavin, Markus Bösche, Roland Krause, Paola Grandi, Martina Marzioch, Andreas Bauer, Jörg Schultz, Jens M Rick, Anne-Marie Michon, Cristina-Maria Cruciat, Marita Remor, Christian Höfert, Malgorzata Schelder, Miro Brajenovic, Heinz Ruffner, Alejandro Merino, Karin Klein, Manuela Hudak, David Dickson, Tatjana Rudi, Volker Gnau, Angela Bauch, Sonja Bastuck, Bettina Huhse, Christina Leutwein, Marie-Anne Heurtier, Richard R Copley, Angela Edelmann, Erich Querfurth, Vladimir Rybin, Gerard Drewes, Manfred Raida, Tewis Bouwmeester, Peer Bork, Bertrand Seraphin, Bernhard Kuster, Gitte Neubauer, and Giulio Superti-Furga. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415(6868):141-7, Jan 2002. [ bib | DOI | http | .pdf ]
Most cellular processes are carried out by multiprotein complexes. The identification and analysis of their components provides insight into how the ensemble of expressed proteins (proteome) is organized into functional units. We used tandem-affinity purification (TAP) and mass spectrometry in a large-scale approach to characterize multiprotein complexes in Saccharomyces cerevisiae. We processed 1,739 genes, including 1,143 human orthologues of relevance to human biology, and purified 589 protein assemblies. Bioinformatic analysis of these assemblies defined 232 distinct multiprotein complexes and proposed new cellular roles for 344 proteins, including 231 proteins with no previous functional annotation. Comparison of yeast and human complexes showed that conservation across species extends from single proteins to their molecular environment. Our analysis provides an outline of the eukaryotic proteome as a network of protein complexes at a level of organization beyond binary interactions. This higher-order map contains fundamental biological information and offers the context for a more reasoned and informed approach to drug discovery.

Keywords: Affinity, Affinity Labels, Amino Acid Sequence, Animals, Cell Cycle Proteins, Cells, Chromatography, Cloning, Comparative Study, Cultured, DNA, DNA Damage, DNA Repair, Electrospray Ionization, Fungal, Gene Targeting, Genetic, Humans, Macromolecular Substances, Mass, Matrix-Assisted Laser Desorption-Ionization, Mitosis, Molecular, Molecular Sequence Data, Non-P.H.S., Non-U.S. Gov't, P.H.S., Phosphoric Monoester Hydrolases, Protein Binding, Protein Interaction Mapping, Protein Kinases, Proteome, Proteomics, Recombinant Fusion Proteins, Research Support, Ribonucleoproteins, Ribosomes, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Alignment, Signal Transduction, Species Specificity, Spectrometry, Spectrum Analysis, Transcription, U.S. Gov't, 11805813
[Gardner2002Genome] Malcolm J Gardner, Neil Hall, Eula Fung, Owen White, Matthew Berriman, Richard W Hyman, Jane M Carlton, Arnab Pain, Karen E Nelson, Sharen Bowman, Ian T Paulsen, Keith James, Jonathan A Eisen, Kim Rutherford, Steven L Salzberg, Alister Craig, Sue Kyes, Man-Suen Chan, Vishvanath Nene, Shamira J Shallom, Bernard Suh, Jeremy Peterson, Sam Angiuoli, Mihaela Pertea, Jonathan Allen, Jeremy Selengut, Daniel Haft, Michael W Mather, Akhil B Vaidya, David M A Martin, Alan H Fairlamb, Martin J Fraunholz, David S Roos, Stuart A Ralph, Geoffrey I McFadden, Leda M Cummings, G. Mani Subramanian, Chris Mungall, J. Craig Venter, Daniel J Carucci, Stephen L Hoffman, Chris Newbold, Ronald W Davis, Claire M Fraser, and Bart Barrell. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature, 419(6906):498-511, Oct 2002. [ bib | DOI | http | .pdf ]
The parasite Plasmodium falciparum is responsible for hundreds of millions of cases of malaria, and kills more than one million African children annually. Here we report an analysis of the genome sequence of P. falciparum clone 3D7. The 23-megabase nuclear genome consists of 14 chromosomes, encodes about 5,300 genes, and is the most (A + T)-rich genome sequenced to date. Genes involved in antigenic variation are concentrated in the subtelomeric regions of the chromosomes. Compared to the genomes of free-living eukaryotic microbes, the genome of this intracellular parasite encodes fewer enzymes and transporters, but a large proportion of genes are devoted to immune evasion and host-parasite interactions. Many nuclear-encoded proteins are targeted to the apicoplast, an organelle involved in fatty-acid and isoprenoid metabolism. The genome sequence provides the foundation for future studies of this organism, and is being exploited in the search for new drugs and vaccines to fight malaria.

Keywords: plasmodium
[Fritz2002Microarray-based] B. Fritz, F. Schubert, G. Wrobel, C. Schwaenen, S. Wessendorf, M. Nessling, C. Korz, R. J. Rieker, K. Montgomery, R. Kucherlapati, G. Mechtersheimer, R. Eils, S. Joos, and P. Lichter. Microarray-based Copy Number and Expression Profiling in Dedifferentiated and Pleomorphic Liposarcoma. Cancer Res., 62(11):2993-2998, 2002. [ bib | http | .pdf ]
Sixteen dedifferentiated and pleomorphic liposarcomas were analyzed by comparative genomic hybridization (CGH) to genomic microarrays (matrix-CGH), cDNA-derived microarrays for expression profiling, and by quantitative PCR. Matrix-CGH revealed copy number gains of numerous oncogenes, i.e., CCND1, MDM2, GLI, CDK4, MYB, ESR1, and AIB1, several of which correlate with a high level of transcripts from the respective gene. In addition, a number of genes were found differentially expressed in dedifferentiated and pleomorphic liposarcomas. Application of dedicated clustering algorithms revealed that both tumor subtypes are clearly separated by the genomic profiles but only with a lesser power by the expression profiles. Using a support vector machine, a subset of five clones was identified as "class discriminators." Thus, for the distinction of these types of liposarcomas, genomic profiling appears to be more advantageous than RNA expression analysis.

Keywords: biosvm, cgh
[Freudenberg2002similarity-based] J. Freudenberg and P. Propping. A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics, 18 Suppl 2:S110-S115, 2002. [ bib | .pdf ]
MOTIVATION: A method for prediction of disease relevant human genes from the phenotypic appearance of a query disease is presented. Diseases of known genetic origin are clustered according to their phenotypic similarity. Each cluster entry consists of a disease and its underlying disease gene. Potential disease genes from the human genome are scored by their functional similarity to known disease genes in these clusters, which are phenotypically similar to the query disease. RESULTS: For assessment of the approach, a leave-one-out cross-validation of 878 diseases from the OMIM database, using 10672 candidate genes from the human genome, is performed. Depending on the applied parameters, in roughly one-third of cases the true solution is contained within the top scoring 3% of predictions and in two-third of cases the true solution is contained within the top scoring 15% of predictions. The prediction results can either be used to identify target genes, when searching for a mutation in monogenic diseases or for selection of loci in genotyping experiments in genetically complex diseases.

[Florens2002proteomic] Laurence Florens, Michael P Washburn, J. Dale Raine, Robert M Anthony, Munira Grainger, J. David Haynes, J. Kathleen Moch, Nemone Muster, John B Sacci, David L Tabb, Adam A Witney, Dirk Wolters, Yimin Wu, Malcolm J Gardner, Anthony A Holder, Robert E Sinden, John R Yates, and Daniel J Carucci. A proteomic view of the Plasmodium falciparum life cycle. Nature, 419(6906):520-526, Oct 2002. [ bib | DOI | http | .pdf ]
The completion of the Plasmodium falciparum clone 3D7 genome provides a basis on which to conduct comparative proteomics studies of this human pathogen. Here, we applied a high-throughput proteomics approach to identify new potential drug and vaccine targets and to better understand the biology of this complex protozoan parasite. We characterized four stages of the parasite life cycle (sporozoites, merozoites, trophozoites and gametocytes) by multidimensional protein identification technology. Functional profiling of over 2,400 proteins agreed with the physiology of each stage. Unexpectedly, the antigenically variant proteins of var and rif genes, defined as molecules on the surface of infected erythrocytes, were also largely expressed in sporozoites. The detection of chromosomal clusters encoding co-expressed proteins suggested a potential mechanism for controlling gene expression.

Keywords: plasmodium
[Fischetti02Generalized] M. Fischetti, J.J. Salazar-Gonzalez, and P. Toth. The generalized travelling salesman and orienteering problems. In The Travelling Salesman Problem and Its Variations, pages 609-662, 2002. [ bib ]
[El-Naqa2002support] I. El-Naqa, Y. Yang, M. N. Wernick, N. P. Galatsanos, and R. M. Nishikawa. A support vector machine approach for detection of microcalcifications. IEEE Trans Med Imaging, 21(12):1552-63, Dec 2002. [ bib ]
In this paper, we investigate an approach based on support vector machines (SVMs) for detection of microcalcification (MC) clusters in digital mammograms, and propose a successive enhancement learning scheme for improved performance. SVM is a machine-learning method, based on the principle of structural risk minimization, which performs well when applied to data outside the training set. We formulate MC detection as a supervised-learning problem and apply SVM to develop the detection algorithm. We use the SVM to detect at each location in the image whether an MC is present or not. We tested the proposed method using a database of 76 clinical mammograms containing 1120 MCs. We use free-response receiver operating characteristic curves to evaluate detection performance, and compare the proposed algorithm with several existing methods. In our experiments, the proposed SVM framework outperformed all the other methods tested. In particular, a sensitivity as high as 94% was achieved by the SVM method at an error rate of one false-positive cluster per image. The ability of SVM to out perform several well-known methods developed for the widely studied problem of MC detection suggests that SVM is a promising technique for object detection in a medical imaging application.

[Ekins2002Towards] S. Ekins, B. Boulanger, P. W. Swaan, and M. A. Z. Hupcey. Towards a new age of virtual ADME/TOX and multidimensional drug discovery. J Comput Aided Mol Des, 16(5-6):381-401, 2002. [ bib ]
With the continual pressure to ensure follow-up molecules to billion dollar blockbuster drugs, there is a hurdle in profitability and growth for pharmaceutical companies in the next decades. With each success and failure we increasingly appreciate that a key to the success of synthesized molecules through the research and development process is the possession of drug-like properties. These properties include an adequate bioactivity as well as adequate solubility, an ability to cross critical membranes (intestinal and sometimes blood-brain barrier), reasonable metabolic stability and of course safety in humans. Dependent on the therapeutic area being investigated it might also be desirable to avoid certain enzymes or transporters to circumvent potential drug-drug interactions. It may also be important to limit the induction of these same proteins that can result in further toxicities. We have clearly moved the assessment of in vitro absorption, distribution, metabolism, excretion and toxicity (ADME/TOX) parameters much earlier in the discovery organization than a decade ago with the inclusion of higher throughput systems. We are also now faced with huge amounts of ADME/TOX data for each molecule that need interpretation and also provide a valuable resource for generating predictive computational models for future drug discovery. The present review aims to show what tools exist today for visualizing and modeling ADME/TOX data, what tools need to be developed, and how both the present and future tools are valuable for virtual filtering using ADME/TOX and bioactivity properties in parallel as a viable addition to present practices.

Keywords: ATP-Binding Cassette Transporters, Algorithms, Animals, Biological, Biological Availability, Computer Simulation, Drug Design, Drug Evaluation, Drug Industry, Gene Expression Profiling, Humans, Models, Organic Anion Transporters, P.H.S., Pharmaceutical, Pharmaceutical Preparations, Pharmacogenetics, Pharmacokinetics, Preclinical, Proteomics, Research Support, Software, Systems Biology, Technology, Toxicity Tests, U.S. Gov't, 12489686
[Donnes2002Prediction] P. Dönnes and A. Elofsson. Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics, 3(1):25, Sep 2002. [ bib | DOI | http | .pdf ]
Background T-cells are key players in regulating a specific immune response. Activation of cytotoxic T-cells requires recognition of specific peptides bound to Major Histocompatibility Complex (MHC) class I molecules. MHC-peptide complexes are potential tools for diagnosis and treatment of pathogens and cancer, as well as for the development of peptide vaccines. Only one in 100 to 200 potential binders actually binds to a certain MHC molecule, therefore a good prediction method for MHC class I binding peptides can reduce the number of candidate binders that need to be synthesized and tested. Results Here, we present a novel approach, SVMHC, based on support vector machines to predict the binding of peptides to MHC class I molecules. This method seems to perform slightly better than two profile based methods, SYFPEITHI and HLA_BIND. The implementation of SVMHC is quite simple and does not involve any manual steps, therefore as more data become available it is trivial to provide prediction for more MHC types. SVMHC currently contains prediction for 26 MHC class I types from the MHCPEP database or alternatively 6 MHC class I types from the higher quality SYFPEITHI database. The prediction models for these MHC types are implemented in a public web service available at http://www.sbc.su.se/svmhc/. Conclusions Prediction of MHC class I binding peptides using Support Vector Machines, shows high performance and is easy to apply to a large number of MHC class I types. As more peptide data are put into MHC databases, SVMHC can easily be updated to give prediction for additional MHC class I types. We suggest that the number of binding peptides needed for SVM training is at least 20 sequences.

Keywords: biosvm immunoinformatics
[Dudoit2002Statistical] S. Dudoit, Y. H. Yang, M. J. Callow, and T. P. Speed. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica, 12:111-139, 2002. [ bib | .pdf ]
[Dudoit2002Comparison] S. Dudoit, J. Fridlyand, and T. Speed. Comparison of discrimination methods for classification of tumors using gene expression data. J. Am. Stat. Assoc., 97:77-87, 2002. [ bib ]
[Dover2002Methylation] Jim Dover, Jessica Schneider, Mary Anne Tawiah-Boateng, Adam Wood, Kimberly Dean, Mark Johnston, and Ali Shilatifard. Methylation of histone h3 by compass requires ubiquitination of histone h2b by rad6. J Biol Chem, 277(32):28368-28371, Aug 2002. [ bib | DOI | http ]
The DNA of eukaryotes is wrapped around nucleosomes and packaged into chromatin. Covalent modifications of the histone proteins that comprise the nucleosome alter chromatin structure and have major effects on gene expression. Methylation of lysine 4 of histone H3 by COMPASS is required for silencing of genes located near chromosome telomeres and within the rDNA (Krogan, N. J, Dover, J., Khorrami, S., Greenblatt, J. F., Schneider, J., Johnston, M., and Shilatifard, A. (2002) J. Biol. Chem. 277, 10753-10755; Briggs, S. D., Bryk, M., Strahl, B. D., Cheung, W. L., Davie, J. K., Dent, S. Y., Winston, F., and Allis, C. D. (2001) Genes. Dev. 15, 3286-3295). To learn about the mechanism of histone methylation, we surveyed the genome of the yeast Saccharomyces cerevisiae for genes necessary for this process. By analyzing approximately 4800 mutant strains, each deleted for a different non-essential gene, we discovered that the ubiquitin-conjugating enzyme Rad6 is required for methylation of lysine 4 of histone H3. Ubiquitination of histone H2B on lysine 123 is the signal for the methylation of histone H3, which leads to silencing of genes located near telomeres.

Keywords: DNA, Ribosomal, metabolism; Electrophoresis, Polyacrylamide Gel; Gene Silencing; Histones, metabolism; Ligases, metabolism; Lysine, metabolism; Methylation; Models, Biological; Mutation; Saccharomyces cerevisiae Proteins; Saccharomyces cerevisiae, genetics; Ubiquitin, metabolism; Ubiquitin-Conjugating Enzymes
[Doniger2002Predicting] S. Doniger, T. Hofmann, and J. Yeh. Predicting CNS permeability of drug molecules: comparison of neural network and support vector machine algorithms. J. Comput. Biol., 9(6):849-864, 2002. [ bib | DOI | .pdf ]
Two different machine-learning algorithms have been used to predict the blood-brain barrier permeability of different classes of molecules, to develop a method to predict the ability of drug compounds to penetrate the CNS. The first algorithm is based on a multilayer perceptron neural network and the second algorithm uses a support vector machine. Both algorithms are trained on an identical data set consisting of 179 CNS active molecules and 145 CNS inactive molecules. The training parameters include molecular weight, lipophilicity, hydrogen bonding, and other variables that govern the ability of a molecule to diffuse through a membrane. The results show that the support vector machine outperforms the neural network. Based on over 30 different validation sets, the SVM can predict up to 96 molecules correctly, averaging 81.5 of equal numbers of CNS positive and negative molecules. This is quite favorable when compared with the neural network's average performance of 75.7 the SVM algorithm are very encouraging and suggest that a classification tool like this one will prove to be a valuable prediction approach.

Keywords: biosvm
[Dietterich2002Machine] T. G. Dietterich. Machine Learning for Sequential Data: A Review. In T. Caelli, editor, Structural, Syntactic, and Statistical Pattern Recognition; Lecture Notes in Computer Science, Vol. 2396, pages 15-30. Springer-Verlag, 2002. [ bib | .pdf ]
Keywords: conditional-random-field
[Deshpande2002Evaluation] M. Deshpande and G. Karypis. Evaluation of Techniques for Classifying Biological Sequences. In PAKDD '02: Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pages 417-431. Springer Verlag, 2002. [ bib | .pdf ]
In recent years we have witnessed an exponential increase in the amount of biological information, either DNA or protein sequences, that has become available in public databases. This has been followed by an increased interest in developing computational techniques to automatically classify these large volumes of sequence data into various categories corresponding to either their role in the chromosomes, their structure, and/or their function. In this paper we evaluate some of the widely-used sequence classification algorithms and develop a framework for modeling sequences in a fashion so that traditional machine learning algorithms, such as support vector machines, can be applied easily. Our detailed experimental evaluation shows that the SVM-based approaches are able to achieve higher classification accuracy compared to the more traditional sequence classification algorithms such as Markov model based techniques and K-nearest neighbor based approaches.

Keywords: biosvm
[Deshpande2002Automated] M. Deshpande and G. Karypis. Automated Approaches for Classifying Structures. In Proceedings of the 2nd Workshop on Data Mining in Bioinformatics (BIOKDD '02), 2002, 2002. [ bib ]
[Denis2002Text] F. Denis, R. Gilleron, and M. Tommasi. Text classification from positive and unlabeled examples. In Proc. of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, 2002. [ bib | http ]
[Dekker2002Capturing] Job Dekker, Karsten Rippe, Martijn Dekker, and Nancy Kleckner. Capturing chromosome conformation. Science, 295(5558):1306-1311, Feb 2002. [ bib | DOI | http ]
We describe an approach to detect the frequency of interaction between any two genomic loci. Generation of a matrix of interaction frequencies between sites on the same or different chromosomes reveals their relative spatial disposition and provides information about the physical properties of the chromatin fiber. This methodology can be applied to the spatial organization of entire genomes in organisms from bacteria to human. Using the yeast Saccharomyces cerevisiae, we could confirm known qualitative features of chromosome organization within the nucleus and dynamic changes in that organization during meiosis. We also analyzed yeast chromosome III at the G1 stage of the cell cycle. We found that chromatin is highly flexible throughout. Furthermore, functionally distinct AT- and GC-rich domains were found to exhibit different conformations, and a population-average 3D model of chromosome III could be determined. Chromosome III emerges as a contorted ring.

Keywords: AT Rich Sequence; Cell Fractionation; Cell Nucleus; Centromere; Chromatin; Chromosomes, Fungal; Cross-Linking Reagents; Deoxyribonuclease EcoRI; Formaldehyde; G1 Phase; GC Rich Sequence; Genome, Fungal; Mathematics; Meiosis; Mitosis; Polymerase Chain Reaction; Protein Conformation; Saccharomyces cerevisiae; Telomere
[Degroeve2002Feature] S. Degroeve, B. De Baets, Y. Van de Peer, and P. Rouze. Feature subset selection for splice site prediction. Bioinformatics, 18(Suppl. 1):S75-S83, 2002. [ bib | http | .pdf ]
Motivation: The large amount of available annotated Arabidopsis thaliana sequences allows the induction of splice site prediction models with supervised learning algorithms (see Haussler (1998) for a review and references). These algorithms need information sources or features from which the models can be computed. For splice site prediction, the features we consider in this study are the presence or absence of certain nucleotides in close proximity to the splice site. Since it is not known how many and which nucleotides are relevant for splice site prediction, the set of features is chosen large enough such that the probability that all relevant information sources are in the set is very high. Using only those features that are relevant for constructing a splice site prediction system might improve the system and might also provide us with useful biological knowledge. Using fewer features will of course also improve the prediction speed of the system. Results: A wrapper-based feature subset selection algorithm using a support vector machine or a naive Bayes prediction method was evaluated against the traditional method for selecting features relevant for splice site prediction. Our results show that this wrapper approach selects features that improve the performance against the use of all features and against the use of the features selected by the traditional method. Availability: The data and additional interactive graphs on the selected feature subsets are available at http://www.psb.rug.ac.be/gps Contact: svgro@gengenp.rug.ac.be yvdp@gengenp.rug.ac.be

Keywords: biosvm
[Decoste2002Training] D. Decoste and B. Schölkopf. Training invariant support vector machines. Mach. Learn., 46(1-3):161-190, 2002. [ bib | .pdf ]
[Doennes2002Prediction] Pierre Dönnes and Arne Elofsson. Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics, 3:25, Sep 2002. [ bib ]
BACKGROUND: T-cells are key players in regulating a specific immune response. Activation of cytotoxic T-cells requires recognition of specific peptides bound to Major Histocompatibility Complex (MHC) class I molecules. MHC-peptide complexes are potential tools for diagnosis and treatment of pathogens and cancer, as well as for the development of peptide vaccines. Only one in 100 to 200 potential binders actually binds to a certain MHC molecule, therefore a good prediction method for MHC class I binding peptides can reduce the number of candidate binders that need to be synthesized and tested. RESULTS: Here, we present a novel approach, SVMHC, based on support vector machines to predict the binding of peptides to MHC class I molecules. This method seems to perform slightly better than two profile based methods, SYFPEITHI and HLA_BIND. The implementation of SVMHC is quite simple and does not involve any manual steps, therefore as more data become available it is trivial to provide prediction for more MHC types. SVMHC currently contains prediction for 26 MHC class I types from the MHCPEP database or alternatively 6 MHC class I types from the higher quality SYFPEITHI database. The prediction models for these MHC types are implemented in a public web service available at http://www.sbc.su.se/svmhc/. CONCLUSIONS: Prediction of MHC class I binding peptides using Support Vector Machines, shows high performance and is easy to apply to a large number of MHC class I types. As more peptide data are put into MHC databases, SVMHC can easily be updated to give prediction for additional MHC class I types. We suggest that the number of binding peptides needed for SVM training is at least 20 sequences.

Keywords: Animals; Artificial Intelligence; Comparative Study; Computational Biology; Databases, Protein; Epitopes, T-Lymphocyte; HLA Antigens; Histocompatibility Antigens Class I; Humans; Peptides; Predictive Value of Tests; Protein Binding; Research Support, Non-U.S. Gov't; Sensitivity and Specificity
[Cucker2002On] F. Cucker and S. Smale. On the mathematical foundations of learning. Bull. Amer. Math. Soc, 39:1-49, 2002. [ bib | .pdf ]
[Cucker2002Best] F. Cucker and S. Smale. Best choices for regularization parameters in learning theory: on the bias-variance problem. Foundations of Computational Mathematics, 2(4):413-428, 2002. [ bib | DOI | http | .pdf ]
[Cristianini2002Latent] N. Cristianini, J. Shawe-Taylor, and H. Lodhi. Latent semantic kernels. J. Intell. Inform. Syst., 18(2-3):127-152, 2002. [ bib | DOI | http | .pdf ]
[Churchill2002Fundamentals] G. A. Churchill. Fundamentals of experimental design for cdna microarrays. Nat. Genet., 32 Suppl:490-495, Dec 2002. [ bib | DOI | http ]
Microarray technology is now widely available and is being applied to address increasingly complex scientific questions. Consequently, there is a greater demand for statistical assessment of the conclusions drawn from microarray experiments. This review discusses fundamental issues of how to design an experiment to ensure that the resulting data are amenable to statistical analysis. The discussion focuses on two-color spotted cDNA microarrays, but many of the same issues apply to single-color gene-expression assays as well.

Keywords: Animals; DNA, Complementary, analysis; Gene Expression; Gene Expression Profiling, methods; Mice; Models, Biological; Oligonucleotide Array Sequence Analysis, methods; Reference Standards; Reproducibility of Results; Research Design; Statistics as Topic
[Chou2002Using] K.-C. Chou and Y.-D. Cai. Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location. J. Biol. Chem., 277(48):45765-45769, 2002. [ bib | http | .pdf ]
Proteins are generally classified into the following 12 subcellular locations: 1) chloroplast, 2) cytoplasm, 3) cytoskeleton, 4) endoplasmic reticulum, 5) extracellular, 6) Golgi apparatus, 7) lysosome, 8) mitochondria, 9) nucleus, 10) peroxisome, 11) plasma membrane, and 12) vacuole. Because the function of a protein is closely correlated with its subcellular location, with the rapid increase in new protein sequences entering into databanks, it is vitally important for both basic research and pharmaceutical industry to establish a high throughput tool for predicting protein subcellular location. In this paper, a new concept, the so-called "functional domain composition" is introduced. Based on the novel concept, the representation for a protein can be defined as a vector in a high-dimensional space, where each of the clustered functional domains derived from the protein universe serves as a vector base. With such a novel representation for a protein, the support vector machine (SVM) algorithm is introduced for predicting protein subcellular location. High success rates are obtained by the self-consistency test, jackknife test, and independent dataset test, respectively. The current approach not only can play an important complementary role to the powerful covariant discriminant algorithm based on the pseudo amino acid composition representation (Chou, K. C. (2001) Proteins Struct. Funct. Genet. 43, 246-255; Correction (2001) Proteins Struct. Funct. Genet. 44, 60), but also may greatly stimulate the development of this area.

Keywords: biosvm
[Chen2002Gene] X. Chen, S. T. Cheung, S. So, S. T. Fan, C. Barry, J. Higgins, K.-M. Lai, J. Ji, S. Dudoit, I. O L. Ng, M. Van De Rijn, D. Botstein, and P. O. Brown. Gene expression patterns in human liver cancers. Mol. Biol. Cell, 13(6):1929-1939, Jun 2002. [ bib | DOI | http | .pdf ]
Hepatocellular carcinoma (HCC) is a leading cause of death worldwide. Using cDNA microarrays to characterize patterns of gene expression in HCC, we found consistent differences between the expression patterns in HCC compared with those seen in nontumor liver tissues. The expression patterns in HCC were also readily distinguished from those associated with tumors metastatic to liver. The global gene expression patterns intrinsic to each tumor were sufficiently distinctive that multiple tumor nodules from the same patient could usually be recognized and distinguished from all the others in the large sample set on the basis of their gene expression patterns alone. The distinctive gene expression patterns are characteristic of the tumors and not the patient; the expression programs seen in clonally independent tumor nodules in the same patient were no more similar than those in tumors from different patients. Moreover, clonally related tumor masses that showed distinct expression profiles were also distinguished by genotypic differences. Some features of the gene expression patterns were associated with specific phenotypic and genotypic characteristics of the tumors, including growth rate, vascular invasion, and p53 overexpression.

Keywords: csbcbook-ch3, csbcbook
[Chan2002Comparison] Kwokleung Chan, Te-Won Lee, Pamela A Sample, Michael H Goldbaum, Robert N Weinreb, and Terrence J Sejnowski. Comparison of machine learning and traditional classifiers in glaucoma diagnosis. IEEE Trans Biomed Eng, 49(9):963-74, Sep 2002. [ bib | DOI | http | .pdf ]
Glaucoma is a progressive optic neuropathy with characteristic structural changes in the optic nerve head reflected in the visual field. The visual-field sensitivity test is commonly used in a clinical setting to evaluate glaucoma. Standard automated perimetry (SAP) is a common computerized visual-field test whose output is amenable to machine learning. We compared the performance of a number of machine learning algorithms with STATPAC indexes mean deviation, pattern standard deviation, and corrected pattern standard deviation. The machine learning algorithms studied included multilayer perceptron (MLP), support vector machine (SVM), and linear (LDA) and quadratic discriminant analysis (QDA), Parzen window, mixture of Gaussian (MOG), and mixture of generalized Gaussian (MGG). MLP and SVM are classifiers that work directly on the decision boundary and fall under the discriminative paradigm. Generative classifiers, which first model the data probability density and then perform classification via Bayes' rule, usually give deeper insight into the structure of the data space. We have applied MOG, MGG, LDA, QDA, and Parzen window to the classification of glaucoma from SAP. Performance of the various classifiers was compared by the areas under their receiver operating characteristic curves and by sensitivities (true-positive rates) at chosen specificities (true-negative rates). The machine-learning-type classifiers showed improved performance over the best indexes from STATPAC. Forward-selection and backward-elimination methodology further improved the classification rate and also has the potential to reduce testing time by diminishing the number of visual-field location measurements.

Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Citric Acid Cycle, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Databases, Decision Making, Diagnosis, Differential, Discriminant Analysis, Drug, Drug Design, Electrostatics, Epitopes, Eukaryotic Cells, Factual, False Negative Reactions, False Positive Reactions, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Heterogeneity, Genetic Markers, Glaucoma, HLA Antigens, Hemolysins, Histocompatibility Antigens Class I, Humans, Internet, Intraocular Pressure, Ion Exchange, Lasers, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Neurological, Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Open-Angle, Ophthalmoscopy, Optic Disk, Optic Nerve Diseases, Ovarian Neoplasms, P.H.S., Pattern Recognition, Peptides, Perimetry, Predictive Value of Tests, Probability, Probability Learning, Protein, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, ROC Curve, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Robotics, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Structural, Structure-Activity Relationship, T-Lymphocyte, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12214886
[Cavalli2002Toward] A. Cavalli, E. Poluzzi, F. De Ponti, and M. Recanatini. Toward a pharmacophore for drugs inducing the long QT syndrome: insights from a CoMFA study of HERG K(+) channel blockers. J. Med. Chem., 45(18):3844-3853, Aug 2002. [ bib ]
In this paper, we present a pharmacophore for QT-prolonging drugs, along with a 3D QSAR (CoMFA) study for a series of very structurally variegate HERG K(+) channel blockers. The blockade of HERG K(+) channels is one of the most important molecular mechanisms through which QT-prolonging drugs increase cardiac action potential duration. Since QT prolongation is one of the most undesirable side effects of drugs, we first tried to identify the minimum set of molecular features responsible for this action and then we attempted to develop a quantitative model correlating the 3D stereoelectronic characteristics of the molecules with their HERG blocking potency. Having considered an initial set of 31 QT-prolonging drugs for which the HERG K(+) channel blocking activity was measured on mammalian transfected cells, we started the construction of a theoretical screening tool able to predict whether a new molecule can interact with the HERG channel and eventually induce the long QT syndrome. This in silico tool might be useful in the design of new drug candidates devoid of the physicochemical features likely to cause the above-mentioned side effect.

Keywords: chemoinformatics herg
[Catoni2002Data] O. Catoni. Data Compression and Adaptive Histograms. In Felipe Cucker and J. Maurice Rojas, editors, Foundations of Computational Mathematics, Proceedings of Smalefest 2000. World Scientific, 2002. [ bib | http | .pdf ]
[Cai2002Prediction] Y.-D. Cai, X.-J. Liu, X.-B. Xu, and G.-P. Zhou. Prediction of protein structural classes by support vector machines. Comput. Chem., 26(3):293-296, 2002. [ bib | DOI | http | .pdf ]
In this paper, we apply a new machine learning method which is called support vector machine to approach the prediction of protein structural class. The support vector machine method is performed based on the database derived from SCOP which is based upon domains of known structure and the evolutionary relationships and the principles that govern their 3D structure. As a result, high rates of both self-consistency and jackknife test are obtained. This indicates that the structural class of a protein inconsiderably correlated with its amino and composition, and the support vector machine can be referred as a powerful computational tool for predicting the structural classes of proteins.

Keywords: biosvm
[Cai2002Support] Y.-D. Cai, X.-J. Liu, X.-B. Xu, and K.-C. Chou. Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect. J. Cell. Biochem., 84(2):343-348, 2002. [ bib | DOI | http | .pdf ]
Support Vector Machine (SVM), which is one class of learning machines, was applied to predict the subcellular location of proteins by incorporating the quasi-sequence-order effect (Chou [2000] Biochem. Biophys. Res. Commun. 278:477-483). In this study, the proteins are classified into the following 12 groups: (1) chloroplast, (2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracellular, (6) Golgi apparatus, (7) lysosome, (8) mitochondria, (9) nucleus, (10) peroxisome, (11) plasma membrane, and (12) vacuole, which account for most organelles and subcellular compartments in an animal or plant cell. Examinations for self-consistency and jackknife testing of the SVMs method were conducted for three sets consisting of 1,911, 2,044, and 2,191 proteins. The correct rates for self-consistency and the jackknife test values achieved with these protein sets were 94 and 83 89 and 75 for correct prediction rates were undertaken with three independent testing datasets containing 2,148 proteins, 2,417 proteins, and 2,494 proteins producing values of 84, 77, and 74%, respectively.

Keywords: biosvm
[Cai2002Supportc] Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou. Support vector machines for the classification and prediction of beta-turn types. J. Pept. Sci., 8(7):297-301, 2002. [ bib | DOI | http | www: ]
The support vector machines (SVMs) method is proposed because it can reflect the sequence-coupling effect for a tetrapeptide in not only a beta-turn or non-beta-turn, but also in different types of beta-turn. The results of the model for 6022 tetrapeptides indicate that the rates of self-consistency for beta-turn types I, I', II, II', VI and VIII and non-beta-turns are 99.92 98.02 training data, the rate of correct prediction by the SVMs for a given protein: rubredoxin (54 residues. 51 tetrapeptides) which includes 12 beta-turn type I tetrapeptides, 1 beta-turn type II tetrapeptide and 38 non-beta-turns reached 82.4 of the SVMs implies that the formation of different beta-turn types or non-beta-turns is considerably correlated with the sequence of a tetrapeptide. The SVMs can save CPU time and avoid the overfitting problem compared with the neural network method.

Keywords: biosvm
[Cai2002Supportb] Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou. Support vector machines for predicting the specificity of GalNAc-transferase. Peptides, 23:205-208, 2002. [ bib | DOI | http | .pdf ]
Support Vector Machines (SVMs) which is one kind of learning machines, was applied to predict the specificity of GalNAc-transferase. The examination for the self-consistency and the jackknife test of the SVMs method were tested for the training dataset (305 oligopeptides), the correct rate of self-consistency and jackknife test reaches 100 and 84.9 testing dataset (30 oligopeptides) was tested, the rate reaches 76.67%.

Keywords: biosvm
[Cai2002Supporta] Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou. Support Vector Machines for predicting HIV protease cleavage sites in protein. J. Comput. Chem., 23(2):267-274, 2002. [ bib | DOI | http | www: ]
Knowledge of the polyprotein cleavage sites by HIV protease will refine our understanding of its specificity, and the information thus acquired is useful for designing specific and efficient HIV protease inhibitors. The pace in searching for the proper inhibitors of HIV protease will be greatly expedited if one can find an accurate, robust, and rapid method for predicting the cleavage sites in proteins by HIV protease. In this article, a Support Vector Machine is applied to predict the cleavability of oligopeptides by proteases with multiple and extended specificity subsites. We selected HIV-1 protease as the subject of the study. Two hundred ninety-nine oligopeptides were chosen for the training set, while the other 63 oligopeptides were taken as a test set. Because of its high rate of self-consistency (299/299 = 100 and correct prediction rate (55/63 = 87 Support Vector Machine method can be referred to as a useful assistant technique for finding effective inhibitors of HIV protease, which is one of the targets in designing potential drugs against AIDS. The principle of the Support Vector Machine method can also be applied to analyzing the specificity of other multisubsite enzymes.

Keywords: biosvm
[Butina2002Predicting] D. Butina, M. D. Segall, and K. Frankcombe. Predicting ADME properties in silico: methods and models. Drug Discov Today, 7(11 Suppl):S83-S88, Jun 2002. [ bib ]
Unfavourable absorption, distribution, metabolism and elimination (ADME) properties have been identified as a major cause of failure for candidate molecules in drug development. Consequently, there is increasing interest in the early prediction of ADME properties, with the objective of increasing the success rate of compounds reaching development. This review explores in silico approaches and selected published models for predicting ADME properties from chemical structure alone. In particular, we provide a comparison of methods based on pattern recognition to identify correlations between molecular descriptors and ADME properties, structural models based on classical molecular mechanics and quantum mechanical techniques for modelling chemical reactions.

Keywords: chemoinformatics
[Bryk2002Evidence] Mary Bryk, Scott D Briggs, Brian D Strahl, M. Joan Curcio, C. David Allis, and Fred Winston. Evidence that set1, a factor required for methylation of histone h3, regulates rdna silencing in s. cerevisiae by a sir2-independent mechanism. Curr Biol, 12(2):165-170, Jan 2002. [ bib ]
Several types of histone modifications have been shown to control transcription. Recent evidence suggests that specific combinations of these modifications determine particular transcription patterns. The histone modifications most recently shown to play critical roles in transcription are arginine-specific and lysine-specific methylation. Lysine-specific histone methyltransferases all contain a SET domain, a conserved 130 amino acid motif originally identified in polycomb- and trithorax-group proteins from Drosophila. Members of the SU(VAR)3-9 family of SET-domain proteins methylate K9 of histone H3. Methylation of H3 has also been shown to occur at K4. Several studies have suggested a correlation between K4-methylated H3 and active transcription. In this paper, we provide evidence that K4-methylated H3 is required in a negative role, rDNA silencing in Saccharomyces cerevisiae. In a screen for rDNA silencing mutants, we identified a mutation in SET1, previously shown to regulate silencing at telomeres and HML. Recent work has shown that Set1 is a member of a complex and is required for methylation of K4 of H3 at several genomic locations. In addition, we demonstrate that a K4R change in H3, which prevents K4 methylation, impairs rDNA silencing, indicating that Set1 regulates rDNA silencing, directly or indirectly, via H3 methylation. Furthermore, we present several lines of evidence that the role of Set1 in rDNA silencing is distinct from that of the histone deacetylase Sir2. Together, these results suggest that Set1-dependent H3 methylation is required for rDNA silencing in a Sir2-independent fashion.

Keywords: Acetylation; DNA Methylation; DNA, Ribosomal, genetics; DNA-Binding Proteins, metabolism; Drosophila Proteins; Fungal Proteins, metabolism; Gene Silencing; Histone Deacetylases, metabolism; Histone-Lysine N-Methyltransferase; Histones, metabolism; Mutation; Saccharomyces cerevisiae Proteins; Saccharomyces cerevisiae, metabolism; Silent Information Regulator Proteins, Saccharomyces cerevisiae; Sirtuin 2; Sirtuins; Trans-Activators, metabolism; Transcription Factors, metabolism
[Brusic2002Prediction] V. Brusic, N. Petrovsky, G. Zhang, and V. B. Bajic. Prediction of promiscuous peptides that bind HLA class I molecules. Immunol. Cell Biol., 80(3):280-285, Jun 2002. [ bib ]
Promiscuous T-cell epitopes make ideal targets for vaccine development. We report here a computational system, MULTIPRED, for the prediction of peptide binding to the HLA-A2 supertype. It combines a novel representation of peptide/MHC interactions with a hidden Markov model as the prediction algorithm. MULTIPREDis both sensitive and specific, and demonstrates high accuracy of peptide-binding predictions for HLA-A*0201, *0204, and *0205 alleles, good accuracy for *0206 allele, and marginal accuracy for *0203 allele. MULTIPREDreplaces earlier requirements for individual prediction models for each HLA allelic variant and simplifies computational aspects of peptide-binding prediction. Preliminary testing indicates that MULTIPRED can predict peptide binding to HLA-A2 supertype molecules with high accuracy, including those allelic variants for which no experimental binding data are currently available.

Keywords: Algorithms, Amino Acid Motifs, Amino Acid Sequence, Antigen-Antibody Complex, Automated, Binding Sites, Computational Biology, Drug Delivery Systems, Drug Design, Epitopes, Forecasting, Genes, HLA Antigens, HLA-A Antigens, HLA-A2 Antigen, HLA-DR Antigens, Humans, Internet, MHC Class I, Markov Chains, Molecular Sequence Data, Neural Networks (Computer), Pattern Recognition, Peptide Fragments, Peptides, Protein, Protein Binding, Protein Interaction Mapping, Sensitivity and Specificity, Sequence Analysis, Software, T-Lymphocyte, User-Computer Interface, Viral Vaccines, 12067415
[Briggs2002Gene] Scott D Briggs, Tiaojiang Xiao, Zu-Wen Sun, Jennifer A Caldwell, Jeffrey Shabanowitz, Donald F Hunt, C. David Allis, and Brian D Strahl. Gene silencing: trans-histone regulatory pathway in chromatin. Nature, 418(6897):498, Aug 2002. [ bib | DOI | http ]
The fundamental unit of eukaryotic chromatin, the nucleosome, consists of genomic DNA wrapped around the conserved histone proteins H3, H2B, H2A and H4, all of which are variously modified at their amino- and carboxy-terminal tails to influence the dynamics of chromatin structure and function - for example, conjugation of histone H2B with ubiquitin controls the outcome of methylation at a specific lysine residue (Lys 4) on histone H3, which regulates gene silencing in the yeast Saccharomyces cerevisiae. Here we show that ubiquitination of H2B is also necessary for the methylation of Lys 79 in H3, the only modification known to occur away from the histone tails, but that not all methylated lysines in H3 are regulated by this 'trans-histone' pathway because the methylation of Lys 36 in H3 is unaffected. Given that gene silencing is regulated by the methylation of Lys 4 and Lys 79 in histone H3, we suggest that H2B ubiquitination acts as a master switch that controls the site-selective histone methylation patterns responsible for this silencing.

Keywords: Chromatin, chemistry/metabolism; Gene Expression Regulation, Fungal; Gene Silencing; Histone-Lysine N-Methyltransferase; Histones, chemistry/metabolism; Ligases, metabolism; Methylation; Models, Biological; Nuclear Proteins, metabolism; Saccharomyces cerevisiae Proteins; Saccharomyces cerevisiae, genetics/metabolism; Ubiquitin, metabolism; Ubiquitin-Conjugating Enzymes
[Bowd2002Comparing] Christopher Bowd, Kwokleung Chan, Linda M Zangwill, Michael H Goldbaum, Te-Won Lee, Terrence J Sejnowski, and Robert N Weinreb. Comparing neural networks and linear discriminant functions for glaucoma detection using confocal scanning laser ophthalmoscopy of the optic disc. Invest Ophthalmol Vis Sci, 43(11):3444-54, Nov 2002. [ bib | http | .pdf ]
PURPOSE: To determine whether neural network techniques can improve differentiation between glaucomatous and nonglaucomatous eyes, using the optic disc topography parameters of the Heidelberg Retina Tomograph (HRT; Heidelberg Engineering, Heidelberg, Germany). METHODS: With the HRT, one eye was imaged from each of 108 patients with glaucoma (defined as having repeatable visual field defects with standard automated perimetry) and 189 subjects without glaucoma (no visual field defects with healthy-appearing optic disc and retinal nerve fiber layer on clinical examination) and the optic nerve topography was defined by 17 global and 66 regional HRT parameters. With all the HRT parameters used as input, receiver operating characteristic (ROC) curves were generated for the classification of eyes, by three neural network techniques: linear and Gaussian support vector machines (SVM linear and SVM Gaussian, respectively) and a multilayer perceptron (MLP), as well as four previously proposed linear discriminant functions (LDFs) and one LDF developed on the current data with all HRT parameters used as input. RESULTS: The areas under the ROC curves for SVM linear and SVM Gaussian were 0.938 and 0.945, respectively; for MLP, 0.941; for the current LDF, 0.906; and for the best previously proposed LDF, 0.890. With the use of forward selection and backward elimination optimization techniques, the areas under the ROC curves for SVM Gaussian and the current LDF were increased to approximately 0.96. CONCLUSIONS: Trained neural networks, with global and regional HRT parameters used as input, improve on previously proposed HRT parameter-based LDFs for discriminating between glaucomatous and nonglaucomatous eyes. The performance of both neural networks and LDFs can be improved with optimization of the features in the input. Neural network analyses show promise for increasing diagnostic accuracy of tests for glaucoma.

Keywords: Acute, Algorithms, Animals, Anion Exchange Resins, Artificial Intelligence, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Carcinoma, Chemical, Chromatography, Citric Acid Cycle, Classification, Cluster Analysis, Comparative Study, Computational Biology, Computer-Assisted, Cystadenoma, DNA, Databases, Decision Making, Diagnosis, Differential, Discriminant Analysis, Drug, Drug Design, Electrostatics, Eukaryotic Cells, Factual, Feasibility Studies, Female, Gene Expression, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Heterogeneity, Genetic Markers, Glaucoma, Hemolysins, Humans, Internet, Intraocular Pressure, Ion Exchange, Lasers, Leukemia, Ligands, Likelihood Functions, Logistic Models, Lung Neoplasms, Lymphocytic, Lymphoma, Markov Chains, Mathematics, Messenger, Models, Molecular, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-Small-Cell Lung, Non-U.S. Gov't, Nucleic Acid Conformation, Nucleic Acid Hybridization, Observer Variation, Oligonucleotide Array Sequence Analysis, Open-Angle, Ophthalmoscopy, Optic Disk, Ovarian Neoplasms, P.H.S., Pattern Recognition, Probability, Probability Learning, Protein Binding, Protein Conformation, Proteins, Quality Control, Quantum Theory, RNA, RNA Splicing, ROC Curve, Receptors, Reference Values, Regression Analysis, Reproducibility of Results, Research Support, Robotics, Saccharomyces cerevisiae Proteins, Sensitivity and Specificity, Sequence Analysis, Signal Processing, Software, Statistical, Stomach Neoplasms, Structural, Structure-Activity Relationship, Thermodynamics, Transcription, Tumor Markers, U.S. Gov't, 12407155
[Boobis2002In] A. Boobis, U. Gundert-Remy, P. Kremers, P. Macheras, and O. Pelkonen. In silico prediction of ADME and pharmacokinetics. Report of an expert meeting organised by COST B15. Eur. J. Pharm. Sci., 17(4-5):183-193, Dec 2002. [ bib ]
The computational approach is one of the newest and fastest developing techniques in pharmacokinetics, ADME (absorption, distribution, metabolism, excretion) evaluation, drug discovery and toxicity. However, to date, the software packages devoted to ADME prediction, especially of metabolism, have not yet been adequately validated and still require improvements to be effective. Most are 'open' systems, under constant evolution and able to incorporate rapidly, and often easily, new information from user or developer databases. Quantitative in silico predictions are now possible for several pharmacokinetic (PK) parameters, particularly absorption and distribution. The emerging consensus is that the predictions are no worse than those made using in vitro tests, with the decisive advantage that much less investment in technology, resources and time is needed. In addition, and of critical importance, it is possible to screen virtual compounds. Some packages are able to handle thousands of molecules in a few hours. However, common experience shows that, in part at least for essentially irrational reasons, there is currently a lack of confidence in these approaches. An effort should be made by the software producers towards more transparency, in order to improve the confidence of their consumers. It seems highly probable that in silico approaches will evolve rapidly, as did in vitro methods during the last decade. Past experience with the latter should be helpful in avoiding repetition of similar errors and in taking the necessary steps to ensure effective implementation. A general concern is the lack of access to the large amounts of data on compounds no longer in development, but still kept secret by the pharmaceutical industry. Controlled access to these data could be particularly helpful in validating new in silico approaches.

Keywords: Adsorption, Biological Availability, Chemical, Computer Simulation, Models, Pharmaceutical, Pharmaceutical Preparations, Predictive Value of Tests, Software, Technology, 12453607
[Bock2002New] J. R. Bock and D. A. Gough. A New Method to Estimate Ligand-Receptor Energetics. Mol Cell Proteomics, 1(11):904-910, 2002. [ bib | http | .pdf ]
In the discovery of new drugs, lead identification and optimization have assumed critical importance given the number of drug targets generated from genetic, genomics, and proteomic technologies. High-throughput experimental screening assays have been complemented recently by "virtual screening" approaches to identify and filter potential ligands when the characteristics of a target receptor structure of interest are known. Virtual screening mandates a reliable procedure for automatic ranking of structurally distinct ligands in compound library databases. Computing a rank score requires the accurate prediction of binding affinities between these ligands and the target. Many current scoring strategies require information about the target three-dimensional structure. In this study, a new method to estimate the free binding energy between a ligand and receptor is proposed. We extend a central idea previously reported (Bock, J. R., and Gough, D. A. (2001) Predicting protein-protein interactions from primary structure. Bioinformatics 17, 455-460; Bock, J. R., and Gough, D. A. (2002) Whole-proteome interaction mining. Bioinformatics, in press) that uses simple descriptors to represent biomolecules as input examples to train a support vector machine (Smola, A. J., and Scholkopf, B. (1998) A Tutorial on Support Vector Regression, NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK) and the application of the trained system to previously unseen pairs, estimating their propensity for interaction. Here we seek to learn the function that maps features of a receptor-ligand pair onto their equilibrium free binding energy. These features do not comprise any direct information about the three-dimensional structures of ligand or target. In cross-validation experiments, it is demonstrated that objective measurements of prediction error rate and rank-ordering statistics are competitive with those of several other investigations, most of which depend on three-dimensional structural data. The size of the sample (n = 2,671) indicates that this approach is robust and may have widespread applicability beyond restricted families of receptor types. It is concluded that newly sequenced proteins, or those for which three-dimensional crystal structures are not easily obtained, can be rapidly analyzed for their binding potential against a library of ligands using this methodology.

Keywords: biosvm
[Belongie02Shape] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell., 24(4):509-522, 2002. [ bib | DOI | http | .pdf ]
We present a novel approach to measuring similarity between shapes and exploit it for object recognition. In our framework, the measurement of similarity is preceded by: (1) solving for correspondences between points on the two shapes; (2) using the correspondences to estimate an aligning transform. In order to solve the correspondence problem, we attach a descriptor, the shape context, to each point. The shape context at a reference point captures the distribution of the remaining points relative to it, thus offering a globally discriminative characterization. Corresponding points on two similar shapes will have similar shape contexts, enabling us to solve for correspondences as an optimal assignment problem. Given the point correspondences, we estimate the transformation that best aligns the two shapes; regularized thin-plate splines provide a flexible class of transformation maps for this purpose. The dissimilarity between the two shapes is computed as a sum of matching errors between corresponding points, together with a term measuring the magnitude of the aligning transform. We treat recognition in a nearest-neighbor classification framework as the problem of finding the stored prototype shape that is maximally similar to that in the image. Results are presented for silhouettes, trademarks, handwritten digits, and the COIL data set

[Bartlett2002Rademacher] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res., 3:463-482, 2002. [ bib | .pdf | .pdf ]
[Bao2002Identifying] L. Bao and Z. Sun. Identifying genes related to drug anticancer mechanisms using support vector machine. FEBS Lett., 521:109-114, 2002. [ bib | .html | .pdf ]
In an effort to identify genes related to the cell line chemosensitivity and to evaluate the functional relationships between genes and anticancer drugs acting by the same mechanism, a supervised machine learning approach called support vector machine was used to label genes into any of the five predefined anticancer drug mechanistic categories. Among dozens of unequivocally categorized genes, many were known to be causally related to the drug mechanisms. For example, a few genes were found to be involved in the biological process triggered by the drugs (e.g. DNA polymerase epsilon was the direct target for the drugs from DNA antimetabolites category). DNA repair-related genes were found to be enriched for about eight-fold in the resulting gene set relative to the entire gene set. Some uncharacterized transcripts might be of interest in future studies. This method of correlating the drugs and genes provides a strategy for finding novel biologically significant relationships for molecular pharmacology.

Keywords: biosvm microarray
[Balasubramanian2002isomap] M. Balasubramanian and E. L. Schwartz. The isomap algorithm and topological stability. Science, 295(5552):7, Jan 2002. [ bib | DOI | http | .pdf ]
Keywords: dimred
[Balakin2002Property-based] K. V. Balakin, S. E. Tkachenko, S. A. Lang, I. Okun, A. A. Ivashchenko, and N. P. Savchuk. Property-based design of GPCR-targeted library. J. Chem. Inf. Comput. Sci., 42(6):1332-1342, 2002. [ bib ]
The design of a GPCR-targeted library, based on a scoring scheme for the classification of molecules into "GPCR-ligand-like" and "non-GPCR-ligand-like", is outlined. The methodology is a valuable tool that can aid in the selection and prioritization of potential GPCR ligands for bioscreening from large collections of compounds. It is based on the distillation of knowledge from large databases of GPCR and non-GPCR active agents. The method employed a set of descriptors for encoding the molecular structures and by training of a neural network for classifying the molecules. The molecular requirements were profiled and validated by using available databases of GPCR- and non-GPCR-active agents [5736 diverse GPCR-active molecules and 7506 diverse non-GPCR-active molecules from the Ensemble Database (Prous Science, 2002)]. The method enables efficient qualification or disqualification of a molecule as a potential GPCR ligand and represents a useful tool for constraining the size of GPCR-targeted libraries that will help speed up the development of new GPCR-active drugs.

Keywords: chemogenomics
[Bajorath2002Integration] J. Bajorath. Integration of virtual and high-throughput screening. Nat Rev Drug Discov, 1(11):882-894, Nov 2002. [ bib | DOI | http ]
High-throughput and virtual screening are important components of modern drug discovery research. Typically, these screening technologies are considered distinct approaches, as one is experimental and the other is theoretical in nature. However, given their similar tasks and goals, these approaches are much more complementary to each other than often thought. Various statistical, informatics and filtering methods have recently been introduced to foster the integration of experimental and in silico screening and maximize their output in drug discovery. Although many of these ideas and efforts have not yet proceeded much beyond the conceptual level, there are several success stories and good indications that early-stage drug discovery will benefit greatly from a more unified and knowledge-based approach to biological screening, despite the many technical advances towards even higher throughput that are made in the screening arena.

Keywords: Animals, Cluster Analysis, Computer Simulation, DNA Fingerprinting, Drug Design, Drug Evaluation, Humans, Pharmaceutical, Preclinical, Quantitative Structure-Activity Relationship, Structure-Activity Relationship, Technology, 12415248
[Bach2002Kernel] F.R. Bach and M.I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1-48, 2002. [ bib | .html | .pdf ]
[Andrews2002Multiple] S. Andrews, T. Hofmann, and I. Tsochantaridis. Multiple Instance Learning with Generalized Support Vector Machines. In Proceedings of the Eighteenth National Conference on Artificial Intelligence, pages 943-944. American Association for Artificial Intelligence, 2002. [ bib ]
Keywords: kernel-theory
[Ambroise2002Selection] C. Ambroise and G.J. McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA, 99(10):6562-6566, 2002. [ bib | http | .pdf ]
In the context of cancer diagnosis and treatment, we consider the problem of constructing an accurate prediction rule on the basis of a relatively small number of tumor tissue samples of known type containing the expression data on very many (possibly thousands) genes. Recently, results have been presented in the literature suggesting that it is possible to construct a prediction rule from only a few genes such that it has a negligible prediction error rate. However, in these results the test error or the leave-one-out cross-validated error is calculated without allowance for the selection bias. There is no allowance because the rule is either tested on tissue samples that were used in the first instance to select the genes being used in the rule or because the cross-validation of the rule is not external to the selection process; that is, gene selection is not performed in training the rule at each stage of the cross-validation process. We describe how in practice the selection bias can be assessed and corrected for by either performing a cross-validation or applying the bootstrap external to the selection process. We recommend using 10-fold rather than leave-one-out cross-validation, and concerning the bootstrap, we suggest using the so-called .632+ bootstrap error estimate designed to handle overfitted prediction rules. Using two published data sets, we demonstrate that when correction is made for the selection bias, the cross-validated error is no longer zero for a subset of only a few genes.

Keywords: featureselection biosvm
[Aliferis2002Machine] C.F. Aliferis, D.P. Hardin, and P. Massion. Machine Learning Models For Lung Cancer Classification Using Array Comparative Genomic Hybridization. In Proceedings of the 2002 American Medical Informatics Association (AMIA) Annual Symposium, pages 7-11, 2002. [ bib | .pdf ]
Array CGH is a recently introduced technology that measures changes in the gene copy number of hundreds of genes in a single experiment. The primary goal of this study was to develop machine learning models that classify non-small Lung Cancers according to histopathology types and to compare several machine learning methods in this learning task. DNA from tumors of 37 patients (21 squamous carcinomas, and 16 adenocarcinomas) were extracted and hybridized onto a 452 BAC clone array. The following algorithms were used: KNN, Decision Tree Induction, Support Vector Machines and Feed-Forward Neural Networks. Performance was measured via leave-one-out classification accuracy. The best multi-gene model found had a leave-one-out accuracy of 89.2%. Decision Trees performed poorer than the other methods in this learning task and dataset. We conclude that gene copy numbers as measured by array CGH are, collectively, an excellent indicator of histological subtype. Several interesting research directions are discussed.

Keywords: biosvm microarray, cgh
[Alberts2002Molecular] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter. Molecular Biology of the Cell. Garland Science, Taylor & Francis Group, LLC, 2002. Fourth Edition. [ bib ]
Keywords: csbcbook
[Albert2002Statistical] R. Albert and A.L. Barabási. Statistical mechanics of complex networks. Rev. Mod. Phys., 74:47-97, 2002. [ bib | .pdf | .pdf ]
[Yang2002Normalization] Y. H. Yang, S. Dudoit, P. Luu, D. M. Lin, V. Peng, J. Ngai, and T. P. Speed. Normalization for cdna microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res., 30(4), February 2002. [ bib | DOI | http ]
There are many sources of systematic variation in cDNA microarray experiments which affect the measured gene expression levels (e.g. differences in labeling efficiency between the two fluorescent dyes). The term normalization refers to the process of removing such variation. A constant adjustment is often used to force the distribution of the intensity log ratios to have a median of zero for each slide. However, such global normalization approaches are not adequate in situations where dye biases can depend on spot overall intensity and/or spatial location within the array. This article proposes normalization methods that are based on robust local regression and account for intensity and spatial dependence in dye biases for different types of cDNA microarray experiments. The selection of appropriate controls for normalization is discussed and a novel set of controls (microarray sample pool, MSP) is introduced to aid in intensity-dependent normalization. Lastly, to allow for comparisons of expression levels across slides, a robust method based on maximum likelihood estimation is proposed to adjust for scale differences among slides.

[Taylor2002Protein] W. R. Taylor. Protein structure comparison using bipartite graph matching and its application to protein structure classification. Mol. Cell. Proteomics, 1(4):334-339, April 2002. [ bib | http ]
A measure of protein structure similarity is calculated from the matching of pairs of secondary structure elements between two proteins. The interaction of each pair was estimated from their axial line segments and combined with other geometric features to produce an optimal discrimination between intrafamily and interfamily relationships. The matching used a fast bipartite graph-matching algorithm that avoids the computational complexity of searching for the full subgraph isomorphism between the two sets of interactions. The main algorithm used was the "stable marriage" algorithm, which works on the ranked "preferences" of one interaction for another. The method takes 1/10 of a second for a typical comparison making it suitable as a fast pre-filter for slower, more exhaustive approaches. An application to protein structure classification is described.

Keywords: structure_classification, structure_comparison
[Roche2002virtual] O. Roche, G. Trube, J. Zuegge, P. Pflimlin, A. Alanine, and G. Schneider. A virtual screening method for prediction of the HERG potassium channel liability of compound libraries. ChemBioChem, 3(5):455-459, May 2002. [ bib ]
A computer-based method has been developed for prediction of the hERG (human ether-à-go-go related gene) K(+)-channel affinity of low molecular weight compounds. hERG channel blockage is a major concern in drug design, as such blocking agents can cause sudden cardiac death. Various techniques were applied to finding appropriate molecular descriptors for modeling structure-activity relationships: substructure analysis, self-organizing maps (SOM), principal component analysis (PCA), partial least squares fitting (PLS), and supervised neural networks. The most accurate prediction system was based on an artificial neural network. In a validation study, 93 % of the nonblocking agents and 71 % of the hERG channel blockers were correctly classified. This virtual screening method can be used for general compound-library shaping and combinatorial library design.

Keywords: chemoinformatics herg
[Ong2002Stable] Shao-En Ong, Blagoy Blagoev, Irina Kratchmarova, Dan Bach Kristensen, Hanno Steen, Akhilesh Pandey, and Matthias Mann. Stable isotope labeling by amino acids in cell culture, silac, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics, 1(5):376-386, May 2002. [ bib ]
Quantitative proteomics has traditionally been performed by two-dimensional gel electrophoresis, but recently, mass spectrometric methods based on stable isotope quantitation have shown great promise for the simultaneous and automated identification and quantitation of complex protein mixtures. Here we describe a method, termed SILAC, for stable isotope labeling by amino acids in cell culture, for the in vivo incorporation of specific amino acids into all mammalian proteins. Mammalian cell lines are grown in media lacking a standard essential amino acid but supplemented with a non-radioactive, isotopically labeled form of that amino acid, in this case deuterated leucine (Leu-d3). We find that growth of cells maintained in these media is no different from growth in normal media as evidenced by cell morphology, doubling time, and ability to differentiate. Complete incorporation of Leu-d3 occurred after five doublings in the cell lines and proteins studied. Protein populations from experimental and control samples are mixed directly after harvesting, and mass spectrometric identification is straightforward as every leucine-containing peptide incorporates either all normal leucine or all Leu-d3. We have applied this technique to the relative quantitation of changes in protein expression during the process of muscle cell differentiation. Proteins that were found to be up-regulated during this process include glyceraldehyde-3-phosphate dehydrogenase, fibronectin, and pyruvate kinase M2. SILAC is a simple, inexpensive, and accurate procedure that can be used as a quantitative proteomic approach in any cell culture system.

Keywords: 3T3 Cells; Amino Acids; Animals; Cell Culture Techniques; Cell Differentiation; Cell Line; Deuterium; Genetic Techniques; Hydrogen-Ion Concentration; Leucine; Mice; Muscles; Peptides; Proteomics; Time Factors; Up-Regulation
[Mering2002Comparative] C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417(6887):399-403, May 2002. [ bib | DOI | http ]
Comprehensive protein protein interaction maps promise to reveal many aspects of the complex regulatory network underlying cellular function. Recently, large-scale approaches have predicted many new protein interactions in yeast. To measure their accuracy and potential as well as to identify biases, strengths and weaknesses, we compare the methods with each other and with a reference set of previously reported protein interactions.

[Gestel2002Bayesian] T. Van Gestel, J. A K Suykens, G. Lanckriet, A. Lambrechts, B. De Moor, and J. Vandewalle. Bayesian framework for least-squares support vector machine classifiers, gaussian processes, and kernel Fisher discriminant analysis. Neural Comput, 14(5):1115-47, May 2002. [ bib | DOI | http | .pdf ]
The Bayesian evidence framework has been successfully applied to the design of multilayer perceptrons (MLPs) in the work of MacKay. Nevertheless, the training of MLPs suffers from drawbacks like the nonconvex optimization problem and the choice of the number of hidden units. In support vector machines (SVMs) for classification, as introduced by Vapnik, a nonlinear decision boundary is obtained by mapping the input vector first in a nonlinear way to a high-dimensional kernel-induced feature space in which a linear large margin classifier is constructed. Practical expressions are formulated in the dual space in terms of the related kernel function, and the solution follows from a (convex) quadratic programming (QP) problem. In least-squares SVMs (LS-SVMs), the SVM problem formulation is modified by introducing a least-squares cost function and equality instead of inequality constraints, and the solution follows from a linear system in the dual space. Implicitly, the least-squares formulation corresponds to a regression formulation and is also related to kernel Fisher discriminant analysis. The least-squares regression formulation has advantages for deriving analytic expressions in a Bayesian evidence framework, in contrast to the classification formulations used, for example, in gaussian processes (GPs). The LS-SVM formulation has clear primal-dual interpretations, and without the bias term, one explicitly constructs a model that yields the same expressions as have been obtained with GPs for regression. In this article, the Bayesian evidence framework is combined with the LS-SVM classifier formulation. Starting from the feature space formulation, analytic expressions are obtained in the dual space on the different levels of Bayesian inference, while posterior class probabilities are obtained by marginalizing over the model parameters. Empirical results obtained on 10 public domain data sets show that the LS-SVM classifier designed within the Bayesian evidence framework consistently yields good generalization performances.

[Ekins2002Three-dimensional] S. Ekins, W. J. Crumb, R. D. Sarazan, J. H. Wikel, and S. A. Wrighton. Three-dimensional quantitative structure-activity relationship for inhibition of human ether-a-go-go-related gene potassium channel. J. Pharmacol. Exp. Ther., 301(2):427-434, May 2002. [ bib ]
The protein product of the human ether-a-go-go gene (hERG) is a potassium channel that when inhibited by some drugs may lead to cardiac arrhythmia. Previously, a three-dimensional quantitative structure-activity relationship (3D-QSAR) pharmacophore model was constructed using Catalyst with in vitro inhibition data for antipsychotic agents. The rationale of the current study was to use a combination of in vitro and in silico technologies to further test the pharmacophore model and qualitatively predict whether molecules are likely to inhibit this potassium channel. These predictions were assessed with the experimental data using the Spearman's rho rank correlation. The antipsychotic-based hERG inhibitor model produced a statistically significant Spearman's rho of 0.71 for 11 molecules. In addition, 15 molecules from the literature were used as a further test set and were also well ranked by the same model with a statistically significant Spearman's rho value of 0.76. A Catalyst General hERG pharmacophore model was generated with these literature molecules, which contained four hydrophobic features and one positive ionizable feature. Linear regression of log-transformed observed versus predicted IC(50) values for this training set resulted in an r(2) value of 0.90. The model based on literature data was evaluated with the in vitro data generated for the original 22 molecules (including the antipsychotics) and illustrated a significant Spearman's rho of 0.77. Thus, the Catalyst 3D-QSAR approach provides useful qualitative predictions for test set molecules. The model based on literature data therefore provides a potentially valuable tool for discovery chemistry as future molecules may be synthesized that are less likely to inhibit hERG based on information provided by a pharmacophore for the inhibition of this potassium channel.

Keywords: herg
[Dahlquist2002GenMAPP] K. D. Dahlquist, N. Salomonis, K. Vranizan, S. C. Lawlor, and B. R. Conklin. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat. Genet., 31(1):19-20, May 2002. [ bib | DOI | http | .pdf ]
[Collobert2002parallel] Ronan Collobert, Samy Bengio, and Yoshua Bengio. A parallel mixture of SVMs for very large scale problems. Neural Comput, 14(5):1105-14, May 2002. [ bib | DOI | http ]
Support vector machines (SVMs) are the state-of-the-art models for many classification problems, but they suffer from the complexity of their training algorithm, which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundred thousand examples with SVMs. This article proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole data set. Experiments on a large benchmark data set (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and surprisingly, a significant improvement in generalization was observed.

[Shawe-Taylor2002On] J. Shawe-Taylor and N. Cristianini. On the Generalization of Soft Margin Algorithms. IEEE Transactions on Information Theory, 48(10):2721-2735, October 2002. [ bib | DOI | http | .pdf ]
[Gasteiger2003Chemoinformatics] J. Gasteiger and T. Engel, editors. Chemoinformatics : a Textbook. Wiley, New York, NY, USA, 2003. [ bib ]
Keywords: chemoinformatics
[Zhu2003Introduction] Lingyun Zhu, Baoming Wu, and Changxiu Cao. Introduction to medical data mining. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi, 20(3):559-62, Sep 2003. [ bib ]
Modern medicine generates a great deal of information stored in the medical database. Extracting useful knowledge and providing scientific decision-making for the diagnosis and treatment of disease from the database increasingly becomes necessary. Data mining in medicine can deal with this problem. It can also improve the management level of hospital information and promote the development of telemedicine and community medicine. Because the medical information is characteristic of redundancy, multi-attribution, incompletion and closely related with time, medical data mining differs from other one. In this paper we have discussed the key techniques of medical data mining involving pretreatment of medical data, fusion of different pattern and resource, fast and robust mining algorithms and reliability of mining results. The methods and applications of medical data mining based on computation intelligence such as artificial neural network, fuzzy system, evolutionary algorithms, rough set, and support vector machine have been introduced. The features and problems in data mining are summarized in the last section.

Keywords: Algorithms, Anion Exchange Resins, Automatic Data Processing, Chemical, Chromatography, Computational Biology, Computer-Assisted, Data Interpretation, Databases, Decision Making, Decision Trees, English Abstract, Factual, Fuzzy Logic, Humans, Indicators and Reagents, Information Storage and Retrieval, Ion Exchange, Models, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nucleic Acid Conformation, P.H.S., Proteins, Quantitative Structure-Activity Relationship, RNA, ROC Curve, Research Support, Sequence Analysis, Statistical, Transfer, U.S. Gov't, 14565039
[Zhao2003Applicationa] Y. Zhao, C. Pinilla, D. Valmori, R. Martin, and R. Simon. Application of support vector machines for T-cell epitopes prediction. Bioinformatics, 19(15):1978-1984, Oct 2003. [ bib ]
MOTIVATION: The T-cell receptor, a major histocompatibility complex (MHC) molecule, and a bound antigenic peptide, play major roles in the process of antigen-specific T-cell activation. T-cell recognition was long considered exquisitely specific. Recent data also indicate that it is highly flexible, and one receptor may recognize thousands of different peptides. Deciphering the patterns of peptides that elicit a MHC restricted T-cell response is critical for vaccine development. RESULTS: For the first time we develop a support vector machine (SVM) for T-cell epitope prediction with an MHC type I restricted T-cell clone. Using cross-validation, we demonstrate that SVMs can be trained on relatively small data sets to provide prediction more accurate than those based on previously published methods or on MHC binding. SUPPLEMENTARY INFORMATION: Data for 203 synthesized peptides is available at http://linus.nci.nih.gov/Data/LAU203_Peptide.pdf

Keywords: Algorithms, Amino Acid Sequence, Antigen, Antigen Presentation, Antigen-Antibody Complex, Artificial Intelligence, Autoimmune Diseases, Autoimmunity, Bacterial Proteins, CD4-Positive T-Lymphocytes, Cell Proliferation, Cells, Clone Cells, Cluster Analysis, Conserved Sequence, Cross Reactions, Cultured, Cytokines, Databases, Epitope Mapping, Epitopes, Gene Products, Genetic, HIV-1, HLA-DQ Antigens, HLA-DR2 Antigen, Haplotypes, Helper-Inducer, Hemagglutination, Histocompatibility Antigens Class I, Humans, K562 Cells, Molecular Mimicry, Molecular Sequence Data, Multiple Sclerosis, Myelin Proteins, Neural Networks (Computer), Orthomyxoviridae, Peptide Library, Peptides, Protein, Protein Binding, Protein Interaction Mapping, ROC Curve, Receptors, Relapsing-Remitting, Reproducibility of Results, Reverse Transcriptase Polymerase Chain Reaction, Sensitivity and Specificity, Sequence Analysis, Structure-Activity Relationship, T-Cell, T-Lymphocyte, T-Lymphocytes, Torque teno virus, Viral, Viral Proteins, gag, 14555632
[Zhao2003Application] Y. Zhao, C. Pinilla, D. Valmori, R. Martin, and R. Simon. Application of support vector machines for T-cell epitopes prediction. Bioinformatics, 19(15):1978-1984, 2003. [ bib | http | .pdf ]
Motivation: The T-cell receptor, a major histocompatibility complex (MHC) molecule, and a bound antigenic peptide, play major roles in the process of antigen-specific T-cell activation. T-cell recognition was long considered exquisitely specific. Recent data also indicate that it is highly flexible, and one receptor may recognize thousands of different peptides. Deciphering the patterns of peptides that elicit a MHC restricted T-cell response is critical for vaccine development. Results: For the first time we develop a support vector machine (SVM) for T-cell epitope prediction with an MHC type I restricted T-cell clone. Using cross-validation, we demonstrate that SVMs can be trained on relatively small data sets to provide prediction more accurate than those based on previously published methods or on MHC binding. Supplementary information: Data for 203 synthesized peptides is available at http://linus.nci.nih.gov/Data/LAU203_Peptide.pdf

Keywords: biosvm immunoinformatics
[Zhang2003Sequence] X. H-F. Zhang, K. A. Heller, I. Hefter, C. S. Leslie, and L. A. Chasin. Sequence Information for the Splicing of Human Pre-mRNA Identified by Support Vector Machine Classification. Genome Res., 13(12):2637-2650, 2003. [ bib | DOI | http | .pdf ]
Vertebrate pre-mRNA transcripts contain many sequences that resemble splice sites on the basis of agreement to the consensus, yet these more numerous false splice sites are usually completely ignored by the cellular splicing machinery. Even at the level of exon definition, pseudo exons defined by such false splices sites outnumber real exons by an order of magnitude. We used a support vector machine to discover sequence information that could be used to distinguish real exons from pseudo exons. This machine learning tool led to the definition of potential branch points, an extended polypyrimidine tract, and C-rich and TG-rich motifs in a region limited to 50 nt upstream of constitutively spliced exons. C-rich sequences were also found in a region extending to 80 nt downstream of exons, along with G-triplet motifs. In addition, it was shown that combinations of three bases within the splice donor consensus sequence were more effective than consensus values in distinguishing real from pseudo splice sites; two-way base combinations were optimal for distinguishing 3' splice sites. These data also suggest that interactions between two or more of these elements may contribute to exon recognition, and provide candidate sequences for assessment as intronic splicing enhancers.

Keywords: biosvm
[Zhang2003Classification] S.-W. Zhang, Q. Pan, H.-C. Zhang, Y-L. Zhang, and H.-Y. Wang. Classification of protein quaternary structure with support vector machine. Bioinformatics, 19(18):2390-2396, 2003. [ bib | http | .pdf ]
Motivation: Since the gap between sharply increasing known sequences and slow accumulation of known structures is becoming large, an automatic classification process based on the primary sequences and known three-dimensional structure becomes indispensable. The classification of protein quaternary structure based on the primary sequences can provide some useful information for the biologists. So a fully automatic and reliable classification system is needed. This work tries to look for the effective methods of extracting attribute and the algorithm for classifying the quaternary structure from the primary sequences. Results: Both of the support vector machine (SVM) and the covariant discriminant algorithms have been first introduced to predict quaternary structure properties from the protein primary sequences. The amino acid composition and the auto-correlation functions based on the amino acid index profile of the primary sequence have been taken into account in the algorithms. We have analyzed 472 amino acid indices and selected the four amino acid indices as the examples, which have the best performance. Thus the five attribute parameter data sets (COMP, FASG, NISK, WOLS and KYTJ) were established from the protein primary sequences. The COMP attribute data set is composed of amino acid composition, and the FASG, NISK, WOLS and KYTJ attribute data sets are composed of the amino acid composition and the auto-correlation functions of the corresponding amino acid residue index. The overall accuracies of SVM are 78.5, 87.5, 83.2, 81.7 and 81.9 and KYTJ data sets in jackknife test, which are 19.6, 7.8, 15.5, 13.1 and 15.8 algorithm in the same test. The results show that SVM may be applied to discriminate between the primary sequences of homodimers and non-homodimers and the two protein sequence descriptors can reflect the quaternary structure information. Compared with previous Robert Garian's investigation, the performance of SVM is almost equal to that of the Decision tree models, and the methods of extracting feature vector from the primary sequences are superior to Robert's binning function method. Availability: Programs are available on request from the authors.

Keywords: biosvm
[Zernov2003Drug] V. V. Zernov, K. V. Balakin, A. A. Ivaschenko, N. P. Savchuk, and I. V. Pletnev. Drug discovery using support vector machines. The case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions. J Chem Inf Comput Sci, 43(6):2048-56, 2003. [ bib | DOI | http | .pdf ]
Support Vector Machines (SVM) is a powerful classification and regression tool that is becoming increasingly popular in various machine learning applications. We tested the ability of SVM, in comparison with well-known neural network techniques, to predict drug-likeness and agrochemical-likeness for large compound collections. For both kinds of data, SVM outperforms various neural networks using the same set of descriptors. We also used SVM for estimating the activity of Carbonic Anhydrase II (CA II) enzyme inhibitors and found that the prediction quality of our SVM model is better than that reported earlier for conventional QSAR. Model characteristics and data set features were studied in detail.

Keywords: biosvm chemoinformatics
[Yu2003Fine-grained] C.S. Yu, J.Y. Wang, J.M. Yang, P.C. Lyu, C.J. Lin, and J.K. Hwang. Fine-grained protein fold assignment by support vector machines using generalized npeptide coding schemes and jury voting from multiple-parameter sets. Proteins, 50(4):531, 6 2003. [ bib | DOI | http | .pdf ]
In the coarse-grained fold assignment of major protein classes, such as all-alpha, all-beta, alpha + beta, alpha/beta proteins, one can easily achieve high prediction accuracy from primary amino acid sequences. However, the fine-grained assignment of folds, such as those defined in the Structural Classification of Proteins (SCOP) database, presents a challenge due to the larger amount of folds available. Recent study yielded reasonable prediction accuracy of 56.0 an independent set of 27 most populated folds. In this communication, we apply the support vector machine (SVM) method, using a combination of protein descriptors based on the properties derived from the composition of n-peptide and jury voting, to the fine-grained fold prediction, and are able to achieve an overall prediction accuracy of 69.6 the same independent set-significantly higher than the previous results. On 10-fold cross-validation, we obtained a prediction accuracy of 65.3 sequence-coding schemes can significantly improve the fine-grained fold prediction. Our approach should be useful in structure prediction and modeling.

Keywords: biosvm
[Yoon2003Analysis] Y. Yoon, J. Song, S.H. Hong, and J.Q. Kim. Analysis of multiple single nucleotide polymorphisms of candidate genes related to coronary heart disease susceptibility by using support vector machines. Clin. Chem. Lab. Med., 41(4):529-534, 2003. [ bib | .html | .pdf ]
Coronary heart disease (CHD) is a complex genetic disease involving gene-environment interaction. Many association studies between single nucleotide polymorphisms (SNPs) of candidate genes and CHD have been reported. We have applied a new method to analyze such relationships using support vector machines (SVMs), which is one of the methods for artificial neuronal network. We assumed that common haplotype implicit in genotypes will differ between cases and controls, and that this will allow SVM-derived patterns to be classifiable according to subject genotypes. Fourteen SNPs of ten candidate genes in 86 CHD patients and 119 controls were investigated. Genotypes were transformed to a numerical vector by giving scores based on difference between the genotypes of each subject and the reference genotypes, which represent the healthy normal population. Overall classification accuracy by SVMs was 64.4 By conventional analysis using the chi2 test, the association between CHD and the SNP of the scavenger receptor B1 gene was most significant in terms of allele frequencies in cases vs. controls (p = 0.0001). In conclusion, we suggest that the application of SVMs for association studies of SNPs in candidate genes shows considerable promise and that further work could be usefully performed upon the estimation of CHD susceptibility in individuals of high risk.

Keywords: biosvm
[Yamanishi2003Extraction] Y. Yamanishi, J.-P. Vert, A. Nakaya, and M. Kanehisa. Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis. Bioinformatics, 19(Suppl. 1):i323-i330, 2003. [ bib | http | .pdf ]
Motivation: A major issue in computational biology is the reconstruction of pathways from several genomic datasets, such as expression data, protein interaction data and phylogenetic profiles. As a first step toward this goal, it is important to investigate the amount of correlation which exists between these data. Results: These methods are successfully tested on their ability to recognize operons in the Escherichia coli genome, from the comparison of three datasets corresponding to functional relationships between genes in metabolic pathways, geometrical relationships along the chromosome, and co-expression relationships as observed by gene expression data. Contact: yoshi@kuicr.kyoto-u.ac.jp

Keywords: biosvm
[Xing2003Distance] E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning with application to clustering with side-information. In S. Thrun S. Becker and K. Obermayer, editors, Adv. Neural. Inform. Process Syst., volume 15, pages 505-512, Cambridge, MA, 2003. MIT Press. [ bib ]
[Wu2003Model] Z Wu, R A Irizarry, R Gentleman, F M Murillo, and F Spencer. A model based background adjustment for oligonucleotide expression arrays. Technical report, John Hopkins University, Department of Biostatistics Working Papers, Baltimore, MD, 2003. [ bib ]
Keywords: csbcbook, csbcbook-ch2
[Wu2003Comparison] B. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams, and H. Zhao. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics, 19(13):1636-1643, 2003. [ bib | http | .pdf ]
Motivation: Novel methods, both molecular and statistical, are urgently needed to take advantage of recent advances in biotechnology and the human genome project for disease diagnosis and prognosis. Mass spectrometry (MS) holds great promise for biomarker identification and genome-wide protein profiling. It has been demonstrated in the literature that biomarkers can be identified to distinguish normal individuals from cancer patients using MS data. Such progress is especially exciting for the detection of early-stage ovarian cancer patients. Although various statistical methods have been utilized to identify biomarkers from MS data, there has been no systematic comparison among these approaches in their relative ability to analyze MS data. Results: We compare the performance of several classes of statistical methods for the classification of cancer based on MS spectra. These methods include: linear discriminant analysis, quadratic discriminant analysis, k-nearest neighbor classifier, bagging and boosting classification trees, support vector machine, and random forest (RF). The methods are applied to ovarian cancer and control serum samples from the National Ovarian Cancer Early Detection Program clinic at Northwestern University Hospital. We found that RF outperforms other methods in the analysis of MS data. Supplementary information: http://bioinformatics.med.yale.edu/proteomics/BioSupp1.html

Keywords: biosvm
[Wolf2003Learning] L. Wolf and A. Shashua. Learning over Sets using Kernel Principal Angles. J. Mach. Learn. Res., 4:913-931, 2003. [ bib | .html ]
Keywords: kernel-theory
[Winters-Hilt2003Highly] S. Winters-Hilt, W. Vercoutere, V.S. DeGuzman, D. Deamer, M. Akeson, and D. Haussler. Highly accurate classification of Watson-Crick basepairs on termini of single DNA molecules. Biophys. J., 84(2):967-976, 2003. [ bib | http | .pdf ]
We introduce a computational method for classification of individual DNA molecules measured by analpha -hemolysin channel detector. We show classification with better than 99 hairpin molecules that differ only in their terminal Watson-Crick basepairs. Signal classification was done in silico to establish performance metrics (i.e., where train and test data were of known type, via single-species data files). It was then performed in solution to assay real mixtures of DNA hairpins. Hidden Markov Models (HMMs) were used with Expectation/Maximization for denoising and for associating a feature vector with the ionic current blockade of the DNA molecule. Support Vector Machines (SVMs) were used as discriminators, and were the focus of off-line training. A multiclass SVM architecture was designed to place less discriminatory load on weaker discriminators, and novel SVM kernels were used to boost discrimination strength. The tuning on HMMs and SVMs enabled biophysical analysis of the captured molecule states and state transitions; structure revealed in the biophysical analysis was used for better feature selection.

Keywords: biosvm
[Wilton2003Comparison] D. Wilton, P. Willett, K. Lawson, and G. Mullier. Comparison of ranking methods for virtual screening in lead-discovery programs. J Chem Inf Comput Sci, 43(2):469-74, 2003. [ bib | DOI | http | .pdf ]
This paper discusses the use of several rank-based virtual screening methods for prioritizing compounds in lead-discovery programs, given a training set for which both structural and bioactivity data are available. Structures from the NCI AIDS data set and from the Syngenta corporate database were represented by two types of fragment bit-string and by sets of high-level molecular features. These representations were processed using binary kernel discrimination, similarity searching, substructural analysis, support vector machine, and trend vector analysis, with the effectiveness of the methods being judged by the extent to which active test set molecules were clustered toward the top of the resultant rankings. The binary kernel discrimination approach yielded consistently superior rankings and would appear to have considerable potential for chemical screening applications.

Keywords: biosvm
[Weston2003Feature] J. Weston, F. Pérez-Cruz, O. Bousquet, O. Chapelle, A. Elisseeff, and B. Schölkopf. Feature selection and transduction for prediction of molecular bioactivity for drug design. Bioinformatics, 19(6):764-771, 2003. [ bib | http | .pdf ]
Motivation: In drug discovery a key task is to identify characteristics that separate active (binding) compounds from inactive (non-binding) ones. An automated prediction system can help reduce resources necessary to carry out this task. Results: Two methods for prediction of molecular bioactivity for drug design are introduced and shown to perform well in a data set previously studied as part of the KDD (Knowledge Discovery and Data Mining) Cup 2001. The data is characterized by very few positive examples, a very large number of features (describing three-dimensional properties of the molecules) and rather different distributions between training and test data. Two techniques are introduced specifically to tackle these problems: a feature selection method for unbalanced data and a classifier which adapts to the distribution of the the unlabeled test data (a so-called transductive method). We show both techniques improve identification performance and in conjunction provide an improvement over using only one of the techniques. Our results suggest the importance of taking into account the characteristics in this data which may also be relevant in other problems of a similar type. Availability: Matlab source code is available at http://www.kyb.tuebingen.mpg.de/bs/people/weston/kdd/kdd.html Contact: jason.weston@tuebingen.mpg.de Supplementary information: Supplementary material is available at http://www.kyb.tuebingen.mpg.de/bs/people/weston/kdd/kdd.html.

Keywords: biosvm
[Waterman2003Transcriptional] S.R. Waterman and P.L.C. Small. Transcriptional expression of escherichia coli glutamate-dependent acid resistance genes gada and gadbc in an hns rpos mutant. J. Bacteriol., 185(15):4644-4647, Aug 2003. [ bib ]
Resistance to being killed by acidic environments with pH values lower than 3 is an important feature of both pathogenic and nonpathogenic Escherichia coli. The most potent E. coli acid resistance system utilizes two isoforms of glutamate decarboxylase encoded by gadA and gadB and a putative glutamate:gamma-aminobutyric acid antiporter encoded by gadC. The gad system is controlled by two repressors (H-NS and CRP), one activator (GadX), one repressor-activator (GadW), and two sigma factors (sigma(S) and sigma(70)). In contrast to results of previous reports, we demonstrate that gad transcription can be detected in an hns rpoS mutant strain of E. coli K-12, indicating that gad promoters can be initiated by sigma(70) in the absence of H-NS.

Keywords: Bacterial Proteins; DNA-Binding Proteins; Drug Resistance, Bacterial; Escherichia coli; Escherichia coli Proteins; Gene Expression Regulation, Bacterial; Glutamate Decarboxylase; Glutamates; Hydrogen-Ion Concentration; Membrane Proteins; Mutation; Sigma Factor; Transcription, Genetic
[Warmuth2003Active] M. K. Warmuth, J. Liao, G. Rätsch, M. Mathieson, S. Putta, and C. Lemmen. Active learning with support vector machines in the drug discovery process. J Chem Inf Comput Sci, 43(2):667-673, 2003. [ bib | DOI | http | .pdf ]
We investigate the following data mining problem from computer-aided drug design: From a large collection of compounds, find those that bind to a target molecule in as few iterations of biochemical testing as possible. In each iteration a comparatively small batch of compounds is screened for binding activity toward this target. We employed the so-called "active learning paradigm" from Machine Learning for selecting the successive batches. Our main selection strategy is based on the maximum margin hyperplane-generated by "Support Vector Machines". This hyperplane separates the current set of active from the inactive compounds and has the largest possible distance from any labeled compound. We perform a thorough comparative study of various other selection strategies on data sets provided by DuPont Pharmaceuticals and show that the strategies based on the maximum margin hyperplane clearly outperform the simpler ones.

Keywords: biosvm
[Ward2003Secondary] J. J. Ward, L. J. McGuffin, B. F. Buxton, and D. T. Jones. Secondary structure prediction with support vector machines. Bioinformatics, 19(13):1650-1655, 2003. [ bib | http | .pdf ]
Motivation: A new method that uses support vector machines (SVMs) to predict protein secondary structure is described and evaluated. The study is designed to develop a reliable prediction method using an alternative technique and to investigate the applicability of SVMs to this type of bioinformatics problem. Methods: Binary SVMs are trained to discriminate between two structural classes. The binary classifiers are combined in several ways to predict multi-class secondary structure. Results: The average three-state prediction accuracy per protein (Q3) is estimated by cross-validation to be 77.07 +/- 0.26 +/- 0.39 PSIPRED prediction method on a non-homologous test set of 121 proteins despite being trained on substantially fewer examples. A simple consensus of the SVM, PSIPRED and PROFsec achieves significantly higher prediction accuracy than the individual methods. Availability: The SVM classifier is available from the authors. Work is in progress to make the method available on-line and to integrate the SVM predictions into the PSIPRED server.

Keywords: biosvm
[Wang2003Nonlinear] Yongmei Michelle Wang, Robert T Schultz, R. Todd Constable, and Lawrence H Staib. Nonlinear estimation and modeling of fMRI data using spatio-temporal support vector regression. Inf Process Med Imaging, 18:647-59, Jul 2003. [ bib ]
This paper presents a new and general nonlinear framework for fMRI data analysis based on statistical learning methodology: support vector machines. Unlike most current methods which assume a linear model for simplicity, the estimation and analysis of fMRI signal within the proposed framework is nonlinear, which matches recent findings on the dynamics underlying neural activity and hemodynamic physiology. The approach utilizes spatio-temporal support vector regression (SVR), within which the intrinsic spatio-temporal autocorrelations in fMRI data are reflected. The novel formulation of the problem allows merging model-driven with data-driven methods, and therefore unifies these two currently separate modes of fMRI analysis. In addition, multiresolution signal analysis is achieved and developed. Other advantages of the approach are: avoidance of interpolation after motion estimation, embedded removal of low-frequency noise components, and easy incorporation of multi-run, multi-subject, and multi-task studies into the framework.

[Wang2003Application] Haojun Wang, Chongxun Zheng, Ying Li, Huafeng Zhu, and Xiangguo Yan. Application of support vector machines to classification of blood cells. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi, 20(3):484-7, Sep 2003. [ bib ]
The support vector machine (SVM) is a new learning technique based on the statistical learning theory. It was originally developed for two-class classification. In this paper, the SVM approach is extended to multi-class classification problems, a hierarchical SVM is applied to classify blood cells in different maturation stages from bone marrow. Based on stepwise decomposition, a hierarchical clustering method is presented to construct the architecture of the hierarchical (tree-like) SVM, then the optimal control parameters of SVM are determined by some criterion for each discriminant step. To verify the performances of classifiers, the SVM method is compared with three classical classifiers using 3-fold cross validation. The preliminary results indicate that the proposed method avoids the curse of dimensionality and has greater generalization. Thus, the method can improve the classification correctness for blood cells from bone marrow.

[Walter2003Detection] T. Walter, J.-C. Klein, P. Massin, and A. Erignay. Detection of the median axis of vessels in retinal images. European Journal of Ophthalmology, 13(2), 2003. [ bib ]
[Wagner2003Protocols] M. Wagner, D. Naik, and A. Pothen. Protocols for disease classification from mass spectrometry data. Proteomics, 3(9):1692-1698, 2003. [ bib | DOI | http | .pdf ]
We report our results in classifying protein matrix-assisted laser desorption/ionization-time of flight mass spectra obtained from serum samples into diseased and healthy groups. We discuss in detail five of the steps in preprocessing the mass spectral data for biomarker discovery, as well as our criterion for choosing a small set of peaks for classifying the samples. Cross-validation studies with four selected proteins yielded misclassification rates in the 10-15 for all the classification methods. Three of these proteins or protein fragments are down-regulated and one up-regulated in lung cancer, the disease under consideration in this data set. When cross-validation studies are performed, care must be taken to ensure that the test set does not influence the choice of the peaks used in the classification. Misclassification rates are lower when both the training and test sets are used to select the peaks used in classification versus when only the training set is used. This expectation was validated for various statistical discrimination methods when thirteen peaks were used in cross-validation studies. One particular classification method, a linear support vector machine, exhibited especially robust performance when the number of peaks was varied from four to thirteen, and when the peaks were selected from the training set alone. Experiments with the samples randomly assigned to the two classes confirmed that misclassification rates were significantly higher in such cases than those observed with the true data. This indicates that our findings are indeed significant. We found closely matching masses in a database for protein expression in lung cancer for three of the four proteins we used to classify lung cancer. Data from additional samples, increased experience with the performance of various preprocessing techniques, and affirmation of the biological roles of the proteins that help in classification, will strengthen our conclusions in the future.

Keywords: biosvm
[Vinokourov2003Inferring] A. Vinokourov, J. Shawe-Taylor, and N. Cristianini. Inferring a semantic representation of text via cross-language correlation analysis. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Adv. Neural Inform. Process. Syst. MIT Press, 2003. [ bib | .pdf ]
[Vickers2003Efficient] T. A. Vickers, S. Koo, C. F. Bennett, S. T. Crooke, N. M. Dean, and B. F. Baker. Efficient reduction of target RNAs by small interfering RNA and RNase H-dependent antisense agents. A comparative analysis. J. Biol. Chem., 278(9):7108-18, Feb 2003. [ bib | DOI | http ]
RNA interference can be considered as an antisense mechanism of action that utilizes a double-stranded RNase to promote hydrolysis of the target RNA. We have performed a comparative study of optimized antisense oligonucleotides designed to work by an RNA interference mechanism to oligonucleotides designed to work by an RNase H-dependent mechanism in human cells. The potency, maximal effectiveness, duration of action, and sequence specificity of optimized RNase H-dependent oligonucleotides and small interfering RNA (siRNA) oligonucleotide duplexes were evaluated and found to be comparable. Effects of base mismatches on activity were determined to be position-dependent for both siRNA oligonucleotides and RNase H-dependent oligonucleotides. In addition, we determined that the activity of both siRNA oligonucleotides and RNase H-dependent oligonucleotides is affected by the secondary structure of the target mRNA. To determine whether positions on target RNA identified as being susceptible for RNase H-mediated degradation would be coincident with siRNA target sites, we evaluated the effectiveness of siRNAs designed to bind the same position on the target mRNA as RNase H-dependent oligonucleotides. Examination of 80 siRNA oligonucleotide duplexes designed to bind to RNA from four distinct human genes revealed that, in general, activity correlated with the activity to RNase H-dependent oligonucleotides designed to the same site, although some exceptions were noted. The one major difference between the two strategies is that RNase H-dependent oligonucleotides were determined to be active when directed against targets in the pre-mRNA, whereas siRNAs were not. These results demonstrate that siRNA oligonucleotide- and RNase H-dependent antisense strategies are both valid strategies for evaluating function of genes in cell-based assays.

Keywords: Animals, Antisense, Base Sequence, COS Cells, Calf Thymus, Cultured, Dose-Response Relationship, Drug, Flow Cytometry, Humans, Intercellular Adhesion Molecule-1, Introns, Luciferases, Messenger, Molecular Sequence Data, Nucleic Acid Conformation, Oligonucleotides, PTEN Phosphohydrolase, Phosphoric Monoester Hydrolases, Protein Structure, RNA, Ribonuclease H, Small Interfering, Tertiary, Time Factors, Tumor Cells, Tumor Suppressor Proteins, 12500975
[Vert2003Extracting] J.-P. Vert and M. Kanehisa. Extracting active pathways from gene expression data. Bioinformatics, 19:238ii-234ii, 2003. [ bib | http | .pdf ]
Motivation: A promising way to make sense out of gene expression profiles is to relate them to the activity of metabolic and signalling pathways. Each pathway usually involves many genes, such as enzymes, which can themselves participate in many pathways. The set of all known pathways can therefore be represented by a complex network of genes. Searching for regularities in the set of gene expression profiles with respect to the topology of this gene network is a way to automatically extract active pathways and their associated patterns of activity. Method: We present a method to perform this task, which consists in encoding both the gene network and the set of profiles into two kernel functions, and performing a regularized form of canonical correlation analysis between the two kernels. Results: When applied to publicly available expression data the method is able to extract biologically relevant expression patterns, as well as pathways with related activity.

Keywords: biosvm
[Vert2003Graph-driven] J.-P. Vert and M. Kanehisa. Graph-driven features extraction from microarray data using diffusion kernels and kernel CCA. In S. Becker, S. Thrun, and K. Obermayer, editors, Adv. Neural Inform. Process. Syst., pages 1449-1456. MIT Press, 2003. [ bib | .pdf ]
Keywords: biosvm
[Turner2003POCUS] F. S. Turner, D. R. Clutterbuck, and C. A. M. Semple. Pocus: mining genomic sequence annotation to predict disease genes. Genome Biol., 4(11):R75, 2003. [ bib | DOI | http | .pdf ]
Here we present POCUS (prioritization of candidate genes using statistics), a novel computational approach to prioritize candidate disease genes that is based on over-representation of functional annotation between loci for the same disease. We show that POCUS can provide high (up to 81-fold) enrichment of real disease genes in the candidate-gene shortlists it produces compared with the original large sets of positional candidates. In contrast to existing methods, POCUS can also suggest counterintuitive candidates.

[Tugcu2003Prediction] Nihal Tugcu, Minghu Song, Curt M Breneman, N. Sukumar, Kristin P Bennett, and Steven M Cramer. Prediction of the effect of mobile-phase salt type on protein retention and selectivity in anion exchange systems. Anal Chem, 75(14):3563-72, Jul 2003. [ bib ]
This study examines the effect of different salt types on protein retention and selectivity in anion exchange systems. Particularly, linear retention data for various proteins were obtained on two structurally different anion exchange stationary-phase materials in the presence of three salts with different counterions. The data indicated that the effects are, for the most part, nonspecific, although various specific effects could also be observed. Quantitative structure retention relationship (QSRR) models based on support vector machine feature selection and regression models were developed using the experimental chromatographic data in conjunction with various molecular descriptors computed from protein crystal structure geometries. Star plots for each descriptor used in the final model were generated to aid in interpretation. The resulting QSRR models were predictive, with cross-validated r2 values of 0.9445, 0.9676, and 0.8897 for Source 15Q and 0.9561, 0.9876, and 0.9760 for Q Sepharose resins in the presence of three different salts. The predictive power of these models was validated using a set of test proteins that were not used in the generation of these models. Interpretation of the models revealed that particular trends for proteins and salts could be captured using QSRR techniques.

[Tugcu2003Identification] Nihal Tugcu, Asif Ladiwala, Curt M Breneman, and Steven M Cramer. Identification of chemically selective displacers using parallel batch screening experiments and quantitative structure efficacy relationship models. Anal Chem, 75(21):5806-16, Nov 2003. [ bib | DOI | http | .pdf ]
Parallel batch screening experiments were carried out to examine how displacer chemistry and salt counterions affect the selectivity of batch protein displacements in anion exchange chromatographic systems. The results indicate that both salt type and displacer chemistry can have a significant impact on the amount of protein displaced. Importantly, the results indicate that, by changing the displacer, salt counterion, or both, one can induce significant selectivity changes in the relative displacement of two model proteins. This indicates that highly selective separations can be developed in ion exchange systems by the appropriate selection of displacer chemistry and salt counterion. The experimental batch screening data were also used in conjunction with various molecular descriptors to generate quantitative structure efficacy relationship (QSER) models based on a support vector machine feature selection and regression tool. The models resulted in good correlations and successful predictions for an external test set of displacers. A star plot approach was shown to be a powerful tool to aid in the interpretation of the QSER models. These results indicate that this modeling approach can be employed for the a priori prediction of displacer efficacy as well as for providing insight into displacer design and the selection of proper mobile-phase conditions for highly selective separations.

[Tsujinishi2003Fuzzy] Daisuke Tsujinishi and Shigeo Abe. Fuzzy least squares support vector machines for multiclass problems. Neural Netw, 16(5-6):785-92, 2003. [ bib ]
In least squares support vector machines (LS-SVMs), the optimal separating hyperplane is obtained by solving a set of linear equations instead of solving a quadratic programming problem. But since SVMs and LS-SVMs are formulated for two-class problems, unclassifiable regions exist when they are extended to multiclass problems. In this paper, we discuss fuzzy LS-SVMs that resolve unclassifiable regions for multiclass problems. We define a membership function in the direction perpendicular to the optimal separating hyperplane that separates a pair of classes. Using the minimum or average operation for these membership functions, we define a membership function for each class. Using some benchmark data sets, we show that recognition performance of fuzzy LS-SVMs with the minimum operator is comparable to that of fuzzy SVMs, but fuzzy LS-SVMs with the average operator showed inferior performance.

[Tsuda2003em] K. Tsuda, S. Akaho, and K. Asai. The em Algorithm for Kernel Matrix Completion with Auxiliary Data. J. Mach. Learn. Res., 4:67-81, 2003. [ bib | .html | .pdf ]
In biological data, it is often the case that observed data are available only for a subset of samples. When a kernel matrix is derived from such data, we have to leave the entries for unavailable samples as missing. In this paper, the missing entries are completed by exploiting an auxiliary kernel matrix derived from another information source. The parametric model of kernel matrices is created as a set of spectral variants of the auxiliary kernel matrix, and the missing entries are estimated by fitting this model to the existing entries. For model fitting, we adopt the em algorithm (distinguished from the EM algorithm of Dempster et al., 1977) based on the information geometry of positive definite matrices. We will report promising results on bacteria clustering experiments using two marker sequences: 16S and gyrB.

Keywords: biosvm
[Tsang2003Distance] I. W. Tsang and J. T. Kwok. Distance metric learning with kernels. In Proceedings of the International Conference on Artificial Neural Networks, pages 126-129, 2003. [ bib ]
[Troyanskaya2003Bayesian] O. G. Troyanskaya, K. Dolinski, A. B. Owen, R. B. Altman, and D. Botstein. A bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomyces cerevisiae). Proc. Natl. Acad. Sci. USA, 100(14):8348-8353, 2003. [ bib | DOI | http ]
Genomic sequencing is no longer a novelty, but gene function annotation remains a key challenge in modern biology. A variety of functional genomics experimental techniques are available, from classic methods such as affinity precipitation to advanced high-throughput techniques such as gene expression microarrays. In the future, more disparate methods will be developed, further increasing the need for integrated computational analysis of data generated by these studies. We address this problem with magic (Multisource Association of Genes by Integration of Clusters), a general framework that uses formal Bayesian reasoning to integrate heterogeneous types of high-throughput biological data (such as large-scale two-hybrid screens and multiple microarray analyses) for accurate gene function prediction. The system formally incorporates expert knowledge about relative accuracies of data sources to combine them within a normative framework. magic provides a belief level with its output that allows the user to vary the stringency of predictions. We applied magic to Saccharomyces cerevisiae genetic and physical interactions, microarray, and transcription factor binding sites data and assessed the biological relevance of gene groupings using Gene Ontology annotations produced by the Saccaromyces Genome Database. We found that by creating functional groupings based on heterogeneous data types, magic improved accuracy of the groupings compared with microarray analysis alone. We describe several of the biological gene groupings identified.

[Tommaso2003Steady-state] Marina de Tommaso, Sebastiano Stramaglia, Jan Mathijs Schoffelen, Marco Guido, Giuseppe Libro, Luciana Losito, Vittorio Sciruicchio, Michele Sardaro, Mario Pellicoro, and Franco Michele Puca. Steady-state visual evoked potentials in the low frequency range in migraine: a study of habituation and variability phenomena. Int J Psychophysiol, 49(2):165-74, Aug 2003. [ bib ]
Previous studies have revealed that migraine patients display an increased photic driving to flash stimuli in the medium frequency range. The aim of this study was to perform a topographic analysis of steady-state visual evoked potentials (SVEPs) in the low frequency range (3-9 Hz), evaluating the temporal behaviour of the F1 amplitude by investigating habituation and variability phenomena. The main component of SVEPs, the F1, demonstrated an increased amplitude in several channels at 3 Hz. Behaviour of F1 amplitude was rather variable over time, and the wavelet-transform standard deviation was increased in migraine patients at a low stimulus rate. The discriminative value of the F1 mean amplitude and variability index, tested by both an artificial neural network classifier and a support vector machine, were high according to both methods. The increased photic driving in migraine should be subtended by a more generic abnormality of visual reactivity instead of a selective impairment of a visual subsystem. Temporal behaviour of SVEPs is not influenced by a clear tendency to habituation, but the F1 amplitude seemed to change in a complex way, which is better described by variability phenomena. An increased variability in response to flicker stimuli in migraine patients could be interpreted as an overactive regulation mechanism, prone to instability and consequently to headache attacks, whether spontaneous or triggered.

[Tillman03Word] C. Tillmann and H. Ney. Word reordering and a dynamic programming beam search algorithm for statistical machine translation. Comput. Linguist., 29(1):97-133, 2003. [ bib | DOI ]
[Tarumi2003Remote] Toshiyasu Tarumi, Gary W Small, Roger J Combs, and Robert T Kroutil. Remote detection of heated ethanol plumes by airborne passive Fourier transform infrared spectrometry. Appl Spectrosc, 57(11):1432-41, Nov 2003. [ bib ]
Methodology is developed for the automated detection of heated plumes of ethanol vapor with airborne passive Fourier transform infrared spectrometry. Positioned in a fixed-wing aircraft in a downward-looking mode, the spectrometer is used to detect ground sources of ethanol vapor from an altitude of 2000-3000 ft. Challenges to the use of this approach for the routine detection of chemical plumes include (1) the presence of a constantly changing background radiance as the aircraft flies, (2) the cost and complexity of collecting the data needed to train the classification algorithms used in implementing the plume detection, and (3) the need for rapid interferogram scans to minimize the ground area viewed per scan. To address these challenges, this work couples a novel ground-based data collection and training protocol with the use of signal processing and pattern recognition methods based on short sections of the interferogram data collected by the spectrometer. In the data collection, heated plumes of ethanol vapor are released from a portable emission stack and viewed by the spectrometer from ground level against a synthetic background designed to simulate a terrestrial radiance source. Classifiers trained with these data are subsequently tested with airborne data collected over a period of 2.5 years. Two classifier architectures are compared in this work: support vector machines (SVM) and piecewise linear discriminant analysis (PLDA). When applied to the airborne test data, the SVM classifiers perform best, failing to detect ethanol in only 8% of the cases in which it is present. False detections occur at a rate of less than 0.5%. The classifier performs well in spite of differences between the backgrounds associated with the ground-based and airborne data collections and the instrumental drift arising from the long time span of the data collection. Further improvements in classification performance are judged to require increased sophistication in the ground-based data collection in order to provide a better match to the infrared backgrounds observed from the air.

Keywords: Air Pollutants, Aircraft, Algorithms, Artificial Intelligence, Automated, Comparative Study, Computer Simulation, Computer-Assisted, Computing Methodologies, Environmental Monitoring, Ethanol, Fourier Transform Infrared, Humans, Image Interpretation, Non-P.H.S., Non-U.S. Gov't, Online Systems, Pattern Recognition, Photography, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Spectroscopy, Subtraction Technique, U.S. Gov't, Video Recording, Walking, 14658159
[Takaoka2003Development] Y. Takaoka, Y. Endo, S. Yamanobe, H. Kakinuma, T. Okubo, Y. Shimazaki, T. Ota, S. Sumiya, and K. Yoshikawa. Development of a method for evaluating drug-likeness and ease of synthesis using a data set in which compounds are assigned scores based on chemists' intuition. J Chem Inf Comput Sci, 43(4):1269-75, 2003. [ bib | DOI | http | .pdf ]
The concept of drug-likeness, an important characteristic for any compound in a screening library, is nevertheless difficult to pin down. Based on our belief that this concept is implicit within the collective experience of working chemists, we devised a data set to capture an intuitive human understanding of both this characteristic and ease of synthesis, a second key characteristic. Five chemists assigned a pair of scores to each of 3980 diverse compounds, with the component scores of each pair corresponding to drug-likeness and ease of synthesis, respectively. Using this data set, we devised binary classifiers with an artificial neural network and a support vector machine. These models were found to efficiently eliminate compounds that are not drug-like and/or hard-to-synthesize derivatives, demonstrating the suitability of these models for use as compound acquisition filters.

Keywords: biosvm
[Takahashi2003Proteomic] Nobuhiro Takahashi, Mitsuaki Yanagida, Sally Fujiyama, Toshiya Hayano, and Toshiaki Isobe. Proteomic snapshot analyses of preribosomal ribonucleoprotein complexes formed at various stages of ribosome biogenesis in yeast and mammalian cells. Mass Spectrom Rev, 22(5):287-317, 2003. [ bib | DOI | http | .pdf ]
Proteomic technologies powered by advancements in mass spectrometry and bioinformatics and coupled with accumulated genome sequence data allow a comprehensive study of cell function through large-scale and systematic protein identifications of protein constituents of the cell and tissues, as well as of multi-protein complexes that carry out many cellular function in a higher-order network in the cell. One of the most extensively analyzed cellular functions by proteomics is the production of ribosome, the protein-synthesis machinery, in the nucle(ol)us-the main site of ribosome biogenesis. The use of tagged proteins as affinity bait, coupled with mass spectrometric identification, enabled us to isolate synthetic intermediates of ribosomes that might represent snapshots of nascent ribosomes at particular stages of ribosome biogenesis and to identify their constituents-some of which showed dynamic changes for association with the intermediates at various stages of ribosome biogenesis. In this review, in conjunction with the results from yeast cells, our proteomic approach to analyze ribosome biogenesis in mammalian cells is described.

Keywords: Affinity Labels, Animals, Comparative Study, Electrospray Ionization, Genetic, Macromolecular Substances, Mass, Mitosis, Non-P.H.S., Non-U.S. Gov't, P.H.S., Protein Interaction Mapping, Proteome, Proteomics, Research Support, Ribonucleoproteins, Ribosomes, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Signal Transduction, Spectrometry, Transcription, U.S. Gov't, 12949916
[Sorlie2003Repeated] T. Sørlie, R. Tibshirani, J. Parker, T. Hastie, J.S. Marron, A. Nobel, S. Deng, H. Johnsen, R. Pesich, S. Geisler, J. Demeter, C.M. Perou, P.E. Lønning, P.O. Brown, A.L. Børresen-Dale, and D. Botstein. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc. Natl. Acad. Sci. USA, 100(14):8418-8423, Jul 2003. [ bib | DOI | http | .pdf ]
Characteristic patterns of gene expression measured by DNA microarrays have been used to classify tumors into clinically relevant subgroups. In this study, we have refined the previously defined subtypes of breast tumors that could be distinguished by their distinct patterns of gene expression. A total of 115 malignant breast tumors were analyzed by hierarchical clustering based on patterns of expression of 534 "intrinsic" genes and shown to subdivide into one basal-like, one ERBB2-overexpressing, two luminal-like, and one normal breast tissue-like subgroup. The genes used for classification were selected based on their similar expression levels between pairs of consecutive samples taken from the same tumor separated by 15 weeks of neoadjuvant treatment. Similar cluster analyses of two published, independent data sets representing different patient cohorts from different laboratories, uncovered some of the same breast cancer subtypes. In the one data set that included information on time to development of distant metastasis, subtypes were associated with significant differences in this clinical feature. By including a group of tumors from BRCA1 carriers in the analysis, we found that this genotype predisposes to the basal tumor subtype. Our results strongly support the idea that many of these breast tumor subtypes represent biologically distinct disease entities.

Keywords: csbcbook, csbcbook-ch3
[Sutherland2003Spline-fitting] J. J. Sutherland, L. A. O'Brien, and D. F. Weaver. Spline-fitting with a genetic algorithm: a method for developing classification structure-activity relationships. J. Chem. Inf. Comput. Sci., 43(6):1906-1915, 2003. [ bib | DOI | http | .pdf ]
Classification methods allow for the development of structure-activity relationship models when the target property is categorical rather than continuous. We describe a classification method which fits descriptor splines to activities, with descriptors selected using a genetic algorithm. This method, which we identify as SFGA, is compared to the well-established techniques of recursive partitioning (RP) and soft independent modeling by class analogy (SIMCA) using five series of compounds: cyclooxygenase-2 (COX-2) inhibitors, benzodiazepine receptor (BZR) ligands, estrogen receptor (ER) ligands, dihydrofolate reductase (DHFR) inhibitors, and monoamine oxidase (MAO) inhibitors. Only 1-D and 2-D descriptors were used. Approximately 40% of compounds in each series were assigned to a test set, "cherry-picked" from the complete set such that they lie outside the training set as much as possible. SFGA produced models that were more predictive for all but the DHFR set, for which SIMCA was most predictive. RP gave the least predictive models for all but the MAO set. A similar trend was observed when using training and test sets to which compounds were randomly assigned and when gradually eliminating compounds from the (designed) training set. The stability of models was examined for the random and reduced sets, where stability means that classification statistics and the selected descriptors are similar for models derived from different sets. Here, SIMCA produced the most stable models, followed by SFGA and RP. We show that a consensus approach that combines all three methods outperforms the single best model for all data sets.

Keywords: chemoinformatics
[Sun2003Identifying] Y.F. Sun, X.D. Fan, and Y.D. Li. Identifying splicing sites in eukaryotic RNA: support vector machine approach. Comput. Biol. Med., 33(1):17-29, 2003. [ bib | DOI | http | .pdf ]
We introduce a new method for splicing sites prediction based on the theory of support vector machines (SVM). The SVM represents a new approach to supervised pattern classification and has been successfully applied to a wide range of pattern recognition problems. In the process of splicing sites prediction, the statistical information of RNA secondary structure in the vicinity of splice sites, e.g. donor and acceptor sites, is introduced in order to compare recognition ratio of true positive and true negative. From the results of comparison, addition of structural information has brought no significant benefit for the recognition of splice sites and had even lowered the rate of recognition. Our results suggest that, through three cross validation, the SVM method can achieve a good performance for splice sites identification.

Keywords: biosvm
[Su2003RankGene] Yang Su, T.M. Murali, Vladimir Pavlovic, Michael Schaffer, and Simon Kasif. RankGene: identification of diagnostic genes based on expression data. Bioinformatics, 19(12):1578-1579, 2003. [ bib | http | .pdf ]
Summary: RankGene is a program for analyzing gene expression data and computing diagnostic genes based on their predictive power in distinguishing between different types of samples. The program integrates into one system a variety of popular ranking criteria, ranging from the traditional t-statistic to one-dimensional support vector machines. This flexibility makes RankGene a useful tool in gene expression analysis and feature selection. Availability: http://genomics10.bu.edu/yangsu/rankgene Contact: murali@bu.edu

Keywords: biosvm
[Steinwart2003Sparseness] I. Steinwart. Sparseness of Support Vector Machines. J. Mach. Learn. Res., 4:1071-1105, 2003. [ bib | .pdf ]
Support vector machines (SVMs) construct decision functions that are linear combinations of kernel evaluations on the training set. The samples with non-vanishing coefficients are called support vectors. In this work we establish lower (asymptotical) bounds on the number of support vectors. On our way we prove several results which are of great importance for the understanding of SVMs. In particular, we describe to which "limit" SVM decision functions tend, discuss the corresponding notion of convergence and provide some results on the stability of SVMs using subdifferential calculus in the associated reproducing kernel Hilbert space.

[Srebro2003Weighted] N. Srebro and T. Jaakkola. Weighted low-rank approximations. In T. Fawcett and N. Mishra, editors, Proceedings of the Twentieth International Conference on Machine Learning, pages 720-727. AAAI Press, 2003. [ bib ]
[Sotiriou2003Breast] C. Sotiriou, S.-Y. Neo, L. M. McShane, E. L. Korn, P. M. Long, A. Jazaeri, P. Martiat, S. B. Fox, A. L. Harris, and E. T. Liu. Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proc. Natl. Acad. Sci. U. S. A., 100(18):10393-10398, Sep 2003. [ bib | DOI | http | .pdf ]
Comprehensive gene expression patterns generated from cDNA microarrays were correlated with detailed clinico-pathological characteristics and clinical outcome in an unselected group of 99 node-negative and node-positive breast cancer patients. Gene expression patterns were found to be strongly associated with estrogen receptor (ER) status and moderately associated with grade, but not associated with menopausal status, nodal status, or tumor size. Hierarchical cluster analysis segregated the tumors into two main groups based on their ER status, which correlated well with basal and luminal characteristics. Cox proportional hazards regression analysis identified 16 genes that were significantly associated with relapse-free survival at a stringent significance level of 0.001 to account for multiple comparisons. Of 231 genes previously reported by others [van't Veer, L. J., et al. (2002) Nature 415, 530-536] as being associated with survival, 93 probe elements overlapped with the set of 7,650 probe elements represented on the arrays used in this study. Hierarchical cluster analysis based on the set of 93 probe elements segregated our population into two distinct subgroups with different relapse-free survival (P < 0.03). The number of these 93 probe elements showing significant univariate association with relapse-free survival (P < 0.05) in the present study was 14, representing 11 unique genes. Genes involved in cell cycle, DNA replication, and chromosomal stability were consistently elevated in the various poor prognostic groups. In addition, glutathione S-transferase M3 emerged as an important survival marker in both studies. When taken together with other array studies, our results highlight the consistent biological and clinical associations with gene expression profiles.

Keywords: breastcancer
[Sorich2003Comparison] M. J. Sorich, J. O. Miners, R. A. McKinnon, D. A. Winkler, F. R. Burden, and P. A. Smith. Comparison of linear and nonlinear classification algorithms for the prediction of drug and chemical metabolism by human UDP-glucuronosyltransferase isoforms. J Chem Inf Comput Sci, 43(6):2019-24, 2003. [ bib | DOI | http | .pdf ]
Partial least squares discriminant analysis (PLSDA), Bayesian regularized artificial neural network (BRANN), and support vector machine (SVM) methodologies were compared by their ability to classify substrates and nonsubstrates of 12 isoforms of human UDP-glucuronosyltransferase (UGT), an enzyme "superfamily" involved in the metabolism of drugs, nondrug xenobiotics, and endogenous compounds. Simple two-dimensional descriptors were used to capture chemical information. For each data set, 70% of the data were used for training, and the remainder were used to assess the generalization performance. In general, the SVM methodology was able to produce models with the best predictive performance, followed by BRANN and then PLSDA. However, a small number of data sets showed either equivalent or better predictability using PLSDA, which may indicate relatively linear relationships in these data sets. All SVM models showed predictive ability (>60% of test set predicted correctly) and five out of the 12 test sets showed excellent prediction (>80% prediction accuracy). These models represent the first use of pattern recognition methods to discriminate between substrates and nonsubstrates of human drug metabolizing enzymes and the first thorough assessment of three classification algorithms using multiple metabolic data sets.

Keywords: biosvm
[Smyth2003Normalization] G. K. Smyth and T. P. Speed. Normalization of cDNA microarray data. Methods, 31:265-273, 2003. [ bib ]
[Smola2003Kernels] A. Smola and R. Kondor. Kernels and Regularization on Graphs. In B. Schölkopf and M.K. Warmuth, editors, Proceedings of 16th Annual Conference on Computational Learning Theory, pages 144-158. Springer-Verlag, 2003. [ bib ]
[Smale2003Estimating] S. Smale and D. Zhou. Estimating the approximation error in learning theory. Analysis and Applications, 1(1), 2003. [ bib | .pdf ]
[Simon2003Pitfalls] R. Simon, M.D. Radmacher, K. Dobbin, and L.M. McShane. Pitfalls in the use of dna microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute, 95(1):14-18, 2003. [ bib ]
[Siepen2003Beta] J. A. Siepen, S. E. Radford, and D. R. Westhead. Beta Edge strands in protein structure prediction and aggregation. Protein Sci., 12(10):2348-2359, 2003. [ bib | DOI | http | .pdf ]
It is well established that recognition between exposed edges of beta-sheets is an important mode of protein-protein interaction and can have pathological consequences; for instance, it has been linked to the aggregation of proteins into a fibrillar structure, which is associated with a number of predominantly neurodegenerative disorders. A number of protective mechanisms have evolved in the edge strands of beta-sheets, preventing the aggregation and insolubility of most natural beta-sheet proteins. Such mechanisms are unfavorable in the interior of a beta-sheet. The problem of distinguishing edge strands from central strands based on sequence information alone is important in predicting residues and mutations likely to be involved in aggregation, and is also a first step in predicting folding topology. Here we report support vector machine (SVM) and decision tree methods developed to classify edge strands from central strands in a representative set of protein domains. Interestingly, rules generated by the decision tree method are in close agreement with our knowledge of protein structure and are potentially useful in a number of different biological applications. When trained on strands from proteins of known structure, using structure-based (Dictionary of Secondary Structure in Proteins) strand assignments, both methods achieved mean cross-validated, prediction accuracies of  78 strand assignments from secondary structure prediction were used. Further investigation of this effect revealed that it could be explained by a significant reduction in the accuracy of standard secondary structure prediction methods for edge strands, in comparison with central strands.

Keywords: biosvm
[Shevade2003simple] S.K. Shevade and S.S. Keerthi. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17):2246-2253, 2003. [ bib ]
[Sheinerman2003Sequence] Felix B Sheinerman, Bissan Al-Lazikani, and Barry Honig. Sequence, structure and energetic determinants of phosphopeptide selectivity of SH2 domains. J. Mol. Biol., 334(4):823-841, Dec 2003. [ bib ]
Here, we present an approach for the prediction of binding preferences of members of a large protein family for which structural information for a number of family members bound to a substrate is available. The approach involves a number of steps. First, an accurate multiple alignment of sequences of all members of a protein family is constructed on the basis of a multiple structural superposition of family members with known structure. Second, the methods of continuum electrostatics are used to characterize the energetic contribution of each residue in a protein to the binding of its substrate. Residues that make a significant contribution are mapped onto the protein sequence and are used to define a "binding site signature" for the complex being considered. Third, sequences whose structures have not been determined are checked to see if they have binding-site signatures similar to one of the known complexes. Predictions of binding affinity to a given substrate are based on similarities in binding-site signature. An important component of the approach is the introduction of a context-specific substitution matrix suitable for comparison of binding-site residues.The methods are applied to the prediction of phosphopeptide selectivity of SH2 domains. To this end, the energetic roles of all protein residues in 17 different complexes of SH2 domains with their cognate targets are analyzed. The total number of residues that make significant contributions to binding is found to vary from nine to 19 in different complexes. These energetically important residues are found to contribute to binding through a variety of mechanisms, involving both electrostatic and hydrophobic interactions. Binding-site signatures are found to involve residues in different positions in SH2 sequences, some of them as far as 9A away from a bound peptide. Surprisingly, similarities in the signatures of different domains do not correlate with whole-domain sequence identities unless the latter is greater than 50%.An extensive comparison with the optimal binding motifs determined by peptide library experiments, as well as other experimental data indicate that the similarity in binding preferences of different SH2 domains can be deduced on the basis of their binding-site signatures. The analysis provides a rationale for the empirically derived classification of SH2 domains described by Songyang & Cantley, in that proteins in the same group are found to have similar residues at positions important for binding. Confident predictions of binding preference can be made for about 85% of SH2 domain sequences found in SWISSPROT. The approach described in this work is quite general and can, in principle, be used to analyze binding preferences of members of large protein families for which structural information for a number of family members is available. It also offers a strategy for predicting cross-reactivity of compounds designed to bind to a particular target, for example in structure-based drug design.

Keywords: Amino Acid Sequence; Binding Sites; Molecular Sequence Data; Peptide Library; Phosphopeptides; Protein Binding; Sequence Alignment; Substrate Specificity; src Homology Domains
[She2003Frequent-subsequence-based] R. She, F. Chen, K. Wang, M. Ester, J.L. Gardy, and F.S.L. Brinkman. Frequent-subsequence-based prediction of outer membrane proteins. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 436-445. ACM Press, 2003. [ bib | DOI | .pdf ]
A number of medically important disease-causing bacteria (collectively called Gram-negative bacteria) are noted for the extra "outer" membrane that surrounds their cell. Proteins resident in this membrane (outer membrane proteins, or OMPs) are of primary research interest for antibiotic and vaccine drug design as they are on the surface of the bacteria and so are the most accessible targets to develop new drugs against. With the development of genome sequencing technology and bioinformatics, biologists can now deduce all the proteins that are likely produced in a given bacteria and have attempted to classify where proteins are located in a bacterial cell. However such protein localization programs are currently least accurate when predicting OMPs, and so there is a current need for the development of a better OMP classifier. Data mining research suggests that the use of frequent patterns has good performance in aiding the development of accurate and efficient classification algorithms. In this paper, we present two methods to identify OMPs based on frequent subsequences and test them on all Gram-negative bacterial proteins whose localizations have been determined by biological experiments. One classifier follows an association rule approach, while the other is based on support vector machines (SVMs). We compare the proposed methods with the state-of-the-art methods in the biological domain. The results demonstrate that our methods are better both in terms of accurately identifying OMPs and providing biological insights that increase our understanding of the structures and functions of these important proteins.

Keywords: biosvm
[Shannon2003Analyzing] William Shannon, Robert Culverhouse, and Jill Duncan. Analyzing microarray data using cluster analysis. Pharmacogenomics, 4(1):41-52, Jan 2003. [ bib ]
As pharmacogenetics researchers gather more detailed and complex data on gene polymorphisms that effect drug metabolizing enzymes, drug target receptors and drug transporters, they will need access to advanced statistical tools to mine that data. These tools include approaches from classical biostatistics, such as logistic regression or linear discriminant analysis, and supervised learning methods from computer science, such as support vector machines and artificial neural networks. In this review, we present an overview of another class of models, cluster analysis, which will likely be less familiar to pharmacogenetics researchers. Cluster analysis is used to analyze data that is not a priori known to contain any specific subgroups. The goal is to use the data itself to identify meaningful or informative subgroups. Specifically, we will focus on demonstrating the use of distance-based methods of hierarchical clustering to analyze gene expression data.

Keywords: Algorithms, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biosensing Techniques, Cluster Analysis, Comparative Study, Computer-Assisted, DNA, Gene Expression Profiling, Gene Expression Regulation, Genes, Hemolysins, Humans, Markov Chains, Messenger, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplastic, Neural Networks (Computer), Non-U.S. Gov't, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Quality Control, RNA, Research Support, Signal Processing, Stomach Neoplasms, 12517285
[Shannon2003Cytoscape] P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, N. Amin, B. Schwikowski, and T. Ideker. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res., 13(11):2498-2504, Nov 2003. [ bib | DOI | http ]
Cytoscape is an open source software project for integrating biomolecular interaction networks with high-throughput expression data and other molecular states into a unified conceptual framework. Although applicable to any system of molecular components and interactions, Cytoscape is most powerful when used in conjunction with large databases of protein-protein, protein-DNA, and genetic interactions that are increasingly available for humans and model organisms. Cytoscape's software Core provides basic functionality to layout and query the network; to visually integrate the network with expression profiles, phenotypes, and other molecular states; and to link the network to databases of functional annotations. The Core is extensible through a straightforward plug-in architecture, allowing rapid development of additional computational analyses and features. Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.

[Serra2003Development] J.R. Serra, E.D. Thompson, and P.C. Jurs. Development of binary classification of structural chromosome aberrations for a diverse set of organic compounds from molecular structure. Chem. Res. Toxicol., 16(2):153-163, 2003. [ bib | DOI | http | .pdf ]
Classification models are generated to predict in vitro cytogenetic results for a diverse set of 383 organic compounds. Both k-nearest neighbor and support vector machine models are developed. They are based on calculated molecular structure descriptors. Endpoints used are the labels clastogenic or nonclastogenic according to an in vitro chromosomal aberration assay with Chinese hamster lung cells. Compounds that were tested with both a 24 and 48 h exposure are included. Each compound is represented by calculated molecular structure descriptors encoding the topological, electronic, geometrical, or polar surface area aspects of the structure. Subsets of informative descriptors are identified with genetic algorithm feature selection coupled to the appropriate classification algorithm. The overall classification success rate for a k-nearest neighbor classifier built with just six topological descriptors is 81.2 and 86.5 success rate for a three-descriptor support vector machine model is 99.7 and 83.8% for an external prediction set.

Keywords: biosvm
[Segal2003Classificationa] N. H. Segal, P. Pavlidis, C. R. Antonescu, R. G. Maki, W. S. Noble, D. DeSantis, J. M. Woodruff, J. J. Lewis, M. F. Brennan, A. N. Houghton, and C. Cordon-Cardo. Classification and Subtype Prediction of Adult Soft Tissue Sarcoma by Functional Genomics. Am. J. Pathol., 163(2):691-700, Aug 2003. [ bib | http | .pdf ]
Adult soft tissue sarcomas are a heterogeneous group of tumors, including well-described subtypes by histological and genotypic criteria, and pleomorphic tumors typically characterized by non-recurrent genetic aberrations and karyotypic heterogeneity. The latter pose a diagnostic challenge, even to experienced pathologists. We proposed that gene expression profiling in soft tissue sarcoma would identify a genomic-based classification scheme that is useful in diagnosis. RNA samples from 51 pathologically confirmed cases, representing nine different histological subtypes of adult soft tissue sarcoma, were examined using the Affymetrix U95A GeneChip. Statistical tests were performed on experimental groups identified by cluster analysis, to find discriminating genes that could subsequently be applied in a support vector machine algorithm. Synovial sarcomas, round-cell/myxoid liposarcomas, clear-cell sarcomas and gastrointestinal stromal tumors displayed remarkably distinct and homogenous gene expression profiles. Pleomorphic tumors were heterogeneous. Notably, a subset of malignant fibrous histiocytomas, a controversialhistological subtype, was identified as a distinct genomic group. The support vector machine algorithm supported a genomic basis for diagnosis, with both high sensitivity and specificity. In conclusion, we showed gene expression profiling to be useful in classification and diagnosis, providing insights into pathogenesis and pointing to potential new therapeutic targets of soft tissue sarcoma.

Keywords: biosvm
[Segal2003Regression] M. R. Segal, K. D. Dahlquist, and B. R. Conklin. Regression approaches for microarray data analysis. J. Comput. Biol., 10(6):961-980, 2003. [ bib | DOI | .pdf ]
A variety of new procedures have been devised to handle the two-sample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays. Such new methods are required in part because of some defining characteristics of microarray-based studies: (i) the very large number of genes contributing expression measures which far exceeds the number of samples (observations) available and (ii) the fact that by virtue of pathway/network relationships, the gene expression measures tend to be highly correlated. These concerns are exacerbated in the regression setting, where the objective is to relate gene expression, simultaneously for multiple genes, to some external outcome or phenotype. Correspondingly, several methods have been recently proposed for addressing these issues. We briefly critique some of these methods prior to a detailed evaluation of gene harvesting. This reveals that gene harvesting, without additional constraints, can yield artifactual solutions. Results obtained employing such constraints motivate the use of regularized regression procedures such as the lasso, least angle regression, and support vector machines. Model selection and solution multiplicity issues are also discussed. The methods are evaluated using a microarray-based study of cardiomyopathy in transgenic mice.

Keywords: biosvm
[Segal2003Module] E. Segal, M. Shapira, A. Regev, D. Pe'er, D. Botstein, D. Koller, and N. Friedman. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat. Genet., 34(2):166-176, Jun 2003. [ bib | DOI | http | .pdf ]
Much of a cell's activity is organized as a network of interacting modules: sets of genes coregulated to respond to different conditions. We present a probabilistic method for identifying regulatory modules from gene expression data. Our procedure identifies modules of coregulated genes, their regulators and the conditions under which regulation occurs, generating testable hypotheses in the form 'regulator X regulates module Y under conditions W'. We applied the method to a Saccharomyces cerevisiae expression data set, showing its ability to identify functionally coherent modules and their correct regulators. We present microarray experiments supporting three novel predictions, suggesting regulatory roles for previously uncharacterized proteins.

Keywords: biogm
[Schwarz2003Asymmetry] D. S. Schwarz, G. Hutvagner, T. Du, Z. Xu, N. Aronin, and P. D. Zamore. Asymmetry in the assembly of the RNAi enzyme complex. Cell, 115(2):199-208, Oct 2003. [ bib | DOI | http | .pdf ]
A key step in RNA interference (RNAi) is assembly of the RISC, the protein-siRNA complex that mediates target RNA cleavage. Here, we show that the two strands of an siRNA duplex are not equally eligible for assembly into RISC. Rather, both the absolute and relative stabilities of the base pairs at the 5? ends of the two siRNA strands determine the degree to which each strand participates in the RNAi pathway. siRNA duplexes can be functionally asymmetric, with only one of the two strands able to trigger RNAi. Asymmetry is the hallmark of a related class of small, single-stranded, noncoding RNAs, microRNAs (miRNAs). We suggest that single-stranded miRNAs are initially generated as siRNA-like duplexes whose structures predestine one strand to enter the RISC and the other strand to be destroyed. Thus, the common step of RISC assembly is an unexpected source of asymmetry for both siRNA function and miRNA biogenesis.

Keywords: sirna
[Schuffenhauer2003Similarity] A. Schuffenhauer, P. Floersheim, P. Acklin, and E. Jacoby. Similarity metrics for ligands reflecting the similarity of the target proteins. J. Chem. Inf. Comput. Sci., 43(2):391-405, 2003. [ bib | DOI | http ]
In this study we evaluate how far the scope of similarity searching can be extended to identify not only ligands binding to the same target as the reference ligand(s) but also ligands of other homologous targets without initially known ligands. This "homology-based similarity searching" requires molecular representations reflecting the ability of a molecule to interact with target proteins. The Similog keys, which are introduced here as a new molecular representation, were designed to fulfill such requirements. They are based only on the molecular constitution and are counts of atom triplets. Each triplet is characterized by the graph distances and the types of its atoms. The atom-typing scheme classifies each atom by its function as H-bond donor or acceptor and by its electronegativity and bulkiness. In this study the Similog keys are investigated in retrospective in silico screening experiments and compared with other conformation independent molecular representations. Studied were molecules of the MDDR database for which the activity data was augmented by standardized target classification information from public protein classification databases. The MDDR molecule set was split randomly into two halves. The first half formed the candidate set. Ligands of four targets (dopamine D2 receptor, opioid delta-receptor, factor Xa serine protease, and progesterone receptor) were taken from the second half to form the respective reference sets. Different similarity calculation methods are used to rank the molecules of the candidate set by their similarity to each of the four reference sets. The accumulated counts of molecules binding to the reference target and groups of targets with decreasing homology to it were examined as a function of the similarity rank for each reference set and similarity method. In summary, similarity searching based on Unity 2D-fingerprints or Similog keys are found to be equally effective in the identification of molecules binding to the same target as the reference set. However, the application of the Similog keys is more effective in comparison with the other investigated methods in the identification of ligands binding to any target belonging to the same family as the reference target. We attribute this superiority to the fact that the Similog keys provide a generalization of the chemical elements and that the keys are counted instead of merely noting their presence or absence in a binary form. The second most effective molecular representation are the occurrence counts of the public ISIS key fragments, which like the Similog method, incorporates key counting as well as a generalization of the chemical elements. The results obtained suggest that ligands for a new target can be identified by the following three-step procedure: 1. Select at least one target with known ligands which is homologous to the new target. 2. Combine the known ligands of the selected target(s) to a reference set. 3. Search candidate ligands for the new targets by their similarity to the reference set using the Similog method. This clearly enlarges the scope of similarity searching from the classical application for a single target to the identification of candidate ligands for whole target families and is expected to be of key utility for further systematic chemogenomics exploration of previously well explored target families.

Keywords: chemogenomics
[Saxena2003Comparison] A.K. Saxena and P. Prathipati. Comparison of mlr, pls and ga-mlr in qsar analysis. SAR. QSAR. Environ. Res., 14:433-445, 2003. [ bib ]
[Saul2003Think] L. K. Saul and S. T. Roweis. Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. J. Mach. Learn. Res., 4:119-155, 2003. [ bib | .html | www: ]
The problem of dimensionality reduction arises in many fields of information processing, including machine learning, data compression, scientific visualization, pattern recognition, and neural computation. Here we describe locally linear embedding (LLE), an unsupervised learning algorithm that computes low dimensional, neighborhood preserving embeddings of high dimensional data. The data, assumed to be sampled from an underlying manifold, are mapped into a single global coordinate system of lower dimensionality. The mapping is derived from the symmetries of locally linear reconstructions, and the actual computation of the embedding reduces to a sparse eigenvalue problem. Notably, the optimizations in LLE-though capable of generating highly nonlinear embeddings-are simple to implement, and they do not involve local minima. In this paper, we describe the implementation of the algorithm in detail and discuss several extensions that enhance its performance. We present results of the algorithm applied to data sampled from known manifolds, as well as to collections of images of faces, lips, and handwritten digits. These examples are used to provide extensive illustrations of the algorithm's performance-both successes and failures-and to relate the algorithm to previous and ongoing work in nonlinear dimensionality reduction.

Keywords: dimred
[Sanjiv2003Discriminative] K. Sanjiv and H. Martial. Discriminative Random Fields: A Discriminative Framework for Contextual Interaction in Classification. In Proceedings of the Ninth IEEE International Conference on Computer Vision, page 1150. IEEE Computer Society, 2003. [ bib ]
In this work we present Discriminative Random Fields(DRFs), a discriminative framework for the classification ofimage regions by incorporating neighborhood interactionsin the labels as well as the observed data. The discriminativerandom fields offer several advantages over the conventionalMarkov Random Field (MRF) framework. First,the DRFs allow to relax the strong assumption of conditionalindependence of the observed data generally used inthe MRF framework for tractability. This assumption is toorestrictive for a large number of applications in vision. Second,the DRFs derive their classification power by exploitingthe probabilistic discriminative models instead of thegenerative models used in the MRF framework. Finally, allthe parameters in the DRF model are estimated simultaneouslyfrom the training data unlike the MRF frameworkwhere likelihood parameters are usually learned separatelyfrom the field parameters. We illustrate the advantages ofthe DRFs over the MRF framework in an application ofman-made structure detection in natural images taken fromthe Corel database.

[Sanchez-Carbayo2003Gene] Marta Sanchez-Carbayo, Nicholas D Socci, Juan Jose Lozano, Wentian Li, Elizabeth Charytonowicz, Thomas J Belbin, Michael B Prystowsky, Angel R Ortiz, Geoffrey Childs, and Carlos Cordon-Cardo. Gene discovery in bladder cancer progression using cDNA microarrays. Am. J. Pathol., 163(2):505-16, Aug 2003. [ bib | http | .pdf ]
To identify gene expression changes along progression of bladder cancer, we compared the expression profiles of early-stage and advanced bladder tumors using cDNA microarrays containing 17,842 known genes and expressed sequence tags. The application of bootstrapping techniques to hierarchical clustering segregated early-stage and invasive transitional carcinomas into two main clusters. Multidimensional analysis confirmed these clusters and more importantly, it separated carcinoma in situ from papillary superficial lesions and subgroups within early-stage and invasive tumors displaying different overall survival. Additionally, it recognized early-stage tumors showing gene profiles similar to invasive disease. Different techniques including standard t-test, single-gene logistic regression, and support vector machine algorithms were applied to identify relevant genes involved in bladder cancer progression. Cytokeratin 20, neuropilin-2, p21, and p33ING1 were selected among the top ranked molecular targets differentially expressed and validated by immunohistochemistry using tissue microarrays (n = 173). Their expression patterns were significantly associated with pathological stage, tumor grade, and altered retinoblastoma (RB) expression. Moreover, p33ING1 expression levels were significantly associated with overall survival. Analysis of the annotation of the most significant genes revealed the relevance of critical genes and pathways during bladder cancer progression, including the overexpression of oncogenic genes such as DEK in superficial tumors or immune response genes such as Cd86 antigen in invasive disease. Gene profiling successfully classified bladder tumors based on their progression and clinical outcome. The present study has identified molecular biomarkers of potential clinical significance and critical molecular targets associated with bladder cancer progression.

Keywords: biosvm
[Salim2003Combination] N. Salim, J. Holliday, and P. Willett. Combination of fingerprint-based similarity coefficients using data fusion. J Chem Inf Comput Sci, 43(2):435-442, 2003. [ bib | DOI | http ]
Many different types of similarity coefficients have been described in the literature. Since different coefficients take into account different characteristics when assessing the degree of similarity between molecules, it is reasonable to combine them to further optimize the measures of similarity between molecules. This paper describes experiments in which data fusion is used to combine several binary similarity coefficients to get an overall estimate of similarity for searching databases of bioactive molecules. The results show that search performances can be improved by combining coefficients with little extra computational cost. However, there is no single combination which gives a consistently high performance for all search types.

Keywords: 80 and over, Acid-Base Imbalance, Acute, Acute Disease, Adolescent, Adult, African Americans, Aged, Anemia, Animals, Anti-HIV Agents, Anti-Infective Agents, Antibiotics, Antibodies, Antineoplastic, Antineoplastic Agents, Antineoplastic Combined Chemotherapy Protocols, Antitubercular Agents, Aorta, Asparaginase, Autoimmune, B-Cell, Bangladesh, Bicarbonates, Biological Markers, Blood Glucose, California, Camptothecin, Cellulitis, Chorionic Gonadotropin, Chronic Disease, Ciprofloxacin, Clinical Protocols, Colorectal Neoplasms, Combination, Comparative Study, Daunorubicin, Decision Trees, Dexamethasone, Diabetes Mellitus, Dideoxynucleosides, Directly Observed Therapy, Disease Transmission, Drug Administration Schedule, Drug Resistance, Drug Therapy, English Abstract, Female, Fluorouracil, Follow-Up Studies, Glucose Tolerance Test, Glucosephosphate Dehydrogenase, Glyburide, HIV Infections, HIV-1, Health Planning, Health Resources, Helminth, Hemolysis, Hemolytic, Hormonal, Hospital Mortality, Human, Humans, Hypoglycemic Agents, Immunoglobulin M, In Vitro, Incidence, Indinavir, Insulin, Intensive Care Units, Interstitial, Lactates, Leucovorin, Leukemia, Male, Maternal Age, Middle Aged, Motor Activity, Multidrug-Resistant, Mutation, Nephritis, Non-U.S. Gov't, Organoplatinum Compounds, Pennsylvania, Phytotherapy, Plant Extracts, Plant Leaves, Population Dynamics, Potassium Channels, Prednisone, Pregnancy, Pregnancy Outcome, Prenatal, Prenatal Care, Progesterone, Prognosis, Prospective Studies, Pulmonary, Rabbits, Randomized Controlled Trials, Rats, Research Support, Retrospective Studies, Risk Assessment, Scalp Dermatoses, Schistosomiasis japonica, Severity of Illness Index, Spondylarthropathies, Streptozocin, Survival Rate, Trauma Centers, Trauma Severity Indices, Tubal, Tuberculosis, Type 2, Ultrasonography, Vertical, Vincristine, Viral, Viral Load, Wistar, Wounds and Injuries, Ziziphus, beta Subunit, 12653506
[Saeys2003Fast] Y. Saeys, S. Degroeve, D. Aeyels, Y. Van de Peer, and P. Rouze. Fast feature selection using a simple estimation of distribution algorithm: a case study on splice site prediction. Bioinformatics, 19(Suppl. 1):ii179-ii188, 2003. [ bib | http | .pdf ]
Motivation: Feature subset selection is an important preprocessing step for classification. In biology, where structures or processes are described by a large number of features, the elimination of irrelevant and redundant information in a reasonable amount of time has a number of advantages. It enables the classification system to achieve good or even better solutions with a restricted subset of features, allows for a faster classification, and it helps the human expert focus on a relevant subset of features, hence providing useful biological knowledge. Results: We present a heuristic method based on Estimation of Distribution Algorithms to select relevant subsets of features for splice site prediction in Arabidopsis thaliana. We show that this method performs a fast detection of relevant feature subsets using the technique of constrained feature subsets. Compared to the traditional greedy methods the gain in speed can be up to one order of magnitude, with results being comparable or even better than the greedy methods. This makes it a very practical solution for classification tasks that can be solved using a relatively small amount of discriminative features (or feature dependencies), but where the initial set of potential discriminative features is rather large. Keywords: Machine Learning, Feature Subset Selection, Estimation of Distribution Algorithms, Splice Site Prediction. Contact: yvsae@gengenp.rug.ac.be

Keywords: biosvm
[Roschke2003Karyotypic] Anna V Roschke, Giovanni Tonon, Kristen S Gehlhaus, Nicolas McTyre, Kimberly J Bussey, Samir Lababidi, Dominic A Scudiero, John N Weinstein, and Ilan R Kirsch. Karyotypic complexity of the nci-60 drug-screening panel. Cancer Res, 63(24):8634-8647, Dec 2003. [ bib ]
We used spectral karyotyping to provide a detailed analysis of karyotypic aberrations in the diverse group of cancer cell lines established by the National Cancer Institute for the purpose of anticancer drug discovery. Along with the karyotypic description of these cell lines we defined and studied karyotypic complexity and heterogeneity (metaphase-to-metaphase variations) based on three separate components of genomic anatomy: (a) ploidy; (b) numerical changes; and (c) structural rearrangements. A wide variation in these parameters was evident in these cell lines, and different association patterns between them were revealed. Analysis of the breakpoints and other specific features of chromosomal changes across the entire set of cell lines or within particular lineages pointed to a striking lability of centromeric regions that distinguishes the epithelial tumor cell lines. We have also found that balanced translocations are as frequent in absolute number within the cell lines derived from solid as from hematopoietic tumors. Important similarities were noticed between karyotypic changes in cancer cell lines and that seen in primary tumors. This dataset offers insights into the causes and consequences of the destabilizing events and chromosomal instability that may occur during tumor development and progression. It also provides a foundation for investigating associations between structural genome anatomy and cancer molecular markers and targets, gene expression, gene dosage, and resistance or sensitivity to tens of thousands of molecular compounds.

Keywords: Cell Line, Tumor; Chromosome Aberrations; DNA Repair, genetics; Drug Screening Assays, Antitumor; Humans; Neoplasms, genetics/pathology; Ploidies; Retinoblastoma Protein, genetics; Spectral Karyotyping; Translocation, Genetic; Tumor Suppressor Protein p53, genetics
[Ramon2003Expressivity] J. Ramon and T. Gärtner. Expressivity versus efficiency of graph kernels. In T. Washio and L. De Raedt, editors, Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences, pages 65-74, 2003. [ bib ]
Keywords: kernel-theory chemoinformatics
[Ramaswamy2003molecular] S. Ramaswamy, K. N. Ross, E. S. Lander, and T. R. Golub. A molecular signature of metastasis in primary solid tumors. Nat. Genet., 33(1):49-54, Jan 2003. [ bib | DOI | http | .pdf ]
Metastasis is the principal event leading to death in individuals with cancer, yet its molecular basis is poorly understood. To explore the molecular differences between human primary tumors and metastases, we compared the gene-expression profiles of adenocarcinoma metastases of multiple tumor types to unmatched primary adenocarcinomas. We found a gene-expression signature that distinguished primary from metastatic adenocarcinomas. More notably, we found that a subset of primary tumors resembled metastatic tumors with respect to this gene-expression signature. We confirmed this finding by applying the expression signature to data on 279 primary solid tumors of diverse types. We found that solid tumors carrying the gene-expression signature were most likely to be associated with metastasis and poor clinical outcome (P < 0.03). These results suggest that the metastatic potential of human tumors is encoded in the bulk of a primary tumor, thus challenging the notion that metastases arise from rare cells within a primary tumor that have the ability to metastasize.

[Qin2003Kernel] J. Qin, D. P. Lewis, and W. S. Noble. Kernel hierarchical gene clustering from microarray expression data. Bioinformatics, 19(16):2097-2104, 2003. [ bib | http | .pdf ]
Motivation: Unsupervised analysis of microarray gene expression data attempts to find biologically significant patterns within a given collection of expression measurements. For example, hierarchical clustering can be applied to expression profiles of genes across multiple experiments, identifying groups of genes that share similiar expression profiles. Previous work using the support vector machine supervised learning algorithm with microarray data suggests that higher-order features, such as pairwise and tertiary correlations across multiple experiments, may provide significant benefit in learning to recognize classes of co-expressed genes. Results: We describe a generalization of the hierarchical clustering algorithm that efficiently incorporates these higher-order features by using a kernel function to map the data into a high-dimensional feature space. We then evaluate the utility of the kernel hierarchical clustering algorithm using both internal and external validation. The experiments demonstrate that the kernel representation itself is insufficient to provide improved clustering performance. We conclude that mapping gene expression data into a high-dimensional feature space is only a good idea when combined with a learning algorithm, such as the support vector machine that does not suffer from the curse of dimensionality. Availability: Supplementary data at www.cs.columbia.edu/compbio/hiclust. Software source code available by request.

Keywords: biosvm
[Qian2003Prediction] J. Qian, J. Lin, N. M. Luscombe, H. Yu, and M. Gerstein. Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data. Bioinformatics, 19(15):1917-1926, 2003. [ bib | http | .pdf ]
Motivation: Defining regulatory networks, linking transcription factors (TFs) to their targets, is a central problem in post-genomic biology. One might imagine one could readily determine these networks through inspection of gene expression data. However, the relationship between the expression timecourse of a transcription factor and its target is not obvious (e.g. simple correlation over the timecourse), and current analysis methods, such as hierarchical clustering, have not been very successful in deciphering them. Results: Here we introduce an approach based on support vector machines (SVMs) to predict the targets of a transcription factor by identifying subtle relationships between their expression profiles. In particular, we used SVMs to predict the regulatory targets for 36 transcription factors in the Saccharomyces cerevisiae genome based on the microarray expression data from many different physiological conditions. We trained and tested our SVM on a data set constructed to include a significant number of both positive and negative examples, directly addressing data imbalance issues. This was non-trivial given that most of the known experimental information is only for positives. Overall, we found that 63 confirmed through cross-validation. We further assessed the performance of our regulatory network identifications by comparing them with the results from two recent genome-wide ChIP-chip experiments. Overall, we find the agreement between our results and these experiments is comparable to the agreement (albeit low) between the two experiments. We find that this network has a delocalized structure with respect to chromosomal positioning, with a given transcription factor having targets spread fairly uniformly across the genome. Availability: The overall network of the relationships is available on the web at http://bioinfo.mbb.yale.edu/expression/echipchip

Keywords: biosvm
[Pham2003Prediction] Tho Hoan Pham, Kenji Satou, and Tu Bao Ho. Prediction and analysis of beta-turns in proteins by support vector machine. Genome Inform Ser Workshop Genome Inform, 14:196-205, 2003. [ bib ]
Tight turn has long been recognized as one of the three important features of proteins after the alpha-helix and beta-sheet. Tight turns play an important role in globular proteins from both the structural and functional points of view. More than 90% tight turns are beta-turns. Analysis and prediction of beta-turns in particular and tight turns in general are very useful for the design of new molecules such as drugs, pesticides, and antigens. In this paper, we introduce a support vector machine (SVM) approach to prediction and analysis of beta-turns. We have investigated two aspects of applying SVM to the prediction and analysis of beta-turns. First, we developed a new SVM method, called BTSVM, which predicts beta-turns of a protein from its sequence. The prediction results on the dataset of 426 non-homologous protein chains by sevenfold cross-validation technique showed that our method is superior to the other previous methods. Second, we analyzed how amino acid positions support (or prevent) the formation of beta-turns based on the "multivariable" classification model of a linear SVM. This model is more general than the other ones of previous statistical methods. Our analysis results are more comprehensive and easier to use than previously published analysis results.

Keywords: biosvm
[Perrin2003Gene] B.E. Perrin, L. Ralaivola, A. Mazurie, S. Bottani, J. Mallet, and F. d'Alche Buc. Gene networks inference using dynamic bayesian networks. Bioinformatics, 19(suppl 2):ii138-ii148, 2003. [ bib ]
[Peng2003Molecular] S. Peng, Q. Xu, X.B. Ling, X. Peng, W. Du, and L. Chen. Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines. FEBS Lett., 555(2):358-362, 2003. [ bib | DOI | http | .pdf ]
Simultaneous multiclass classification of tumor types is essential for future clinical implementations of microarray-based cancer diagnosis. In this study, we have combined genetic algorithms (GAs) and all paired support vector machines (SVMs) for multiclass cancer identification. The predictive features have been selected through iterative SVMs/GAs, and recursive feature elimination post-processing steps, leading to a very compact cancer-related predictive gene set. Leave-one-out cross-validations yielded accuracies of 87.93 the eight-class and 85.19 outperforming the results derived from previously published methods.

Keywords: biosvm microarray
[Peng2003Splicing-site] Si hua Peng, Long jiang Fan, Xiao ning Peng, Shu lin Zhuang, Wei Du, and Liang biao Chen. Splicing-site recognition of rice (Oryza sativa L.) DNA sequences by support vector machines. J Zhejiang Univ Sci, 4(5):573-7, 2003. [ bib ]
MOTIVATION: It was found that high accuracy splicing-site recognition of rice (Oryza sativa L.) DNA sequence is especially difficult. We described a new method for the splicing-site recognition of rice DNA sequences. METHOD: Based on the intron in eukaryotic organisms conforming to the principle of GT-AG, we used support vector machines (SVM) to predict the splicing sites. By machine learning, we built a model and used it to test the effect of the test data set of true and pseudo splicing sites. RESULTS: The prediction accuracy we obtained was 87.53% at the true 5' end splicing site and 87.37% at the true 3' end splicing sites. The results suggested that the SVM approach could achieve higher accuracy than the previous approaches.

[Patterson2003Proteomics] Scott D Patterson and Ruedi H Aebersold. Proteomics: the first decade and beyond. Nat Genet, 33 Suppl:311-323, Mar 2003. [ bib | DOI | http ]
Proteomics is the systematic study of the many and diverse properties of proteins in a parallel manner with the aim of providing detailed descriptions of the structure, function and control of biological systems in health and disease. Advances in methods and technologies have catalyzed an expansion of the scope of biological studies from the reductionist biochemical analysis of single proteins to proteome-wide measurements. Proteomics and other complementary analysis methods are essential components of the emerging 'systems biology' approach that seeks to comprehensively describe biological systems through integration of diverse types of data and, in the future, to ultimately allow computational simulations of complex biological systems.

Keywords: Amino Acid Sequence; Base Sequence; Chromatography, Liquid; Computational Biology; DNA; Genetic Techniques; History, 20th Century; History, 21st Century; Mass Spectrometry; Oligonucleotide Array Sequence Analysis; Proteins; Proteomics
[Park2003Prediction] K.-J. Park and M. Kanehisa. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics, 19(13):1656-1663, 2003. [ bib | http | .pdf ]
Motivation: The subcellular location of a protein is closely correlated to its function. Thus, computational prediction of subcellular locations from the amino acid sequence information would help annotation and functional prediction of protein coding genes in complete genomes. We have developed a method based on support vector machines (SVMs). Results: We considered 12 subcellular locations in eukaryotic cells: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracellular medium, Golgi apparatus, lysosome, mitochondrion, nucleus, peroxisome, plasma membrane, and vacuole. We constructed a data set of proteins with known locations from the SWISS-PROT database. A set of SVMs was trained to predict the subcellular location of a given protein based on its amino acid, amino acid pair, and gapped amino acid pair compositions. The predictors based on these different compositions were then combined using a voting scheme. Results obtained through 5-fold cross-validation tests showed an improvement in prediction accuracy over the algorithm based on the amino acid composition only. This prediction method is available via the Internet. Availability: http://www.genome.ad.jp/SIT/ploc.html Supplementary information: http://web.kuicr.kyoto-u.ac.jp/ park/Seqdata/

Keywords: biosvm
[Palmer2003Comparison] Gregory M Palmer, Changfang Zhu, Tara M Breslin, Fushen Xu, Kennedy W Gilchrist, and Nirmala Ramanujam. Comparison of multiexcitation fluorescence and diffuse reflectance spectroscopy for the diagnosis of breast cancer (March 2003). IEEE Trans Biomed Eng, 50(11):1233-42, Nov 2003. [ bib ]
Nonmalignant (n = 36) and malignant (n = 20) tissue samples were obtained from breast cancer and breast reduction surgeries. These tissues were characterized using multiple excitation wavelength fluorescence spectroscopy and diffuse reflectance spectroscopy in the ultraviolet-visible wavelength range, immediately after excision. Spectra were then analyzed using principal component analysis (PCA) as a data reduction technique. PCA was performed on each fluorescence spectrum, as well as on the diffuse reflectance spectrum individually, to establish a set of principal components for each spectrum. A Wilcoxon rank-sum test was used to determine which principal components show statistically significant differences between malignant and nonmalignant tissues. Finally, a support vector machine (SVM) algorithm was utilized to classify the samples based on the diagnostically useful principal components. Cross-validation of this nonparametric algorithm was carried out to determine its classification accuracy in an unbiased manner. Multiexcitation fluorescence spectroscopy was successful in discriminating malignant and nonmalignant tissues, with a sensitivity and specificity of 70% and 92%, respectively. The sensitivity (30%) and specificity (78%) of diffuse reflectance spectroscopy alone was significantly lower. Combining fluorescence and diffuse reflectance spectra did not improve the classification accuracy of an algorithm based on fluorescence spectra alone. The fluorescence excitation-emission wavelengths identified as being diagnostic from the PCA-SVM algorithm suggest that the important fluorophores for breast cancer diagnosis are most likely tryptophan, NAD(P)H and flavoproteins.

[OHagan2003Array] R. C. O'Hagan, C. W. Brennan, A. Strahs, X. Zhang, K. Kannan, M. Donovan, C. Cauwels, N. E. Sharpless, W. H. Wong, and L. Chin. Array comparative genome hybridization for tumor classification and gene discovery in mouse models of malignant melanoma. Cancer Res., 63(17):5352-5356, Sep 2003. [ bib | .pdf ]
Chromosomal numerical aberrations (CNAs), particularly regional amplifications and deletions, are a hallmark of solid tumor genomes. These genomic alterations carry the potential to convey etiologic and clinical significance by virtue of their clonality within a tumor cell population, their distinctive patterns in relation to tumor staging, and their recurrence across different tumor types. In this study, we showed that array-based comparative genomic hybridization (CGH) analysis of genome-wide CNAs can classify tumors on the basis of differing etiologies and provide mechanistic insights to specific biological processes. In a RAS-induced p19(Arf-/-) mouse model that experienced accelerated melanoma formation after UV exposure, array-CGH analysis was effective in distinguishing phenotypically identical melanomas that differed solely by previous UV exposure. Moreover, classification by array-CGH identified key CNAs unique to each class, including amplification of cyclin-dependent kinase 6 in UV-treated cohort, a finding consistent with our recent report that UVB targets components of the p16(INK4a)-cyclin-dependent kinase-RB pathway in melanoma genesis (K. Kannan, et al., Proc. Natl. Acad. Sci. USA, 21: 2003). These results are the first to establish the utility of array-CGH as a means of etiology-based tumor classification in genetically defined cancer-prone models.

Keywords: cgh
[Nguyen2003Multi-class] Minh N Nguyen and Jagath C Rajapakse. Multi-class support vector machines for protein secondary structure prediction. Genome Inform Ser Workshop Genome Inform, 14:218-27, 2003. [ bib ]
The solution of binary classification problems using the Support Vector Machine (SVM) method has been well developed. Though multi-class classification is typically solved by combining several binary classifiers, recently, several multi-class methods that consider all classes at once have been proposed. However, these methods require resolving a much larger optimization problem and are applicable to small datasets. Three methods based on binary classifications: one-against-all (OAA), one-against-one (OAO), and directed acyclic graph (DAG), and two approaches for multi-class problem by solving one single optimization problem, are implemented to predict protein secondary structure. Our experiments indicate that multi-class SVM methods are more suitable for protein secondary structure (PSS) prediction than the other methods, including binary SVMs, because their capacity to solve an optimization problem in one step. Furthermore, in this paper, we argue that it is feasible to extend the prediction accuracy by adding a second-stage multi-class SVM to capture the contextual information among secondary structural elements and thereby further improving the accuracies. We demonstrate that two-stage SVMs perform better than single-stage SVM techniques for PSS prediction using two datasets and report a maximum accuracy of 79.5%.

Keywords: biosvm
[Murphy2003Using] Kevin Murphy, Antonio Torralba, and William T.F. Freeman. Using the forest to see the trees: a graphical model relating features, objects and scenes. In Adv. Neural Inform. Process. Syst., Vancouver, BC, 2003. MIT Press. [ bib | .pdf ]
Keywords: conditional-random-field
[Mootha2003PGC] V. K. Mootha, C. M. Lindgren, K.-F. Eriksson, A. Subramanian, S. Sihag, J. Lehar, P. Puigserver, E. Carlsson, M. Ridderstrale, E. Laurila, N. Houstis, M. J. Daly, N. Patterson, J. P. Mesirov, T. R. Golub, P. Tamayo, B. Spiegelman, E. S. Lander, J. N. Hirschhorn, D. Altshuler, and L. C. Groop. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet., 34(3):267-273, Jul 2003. [ bib | DOI | http | .pdf ]
DNA microarrays can be used to identify gene expression changes characteristic of human disease. This is challenging, however, when relevant differences are subtle at the level of individual genes. We introduce an analytical strategy, Gene Set Enrichment Analysis, designed to detect modest but coordinate changes in the expression of groups of functionally related genes. Using this approach, we identify a set of genes involved in oxidative phosphorylation whose expression is coordinately decreased in human diabetic muscle. Expression of these genes is high at sites of insulin-mediated glucose disposal, activated by PGC-1alpha and correlated with total-body aerobic capacity. Our results associate this gene set with clinically important variation in human metabolism and illustrate the value of pathway relationships in the analysis of genomic profiling experiments.

[Mirzadegan2003Sequence] T. Mirzadegan, G. Benkö, S. Filipek, and K. Palczewski. Sequence analyses of G-protein-coupled receptors: similarities to rhodopsin. Biochemistry, 42(10):2759-2767, Mar 2003. [ bib | DOI ]
Keywords: chemogenomics
[Mendelson2003Geometric] S. Mendelson. Geometric parameters in Learning Theory. Lecture notes, 2003. [ bib ]
[Meireles2003Differentially] S.I. Meireles, A.F. Carvalho, R. Hirata, A.L. Montagnini, W.K. Martins, F.B. Runza, B.S. Stolf, L. Termini, C.E. Neto, R.L. Silva, F.A. Soares, E.J. Neves, and L.F. Reis. Differentially expressed genes in gastric tumors identified by cDNA array. Cancer Lett., 190(2):199-211, Feb 2003. [ bib | DOI | http | .pdf ]
Using cDNA fragments from the FAPESP/lICR Cancer Genome Project, we constructed a cDNA array having 4512 elements and determined gene expression in six normal and six tumor gastric tissues. Using t-statistics, we identified 80 cDNAs whose expression in normal and tumor samples differed more than 3.5 sample standard deviations. Using Self-Organizing Map, the expression profile of these cDNAs allowed perfect separation of malignant and non-malignant samples. Using the supervised learning procedure Support Vector Machine, we identified trios of cDNAs that could be used to classify samples as normal or tumor, based on single-array analysis. Finally, we identified genes with altered linear correlation when their expression in normal and tumor samples were compared. Further investigation concerning the function of these genes could contribute to the understanding of gastric carcinogenesis and may prove useful in molecular diagnostics.

Keywords: biosvm microarray
[McKnight2003Categorization] Larry McKnight and Padmini Srinivasan. Categorization of sentence types in medical abstracts. AMIA Annu Symp Proc, pages 440-4, 2003. [ bib ]
This study evaluated the use of machine learning techniques in the classification of sentence type. 7253 structured abstracts and 204 unstructured abstracts of Randomized Controlled Trials from MedLINE were parsed into sentences and each sentence was labeled as one of four types (Introduction, Method, Result, or Conclusion). Support Vector Machine (SVM) and Linear Classifier models were generated and evaluated on cross-validated data. Treating sentences as a simple "bag of words", the SVM model had an average ROC area of 0.92. Adding a feature of relative sentence location improved performance markedly for some models and overall increasing the average ROC to 0.95. Linear classifier performance was significantly worse than the SVM in all datasets. Using the SVM model trained on structured abstracts to predict unstructured abstracts yielded performance similar to that of models trained with unstructured abstracts in 3 of the 4 types. We conclude that classification of sentence type seems feasible within the domain of RCT's. Identification of sentence types may be helpful for providing context to end users or other text summarization techniques.

Keywords: biosvm
[Mcgann2003Shape] M. R. Mcgann, H. R. Almond, A. Nicholls, A. J. Grant, and F. K. Brown. Gaussian docking functions. Biopolymers, 68(1):76-90, 2003. [ bib | DOI | http ]
A shape-based Gaussian docking function is constructed which uses Gaussian functions to represent the shapes of individual atoms. A set of 20 trypsin ligand-protein complexes are drawn from the Protein Data Bank (PDB), the ligands are separated from the proteins, and then are docked back into the active sites using numerical optimization of this function. It is found that by employing this docking function, quasi-Newton optimization is capable of moving ligands great distances [on average 7 A root mean square distance (RMSD)] to locate the correctly docked structure. It is also found that a ligand drawn from one PDB file can be docked into a trypsin structure drawn from any of the trypsin PDB files. This implies that this scoring function is not limited to more accurate x-ray structures, as is the case for many of the conventional docking methods, but could be extended to homology models. (c) 2002 Wiley Periodicals, Inc. Biopolymers 68: 76-90, 2003

Keywords: docking, energyfunctions, openeye
[Mayr2003Cross-reactive] Torsten Mayr, Christian Igel, Gregor Liebsch, Ingo Klimant, and Otto S Wolfbeis. Cross-reactive metal ion sensor array in a micro titer plate format. Anal Chem, 75(17):4389-96, Sep 2003. [ bib ]
A cross-reactive array in a micro titer plate (MTP) format is described that is based on a versatile and highly flexible scheme. It makes use of rather unspecific metal ions probes having almost identical fluorescence spectra, thus enabling (a) interrogation at identical analytical wavelengths, and (b) imaging of the probes contained in the wells of the MTP using a CCD camera and an array of blue-light-emitting diodes as a light source. The unselective response of the indicators in the presence of mixtures of five divalent cations generates a characteristic pattern that was analyzed by chemometric tools. The fluorescence intensity of the indicators was transferred into a time-dependent parameter applying a scheme called dual lifetime referencing. In this method, the fluorescence decay profile of the indicator is referenced against the phosphorescence of an inert reference dye added to the system. The intrinsically referenced measurements also were performed using blue LEDs as light sources and a CCD camera without intensifiers as the detector. The best performance was observed if each well was excited by a single LED. The assembly allows the detection of dye concentrations in the nanomoles-per-liter range without amplification and the acquisition of 96 wells simultaneously. The pictures obtained form the basis for evaluation by pattern recognition algorithms. Support vector machines are capable of predicting the presence of significant concentrations of metal ions with high accuracy.

Keywords: Agrochemicals, Air Pollutants, Aircraft, Algorithms, Artificial Intelligence, Automated, Base Composition, Base Sequence, Bayes Theorem, Carbonic Anhydrase Inhibitors, Cluster Analysis, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer Systems, Computer-Assisted, Computing Methodologies, Confidence Intervals, Cytosine, DNA, Data Interpretation, Databases, Diagnosis, Drug Design, Enhancer Elements (Genetics), Environmental Monitoring, Enzyme Inhibitors, Ethanol, Exons, Forecasting, Fourier Transform Infrared, Gene Expression Profiling, Gene Expression Regulation, Genetic, Genetic Screening, Glucuronosyltransferase, Guanine, Humans, Image Interpretation, Isoenzymes, Least-Squares Analysis, Leukemia, Linear Models, Lymphoma, Models, Molecular, Molecular Conformation, Molecular Sequence Data, Natural Disasters, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Oligonucleotide Array Sequence Analysis, Online Systems, P.H.S., Pattern Recognition, Pharmaceutical Preparations, Phenotype, Photography, Probability, Pyrimidines, Quantitative Structure-Activity Relationship, RNA Precursors, RNA Splice Sites, RNA Splicing, Radiation, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Signal Processing, Software, Spectroscopy, Statistical, Subtraction Technique, Terminology, Thermodynamics, Time Factors, U.S. Gov't, Untranslated Regions, Video Recording, Walking, 14632041
[Mattick2003Challenging] John S. Mattick. Challenging the dogma: the hidden layer of non-protein-coding RNAs in complex organisms. BioEssays, 25:930-939, 2003. [ bib ]
Keywords: csbcbook
[Mattfeldt2003Classification] T. Mattfeldt, H.W. Gottfried, H. Wolter, V. Schmidt, H.A. Kestler, and J. Mayer. Classification of prostatic carcinoma with artificial neural networks using comparative genomic hybridization and quantitative stereological data. Pathol. Res. Pract., 199(12):773-784, 2003. [ bib | DOI | http ]
Staging of prostate cancer is a mainstay of treatment decisions and prognostication. In the present study, 50 pT2N0 and 28 pT3N0 prostatic adenocarcinomas were characterized by Gleason grading, comparative genomic hybridization (CGH), and histological texture analysis based on principles of stereology and stochastic geometry. The cases were classified by learning vector quantization and support vector machines. The quality of classification was tested by cross-validation. Correct prediction of stage from primary tumor data was possible with an accuracy of 74-80 of prediction was similar when the Gleason score was used as input variable, when stereological data were used, or when a combination of CGH data and stereological data was used. The results of classification by learning vector quantization were slightly better than those by support vector machines. A method is briefly sketched by which training of neural networks can be adapted to unequal sample sizes per class. Progression from pT2 to pT3 prostate cancer is correlated with complex changes of the epithelial cells in terms of volume fraction, of surface area, and of second-order stereological properties. Genetically, this progression is accompanied by a significant global increase in losses and gains of DNA, and specifically by increased numerical aberrations on chromosome arms 1q, 7p, and 8p.

Keywords: biosvm, cgh
[Markowetz2003Support] F. Markowetz, L. Edler, and M. Vingron. Support Vector Machines for Protein Fold Class Prediction. Biometrical Journal, 45(3):377-389, 2003. [ bib | DOI | http | .pdf ]
Knowledge of the three-dimensional structure of a protein is essential for describing and understanding its function. Today, a large number of known protein sequences faces a small number of identified structures. Thus, the need arises to predict structure from sequence without using time-consuming experimental identification. In this paper the performance of Support Vector Machines (SVMs) is compared to Neural Networks and to standard statistical classification methods as Discriminant Analysis and Nearest Neighbor Classification. We show that SVMs can beat the competing methods on a dataset of 268 protein sequences to be classified into a set of 42 fold classes. We discuss misclassification with respect to biological function and similarity. In a second step we examine the performance of SVMs if the embedding is varied from frequencies of single amino acids to frequencies of tripletts of amino acids. This work shows that SVM provide a promising alternative to standard statistical classification and prediction methods in functional genomics.

Keywords: biosvm
[Marchal2003Bioinformatics] I. Marchal, G. Golfier, O. Dugas, and M. Majed. Bioinformatics in glycobiology. Biochimie, 85(1-2):75-81, 2003. [ bib ]
In comparison with genes and proteins, attention paid to oligosaccharides that modify proteins is still marginal. Accordingly, bioinformatics is so far poorly involved in glycobiology. Some initiatives have been taken, however, to collect in databases all glycobiology-relevant information or to design specific data mining algorithms to infer predictions or identify oligosaccharide structures. In this review, we make a non-exhaustive survey of the available glycobiology-related bioinformatic resources, focussing mainly on those resources that are available through the World Wide Web. Some well-curated databases are identified, but the development of specialised algorithms appears to be limited.

Keywords: glycans
[Ma2003GadE] Z. Ma, S. Gong, H. Richard, D. L. Tucker, T. Conway, and J. W. Foster. GadE (YhiE) activates glutamate decarboxylase-dependent acid resistance in Escherichia coli K-12. Mol. Microbiol., 49(5):1309-1320, Sep 2003. [ bib ]
Commensal and pathogenic strains of Escherichia coli possess three inducible acid resistance systems that collaboratively protect cells against acid stress to pH 2 or below. The most effective system requires glutamate in the acid challenge media and relies on two glutamate decarboxylases (GadA and B) combined with a putative glutamate:gamma-aminobutyric acid antiporter (GadC). A complex network of regulators mediates induction of this system in response to various media, pH and growth phase signals. We report that the LuxR-like regulator GadE (formerly YhiE) is required for expression of gadA and gadBC regardless of media or growth conditions. This protein binds directly to the 20 bp GAD box sequence found in the control regions of both loci. Two previously identified AraC-like regulators, GadX and GadW, are only needed for gadA/BC expression under some circumstances. Overexpression of GadX or GadW will not overcome a need for GadE. However, overexpression of GadE can supplant a requirement for GadX and W. Data provided also indicate that GadX and GadE can simultaneously bind the area around the GAD box region and probably form a complex. The gadA, gadBC and gadE genes are all induced by low pH in exponential phase cells grown in minimal glucose media. The acid induction of gadA/BC results primarily from the acid induction of gadE. Constitutive expression of GadE removes most pH control over the glutamate decarboxylase and antiporter genes. The small amount of remaining pH control is governed by GadX and W. The finding that gadE mutations also diminish the effectiveness of the other two acid resistance systems suggests that GadE influences the expression of additional acid resistance components. The number of regulatory proteins (five), sigma factors (two) and regulatory feedback loops focused on gadA/BC expression make this one of the most intensively regulated systems in E. coli.

[Lugosi2003Concentration-of-measure] G. Lugosi. Concentration-of-measure inequalities. Lecture notes, January 2003. [ bib ]
[Lu2003Expression] Y.J. Lu, D. Williamson, R. Wang, B. Summersgill, S. Rodriguez, S. Rogers, K. Pritchard-Jones, C. Campbell, and J. Shipley. Expression profiling targeting chromosomes for tumor classification and prediction of clinical behavior. Genes Chromosomes Cancer, 38(3):207-214, 2003. [ bib | DOI | .pdf ]
Tumors are associated with altered or deregulated gene products that affect critical cellular functions. Here we assess the use of a global expression profiling technique that identifies chromosome regions corresponding to differential gene expression, termed comparative expressed sequence hybridization (CESH). CESH analysis was performed on a total of 104 tumors with a diagnosis of rhabdomyosarcoma, leiomyosarcoma, prostate cancer, and favorable-histology Wilms tumors. Through the use of the chromosome regions identified as variables, support vector machine analysis was applied to assess classification potential, and feature selection (recursive feature elimination) was used to identify the best discriminatory regions. We demonstrate that the CESH profiles have characteristic patterns in tumor groups and were also able to distinguish subgroups of rhabdomyosarcoma. The overall CESH profiles in favorable-histology Wilms tumors were found to correlate with subsequent clinical behavior. Classification by use of CESH profiles was shown to be similar in performance to previous microarray expression studies and highlighted regions for further investigation. We conclude that analysis of chromosomal expression profiles can group, subgroup, and even predict clinical behavior of tumors to a level of performance similar to that of microarray analysis. CESH is independent of selecting sequences for interrogation and is a simple, rapid, and widely accessible approach to identify clinically useful differential expression.

Keywords: biosvm
[Lu2003Preoperative] C. Lu, T. Van Gestel, J.A. Suykens, S. Van Huffel, I. Vergote, and D. Timmerman. Preoperative prediction of malignancy of ovarian tumors using least squares support vector machines. Artif. Intell. Med., 28(3):281-306, 2003. [ bib | DOI | http | .pdf ]
In this work, we develop and evaluate several least squares support vector machine (LS-SVM) classifiers within the Bayesian evidence framework, in order to preoperatively predict malignancy of ovarian tumors. The analysis includes exploratory data analysis, optimal input variable selection, parameter estimation, and performance evaluation via receiver operating characteristic (ROC) curve analysis. LS-SVM models with linear and radial basis function (RBF) kernels, and logistic regression models have been built on 265 training data, and tested on 160 newly collected patient data. The LS-SVM model with nonlinear RBF kernel achieves the best performance, on the test set with the area under the ROC curve (AUC), sensitivity and specificity equal to 0.92, 81.5 best averaged performance over 30 runs of randomized cross-validation is also obtained by an LS-SVM RBF model, with AUC, sensitivity and specificity equal to 0.94, 90.0 results show that the LS-SVM models have the potential to obtain a reliable preoperative distinction between benign and malignant ovarian tumors, and to assist the clinicians for making a correct diagnosis.

[Liu2003QSAR] H. X. Liu, R. S. Zhang, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. QSAR study of ethyl 2-[(3-methyl-2,5-dioxo(3-pyrrolinyl))amino]-4-(trifluoromethyl) pyrimidine-5-carboxylate: an inhibitor of AP-1 and NF-kappa B mediated gene expression based on support vector machines. J Chem Inf Comput Sci, 43(4):1288-96, 2003. [ bib | DOI | http | .pdf ]
The support vector machine, as a novel type of learning machine, for the first time, was used to develop a QSAR model of 57 analogues of ethyl 2-[(3-methyl-2,5-dioxo(3-pyrrolinyl))amino]-4-(trifluoromethyl)pyrimidine-5-carboxylate (EPC), an inhibitor of AP-1 and NF-kappa B mediated gene expression, based on calculated quantum chemical parameters. The quantum chemical parameters involved in the model are Kier and Hall index (order3) (KHI3), Information content (order 0) (IC0), YZ Shadow (YZS) and Max partial charge for an N atom (MaxPCN), Min partial charge for an N atom (MinPCN). The mean relative error of the training set, the validation set, and the testing set is 1.35%, 1.52%, and 2.23%, respectively, and the maximum relative error is less than 5.00%.

Keywords: biosvm
[Liu2003Diagnosing] H. X. Liu, R. S. Zhang, F. Luan, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. Diagnosing breast cancer based on support vector machines. J. Chem. Inf. Comput. Sci., 43(3):900-7, 2003. [ bib | DOI | http | .pdf ]
The Support Vector Machine (SVM) classification algorithm, recently developed from the machine learning community, was used to diagnose breast cancer. At the same time, the SVM was compared to several machine learning techniques currently used in this field. The classification task involves predicting the state of diseases, using data obtained from the UCI machine learning repository. SVM outperformed k-means cluster and two artificial neural networks on the whole. It can be concluded that nine samples could be mislabeled from the comparison of several machine learning techniques.

Keywords: breastcancer
[Liu2003in-silico] Huiqing Liu, Hao Han, Jinyan Li, and Limsoon Wong. An in-silico method for prediction of polyadenylation signals in human sequences. Genome Inform Ser Workshop Genome Inform, 14:84-93, 2003. [ bib ]
This paper presents a machine learning method to predict polyadenylation signals (PASes) in human DNA and mRNA sequences by analysing features around them. This method consists of three sequential steps of feature manipulation: generation, selection and integration of features. In the first step, new features are generated using k-gram nucleotide acid or amino acid patterns. In the second step, a number of important features are selected by an entropy-based algorithm. In the third step, support vector machines are employed to recognize true PASes from a large number of candidates. Our study shows that true PASes in DNA and mRNA sequences can be characterized by different features, and also shows that both upstream and downstream sequence elements are important for recognizing PASes from DNA sequences. We tested our method on several public data sets as well as our own extracted data sets. In most cases, we achieved better validation results than those reported previously on the same data sets. The important motifs observed are highly consistent with those reported in literature.

Keywords: biosvm
[Liu2003Building] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In X. Wu, A. Tuzhilin, and J. Shavlik, editors, Proceedings of the Third IEEE International Conference on Data Mining, ICDM '03, pages 179-186, Washington, DC, USA, 2003. IEEE Computer Society. [ bib | http | .pdf ]
Keywords: PUlearning
[Lind2003Support] P. Lind and T. Maltseva. Support vector machines for the estimation of aqueous solubility. J Chem Inf Comput Sci, 43(6):1855-9, 2003. [ bib | DOI | http | .pdf ]
Support Vector Machines (SVMs) are used to estimate aqueous solubility of organic compounds. A SVM equipped with a Tanimoto similarity kernel estimates solubility with accuracy comparable to results from other reported methods where the same data sets have been studied. Complete cross-validation on a diverse data set resulted in a root-mean-squared error = 0.62 and R(2) = 0.88. The data input to the machine is in the form of molecular fingerprints. No physical parameters are explicitly involved in calculations.

Keywords: biosvm chemoinformatics
[Liao2003Combining] L. Liao and W.S. Noble. Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. J. Comput. Biol., 10(6):857-868, 2003. [ bib | http | .pdf ]
One key element in understanding the molecular machinery of the cell is to understand the structure and function of each protein encoded in the genome. A very successful means of inferring the structure or function of a previously unannotated protein is via sequence similarity with one or more proteins whose structure or function is already known. Toward this end, we propose a means of representing proteins using pairwise sequence similarity scores. This representation, combined with a discriminative classification algorithm known as the support vector machine (SVM), provides a powerful means of detecting subtle structural and evolutionary relationships among proteins. The algorithm, called SVM-pairwise, when tested on its ability to recognize previously unseen families from the SCOP database, yields significantly better performance than SVM-Fisher, profile HMMs, and PSI-BLAST.

Keywords: biosvm
[Liao2003Network] J. C. Liao, R. Boscolo, Y.-L. Yang, L. M. Tran, C. Sabatti, and V. P. Roychowdhury. Network component analysis: Reconstruction of regulatory signals in biological systems. Proc. Natl. Acad. Sci. USA, 100(26):15522-15527, 2003. [ bib | DOI | http | .pdf ]
[Li2003Learning] X. Li and B. Liu. Learning to classify texts using positive and unlabeled data. In G. Gottlob and T. Walsh, editors, IJCAI'03: Proceedings of the 18th international joint conference on Artificial intelligence, pages 587-592, San Francisco, CA, USA, 2003. Morgan Kaufmann Publishers Inc. [ bib ]
In traditional text classification, a classifier is built using labeled training documents of every class. This paper studies a different problem. Given a set P of documents of a particular class (called positive class) and a set U of unlabeled documents that contains documents from class P and also other types of documents (called negative class documents), we want to build a classifier to classify the documents in U into documents from P and documents not from P. The key feature of this problem is that there is no labeled negative document, which makes traditional text classification techniques inapplicable. In this paper, we propose an effective technique to solve the problem. It combines the Rocchio method and the SVM technique for classifier building. Experimental results show that the new method outperforms existing methods significantly.

[Li2003Simple] Jinyan Li, Huiqing Liu, James R Downing, Allen Eng-Juh Yeoh, and Limsoon Wong. Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients. Bioinformatics, 19(1):71-8, Jan 2003. [ bib ]
MOTIVATIONS AND RESULTS: For classifying gene expression profiles or other types of medical data, simple rules are preferable to non-linear distance or kernel functions. This is because rules may help us understand more about the application in addition to performing an accurate classification. In this paper, we discover novel rules that describe the gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients. We also introduce a new classifier, named PCL, to make effective use of the rules. PCL is accurate and can handle multiple parallel classifications. We evaluate this method by classifying 327 heterogeneous ALL samples. Our test error rate is competitive to that of support vector machines, and it is 71% better than C4.5, 50% better than Naive Bayes, and 43% better than k-nearest neighbour. Experimental results on another independent data sets are also presented to show the strength of our method. AVAILABILITY: Under http://sdmc.lit.org.sg/GEDatasets/, click on Supplementary Information.

Keywords: Acute, Algorithms, Automated, Base Pair Mismatch, Base Pairing, Base Sequence, Biological, Biosensing Techniques, Cluster Analysis, Comparative Study, Computer-Assisted, DNA, Gene Expression Profiling, Gene Expression Regulation, Genes, Genetic, Genetic Markers, Hemolysins, Humans, Leukemia, Lymphocytic, Markov Chains, Messenger, Models, Molecular Probe Techniques, Molecular Sequence Data, Nanotechnology, Neoplasm, Neoplastic, Neural Networks (Computer), Non-U.S. Gov't, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Quality Control, RNA, Research Support, Signal Processing, Statistical, Stomach Neoplasms, Tumor Markers, 12499295
[Lewis2003Prediction] Benjamin P Lewis, I hung Shih, Matthew W Jones-Rhoades, David P Bartel, and Christopher B Burge. Prediction of mammalian microrna targets. Cell, 115(7):787-798, Dec 2003. [ bib ]
MicroRNAs (miRNAs) can play important gene regulatory roles in nematodes, insects, and plants by basepairing to mRNAs to specify posttranscriptional repression of these messages. However, the mRNAs regulated by vertebrate miRNAs are all unknown. Here we predict more than 400 regulatory target genes for the conserved vertebrate miRNAs by identifying mRNAs with conserved pairing to the 5' region of the miRNA and evaluating the number and quality of these complementary sites. Rigorous tests using shuffled miRNA controls supported a majority of these predictions, with the fraction of false positives estimated at 31% for targets identified in human, mouse, and rat and 22% for targets identified in pufferfish as well as mammals. Eleven predicted targets (out of 15 tested) were supported experimentally using a HeLa cell reporter system. The predicted regulatory targets of mammalian miRNAs were enriched for genes involved in transcriptional regulation but also encompassed an unexpectedly broad range of other functions.

Keywords: sirna
[Leslie2003Mismatch] C. Leslie, E. Eskin, J. Weston, and W.S. Noble. Mismatch String Kernels for SVM Protein Classification. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems 15. MIT Press, 2003. [ bib | .pdf | .pdf ]
Keywords: biosvm
[Lee2003Classification] Y. Lee and C.-K. Lee. Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics, 19(9):1132-1139, 2003. [ bib | http | .pdf ]
Motivation: High-density DNA microarray measures the activities of several thousand genes simultaneously and the gene expression profiles have been used for the cancer classification recently. This new approach promises to give better therapeutic measurements to cancer patients by diagnosing cancer types with improved accuracy. The Support Vector Machine (SVM) is one of the classification methods successfully applied to the cancer diagnosis problems. However, its optimal extension to more than two classes was not obvious, which might impose limitations in its application to multiple tumor types. We briefly introduce the Multicategory SVM, which is a recently proposed extension of the binary SVM, and apply it to multiclass cancer diagnosis problems Results: Its applicability is demonstrated on the leukemia data (Golub et al., 1999) and the small round blue cell tumors of childhood data (Khan et al., 2001). Comparable classification accuracy shown in the applications and its flexibility render the MSVM a viable alternative to other classification methods Supplementary Information: http://www.stat.ohio-state.edu/ yklee/msvm.html Contact: yklee@stat.ohio-state.edu

Keywords: biosvm
[Lee2003Learning] W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In T. Fawcett and N. Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003, pages 448-455. AAAI Press, 2003. [ bib ]
[Lee2003Application] S.-I. Lee and S. Batzoglou. Application of independent component analysis to microarrays. Genome Biol., 4(11):R76, 2003. [ bib | DOI | http | .pdf ]
[Lee2003Discovery] Dongkwon Lee, Sang Wook Choi, Myengsoo Kim, Jin Hyun Park, Moonkyu Kim, Jungchul Kim, and In-Beum Lee. Discovery of differentially expressed genes related to histological subtype of hepatocellular carcinoma. Biotechnol Prog., 19(3):1011-5, 2003. [ bib | DOI | http | .pdf ]
Hepatocellular carcinoma (HCC) is one of the most common human malignancies in the world. To identify the histological subtype-specific genes of HCC, we analyzed the gene expression profile of 10 HCC patients by means of cDNA microarray. We proposed a systematic approach for determining the discriminatory genes and revealing the biological phenomena of HCC with cDNA microarray data. First, normalization of cDNA microarray data was performed to reduce or minimize systematic variations. On the basis of the suitably normalized data, we identified specific genes involved in histological subtype of HCC. Two classification methods, Fisher's discriminant analysis (FDA) and support vector machine (SVM), were used to evaluate the reliability of the selected genes and discriminate the histological subtypes of HCC. This study may provide a clue for the needs of different chemotherapy and the reason for heterogeneity of the clinical responses according to histological subtypes.

Keywords: biosvm
[Leach2003introduction] A. R. Leach and V. J. Gillet. An introduction to chemoinformatics. Kluwer Academic Publishers, 2003. [ bib ]
[LeRoch2003Discovery] K. G. Le Roch, Y. Zhou, P. L. Blair, M. Grainger, J. K. Moch, J. D. Haynes, P. De la Vega, A. A. Holder, S. Batalov, D. J. Carucci, and E. A. Winzeler. Discovery of gene function by expression profiling of the malaria parasite life cycle. Science, 301(5639):1503-1508, 2003. [ bib | DOI | http | .pdf ]
The completion of the genome sequence for Plasmodium falciparum, the species responsible for most malaria human deaths, has the potential to reveal hundreds of new drug targets and proteins involved in pathogenesis. However, only approximately 35 with an identifiable function. The absence of routine genetic tools for studying Plasmodium parasites suggests that this number is unlikely to change quickly if conventional serial methods are used to characterize encoded proteins. Here, we use a high-density oligonucleotide array to generate expression profiles of human and mosquito stages of the malaria parasite's life cycle. Genes with highly correlated levels and temporal patterns of expression were often involved in similar functions or cellular processes.

Keywords: microarray plasmodium
[Kubinyi2003Comparative] H. Kubinyi. Comparative Molecular Field Analysis. In J. Gasteiger, editor, Handbook of Chemoinformatics. From Data to Knowledge, Volume 4, pages 1555-1574. Wiley-VCH, Weinheim, 2003. [ bib ]
Keywords: chemoinformatics
[Krishnan2003comparative] V. G. Krishnan and D. R. Westhead. A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function. Bioinformatics, 19(17):2199-2209, 2003. [ bib | http | .pdf ]
Motivation: The large volume of single nucleotide polymorphism data now available motivates the development of methods for distinguishing neutral changes from those which have real biological effects. Here, two different machine-learning methods, decision trees and support vector machines (SVMs), are applied for the first time to this problem. In common with most other methods, only non-synonymous changes in protein coding regions of the genome are considered. Results: In detailed cross-validation analysis, both learning methods are shown to compete well with existing methods, and to out-perform them in some key tests. SVMs show better generalization performance, but decision trees have the advantage of generating interpretable rules with robust estimates of prediction confidence. It is shown that the inclusion of protein structure information produces more accurate methods, in agreement with other recent studies, and the effect of using predicted rather than actual structure is evaluated. Availability: Software is available on request from the authors.

Keywords: biosvm
[Kondor2003Kernel] R. Kondor and T. Jebara. A kernel between sets of vectors. In ICML '03: Proceedings of the 20th international conference on Machine learning, 2003. [ bib ]
[Koltchinskii2003Localized] V. Koltchinskii. Localized Rademacher complexities. Manuscript, september 2003. [ bib ]
[Koehn-et-al-03] P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In NAACL 2003, pages 48-54, Morristown, NJ, USA, 2003. Association for Computational Linguistics. [ bib | DOI ]
[Kim2003Protein] H. Kim and H. Park. Protein secondary structure prediction based on an improved support vector machines approach. Protein Eng., 16(8):553-560, Aug 2003. [ bib | http | .pdf ]
The prediction of protein secondary structure is an important step in the prediction of protein tertiary structure. A new protein secondary structure prediction method, SVMpsi, was developed to improve the current level of prediction by incorporating new tertiary classifiers and their jury decision system, and the PSI-BLAST PSSM profiles. Additionally, efficient methods to handle unbalanced data and a new optimization strategy for maximizing the Q3 measure were developed. The SVMpsi produces the highest published Q3 and SOV94 scores on both the RS126 and CB513 data sets to date. For a new KP480 set, the prediction accuracy of SVMpsi was Q3 = 78.5 for 136 non-redundant protein sequences which do not contain homologues of training data sets were Q3 = 77.2 SVMpsi results in CASP5 illustrate that it is another competitive method to predict protein secondary structure.

Keywords: biosvm
[Khvorova2003Functional] A. Khvorova, A. Reynolds, and S.D. Jayasena. Functional siRNAs and miRNAs exhibit strand bias. Cell, 115(2):209-216, Oct 2003. [ bib | DOI | http | .pdf ]
Both microRNAs (miRNA) and small interfering RNAs (siRNA) share a common set of cellular proteins (Dicer and the RNA-induced silencing complex [RISC]) to elicit RNA interference. In the following work, a statistical analysis of the internal stability of published miRNA sequences in the context of miRNA precursor hairpins revealed enhanced flexibility of miRNA precursors, especially at the 5?-anti-sense (AS) terminal base pair. The same trend was observed in siRNA, with functional duplexes displaying a lower internal stability (?0.5 kcal/mol) at the 5?-AS end than nonfunctional duplexes. Average internal stability of siRNA molecules retrieved from plant cells after introduction of long RNA sequences also shows this characteristic thermodynamic signature. Together, these results suggest that the thermodynamic properties of siRNA play a critical role in determining the molecule's function and longevity, possibly biasing the steps involved in duplex unwinding and strand retention by RISC.

Keywords: sirna
[Keserue2003Prediction] G. M. Keserü. Prediction of hERG potassium channel affinity by traditional and hologram qSAR methods. Bioorg. Med. Chem. Lett., 13(16):2773-2775, Aug 2003. [ bib ]
Traditional and hologram QSAR (HQSAR) models were developed for the prediction of hERG potassium channel affinities. The models were validated on three different test sets including compounds with published patch-clamp IC(50) data and two subsets from the World Drug Index (compounds indicated to have ECG modifying adverse effect and drugs marked to be approved, respectively). Discriminant analysis performed on the full set of hERG data resulted in a traditional QSAR model that classified 83% of actives and 87% of inactives correctly. Analysis of our HQSAR model revealed it to be predictive in both IC(50) and discrimination studies.

Keywords: chemoinformatics herg
[Keselman03many-to-many] Y. Keselman, A. Shokoufandeh, M. F. Demirci, and S. Dickinson. Many-to-many graph matching via metric embedding. In CVPR, pages 850-857, 2003. [ bib ]
[Kelley2003Conserved] B.P. Kelley, R. Sharan, R.M. Karp, T. Sittler, D.E. Root, B.R. Stockwell, and T. Ideker. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl. Acad. Sci. USA, 100(20):11394-11399, Sep 2003. [ bib | DOI | http | .pdf ]
We implement a strategy for aligning two protein-protein interaction networks that combines interaction topology and protein sequence similarity to identify conserved interaction pathways and complexes. Using this approach we show that the protein-protein interaction networks of two distantly related species, Saccharomyces cerevisiae and Helicobacter pylori, harbor a large complement of evolutionarily conserved pathways, and that a large number of pathways appears to have duplicated and specialized within yeast. Analysis of these findings reveals many well characterized interaction pathways as well as many unanticipated pathways, the significance of which is reinforced by their presence in the networks of both species.

[Keerthi2003SMO] S. S. Keerthi and S. K. Shevade. SMO algorithm for least-squares SVM formulations. Neural Comput, 15(2):487-507, Feb 2003. [ bib | DOI | http ]
This article extends the well-known SMO algorithm of support vector machines (SVMs) to least-squares SVM formulations that include LS-SVM classification, kernel ridge regression, and a particular form of regularized kernel Fisher discriminant. The algorithm is shown to be asymptotically convergent. It is also extremely easy to implement. Computational experiments show that the algorithm is fast and scales efficiently (quadratically) as a function of the number of examples.

[Keerthi2003Asymptotic] S. Sathiya Keerthi and Chih-Jen Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Comput, 15(7):1667-89, Jul 2003. [ bib | DOI | http ]
Support vector machines (SVMs) with the gaussian (RBF) kernel have been popular for practical use. Model selection in this class of SVMs involves two hyperparameters: the penalty parameter C and the kernel width sigma. This letter analyzes the behavior of the SVM classifier when these hyperparameters take very small or very large values. Our results help in understanding the hyperparameter space that leads to an efficient heuristic method of searching for hyperparameter values with small generalization errors. The analysis also indicates that if complete model selection using the gaussian kernel has been conducted, there is no need to consider linear SVM.

[Kazhdan2003Rotation] Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Rotation invariant spherical harmonic representation of 3d shape descriptors. In SGP '03: Proceedings of the 2003 Eurographics/ACM SIGGRAPH symposium on Geometry processing, pages 156-164, Aire-la-Ville, Switzerland, Switzerland, 2003. Eurographics Association. [ bib ]
[Kashima2003Marginalized] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized Kernels between Labeled Graphs. In T. Faucett and N. Mishra, editors, Proceedings of the Twentieth International Conference on Machine Learning, pages 321-328, New York, NY, USA, 2003. AAAI Press. [ bib | .pdf ]
Keywords: biosvm
[Kandola2003Learning] J. Kandola, J. Shawe-Taylor, and N. Cristianini. Learning Semantic Similarity. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems 15. MIT Press, 2003. [ bib ]
[Kamath2003Systematic] R. S. Kamath, A. G. Fraser, Y. Dong, G. Poulin, R. Durbin, M. Gotta, A. Kanapin, N. Le Bot, S. Moreno, M. Sohrmann, D. P. Welchman, P Zipperlen, and J. Ahringer. Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature, 421(6920):231-237, Jan 2003. [ bib | DOI | http | .pdf ]
A principal challenge currently facing biologists is how to connect the complete DNA sequence of an organism to its development and behaviour. Large-scale targeted-deletions have been successful in defining gene functions in the single-celled yeast Saccharomyces cerevisiae, but comparable analyses have yet to be performed in an animal. Here we describe the use of RNA interference to inhibit the function of approximately 86 C. elegans. We identified mutant phenotypes for 1,722 genes, about two-thirds of which were not previously associated with a phenotype. We find that genes of similar functions are clustered in distinct, multi-megabase regions of individual chromosomes; genes in these regions tend to share transcriptional profiles. Our resulting data set and reusable RNAi library of 16,757 bacterial clones will facilitate systematic analyses of the connections among gene sequence, chromosomal location and gene function in C. elegans.

[Kalatzis2003Support] I. Kalatzis, D. Pappas, N. Piliouras, and D. Cavouras. Support vector machines based analysis of brain SPECT images for determining cerebral abnormalities in asymptomatic diabetic patients. Med Inform Internet Med, 28(3):221-30, Sep 2003. [ bib | DOI | http ]
Purpose: An image processing method was developed to investigate whether brain SPECT images of patients with diabetes mellitus type II (DMII) and no brain damage differ from those of normal subjects. Materials and methods: Twenty-five DMII patients and eight healthy volunteers underwent brain 99mTc-Bicisate SPECT examination. A semi-automatic method, allowing for physician's interaction, was developed to delineate specific brain regions (ROIs) on the SPECT images. Twenty-eight features from the grey-level histogram and the spatial-dependence matrix were computed from numerous small image-samples collected from each specific ROI. Classification into 'diabetics' and 'non-diabetics' was performed for each ROI separately. The classical least squares-minimum distance (LSMD) classifier and the recently developed support vector machines (SVM) classifier were used. System performance was evaluated by means of the leave-one-out method; one sample was left out, the classifier was trained by the rest of the samples, and the left-out sample was classified. By repeating for all samples, the classifier's performance could be tested on data not incorporated in its design. Results: Highest classification accuracies (LSMD: 97.8%, SVM: 99.1%) were achieved at the right occipital lobule employing two features, the standard deviation and entropy. For the rest of the ROIs classification accuracies ranged between 84.5 and 98.6%. Conclusion: Our findings indicate cerebral blood flow disruption in patients with DMII. The proposed system may assist physicians in evaluating cerebral blood flow in patients with DMII undergoing brain SPECT.

[Jorgensen2003Sense] R. A. Jorgensen. Sense cosuppression in plants: Past, present, and future. In G. J. Hannon, editor, RNAi: A guide to gene silencing. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, 2003. [ bib ]
Keywords: sirna
[Jansen2003Bayesian] R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N.J. Krogan, S. Chung, A. Emili, M. Snyder, J.F. Greenblatt, and M. Gerstein. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302(5644):449-453, 2003. [ bib | DOI | http | .pdf ]
We have developed an approach using Bayesian networks to predict protein-protein interactions genome-wide in yeast. Our method naturally weights and combines into reliable predictions genomic features only weakly associated with interaction (e.g., mRNA coexpression, coessentiality, and colocalization). In addition to de novo predictions, it can integrate often noisy, experimental interaction data sets. We observe that at given levels of sensitivity, our predictions are more accurate than the existing high-throughput experimental data sets. We validate our predictions with new TAP?tagging (tandem affinity purification) experiments.

Keywords: biogm
[Jambon2003New] Martin Jambon, Anne Imberty, Gilbert Deléage, and Christophe Geourjon. A new bioinformatic approach to detect common 3d sites in protein structures. Proteins, 52(2):137-145, Aug 2003. [ bib | DOI | http ]
An innovative bioinformatic method has been designed and implemented to detect similar three-dimensional (3D) sites in proteins. This approach allows the comparison of protein structures or substructures and detects local spatial similarities: this method is completely independent from the amino acid sequence and from the backbone structure. In contrast to already existing tools, the basis for this method is a representation of the protein structure by a set of stereochemical groups that are defined independently from the notion of amino acid. An efficient heuristic for finding similarities that uses graphs of triangles of chemical groups to represent the protein structures has been developed. The implementation of this heuristic constitutes a software named SuMo (Surfing the Molecules), which allows the dynamic definition of chemical groups, the selection of sites in the proteins, and the management and screening of databases. To show the relevance of this approach, we focused on two extreme examples illustrating convergent and divergent evolution. In two unrelated serine proteases, SuMo detects one common site, which corresponds to the catalytic triad. In the legume lectins family composed of >100 structures that share similar sequences and folds but may have lost their ability to bind a carbohydrate molecule, SuMo discriminates between functional and non-functional lectins with a selectivity of 96%. The time needed for searching a given site in a protein structure is typically 0.1 s on a PIII 800MHz/Linux computer; thus, in further studies, SuMo will be used to screen the PDB.

Keywords: Algorithms; Catalytic Domain; Chymotrypsin, chemistry/genetics; Computational Biology, methods; Evolution, Molecular; Fabaceae, chemistry; Models, Molecular; Plant Lectins, chemistry/genetics; Protein Conformation; Proteins, chemistry; Reproducibility of Results; Subtilisin, chemistry/genetics
[Jaenisch2003Epigenetic] Rudolf Jaenisch and Adrian Bird. Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat. Genet., 33 Suppl:245-254, Mar 2003. [ bib | DOI | http ]
Cells of a multicellular organism are genetically homogeneous but structurally and functionally heterogeneous owing to the differential expression of genes. Many of these differences in gene expression arise during development and are subsequently retained through mitosis. Stable alterations of this kind are said to be 'epigenetic', because they are heritable in the short term but do not involve mutations of the DNA itself. Research over the past few years has focused on two molecular mechanisms that mediate epigenetic phenomena: DNA methylation and histone modifications. Here, we review advances in the understanding of the mechanism and role of DNA methylation in biological processes. Epigenetic effects by means of DNA methylation have an important role in development but can also arise stochastically as animals age. Identification of proteins that mediate these effects has provided insight into this complex process and diseases that occur when it is perturbed. External influences on epigenetic processes are seen in the effects of diet on long-term diseases such as cancer. Thus, epigenetic mechanisms seem to allow an organism to respond to the environment through changes in gene expression. The extent to which environmental effects can provoke epigenetic responses represents an exciting area of future research.

Keywords: Aging; Animals; Cloning, Organism; DNA Methylation; Diet; Dosage Compensation, Genetic; Gene Expression Regulation, Developmental; Genetic Diseases, Inborn; Genome; Genomic Imprinting; Humans; Mice; Models, Genetic; Mutation; Neoplasms; Phenotype; Signal Transduction
[Jackson2003Expression] A. L. Jackson, S. R. Bartz, J. Schelter, S. V. Kobayashi, J. Burchard, M. Mao, B. Li, G. Cavet, and P. S. Linsley. Expression profiling reveals off-target gene regulation by RNAi. Nat. Biotechnol., 21(6):635-7, Jun 2003. [ bib | DOI | http ]
RNA interference is thought to require near-identity between the small interfering RNA (siRNA) and its cognate mRNA. Here, we used gene expression profiling to characterize the specificity of gene silencing by siRNAs in cultured human cells. Transcript profiles revealed siRNA-specific rather than target-specific signatures, including direct silencing of nontargeted genes containing as few as eleven contiguous nucleotides of identity to the siRNA. These results demonstrate that siRNAs may cross-react with targets of limited sequence similarity.

Keywords: sirna
[Irizarry2003Exploration] R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, and T. P. Speed. Exploration, normalization, and summaries of high density oligonucleotide array probe level datas. Biostatistics, 4(2):249-264, Apr 2003. [ bib | DOI | http | .pdf ]
In this paper we report exploratory analyses of high-density oligonucleotide array data from the Affymetrix GeneChip system with the objective of improving upon currently used measures of gene expression. Our analyses make use of three data sets: a small experimental study consisting of five MGU74A mouse GeneChip arrays, part of the data from an extensive spike-in study conducted by Gene Logic and Wyeth's Genetics Institute involving 95 HG-U95A human GeneChip arrays; and part of a dilution study conducted by Gene Logic involving 75 HG-U95A GeneChip arrays. We display some familiar features of the perfect match and mismatch probe (PM and MM) values of these data, and examine the variance-mean relationship with probe-level data from probes believed to be defective, and so delivering noise only. We explain why we need to normalize the arrays to one another using probe level intensities. We then examine the behavior of the PM and MM using spike-in data and assess three commonly used summary measures: Affymetrix's (i) average difference (AvDiff) and (ii) MAS 5.0 signal, and (iii) the Li and Wong multiplicative model-based expression index (MBEI). The exploratory data analyses of the probe level data motivate a new summary measure that is a robust multi-array average (RMA) of background-adjusted, normalized, and log-transformed PM values. We evaluate the four expression summary measures using the dilution study data, assessing their behavior in terms of bias, variance and (for MBEI and RMA) model fit. Finally, we evaluate the algorithms in terms of their ability to detect known levels of differential expression using the spike-in data. We conclude that there is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe-specific affinities.

Keywords: csbcbook, csbcbook-ch2
[Consortium2003International] International HapMap Consortium. The international hapmap project. Nature, 426(6968):789-796, Dec 2003. [ bib ]
The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention.

Keywords: Base Sequence; Continental Population Groups, genetics; DNA, genetics; Gene Frequency; Genetic Variation, genetics; Genome, Human; Genomics, methods; Haplotypes, genetics; Humans; International Cooperation; Polymorphism, Single Nucleotide, genetics; Public Sector
[Inokuchi2003Complete] A. Inokuchi, T. Washio, and H. Motoda. Complete mining of frequent patterns from graphs: mining graph data. Mach. Learn., 50(3):321-354, 2003. [ bib ]
[Imoto2003Bayesian] S. Imoto, S. Kim, T. Goto, S. Miyano, S. Aburatani, K. Tashiro, and S. Kuhara. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. J. Bioinform. Comput. Biol., 1(2):231-252, Jul 2003. [ bib | DOI | http | .pdf ]
We propose a new statistical method for constructing a genetic network from microarray gene expression data by using a Bayesian network. An essential point of Bayesian network construction is the estimation of the conditional distribution of each random variable. We consider fitting nonparametric regression models with heterogeneous error variances to the microarray gene expression data to capture the nonlinear structures between genes. Selecting the optimal graph, which gives the best representation of the system among genes, is still a problem to be solved. We theoretically derive a new graph selection criterion from Bayes approach in general situations. The proposed method includes previous methods based on Bayesian networks. We demonstrate the effectiveness of the proposed method through the analysis of Saccharomyces cerevisiae gene expression data newly obtained by disrupting 100 genes.

Keywords: biogm
[Iizuka2003Oligonucleotide] Norio Iizuka, Masaaki Oka, Hisafumi Yamada-Okabe, Minekatsu Nishida, Yoshitaka Maeda, Naohide Mori, Takashi Takao, Takao Tamesa, Akira Tangoku, Hisahiro Tabuchi, Kenji Hamada, Hironobu Nakayama, Hideo Ishitsuka, Takanobu Miyamoto, Akira Hirabayashi, Shunji Uchimura, and Yoshihiko Hamamoto. Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection. Lancet, 361(9361):923-9, Mar 2003. [ bib | DOI | http | .pdf ]
BACKGROUND: Hepatocellular carcinoma has a poor prognosis because of the high intrahepatic recurrence rate. There are technological limitations to traditional methods such as TNM staging for accurate prediction of recurrence, suggesting that new techniques are needed. METHODS: We investigated mRNA expression profiles in tissue specimens from a training set, comprising 33 patients with hepatocellular carcinoma, with high-density oligonucleotide microarrays representing about 6000 genes. We used this training set in a supervised learning manner to construct a predictive system, consisting of 12 genes, with the Fisher linear classifier. We then compared the predictive performance of our system with that of a predictive system with a support vector machine (SVM-based system) on a blinded set of samples from 27 newly enrolled patients. FINDINGS: Early intrahepatic recurrence within 1 year after curative surgery occurred in 12 (36%) and eight (30%) patients in the training and blinded sets, respectively. Our system correctly predicted early intrahepatic recurrence or non-recurrence in 25 (93%) of 27 samples in the blinded set and had a positive predictive value of 88% and a negative predictive value of 95%. By contrast, the SVM-based system predicted early intrahepatic recurrence or non-recurrence correctly in only 16 (60%) individuals in the blinded set, and the result yielded a positive predictive value of only 38% and a negative predictive value of 79%. INTERPRETATION: Our system predicted early intrahepatic recurrence or non-recurrence for patients with hepatocellular carcinoma much more accurately than the SVM-based system, suggesting that our system could serve as a new method for characterising the metastatic potential of hepatocellular carcinoma.

[Ifantis2003nonlinear] A. Ifantis and S. Papadimitriou. The nonlinear predictability of the electrotelluric field variations data analyzed with support vector machines as an earthquake precursor. Int J Neural Syst, 13(5):315-32, Oct 2003. [ bib ]
This work investigates the nonlinear predictability of the Electro Telluric Field (ETF) variations data in order to develop new intelligent tools for the difficult task of earthquake prediction. Support Vector Machines trained on a signal window have been used to predict the next sample. We observe a significant increase at this short-term unpredictability of the ETF signal at about two weeks time period before the major earthquakes that took place in regions near the recording devices. The unpredictability increase can be attributed to a quick time variation of the dynamics that produce the ETF signal due to the earthquake generation process. Thus, this increase can be taken into advantage for signaling for an increased possibility of a large earthquake within the next few days in the neighboring region of the recording station.

Keywords: Air Pollutants, Aircraft, Algorithms, Artificial Intelligence, Automated, Base Composition, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Computing Methodologies, Cytosine, Data Interpretation, Databases, Enhancer Elements (Genetics), Environmental Monitoring, Ethanol, Exons, Fourier Transform Infrared, Genetic, Guanine, Humans, Image Interpretation, Natural Disasters, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Online Systems, P.H.S., Pattern Recognition, Photography, Probability, Pyrimidines, RNA Precursors, RNA Splice Sites, RNA Splicing, Radiation, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Spectroscopy, Statistical, Subtraction Technique, Thermodynamics, Time Factors, U.S. Gov't, Untranslated Regions, Video Recording, Walking, 14652873
[Huh2003Global] W.-K. Huh, J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson, J. S. Weissman, and E. K. O'Shea. Global analysis of protein localization in budding yeast. Nature, 425(6959):686-691, Oct 2003. [ bib | DOI | http | .pdf ]
A fundamental goal of cell biology is to define the functions of proteins in the context of compartments that organize them in the cellular environment. Here we describe the construction and analysis of a collection of yeast strains expressing full-length, chromosomally tagged green fluorescent protein fusion proteins. We classify these proteins, representing 75% of the yeast proteome, into 22 distinct subcellular localization categories, and provide localization information for 70% of previously unlocalized proteins. Analysis of this high-resolution, high-coverage localization data set in the context of transcriptional, genetic, and protein-protein interaction data helps reveal the logic of transcriptional co-regulation, and provides a comprehensive view of interactions within and between organelles in eukaryotic cells.

[Bild2003] E. Huang, S. Ishida, J. Pittman, H. Dressman, A. Bild, M. Kloos, M. D'Amico, R. G. Pestell, M. West, and J. R. Nevins. Gene expression phenotypic models that predict the activity of oncogenic pathways. Nat Genet, 34(2):226-30, 2003. [ bib ]
High-density DNA microarrays measure expression of large numbers of genes in one assay. The ability to find underlying structure in complex gene expression data sets and rigorously test association of that structure with biological conditions is essential to developing multi-faceted views of the gene activity that defines cellular phenotype. We sought to connect features of gene expression data with biological hypotheses by integrating 'metagene' patterns from DNA microarray experiments in the characterization and prediction of oncogenic phenotypes. We applied these techniques to the analysis of regulatory pathways controlled by the genes HRAS (Harvey rat sarcoma viral oncogene homolog), MYC (myelocytomatosis viral oncogene homolog) and E2F1, E2F2 and E2F3 (encoding E2F transcription factors 1, 2 and 3, respectively). The phenotypic models accurately predict the activity of these pathways in the context of normal cell proliferation. Moreover, the metagene models trained with gene expression patterns evoked by ectopic production of Myc or Ras proteins in primary tissue culture cells properly predict the activity of in vivo tumor models that result from deregulation of the MYC or HRAS pathways. We conclude that these gene expression phenotypes have the potential to characterize the complex genetic alterations that typify the neoplastic state, whether in vitro or in vivo, in a way that truly reflects the complexity of the regulatory pathways that are affected.

Keywords: Animals *Cell Cycle Proteins *DNA-Binding Proteins E2F Transcription Factors E2F1 Transcription Factor E2F2 Transcription Factor E2F3 Transcription Factor Female *Gene Expression Gene Expression Profiling Gene Expression Regulation, Neoplastic Genes, myc Genes, ras Mammary Neoplasms, Experimental/genetics Mice Mice, Transgenic *Models, Genetic Oligonucleotide Array Sequence Analysis *Oncogenes Phenotype Transcription Factors/genetics
[Hou2003Efficient] Y. Hou, W. Hsu, M. L. Lee, and C. Bystroff. Efficient remote homology detection using local structure. Bioinformatics, 19(17):2294-2301, 2003. [ bib | http | .pdf ]
Motivation: The function of an unknown biological sequence can often be accurately inferred if we are able to map this unknown sequence to its corresponding homologous family. At present, discriminative methods such as SVM-Fisher and SVM-pairwise, which combine support vector machine (SVM) and sequence similarity, are recognized as the most accurate methods, with SVM-pairwise being the most accurate. However, these methods typically encode sequence information into their feature vectors and ignore the structure information. They are also computationally inefficient. Based on these observations, we present an alternative method for SVM-based protein classification. Our proposed method, SVM-I-sites, utilizes structure similarity for remote homology detection. Result: We run experiments on the Structural Classification of Proteins 1.53 data set. The results show that SVM-I-sites is more efficient than SVM-pairwise. Further, we find that SVM-I-sites outperforms sequence-based methods such as PSI-BLAST, SAM, and SVM-Fisher while achieving a comparable performance with SVM-pairwise. Availability: I-sites server is accessible through the web at http://www.bioinfo.rpi.edu. Programs are available upon request for academics. Licensing agreements are available for commercial interests. The framework of encoding local structure into feature vector is available upon request.

Keywords: biosvm
[Horvath2003Neighborhooda] D. Horvath and C. Jeandenans. Neighborhood behavior of in silico structural spaces with respect to in vitro activity spaces-a benchmark for neighborhood behavior assessment of different in silico similarity metrics. J. Chem. Inf. Comput. Sci., 43(2):691-698, 2003. [ bib | DOI | http | .pdf ]
In a previous work, we have introduced Neighborhood Behavior (NB) criteria for calculated molecular similarity metrics, based on the analysis of in vitro activity spaces that simultaneously monitor the responses of a compound with respect to an entire panel of biologically relevant receptors and enzymes. Now, these novel NB criteria will be used as a benchmark for the comparison of different in silico molecular similarity metrics, addressing the following topics: (1) the relative performance of 2D vs 3D descriptors, (2) the importance of the similarity scoring function for a given descriptor set, and (3) binary or Fuzzy Pharmacophore Fingerprints-can they capture the similarity of the spatial distribution of pharmacophoric groups despite different molecular connectivity? It was found that fuzzy pharmacophore descriptors (FBPA) displayed an optimal NB and, unlike their binary counterparts, were successful in evidencing pharmacophore pattern similarity independently of topological similarity. Topological FBPA, identical to the former except for the use of topological instead of 3D atom pair distances, display a somehow weaker, but significant NB. Metrics based on "classical" global 2D and 3D molecular descriptors and a Dice scoring function also performed well. The choice of the similarity scoring function is therefore as important as the choice of the appropriate molecular descriptors.

Keywords: chemoinformatics
[Horn2003GPCRDB] F. Horn, E. Bettler, L. Oliveira, F. Campagne, F. E. Cohen, and G. Vriend. GPCRDB information system for G protein-coupled receptors. Nucl. Acids Res., 31(1):294-297, 2003. [ bib | DOI | arXiv | http ]
The GPCRDB is a molecular class-specific information system that collects, combines, validates and disseminates heterogeneous data on G protein-coupled receptors (GPCRs). The database stores data on sequences, ligand binding constants and mutations. The system also provides computationally derived data such as sequence alignments, homology models, and a series of query and visualization tools. The GPCRDB is updated automatically once every 4-5 months and is freely accessible at http://www.gpcr.org/7tm/.

Keywords: chemogenomics
[Harborth2003Sequence] J. Harborth, S. M. Elbashir, K. Vandenburgh, H. Manninga, S. A. Scaringe, K. Weber, and T. Tuschl. Sequence, chemical, and structural variation of small interfering RNAs and short hairpin RNAs and the effect on mammalian gene silencing. Antisense Nucleic Acid. Drug. Dev., 13(2):83-105, Apr 2003. [ bib | DOI | http ]
Small interfering RNAs (siRNAs) induce sequence-specific gene silencing in mammalian cells and guide mRNA degradation in the process of RNA interference (RNAi). By targeting endogenous lamin A/C mRNA in human HeLa or mouse SW3T3 cells, we investigated the positional variation of siRNA-mediated gene silencing. We find cell-type-dependent global effects and cell-type-independent positional effects. HeLa cells were about 2-fold more responsive to siRNAs than SW3T3 cells but displayed a very similar pattern of positional variation of lamin A/C silencing. In HeLa cells, 26 of 44 tested standard 21-nucleotide (nt) siRNA duplexes reduced the protein expression by at least 90%, and only 2 duplexes reduced the lamin A/C proteins to <50%. Fluorescent chromophores did not perturb gene silencing when conjugated to the 5'-end or 3'-end of the sense siRNA strand and the 5'-end of the antisense siRNA strand, but conjugation to the 3'-end of the antisense siRNA abolished gene silencing. RNase-protecting phosphorothioate and 2'-fluoropyrimidine RNA backbone modifications of siRNAs did not significantly affect silencing efficiency, although cytotoxic effects were observed when every second phosphate of an siRNA duplex was replaced by phosphorothioate. Synthetic RNA hairpin loops were subsequently evaluated for lamin A/C silencing as a function of stem length and loop composition. As long as the 5'-end of the guide strand coincided with the 5'-end of the hairpin RNA, 19-29 base pair (bp) hairpins effectively silenced lamin A/C, but when the hairpin started with the 5'-end of the sense strand, only 21-29 bp hairpins were highly active.

Keywords: Adaptor Protein Complex alpha Subunits, Animal, Animals, Antisense, Apolipoproteins B, Base Sequence, Biological Transport, Blotting, Catalytic, Cell Line, Cell Membrane, Cell Survival, Chemical, Cholesterol, Clathrin, Clathrin Heavy Chains, Disease Models, Endocytosis, Epidermal Growth Factor, Fluorescence, Gene Expression Profiling, Gene Silencing, Gene Therapy, Hela Cells, Humans, Injections, Intravenous, Jejunum, Kinetics, Lamin Type A, Liver, Messenger, Metabolic Syndrome X, Mice, Microscopy, Models, Molecular Sequence Data, NIH 3T3 Cells, Non-U.S. Gov't, Nucleic Acid, Oligonucleotides, Open Reading Frames, Post-Transcriptional, Protein Isoforms, Pyrimidines, RNA, RNA Interference, RNA Processing, RNA Stability, Research Support, Reverse Transcriptase Polymerase Chain Reaction, Sensitivity and Specificity, Sequence Homology, Small Interfering, Subcellular Fractions, Swiss 3T3 Cells, Thionucleotides, Time Factors, Transfection, Transferrin, Transgenic, Tumor, Western, 12804036
[Gartner2003grapha] T. Gärtner, P. Glach, and S. Wrobel. On graph kernels: hardness results and efficient alternatives. In Proceedings of COLT / Kernel workshop, 2003. [ bib ]
[Gartner2003graph] T. Gärtner, P. Flach, and S. Wrobel. On graph kernels: hardness results and efficient alternatives. In B. Schölkopf and M. Warmuth, editors, Proceedings of the Sixteenth Annual Conference on Computational Learning Theory and the Seventh Annual Workshop on Kernel Machines, volume 2777 of Lecture Notes in Computer Science, pages 129-143, Heidelberg, 2003. Springer. [ bib | DOI | http ]
As most lsquoreal-worldrsquo data is structured, research in kernel methods has begun investigating kernels for various kinds of structured data. One of the most widely used tools for modeling structured data are graphs. An interesting and important challenge is thus to investigate kernels on instances that are represented by graphs. So far, only very specific graphs such as trees and strings have been considered. This paper investigates kernels on labeled directed graphs with general structure. It is shown that computing a strictly positive definite graph kernel is at least as hard as solving the graph isomorphism problem. It is also shown that computing an inner product in a feature space indexed by all possible graphs, where each feature counts the number of subgraphs isomorphic to that graph, is NP-hard. On the other hand, inner products in an alternative feature space, based on walks in the graph, can be computed in polynomial time. Such kernels are defined in this paper.

[Gartner2003Survey] T. Gärtner. A Survey of Kernels for Structured Data. SIGKDD Explor. Newsl., 5(1):49-58, 2003. [ bib | DOI ]
Keywords: kernel-theory
[Guyon2003introduction] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157-1182, 2003. [ bib | .pdf | .pdf ]
[Gutin03Traveling] G. Gutin. Travelling salesman and related problems. In Handbook of Graph Theory, 2003. [ bib ]
[Gururaja2003Multiple] T. Gururaja, W. Li, W.S. Noble, D.G. Payan, and D.C. Anderson. Multiple functional categories of proteins identified in an in vitro cellular ubiquitin affinity extract using shotgun peptide sequencing. J Proteome Res, 2(394-404):394-404, 2003. [ bib | .pdf ]
Using endogenous human cellular ubiquitin system enzymes and added his-tagged ubiquitin, ATP, and an ATP-regenerating system, we labelled cellular proteins with hexahistidine tagged ubiquitin in vitro. Labeling was dependent on ATP and the ATP recycling system, on the proteasome inhibitor MG132 and the ubiquitin protease inhibitor ubiquitin aldehyde, and was inhibited by iodoacetamide. Labeled proteins were affinity extracted in quadruplicate and tryptic peptides identifed by 2D capillary LC/MS/MS comb9ined with SEQUEST and MEDUSA analyses. Support vector machine analyais of the mass spectrometry data allowed prediction of correct matches between mass spectrometry data and peptide sequences. Overall, 144 proteins were identified by peptides predicted to be correctly sequenced, and 113 were identified by at least three peptides or one or two peptides with at least an 80 Identified proteins included 22 proteasome subunits or associated proteins, 18 E1, E2 or E3 ubiquitin system enzymes or related proteins, and four ubiquitin domain proteins. Seventeen directly ubiquitinated proteins or proteins associated with the ubiquitin system were identified. Functional clusters of other proteins included redox enzymes, proteins associated with endocytosis, cytoskeletal proteins, DNA damage or repair related proteins, calcium binding proteins, and splicing factor and related proteins, suggesting that in vitro ubiquitination is not random, and that these functions may be regulated by the ubiquitin system. This map of cellular ubiquitinated proteins and their interacting proteins will be useful for further studies of ubiquitin system function.

Keywords: biosvm
[Gordon2003Sequence] L. Gordon, A. Y. Chervonenkis, A. J. Gammerman, I. A. Shahmuradov, and V. V. Solovyev. Sequence alignment kernel for recognition of promoter regions. Bioinformatics, 19(15):1964-1971, 2003. [ bib | http | .pdf ]
In this paper we propose a new method for recognition of prokaryotic promoter regions with startpoints of transcription. The method is based on Sequence Alignment Kernel, a function reflecting the quantitative measure of match between two sequences. This kernel function is further used in Dual SVM, which performs the recognition. Several recognition methods have been trained and tested on positive data set, consisting of 669 sigma70-promoter regions with known transcription startpoints of Escherichia coli and two negative data sets of 709 examples each, taken from coding and non-coding regions of the same genome. The results show that our method performs well and achieves 16.5 data and 18.6 data. Availability:The demo version of our method is accessible from our website http://mendel.cs.rhul.ac.uk/

Keywords: biosvm
[Gomez2003Learning] S. M. Gomez, W. S. Noble, and A. Rzhetsky. Learning to predict protein-protein interactions from protein sequences. Bioinformatics, 19(15):1875-1881, 2003. [ bib | http | .pdf ]
In order to understand the molecular machinery of the cell, we need to know about the multitude of protein-protein interactions that allow the cell to function. High-throughput technologies provide some data about these interactions, but so far that data is fairly noisy. Therefore, computational techniques for predicting protein-protein interactions could be of significant value. One approach to predicting interactions in silico is to produce from first principles a detailed model of a candidate interaction. We take an alternative approach, employing a relatively simple model that learns dynamically from a large collection of data. In this work, we describe an attraction-repulsion model, in which the interaction between a pair of proteins is represented as the sum of attractive and repulsive forces associated with small, domain- or motif-sized features along the length of each protein. The model is discriminative, learning simultaneously from known interactions and from pairs of proteins that are known (or suspected) not to interact. The model is efficient to compute and scales well to very large collections of data. In a cross-validated comparison using known yeast interactions, the attraction-repulsion method performs better than several competing techniques.

Keywords: biosvm
[Gillet2003Similarity] V. Gillet, P. Willett, and J. Bradshaw. Similarity searching using reduced graphs. J. Chem. Inf. Comput. Sci., 43:338-345, 2003. [ bib ]
[Ge2003Reducing] Xijin Ge, Shuichi Tsutsumi, Hiroyuki Aburatani, and Shuichi Iwata. Reducing false positives in molecular pattern recognition. Genome Inform Ser Workshop Genome Inform, 14:34-43, 2003. [ bib ]
In the search for new cancer subtypes by gene expression profiling, it is essential to avoid misclassifying samples of unknown subtypes as known ones. In this paper, we evaluated the false positive error rates of several classification algorithms through a 'null test' by presenting classifiers a large collection of independent samples that do not belong to any of the tumor types in the training dataset. The benchmark dataset is available at www2.genome.rcast.u-tokyo.ac.jp/pm/. We found that k-nearest neighbor (KNN) and support vector machine (SVM) have very high false positive error rates when fewer genes (<100) are used in prediction. The error rate can be partially reduced by including more genes. On the other hand, prototype matching (PM) method has a much lower false positive error rate. Such robustness can be achieved without loss of sensitivity by introducing suitable measures of prediction confidence. We also proposed a cluster-and-select technique to select genes for classification. The nonparametric Kruskal-Wallis H test is employed to select genes differentially expressed in multiple tumor types. To reduce the redundancy, we then divided these genes into clusters with similar expression patterns and selected a given number of genes from each cluster. The reliability of the new algorithm is tested on three public datasets.

Keywords: Amino Acid Sequence, Amino Acids, Animals, Automated, Base Sequence, Bayes Theorem, Biological, Carbohydrate Conformation, Carbohydrate Sequence, Cattle, Computational Biology, Computer Simulation, Crystallography, DNA, Databases, Factual, False Positive Reactions, Gene Expression Profiling, Genes, Genetic, Genetic Techniques, Genome, Histocompatibility Antigens Class I, Human, Humans, Introns, Least-Squares Analysis, MHC Class I, Major Histocompatibility Complex, Markov Chains, Messenger, Mice, Models, Monosaccharides, Neoplasms, Non-U.S. Gov't, Nonparametric, Pattern Recognition, Peptides, Phylogeny, Plants, Poly A, Polysaccharides, Predictive Value of Tests, Protein, Protein Structure, Proteins, RNA, Rats, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Secondary, Sequence Alignment, Software, Species Specificity, Statistics, Theoretical, X-Ray, 15706518
[Garrett2003Comparison] D. Garrett, D. A Peterson, C. Anderson, and M. Thaut. Comparison of linear, nonlinear, and feature selection methods for EEG signal classification. IEEE Trans Neural Syst Rehabil Eng, 11(2):141-4, Jun 2003. [ bib ]
The reliable operation of brain-computer interfaces (BCIs) based on spontaneous electroencephalogram (EEG) signals requires accurate classification of multichannel EEG. The design of EEG representations and classifiers for BCI are open research questions whose difficulty stems from the need to extract complex spatial and temporal patterns from noisy multidimensional time series obtained from EEG measurements. The high-dimensional and noisy nature of EEG may limit the advantage of nonlinear classification methods over linear ones. This paper reports the results of a linear (linear discriminant analysis) and two nonlinear classifiers (neural networks and support vector machines) applied to the classification of spontaneous EEG during five mental tasks, showing that nonlinear classifiers produce only slightly better classification results. An approach to feature selection based on genetic algorithms is also presented with preliminary results of application to EEG during finger movement.

Keywords: 80 and over, Adnexal Diseases, Adult, Aged, Algorithms, Artificial Intelligence, Automated, Bayes Theorem, Biological, Brain, Brain Mapping, Breast Neoplasms, Case-Control Studies, Chromatography, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Diagnosis, Differential, Discriminant Analysis, Electroencephalography, Evoked Potentials, Feasibility Studies, Female, Fingers, Gene Expression Profiling, Gene Expression Regulation, Genetic, Genetic Markers, Genetic Predisposition to Disease, Genetic Screening, Habituation (Psychophysiology), High Pressure Liquid, Humans, Linear Models, Logistic Models, Male, Middle Aged, Migraine, Models, Movement, Neural Networks (Computer), Neurological, Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Nucleosides, Ovarian Neoplasms, Pattern Recognition, Photic Stimulation, Predictive Value of Tests, ROC Curve, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Software, Statistical, Thinking, Tumor Markers, U.S. Gov't, User-Computer Interface, Visual, 12899257
[Gardner2003Inferring] T. S. Gardner, D. Bernardo, D. Lorenz, and J. J. Collins. Inferring genetic networks and identifying compound mode of action via expression profiling. Science, 301(5629):102-105, Jul 2003. [ bib | DOI | http | .pdf ]
The complexity of cellular gene, protein, and metabolite networks can hinder attempts to elucidate their structure and function. To address this problem, we used systematic transcriptional perturbations to construct a first-order model of regulatory interactions in a nine-gene subnetwork of the SOS pathway in Escherichia coli. The model correctly identified the major regulatory genes and the transcriptional targets of mitomycin C activity in the subnetwork. This approach, which is experimentally and computationally scalable, provides a framework for elucidating the functional properties of genetic networks and identifying molecular targets of pharmacological compounds.

[Furlanello2003Entropy-based] C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 4(54), 2003. [ bib | DOI | http | .pdf ]
Background We describe the E-RFE method for gene ranking, which is useful for the identification of markers in the predictive classification of array data. The method supports a practical modeling scheme designed to avoid the construction of classification rules based on the selection of too small gene subsets (an effect known as the selection bias, in which the estimated predictive errors are too optimistic due to testing on samples already considered in the feature selection process). Results With E-RFE, we speed up the recursive feature elimination (RFE) with SVM classifiers by eliminating chunks of uninteresting genes using an entropy measure of the SVM weights distribution. An optimal subset of genes is selected according to a two-strata model evaluation procedure: modeling is replicated by an external stratified-partition resampling scheme, and, within each run, an internal K-fold cross-validation is used for E-RFE ranking. Also, the optimal number of genes can be estimated according to the saturation of Zipf's law profiles. Conclusions Without a decrease of classification accuracy, E-RFE allows a speed-up factor of 100 with respect to standard RFE, while improving on alternative parametric RFE reduction strategies. Thus, a process for gene selection and error estimation is made practical, ensuring control of the selection bias, and providing additional diagnostic indicators of gene importance.

Keywords: biosvm
[Freund2003efficient] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res., 4:933-969, 2003. [ bib | .pdf | .pdf ]
[Formosa2003Changing] T. Formosa. Changing the dna landscape: putting a spn on chromatin. Curr Top Microbiol Immunol, 274:171-201, 2003. [ bib ]
In eukaryotic cells, transcription and replication each occur on DNA templates that are incorporated into nucleosomes. Formation of chromatin generally limits accessibility of specific DNA sequences and inhibits progression of polymerases as they copy information from the DNA. The processes that select sites for initiating either transcription or replication are therefore strongly influenced by factors that modulate the properties of chromatin proteins. Further, in order to elongate their products, both DNA and RNA polymerases must be able to overcome the inhibition presented by chromatin (Lipford and Bell 2001; Workman and Kingston 1998). One way to adjust the properties of chromatin proteins is to covalently modify them by adding or removing chemical moieties. Both histone and non-histone chromatin proteins are altered by acetylation, methylation, and other changes, and the 'nucleosome modifying' complexes that perform these reactions are important components of pathways of transcriptional regulation (Cote 2002; Orphanides and Reinberg 2000; Roth et al. 2001; Strahl and Allis 2000; Workman and Kingston 1998). Another way to alter the effects of nucleosomes is to change the position of the histone octamers relative to specific DNA sequences (Orphanides and Reinberg 2000; Verrijzer 2002; Wang 2002; Workman and Kingston 1998). Since the ability of a sequence to be bound by specific proteins can vary significantly whether the sequence is in the linkers between nucleosomes or at various positions within a nucleosome, 'nucleosome remodeling' complexes that rearrange nucleosome positioning are also important regulators of transcription. Since the DNA replication machinery has to encounter many of the same challenges posed by chromatin, it seems likely that modifying and remodeling complexes also act during duplication of the genome, but most of the current information on these factors relates to regulation of transcription. This chapter describes the factor known variously as FACT in humans, where it promotes elongation of RNA polymerase II on nucleosomal templates in vitro (Orphanides et al. 1998, 1999), DUF in frogs, where it is needed for DNA replication in oocyte extracts (Okuhara et al. 1999), and CP or SPN in yeast, where it is linked in vivo to both transcription and replication (Brewster et al. 2001; Formosa et al. 2001). Like the nucleosome modifying and remodeling complexes, it is broadly conserved among eukaryotes, affects a wide range of processes that utilize chromatin, and directly alters the properties of nucleosomes. However, it does not have nucleosome modifying or standard ATP-dependent remodeling activity, and therefore represents a third class of chromatin modulating factors. It is also presently unique in the extensive connections it displays with both transcription and replication: FACT/DUF/CP/SPN appears to modify nucleosomes in a way that is directly important for the efficient functioning of both RNA polymerases and DNA polymerases. While less is known about the mechanisms it uses to promote its functions than for other factors that affect chromatin, it is clearly an essential part of the complex mixture of activities that modulate access to DNA within chromatin. Physical and genetic interactions suggest that FACT/DUF/CP/SPN affects multiple pathways within replication and transcription as a member of several distinct complexes. Some of the interactions are easy to assimilate into models for replication or transcription, such as direct binding to DNA polymerase alpha (Wittmeyer and Formosa 1997; Wittmeyer et al. 1999), association with nucleosome modifying complexes (John et al. 2000), and interaction with factors that participate in elongation of RNA Polymerase II (Gavin et al. 2002; Squazzo et al. 2002). Others are more surprising such as an association with the 19S complex that regulates the function of the 20S proteasome (Ferdous et al. 2001; Xu et al. 1995), and the indication that FACT/DUF/CP/SPN can act as a specificity factor for casein kinase II (Keller et al. 2001). This chapter reviews the varied approaches that have each revealed different aspects of the function of FACT/DUF/CP/SPN, and presents a picture of a factor that can both alter nucleosomes and orchestrate the assembly or activity of a broad range of complexes that act upon chromatin.

Keywords: Animals; Cell Cycle Proteins, metabolism; Chromatin, metabolism; DNA, metabolism; Eukaryotic Cells, metabolism; Gene Expression Regulation; Humans; Saccharomyces cerevisiae Proteins; Transcription Factors, metabolism; Transcription, Genetic; Transcriptional Elongation Factors
[Fawcett2003ROC] T. Fawcett. ROC graphs: notes and practical considerations for data mining researchers. Technical Report 2003-4, HP Laboratories, Palo Alto, CA, USA, 2003. [ bib ]
[Fare2003Effects] Thomas L Fare, Ernest M Coffey, Hongyue Dai, Yudong D He, Deborah A Kessler, Kristopher A Kilian, John E Koch, Eric LeProust, Matthew J Marton, Michael R Meyer, Roland B Stoughton, George Y Tokiwa, and Yanqun Wang. Effects of atmospheric ozone on microarray data quality. Anal Chem, 75(17):4672-4675, Sep 2003. [ bib ]
A data anomaly was observed that affected the uniformity and reproducibility of fluorescent signal across DNA microarrays. Results from experimental sets designed to identify potential causes (from microarray production to array scanning) indicated that the anomaly was linked to a batch process; further work allowed us to localize the effect to the posthybridization array stringency washes. Ozone levels were monitored and highly correlated with the batch effect. Controlled exposures of microarrays to ozone confirmed this factor as the root cause, and we present data that show susceptibility of a class of cyanine dyes (e.g., Cy5, Alexa 647) to ozone levels as low as 5-10 ppb for periods as short as 10-30 s. Other cyanine dyes (e.g., Cy3, Alexa 555) were not significantly affected until higher ozone levels (> 100 ppb). To address this environmental effect, laboratory ozone levels should be kept below 2 ppb (e.g., with filters in HVAC) to achieve high quality microarray data.

Keywords: Artifacts; Atmosphere, chemistry; Carbocyanines, chemistry; Desiccation; Fluorescence; Oligonucleotide Array Sequence Analysis, instrumentation/standards; Ozone, analysis/chemistry; Quality Control; Reproducibility of Results
[Ekins2003In] S. Ekins. In silico approaches to predicting drug metabolism, toxicology and beyond. Biochem. Soc. Trans., 31(Pt 3):611-614, Jun 2003. [ bib | DOI | http ]
The discovery and optimization of new drug candidates is becoming increasingly reliant upon the combination of experimental and computational approaches related to drug metabolism, toxicology and general biopharmaceutical properties. With the considerable output of high-throughput assays for cytochrome-P450-mediated drug-drug interactions, metabolic stability and assays for toxicology, we have orders of magnitude more data that will facilitate model building. A recursive partitioning model for human liver microsomal metabolic stability based on over 800 structurally diverse molecules was used to predict molecules with known log in vitro clearance data (Spearman's rho -0.64, P <0.0001). In addition, with solely published data, a quantitative structure-activity relationship for 66 inhibitors of the potassium channel human ether-a-gogo (hERG) that has been implicated in the failure of a number of recent drugs has been generated. This model has been validated with further published data for 25 molecules (Spearman's rho 0.83, P <0.0001). If continued value is to be realized from these types of computational models, there needs to be some applied research on their validation and optimization with new data. Some relatively simple approaches may have value when it comes to combining data from multiple models in order to improve and focus drug discovery on the molecules most likely to succeed.

Keywords: herg
[Driel2003new] M. van Driel, K. Cuelenaere, P.P.C.W. Kemmeren, J.A.M. Leunissen, and H.G. Brunner. A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur. J. Hum. Genet., 11(1):57-63, Jan 2003. [ bib | DOI | http ]
To identify the gene underlying a human genetic disorder can be difficult and time-consuming. Typically, positional data delimit a chromosomal region that contains between 20 and 200 genes. The choice then lies between sequencing large numbers of genes, or setting priorities by combining positional data with available expression and phenotype data, contained in different internet databases. This process of examining positional candidates for possible functional clues may be performed in many different ways, depending on the investigator's knowledge and experience. Here, we report on a new tool called the GeneSeeker, which gathers and combines positional data and expression/phenotypic data in an automated way from nine different web-based databases. This results in a quick overview of interesting candidate genes in the region of interest. The GeneSeeker system is built in a modular fashion allowing for easy addition or removal of databases if required. Databases are searched directly through the web, which obviates the need for data warehousing. In order to evaluate the GeneSeeker tool, we analysed syndromes with known genesis. For each of 10 syndromes the GeneSeeker programme generated a shortlist that contained a significantly reduced number of candidate genes from the critical region, yet still contained the causative gene. On average, a list of 163 genes based on position alone was reduced to a more manageable list of 22 genes based on position and expression or phenotype information. We are currently expanding the tool by adding other databases. The GeneSeeker is available via the web-interface (http://www.cmbi.kun.nl/GeneSeeker/).

Keywords: Computational Biology; Databases, Genetic; Databases, Nucleic Acid; Gene Expression; Genetic Diseases, Inborn; Humans; Internet; Noonan Syndrome; Software
[Doolan2003Identification] Denise L Doolan, Scott Southwood, Daniel A Freilich, John Sidney, Norma L Graber, Lori Shatney, Lolita Bebris, Laurence Florens, Carlota Dobano, Adam A Witney, Ettore Appella, Stephen L Hoffman, John R Yates, Daniel J Carucci, and Alessandro Sette. Identification of Plasmodium falciparum antigens by antigenic analysis of genomic and proteomic data. Proc. Natl. Acad. Sci. U. S. A., 100(17):9952-9957, Aug 2003. [ bib | DOI | http | .pdf ]
The recent explosion in genomic sequencing has made available a wealth of data that can now be analyzed to identify protein antigens, potential targets for vaccine development. Here we present, in the context of Plasmodium falciparum, a strategy that rapidly identifies target antigens from large and complex genomes. Sixteen antigenic proteins recognized by volunteers immunized with radiation-attenuated P. falciparum sporozoites, but not by mock immunized controls, were identified. Several of these were more antigenic than previously identified and well characterized P. falciparum-derived protein antigens. The data suggest that immune responses to Plasmodium are dispersed on a relatively large number of parasite antigens. These studies have implications for our understanding of immunodominance and breadth of responses to complex pathogens.

Keywords: plasmodium
[Donoho2003Hessian] D. L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. USA, 100(10):5591-5596, 2003. [ bib | DOI | http | www: ]
Keywords: dimred
[Donaldson2003PreBIND] I. Donaldson, J. Martin, B. de Bruijn, C. Wolting, V. Lay, B. Tuekam, S. Zhang, B. Baskin, G.D. Bader, K. Michalickova, T. Pawson, and C.W.V. Hogue. PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics, 4(1):11, Mar 2003. [ bib | DOI | http | .pdf ]
Background The majority of experimentally verified molecular interaction and biological pathway data are present in the unstructured text of biomedical journal articles where they are inaccessible to computational methods. The Biomolecular interaction network database (BIND) seeks to capture these data in a machine-readable format. We hypothesized that the formidable task-size of backfilling the database could be reduced by using Support Vector Machine technology to first locate interaction information in the literature. We present an information extraction system that was designed to locate protein-protein interaction data in the literature and present these data to curators and the public for review and entry into BIND. Results Cross-validation estimated the support vector machine's test-set precision, accuracy and recall for classifying abstracts describing interaction information was 92 would be able to recall up to 60 present in another yeast-protein interaction database. Finally, this system was applied to a real-world curation problem and its use was found to reduce the task duration by 70 days. Conclusions Machine learning methods are useful as tools to direct interaction and pathway database back-filling; however, this potential can only be realized if these techniques are coupled with human review and entry into a factual database such as BIND. The PreBIND system described here is available to the public at http://bind.ca. Current capabilities allow searching for human, mouse and yeast protein-interaction information.

Keywords: biosvm
[Dobson2003Distinguishing] P.D. Dobson and A.J. Doig. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol., 330(4):771-783, 2003. [ bib | DOI | http | .pdf ]
The ability to predict protein function from structure is becoming increasingly important as the number of structures resolved is growing more rapidly than our capacity to study function. Current methods for predicting protein function are mostly reliant on identifying a similar protein of known function. For proteins that are highly dissimilar or are only similar to proteins also lacking functional annotations, these methods fail. Here, we show that protein function can be predicted as enzymatic or not without resorting to alignments. We describe 1178 high-resolution proteins in a structurally non-redundant subset of the Protein Data Bank using simple features such as secondary-structure content, amino acid propensities, surface properties and ligands. The subset is split into two functional groupings, enzymes and non-enzymes. We use the support vector machine-learning algorithm to develop models that are capable of assigning the protein class. Validation of the method shows that the function can be predicted to an accuracy of 77 protein. An adaptive search of possible subsets of features produces a simplified model based on 36 features that predicts at an accuracy of 80 avoid calculating alignments and predict a recently released set of unrelated proteins. The most useful features for distinguishing enzymes from non-enzymes are secondary-structure content, amino acid frequencies, number of disulphide bonds and size of the largest cleft. This method is applicable to any structure as it does not require the identification of sequence or structural similarity to a protein of known function.

Keywords: biosvm
[Djordjevic2003biophysical] Marko Djordjevic, Anirvan M Sengupta, and Boris I Shraiman. A biophysical approach to transcription factor binding site discovery. Genome Res., 13(11):2381-90, Nov 2003. [ bib | DOI | http | .pdf ]
Identification of transcription factor binding sites within regulatory segments of genomic DNA is an important step toward understanding of the regulatory circuits that control expression of genes. Here, we describe a novel bioinformatics method that bases classification of potential binding sites explicitly on the estimate of sequence-specific binding energy of a given transcription factor. The method also estimates the chemical potential of the factor that defines the threshold of binding. In contrast with the widely used information-theoretic weight matrix method, the new approach correctly describes saturation in the transcription factor/DNA binding probability. This results in a significant improvement in the number of expected false positives, particularly in the ubiquitous case of low-specificity factors. In the strong binding limit, the algorithm is related to the "support vector machine" approach to pattern recognition. The new method is used to identify likely genomic binding sites for the E. coli transcription factors collected in the DPInteract database. In addition, for CRP (a global regulatory factor), the likely regulatory modality (i.e., repressor or activator) of predicted binding sites is determined.

[Ding2003statistical] Y. Ding and C. E. Lawrence. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res., 31(24):7280-301, Dec 2003. [ bib ]
An RNA molecule, particularly a long-chain mRNA, may exist as a population of structures. Further more, multiple structures have been demonstrated to play important functional roles. Thus, a representation of the ensemble of probable structures is of interest. We present a statistical algorithm to sample rigorously and exactly from the Boltzmann ensemble of secondary structures. The forward step of the algorithm computes the equilibrium partition functions of RNA secondary structures with recent thermodynamic parameters. Using conditional probabilities computed with the partition functions in a recursive sampling process, the backward step of the algorithm quickly generates a statistically representative sample of structures. With cubic run time for the forward step, quadratic run time in the worst case for the sampling step, and quadratic storage, the algorithm is efficient for broad applicability. We demonstrate that, by classifying sampled structures, the algorithm enables a statistical delineation and representation of the Boltzmann ensemble. Applications of the algorithm show that alternative biological structures are revealed through sampling. Statistical sampling provides a means to estimate the probability of any structural motif, with or without constraints. For example, the algorithm enables probability profiling of single-stranded regions in RNA secondary structure. Probability profiling for specific loop types is also illustrated. By overlaying probability profiles, a mutual accessibility plot can be displayed for predicting RNA:RNA interactions. Boltzmann probability-weighted density of states and free energy distributions of sampled structures can be readily computed. We show that a sample of moderate size from the ensemble of an enormous number of possible structures is sufficient to guarantee statistical reproducibility in the estimates of typical sampling statistics. Our applications suggest that the sampling algorithm may be well suited to prediction of mRNA structure and target accessibility. The algorithm is applicable to the rational design of small interfering RNAs (siRNAs), antisense oligonucleotides, and trans-cleaving ribozymes in gene knock-down studies.

[DiMasi2003price] J. A. DiMasi, R. W. Hansen, and H. G. Grabowski. The price of innovation: new estimates of drug development costs. J Health Econ, 22(2):151-185, Mar 2003. [ bib ]
The research and development costs of 68 randomly selected new drugs were obtained from a survey of 10 pharmaceutical firms. These data were used to estimate the average pre-tax cost of new drug development. The costs of compounds abandoned during testing were linked to the costs of compounds that obtained marketing approval. The estimated average out-of-pocket cost per new drug is 403 million US dollars (2000 dollars). Capitalizing out-of-pocket costs to the point of marketing approval at a real discount rate of 11% yields a total pre-approval cost estimate of 802 million US dollars (2000 dollars). When compared to the results of an earlier study with a similar methodology, total capitalized costs were shown to have increased at an annual rate of 7.4% above general price inflation.

Keywords: Capital Expenditures, Costs and Cost Analysis, Data Collection, Drug Approval, Drug Evaluation, Drug Industry, Drugs, Economic, Humans, Inflation, Investigational, Organizational Innovation, Preclinical, Research Support, United States, 16087260
[Dieterle2003Urinary] Frank Dieterle, Silvia Müller-Hagedorn, Hartmut M Liebich, and Günter Gauglitz. Urinary nucleosides as potential tumor markers evaluated by learning vector quantization. Artif. Intell. Med., 28(3):265-79, Jul 2003. [ bib | DOI | http | .pdf ]
Modified nucleosides were recently presented as potential tumor markers for breast cancer. The patterns of the levels of urinary nucleosides are different for tumor bearing individuals and for healthy individuals. Thus, a powerful pattern recognition method is needed. Although backpropagation (BP) neural networks are becoming increasingly common in medical literature for pattern recognition, it has been shown that often-superior methods exist like learning vector quantization (LVQ) and support vector machines (SVM). The aim of this feasibility study is to get an indication of the performance of urinary nucleoside levels evaluated by LVQ in contrast to the evaluation the popular BP and SVM networks. Urine samples were collected from female breast cancer patients and from healthy females. Twelve different ribonucleosides were isolated and quantified by a high performance liquid chromatography (HPLC) procedure. LVQ, SVM and BP networks were trained and the performance was evaluated by the classification of the test sets into the categories "cancer" and "healthy". All methods showed a good classification with a sensitivity ranging from 58.8 to 70.6% at a specificity of 88.4-94.2% for the test patterns. Although the classification performance of all methods is comparable, the LVQ implementations are superior in terms of more qualitative features: the results of LVQ networks are more reproducible, as the initialization is deterministic. The LVQ networks can be trained by unbalanced sizes of the different classes. LVQ networks are fast during training, need only few parameters adjusted for training and can be retrained by patterns of "local individuals". As at least some of these features play an important role in an implementation into a medical decision support system, it is recommended to use LVQ for an extended study.

Keywords: 80 and over, Adnexal Diseases, Adult, Aged, Algorithms, Artificial Intelligence, Automated, Bayes Theorem, Biological, Breast Neoplasms, Case-Control Studies, Chromatography, Comparative Study, Computational Biology, Computer-Assisted, Diagnosis, Differential, Feasibility Studies, Female, High Pressure Liquid, Humans, Logistic Models, Middle Aged, Neural Networks (Computer), Non-U.S. Gov't, Nucleosides, Ovarian Neoplasms, Pattern Recognition, Predictive Value of Tests, ROC Curve, Reproducibility of Results, Research Support, Sensitivity and Specificity, Tumor Markers, 12927336
[Diekman2003Hybrid] Casey Diekman, Wei He, Nagabhushana Prabhu, and Harvey Cramer. Hybrid methods for automated diagnosis of breast tumors. Anal Quant Cytol Histol, 25(4):183-90, Aug 2003. [ bib ]
OBJECTIVE: To design and analyze a new family of hybrid methods for the diagnosis of breast tumors using fine needle aspirates. STUDY DESIGN: We present a radically new approach to the design of diagnosis systems. In the new approach, a nonlinear classifier with high sensitivity but low specificity is hybridized with a linear classifier having low sensitivity but high specificity. Data from the Wisconsin Breast Cancer Database are used to evaluate, computationally, the performance of the hybrid classifiers. RESULTS: The diagnosis scheme obtained by hybridizing the nonlinear classifier ellipsoidal multisurface method (EMSM) with the linear classifier proximal support vector machine (PSVM) was found to have a mean sensitivity of 97.36% and a mean specificity of 95.14% and was found to yield a 2.44% improvement in the reliability of positive diagnosis over that of EMSM at the expense of 0.4% degradation in the reliability of negative diagnosis, again compared to EMSM. At the 95% confidence level we can trust the hybrid method to be 96.19-98.53% correct in its malignant diagnosis of new tumors and 93.57-96.71% correct in its benign diagnosis. CONCLUSION: Hybrid diagnosis schemes represent a significant paradigm shift and provide a promising new technique to improve the specificity of nonlinear classifiers without seriously affecting the high sensitivity of nonlinear classifiers.

Keywords: Algorithms, Amino Acid Sequence, Amino Acids, Anion Exchange Resins, Antigen-Antibody Complex, Artificial Intelligence, Automated, Automatic Data Processing, Benchmarking, Biological, Biological Markers, Biopsy, Blood Cells, Blood Proteins, Breast Neoplasms, Cell Line, Cellular Structures, Chemical, Chromatography, Chromosome Aberrations, Cluster Analysis, Colonic Neoplasms, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Computing Methodologies, DNA, Data Interpretation, Databases, Decision Making, Decision Trees, Diagnosis, Diffusion Magnetic Resonance Imaging, Disease, English Abstract, Epitopes, Expert Systems, Factual, Female, Fine-Needle, Fusion, Fuzzy Logic, Gene Expression Profiling, Gene Expression Regulation, Gene Targeting, Genetic, Genome, Histocompatibility Antigens Class I, Humans, Hydrogen Bonding, Hydrophobicity, Image Interpretation, Image Processing, In Vitro, Indicators and Reagents, Information Storage and Retrieval, Ion Exchange, Least-Squares Analysis, Leiomyosarcoma, Liver Cirrhosis, Lung Neoplasms, Magnetic Resonance Imaging, Male, Mass, Mathematical Computing, Matrix-Assisted Laser Desorption-Ionization, Models, Molecular, Molecular Sequence Data, Neoplasm Proteins, Neoplasms, Neoplastic, Nephroblastoma, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonl, Nucleic Acid Conformation, Nucleic Acid Hybridization, Oligonucleotide Array Sequence Analysis, Oncogene Proteins, Ovarian Neoplasms, P.H.S., Pattern Recognition, Predictive Value of Tests, Pro, Prostatic Neoplasms, Protein, Protein Binding, Protein Interaction Mapping, Protein Structure, Proteins, Quantitative Structure-Activity Relationship, RNA, ROC Curve, Reproducibility of Results, Research Support, Rhabdomyosarcoma, Secondary, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Severity of Illness Index, Software, Solubility, Spectrometry, Statistical, Structure-Activity Relationship, Subcellular Fractions, Subtraction Technique, T-Lymphocyte, Tissue Distribution, Transcription Factors, Transfer, Treatment Outcome, Tumor, Tumor Markers, U.S. Gov't, User-Computer Interface, inear Dynamics, teome, 12961824
[Denis2003Text] F. Denis, R. Gilleron, A. Laurent, and M. Tommasi. Text classification and co-training from positive and unlabeled examples. In Proceedings of the ICML 2003 Workshop: The Continuum from Labeled to Unlabeled Data, 2003. [ bib | http ]
[Deb2003Reliable] Kalyanmoy Deb and A. Raji Reddy. Reliable classification of two-class cancer data using evolutionary algorithms. Biosystems, 72(1-2):111-29, Nov 2003. [ bib | DOI | http ]
In the area of bioinformatics, the identification of gene subsets responsible for classifying available disease samples to two or more of its variants is an important task. Such problems have been solved in the past by means of unsupervised learning methods (hierarchical clustering, self-organizing maps, k-mean clustering, etc.) and supervised learning methods (weighted voting approach, k-nearest neighbor method, support vector machine method, etc.). Such problems can also be posed as optimization problems of minimizing gene subset size to achieve reliable and accurate classification. The main difficulties in solving the resulting optimization problem are the availability of only a few samples compared to the number of genes in the samples and the exorbitantly large search space of solutions. Although there exist a few applications of evolutionary algorithms (EAs) for this task, here we treat the problem as a multiobjective optimization problem of minimizing the gene subset size and minimizing the number of misclassified samples. Moreover, for a more reliable classification, we consider multiple training sets in evaluating a classifier. Contrary to the past studies, the use of a multiobjective EA (NSGA-II) has enabled us to discover a smaller gene subset size (such as four or five) to correctly classify 100% or near 100% samples for three cancer samples (Leukemia, Lymphoma, and Colon). We have also extended the NSGA-II to obtain multiple non-dominated solutions discovering as much as 352 different three-gene combinations providing a 100% correct classification to the Leukemia data. In order to have further confidence in the identification task, we have also introduced a prediction strength threshold for determining a sample's belonging to one class or the other. All simulation results show consistent gene subset identifications on three disease samples and exhibit the flexibilities and efficacies in using a multiobjective EA for the gene subset identification task.

[Cox2003Functional] David D Cox and Robert L Savoy. Functional magnetic resonance imaging (fMRI) "brain reading": detecting and classifying distributed patterns of fMRI activity in human visual cortex. Neuroimage, 19(2 Pt 1):261-70, Jun 2003. [ bib ]
Traditional (univariate) analysis of functional MRI (fMRI) data relies exclusively on the information contained in the time course of individual voxels. Multivariate analyses can take advantage of the information contained in activity patterns across space, from multiple voxels. Such analyses have the potential to greatly expand the amount of information extracted from fMRI data sets. In the present study, multivariate statistical pattern recognition methods, including linear discriminant analysis and support vector machines, were used to classify patterns of fMRI activation evoked by the visual presentation of various categories of objects. Classifiers were trained using data from voxels in predefined regions of interest during a subset of trials for each subject individually. Classification of subsequently collected fMRI data was attempted according to the similarity of activation patterns to prior training examples. Classification was done using only small amounts of data (20 s worth) at a time, so such a technique could, in principle, be used to extract information about a subject's percept on a near real-time basis. Classifiers trained on data acquired during one session were equally accurate in classifying data collected within the same session and across sessions separated by more than a week, in the same subject. Although the highest classification accuracies were obtained using patterns of activity including lower visual areas as input, classification accuracies well above chance were achieved using regions of interest restricted to higher-order object-selective visual areas. In contrast to typical fMRI data analysis, in which hours of data across many subjects are averaged to reveal slight differences in activation, the use of pattern recognition methods allows a subtle 10-way discrimination to be performed on an essentially trial-by-trial basis within individuals, demonstrating that fMRI data contain far more information than is typically appreciated.

[Cortes2003Rational] C. Cortes, P. Haffner, and M. Mohri. Rational Kernels. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems 15. MIT Press, 2003. [ bib ]
[Collins2003Vision] Francis S Collins, Eric D Green, Alan E Guttmacher, Mark S Guyer, and U. S. National Human Genome Research Institute. A vision for the future of genomics research. Nature, 422(6934):835-847, Apr 2003. [ bib ]
[Chung2003Radius] Kai-Min Chung, Wei-Chun Kao, Chia-Liang Sun, Li-Lun Wang, and Chih-Jen Lin. Radius margin bounds for support vector machines with the RBF kernel. Neural Comput, 15(11):2643-81, Nov 2003. [ bib | DOI | http ]
An important approach for efficient support vector machine (SVM) model selection is to use differentiable bounds of the leave-one-out (loo) error. Past efforts focused on finding tight bounds of loo (e.g., radius margin bounds, span bounds). However, their practical viability is still not very satisfactory. Duan, Keerthi, and Poo (2003) showed that radius margin bound gives good prediction for L2-SVM, one of the cases we look at. In this letter, through analyses about why this bound performs well for L2-SVM, we show that finding a bound whose minima are in a region with small loo values may be more important than its tightness. Based on this principle, we propose modified radius margin bounds for L1-SVM (the other case) where the original bound is applicable only to the hard-margin case. Our modification for L1-SVM achieves comparable performance to L2-SVM. To study whether L1- or L2-SVM should be used, we analyze other properties, such as their differentiability, number of support vectors, and number of free support vectors. In this aspect, L1-SVM possesses the advantage of having fewer support vectors. Their implementations are also different, so we discuss related issues in detail.

[Chipman2003Statistical] H. Chipman, T. Hastie, and R. Tibshirani. Statistical Analysis of Gene Expression Microarray Data, chapter Clustering Microarray Data, pages 159-200. Chapman and Hall, CRC press., 2003. [ bib ]
[Chipman2003Clustering] H. Chipman, T. Hastie, and R. Tibshirani. Clustering microarray data. In T. Speed, editor, Statistical Analysis of Gene Expression Microarray Data, pages 159-200. Chapman and Hall, CRC press., 2003. [ bib ]
[Chenna2003Multiple] R. Chenna, H. Sugawara, T. Koike, R. Lopez, T. J. Gibson, D. G. Higgins, and J. D. Thompson. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res., 31(13):3497-3500, Jul 2003. [ bib ]
The Clustal series of programs are widely used in molecular biology for the multiple alignment of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity of the programs depends on a number of factors, including not only the accuracy of the results, but also the robustness, portability and user-friendliness of the programs. New features include NEXUS and FASTA format output, printing range numbers and faster tree calculation. Although, Clustal was originally developed to run on a local computer, numerous Web servers have been set up, notably at the EBI (European Bioinformatics Institute) (http://www.ebi.ac.uk/clustalw/).

Keywords: Algorithms; Amino Acid Sequence; Internet; Nucleic Acids; Phylogeny; Sequence Alignment; Sequence Analysis; Sequence Analysis, Protein; Software
[Chang2003Support] Ruey-Feng Chang, Wen-Jie Wu, Woo Kyung Moon, Yi-Hong Chou, and Dar-Ren Chen. Support vector machines for diagnosis of breast tumors on US images. Acad Radiol, 10(2):189-97, Feb 2003. [ bib | DOI | http ]
RATIONALE AND OBJECTIVES: Breast cancer has become the leading cause of cancer deaths among women in developed countries. To decrease the related mortality, disease must be treated as early as possible, but it is hard to detect and diagnose tumors at an early stage. A well-designed computer-aided diagnostic system can help physicians avoid misdiagnosis and avoid unnecessary biopsy without missing cancers. In this study, the authors tested one such system to determine its effectiveness. MATERIALS AND METHODS: Many computer-aided diagnostic systems for ultrasonography are based on the neural network model and classify breast tumors according to texture features. The authors tested a refinement of this model, an advanced support vector machine (SVM), in 250 cases of pathologically proved breast tumors (140 benign and 110 malignant), and compared its performance with that of a multilayer propagation neural network. RESULTS: The accuracy of the SVM for classifying malignancies was 85.6% (214 of 250); the sensitivity, 95.45% (105 of 110); the specificity, 77.86% (109 of 140); the positive predictive value, 77.21% (105 of 136); and the negative predictive value, 95.61% (109 of 114). CONCLUSION: The SVM proved helpful in the imaging diagnosis of breast cancer. The classification ability of the SVM is nearly equal to that of the neural network model, and the SVM has a much shorter training time (1 vs 189 seconds). Given the increasing size and complexity of data sets, the SVM is therefore preferable for computer-aided diagnosis.

[Chanda2003Fulfilling] S. K. Chanda and J. S. Caldwell. Fulfilling the promise: drug discovery in the post-genomic era. Drug Discov Today, 8(4):168-174, Feb 2003. [ bib ]
The genomic era has brought with it a basic change in experimentation, enabling researchers to look more comprehensively at biological systems. The sequencing of the human genome coupled with advances in automation and parallelization technologies have afforded a fundamental transformation in the drug target discovery paradigm, towards systematic whole genome and proteome analyses. In conjunction with novel proteomic techniques, genome-wide annotation of function in cellular models is possible. Overlaying data derived from whole genome sequence, expression and functional analysis will facilitate the identification of causal genes in disease and significantly streamline the target validation process. Moreover, several parallel technological advances in small molecule screening have resulted in the development of expeditious and powerful platforms for elucidating inhibitors of protein or pathway function. Conversely, high-throughput and automated systems are currently being used to identify targets of orphan small molecules. The consolidation of these emerging functional genomics and drug discovery technologies promises to reap the fruits of the genomic revolution.

[Chan2003Detection] Ian Chan, William Wells, Robert V Mulkern, Steven Haker, Jianqing Zhang, Kelly H Zou, Stephan E Maier, and Clare M C Tempany. Detection of prostate cancer by integration of line-scan diffusion, T2-mapping and T2-weighted magnetic resonance imaging; a multichannel statistical classifier. Med Phys, 30(9):2390-8, Sep 2003. [ bib | .pdf ]
A multichannel statistical classifier for detecting prostate cancer was developed and validated by combining information from three different magnetic resonance (MR) methodologies: T2-weighted, T2-mapping, and line scan diffusion imaging (LSDI). From these MR sequences, four different sets of image intensities were obtained: T2-weighted (T2W) from T2-weighted imaging, Apparent Diffusion Coefficient (ADC) from LSDI, and proton density (PD) and T2 (T2 Map) from T2-mapping imaging. Manually segmented tumor labels from a radiologist, which were validated by biopsy results, served as tumor "ground truth." Textural features were extracted from the images using co-occurrence matrix (CM) and discrete cosine transform (DCT). Anatomical location of voxels was described by a cylindrical coordinate system. A statistical jack-knife approach was used to evaluate our classifiers. Single-channel maximum likelihood (ML) classifiers were based on 1 of the 4 basic image intensities. Our multichannel classifiers: support vector machine (SVM) and Fisher linear discriminant (FLD), utilized five different sets of derived features. Each classifier generated a summary statistical map that indicated tumor likelihood in the peripheral zone (PZ) of the prostate gland. To assess classifier accuracy, the average areas under the receiver operator characteristic (ROC) curves over all subjects were compared. Our best FLD classifier achieved an average ROC area of 0.839(+/-0.064), and our best SVM classifier achieved an average ROC area of 0.761(+/-0.043). The T2W ML classifier, our best single-channel classifier, only achieved an average ROC area of 0.599(+/-0.146). Compared to the best single-channel ML classifier, our best multichannel FLD and SVM classifiers have statistically superior ROC performance (P=0.0003 and 0.0017, respectively) from pairwise two-sided t-test. By integrating the information from multiple images and capturing the textural and anatomical features in tumor areas, summary statistical maps can potentially aid in image-guided prostate biopsy and assist in guiding and controlling delivery of localized therapy under image guidance.

Keywords: Algorithms, Anion Exchange Resins, Antigen-Antibody Complex, Artificial Intelligence, Automated, Automatic Data Processing, Biological, Blood Cells, Chemical, Chromatography, Cluster Analysis, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Data Interpretation, Databases, Decision Making, Decision Trees, Diffusion Magnetic Resonance Imaging, English Abstract, Epitopes, Expert Systems, Factual, Fuzzy Logic, Gene Expression Profiling, Gene Expression Regulation, Gene Targeting, Genome, Histocompatibility Antigens Class I, Humans, Image Interpretation, Image Processing, In Vitro, Indicators and Reagents, Information Storage and Retrieval, Ion Exchange, Least-Squares Analysis, Liver Cirrhosis, Magnetic Resonance Imaging, Male, Models, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonl, Nucleic Acid Conformation, P.H.S., Pattern Recognition, Pro, Prostatic Neoplasms, Protein, Protein Binding, Protein Interaction Mapping, Proteins, Quantitative Structure-Activity Relationship, RNA, ROC Curve, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Analysis, Severity of Illness Index, Statistical, Structure-Activity Relationship, Subtraction Technique, T-Lymphocyte, Transcription Factors, Transfer, Treatment Outcome, U.S. Gov't, User-Computer Interface, inear Dynamics, teome, 14528961
[Carcassoni2002Spectral] M. Carcassoni and E. Hancock. Spectral correspondence for point pattern matching. Pattern Recogn., 36(1):193-204, January 2003. [ bib | DOI | http ]
This paper investigates the correspondence matching of point-sets using spectral graph analysis. In particular, we are interested in the problem of how the modal analysis of point-sets can be rendered robust to contamination and drop-out. We make three contributions. First, we show how the modal structure of point-sets can be embedded within the framework of the EM algorithm. Second, we present several methods for computing the probabilities of point correspondences from the modes of the point proximity matrix. Third, we consider alternatives to the Gaussian proximity matrix. We evaluate the new method on both synthetic and real-world data. Here we show that the method can be used to compute useful correspondences even when the level of point contamination is as large as 50%. We also provide some examples on deformed point-set tracking.

Keywords: correspondences, matching
[Cai2003Support] Y.-D. Cai, G.-P. Zhou, and K.-C. Chou. Support Vector Machines for Predicting Membrane Protein Types by Using Functional Domain Composition. Biophys. J., 84(5):3257-3263, 2003. [ bib | http | .pdf ]
Membrane proteins are generally classified into the following five types: 1), type I membrane protein; 2), type II membrane protein; 3), multipass transmembrane proteins; 4), lipid chain-anchored membrane proteins; and 5), GPI-anchored membrane proteins. In this article, based on the concept of using the functional domain composition to define a protein, the Support Vector Machine algorithm is developed for predicting the membrane protein type. High success rates are obtained by both the self-consistency and jackknife tests. The current approach, complemented with the powerful covariant discriminant algorithm based on the pseudo-amino acid composition that has incorporated quasi-sequence-order effect as recently proposed by K. C. Chou (2001), may become a very useful high-throughput tool in the area of bioinformatics and proteomics.

Keywords: biosvm
[Cai2003Supportb] Y.D. Cai, X.J. Liu, X.B. Xu, and K.C. Chou. Support vector machines for prediction of protein domain structural class. J. Theor. Biol., 221(1):115-120, 2003. [ bib | DOI | http | .pdf ]
The support vector machines (SVMs) method was introduced for predicting the structural class of protein domains. The results obtained through the self-consistency test, jack-knife test, and independent dataset test have indicated that the current method and the elegant component-coupled algorithm developed by Chou and co-workers, if effectively complemented with each other, may become a powerful tool for predicting the structural class of protein domains.

Keywords: biosvm
[Cai2003Prediction] Y.D. Cai, X.J. Liu, Y.X. Li, X.B. Xu, and K.C. Chou. Prediction of beta-turns with learning machines. Peptides, 24(5):665-669, 2003. [ bib | DOI | http | .pdf ]
The support vector machine approach was introduced to predict the beta-turns in proteins. The overall self-consistency rate by the re-substitution test for the training or learning dataset reached 100 were taken from Chou [J. Pept. Res. 49 (1997) 120]. The success prediction rates by the jackknife test for the beta-turn subset of 455 tetrapeptides and non-beta-turn subset of 3807 tetrapeptides in the training dataset were 58.1 and 98.4 success rates with the independent dataset test for the beta-turn subset of 110 tetrapeptides and non-beta-turn subset of 30,231 tetrapeptides were 69.1 and 97.3 study support the conclusion that the residue-coupled effect along a tetrapeptide is important for the formation of a beta-turn.

Keywords: biosvm
[Cai2003Supporta] Y.D. Cai, S.L. Lin, and K.C. Chou. Support vector machines for prediction of protein signal sequences and their cleavage sites. Peptides, 24(1):159-161, 2003. [ bib | DOI | .pdf ]
Given a nascent protein sequence, how can one predict its signal peptide or "Zipcode" sequence? This is an important problem for scientists to use signal peptides as a vehicle to find new drugs or to reprogram cells for gene therapy (see, e.g. [7] K.C. Chou, Current Protein and Peptide Science 2002;3:615?22). In this paper, support vector machines (SVMs), a new machine learning method, is applied to approach this problem. The overall rate of correct prediction for 1939 secretary proteins and 1440 nonsecretary proteins was over 91 may also serve as a useful tool for further investigating many unclear details regarding the molecular mechanism of the ZIP code protein-sorting system in cells.

Keywords: biosvm
[Cai2003Supportd] Y.D. Cai and S.L. Lin. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim. Biophys. Acta, 1648(1-2):127-133, 2003. [ bib | DOI | http | .pdf ]
Classification of gene function remains one of the most important and demanding tasks in the post-genome era. Most of the current predictive computer methods rely on comparing features that are essentially linear to the protein sequence. However, features of a protein nonlinear to the sequence may also be predictive to its function. Machine learning methods, for instance the Support Vector Machines (SVMs), are particularly suitable for exploiting such features. In this work we introduce SVM and the pseudo-amino acid composition, a collection of nonlinear features extractable from protein sequence, to the field of protein function prediction. We have developed prototype SVMs for binary classification of rRNA-, RNA-, and DNA-binding proteins. Using a protein's amino acid composition and limited range correlation of hydrophobicity and solvent accessible surface area as input, each of the SVMs predicts whether the protein belongs to one of the three classes. In self-consistency and cross-validation tests, which measures the success of learning and prediction, respectively, the rRNA-binding SVM has consistently achieved >95 The RNA- and DNA-binding SVMs demonstrate more diverse accuracy, ranging from approximately 76 the test results suggests the directions of improving the SVMs.

Keywords: biosvm
[Cai2003Supportc] Y.D. Cai, K.Y. Feng, Y.X. Li, and K.C. Chou. Support vector machine for predicting alpha-turn types. Peptides, 24(4):629-630, 2003. [ bib | DOI | http | .pdf ]
Tight turns play an important role in globular proteins from both the structural and functional points of view. Of tight turns, beta-turns and gamma-turns have been extensively studied, but alpha-turns were little investigated. Recently, a systematic search for alpha-turns classified alpha-turns into nine different types according to their backbone trajectory features. In this paper, Support Vector Machines (SVMs), a new machine learning method, is proposed for predicting the alpha-turn types in proteins. The high rates of correct prediction imply that that the formation of different alpha-turn types is evidently correlated with the sequence of a pentapeptide, and hence can be approximately predicted based on the sequence information of the pentapeptide alone, although the incorporation of its interaction with the other part of a protein, the so-called "long distance interaction", will further improve the prediction quality.

Keywords: biosvm
[Cai2003SVM-Prot] C. Z. Cai, L. Y. Han, Z. L. Ji, X. Chen, and Y. Z. Chen. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res, 31(13):3692-7, Jul 2003. [ bib | http | .pdf ]
Prediction of protein function is of significance in studying biological processes. One approach for function prediction is to classify a protein into functional family. Support vector machine (SVM) is a useful method for such classification, which may involve proteins with diverse sequence distribution. We have developed a web-based software, SVMProt, for SVM classification of a protein into functional family from its primary sequence. SVMProt classification system is trained from representative proteins of a number of functional families and seed proteins of Pfam curated protein families. It currently covers 54 functional families and additional families will be added in the near future. The computed accuracy for protein family classification is found to be in the range of 69.1-99.6%. SVMProt shows a certain degree of capability for the classification of distantly related proteins and homologous proteins of different function and thus may be used as a protein function prediction tool that complements sequence alignment methods. SVMProt can be accessed at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.

Keywords: biosvm
[Cai2003Protein] C.Z. Cai, W.L. Wang, L.Z. Sun, and Y.Z. Chen. Protein function classification via support vector machine approach. Math. Biosci., 185(2):111-122, 2003. [ bib | DOI | .pdf ]
Support vector machine (SVM) is introduced as a method for the classification of proteins into functionally distinguished classes. Studies are conducted on a number of protein classes including RNA-binding proteins; protein homodimers, proteins responsible for drug absorption, proteins involved in drug distribution and excretion, and drug metabolizing enzymes. Testing accuracy for the classification of these protein classes is found to be in the range of 84-96 usefulness of SVM in the classification of protein functional classes and its potential application in protein function prediction.

Keywords: biosvm
[Byvatov2003Support] E. Byvatov and G. Schneider. Support vector machine applications in bioinformatics. Appl Bioinformatics, 2(2):67-77, 2003. [ bib ]
The support vector machine (SVM) approach represents a data-driven method for solving classification tasks. It has been shown to produce lower prediction error compared to classifiers based on other methods like artificial neural networks, especially when large numbers of features are considered for sample description. In this review, the theory and main principles of the SVM approach are outlined, and successful applications in traditional areas of bioinformatics research are described. Current developments in techniques related to the SVM approach are reviewed which might become relevant for future functional genomics and chemogenomics projects. In a comparative study, we developed neural network and SVM models to identify small organic molecules that potentially modulate the function of G-protein coupled receptors. The SVM system was able to correctly classify approximately 90% of the compounds in a cross-validation study yielding a Matthews correlation coefficient of 0.78. This classifier can be used for fast filtering of compound libraries in virtual screening applications.

Keywords: biosvm
[Byvatov2003Comparison] E. Byvatov, U. Fechner, J. Sadowski, and G. Schneider. Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. J Chem Inf Comput Sci, 43(6):1882-9, 2003. [ bib | DOI | http | .pdf ]
Support vector machine (SVM) and artificial neural network (ANN) systems were applied to a drug/nondrug classification problem as an example of binary decision problems in early-phase virtual compound filtering and screening. The results indicate that solutions obtained by SVM training seem to be more robust with a smaller standard error compared to ANN training. Generally, the SVM classifier yielded slightly higher prediction accuracy than ANN, irrespective of the type of descriptors used for molecule encoding, the size of the training data sets, and the algorithm employed for neural network training. The performance was compared using various different descriptor sets and descriptor combinations based on the 120 standard Ghose-Crippen fragment descriptors, a wide range of 180 different properties and physicochemical descriptors from the Molecular Operating Environment (MOE) package, and 225 topological pharmacophore (CATS) descriptors. For the complete set of 525 descriptors cross-validated classification by SVM yielded 82% correct predictions (Matthews cc = 0.63), whereas ANN reached 80% correct predictions (Matthews cc = 0.58). Although SVM outperformed the ANN classifiers with regard to overall prediction accuracy, both methods were shown to complement each other, as the sets of true positives, false positives (overprediction), true negatives, and false negatives (underprediction) produced by the two classifiers were not identical. The theory of SVM and ANN training is briefly reviewed.

Keywords: biosvm chemoinformatics
[Buus2003Sensitive] S. Buus, S. L. Lauemøller, P. Worning, C. Kesmir, T. Frimurer, S. Corbet, A. Fomsgaard, J. Hilden, A. Holm, and S. Brunak. Sensitive quantitative predictions of peptide-MHC binding by a 'query by committee' artificial neural network approach. Tissue Antigens, 62(5):378-384, Nov 2003. [ bib ]
We have generated Artificial Neural Networks (ANN) capable of performing sensitive, quantitative predictions of peptide binding to the MHC class I molecule, HLA-A*0204. We have shown that such quantitative ANN are superior to conventional classification ANN, that have been trained to predict binding vs non-binding peptides. Furthermore, quantitative ANN allowed a straightforward application of a 'Query by Committee' (QBC) principle whereby particularly information-rich peptides could be identified and subsequently tested experimentally. Iterative training based on QBC-selected peptides considerably increased the sensitivity without compromising the efficiency of the prediction. This suggests a general, rational and unbiased approach to the development of high quality predictions of epitopes restricted to this and other HLA molecules. Due to their quantitative nature, such predictions will cover a wide range of MHC-binding affinities of immunological interest, and they can be readily integrated with predictions of other events involved in generating immunogenic epitopes. These predictions have the capacity to perform rapid proteome-wide searches for epitopes. Finally, it is an example of an iterative feedback loop whereby advanced, computational bioinformatics optimize experimental strategy, and vice versa.

Keywords: HLA-A Antigens; Humans; Neural Networks (Computer); Peptides; Protein Binding; Proteome; Research Support, Non-U.S. Gov't; Research Support, U.S. Gov't, P.H.S.
[Bultinck2003Quantum] P. Bultinck, T. Kuppens, X. Gironès, and R. Carbó-Dorca. Quantum similarity superposition algorithm (QSSA): a consistent scheme for molecular alignment and molecular similarity based on quantum chemistry. J Chem Inf Comput Sci, 43(4):1143-1150, 2003. [ bib | DOI | http ]
The use of the molecular quantum similarity overlap measure for molecular alignment is investigated. A new algorithm is presented, the quantum similarity superposition algorithm (QSSA), expressing the relative positions of two molecules in terms of mutual translation in three Cartesian directions and three Euler angles. The quantum similarity overlap is then used to optimize the mutual positions of the molecules. A comparison is made with TGSA, a topogeometrical approach, and the influence of differences on molecular clustering is discussed.

Keywords: Aldosterone, Algorithms, Chemical, Comparative Study, Estrone, Isomerism, Models, Molecular, Molecular Structure, Non-U.S. Gov't, Quantitative Structure-Activity Relationship, Quantum Theory, Research Support, 12870905
[Bozdech2003Expression] Z. Bozdech, J. Zhu, M. Joachimiak, F. Cohen, B. Pulliam, and J. DeRisi. Expression profiling of the schizont and trophozoite stages of Plasmodium falciparum with a long-oligonucleotide microarray. Genome Biology, 4(2):R9, 2003. [ bib | DOI | http | .pdf ]
BACKGROUND:The worldwide persistence of drug-resistant Plasmodium falciparum, the most lethal variety of human malaria, is a global health concern. The P. falciparum sequencing project has brought new opportunities for identifying molecular targets for antimalarial drug and vaccine development.RESULTS:We developed a software package, ArrayOligoSelector, to design an open reading frame (ORF)-specific DNA microarray using the publicly available P. falciparum genome sequence. Each gene was represented by one or more long 70 mer oligonucleotides selected on the basis of uniqueness within the genome, exclusion of low-complexity sequence, balanced base composition and proximity to the 3' end. A first-generation microarray representing approximately 6,000 ORFs of the P. falciparum genome was constructed. Array performance was evaluated through the use of control oligonucleotide sets with increasing levels of introduced mutations, as well as traditional northern blotting. Using this array, we extensively characterized the gene-expression profile of the intraerythrocytic trophozoite and schizont stages of P. falciparum. The results revealed extensive transcriptional regulation of genes specialized for processes specific to these two stages.CONCLUSIONS:DNA microarrays based on long oligonucleotides are powerful tools for the functional annotation and exploration of the P. falciparum genome. Expression profiling of trophozoites and schizonts revealed genes associated with stage-specific processes and may serve as the basis for future drug targets and vaccine development.

Keywords: microarray plasmodium
[Bozdech2003Transcriptome] Z. Bozdech, M. Llinas, B. L. Pulliam, E. D. Wong, J. Zhu, and J. L. DeRisi. The Transcriptome of the Intraerythrocytic Developmental Cycle of Plasmodium falciparum. PLoS Biology, 1(1):e5, 2003. [ bib | DOI | http | .pdf ]
Plasmodium falciparum is the causative agent of the most burdensome form of human malaria, affecting 200-300 million individuals per year worldwide. The recently sequenced genome of P. falciparum revealed over 5,400 genes, of which 60percnt encode proteins of unknown function. Insights into the biochemical function and regulation of these genes will provide the foundation for future drug and vaccine development efforts toward eradication of this disease. By analyzing the complete asexual intraerythrocytic developmental cycle (IDC) transcriptome of the HB3 strain of P. falciparum, we demonstrate that at least 60percnt of the genome is transcriptionally active during this stage. Our data demonstrate that this parasite has evolved an extremely specialized mode of transcriptional regulation that produces a continuous cascade of gene expression, beginning with genes corresponding to general cellular processes, such as protein synthesis, and ending with Plasmodium-specific functionalities, such as genes involved in erythrocyte invasion. The data reveal that genes contiguous along the chromosomes are rarely coregulated, while transcription from the plastid genome is highly coregulated and likely polycistronic. Comparative genomic hybridization between HB3 and the reference genome strain (3D7) was used to distinguish between genes not expressed during the IDC and genes not detected because of possible sequence variations. Genomic differences between these strains were found almost exclusively in the highly antigenic subtelomeric regions of chromosomes. The simple cascade of gene regulation that directs the asexual development of P. falciparum is unprecedented in eukaryotic biology. The transcriptome of the IDC resembles a "just-in-time" manufacturing process whereby induction of any given gene occurs once per cycle and only at a time when it is required. These data provide to our knowledge the first comprehensive view of the timing of transcription throughout the intraerythrocytic development of P. falciparum and provide a resource for the identification of new chemotherapeutic and vaccine candidates.

Keywords: microarray plasmodium
[Bouchaud2003Theory] J.-P. Bouchaud and M. Potters. Theory of financial risk and derivative pricing. Cambridge University Press, 2003. [ bib ]
[Bostroem2003Assessing] J. Boström, J. R. Greenwood, and J. Gottfries. Assessing the performance of OMEGA with respect to retrieving bioactive conformations. J. Mol. Graph. Model., 21(5):449-462, Mar 2003. [ bib ]
OMEGA is a rule-based program which rapidly generates conformational ensembles of small molecules. We have varied the parameters which control the nature of the ensembles generated by OMEGA in a statistical fashion (D-optimal) with the aim of increasing the probability of generating bioactive conformations. Thirty-six drug-like ligands from different ligand-protein complexes determined by high-resolution (< or =2.0A) X-ray crystallography have been analyzed. Statistically significant models (Q(2)> or =0.75) confirm that one can increase the performance of OMEGA by modifying the parameters. Twenty-eight of the bioactive conformations were retrieved when using a low-energy cut-off (5 kcal/mol), a low RMSD value (0.6A) for duplicate removal, and a maximum of 1000 output conformations. All of those that were not retrieved had eight or more rotatable bonds. The duplicate removal parameter was found to have the largest impact on retrieval of bioactive conformations, and the maximum number of conformations also affected the results considerably. The input conformation was found to influence the results largely because certain bond angles can prevent the bioactive conformation from being generated as a low-energy conformation. Pre-optimizing the input structures with MMFF94s improved the results significantly. We also investigated the performance of OMEGA in connection with database searching. The shape-matching program Rapid Overlay of Chemical Structures (ROCS) was used as search tool. Two multi-conformational databases were built from the MDDR database plus the 36 compounds; one large (maximum 1000 conformations/mol) and one small (maximum 100 conformations/mol). Both databases provided satisfactory results in terms of retrieval. ROCS was able to rank 35 out of 36 X-ray structures among the top 500 hits from the large database.

[Bolstad2003comparison] B.M. Bolstad, R.A. Irizarry, M. Åstrand, and T.P. Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185-193, 2003. [ bib ]
[Bock2003Whole-proteome] J. R. Bock and D. A. Gough. Whole-proteome interaction mining. Bioinformatics, 19(1):125-134, 2003. [ bib | http | .pdf ]
Motivation: A major post-genomic scientific and technological pursuit is to describe the functions performed by the proteins encoded by the genome. One strategy is to first identify the protein-protein interactions in a proteome, then determine pathways and overall structure relating these interactions, and finally to statistically infer functional roles of individual proteins. Although huge amounts of genomic data are at hand, current experimental protein interaction assays must overcome technical problems to scale-up for high-throughput analysis. In the meantime, bioinformatics approaches may help bridge the information gap required for inference of protein function. In this paper, a previously described data mining approach to prediction of protein-protein interactions (Bock and Gough, 2001, Bioinformatics, 17, 455-460) is extended to interaction mining on a proteome-wide scale. An algorithm (the phylogenetic bootstrap) is introduced, which suggests traversal of a phenogram, interleaving rounds of computation and experiment, to develop a knowledge base of protein interactions in genetically-similar organisms. Results: The interaction mining approach was demonstrated by building a learning system based on 1,039 experimentally validated protein-protein interactions in the human gastric bacterium Helicobacter pylori. An estimate of the generalization performance of the classifier was derived from 10-fold cross-validation, which indicated expected upper bounds on precision of 80 One such organism is the enteric pathogen Campylobacter jejuni, in which comprehensive machine learning prediction of all possible pairwise protein-protein interactions was performed. The resulting network of interactions shares an average protein connectivity characteristic in common with previous investigations reported in the literature, offering strong evidence supporting the biological feasibility of the hypothesized map. For inferences about complete proteomes in which the number of pairwise non-interactions is expected to be much larger than the number of actual interactions, we anticipate that the sensitivity will remain the same but precision may decrease. We present specific biological examples of two subnetworks of protein-protein interactions in C. jejuni resulting from the application of this approach, including elements of a two-component signal transduction systems for thermoregulation, and a ferritin uptake network. Contact: dgough@bioeng.ucsd.edu

Keywords: biosvm
[Bissantz2003Protein-based] C. Bissantz, P. Bernard, M. Hibert, and D. Rognan. Protein-based virtual screening of chemical databases. II. are homology models of G-protein coupled receptors suitable targets? Proteins, 50(1):5-25, Jan 2003. [ bib | DOI | http ]
The aim of the current study is to investigate whether homology models of G-Protein-Coupled Receptors (GPCRs) that are based on bovine rhodopsin are reliable enough to be used for virtual screening of chemical databases. Starting from the recently described 2.8 A-resolution X-ray structure of bovine rhodopsin, homology models of an "antagonist-bound" form of three human GPCRs (dopamine D3 receptor, muscarinic M1 receptor, vasopressin V1a receptor) were constructed. The homology models were used to screen three-dimensional databases using three different docking programs (Dock, FlexX, Gold) in combination with seven scoring functions (ChemScore, Dock, FlexX, Fresno, Gold, Pmf, Score). Rhodopsin-based homology models turned out to be suitable, indeed, for virtual screening since known antagonists seeded in the test databases could be distinguished from randomly chosen molecules. However, such models are not accurate enough for retrieving known agonists. To generate receptor models better suited for agonist screening, we developed a new knowledge- and pharmacophore-based modeling procedure that might partly simulate the conformational changes occurring in the active site during receptor activation. Receptor coordinates generated by this new procedure are now suitable for agonist screening. We thus propose two alternative strategies for the virtual screening of GPCR ligands, relying on a different set of receptor coordinates (antagonist-bound and agonist-bound states).

Keywords: chemogenomics
[Bi2003Dimensionality] J. Bi, K. Bennett, M. Embrechts, C. Breneman, and M. Song. Dimensionality reduction via sparse support vector machines. J. Mach. Learn. Res., 3:1229-1243, 2003. [ bib ]
[Bhasin2003MHCBN] Manoj Bhasin, Harpreet Singh, and G. P S Raghava. MHCBN: a comprehensive database of MHC binding and non-binding peptides. Bioinformatics, 19(5):665-666, Mar 2003. [ bib ]
MHCBN is a comprehensive database of Major Histocompatibility Complex (MHC) binding and non-binding peptides compiled from published literature and existing databases. The latest version of the database has 19 777 entries including 17 129 MHC binders and 2648 MHC non-binders for more than 400 MHC molecules. The database has sequence and structure data of (a) source proteins of peptides and (b) MHC molecules. MHCBN has a number of web tools that include: (i) mapping of peptide on query sequence; (ii) search on any field; (iii) creation of data sets; and (iv) online data submission. The database also provides hypertext links to major databases like SWISS-PROT, PDB, IMGT/HLA-DB, GenBank and PUBMED.

Keywords: Amino Acid Sequence; Binding Sites; Database Management Systems; Databases, Protein; Histocompatibility Antigens; Information Storage and Retrieval; Macromolecular Substances; Major Histocompatibility Complex; Molecular Sequence Data; Peptide Fragments; Peptides; Protein Binding; Protein Conformation; Sequence Alignment; Sequence Analysis, Protein; Structure-Activity Relationship; User-Computer Interface
[Berlinet2003Reproducing] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer, 2003. [ bib ]
[Ben-Hur2003Remote] A. Ben-Hur and D. Brutlag. Remote homology detection: a motif based approach. Bioinformatics, 19(Suppl. 1):i26-i33, 2003. [ bib | http | .pdf ]
Motivation: Remote homology detection is the problem of detecting homology in cases of low sequence similarity. It is a hard computational problem with no approach that works well in all cases. Results: We present a method for detecting remote homology that is based on the presence of discrete sequence motifs. The motif content of a pair of sequences is used to define a similarity that is used as a kernel for a Support Vector Machine (SVM) classifier. We test the method on two remote homology detection tasks: prediction of a previously unseen SCOP family and prediction of an enzyme class given other enzymes that have a similar function on other substrates. We find that it performs significantly better than an SVM method that uses BLAST or Smith-Waterman similarity scores as features. Availability: The software is available from the authors upon request.

Keywords: biosvm
[Ben-David2003Exploiting] S. Ben-David and R. Schuller. Exploiting task relatedness for multiple task learning, 2003. [ bib | .html ]
[Belkin2003Laplacian] M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Comput., 15(6):1373-1396, 2003. [ bib | http | www: ]
One of the central problems in machine learning and pattern recognition is to develop appropriate representations for complex data. We consider the problem of constructing a representation for data lying on a low-dimensional manifold embedded in a high-dimensional space. Drawing on the correspondence between the graph Laplacian, the Laplace Beltrami operator on the manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for representing the high-dimensional data. The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality-preserving properties and a natural connection to clustering. Some potential applications and illustrative examples are discussed.

Keywords: dimred
[Beerenwinkel2003Methods] N. Beerenwinkel, T. Lengauer, M. Daumer, R. Kaiser, H. Walter, K. Korn, D. Hoffmann, and J. Selbig. Methods for optimizing antiviral combination therapies. Bioinformatics, 19(Suppl. 1):i16-i25, 2003. [ bib | http | .pdf ]
Motivation: Despite some progress with antiretroviral combination therapies, therapeutic success in the management of HIV-infected patients is limited. The evolution of drug-resistant genetic variants in response to therapy plays a key role in treatment failure and finding a new potent drug combination after therapy failure is considered challenging. Results: To estimate the activity of a drug combination against a particular viral strain, we develop a scoring function whose independent variables describe a set of antiviral agents and viral DNA sequences coding for the molecular targets of the respective drugs. The construction of this activity score involves (1) predicting phenotypic drug resistance from genotypes for each drug individually, (2) probabilistic modeling of predicted resistance values and integration into a score for drug combinations, and (3) searching through the mutational neighborhood of the considered strain in order to estimate activity on nearby mutants. For a clinical data set, we determine the optimal search depth and show that the scoring scheme is predictive of therapeutic outcome. Properties of the activity score and applications are discussed. Contact: beerenwinkel@mpi-sb.mpg.de Keywords: HIV, antiretroviral therapy, drug resistance, SVM regression, therapy optimization, sequence space search.

Keywords: biosvm
[Bartlett2003Convexity] P.I. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification and risk bounds. Technical Report 638, UC Berkeley Statistics, 2003. [ bib | .pdf ]
[Balmain2003genetics] A. Balmain, J. Gray, and B. Ponder. The genetics and genomics of cancer. Nat. Genet., 33:238-244, 2003. [ bib | DOI | http | .pdf ]
The past decade has seen great strides in our understanding of the genetic basis of human disease. Arguably, the most profound impact has been in the area of cancer genetics, where the explosion of genomic sequence and molecular profiling data has illustrated the complexity of human malignancies. In a tumor cell, dozens of different genes may be aberrant in structure or copy number, and hundreds or thousands of genes may be differentially expressed. A number of familial cancer genes with high-penetrance mutations have been identified, but the contribution of low-penetrance genetic variants or polymorphisms to the risk of sporadic cancer development remains unclear. Studies of the complex somatic genetic events that take place in the emerging cancer cell may aid the search for the more elusive germline variants that confer increased susceptibility. Insights into the molecular pathogenesis of cancer have provided new strategies for treatment, but a deeper understanding of this disease will require new statistical and computational approaches for analysis of the genetic and signaling networks that orchestrate individual cancer susceptibility and tumor behavior.

[Bakker2003Task] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. J. Mach. Learn. Res., 4:83-99, 2003. [ bib ]
[Bagirov2003New] A. M. Bagirov, B. Ferguson, S. Ivkovic, G. Saunders, and J. Yearwood. New algorithms for multi-class cancer diagnosis using tumor gene expression signatures. Bioinformatics, 19(14):1800-7, Sep 2003. [ bib | http | .pdf ]
MOTIVATION: The increasing use of DNA microarray-based tumor gene expression profiles for cancer diagnosis requires mathematical methods with high accuracy for solving clustering, feature selection and classification problems of gene expression data. RESULTS: New algorithms are developed for solving clustering, feature selection and classification problems of gene expression data. The clustering algorithm is based on optimization techniques and allows the calculation of clusters step-by-step. This approach allows us to find as many clusters as a data set contains with respect to some tolerance. Feature selection is crucial for a gene expression database. Our feature selection algorithm is based on calculating overlaps of different genes. The database used, contains over 16 000 genes and this number is considerably reduced by feature selection. We propose a classification algorithm where each tissue sample is considered as the center of a cluster which is a ball. The results of numerical experiments confirm that the classification algorithm in combination with the feature selection algorithm perform slightly better than the published results for multi-class classifiers based on support vector machines for this data set. AVAILABILITY: Available on request from the authors.

Keywords: Algorithms, Amino Acid Sequence, Anion Exchange Resins, Antigen-Antibody Complex, Artificial Intelligence, Automated, Automatic Data Processing, Biological, Blood Cells, Chemical, Chromatography, Cluster Analysis, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA, Data Interpretation, Databases, Decision Making, Decision Trees, Diffusion Magnetic Resonance Imaging, English Abstract, Epitopes, Expert Systems, Factual, Fuzzy Logic, Gene Expression Profiling, Gene Expression Regulation, Gene Targeting, Genetic, Genome, Histocompatibility Antigens Class I, Humans, Image Interpretation, Image Processing, In Vitro, Indicators and Reagents, Information Storage and Retrieval, Ion Exchange, Least-Squares Analysis, Liver Cirrhosis, Magnetic Resonance Imaging, Male, Models, Molecular Sequence Data, Neoplasms, Neoplastic, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonl, Nucleic Acid Conformation, Oligonucleotide Array Sequence Analysis, P.H.S., Pattern Recognition, Pro, Prostatic Neoplasms, Protein, Protein Binding, Protein Interaction Mapping, Proteins, Quantitative Structure-Activity Relationship, RNA, ROC Curve, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Severity of Illness Index, Statistical, Structure-Activity Relationship, Subtraction Technique, T-Lymphocyte, Transcription Factors, Transfer, Treatment Outcome, Tumor Markers, U.S. Gov't, User-Computer Interface, inear Dynamics, teome, 14512351
[Bader2003BIND] G.D. Bader, D. Betel, and C.W.V. Hogue. Bind: the biomolecular interaction network database. Nucleic Acids Res, 31(1):248-250, Jan 2003. [ bib ]
The Biomolecular Interaction Network Database (BIND: http://bind.ca) archives biomolecular interaction, complex and pathway information. A web-based system is available to query, view and submit records. BIND continues to grow with the addition of individual submissions as well as interaction data from the PDB and a number of large-scale interaction and complex mapping experiments using yeast two hybrid, mass spectrometry, genetic interactions and phage display. We have developed a new graphical analysis tool that provides users with a view of the domain composition of proteins in interaction and complex records to help relate functional domains to protein interactions. An interaction network clustering tool has also been developed to help focus on regions of interest. Continued input from users has helped further mature the BIND data specification, which now includes the ability to store detailed information about genetic interactions. The BIND data specification is available as ASN.1 and XML DTD.

[Bach2003Learning] Francis R. Bach and Michael I. Jordan. Learning spectral clustering. In Advances in Neural Information Processing Systems 16. MIT Press, 2003. [ bib ]
[Bohm2003Protein-ligand] H.-J. Böhm, G. Schneider, R. Mannhold, H. Kubinyi, and G. Folkers. Protein-ligand interactions. Wiley, 2003. [ bib ]
Keywords: chemoinformatics
[Attwood2003PRINTS] T. K. Attwood, P. Bradley, D. R. Flower, A. Gaulton, N. Maudling, A. L. Mitchell, G. Moulton, A. Nordle, K. Paine, P. Taylor, A. Uddin, and C. Zygouri. Prints and its automatic supplement, preprints. Nucleic Acids Res., 31(1):400-402, Jan 2003. [ bib ]
The PRINTS database houses a collection of protein fingerprints. These may be used to assign uncharacterised sequences to known families and hence to infer tentative functions. The September 2002 release (version 36.0) includes 1800 fingerprints, encoding approximately 11 000 motifs, covering a range of globular and membrane proteins, modular polypeptides and so on. In addition to its continued steady growth, we report here the development of an automatic supplement, prePRINTS, designed to increase the coverage of the resource and reduce some of the manual burdens inherent in its maintenance. The databases are accessible for interrogation and searching at http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/.

Keywords: Amino Acid Motifs; Animals; Automation; Conserved Sequence; Databases, Protein; Proteins; Software
[Arakawa2003Application] M. Arakawa, K. Hasegawa, and K. Funatsu. Application of the novel molecular alignment method using the Hopfield Neural Network to 3D-QSAR. J Chem Inf Comput Sci, 43(5):1396-1402, 2003. [ bib | DOI | http ]
Recently, we investigated and proposed the novel molecular alignment method with the Hopfield Neural Network (HNN). Molecules are represented by four kinds of chemical properties (hydrophobic group, hydrogen-bonding acceptor, hydrogen-bonding donor, and hydrogen-bonding donor/acceptor), and then those properties between two molecules correspond to each other using HNN. The 12 pairs of enzyme-inhibitors were used for validation, and our method could successfully reproduce the real molecular alignments obtained from X-ray crystallography. In this paper, we apply the molecular alignment method to three-dimensional quantitative structure-activity relationship (3D-QSAR) analysis. The two data sets (human epidermal growth factor receptor-2 inhibitors and cyclooxygenase-2 inhibitors) were investigated to validate our method. As a result, the robust and predictive 3D-QSAR models were successfully obtained in both data sets.

Keywords: Chemical, Cyclooxygenase 2, Cyclooxygenase 2 Inhibitors, Cyclooxygenase Inhibitors, Databases, Enzyme Inhibitors, Epidermal Growth Factor, Factual, Humans, Isoenzymes, Membrane Proteins, Models, Molecular, Neural Networks (Computer), Prostaglandin-Endoperoxide Synthases, Quantitative Structure-Activity Relationship, Receptor, 14502472
[Anguita2003Quantum] Davide Anguita, Sandro Ridella, Fabio Rivieccio, and Rodolfo Zunino. Quantum optimization for training support vector machines. Neural Netw., 16(5-6):763-70, 2003. [ bib | DOI | http | .pdf ]
Refined concepts, such as Rademacher estimates of model complexity and nonlinear criteria for weighting empirical classification errors, represent recent and promising approaches to characterize the generalization ability of Support Vector Machines (SVMs). The advantages of those techniques lie in both improving the SVM representation ability and yielding tighter generalization bounds. On the other hand, they often make Quadratic-Programming algorithms no longer applicable, and SVM training cannot benefit from efficient, specialized optimization techniques. The paper considers the application of Quantum Computing to solve the problem of effective SVM training, especially in the case of digital implementations. The presented research compares the behavioral aspects of conventional and enhanced SVMs; experiments in both a synthetic and real-world problems support the theoretical analysis. At the same time, the related differences between Quadratic-Programming and Quantum-based optimization techniques are considered.

[Anderson2003new] D.C. Anderson, W. Li, D.G. Payan, and W.S. Noble. A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. J Proteome Res, 2(2):137-146, 2003. [ bib | .pdf ]
Shotgun tandem mass spectrometry-based peptide sequencing using programs such as SEQUEST allows high-throughput identification of peptides, which in turn allows the identification of corresponding proteins. We have applied a machine learning algorithm, called the support vector machine, to discriminate between correctly and incorrectly identified peptides using SEQUEST output. Each peptide was characterized by SEQUEST-calculated features such as delta Cn and Xcorr, measurements such as precursor ion current and mass, and additional calculated parameters such as the fraction of matched MS/MS peaks. The trained SVM classifier performed significantly better than previous cutoff-based methods at separating positive from negative peptides. Positive and negative peptides were more readily distinguished in training set data acquired on a QTOF, compared to an ion trap mass spectrometer. The use of 13 features, including four new parameters, significantly improved the separation between positive and negative peptides. Use of the support vector machine and these additional parameters resulted in a more accurate interpretation of peptide MS/MS spectra and is an important step toward automated interpretation of peptide tandem mass spectrometry data in proteomics.

Keywords: biosvm proteomics
[Ambauen2003Graph] R. Ambauen, S. Fischer, and H. Bunke. Graph edit distance with node splitting and merging, and its application to diatom idenfication. In GbRPR, pages 95-106, 2003. [ bib ]
[Altun2003Large] Y. Altun and T. Hofmann. Large Margin Methods for Label Sequence Learning. In 8th European Conference on Speech Communication and Technology (EuroSpeech), 2003. [ bib | .pdf ]
Label sequence learning is the problem of inferring a state sequence from an observation sequence, where the state sequence may encode a labeling, annotation or segmentation of the sequence. In this paper we give an overview of discriminative methods developed for this problem. Special emphasis is put on large margin methods by generalizing multiclass Support Vector Machines and AdaBoost to the case of label sequences.An experimental evaluation demonstrates the advantages over classical approaches like Hidden Markov Models and the competitiveness with methods like Conditional Random Fields.

Keywords: conditional-random-field
[Alexandersson2003SLAM] M. Alexandersson, S. Cawley, and L. Pachter. SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res., 13(3):496-502, Mar 2003. [ bib | DOI | http | .pdf ]
Comparative-based gene recognition is driven by the principle that conserved regions between related organisms are more likely than divergent regions to be coding. We describe a probabilistic framework for gene structure and alignment that can be used to simultaneously find both the gene structure and alignment of two syntenic genomic regions. A key feature of the method is the ability to enhance gene predictions by finding the best alignment between two syntenic sequences, while at the same time finding biologically meaningful alignments that preserve the correspondence between coding exons. Our probabilistic framework is the generalized pair hidden Markov model, a hybrid of (1). generalized hidden Markov models, which have been used previously for gene finding, and (2). pair hidden Markov models, which have applications to sequence alignment. We have built a gene finding and alignment program called SLAM, which aligns and identifies complete exon/intron structures of genes in two related but unannotated sequences of DNA. SLAM is able to reliably predict gene structures for any suitably related pair of organisms, most notably with fewer false-positive predictions compared to previous methods (examples are provided for Homo sapiens/Mus musculus and Plasmodium falciparum/Plasmodium vivax comparisons). Accuracy is obtained by distinguishing conserved noncoding sequence (CNS) from conserved coding sequence. CNS annotation is a novel feature of SLAM and may be useful for the annotation of UTRs, regulatory elements, and other noncoding features.

Keywords: biogm
[Albertson2003Genomic] D. G. Albertson and D. Pinkel. Genomic microarrays in human genetic disease and cancer. Hum. Mol. Genet., 12 Spec No 2:R145-R152, Oct 2003. [ bib | DOI | http | .pdf ]
Alterations in the genome that lead to changes in DNA sequence copy number are a characteristic of solid tumors and are found in association with developmental abnormalities and/or mental retardation. Comparative genomic hybridization (CGH) can be used to detect and map these changes. Recent improvements in the resolution and sensitivity of CGH have been possible through implementation of microarray-based CGH (array CGH). Here we discuss the performance characteristics of different array platforms and review some of the recent applications of array CGH in cancer and medical genetics.

Keywords: csbcbook
[Albertson2003Chromosome] D. G. Albertson, C. Collins, F. McCormick, and J. W. Gray. Chromosome aberrations in solid tumors. Nat. Genet., 34(4):369-376, Aug 2003. [ bib | DOI | http | .pdf ]
Chromosome aberrations in human solid tumors are hallmarks of gene deregulation and genome instability. This review summarizes current knowledge regarding aberrations, discusses their functional importance, suggests mechanisms by which aberrations may form during cancer progression and provides examples of clinical advances that have come from studies of chromosome aberrations.

Keywords: csbcbook
[Aguda2003CellCycle] B. D. Aguda and C. K. Algar. A structural analysis of the qualitative networks regulating the cell cycle and apoptosis. Cell Cycle, 2(6):538-44, 2003. [ bib ]
This paper proposes an integration and modular organization of the complex regulatory networks involved in the mammalian cell cycle, apoptosis, and related intracellular signaling cascades. A common node linking the cell cycle and apoptosis permits the possibility of coordinate control between the initiation of these two cellular processes. From this node, pathways emanate that lead to the activation of cyclin-dependent kinases (in the cell cycle) and caspases (in apoptosis). Computer simulations are carried out to demonstrate that the proposed network architecture and certain module-module interactions can account for the experimentally observed sequence of cellular events (quiescence, cell cycle, and apoptosis) as the transcriptional activities of E2F-1 and c-Myc are increased. Despite the lack of quantitative kinetic data on most of the pathways, it is demonstrated that there can be meaningful conclusions regarding system stability that arise from the topology of the network. It is shown that only cycles in the network graph determine stability. Thus, several positive and negative feedback loops are identified from a literature review of the major pathways involved in the initiation of the cell cycle and of apoptosis.

Keywords: csbcbook
[Aebersold2003Mass] R. Aebersold and M. Mann. Mass spectrometry-based proteomics. Nature, 422(6928):198-207, Mar 2003. [ bib | DOI | http | .pdf ]
Recent successes illustrate the role of mass spectrometry-based proteomics as an indispensable tool for molecular and cellular biology and for the emerging field of systems biology. These include the study of protein-protein interactions via affinity-based isolations on a small and proteome-wide scale, the mapping of numerous organelles, the concurrent description of the malaria parasite genome and proteome, and the generation of quantitative protein profiles from diverse species. The ability of mass spectrometry to identify and, increasingly, to precisely quantify thousands of proteins from complex samples can be expected to impact broadly on biology and medicine.

Keywords: bio
[Abrahamian2003Efficient] E. Abrahamian, P. C. Fox, L. Naerum, I. T. Christensen, H. Thøgersen, and R. D. Clark. Efficient generation, storage, and manipulation of fully flexible pharmacophore multiplets and their use in 3-D similarity searching. J. Chem. Inf. Comput. Sci., 43(2):458-468, 2003. [ bib | DOI | http | .pdf ]
Pharmacophore triplets and quartets have been used by many groups in recent years, primarily as a tool for molecular diversity analysis. In most cases, slow processing speeds and the very large size of the bitsets generated have forced researchers to compromise in terms of how such multiplets were stored, manipulated, and compared, e.g., by using simple unions to represent multiplets for sets of molecules. Here we report using bitmaps in place of bitsets to reduce storage demands and to improve processing speed. Here, a bitset is taken to mean a fully enumerated string of zeros and ones, from which a compressed bitmap is obtained by replacing uniform blocks ("runs") of digits in the bitset with a pair of values identifying the content and length of the block (run-length encoding compression). High-resolution multiplets involving four features are enabled by using 64 bit executables to create and manipulate bitmaps, which "connect" to the 32 bit executables used for database access and feature identification via an extensible mark-up language (XML) data stream. The encoding system used supports simple pairs, triplets, and quartets; multiplets in which a privileged substructure is used as an anchor point; and augmented multiplets in which an additional vertex is added to represent a contingent feature such as a hydrogen bond extension point linked to a complementary feature (e.g., a donor or an acceptor atom) in a base pair or triplet. It can readily be extended to larger, more complex multiplets as well. Database searching is one particular potential application for this technology. Consensus bitmaps built up from active ligands identified in preliminary screening can be used to generate hypothesis bitmaps, a process which includes allowance for differential weighting to allow greater emphasis to be placed on bits arising from multiplets expected to be particularly discriminating. Such hypothesis bitmaps are shown to be useful queries for database searching, successfully retrieving active compounds across a range of structural classes from a corporate database. The current implementation allows multiconformer bitmaps to be obtained from pregenerated conformations or by random perturbation on-the-fly. The latter application involves random sampling of the full range of conformations not precluded by steric clashes, which limits the usefulness of classical fingerprint similarity measures. A new measure of similarity, The Stochastic Cosine, is introduced here to address this need. This new similarity measure uses the average number of bits common to independently drawn conformer sets to normalize the cosine coefficient. Its use frees the user from having to ensure strict comparability of starting conformations and having to use fixed torsional increments, thereby allowing fully flexible characterization of pharmacophoric patterns.

[Tegner2003Reverse] J. Tegner, M. K. S. Yeung, J. Hasty, and J. J. Collins. Reverse engineering gene networks: integrating genetic perturbations with dynamical modeling. Proc. Natl. Acad. Sci. USA, 100(10):5944-5949, May 2003. [ bib | DOI | http ]
While the fundamental building blocks of biology are being tabulated by the various genome projects, microarray technology is setting the stage for the task of deducing the connectivity of large-scale gene networks. We show how the perturbation of carefully chosen genes in a microarray experiment can be used in conjunction with a reverse engineering algorithm to reveal the architecture of an underlying gene regulatory network. Our iterative scheme identifies the network topology by analyzing the steady-state changes in gene expression resulting from the systematic perturbation of a particular node in the network. We highlight the validity of our reverse engineering approach through the successful deduction of the topology of a linear in numero gene network and a recently reported model for the segmentation polarity network in Drosophila melanogaster. Our method may prove useful in identifying and validating specific drug targets and in deconvolving the effects of chemical compounds.

[Semizarov2003Specificity] D. Semizarov, L. Frost, A. Sarthy, P. Kroeger, D. N. Halbert, and S. W. Fesik. Specificity of short interfering RNA determined through gene expression signatures. Proc. Natl. Acad. Sci. USA, 100(11):6347-52, May 2003. [ bib | DOI | http ]
Short interfering RNA (siRNA) is widely used for studying gene function and holds great promise as a tool for validating drug targets and treating disease. A critical assumption in these applications is that the effect of siRNA on cells is specific, i.e., limited to the specific knockdown of the target gene. In this article, we characterize the specificity of siRNA by applying gene expression profiling. Several siRNAs were designed against different regions of the same target gene for three different targets. Their effects on cells were compared by using DNA microarrays to generate gene expression signatures. When the siRNA design and transfection conditions were optimized, the signatures for different siRNAs against the same target were shown to correlate very closely, whereas the signatures for different genes revealed no correlation. These results indicate that siRNA is a highly specific tool for targeted gene knockdown, establishing siRNA-mediated gene silencing as a reliable approach for large-scale screening of gene function and drug target validation.

Keywords: sirna
[Segal2003Classification] N. H. Segal, P. Pavlidis, W. S. Noble, C. R. Antonescu, A. Viale, U. V. Wesley, K. Busam, H. Gallardo, D. DeSantis, M. F. Brennan, C. Cordon-Cardo, J. D. Wolchok, and A. N. Houghton. Classification of Clear-Cell Sarcoma as a Subtype of Melanoma by Genomic Profiling. J. Clin. Oncol., 21(9):1775-1781, May 2003. [ bib | DOI | http | .pdf ]
Purpose: To develop a genome-based classification scheme for clear-cell sarcoma (CCS), also known as melanoma of soft parts (MSP), which would have implications for diagnosis and treatment. This tumor displays characteristic features of soft tissue sarcoma (STS), including deep soft tissue primary location and a characteristic translocation, t(12;22)(q13;q12), involving EWS and ATF1 genes. CCS/MSP also has typical melanoma features, including immunoreactivity for S100 and HMB45, pigmentation, MITF-M expression, and a propensity for regional lymph node metastases. Materials and Methods: RNA samples from 21 cell lines and 60 pathologically confirmed cases of STS, melanoma, and CCS/MSP were examined using the U95A GeneChip (Affymetrix, Santa Clara, CA). Hierarchical cluster analysis, principal component analysis, and support vector machine (SVM) analysis exploited genomic correlations within the data to classify CCS/MSP. Results: Unsupervised analyses demonstrated a clear distinction between STS and melanoma and, furthermore, showed that CCS/MSP cluster with the melanomas as a distinct group. A supervised SVM learning approach further validated this finding and provided a user-independent approach to diagnosis. Genes of interest that discriminate CCS/MSP included those encoding melanocyte differentiation antigens, MITF, SOX10, ERBB3, and FGFR1. Conclusion: Gene expression profiles support the classification of CCS/MSP as a distinct genomic subtype of melanoma. Analysis of these gene profiles using the SVM may be an important diagnostic tool. Genomic analysis identified potential targets for the development of therapeutic strategies in the treatment of this disease.

Keywords: biosvm
[Pearlstein2003Understanding] R. Pearlstein, R. Vaz, and D. Rampe. Understanding the structure-activity relationship of the human ether-a-go-go-related gene cardiac K+ channel. A model for bad behavior. J. Med. Chem., 46(11):2017-2022, May 2003. [ bib | DOI | http ]
Keywords: chemoinformatics herg
[Pearlstein2003Characterization] R. A. Pearlstein, R. J. Vaz, J. Kang, X.-L. Chen, M. Preobrazhenskaya, A. E. Shchekotikhin, A. M. Korolev, L. N. Lysenkova, O. V. Miroshnikova, J. Hendrix, and D. Rampe. Characterization of HERG potassium channel inhibition using CoMSiA 3D QSAR and homology modeling approaches. Bioorg. Med. Chem. Lett., 13(10):1829-1835, May 2003. [ bib ]
A data set consisting of twenty-two sertindole analogues and ten structurally diverse inhibitors, spanning a wide range in potency, was analyzed using CoMSiA. A homology model of HERG was constructed from the crystal structure of the open MthK potassium channel. A complementary relationship between our CoMSiA and homology models is apparent when the long inhibitor axis is oriented parallel to the longitudinal axis of the pore, with the tail region pointed toward the selectivity filter. The key elements of the pharmacophore, the CoMSiA and the homology model are: (1) The hydrophobic feature optimally consists of an aromatic group that is capable of engaging in pi-stacking with a Phe656 side chain. Optionally, a second aromatic or hydrophobic group present in some inhibitors may contact an additional Phe656 side chain. (2) The basic nitrogen appears to undergo a pi-cation interaction with Tyr652. (3) The pore diameter (12A+), and depth of the selectivity loop relative to the intracellular opening, act as constraints on the conformation-dependent inhibitor dimensions.

Keywords: chemoinformatics herg
[Nielsen2003Reliable] M. Nielsen, C. Lundegaard, P. Worning, S. L. Lauemøller, K. Lamberth, S. Buus, S. Brunak, and O. Lund. Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci., 12(5):1007-1017, May 2003. [ bib ]
In this paper we describe an improved neural network method to predict T-cell class I epitopes. A novel input representation has been developed consisting of a combination of sparse encoding, Blosum encoding, and input derived from hidden Markov models. We demonstrate that the combination of several neural networks derived using different sequence-encoding schemes has a performance superior to neural networks derived using a single sequence-encoding scheme. The new method is shown to have a performance that is substantially higher than that of other methods. By use of mutual information calculations we show that peptides that bind to the HLA A*0204 complex display signal of higher order sequence correlations. Neural networks are ideally suited to integrate such higher order correlations when predicting the binding affinity. It is this feature combined with the use of several neural networks derived from different and novel sequence-encoding schemes and the ability of the neural network to be trained on data consisting of continuous binding affinities that gives the new method an improved performance. The difference in predictive performance between the neural network methods and that of the matrix-driven methods is found to be most significant for peptides that bind strongly to the HLA molecule, confirming that the signal of higher order sequence correlation is most strongly present in high-binding peptides. Finally, we use the method to predict T-cell epitopes for the genome of hepatitis C virus and discuss possible applications of the prediction method to guide the process of rational vaccine design.

Keywords: immunoinformatics
[Elkon2003Genome-wide] R. Elkon, C. Linhart, R. Sharan, R. Shamir, and Y. Shiloh. Genome-wide in silico identification of transcriptional regulators controlling the cell cycle in human cells. Genome Res., 13(5):773-780, May 2003. [ bib | DOI | http ]
Dissection of regulatory networks that control gene transcription is one of the greatest challenges of functional genomics. Using human genomic sequences, models for binding sites of known transcription factors, and gene expression data, we demonstrate that the reverse engineering approach, which infers regulatory mechanisms from gene expression patterns, can reveal transcriptional networks in human cells. To date, such methodologies were successfully demonstrated only in prokaryotes and low eukaryotes. We developed computational methods for identifying putative binding sites of transcription factors and for evaluating the statistical significance of their prevalence in a given set of promoters. Focusing on transcriptional mechanisms that control cell cycle progression, our computational analyses revealed eight transcription factors whose binding sites are significantly overrepresented in promoters of genes whose expression is cell-cycle-dependent. The enrichment of some of these factors is specific to certain phases of the cell cycle. In addition, several pairs of these transcription factors show a significant co-occurrence rate in cell-cycle-regulated promoters. Each such pair indicates functional cooperation between its members in regulating the transcriptional program associated with cell cycle progression. The methods presented here are general and can be applied to the analysis of transcriptional networks controlling any biological process.

[Chung2003Spectra] Fan Chung, Linyuan Lu, and Van Vu. Spectra of random graphs with given expected degrees. Proc. Natl. Acad. Sci. USA, 100(11):6313-6318, May 2003. [ bib | DOI | http ]
In the study of the spectra of power-law graphs, there are basically two competing approaches. One is to prove analogues of Wigner's semicircle law, whereas the other predicts that the eigenvalues follow a power-law distribution. Although the semicircle law and the power law have nothing in common, we will show that both approaches are essentially correct if one considers the appropriate matrices. We will prove that (under certain mild conditions) the eigenvalues of the (normalized) Laplacian of a random power-law graph follow the semicircle law, whereas the spectrum of the adjacency matrix of a power-law graph obeys the power law. Our results are based on the analysis of random graphs with given expected degrees and their relations to several key invariants. Of interest are a number of (new) values for the exponent beta, where phase transitions for eigenvalue distributions occur. The spectrum distributions have direct implications to numerous graph algorithms such as, for example, randomized algorithms that involve rapidly mixing Markov chains.

Keywords: 12743375
[Chang2003Improvement] Ruey-Feng Chang, Wen-Jie Wu, Woo Kyung Moon, and Dar-Ren Chen. Improvement in breast tumor discrimination by support vector machines and speckle-emphasis texture analysis. Ultrasound Med Biol, 29(5):679-86, May 2003. [ bib | DOI | http | .pdf ]
Recent statistics show that breast cancer is a major cause of death among women in developed countries. Hence, finding an accurate and effective diagnostic method is very important. In this paper, we propose a high precision computer-aided diagnosis (CAD) system for sonography. We utilize a support vector machine (SVM) to classify breast tumors according to their texture information surrounding speckle pixels. We test our system with 250 pathologically-proven breast tumors including 140 benign and 110 malignant ones. Also we compare the diagnostic performances of three texture features, i.e., speckle-emphasis texture feature, nonspeckle-emphasis texture feature and conventional all pixels texture feature, applied to breast sonography using SVM. In our experiment, the accuracy of SVM with speckle information for classifying malignancies is 93.2% (233/250), the sensitivity is 95.45% (105/110), the specificity is 91.43% (128/140), the positive predictive value is 89.74% (105/117) and the negative predictive value is 96.24% (128/133). Based on the experimental results, speckle phenomenon is a useful tool to be used in computer-aided diagnosis; its performance is better than those of the other two features. Speckle phenomenon, which is considered as noise in sonography, can intrude into judgments of a physician using naked eyes but it is another story for application in a computer-aided diagnosis algorithm.

Keywords: breastcancer
[Cavasotto2003Structure-based] C. N. Cavasotto, A. J. W. Orry, and R. A. Abagyan. Structure-based identification of binding sites, native ligands and potential inhibitors for G-protein coupled receptors. Proteins, 51(3):423-433, May 2003. [ bib | DOI | http ]
G-protein coupled receptors (GPCRs) are the largest family of cell-surface receptors involved in signal transmission. Drugs associated with GPCRs represent more than one fourth of the 100 top-selling drugs and are the targets of more than half of the current therapeutic agents on the market. Our methodology based on the internal coordinate mechanics (ICM) program can accurately identify the ligand-binding pocket in the currently available crystal structures of seven transmembrane (7TM) proteins [bacteriorhodopsin (BR) and bovine rhodopsin (bRho)]. The binding geometry of the ligand can be accurately predicted by ICM flexible docking with and without the loop regions, a useful finding for GPCR docking because the transmembrane regions are easier to model. We also demonstrate that the native ligand can be identified by flexible docking and scoring in 1.5% and 0.2% (for bRho and BR, respectively) of the best scoring compounds from two different types of compound database. The same procedure can be applied to the database of available chemicals to identify specific GPCR binders. Finally, we demonstrate that even if the sidechain positions in the bRho binding pocket are entirely wrong, their correct conformation can be fully restored with high accuracy (0.28 A) through the ICM global optimization with and without the ligand present. These binding site adjustments are critical for flexible docking of new ligands to known structures or for docking to GPCR homology models. The ICM docking method has the potential to be used to "de-orphanize" orphan GPCRs (oGPCRs) and to identify antagonists-agonists for GPCRs if an accurate model (experimentally and computationally validated) of the structure has been constructed or when future crystal structures are determined.

Keywords: chemogenomics
[Bleicher2003Hit] K. H. Bleicher, H.-J. Böhm, K. Müller, and A. I. Alanine. Hit and lead generation: beyond high-throughput screening. Nat Rev Drug Discov, 2(5):369-378, May 2003. [ bib | DOI | http ]
The identification of small-molecule modulators of protein function, and the process of transforming these into high-content lead series, are key activities in modern drug discovery. The decisions taken during this process have far-reaching consequences for success later in lead optimization and even more crucially in clinical development. Recently, there has been an increased focus on these activities due to escalating downstream costs resulting from high clinical failure rates. In addition, the vast emerging opportunities from efforts in functional genomics and proteomics demands a departure from the linear process of identification, evaluation and refinement activities towards a more integrated parallel process. This calls for flexible, fast and cost-effective strategies to meet the demands of producing high-content lead series with improved prospects for clinical success.

Keywords: Amino Acid Motifs, Combinatorial Chemistry Techniques, Drug Design, Drug Evaluation, Genomics, Preclinical, Proteomics, 12750740
[Lacoste-Julien2003introduction] S. Lacoste-Julien. An introduction to Max-Margin Markov Networks. UC Berkeley cs281a project report, December 2003. [ bib | .pdf ]
Keywords: conditional-random-field
[Kubinyi2004Chemogenomics] H. Kubinyi, G. Müller, R. Mannhold, and G. Folkers, editors. Chemogenomics in Drug Discovery: A Medicinal Chemistry Perspective. Methods and Principles in Medicinal Chemistry. Wiley-VCH, New York, 2004. [ bib ]
[Zomer2004Toxicological] Simeone Zomer, Christelle Guillo, Richard G Brereton, and Melissa Hanna-Brown. Toxicological classification of urine samples using pattern recognition techniques and capillary electrophoresis. Anal Bioanal Chem, 378(8):2008-20, Apr 2004. [ bib | DOI | http | .pdf ]
In toxicology, hazardous substances detected in organisms may often lead to different pathological conditions depending on the type of exposure and level of dosage; hence, further analysis on this can suggest the best cure. Urine profiling may serve the purpose because samples typically contain hundreds of compounds representing an effective metabolic fingerprint. This paper proposes a pattern recognition procedure for determining the type of cadmium dosage, acute or chronic, administrated to laboratory rats, where urinary profiles are detected using capillary electrophoresis. The procedure is based on the composition of a sample data matrix consisting of areas of common peaks, with appropriate pre-processing aimed at reducing the lack of reproducibility and enhancing the potential contribution of low-level metabolites in discrimination. The matrix is then used for pattern recognition including principal components analysis, cluster analysis, discriminant analysis and support vector machines. Attention is particularly focussed on the last of these techniques, because of its novelty and some attractive features such as its suitability to work with datasets that are small and/or have low samples/variable ratios. The type of cadmium administration is detected as a relevant feature that contributes to the structure of the sample matrix, and samples are classified according to the class membership, with discriminant analysis and support vector machines performing complementarily on a training and on a test set.

Keywords: Algorithms, Ambergris, Animals, Automated, Cadmium, Candida, Candida albicans, Capillary, Cluster Analysis, Combinatorial Chemistry Techniques, Electrophoresis, Eye Enucleation, Humans, Magnetic Resonance Spectroscopy, Melanoma, Models, Molecular, Molecular Conformation, Non-U.S. Gov't, Odors, P.H.S., Pattern Recognition, Perfume, Predictive Value of Tests, Prognosis, Prospective Studies, Quantitative Structure-Activity Relationship, Rats, Research Support, U.S. Gov't, Uveal Neoplasms, 15007590
[Zhu20041norm] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. In S. Thrun, L. Saul, and B. Schölkopf, editors, Adv. Neural. Inform. Process Syst., volume 16, Cambridge, MA, 2004. MIT Press. [ bib ]
[Zhu2004Classification] J. Zhu and T. Hastie. Classification of gene microarrays by penalized logistic regression. Biostatistics, 5(3):427-43, Jul 2004. [ bib | DOI | http | .pdf ]
Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described.

[Zhao2004Diagnosing] C. Y. Zhao, R. S. Zhang, H. X. Liu, C. X. Xue, S. G. Zhao, X. F. Zhou, M. C. Liu, and B. T. Fan. Diagnosing anorexia based on partial least squares, back propagation neural network, and support vector machines. J Chem Inf Comput Sci, 44(6):2040-6, 2004. [ bib | DOI | http | .pdf ]
Support vector machine (SVM), as a novel type of learning machine, for the first time, was used to develop a predictive model for early diagnosis of anorexia. It was based on the concentration of six elements (Zn, Fe, Mg, Cu, Ca, and Mn) and the age extracted from 90 cases. Compared with the results obtained from two other classifiers, partial least squares (PLS) and back-propagation neural network (BPNN), the SVM method exhibited the best whole performance. The accuracies for the test set by PLS, BPNN, and SVM methods were 52%, 65%, and 87%, respectively. Moreover, the models we proposed could also provide some insight into what factors were related to anorexia.

[Zhang2004Statistical] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat., 32:56-134, 2004. [ bib | DOI | http | .pdf ]
We study how closely the optimal Bayes error rate can be approximately reached using a classification algorithm that computes a classifier by minimizing a convex upper bound of the classification error function. The measurement of closeness is characterized by the loss function used in the estimation. We show that such a classification scheme can be generally regarded as a (nonmaximum-likelihood) conditional in-class probability estimate, and we use this analysis to compare various convex loss functions that have appeared in the literature. Furthermore, the theoretical insight allows us to design good loss functions with desirable properties. Another aspect of our analysis is to demonstrate the consistency of certain classification methods using convex risk minimization. This study sheds light on the good performance of some recently proposed linear classification methods including boosting and support vector machines. It also shows their limitations and suggests possible improvements.

[Zhang2004Wavelet] Li Zhang, Weida Zhou, and Licheng Jiao. Wavelet support vector machine. IEEE Trans Syst Man Cybern B Cybern, 34(1):34-9, Feb 2004. [ bib | DOI | http | .pdf ]
An admissible support vector (SV) kernel (the wavelet kernel), by which we can construct a wavelet support vector machine (SVM), is presented. The wavelet kernel is a kind of multidimensional wavelet function that can approximate arbitrary nonlinear functions. The existence of wavelet kernels is proven by results of theoretic analysis. Computer simulations show the feasibility and validity of wavelet support vector machines (WSVMs) in regression and pattern recognition.

[Zhang2004Hidden] Li Zhang, Weida Zhou, and Licheng Jiao. Hidden space support vector machines. IEEE Trans Neural Netw, 15(6):1424-34, Nov 2004. [ bib | DOI | http | .pdf ]
Hidden space support vector machines (HSSVMs) are presented in this paper. The input patterns are mapped into a high-dimensional hidden space by a set of hidden nonlinear functions and then the structural risk is introduced into the hidden space to construct HSSVMs. Moreover, the conditions for the nonlinear kernel function in HSSVMs are more relaxed, and even differentiability is not required. Compared with support vector machines (SVMs), HSSVMs can adopt more kinds of kernel functions because the positive definite property of the kernel function is not a necessary condition. The performance of HSSVMs for pattern recognition and regression estimation is also analyzed. Experiments on artificial and real-world domains confirm the feasibility and the validity of our algorithms.

[Zhang2004Intrusion] Lian hua Zhang, Guan hua Zhang, Jie Zhang, and Ying cai Bai. Intrusion detection using rough set classification. J Zhejiang Univ Sci, 5(9):1076-86, Sep 2004. [ bib | DOI | http | .pdf ]
Recently machine learning-based intrusion detection approaches have been subjected to extensive researches because they can detect both misuse and anomaly. In this paper, rough set classification (RSC), a modern learning algorithm, is used to rank the features extracted for detecting intrusions and generate intrusion detection models. Feature ranking is a very critical step when building the model. RSC performs feature ranking before generating rules, and converts the feature ranking to minimal hitting set problem addressed by using genetic algorithm (GA). This is done in classical approaches using Support Vector Machine (SVM) by executing many iterations, each of which removes one useless feature. Compared with those methods, our method can avoid many iterations. In addition, a hybrid genetic algorithm is proposed to increase the convergence speed and decrease the training time of RSC. The models generated by RSC take the form of "IF-THEN" rules, which have the advantage of explication. Tests and comparison of RSC with SVM on DARPA benchmark data showed that for Probe and DoS attacks both RSC and SVM yielded highly accurate results (greater than 99% accuracy on testing set).

[Zangwill2004Heidelberg] Linda M Zangwill, Kwokleung Chan, Christopher Bowd, Jicuang Hao, Te-Won Lee, Robert N Weinreb, Terrence J Sejnowski, and Michael H Goldbaum. Heidelberg retina tomograph measurements of the optic disc and parapapillary retina for detecting glaucoma analyzed by machine learning classifiers. Invest Ophthalmol Vis Sci, 45(9):3144-51, Sep 2004. [ bib | DOI | http | .pdf ]
PURPOSE: To determine whether topographical measurements of the parapapillary region analyzed by machine learning classifiers can detect early to moderate glaucoma better than similarly processed measurements obtained within the disc margin and to improve methods for optimization of machine learning classifier feature selection. METHODS: One eye of each of 95 patients with early to moderate glaucomatous visual field damage and of each of 135 normal subjects older than 40 years participating in the longitudinal Diagnostic Innovations in Glaucoma Study (DIGS) were included. Heidelberg Retina Tomograph (HRT; Heidelberg Engineering, Dossenheim, Germany) mean height contour was measured in 36 equal sectors, both along the disc margin and in the parapapillary region (at a mean contour line radius of 1.7 mm). Each sector was evaluated individually and in combination with other sectors. Gaussian support vector machine (SVM) learning classifiers were used to interpret HRT sector measurements along the disc margin and in the parapapillary region, to differentiate between eyes with normal and glaucomatous visual fields and to compare the results with global and regional HRT parameter measurements. The area under the receiver operating characteristic (ROC) curve was used to measure diagnostic performance of the HRT parameters and to evaluate the cross-validation strategies and forward selection and backward elimination optimization techniques that were used to generate the reduced feature sets. RESULTS: The area under the ROC curve for mean height contour of the 36 sectors along the disc margin was larger than that for the mean height contour in the parapapillary region (0.97 and 0.85, respectively). Of the 36 individual sectors along the disc margin, those in the inferior region between 240 degrees and 300 degrees, had the largest area under the ROC curve (0.85-0.91). With SVM Gaussian techniques, the regional parameters showed the best ability to discriminate between normal eyes and eyes with glaucomatous visual field damage, followed by the global parameters, mean height contour measures along the disc margin, and mean height contour measures in the parapapillary region. The area under the ROC curve was 0.98, 0.94, 0.93, and 0.85, respectively. Cross-validation and optimization techniques demonstrated that good discrimination (99% of peak area under the ROC curve) can be obtained with a reduced number of HRT parameters. CONCLUSIONS: Mean height contour measurements along the disc margin discriminated between normal and glaucomatous eyes better than measurements obtained in the parapapillary region.

[Yuan2004SVMtm] Z. Yuan, J.S. Mattick, and R.D. Teasdale. SVMtm: support vector machines to predict transmembrane segments. J. Comput. Chem., 25(5):632, 6 2004. [ bib | DOI | http | .pdf ]
A new method has been developed for prediction of transmembrane helices using support vector machines. Different coding schemes of protein sequences were explored, and their performances were assessed by crossvalidation tests. The best performance method can predict the transmembrane helices with sensitivity of 93.4 of 92.0 given to show the strength of transmembrane signal and the prediction reliability. In particular, this method can distinguish transmembrane proteins from soluble proteins with an accuracy of approximately 99 helix prediction methods and can be used for consensus analysis of entire proteomes. The predictor is located at http://genet.imb.uq.edu.au/predictors/SVMtm.

Keywords: biosvm
[Yu2004Efficient] L. Yu and H. Liu. Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research, 5:1205-1224, 2004. [ bib ]
[Yu2004Advances] J. Yu, V.A. Smith, P.P. Wang, A.J. Hartemink, and E.D. Jarvis. Advances to bayesian network inference for generating causal networks from observational biological data. Bioinformatics, 20(18):3594-3603, Dec 2004. [ bib | DOI | http ]
MOTIVATION: Network inference algorithms are powerful computational tools for identifying putative causal interactions among variables from observational data. Bayesian network inference algorithms hold particular promise in that they can capture linear, non-linear, combinatorial, stochastic and other types of relationships among variables across multiple levels of biological organization. However, challenges remain when applying these algorithms to limited quantities of experimental data collected from biological systems. Here, we use a simulation approach to make advances in our dynamic Bayesian network (DBN) inference algorithm, especially in the context of limited quantities of biological data. RESULTS: We test a range of scoring metrics and search heuristics to find an effective algorithm configuration for evaluating our methodological advances. We also identify sampling intervals and levels of data discretization that allow the best recovery of the simulated networks. We develop a novel influence score for DBNs that attempts to estimate both the sign (activation or repression) and relative magnitude of interactions among variables. When faced with limited quantities of observational data, combining our influence score with moderate data interpolation reduces a significant portion of false positive interactions in the recovered networks. Together, our advances allow DBN inference algorithms to be more effective in recovering biological networks from experimentally collected data. AVAILABILITY: Source code and simulated data are available upon request. SUPPLEMENTARY INFORMATION: http://www.jarvislab.net/Bioinformatics/BNAdvances/

Keywords: Algorithms; Bayes Theorem; Computer Simulation; Gene Expression Profiling; Gene Expression Regulation; Models, Genetic; Models, Statistical; Oligonucleotide Array Sequence Analysis; Signal Transduction; Software
[Yu2004integrated] J.K. Yu, Y.D. Chen, and S. Zheng. An integrated approach to the detection of colorectal cancer utilizing proteomics and bioinformatics. World J. Gastroenterol., 10(21):3127-3131, 2004. [ bib | .pdf ]
AIM: To find new potential biomarkers and to establish patterns for early detection of colorectal cancer. METHODS: One hundred and eighty-two serum samples including 55 from colorectal cancer (CRC) patients, 35 from colorectal adenoma (CRA) patients and 92 from healthy persons (HP) were detected by surface-enhanced laser desorption/ionization mass spectrometry (SELDI-MS). The data of spectra were analyzed by bioinformatics tools like artificial neural network (ANN) and support vector machine (SVM). RESULTS: The diagnostic pattern combined with 7 potential biomarkers could differentiate CRC patients from CRA patients with a specificity of 83 The diagnostic pattern combined with 4 potential biomarkers could differentiate CRC patients from HP with a specificity of 92 sensitivity of 89 The combination of SELDI with bioinformatics tools could help find new biomarkers and establish patterns with high sensitivity and specificity for the detection of CRC.

Keywords: biosvm
[Yu2004PEBL] H. Yu, J. Han, and K. C.-C. Chang. PEBL: Web page classification without negative examples. IEEE Trans. Knowl. Data Eng., 16(1):70-81, 2004. [ bib | DOI | http | .pdf ]
Keywords: PUlearning
[Yu2004Predicting] C.-S. Yu, C.-J. Lin, and J.-K. Hwang. Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Sci., 13(5):1402-1406, 2004. [ bib | DOI | http | .pdf ]
Gram-negative bacteria have five major subcellular localization sites: the cytoplasm, the periplasm, the inner membrane, the outer membrane, and the extracellular space. The subcellular location of a protein can provide valuable information about its function. With the rapid increase of sequenced genomic data, the need for an automated and accurate tool to predict subcellular localization becomes increasingly important. We present an approach to predict subcellular localization for Gram-negative bacteria. This method uses the support vector machines trained by multiple feature vectors based on n-peptide compositions. For a standard data set comprising 1443 proteins, the overall prediction accuracy reaches 89 highest prediction rate ever reported. Our prediction is 14 than that of the recently developed multimodular PSORT-B. Because of its simplicity, this approach can be easily extended to other organisms and should be a useful tool for the high-throughput and large-scale analysis of proteomic and genomic data.

Keywords: biosvm
[Yao2004Comparative] X. J. Yao, A. Panaye, J. P. Doucet, R. S. Zhang, H. F. Chen, M. C. Liu, Z. D. Hu, and B. T. Fan. Comparative study of QSAR/QSPR correlations using support vector machines, radial basis function neural networks, and multiple linear regression. J Chem Inf Comput Sci, 44(4):1257-66, 2004. [ bib | DOI | http | .pdf ]
Support vector machines (SVMs) were used to develop QSAR models that correlate molecular structures to their toxicity and bioactivities. The performance and predictive ability of SVM are investigated and compared with other methods such as multiple linear regression and radial basis function neural network methods. In the present study, two different data sets were evaluated. The first one involves an application of SVM to the development of a QSAR model for the prediction of toxicities of 153 phenols, and the second investigation deals with the QSAR model between the structures and the activities of a set of 85 cyclooxygenase 2 (COX-2) inhibitors. For each application, the molecular structures were described using either the physicochemical parameters or molecular descriptors. In both studied cases, the predictive ability of the SVM model is comparable or superior to those obtained by MLR and RBFNN. The results indicate that SVM can be used as an alternative powerful modeling tool for QSAR studies.

Keywords: biosvm chemoinformatics
[Yang2004Bio-support] Z. R. Yang and K.-C. Chou. Bio-support vector machines for computational proteomics. Bioinformatics, 20(5):735-741, 2004. [ bib | http | .pdf ]
Motivation: One of the most important issues in computational proteomics is to produce a prediction model for the classification or annotation of biological function of novel protein sequences. In order to improve the prediction accuracy, much attention has been paid to the improvement of the performance of the algorithms used, few is for solving the fundamental issue, namely, amino acid encoding as most existing pattern recognition algorithms are unable to recognize amino acids in protein sequences. Importantly, the most commonly used amino acid encoding method has the flaw that leads to large computational cost and recognition bias. Results: By replacing kernel functions of support vector machines (SVMs) with amino acid similarity measurement matrices, we have modified SVMs, a new type of pattern recognition algorithm for analysing protein sequences, particularly for proteolytic cleavage site prediction. We refer to the modified SVMs as bio-support vector machine. When applied to the prediction of HIV protease cleavage sites, the new method has shown a remarkable advantage in reducing the model complexity and enhancing the model robustness.

Keywords: biosvm
[Yang2004Biological] Zheng Rong Yang. Biological applications of support vector machines. Brief Bioinform, 5(4):328-38, Dec 2004. [ bib ]
[Yan2004Identification] C. Yan, V. Honavar, and D. Dobbs. Identification of interface residues in protease-inhibitor and antigen-antibody complexes: a support vector machine. Neural Comput. & Applic., 13:123-129, 2004. [ bib | DOI | .pdf ]
Keywords: biosvm
[Yan2004two-stage] C. Yan, D. Dobbs, and V. Honavar. A two-stage classifier for identification of protein-protein interface residues. Bioinformatics, 20(Suppl. 1):i371-i378, 2004. [ bib | http | .pdf ]
Motivation: The ability to identify protein-protein interaction sites and to detect specific amino acid residues that contribute to the specificity and affinity of protein interactions has important implications for problems ranging from rational drug design to analysis of metabolic and signal transduction networks. Results: We have developed a two-stage method consisting of a support vector machine (SVM) and a Bayesian classifier for predicting surface residues of a protein that participate in protein-protein interactions. This approach exploits the fact that interface residues tend to form clusters in the primary amino acid sequence. Our results show that the proposed two-stage classifier outperforms previously published sequence-based methods for predicting interface residues. We also present results obtained using the two-stage classifier on an independent test set of seven CAPRI (Critical Assessment of PRedicted Interactions) targets. The success of the predictions is validated by examining the predictions in the context of the three-dimensional structures of protein complexes. Supplementary information: http://www.public.iastate.edu/ chhyan/ISMB2004/list.html

Keywords: biosvm
[Yamanishi2004Protein] Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Protein network inference from multiple genomic data: a supervised approach. Bioinformatics, 20:i363-i370, 2004. [ bib | http | .pdf ]
Motivation: An increasing number of observations support the hypothesis that most biological functions involve the interactions between many proteins, and that the complexity of living systems arises as a result of such interactions. In this context, the problem of inferring a global protein network for a given organism, using all available genomic data about the organism, is quickly becoming one of the main challenges in current computational biology. Results: This paper presents a new method to infer protein networks from multiple types of genomic data. Based on a variant of kernel canonical correlation analysis, its originality is in the formalization of the protein network inference problem as a supervised learning problem, and in the integration of heterogeneous genomic data within this framework. We present promising results on the prediction of the protein network for the yeast Saccharomyces cerevisiae from four types of widely available data: gene expressions, protein interactions measured by yeast two-hybrid systems, protein localizations in the cell and protein phylogenetic profiles. The method is shown to outperform other unsupervised protein network inference methods. We finally conduct a comprehensive prediction of the protein network for all proteins of the yeast, which enables us to propose protein candidates for missing enzymes in a biosynthesis pathway. Availability: Softwares are available upon request.

Keywords: biosvm
[Yamanishi2004Heterogeneous] Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Heterogeneous data comparison and gene selection with kernel canonical correlation analysis. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 209-230. MIT Press, 2004. [ bib | www: ]
Keywords: biosvm
[Xue2004Prediction] Y. Xue, C. W. Yap, L. Z. Sun, Z. W. Cao, J. F. Wang, and Y. Z. Chen. Prediction of P-glycoprotein substrates by a support vector machine approach. J Chem Inf Comput Sci, 44(4):1497-505, 2004. [ bib | DOI | http | .pdf ]
P-glycoproteins (P-gp) actively transport a wide variety of chemicals out of cells and function as drug efflux pumps that mediate multidrug resistance and limit the efficacy of many drugs. Methods for facilitating early elimination of potential P-gp substrates are useful for facilitating new drug discovery. A computational ensemble pharmacophore model has recently been used for the prediction of P-gp substrates with a promising accuracy of 63%. It is desirable to extend the prediction range beyond compounds covered by the known pharmacophore models. For such a purpose, a machine learning method, support vector machine (SVM), was explored for the prediction of P-gp substrates. A set of 201 chemical compounds, including 116 substrates and 85 nonsubstrates of P-gp, was used to train and test a SVM classification system. This SVM system gave a prediction accuracy of at least 81.2% for P-gp substrates based on two different evaluation methods, which is substantially improved against that obtained from the multiple-pharmacophore model. The prediction accuracy for nonsubstrates of P-gp is 79.2% using 5-fold cross-validation. These accuracies are slightly better than those obtained from other statistical classification methods, including k-nearest neighbor (k-NN), probabilistic neural networks (PNN), and C4.5 decision tree, that use the same sets of data and molecular descriptors. Our study indicates the potential of SVM in facilitating the prediction of P-gp substrates.

Keywords: biosvm
[Xue2004Effect] Y. Xue, Z. R. Li, C. W. Yap, L. Z. Sun, X. Chen, and Y. Z. Chen. Effect of molecular descriptor feature selection in support vector machine classification of pharmacokinetic and toxicological properties of chemical agents. J Chem Inf Comput Sci, 44(5):1630-8, 2004. [ bib | DOI | http | .pdf ]
Statistical-learning methods have been developed for facilitating the prediction of pharmacokinetic and toxicological properties of chemical agents. These methods employ a variety of molecular descriptors to characterize structural and physicochemical properties of molecules. Some of these descriptors are specifically designed for the study of a particular type of properties or agents, and their use for other properties or agents might generate noise and affect the prediction accuracy of a statistical learning system. This work examines to what extent the reduction of this noise can improve the prediction accuracy of a statistical learning system. A feature selection method, recursive feature elimination (RFE), is used to automatically select molecular descriptors for support vector machines (SVM) prediction of P-glycoprotein substrates (P-gp), human intestinal absorption of molecules (HIA), and agents that cause torsades de pointes (TdP), a rare but serious side effect. RFE significantly reduces the number of descriptors for each of these properties thereby increasing the computational speed for their classification. The SVM prediction accuracies of P-gp and HIA are substantially increased and that of TdP remains unchanged by RFE. These prediction accuracies are comparable to those of earlier studies derived from a selective set of descriptors. Our study suggests that molecular feature selection is useful for improving the speed and, in some cases, the accuracy of statistical learning methods for the prediction of pharmacokinetic and toxicological properties of chemical agents.

Keywords: biosvm
[Xue2004Study] C. X. Xue, R. S. Zhang, M. C. Liu, Z. D. Hu, and B. T. Fan. Study of the quantitative structure-mobility relationship of carboxylic acids in capillary electrophoresis based on support vector machines. J Chem Inf Comput Sci, 44(3):950-7, 2004. [ bib | DOI | http | .pdf ]
The support vector machines (SVM), as a novel type of learning machine, were used to develop a quantitative structure-mobility relationship (QSMR) model of 58 aliphatic and aromatic carboxylic acids based on molecular descriptors calculated from the structure alone. Multiple linear regression (MLR) and radial basis function neural networks (RBFNNs) were also utilized to construct the linear and the nonlinear model to compare with the results obtained by SVM. The root-mean-square errors in absolute mobility predictions for the whole data set given by MLR, RBFNNs, and SVM were 1.530, 1.373, and 0.888 mobility units (10(-5) cm(2) S(-1) V(-1)), respectively, which indicated that the prediction result agrees well with the experimental values of these compounds and also revealed the superiority of SVM over MLR and RBFNNs models for the prediction of the absolute mobility of carboxylic acids. Moreover, the models we proposed could also provide some insight into what structural features are related to the absolute mobility of aliphatic and aromatic carboxylic acids.

Keywords: biosvm
[Xue2004QSAR] C. X. Xue, R. S. Zhang, H. X. Liu, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. QSAR models for the prediction of binding affinities to human serum albumin using the heuristic method and a support vector machine. J Chem Inf Comput Sci, 44(5):1693-700, 2004. [ bib | DOI | http | .pdf ]
The binding affinities to human serum albumin for 94 diverse drugs and drug-like compounds were modeled with the descriptors calculated from the molecular structure alone using a quantitative structure-activity relationship (QSAR) technique. The heuristic method (HM) and support vector machine (SVM) were utilized to construct the linear and nonlinear prediction models, leading to a good correlation coefficient (R2) of 0.86 and 0.94 and root-mean-square errors (rms) of 0.212 and 0.134 albumin drug binding affinity units, respectively. Furthermore, the models were evaluated by a 10 compound external test set, yielding R2 of 0.71 and 0.89 and rms error of 0.430 and 0.222. The specific information described by the heuristic linear model could give some insights into the factors that are likely to govern the binding affinity of the compounds and be used as an aid to the drug design process; however, the prediction results of the nonlinear SVM model seem to be better than that of the HM.

Keywords: biosvm
[Xue2004accurate] C. X. Xue, R. S. Zhang, H. X. Liu, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. An accurate QSPR study of O-H bond dissociation energy in substituted phenols based on support vector machines. J Chem Inf Comput Sci, 44(2):669-77, 2004. [ bib | DOI | http | .pdf ]
The support vector machine (SVM), as a novel type of learning machine, was used to develop a Quantitative Structure-Property Relationship (QSPR) model of the O-H bond dissociation energy (BDE) of 78 substituted phenols. The six descriptors calculated solely from the molecular structures of compounds selected by forward stepwise regression were used as inputs for the SVM model. The root-mean-square (rms) errors in BDE predictions for the training, test, and overall data sets were 3.808, 3.320, and 3.713 BDE units (kJ mol(-1)), respectively. The results obtained by Gaussian-kernel SVM were much better than those obtained by multiple linear regression, radial basis function neural networks, linear-kernel SVM, and other QSPR approaches.

Keywords: biosvm
[Xue2004Support] C. X. Xue, R. S. Zhang, H. X. Liu, M. C. Liu, Z. D. Hu, and B. T. Fan. Support vector machines-based quantitative structure-property relationship for the prediction of heat capacity. J Chem Inf Comput Sci, 44(4):1267-74, 2004. [ bib | DOI | http | .pdf ]
The support vector machine (SVM), as a novel type of learning machine, for the first time, was used to develop a Quantitative Structure-Property Relationship (QSPR) model of the heat capacity of a diverse set of 182 compounds based on the molecular descriptors calculated from the structure alone. Multiple linear regression (MLR) and radial basis function networks (RBFNNs) were also utilized to construct quantitative linear and nonlinear models to compare with the results obtained by SVM. The root-mean-square (rms) errors in heat capacity predictions for the whole data set given by MLR, RBFNNs, and SVM were 4.648, 4.337, and 2.931 heat capacity units, respectively. The prediction results are in good agreement with the experimental value of heat capacity; also, the results reveal the superiority of the SVM over MLR and RBFNNs models.

Keywords: biosvm
[Xu2004Molecular] Xiu-Qin Xu, Chon K Leow, Xin Lu, Xuegong Zhang, Jun S Liu, Wing-Hung Wong, Arndt Asperger, Sören Deininger, and Hon-Chiu Eastwood Leung. Molecular classification of liver cirrhosis in a rat model by proteomics and bioinformatics. Proteomics, 4(10):3235-45, Oct 2004. [ bib | DOI | http | .pdf ]
Liver cirrhosis is a worldwide health problem. Reliable, noninvasive methods for early detection of liver cirrhosis are not available. Using a three-step approach, we classified sera from rats with liver cirrhosis following different treatment insults. The approach consisted of: (i) protein profiling using surface-enhanced laser desorption/ionization (SELDI) technology; (ii) selection of a statistically significant serum biomarker set using machine learning algorithms; and (iii) identification of selected serum biomarkers by peptide sequencing. We generated serum protein profiles from three groups of rats: (i) normal (n=8), (ii) thioacetamide-induced liver cirrhosis (n=22), and (iii) bile duct ligation-induced liver fibrosis (n=5) using a weak cation exchanger surface. Profiling data were further analyzed by a recursive support vector machine algorithm to select a panel of statistically significant biomarkers for class prediction. Sensitivity and specificity of classification using the selected protein marker set were higher than 92%. A consistently down-regulated 3495 Da protein in cirrhosis samples was one of the selected significant biomarkers. This 3495 Da protein was purified on-chip and trypsin digested. Further structural characterization of this biomarkers candidate was done by using cross-platform matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) peptide mass fingerprinting (PMF) and matrix-assisted laser desorption/ionization time of flight/time of flight (MALDI-TOF/TOF) tandem mass spectrometry (MS/MS). Combined data from PMF and MS/MS spectra of two tryptic peptides suggested that this 3495 Da protein shared homology to a histidine-rich glycoprotein. These results demonstrated a novel approach to discovery of new biomarkers for early detection of liver cirrhosis and classification of liver diseases.

Keywords: biosvm
[Xing2004LOGOS] E. P. Xing, W. Wu, M. I. Jordan, and R. M. Karp. LOGOS: A modular Bayesian model for de novo motif detection. J. Bioinform. Comput. Biol., 2:127-154, 2004. [ bib | DOI | http | .pdf ]
The complexity of the global organization and internal structure of motifs in higher eukaryotic organisms raises significant challenges for motif detection techniques. To achieve successful de novo motif detection, it is necessary to model the complex dependencies within and among motifs and to incorporate biological prior knowledge. In this paper, we present LOGOS, an integrated LOcal and GlObal motif Sequence model for biopolymer sequences, which provides a principled framework for developing, modularizing, extending and computing expressive motif models for complex biopolymer sequence analysis. LOGOS consists of two interacting submodels: HMDM, a local alignment model capturing biological prior knowledge and positional dependency within the motif local structure; and HMM, a global motif distribution model modeling frequencies and dependencies of motif occurrences. Model parameters can be fit using training motifs within an empirical Bayesian framework. A variational EM algorithm is developed for de novo motif detection. LOGOS improves over existing models that ignore biological priors and dependencies in motif structures and motif occurrences, and demonstrates superior performance on both semi-realistic test data and cis-regulatory sequences from yeast and Drosophila genomes with regard to sensitivity, specificity, flexibility and extensibility.

Keywords: biogm
[Xing2004MotifPrototyper] E. Xing and R. Karp. Motifprototyper: A bayesian profile model for motif families. PNAS, 101(29):10523-10528, 2004. [ bib ]
[Xia2004one-layer] Youshen Xia and Jun Wang. A one-layer recurrent neural network for support vector machine learning. IEEE Trans Syst Man Cybern B Cybern, 34(2):1261-9, Apr 2004. [ bib ]
This paper presents a one-layer recurrent neural network for support vector machine (SVM) learning in pattern classification and regression. The SVM learning problem is first converted into an equivalent formulation, and then a one-layer recurrent neural network for SVM learning is proposed. The proposed neural network is guaranteed to obtain the optimal solution of support vector classification and regression. Compared with the existing two-layer neural network for the SVM classification, the proposed neural network has a low complexity for implementation. Moreover, the proposed neural network can converge exponentially to the optimal solution of SVM learning. The rate of the exponential convergence can be made arbitrarily high by simply turning up a scaling parameter. Simulation examples based on benchmark problems are discussed to show the good performance of the proposed neural network for SVM learning.

[Xia2004RNAi] H. Xia, Q. Mao, S. L. Eliason, S. Q. Harper, I. H. Martins, H. T. Orr, H. L. Paulson, L. Yang, R. M. Kotin, and B. L. Davidson. RNAi suppresses polyglutamine-induced neurodegeneration in a model of spinocerebellar ataxia. Nat. Med., 10(8):816-820, Aug 2004. [ bib | DOI | http ]
The dominant polyglutamine expansion diseases, which include spinocerebellar ataxia type 1 (SCA1) and Huntington disease, are progressive, untreatable, neurodegenerative disorders. In inducible mouse models of SCA1 and Huntington disease, repression of mutant allele expression improves disease phenotypes. Thus, therapies designed to inhibit expression of the mutant gene would be beneficial. Here we evaluate the ability of RNA interference (RNAi) to inhibit polyglutamine-induced neurodegeneration caused by mutant ataxin-1 in a mouse model of SCA1. Upon intracerebellar injection, recombinant adeno-associated virus (AAV) vectors expressing short hairpin RNAs profoundly improved motor coordination, restored cerebellar morphology and resolved characteristic ataxin-1 inclusions in Purkinje cells of SCA1 mice. Our data demonstrate in vivo the potential use of RNAi as therapy for dominant neurodegenerative disease.

Keywords: Adenoviridae, Animal, Animals, Blotting, Brain, Cells, Comparative Study, Cultured, Disease Models, Gene Expression, Genetic, Glutamine, Immunohistochemistry, Messenger, Mice, Nerve Degeneration, Nerve Tissue Proteins, Non-U.S. Gov't, Northern, Nuclear Proteins, P.H.S., Plasmids, Psychomotor Performance, Purkinje Cells, RNA, RNA Interference, Research Support, Reverse Transcriptase Polymerase Chain Reaction, Small Interfering, Spinocerebellar Ataxias, Transduction, Transgenic, U.S. Gov't, 15286770
[Winters-Hilt2004Nanopore] Stephen Winters-Hilt and Mark Akeson. Nanopore cheminformatics. DNA Cell Biol, 23(10):675-83, Oct 2004. [ bib | DOI | http ]
A cheminformatics method is described for classification, and biophysical examination, of individual molecules. A novel molecular detector is used-one based on current blockade measurements through a nanometer-scale ion channel (alpha-hemolysin). Classification results are described for blockades caused by DNA molecules in the alpha-hemolysin nanopore detector, with signal analysis and pattern recognition performed using a combination of methods from bioinformatics and machine learning. Due to the size of the alpha-hemolysin protein channel, the blockade events report on one DNA molecule at a time, which enables a variety of reproducible, single-molecule biophysical experiments. To capture the full sensitivity of the nanopore detector's blockade signal, Hidden Markov Models (HMMs) were used with Expectation/Maximization for denoising and for associating a feature vector with the ionic current blockade of each captured DNA molecule. Support Vector Machines (SVMs) that employ novel kernel designs were then used as discriminators. With SVM training performed off-line, and economical HMM processing on-line, blockade classification was possible during capture. HMMs were also used in conjunction with a time-domain finite state automaton (off-line) for feature discovery and kinetics analysis. Analysis of the DNA data indicates a variety of binding (DNA-protein), fraying, and conformational shifts that are consistent with data obtained from thermodynamic analyses (melting curves), X-ray crystallography, and NMR studies. The software tools are designed for analysis of generic blockades in ionic channels, including those in other biological pore-forming toxins, other biological channels in general, and semiconductor-based channels.

Keywords: Algorithms, Artificial Intelligence, Ascomycota, Automated, Base Sequence, Chromosome Mapping, Codon, Comparative Study, Crystallography, DNA, DNA Primers, Hordeum, Host-Parasite Relations, Informatics, Kinetics, Magnetic Resonance Spectroscopy, Nanotechnology, Non-U.S. Gov't, Pattern Recognition, Plant, Plants, Research Support, Sequence Alignment, Sequence Analysis, Thermodynamics, X-Ray, 15585125
[Williams2004Prognostic] R.D. Williams, S.N. Hing, B.T. Greer, C.C. Whiteford, J.S. Wei, R. Natrajan, A. Kelsey, S. Rogers, C. Campbell, K. Pritchard-Jones, and J. Khan. Prognostic classification of relapsing favorable histology Wilms tumor using cDNA microarray expression profiling and support vector machines. Genes Chromosomes Cancer, 41(1):65-79, Sep 2004. [ bib | DOI | http | .pdf ]
Treatment of Wilms tumor has a high success rate, with some 85 of patients achieving long-term survival. However, late effects of treatment and management of relapse remain significant clinical problems. If accurate prognostic methods were available, effective risk-adapted therapies could be tailored to individual patients at diagnosis. Few molecular prognostic markers for Wilms tumor are currently defined, though previous studies have linked allele loss on 1p or 16q, genomic gain of 1q, and overexpression from 1q with an increased risk of relapse. To identify specific patterns of gene expression that are predictive of relapse, we used high-density (30 k) cDNA microarrays to analyze RNA samples from 27 favorable histology Wilms tumors taken from primary nephrectomies at the time of initial diagnosis. Thirteen of these tumors relapsed within 2 years. Genes differentially expressed between the relapsing and nonrelapsing tumor classes were identified by statistical scoring (t test). These genes encode proteins with diverse molecular functions, including transcription factors, developmental regulators, apoptotic factors, and signaling molecules. Use of a support vector machine classifier, feature selection, and test evaluation using cross-validation led to identification of a generalizable expression signature, a small subset of genes whose expression potentially can be used to predict tumor outcome in new samples. Similar methods were used to identify genes that are differentially expressed between tumors with and without genomic 1q gain. This set of discriminators was highly enriched in genes on 1q, indicating close agreement between data obtained from expression profiling with data from genomic copy number analyses.

Keywords: biosvm
[Wilhelm2004Analysis] T. Wilhelm, J. Behre, and S. Schuster. Analysis of structural robustness of metabolic networks. Syst Biol, 1(1):114-120, Jun 2004. [ bib | .pdf ]
We study the structural robustness of metabolic networks on the basis of the concept of elementary flux modes. It is shown that the number of elementary modes itself is not an appropriate measure of structural robustness. Instead, we introduce three new robustness measures. These are based on the relative number of elementary modes remaining after the knockout of enzymes. We discuss the relevance of these measures with the help of simple examples, as well as with larger, realistic metabolic networks. Thereby we demonstrate quantitatively that the metabolism of Escherichia coli, which must be able to adapt to varying conditions, is more robust than the metabolism of the human erythrocyte, which lives under much more homeostatic conditions.

[Weinberger2004Learning] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel matrix for nonlinear dimensionality reduction. In ICML '04: Twenty-first international conference on Machine learning, New York, NY, USA, 2004. ACM Press. [ bib | DOI | www: ]
We investigate how to learn a kernel matrix for high dimensional data that lies on or near a low dimensional manifold. Noting that the kernel matrix implicitly maps the data into a nonlinear feature space, we show how to discover a mapping that "unfolds" the underlying manifold from which the data was sampled. The kernel matrix is constructed by maximizing the variance in feature space subject to local constraints that preserve the angles and distances between nearest neighbors. The main optimization involves an instance of semidefinite programming-a fundamentally different computation than previous algorithms for manifold learning, such as Isomap and locally linear embedding. The optimized kernels perform better than polynomial and Gaussian kernels for problems in manifold learning, but worse for problems in large margin classification. We explain these results in terms of the geometric properties of different kernels and comment on various interpretations of other manifold learning algorithms as kernel methods.

Keywords: dimred
[Weinberger2004Unsupervised] K. Q. Weinberger and L. K. Saul. Unsupervised Learning of Image Manifolds by Semidefinite Programming. In 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'04), volume 2, pages 988-995, 2004. [ bib | DOI | www: ]
Can we detect low dimensional structure in high dimensional data sets of images and video? The problem of dimensionality reduction arises often in computer vision and pattern recognition. In this paper, we propose a new solution to this problem based on semidefinite programming. Our algorithm can be used to analyze high dimensional data that lies on or near a low dimensional manifold. It overcomes certain limitations of previous work in manifold learning, such as Isomap and locally linear embedding. We illustrate the algorithm on easily visualized examples of curves and surfaces, as well as on actual images of faces, handwritten digits, and solid objects.

Keywords: dimred
[Weathers2004Reduced] E. A. Weathers, M. E. Paulaitis, T. B. Woolf, and J. H. Hoh. Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein. FEBS Lett., 576(3):348-352, 2004. [ bib | DOI | http | .pdf ]
Intrinsically disordered proteins are an important class of proteins with unique functions and properties. Here, we have applied a support vector machine (SVM) trained on naturally occurring disordered and ordered proteins to examine the contribution of various parameters (vectors) to recognizing proteins that contain disordered regions. We find that a SVM that incorporates only amino acid composition has a recognition accuracy of 87+/-2 composition alone is sufficient to accurately recognize disorder. Interestingly, SVMs using reduced sets of amino acids based on chemical similarity preserve high recognition accuracy. A set as small as four retains an accuracy of 84+/-2 general physicochemical properties rather than specific amino acids are important factors contributing to protein disorder.

Keywords: biosvm
[Waring2004Interlaboratory] Jeffrey F Waring, Roger G Ulrich, Nick Flint, David Morfitt, Arno Kalkuhl, Frank Staedtler, Michael Lawton, Johanna M Beekman, and Laura Suter. Interlaboratory evaluation of rat hepatic gene expression changes induced by methapyrilene. Environ Health Perspect, 112(4):439-48, Mar 2004. [ bib ]
Several studies using microarrays have shown that changes in gene expression provide information about the mechanism of toxicity induced by xenobiotic agents. Nevertheless, the issue of whether gene expression profiles are reproducible across different laboratories remains to be determined. To address this question, several members of the Hepatotoxicity Working Group of the International Life Sciences Institute Health and Environmental Sciences Institute evaluated the liver gene expression profiles of rats treated with methapyrilene (MP). Animals were treated at one facility, and RNA was distributed to five different sites for gene expression analysis. A preliminary evaluation of the number of modulated genes uncovered striking differences between the five different sites. However, additional data analysis demonstrated that these differences had an effect on the absolute gene expression results but not on the outcome of the study. For all users, unsupervised algorithms showed that gene expression allows the distinction of the high dose of MP from controls and low dose. In addition, the use of a supervised analysis method (support vector machines) made it possible to correctly classify samples. In conclusion, the results show that, despite some variability, robust gene expression changes were consistent between sites. In addition, key expression changes related to the mechanism of MP-induced hepatotoxicity were identified. These results provide critical information regarding the consistency of microarray results across different laboratories and shed light on the strengths and limitations of expression profiling in drug safety analysis.

Keywords: biosvm
[Wang2004Bipartie] Y. Wang, F. Makedon, and J. Ford. A bipartite graph matching framework for finding correspondences between structural elements in two proteins. In Proceedings of the 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2004. [ bib | .pdf ]
A protein molecule consists one or more chains of amino acid sequences that fold into a complex three-dimensional structure. A protein's functions are often determined by its 3D structure, and so comparing the similarity of 3D structures between proteins is an important problem. To accomplish such comparison, one must align two proteins properly with rotation and translation in 3D space. Finding the correspondences between structural elements in the two proteins is the key step in many protein structure alignment algorithms. In this paper, we introduce a new graph theoretic framework based on bipartite graph matching for finding sufficiently good correspondences. It is capable of providing both sequence-dependent and sequence-independent correspondences. It is a general framework for pair-wise matching of atoms, amino acids residues or secondary structure elements.

[Wang2004Objective] S. Wang, H. Li, F. Qi, and Y. Zhao. Objective facial paralysis grading based on Pface and eigenflow. Med Biol Eng Comput, 42(5):598-603, Sep 2004. [ bib ]
To provide physicians with an objective and quantitative measurement of single-sided facial paralysis, the paper presents a computer-based approach that is different from the nine existing, subjective and hand-performed international scales, such as House-Brackman. For voluntary expressions of a patient, this approach used Pface, which stems from Dface, to measure the asymmetry between two sides of the face and used eigenflow to measure the expression variations between the patient and normal subjects. The results from Pface and eigenflow were then combined by the support vector machine produce to Pdegree. A study of 25 subjects revealed that Pdegree could differentiate paralysis states (Pdegree > or = 0) and normal states (Pdegree < 0), with the ability to grade facial paralysis automatically. Moreover, the Pface of specific facial areas can be used in the supervision of the rehabilitation process.

Keywords: Adolescent, Adult, Computer-Assisted, Facial Asymmetry, Facial Expression, Facial Paralysis, Female, Humans, Image Interpretation, Male, Middle Aged, Motion, Photography, Severity of Illness Index, 15503959
[Wang2004Support] M-L. Wang, W-J. Li, M-L. Wang, and W-B. Xu. Support vector machines for prediction of peptidyl prolyl cis/trans isomerization. J Pept Res, 63(1):23-8, Jan 2004. [ bib ]
A new method for peptidyl prolyl cis/trans isomerization prediction based on the theory of support vector machines (SVM) was introduced. The SVM represents a new approach to supervised pattern classification and has been successfully applied to a wide range of pattern recognition problems. In this study, six training datasets consisting of different length local sequence respectively were used. The polynomial kernel functions with different parameter d were chosen. The test for the independent testing dataset and the jackknife test were both carried out. When the local sequence length was 20-residue and the parameter d = 8, the SVM method archived the best performance with the correct rate for the cis and trans forms reaching 70.4 and 69.7% for the independent testing dataset, 76.7 and 76.6% for the jackknife test, respectively. Matthew's correlation coefficients for the jackknife test could reach about 0.5. The results obtained through this study indicated that the SVM method would become a powerful tool for predicting peptidyl prolyl cis/trans isomerization.

Keywords: biosvm
[Wang2004Weighted-support] M. Wang, J. Yang, G.-P. Liu, Z.-J. Xu, and K.-C. Chou. Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. Protein Eng. Des. Sel., 17(6):509-516, 2004. [ bib | DOI | arXiv | http | .pdf ]
Membrane proteins are generally classified into the following five types: (1) type I membrane proteins, (2) type II membrane proteins, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins and (5) GPI-anchored membrane proteins. Prediction of membrane protein types has become one of the growing hot topics in bioinformatics. Currently, we are facing two critical challenges in this area: first, how to take into account the extremely complicated sequence-order effects, and second, how to deal with the highly uneven sizes of the subsets in a training dataset. In this paper, stimulated by the concept of using the pseudo-amino acid composition to incorporate the sequence-order effects, the spectral analysis technique is introduced to represent the statistical sample of a protein. Based on such a framework, the weighted support vector machine (SVM) algorithm is applied. The new approach has remarkable power in dealing with the bias caused by the situation when one subset in the training dataset contains many more samples than the other. The new method is particularly useful when our focus is aimed at proteins belonging to small subsets. The results obtained by the self-consistency test, jackknife test and independent dataset test are encouraging, indicating that the current approach may serve as a powerful complementary tool to other existing methods for predicting the types of membrane proteins.

Keywords: biosvm
[Wang2004Predicting] Long-Hui Wang, Juan Liu, Yan-Fu Li, and Huai-Bei Zhou. Predicting protein secondary structure by a support vector machine based on a new coding scheme. Genome Inform Ser Workshop Genome Inform, 15(2):181-90, 2004. [ bib | .html ]
Protein structure prediction is one of the most important problems in modern computational biology. Protein secondary structure prediction is a key step in prediction of protein tertiary structure. There have emerged many methods based on machine learning techniques, such as neural networks (NN) and support vector machine (SVM) etc., to focus on the prediction of the secondary structures. In this paper, a new method was proposed based on SVM. Different from the existing methods, this method takes into account of the physical-chemical properties and structure properties of amino acids. When tested on the most popular dataset CB513, it achieved a Q(3) accuracy of 0.7844, which illustrates that it is one of the top range methods for protein of secondary structure prediction.

Keywords: biosvm
[Wang2004Simple] Kai Wang, Ekachai Jenwitheesuk, Ram Samudrala, and John E Mittler. Simple linear model provides highly accurate genotypic predictions of HIV-1 drug resistance. Antivir Ther, 9(3):343-52, Jun 2004. [ bib ]
Drug resistance is a major obstacle to the successful treatment of HIV-1 infection. Genotypic assays are used widely to provide indirect evidence of drug resistance, but the performance of these assays has been mixed. We used standard stepwise linear regression to construct drug resistance models for seven protease inhibitors and 10 reverse transcriptase inhibitors using data obtained from the Stanford HIV drug resistance database. We evaluated these models by hold-one-out experiments and by tests on an independent dataset. Our linear model outperformed other publicly available genotypic interpretation algorithms, including decision tree, support vector machine and four rules-based algorithms (HIVdb, VGI, ANRS and Rega) under both tests. Interestingly, our model did well despite the absence of any terms for interactions between different residues in protease or reverse transcriptase. The resulting linear models are easy to understand and can potentially assist in choosing combination therapy regimens.

Keywords: Algorithms, Computational Biology, Databases, Drug Resistance, Forecasting, Genetic, Genotype, HIV Protease Inhibitors, HIV-1, Humans, Information Management, Information Storage and Retrieval, Kinetics, Linear Models, Microbial Sensitivity Tests, Models, Non-U.S. Gov't, P.H.S., Periodicals, Point Mutation, Pyrimidinones, Research Support, Reverse Transcriptase Inhibitors, Theoretical, U.S. Gov't, Viral, 15259897
[Wagner2004Computational] M. Wagner, D.N. Naik, A. Pothen, S. Kasukurti, R.R. Devineni, B.L. Adam, O.J. Semmes, and G.L. Wright Jr. Computational protein biomarker prediction: a case study for prostate cancer. BMC Bioinformatics, 5(26), 2004. [ bib | DOI | http | .pdf ]
Background Recent technological advances in mass spectrometry pose challenges in computational mathematics and statistics to process the mass spectral data into predictive models with clinical and biological significance. We discuss several classification-based approaches to finding protein biomarker candidates using protein profiles obtained via mass spectrometry, and we assess their statistical significance. Our overall goal is to implicate peaks that have a high likelihood of being biologically linked to a given disease state, and thus to narrow the search for biomarker candidates. Results Thorough cross-validation studies and randomization tests are performed on a prostate cancer dataset with over 300 patients, obtained at the Eastern Virginia Medical School using SELDI-TOF mass spectrometry. We obtain average classification accuracies of 87 on a four-group classification problem using a two-stage linear SVM-based procedure and just 13 peaks, with other methods performing comparably. Conclusions Modern feature selection and classification methods are powerful techniques for both the identification of biomarker candidates and the related problem of building predictive models from protein mass spectrometric profiles. Cross-validation and randomization are essential tools that must be performed carefully in order not to bias the results unfairly. However, only a biological validation and identification of the underlying proteins will ultimately confirm the actual value and power of any computational predictions.

Keywords: biosvm
[Vogelstein2004Cancer] B. Vogelstein and K. W. Kinzler. Cancer genes and the pathways they control. Nat. Med., 10(8):789-799, Aug 2004. [ bib | DOI | http | .pdf ]
The revolution in cancer research can be summed up in a single sentence: cancer is, in essence, a genetic disease. In the last decade, many important genes responsible for the genesis of various cancers have been discovered, their mutations precisely identified, and the pathways through which they act characterized. The purposes of this review are to highlight examples of progress in these areas, indicate where knowledge is scarce and point out fertile grounds for future investigation.

[Vishwanathan2004Fast] S. V. N. Vishwanathan and A. J. Smola. Fast kernels for string and tree matching. In B. Schölkopf, K. Tsuda, and J.-P. Vert, editors, Kernel methods in computational biology, pages 113-130. MIT Press, 2004. [ bib ]
[Vinayagam2004Applying] A. Vinayagam, R. König, J. Moormann, F. Schubert, R. Eils, K.-H. Glatting, and S. Suhai. Applying Support Vector Machines for Gene Ontology based gene function prediction. BMC Bioinformatics, 5(1):116, Aug 2004. [ bib | DOI | http | .pdf ]
BACKGROUND: The current progress in sequencing projects calls for rapid, reliable and accurate function assignments of gene products. A variety of methods has been designed to annotate sequences on a large scale. However, these methods can either only be applied for specific subsets, or their results are not formalised, or they do not provide precise confidence estimates for their predictions. RESULTS: We have developed a large-scale annotation system that tackles all of these shortcomings. In our approach, annotation was provided through Gene Ontology terms by applying multiple Support Vector Machines (SVM) for the classification of correct and false predictions. The general performance of the system was benchmarked with a large dataset. An organism-wise cross-validation was performed to define confidence estimates, resulting in an average precision of 80% for 74% of all test sequences. The validation results show that the prediction performance was organism-independent and could reproduce the annotation of other automated systems as well as high-quality manual annotations. We applied our trained classification system to Xenopus laevis sequences, yielding functional annotation for more than half of the known expressed genome. Compared to the currently available annotation, we provided more than twice the number of contigs with good quality annotation, and additionally we assigned a confidence value to each predicted GO term. CONCLUSIONS: We present a complete automated annotation system that overcomes many of the usual problems by applying a controlled vocabulary of Gene Ontology and an established classification method on large and well-described sequence data sets. In a case study, the function for Xenopus laevis contig sequences was predicted and the results are publicly available at ftp://genome.dkfz-heidelberg.de/pub/agd/gene_association.agd_Xenopus.

Keywords: biosvm
[Vert2004primer] J.-P. Vert, K. Tsuda, and B. Schölkopf. A primer on kernel methods. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 35-70. MIT Press, 2004. [ bib ]
Keywords: biosvm
[Vert2004Local] J.-P. Vert, H. Saigo, and T. Akutsu. Local alignment kernels for biological sequences. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 131-154. MIT Press, The MIT Press, Cambridge, Massachussetts, 2004. [ bib | www: ]
Keywords: biosvm
[Venables2004Aberrant] Julian P. Venables. Aberrant and alternative splicing in cancer. Cancer Res., 64:7647-7654, 2004. [ bib ]
Keywords: csbcbook
[Vallabhaneni2004Motor] Anirudh Vallabhaneni and Bin He. Motor imagery task classification for brain computer interface applications using spatiotemporal principle component analysis. Neurol Res, 26(3):282-7, Apr 2004. [ bib | DOI | http ]
Classification of single-trial imagined left- and right-hand movements recorded through scalp EEG are explored in this study. Classical event-related desynchronization/synchronization (ERD/ERS) calculation approach was utilized to extract ERD features from the raw scalp EEG signal. Principle Component Analysis (PCA) was used for feature extraction and applied on spatial, as well as temporal dimensions in two consecutive steps. A Support Vector Machine (SVM) classifier using a linear decision function was used to classify each trial as either left or right. The present approach has yielded good classification results and promises to have potential for further refinement for increased accuracy as well as application in online brain computer interface (BCI).

Keywords: Amino Acids, Antibodies, Artificial Intelligence, Biological, Brain, Brain Mapping, Calibration, Comparative Study, Computational Biology, Cysteine, Cystine, Electrodes, Electroencephalography, Evoked Potentials, Female, Horseradish Peroxidase, Humans, Imagery (Psychotherapy), Imagination, Laterality, Male, Monoclonal, Movement, Neoplasms, Non-P.H.S., Non-U.S. Gov't, P.H.S., Perception, Principal Component Analysis, Protein, Protein Array Analysis, Proteins, Research Support, Sensitivity and Specificity, Sequence Analysis, Tumor Markers, U.S. Gov't, User-Computer Interface, 15142321
[Ui-Tei2004Guidelines] K. Ui-Tei, Y. Naito, F. Takahashi, T. Haraguchi, H. Ohki-Hamazaki, A. Juni, R. Ueda, and K. Saigo. Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference. Nucleic Acids Res., 32(3):936-948, Feb 2004. [ bib | DOI | http ]
In the present study, the relationship between short interfering RNA (siRNA) sequence and RNA interference (RNAi) effect was extensively analyzed using 62 targets of four exogenous and two endogenous genes and three mammalian and Drosophila cells. We present the rules that may govern siRNA sequence preference and in accordance with which highly effective siRNAs essential for systematic mammalian functional genomics can be readily designed. These rules indicate that siRNAs which simultaneously satisfy all four of the following sequence conditions are capable of inducing highly effective gene silencing in mammalian cells: (i) A/U at the 5' end of the antisense strand; (ii) G/C at the 5' end of the sense strand; (iii) at least five A/U residues in the 5' terminal one-third of the antisense strand; and (iv) the absence of any GC stretch of more than 9 nt in length. siRNAs opposite in features with respect to the first three conditions give rise to little or no gene silencing in mammalian cells. Essentially the same rules for siRNA sequence preference were found applicable to DNA-based RNAi in mammalian cells and in ovo RNAi using chick embryos. In contrast to mammalian and chick cells, little siRNA sequence preference could be detected in Drosophila in vivo RNAi.

Keywords: sirna
[Udupa04Algorithmic] R. Udupa, T. A. Faruquie, and H. K. Maji. An algorithmic framework for solving the decoding problem in statistical machine translation. In Proceedings of Coling 2004, pages 631-637, Geneva, Switzerland, Aug 23-Aug 27 2004. COLING. [ bib ]
[Tzeng2004Predicting] Huey-Ming Tzeng, Jer-Guang Hsieh, and Yih-Lon Lin. Predicting nurses' intention to quit with a support vector machine: a new approach to set up an early warning mechanism in human resource management. Comput Inform Nurs, 22(4):232-42, 2004. [ bib ]
This project developed a Support Vector Machine for predicting nurses' intention to quit, using working motivation, job satisfaction, and stress levels as predictors. This study was conducted in three hospitals located in southern Taiwan. The target population was all nurses (389 valid cases). For cross-validation, we randomly split cases into four groups of approximately equal sizes, and performed four training runs. After the training, the average percentage of misclassification on the training data was 0.86, while that on the testing data was 10.8, resulting in predictions with 89.2% accuracy. This Support Vector Machine can predict nurses' intention to quit, without asking these nurses whether they have an intention to quit.

Keywords: Adolescent, Adult, Algorithms, Amino Acid Sequence, Amino Acids, Anatomic, Attitude of Health Personnel, Bacterial Proteins, Bias (Epidemiology), Brain, Brain Mapping, Burnout, Comparative Study, Computer Simulation, Computer-Assisted, Data Interpretation, Diffusion Magnetic Resonance Imaging, Facial Asymmetry, Facial Expression, Facial Paralysis, Female, Gene Expression Profiling, Gram-Negative Bacteria, Gram-Positive Bacteria, Hospital, Humans, Image Interpretation, Intention, Job Satisfaction, Logistic Models, Magnetoencephalography, Male, Middle Aged, Models, Motion, Neural Networks (Computer), Neural Pathways, Non-U.S. Gov't, Nonlinear Dynamics, Nursing Administration Research, Nursing Staff, Personnel Management, Personnel Turnover, Photography, Predictive Value of Tests, Professional, Protein, Proteins, Proteome, Psychological, Questionnaires, Regression Analysis, Reproducibility of Results, Research Support, Retina, Risk Factors, Sequence Alignment, Sequence Analysis, Severity of Illness Index, Software, Statistical, Subcellular Fractions, Taiwan, Theoretical, Workplace, 15494654
[Tu2004Image] Zhuowen Tu, Xiangrong Chen, and Alan L. Yuille. Image parsing: Unifying segmentation, detection, and recognition. Int. J. Comput. Vis., 2004. [ bib ]
[Tsybakov2004Introduction] A. B. Tsybakov. Introduction à l'estimation non-paramétrique. Springer, 2004. [ bib ]
[Tsuda2004Learning] K. Tsuda and W.S. Noble. Learning kernels from biological networks by maximizing entropy. Bioinformatics, 20:i326-i333, 2004. [ bib | DOI | http | .pdf ]
Motivation: The diffusion kernel is a general method for computing pairwise distances among all nodes in a graph, based on the sum of weighted paths between each pair of nodes. This technique has been used successfully, in conjunction with kernel-based learning methods, to draw inferences from several types of biological networks. Results: We show that computing the diffusion kernel is equivalent to maximizing the von Neumann entropy, subject to a global constraint on the sum of the Euclidean distances between nodes. This global constraint allows for high variance in the pairwise distances. Accordingly, we propose an alternative, locally constrained diffusion kernel, and we demonstrate that the resulting kernel allows for more accurate support vector machine prediction of protein functional classifications from metabolic and protein?protein interaction networks. Availability: Supplementary results and data are available at noble.gs.washington.edu/proj/maxent

Keywords: learning-kernel graph-kernel biosvm
[Tsochantaridis2004Support] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In Twenty-first international conference on Machine learning. ACM Press, 2004. [ bib | DOI | .pdf ]
Learning general functional dependencies is one of the main goals in machine learning. Recent progress in kernel-based methods has focused on designing flexible and powerful input representations. This paper addresses the complementary issue of problems involving complex outputs such as multiple dependent output variables and structured output spaces. We propose to generalize multiclass Support Vector Machine learning in a formulation that involves features extracted jointly from inputs and outputs. The resulting optimization problem is solved efficiently by a cutting plane algorithm that exploits the sparseness and structural decomposition of the problem. We demonstrate the versatility and effectiveness of our method on problemsranging from supervised grammar learning and named-entity recognition, totaxonomic text classification and sequence alignment.

Keywords: structured-output
[Tsai2004Gene] C.A. Tsai, C.H. Chen, T.C. Lee, I.C. Ho, U.C. Yang, and J.J. Chen. Gene selection for sample classifications in microarray experiments. DNA Cell Biol., 23(10):607-614, 2004. [ bib | DOI | http | .pdf ]
DNA microarray technology provides useful tools for profiling global gene expression patterns in different cell/tissue samples. One major challenge is the large number of genes relative to the number of samples. The use of all genes can suppress or reduce the performance of a classification rule due to the noise of nondiscriminatory genes. Selection of an optimal subset from the original gene set becomes an important prestep in sample classification. In this study, we propose a family-wise error (FWE) rate approach to selection of discriminatory genes for two-sample or multiple-sample classification. The FWE approach controls the probability of the number of one or more false positives at a prespecified level. A public colon cancer data set is used to evaluate the performance of the proposed approach for the two classification methods: k nearest neighbors (k-NN) and support vector machine (SVM). The selected gene sets from the proposed procedure appears to perform better than or comparable to several results reported in the literature using the univariate analysis without performing multivariate search. In addition, we apply the FWE approach to a toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV) for a total of 55 samples for a multisample classification. Two gene sets are considered: the gene set omegaF formed by the ANOVA F-test, and a gene set omegaT formed by the union of one-versus-all t-tests. The predicted accuracies are evaluated using the internal and external crossvalidation. Using the SVM classification, the overall accuracies to predict 55 samples into one of the nine treatments are above 80 crossvalidation. OmegaF has slightly higher accuracy rates than omegaT. The overall predicted accuracies are above 70 external crossvalidation; the two gene sets omegaT and omegaF performed equally well.

Keywords: biosvm microarray
[Tropp2004Greed] Joel A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Trans. Inform. Theory, 50:2231-2242, 2004. [ bib ]
[Tian2004novel] Liang Tian and Afzel Noore. A novel approach for short-term load forecasting using support vector machines. Int J Neural Syst, 14(5):329-35, Oct 2004. [ bib ]
A support vector machine (SVM) modeling approach for short-term load forecasting is proposed. The SVM learning scheme is applied to the power load data, forcing the network to learn the inherent internal temporal property of power load sequence. We also study the performance when other related input variables such as temperature and humidity are considered. The performance of our proposed SVM modeling approach has been tested and compared with feed-forward neural network and cosine radial basis function neural network approaches. Numerical results show that the SVM approach yields better generalization capability and lower prediction error compared to those neural network approaches.

[Thissen2004Multivariate] Uwe Thissen, Bülent Ustün, Willem J Melssen, and Lutgarde M C Buydens. Multivariate calibration with least-squares support vector machines. Anal Chem, 76(11):3099-105, Jun 2004. [ bib | DOI | http | .pdf ]
This paper proposes the use of least-squares support vector machines (LS-SVMs) as a relatively new nonlinear multivariate calibration method, capable of dealing with ill-posed problems. LS-SVMs are an extension of "traditional" SVMs that have been introduced recently in the field of chemistry and chemometrics. The advantages of SVM-based methods over many other methods are that these lead to global models that are often unique, and nonlinear regression can be performed easily as an extension to linear regression. An additional advantage of LS-SVM (compared to SVM) is that model calculation and optimization can be performed relatively fast. As a test case to study the use of LS-SVM, the well-known and important chemical problem is considered in which spectra are affected by nonlinear interferences. As one specific example, a commonly used case is studied in which near-infrared spectra are affected by temperature-induced spectral variation. Using this test case, model optimization, pruning, and model interpretation of the LS-SVM have been demonstrated. Furthermore, excellent performance of the LS-SVM, compared to other approaches, has been presented on the specific example. Therefore, it can be concluded that LS-SVMs can be seen as very promising techniques to solve ill-posed problems. Furthermore, these have been shown to lead to robust models in cases of spectral variations due to nonlinear interferences.

[Thimm2004Comparison] M. Thimm, A. Goede, S. Hougardy, and R. Preissner. Comparison of 2D similarity and 3D superposition. Application to searching a conformational drug database. J Chem Inf Comput Sci, 44(5):1816-1822, 2004. [ bib | DOI | http ]
In a database of about 2000 approved drugs, represented by 10(5) structural conformers, we have performed 2D comparisons (Tanimoto coefficients) and 3D superpositions. For one class of drugs the correlation between structural resemblance and similar action was analyzed in detail. In general Tanimoto coefficients and 3D scores give similar results, but we find that 2D similarity measures neglect important structural/funtional features. Examples for both over- and underestimation of similarity by 2D metrics are discussed. The required additional effort for 3D superpositions is assessed by implementation of a fast algorithm with a processing time below 0.01 s and a more sophisticated approach (0.5 s per superposition). According to the improvement of similarity detection compared to 2D screening and the pleasant rapidity on a desktop PC, full-atom 3D superposition will be an upcoming method of choice for library prioritization or similarity screening approaches.

Keywords: Arabidopsis, Carbohydrates, Circadian Rhythm, Comparative Study, Database Management Systems, Gene Expression Regulation, Genes, Genome, Messenger, Molecular Conformation, Mutation, Nitrogen, Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Pharmaceutical Preparations, Plant, RNA, Research Support, 15446841
[Thies2004Optimal] Thorsten Thies and Frank Weber. Optimal reduced-set vectors for support vector machines with a quadratic kernel. Neural Comput, 16(9):1769-77, Sep 2004. [ bib ]
To reduce computational cost, the discriminant function of a support vector machine (SVM) should be represented using as few vectors as possible. This problem has been tackled in different ways. In this article,we develop an explicit solution in the case of a general quadratic kernel k(x. x') = (C + D xT x')2. For a given number of vectors, this solution provides the best possible approximation and can even recover the discriminant function if the number of used vectors is large enough. The key idea is to express the inhomogeneous kernel as a homogeneous kernel ona space having one dimension more than the original one and to follow the approach of Burges (1996).

[Taskar2004Max-Margin] B. Taskar, C. Guestrin, and D. Koller. Max-Margin Markov Networks. In Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, editors, Advances in Neural Information Processing Systems 16, Cambridge, MA, 2004. MIT Press. [ bib | .pdf ]
Keywords: conditional-random-field
[Sun2004protein] Zhenghong Sun, Xiaoli Fu, Lu Zhang, Xiaoli Yang, Feizhou Liu, and Gengxi Hu. A protein chip system for parallel analysis of multi-tumor markers and its application in cancer detection. Anticancer Res, 24(2C):1159-65, 2004. [ bib ]
BACKGROUND: Tumor markers are routinely measured in clinical oncology. However, their value in cancer detection has been controversial largely because no single tumor marker is sensitive and specific enough to meet strict diagnostic criteria. One strategy to overcome the shortcomings of single tumor markers is to measure a combination of tumor markers to increase sensitivity and look for distinct patterns to increase specificity. This study aimed to develop a system for parallel detection of tumor markers as a tool for tumor detection in both cancer patients and asymptomatic populations at high risk. MATERIALS AND METHODS: A protein chip was fabricated with twelve monoclonal antibodies against the following tumor markers respectively: CA125, CA15-3, CA19-9, CA242, CEA, AFP, PSA, free-PSA, HGH, beta-HCG, NSE and ferritin. Tumor markers were captured after the protein chip was incubated with serum samples. A secondary antibody conjugated with HRP was used to detect the captured tumor markers using chemiluminescence technique. Quantification of the tumor markers was obtained after calibration with standard curves. RESULTS: The chip system showed an overall sensitivity of 68.18% after testing 1147 cancer patients, with high sensitivities for liver, pancreas and ovarian tumors and low sensitivities for gastrointestinal tumors, and a specificity of 97.1% after testing 793 healthy individuals. Application of the chip system in physical checkups of 15,867 individuals resulted in 16 cases that were subsequently confirmed as having cancers. Analysis of the detection results with a Support Vector Machine algorithm considerably increased the specificity of the system as reflected in healthy individuals and hepatitis/cirrhosis patients, but only modestly decreased the sensitivity for cancer patients. CONCLUSION: This protein chip system is a potential tool for assisting cancer diagnosis and for screening cancer in high-risk populations.

Keywords: Antibodies, Artificial Intelligence, Biological, Calibration, Female, Horseradish Peroxidase, Humans, Male, Monoclonal, Neoplasms, Protein Array Analysis, Sensitivity and Specificity, Tumor Markers, 15154641
[Strauss2004Objective] Daniel J Strauss, Wolfgang Delb, and Peter K Plinkert. Objective detection of the central auditory processing disorder: a new machine learning approach. IEEE Trans Biomed Eng, 51(7):1147-55, Jul 2004. [ bib ]
The objective detection of binaural interaction is of diagnostic interest for the evaluation of the central auditory processing disorder (CAPD). The beta-wave of the binaural interaction component in auditory brainstem responses has been suggested as an objective measure of binaural interaction and has been shown to be of diagnostic value in the CAPD diagnosis. However, a reliable and automated detection of the beta-wave capable of clinical use still remains a challenge. We propose a new machine learning approach to the detection of the CAPD that is based on adapted tight frame decompositions which are tailored for support vector machines with radial kernels. Using shift-invariant scale and morphological features of the binaurally evoked brainstem potentials, our approach provides at least comparable results to the beta-wave detection in view of the discrimination of subjects being at risk for CAPD and subjects being not at risk for CAPD. Furthermore, as no information from the monaurally evoked potentials is necessary, the measurement cost is reduced by two-thirds compared to the computation of the binaural interaction component. We conclude that a machine learning approach in the form of a hybrid tight frame-support vector classification is effective in the objective detection of the CAPD.

[Steinwart2004Fast] I. Steinwart and C. Scovel. Fast Rates for Support Vector Machines using Gaussian Kernels. Technical Report LA-UR 04-8796, Los Alamos National Laboratory, 2004. [ bib ]
[Steinwart2005explicit] I. Steinwart, D. Hush, and C. Scovel. An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels. Technical Report LA-UR 04-8274, Los Alamos National Laboratory, 2004. [ bib ]
[Steiner2004Discriminating] Guido Steiner, Laura Suter, Franziska Boess, Rodolfo Gasser, Maria Cristina de Vera, Silvio Albertini, and Stefan Ruepp. Discriminating different classes of toxicants by transcript profiling. Environ. Health Perspect., 112(12):1236-48, Aug 2004. [ bib | .html | .pdf ]
Male rats were treated with various model compounds or the appropriate vehicle controls. Most substances were either well-known hepatotoxicants or showed hepatotoxicity during preclinical testing. The aim of the present study was to determine if biological samples from rats treated with various compounds can be classified based on gene expression profiles. In addition to gene expression analysis using microarrays, a complete serum chemistry profile and liver and kidney histopathology were performed. We analyzed hepatic gene expression profiles using a supervised learning method (support vector machines; SVMs) to generate classification rules and combined this with recursive feature elimination to improve classification performance and to identify a compact subset of probe sets with potential use as biomarkers. Two different SVM algorithms were tested, and the models obtained were validated with a compound-based external cross-validation approach. Our predictive models were able to discriminate between hepatotoxic and nonhepatotoxic compounds. Furthermore, they predicted the correct class of hepatotoxicant in most cases. We provide an example showing that a predictive model built on transcript profiles from one rat strain can successfully classify profiles from another rat strain. In addition, we demonstrate that the predictive models identify nonresponders and are able to discriminate between gene changes related to pharmacology and toxicity. This work confirms the hypothesis that compound classification based on gene expression data is feasible.

Keywords: biosvm
[Statnikov2004Methods] Alexander Statnikov, Constantin F Aliferis, and Ioannis Tsamardinos. Methods for multi-category cancer diagnosis from gene expression data: a comprehensive evaluation to inform decision support system development. Medinfo, 11(Pt 2):813-7, 2004. [ bib ]
Cancer diagnosis is a major clinical applications area of gene expression microarray technology. We are seeking to develop a system for cancer diagnostic model creation based on microarray data. In order to equip the system with the optimal combination of data modeling methods, we performed a comprehensive evaluation of several major classification algorithms, gene selection methods, and cross-validation designs using 11 datasets spanning 74 diagnostic categories (41 cancer types and 12 normal tissue types). The Multi-Category Support Vector Machine techniques by Crammer and Singer, Weston and Watkins, and one-versus-rest were found to be the best methods and they outperform other learning algorithms such as K-Nearest Neighbors and Neural Networks often to a remarkable degree. Gene selection techniques are shown to significantly improve classification performance. These results guided the development of a software system that fully automates cancer diagnostic model construction with quality on par with or better than previously published results derived by expert human analysts.

Keywords: biosvm
[Stahura2004Virtual] Florence L Stahura and Jürgen Bajorath. Virtual screening methods that complement HTS. Comb Chem High Throughput Screen, 7(4):259-69, Jun 2004. [ bib ]
In this review, we discuss a number of computational methods that have been developed or adapted for molecule classification and virtual screening (VS) of compound databases. In particular, we focus on approaches that are complementary to high-throughput screening (HTS). The discussion is limited to VS methods that operate at the small molecular level, which is often called ligand-based VS (LBVS), and does not take into account docking algorithms or other structure-based screening tools. We describe areas that greatly benefit from combining virtual and biological screening and discuss computational methods that are most suitable to contribute to the integration of screening technologies. Relevant approaches range from established methods such as clustering or similarity searching to techniques that have only recently been introduced for LBVS applications such as statistical methods or support vector machines. Finally, we discuss a number of representative applications at the interface between VS and HTS.

Keywords: Algorithms, Animals, Antisense, Artificial Intelligence, Cell Line, Cluster Analysis, Comparative Study, Computational Biology, Computer Simulation, DNA Fingerprinting, Drug Evaluation, Fluorescence, Fuzzy Logic, Gene Silencing, Gene Targeting, Genetic, Hela Cells, Humans, Imaging, Intracellular Space, Microscopy, Models, Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotides, P.H.S., Preclinical, Prognosis, Proteomics, Quantitative Structure-Activity Relationship, RNA, RNA Interference, Research Support, Sensitivity and Specificity, Small Interfering, Thionucleotides, Three-Dimensional, Tumor, U.S. Gov't, 15200375
[Sorich2004Rapid] Michael J Sorich, Ross A McKinnon, John O Miners, David A Winkler, and Paul A Smith. Rapid prediction of chemical metabolism by human UDP-glucuronosyltransferase isoforms using quantum chemical descriptors derived with the electronegativity equalization method. J Med Chem, 47(21):5311-7, Oct 2004. [ bib | DOI | http | .pdf ]
This study aimed to evaluate in silico models based on quantum chemical (QC) descriptors derived using the electronegativity equalization method (EEM) and to assess the use of QC properties to predict chemical metabolism by human UDP-glucuronosyltransferase (UGT) isoforms. Various EEM-derived QC molecular descriptors were calculated for known UGT substrates and nonsubstrates. Classification models were developed using support vector machine and partial least squares discriminant analysis. In general, the most predictive models were generated with the support vector machine. Combining QC and 2D descriptors (from previous work) using a consensus approach resulted in a statistically significant improvement in predictivity (to 84%) over both the QC and 2D models and the other methods of combining the descriptors. EEM-derived QC descriptors were shown to be both highly predictive and computationally efficient. It is likely that EEM-derived QC properties will be generally useful for predicting ADMET and physicochemical properties during drug discovery.

Keywords: biosvm
[Song2004Comparison] Xiaowei Song, Arnold Mitnitski, Jafna Cox, and Kenneth Rockwood. Comparison of machine learning techniques with classical statistical models in predicting health outcomes. Medinfo, 11(Pt 1):736-40, 2004. [ bib ]
Several machine learning techniques (multilayer and single layer perceptron, logistic regression, least square linear separation and support vector machines) are applied to calculate the risk of death from two biomedical data sets, one from patient care records, and another from a population survey. Each dataset contained multiple sources of information: history of related symptoms and other illnesses, physical examination findings, laboratory tests, medications (patient records dataset), health attitudes, and disabilities in activities of daily living (survey dataset). Each technique showed very good mortality prediction in the acute patients data sample (AUC up to 0.89) and fair prediction accuracy for six year mortality (AUC from 0.70 to 0.76) in individuals from epidemiological database surveys. The results suggest that the nature of data is of primary importance rather than the learning technique. However, the consistently superior performance of the artificial neural network (multi-layer perceptron) indicates that nonlinear relationships (which cannot be discerned by linear separation techniques) can provide additional improvement in correctly predicting health outcomes.

Keywords: Aged, Air, Algorithms, Amino Acids, Animals, Area Under Curve, Artifacts, Artificial Intelligence, Atrial, Automated, Canada, Carotid Stenosis, Cerebrovascular Accident, Cerebrovascular Circulation, Comparative Study, Computer-Assisted, Cysteine, Decision Trees, Dementia, Diagnosis, Disulfides, Doppler, Embolism, Expert Systems, Extramural, Factor Analysis, Female, Gene Expression, Gene Expression Profiling, Health Status, Heart Septal Defects, Humans, Intracranial Embolism, Male, Models, Molecular, Myocardial Infarction, N.I.H., Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Oxidation-Reduction, P.H.S., Pattern Recognition, Prognosis, Protein Binding, Protein Folding, Proteins, ROC Curve, Research Support, Sensitivity and Specificity, Software, Statistical, Transcranial, Treatment Outcome, U.S. Gov't, Ultrasonography, 15360910
[Sohler2004New] F. Sohler, D. Hanisch, and R. Zimmer. New methods for joint analysis of biological networks and expression data. Bioinformatics, 20(10):1517-1521, Jul 2004. [ bib | DOI | http | .pdf ]
SUMMARY: Biological networks, such as protein interaction, regulatory or metabolic networks, derived from public databases, biological experiments or text mining can be useful for the analysis of high-throughput experimental data. We present two algorithms embedded in the ToPNet application that show promising performance in analyzing expression data in the context of such networks. First, the Significant Area Search algorithm detects subnetworks consisting of significantly regulated genes. These subnetworks often provide hints on which biological processes are affected in the measured conditions. Second, Pathway Queries allow detection of networks including molecules that are not necessarily significantly regulated, such as transcription factors or signaling proteins. Moreover, using these queries, the user can formulate biological hypotheses and check their validity with respect to experimental data. All resulting networks and pathways can be explored further using the interactive analysis tools provided by ToPNet program.

[Snoeve2004Designing] O. Snøve, M. Nedland, S. H. Fjeldstad, H. Humberset, O. R. Birkeland, T. Grönfeld, and P. Saetrom. Designing effective siRNAs with off-target control. Biochem. Biophys. Res. Commun., 325(3):769-73, Dec 2004. [ bib | DOI | http ]
Successful gene silencing by RNA interference requires a potent and specific depletion of the target mRNA. Target candidates must be chosen so that their corresponding short interfering RNAs are likely to be effective against that target and unlikely to accidentally silence other transcripts due to sequence similarity. We show that both effective and unique targets exist in mouse, fruit fly, and worm, and present a new design tool that enables users to make the trade-off between efficacy and uniqueness. The tool lists all targets with partial sequence similarity to the primary target to highlight candidates for negative controls.

Keywords: sirna
[Smyth2004Linear] G. K. Smyth. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol., 3:Article3, 2004. [ bib | DOI | http | .pdf ]
The problem of identifying differentially expressed genes in designed microarray experiments is considered. Lonnstedt and Speed (2002) derived an expression for the posterior odds of differential expression in a replicated two-color experiment using a simple hierarchical parametric model. The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed (2002) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples. The model is reset in the context of general linear models with arbitrary coefficients and contrasts of interest. The approach applies equally well to both single channel and two color microarray experiments. Consistent, closed form estimators are derived for the hyperparameters in the model. The estimators proposed have robust behavior even for small numbers of arrays and allow for incomplete data arising from spot filtering or spot quality weights. The posterior odds statistic is reformulated in terms of a moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations. The empirical Bayes approach is equivalent to shrinkage of the estimated sample variances towards a pooled estimate, resulting in far more stable inference when the number of arrays is small. The use of moderated t-statistics has the advantage over the posterior odds that the number of hyperparameters which need to estimated is reduced; in particular, knowledge of the non-null prior for the fold changes are not required. The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom. The moderated t inferential approach extends to accommodate tests of composite null hypotheses through the use of moderated F-statistics. The performance of the methods is demonstrated in a simulation study. Results are presented for two publicly available data sets.

[Smith2004Towards] P. A. Smith, M. J. Sorich, L. S C Low, R. A. McKinnon, and J. O. Miners. Towards integrated ADME prediction: past, present and future directions for modelling metabolism by UDP-glucuronosyltransferases. J Mol Graph Model, 22(6):507-17, Jul 2004. [ bib | DOI | http | .pdf ]
Undesirable absorption, distribution, metabolism, excretion (ADME) properties are the cause of many drug development failures and this has led to the need to identify such problems earlier in the development process. This review highlights computational (in silico) approaches that have been used to identify the characteristics of ligands influencing molecular recognition and/or metabolism by the drug-metabolising enzyme UDP-gucuronosyltransferase (UGT). Current studies applying pharmacophore elucidation, 2D-quantitative structure metabolism relationships (2D-QSMR), 3D-quantitative structure metabolism relationships (3D-QSMR), and non-linear pattern recognition techniques such as artificial neural networks and support vector machines for modelling metabolism by UGT are reported. An assessment of the utility of in silico approaches for the qualitative and quantitative prediction of drug glucuronidation parameters highlights the benefit of using multiple pharmacophores and also non-linear techniques for classification. Some of the challenges facing the development of generalisable models for predicting metabolism by UGT, including the need for screening of more diverse structures, are also outlined.

Keywords: Algorithms, Animals, Antisense, Artificial Intelligence, Astrocytoma, Automated, Autonomic Nervous System, Brain, Brain Neoplasms, Cell Line, Cerebral Cortex, Child, Cluster Analysis, Cognition, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA Fingerprinting, Databases, Diagnosis, Discriminant Analysis, Drug Design, Drug Evaluation, Electroencephalography, Emotions, Event-Related Potentials, Evoked Potentials, Factual, Fluorescence, Fuzzy Logic, Gene Silencing, Gene Targeting, Genetic, Glucuronosyltransferase, Hand, Hela Cells, Humans, Imaging, Intracellular Space, Magnetic Resonance Spectroscopy, Male, Meningeal Neoplasms, Meningioma, Microscopy, Models, Molecular Structure, Monitoring, Motor, Neoplasm Metastasis, Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotides, P.H.S., P300, Pattern Recognition, Peptides, Pharmaceutical Preparations, Physiologic, Preclinical, Predictive Value of Tests, Preschool, Prognosis, Protein Interaction Mapping, Protein Structure, Proteins, Proteomics, Quantitative Structure-Activity Relationship, Quaternary, RNA, RNA Interference, Recognition (Psychology), Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Small Interfering, Software, Thionucleotides, Three-Dimensional, Tumor, U.S. Gov't, User-Computer Interface, Word Processing, 15182810
[Sjoelander2004Phylogenomic] K. Sjölander. Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics, 20(2):170-179, Jan 2004. [ bib | http | .pdf ]
MOTIVATION: Protein families evolve a multiplicity of functions through gene duplication, speciation and other processes. As a number of studies have shown, standard methods of protein function prediction produce systematic errors on these data. Phylogenomic analysis-combining phylogenetic tree construction, integration of experimental data and differentiation of orthologs and paralogs-has been proposed to address these errors and improve the accuracy of functional classification. The explicit integration of structure prediction and analysis in this framework, which we call structural phylogenomics, provides additional insights into protein superfamily evolution. RESULTS: Results of protein functional classification using phylogenomic analysis show fewer expected false positives overall than when pairwise methods of functional classification are employed. We present an overview of the motivations and fundamental principles of phylogenomic analysis, new methods developed for the key tasks, benchmark datasets for these tasks (when available) and suggest procedures to increase accuracy. We also discuss some of the methods used in the Celera Genomics high-throughput phylogenomic classification of the human genome. AVAILABILITY: Software tools from the Berkeley Phylogenomics Group are available at http://phylogenomics.berkeley.edu

[Sindhwani2004Feature] Vikas Sindhwani, Subrata Rakshit, Dipti Deodhare, Deniz Erdogmus, Jose C Principe, and Partha Niyogi. Feature selection in MLPs and SVMs based on maximum output information. IEEE Trans Neural Netw, 15(4):937-48, Jul 2004. [ bib ]
This paper presents feature selection algorithms for multilayer perceptrons (MLPs) and multiclass support vector machines (SVMs), using mutual information between class labels and classifier outputs, as an objective function. This objective function involves inexpensive computation of information measures only on discrete variables; provides immunity to prior class probabilities; and brackets the probability of error of the classifier. The maximum output information (MOI) algorithms employ this function for feature subset selection by greedy elimination and directed search. The output of the MOI algorithms is a feature subset of user-defined size and an associated trained classifier (MLP/SVM). These algorithms compare favorably with a number of other methods in terms of performance on various artificial and real-world data sets.

[Sindhwani2004Manifold] V. Sindhwani, P. Niyogi, and M. Belkin. Manifold Regularization: A Geometric Framework for Learning from Examples. Technical Report TR-2004-06, The University of Chicago, 2004. [ bib ]
[Sinden2004proteomic] R. E. Sinden. A proteomic analysis of malaria biology: integration of old literature and new technologies. Int. J. Parasitol., 34(13-14):1441-1450, Dec 2004. [ bib | DOI | http | .pdf ]
The genomic revolution has brought a new vitality into research on Plasmodium, its insect and vertebrate hosts. At the cellular level nowhere is the impact greater than in the analysis of protein expression and the 'assembly' of the supramolecular machines that together comprise the functional cell. The repetitive phases of invasion and replication that typify the malaria life cycle, together with the unique phase of sexual differentiation provide a powerful platform on which to investigate the 'molecular machines' that underpin parasite strategy and stage-specific functions. This approach is illustrated here in an analysis of the ookinete of Plasmodium berghei. Such analyses are useful only if conducted with a secure understanding of parasite biology. The importance of carefully searching the older literature to reach this understanding cannot be over-emphasised. When viewed together, the old and new data can give rapid and penetrating insights into what some might now term the 'Systems-Biology' of Plasmodium.

Keywords: plasmodium
[Shulman-Peleg2004Recognition] Alexandra Shulman-Peleg, Ruth Nussinov, and Haim J Wolfson. Recognition of functional sites in protein structures. J Mol Biol, 339(3):607-633, Jun 2004. [ bib | DOI | http ]
Recognition of regions on the surface of one protein, that are similar to a binding site of another is crucial for the prediction of molecular interactions and for functional classifications. We first describe a novel method, SiteEngine, that assumes no sequence or fold similarities and is able to recognize proteins that have similar binding sites and may perform similar functions. We achieve high efficiency and speed by introducing a low-resolution surface representation via chemically important surface points, by hashing triangles of physico-chemical properties and by application of hierarchical scoring schemes for a thorough exploration of global and local similarities. We proceed to rigorously apply this method to functional site recognition in three possible ways: first, we search a given functional site on a large set of complete protein structures. Second, a potential functional site on a protein of interest is compared with known binding sites, to recognize similar features. Third, a complete protein structure is searched for the presence of an a priori unknown functional site, similar to known sites. Our method is robust and efficient enough to allow computationally demanding applications such as the first and the third. From the biological standpoint, the first application may identify secondary binding sites of drugs that may lead to side-effects. The third application finds new potential sites on the protein that may provide targets for drug design. Each of the three applications may aid in assigning a function and in classification of binding patterns. We highlight the advantages and disadvantages of each type of search, provide examples of large-scale searches of the entire Protein Data Base and make functional predictions.

Keywords: Algorithms; Catalytic Domain; Hydrogen Bonding; Models, Molecular; Protein Conformation; Proteins, chemistry
[Shulman2004Uniform] N. Shulman and M. Feder. The Uniform Distribution as a Universal Prior. IEEE Trans. Inform. Theory, 50(6):1356 - 1362, Jun 2004. [ bib | .pdf ]
In this correspondence, we discuss the properties of the uniform prior as a universal prior, i.e., a prior that induces a mutual information that is simultaneously close to the capacity for all channels. We determine bounds on the amount of the mutual information loss in using the uniform prior instead of the capacity-achieving prior. Specifically, for the class of binary input channels with any output alphabet, we show that theZ-channel has the minimal mutual information with uniform prior, out of all channels with a given capacity. From this, we conclude that the degradation of the mutual information with respect to the capacity is at most 0.011 bit, and as was shown previously, at most 6 prior, for any channel, is not far from uniform. Some of these results are extended to channels with nonbinary input.

Keywords: information-theory
[Shoeb2004Patient-specific] Ali Shoeb, Herman Edwards, Jack Connolly, Blaise Bourgeois, S. Ted Treves, and John Guttag. Patient-specific seizure onset detection. Epilepsy Behav, 5(4):483-98, Aug 2004. [ bib | DOI | http | .pdf ]
This article presents an automated, patient-specific method for the detection of epileptic seizure onset from noninvasive electroencephalography. We adopt a patient-specific approach to exploit the consistency of an individual patient's seizure and nonseizure electroencephalograms. Our method uses a wavelet decomposition to construct a feature vector that captures the morphology and spatial distribution of an electroencephalographic epoch, and then determines whether that vector is representative of a patient's seizure or nonseizure electroencephalogram using the support vector machine classification algorithm. Our completely automated method was tested on noninvasive electroencephalograms from 36 pediatric subjects suffering from a variety of seizure types. It detected 131 of 139 seizure events within 8.0+/-3.2 seconds of electrographic onset, and declared 15 false detections in 60 hours of clinical electroencephalography. Our patient-specific method can be used to initiate delay-sensitive clinical procedures following seizure onset, for example, the injection of a functional imaging radiotracer.

Keywords: Algorithms, Comparative Study, Computational Biology, Computer-Assisted, Databases, Diagnosis, Drug Resistance, Electroencephalography, Epilepsy, Forecasting, Genetic, Genotype, HIV Protease Inhibitors, HIV-1, Humans, Information Management, Information Storage and Retrieval, Kinetics, Linear Models, Microbial Sensitivity Tests, Models, Monitoring, Non-U.S. Gov't, P.H.S., Periodicals, Physiologic, Point Mutation, Pyrimidinones, Reaction Time, Research Support, Reverse Transcriptase Inhibitors, Signal Processing, Theoretical, Time Factors, U.S. Gov't, Viral, 15256184
[Sherr2004Principles] Charles J. Sherr. Principles of tumor suppression. Cell, 116:235-246, 2004. [ bib ]
Keywords: csbcbook
[Shawe-Taylor2004Kernel] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004. [ bib ]
[Shah2004Fingerprint] Shesha Shah and P. S. Sastry. Fingerprint classification using a feedback-based line detector. IEEE Trans Syst Man Cybern B Cybern, 34(1):85-94, Feb 2004. [ bib ]
We present a fingerprint classification algorithm in this paper. This algorithm classifies a fingerprint image into one of the five classes: Arch, Left loop, Right loop, Whorl, and Tented arch. We use a new low-dimensional feature vector obtained from the output of a novel oriented line detector presented here. Our line detector is a co-operative dynamical system that gives oriented lines and preserves multiple orientations at points where differently oriented lines meet. Our feature extraction process is based on characterizing the distribution of orientations around the fingerprint. We discuss three different classifiers: support vector machines, nearest-neighbor classifier, and neural network classifier. We present results obtained on a National Institute of Standards and Technology (NIST) fingerprint database and compare with other published results on NIST databases. All our classifiers perform equally well, and this suggests that our novel line detection and feature extraction process indeed captures all the crucial information needed for classification in this problem.

[Shacham2004PREDICT] S. Shacham, Y. Marantz, S. Bar-Haim, O. Kalid, D. Warshaviak, N. Avisar, B. Inbal, A. Heifetz, M. Fichman, M. Topf, Z. Naor, S. Noiman, and O. M. Becker. PREDICT modeling and in-silico screening for G-protein coupled receptors. Proteins, 57(1):51-86, Oct 2004. [ bib | DOI | http ]
G-protein coupled receptors (GPCRs) are a major group of drug targets for which only one x-ray structure is known (the nondrugable rhodopsin), limiting the application of structure-based drug discovery to GPCRs. In this paper we present the details of PREDICT, a new algorithmic approach for modeling the 3D structure of GPCRs without relying on homology to rhodopsin. PREDICT, which focuses on the transmembrane domain of GPCRs, starts from the primary sequence of the receptor, simultaneously optimizing multiple 'decoy' conformations of the protein in order to find its most stable structure, culminating in a virtual receptor-ligand complex. In this paper we present a comprehensive analysis of three PREDICT models for the dopamine D2, neurokinin NK1, and neuropeptide Y Y1 receptors. A shorter discussion of the CCR3 receptor model is also included. All models were found to be in good agreement with a large body of experimental data. The quality of the PREDICT models, at least for drug discovery purposes, was evaluated by their successful utilization in in-silico screening. Virtual screening using all three PREDICT models yielded enrichment factors 9-fold to 44-fold better than random screening. Namely, the PREDICT models can be used to identify active small-molecule ligands embedded in large compound libraries with an efficiency comparable to that obtained using crystal structures for non-GPCR targets.

Keywords: chemogenomics
[Sen2004Predicting] T.Z. Sen, A. Kloczkowski, R.L. Jernigan, C. Yan, V. Honavar, K.M. Ho, C.Z. Wang, Y. Ihm, H. Cao, X. Gu, and D. Dobbs. Predicting binding sites of hydrolase-inhibitor complexes by combining several methods. BMC Bioinformatics, 5(205), 2004. [ bib | DOI | .pdf ]
Background Protein-protein interactions play a critical role in protein function. Completion of many genomes is being followed rapidly by major efforts to identify interacting protein pairs experimentally in order to decipher the networks of interacting, coordinated-in-action proteins. Identification of protein-protein interaction sites and detection of specific amino acids that contribute to the specificity and the strength of protein interactions is an important problem with broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks. Results In order to increase the power of predictive methods for protein-protein interaction sites, we have developed a consensus methodology for combining four different methods. These approaches include: data mining using Support Vector Machines, threading through protein structures, prediction of conserved residues on the protein surface by analysis of phylogenetic trees, and the Conservatism of Conservatism method of Mirny and Shakhnovich. Results obtained on a dataset of hydrolase-inhibitor complexes demonstrate that the combination of all four methods yield improved predictions over the individual methods. Conclusions We developed a consensus method for predicting protein-protein interface residues by combining sequence and structure-based methods. The success of our consensus approach suggests that similar methodologies can be developed to improve prediction accuracies for other bioinformatic problems.

Keywords: biosvm
[Segal2004modulea] E. Segal, N. Friedman, D. Koller, and A. Regev. A module map showing conditional activity of expression modules in cancer. Nat Genet, 36(10):1090-8, 2004. [ bib | DOI | http | .pdf ]
DNA microarrays are widely used to study changes in gene expression in tumors, but such studies are typically system-specific and do not address the commonalities and variations between different types of tumor. Here we present an integrated analysis of 1,975 published microarrays spanning 22 tumor types. We describe expression profiles in different tumors in terms of the behavior of modules, sets of genes that act in concert to carry out a specific function. Using a simple unified analysis, we extract modules and characterize gene-expression profiles in tumors as a combination of activated and deactivated modules. Activation of some modules is specific to particular types of tumor; for example, a growth-inhibitory module is specifically repressed in acute lymphoblastic leukemias and may underlie the deregulated proliferation in these cancers. Other modules are shared across a diverse set of clinical conditions, suggestive of common tumor progression mechanisms. For example, the bone osteoblastic module spans a variety of tumor types and includes both secreted growth factors and their receptors. Our findings suggest that there is a single mechanism for both primary tumor proliferation and metastasis to bone. Our analysis presents multiple research directions for diagnostic, prognostic and therapeutic studies.

[Segal2004module] E. Segal, N. Friedman, D. Koller, and A. Regev. A module map showing conditional activity of expression modules in cancer. Nat. Genet., 36(10):1090-1098, Oct 2004. [ bib | DOI | http | .pdf ]
DNA microarrays are widely used to study changes in gene expression in tumors, but such studies are typically system-specific and do not address the commonalities and variations between different types of tumor. Here we present an integrated analysis of 1,975 published microarrays spanning 22 tumor types. We describe expression profiles in different tumors in terms of the behavior of modules, sets of genes that act in concert to carry out a specific function. Using a simple unified analysis, we extract modules and characterize gene-expression profiles in tumors as a combination of activated and deactivated modules. Activation of some modules is specific to particular types of tumor; for example, a growth-inhibitory module is specifically repressed in acute lymphoblastic leukemias and may underlie the deregulated proliferation in these cancers. Other modules are shared across a diverse set of clinical conditions, suggestive of common tumor progression mechanisms. For example, the bone osteoblastic module spans a variety of tumor types and includes both secreted growth factors and their receptors. Our findings suggest that there is a single mechanism for both primary tumor proliferation and metastasis to bone. Our analysis presents multiple research directions for diagnostic, prognostic and therapeutic studies.

Keywords: biogm
[Seeger2004Gaussian] Matthias Seeger. Gaussian processes for machine learning. Int J Neural Syst, 14(2):69-106, Apr 2004. [ bib ]
Gaussian processes (GPs) are natural generalisations of multivariate Gaussian random variables to infinite (countably or continuous) index sets. GPs have been applied in a large number of fields to a diverse range of ends, and very many deep theoretical analyses of various properties are available. This paper gives an introduction to Gaussian processes on a fairly elementary level with special emphasis on characteristics relevant in machine learning. It draws explicit connections to branches such as spline smoothing models and support vector machines in which similar ideas have been investigated. Gaussian process models are routinely used to solve hard machine learning problems. They are attractive because of their flexible non-parametric nature and computational simplicity. Treated within a Bayesian framework, very powerful statistical methods can be implemented which offer valid estimates of uncertainties in our predictions and generic model selection procedures cast as nonlinear optimization problems. Their main drawback of heavy computational scaling has recently been alleviated by the introduction of generic sparse approximations.13,78,31 The mathematical literature on GPs is large and often uses deep concepts which are not required to fully understand most machine learning applications. In this tutorial paper, we aim to present characteristics of GPs relevant to machine learning and to show up precise connections to other "kernel machines" popular in the community. Our focus is on a simple presentation, but references to more detailed sources are provided.

Keywords: Algorithms, Amino Acids, Antibodies, Artificial Intelligence, Astrocytoma, Automated, Bayes Theorem, Biological, Biopsy, Brain, Brain Mapping, Brain Neoplasms, Calibration, Comparative Study, Computational Biology, Computer-Assisted, Computing Methodologies, Cysteine, Cystine, Dysplastic Nevus Syndrome, Electrodes, Electroencephalography, Entropy, Eosine Yellowish-(YS), Evoked Potentials, Female, Gene Expression Profiling, Hematoxylin, Horseradish Peroxidase, Humans, Image Interpretation, Image Processing, Imagery (Psychotherapy), Imagination, Laterality, Linear Models, Male, Melanoma, Models, Monoclonal, Movement, Neoplasms, Neural Networks (Computer), Neuropeptides, Non-P.H.S., Non-U.S. Gov't, Nonparametric, Normal Distribution, P.H.S., Pattern Recognition, Perception, Principal Component Analysis, Protein, Protein Array Analysis, Protein Interaction Mapping, Proteins, Regression Analysis, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Ana, Sequence Analysis, Skin Neoplasms, Software, Statistical, Statistics, Tumor Markers, U.S. Gov't, User-Computer Interface, World Health Organization, lysis, 15112367
[Sebat2004Large-scale] J. Sebat, B. Lakshmi, J. Troge, J. Alexander, J. Young, P. Lundin, S. MÃ¥nér, H. Massa, M. Walker, M. Chi, N. Navin, R. Lucito, J. Healy, J. Hicks, K. Ye, A. Reiner, T. C. Gilliam, B. Trask, N. Patterson, A. Zetterberg, and M. Wigler. Large-scale copy number polymorphism in the human genome. Science, 305(5683):525-528, Jul 2004. [ bib | DOI | http | .pdf ]
The extent to which large duplications and deletions contribute to human genetic variation and diversity is unknown. Here, we show that large-scale copy number polymorphisms (CNPs) (about 100 kilobases and greater) contribute substantially to genomic variation between normal humans. Representational oligonucleotide microarray analysis of 20 individuals revealed a total of 221 copy number differences representing 76 unique CNPs. On average, individuals differed by 11 CNPs, and the average length of a CNP interval was 465 kilobases. We observed copy number variation of 70 different genes within CNP intervals, including genes involved in neurological function, regulation of cell growth, regulation of metabolism, and several genes known to be associated with disease.

Keywords: cgh
[Scovel2004Fast] C. Scovel and I. Steinwart. Fast Rates for Support Vector Machines. Technical report, Los Alamos National Laboratory, 2004. [ bib ]
[Schoelkopf2004Kernel] B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT Press, The MIT Press, Cambridge, Massachussetts, 2004. [ bib ]
Keywords: biosvm
[Schwender2004pilot] Holger Schwender, Manuela Zucknick, Katja Ickstadt, Hermann M Bolt, and G. E. N. I. C. A. network. A pilot study on the application of statistical classification procedures to molecular epidemiological data. Toxicol Lett, 151(1):291-9, Jun 2004. [ bib ]
The development of new statistical methods for use in molecular epidemiology comprises the building and application of appropriate classification rules. The aim of this study was to assess various classification methods that can potentially handle genetic interactions. A data set comprising genotypes at 25 single nucleotide polymorphic (SNP) loci from 518 breast cancer cases and 586 age-matched population-based controls from the GENICA study was used to built a classification rule with the discrimination methods SVM (support vector machine), CART (classification and regression tree), Bagging, Random Forest, LogitBoost and k nearest neighbours (kNN). A blind pilot analysis of the genotypic data set was a first approach to obtain an impression of the statistical structure of the data. Furthermore, this analysis was performed to explore classification methods that may be applied to molecular-epidemiological evaluation. The results showed that all blindly applied classification methods had a slightly smaller misclassification rate than a random classification. The findings, nevertheless, suggest that SNP data might be useful for the classification of individuals into categories of high or low risk of diseases.

Keywords: biosvm
[Schneider2004Advances] Gisbert Schneider and Uli Fechner. Advances in the prediction of protein targeting signals. Proteomics, 4(6):1571-80, Jun 2004. [ bib | DOI | http | .pdf ]
[Scacheri2004Short] P. C. Scacheri, O. Rozenblatt-Rosen, N. J. Caplen, T. G. Wolfsberg, L. Umayam, J. C. Lee, C. M. Hughes, K. S. Shanmugam, A. Bhattacharjee, M. Meyerson, and F. S. Collins. Short interfering RNAs can induce unexpected and divergent changes in the levels of untargeted proteins in mammalian cells. Proc. Natl. Acad. Sci. USA, 101(7):1892-7, Feb 2004. [ bib | DOI | http ]
RNA interference (RNAi) mediated by short interfering RNAs (siRNAs) is a widely used method to analyze gene function. To use RNAi knockdown accurately to infer gene function, it is essential to determine the specificity of siRNA-mediated RNAi. We have assessed the specificity of 10 different siRNAs corresponding to the MEN1 gene by examining the expression of two additional genes, TP53 (p53) and CDKN1A (p21), which are considered functionally unrelated to menin but are sensitive markers of cell state. MEN1 RNA and corresponding protein levels were all reduced after siRNA transfection of HeLa cells, although the degree of inhibition mediated by individual siRNAs varied. Unexpectedly, we observed dramatic and significant changes in protein levels of p53 and p21 that were unrelated to silencing of the target gene. The modulations in p53 and p21 levels were not abolished on titration of the siRNAs, and similar results were obtained in three other cell lines; in none of the cell lines tested did we see an effect on the protein levels of actin. These data suggest that siRNAs can induce nonspecific effects on protein levels that are siRNA sequence dependent but that these effects may be difficult to detect until genes central to a pivotal cellular response, such as p53 and p21, are studied. We find no evidence that activation of the double-stranded RNA-triggered IFN-associated antiviral pathways accounts for these effects, but we speculate that partial complementary sequence matches to off-target genes may result in a micro-RNA-like inhibition of translation.

Keywords: sirna
[Saigo2004Protein] H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682-1689, 2004. [ bib | http | .pdf ]
Motivation: Remote homology detection between protein sequences is a central problem in computational biology. Discriminative methods involving support vector machines (SVMs) are currently the most effective methods for the problem of superfamily recognition in the Structural Classification Of Proteins (SCOP) database. The performance of SVMs depends critically on the kernel function used to quantify the similarity between sequences. Results: We propose new kernels for strings adapted to biological sequences, which we call local alignment kernels. These kernels measure the similarity between two sequences by summing up scores obtained from local alignments with gaps of the sequences. When tested in combination with SVM on their ability to recognize SCOP superfamilies on a benchmark dataset, the new kernels outperform state-of-the-art methods for remote homology detection. Availability: Software and data available upon request.

Keywords: biosvm
[Saeys2004Feature] Y. Saeys, S. Degroeve, D. Aeyels, P. Rouzé, and Y. Van de Peer. Feature selection for splice site prediction: A new method using EDA-based feature ranking. BMC Bioinformatics, 5(64), 2004. [ bib | DOI | .pdf ]
Background The identification of relevant biological features in large and complex datasets is an important step towards gaining insight in the processes underlying the data. Other advantages of feature selection include the ability of the classification system to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods for fast feature selection are of key importance in extracting knowledge from complex biological data. Results In this paper we present a novel method for feature subset selection applied to splice site prediction, based on estimation of distribution algorithms, a more general framework of genetic algorithms. From the estimated distribution of the algorithm, a feature ranking is derived. Afterwards this ranking is used to iteratively discard features. We apply this technique to the problem of splice site prediction, and show how it can be used to gain insight into the underlying biological process of splicing. Conclusion We show that this technique proves to be more robust than the traditional use of estimation of distribution algorithms for feature selection: instead of returning a single best subset of features (as they normally do) this method provides a dynamical view of the feature selection process, like the traditional sequential wrapper methods. However, the method is faster than the traditional techniques, and scales better to datasets described by a large number of features.

Keywords: biosvm
[Saetrom2004comparison] P. Saetrom and O. Snøve. A comparison of siRNA efficacy predictors. Biochem. Biophys. Res. Commun., 321(1):247-53, Aug 2004. [ bib | DOI | http ]
Short interfering RNA (siRNA) efficacy prediction algorithms aim to increase the probability of selecting target sites that are applicable for gene silencing by RNA interference. Many algorithms have been published recently, and they base their predictions on such different features as duplex stability, sequence characteristics, mRNA secondary structure, and target site uniqueness. We compare the performance of the algorithms on a collection of publicly available siRNAs. First, we show that our regularized genetic programming algorithm GPboost appears to have a higher and more stable performance than other algorithms on the collected datasets. Second, several algorithms gave close to random classification on unseen data, and only GPboost and three other algorithms have a reasonably high and stable performance on all parts of the dataset. Third, the results indicate that the siRNAs' sequence is sufficient input to siRNA efficacy algorithms, and that other features that have been suggested to be important may be indirectly captured by the sequence.

Keywords: sirna
[Saetrom2004Predicting] P. Saetrom. Predicting the efficacy of short oligonucleotides in antisense and RNAi experiments with boosted genetic programming. Bioinformatics, 20(17):3055-3063, 2004. [ bib | DOI | http | .pdf ]
Motivation: Both small interfering RNAs (siRNAs) and antisense oligonucleotides can selectively block gene expression. Although the two methods rely on different cellular mechanisms, these methods share the common property that not all oligonucleotides (oligos) are equally effective. That is, if mRNA target sites are picked at random, many of the antisense or siRNA oligos will not be effective. Algorithms that can reliably predict the efficacy of candidate oligos can greatly reduce the cost of knockdown experiments, but previous attempts to predict the efficacy of antisense oligos have had limited success. Machine learning has not previously been used to predict siRNA efficacy. Results: We develop a genetic programming based prediction system that shows promising results on both antisense and siRNA efficacy prediction. We train and evaluate our system on a previously published database of antisense efficacies and our own database of siRNA efficacies collected from the literature. The best models gave an overall correlation between predicted and observed efficacy of 0.46 on both antisense and siRNA data. As a comparison, the best correlations of support vector machine classifiers trained on the same data were 0.40 and 0.30, respectively. Availability: The prediction system uses proprietary hardware and is available for both commercial and strategic academic collaborations. The siRNA database is available upon request.

Keywords: biosvm
[Sadik2004Detection] Omowunmi Sadik, Walker H Land, Adam K Wanekaya, Michiko Uematsu, Mark J Embrechts, Lut Wong, Dale Leibensperger, and Alex Volykin. Detection and classification of organophosphate nerve agent simulants using support vector machines with multiarray sensors. J Chem Inf Comput Sci, 44(2):499-507, 2004. [ bib | DOI | http | .pdf ]
The need for rapid and accurate detection systems is expanding and the utilization of cross-reactive sensor arrays to detect chemical warfare agents in conjunction with novel computational techniques may prove to be a potential solution to this challenge. We have investigated the detection, prediction, and classification of various organophosphate (OP) nerve agent simulants using sensor arrays with a novel learning scheme known as support vector machines (SVMs). The OPs tested include parathion, malathion, dichlorvos, trichlorfon, paraoxon, and diazinon. A new data reduction software program was written in MATLAB V. 6.1 to extract steady-state and kinetic data from the sensor arrays. The program also creates training sets by mixing and randomly sorting any combination of data categories into both positive and negative cases. The resulting signals were fed into SVM software for "pairwise" and "one" vs all classification. Experimental results for this new paradigm show a significant increase in classification accuracy when compared to artificial neural networks (ANNs). Three kernels, the S2000, the polynomial, and the Gaussian radial basis function (RBF), were tested and compared to the ANN. The following measures of performance were considered in the pairwise classification: receiver operating curve (ROC) Az indices, specificities, and positive predictive values (PPVs). The ROC Az) values, specifities, and PPVs increases ranged from 5% to 25%, 108% to 204%, and 13% to 54%, respectively, in all OP pairs studied when compared to the ANN baseline. Dichlorvos, trichlorfon, and paraoxon were perfectly predicted. Positive prediction for malathion was 95%.

Keywords: Algorithms, Ambergris, Combinatorial Chemistry Techniques, Models, Molecular, Molecular Conformation, Odors, P.H.S., Perfume, Predictive Value of Tests, Quantitative Structure-Activity Relationship, Research Support, U.S. Gov't, 15032529
[Roegnvaldsson2004Why] Thorsteinn Rögnvaldsson and Liwen You. Why neural networks should not be used for HIV-1 protease cleavage site prediction. Bioinformatics, 20(11):1702-9, Jul 2004. [ bib | DOI | http | .pdf ]
SUMMARY: Several papers have been published where nonlinear machine learning algorithms, e.g. artificial neural networks, support vector machines and decision trees, have been used to model the specificity of the HIV-1 protease and extract specificity rules. We show that the dataset used in these studies is linearly separable and that it is a misuse of nonlinear classifiers to apply them to this problem. The best solution on this dataset is achieved using a linear classifier like the simple perceptron or the linear support vector machine, and it is straightforward to extract rules from these linear models. We identify key residues in peptides that are efficiently cleaved by the HIV-1 protease and list the most prominent rules, relating them to experimental results for the HIV-1 protease. MOTIVATION: Understanding HIV-1 protease specificity is important when designing HIV inhibitors and several different machine learning algorithms have been applied to the problem. However, little progress has been made in understanding the specificity because nonlinear and overly complex models have been used. RESULTS: We show that the problem is much easier than what has previously been reported and that linear classifiers like the simple perceptron or linear support vector machines are at least as good predictors as nonlinear algorithms. We also show how sets of specificity rules can be generated from the resulting linear classifiers. AVAILABILITY: The datasets used are available at http://www.hh.se/staff/bioinf/

Keywords: biosvm
[Ratsch2004Accurate] G. Rätsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 277-298. MIT Press, 2004. [ bib ]
During the past three years, the support vector machine learning algorithm has been extensively applied within the field of computational biology. The algorithm has been used to detect patterns within and among biological sequences, to classify genes and patients based upon gene expression profiles, and has recently been applied to several new biological problems. This chapter reviews the state of the art with respect to SVM applications in computational biology.

Keywords: biosvm
[Roth2004generalized] Volker Roth. The generalized LASSO. IEEE Trans Neural Netw, 15(1):16-28, Jan 2004. [ bib | DOI | http | .pdf ]
In the last few years, the support vector machine (SVM) method has motivated new interest in kernel regression techniques. Although the SVM has been shown to exhibit excellent generalization properties in many experiments, it suffers from several drawbacks, both of a theoretical and a technical nature: the absence of probabilistic outputs, the restriction to Mercer kernels, and the steep growth of the number of support vectors with increasing size of the training set. In this paper, we present a different class of kernel regressors that effectively overcome the above problems. We call this approach generalized LASSO regression. It has a clear probabilistic interpretation, can handle learning sets that are corrupted by outliers, produces extremely sparse solutions, and is capable of dealing with large-scale problems. For regression functionals which can be modeled as iteratively reweighted least-squares (IRLS) problems, we present a highly efficient algorithm with guaranteed global convergence. This defies a unique framework for sparse regression models in the very rich class of IRLS models, including various types of robust regression models and logistic regression. Performance studies for many standard benchmark datasets effectively demonstrate the advantages of this model over related approaches.

Keywords: Algorithms, Bayes Theorem, Models, Neural Networks (Computer), Non-U.S. Gov't, Research Design, Research Support, Theoretical, 15387244
[Ross2004Multiplexed] Philip L Ross, Yulin N Huang, Jason N Marchese, Brian Williamson, Kenneth Parker, Stephen Hattan, Nikita Khainovski, Sasi Pillai, Subhakar Dey, Scott Daniels, Subhasish Purkayastha, Peter Juhasz, Stephen Martin, Michael Bartlet-Jones, Feng He, Allan Jacobson, and Darryl J Pappin. Multiplexed protein quantitation in saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteomics, 3(12):1154-1169, Dec 2004. [ bib | DOI | http ]
We describe here a multiplexed protein quantitation strategy that provides relative and absolute measurements of proteins in complex mixtures. At the core of this methodology is a multiplexed set of isobaric reagents that yield amine-derivatized peptides. The derivatized peptides are indistinguishable in MS, but exhibit intense low-mass MS/MS signature ions that support quantitation. In this study, we have examined the global protein expression of a wild-type yeast strain and the isogenic upf1Delta and xrn1Delta mutant strains that are defective in the nonsense-mediated mRNA decay and the general 5' to 3' decay pathways, respectively. We also demonstrate the use of 4-fold multiplexing to enable relative protein measurements simultaneously with determination of absolute levels of a target protein using synthetic isobaric peptide standards. We find that inactivation of Upf1p and Xrn1p causes common as well as unique effects on protein expression.

Keywords: Cations; Chromatography, Ion Exchange; Chromatography, Liquid; Down-Regulation; Exoribonucleases; Fungal Proteins; Indicators and Reagents; Ions; Mass Spectrometry; Models, Chemical; Peptides; Phenotype; Proteomics; RNA Helicases; RNA, Messenger; Saccharomyces cerevisiae; Saccharomyces cerevisiae Proteins; Succinimides
[Rosasco2004loss] L. Rosasco, E.D. Vito, A. Caponnetto, M. Piana, and A. Verri. Are loss functions all the same? Neural Computation, 16(5):1063-1076, 2004. [ bib ]
[Riedesel2004Peptide] Henning Riedesel, Björn Kolbeck, Oliver Schmetzer, and Ernst-Walter Knapp. Peptide binding at class I major histocompatibility complex scored with linear functions and support vector machines. Genome Inform Ser Workshop Genome Inform, 15(1):198-212, 2004. [ bib | .html | .pdf ]
We explore two different methods to predict the binding ability of nonapeptides at the class I major histocompatibility complex using a general linear scoring function that defines a separating hyperplane in the feature space of sequences. In absence of suitable data on non-binding nonapeptides we generated sequences randomly from a selected set of proteins from the protein data bank. The parameters of the scoring function were determined by a generalized least square optimization (LSM) and alternatively by the support vector machine (SVM). With the generalized LSM impaired data for learning with a small set of binding peptides and a large set of non-binding peptides can be treated in a balanced way rendering LSM more successful than SVM, while for symmetric data sets SVM has a slight advantage compared to LSM.

Keywords: biosvm
[Rhodes2004Large-scale] D. R. Rhodes, J. Yu, K. Shanker, N. Deshpande, R. Varambally, D. Ghosh, T. Barrette, A. Pandey, and A. M. Chinnaiyan. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc. Natl. Acad. Sci. U. S. A., 101(25):9309-9314, Jun 2004. [ bib | DOI | http | .pdf ]
Many studies have used DNA microarrays to identify the gene expression signatures of human cancer, yet the critical features of these often unmanageably large signatures remain elusive. To address this, we developed a statistical method, comparative metaprofiling, which identifies and assesses the intersection of multiple gene expression signatures from a diverse collection of microarray data sets. We collected and analyzed 40 published cancer microarray data sets, comprising 38 million gene expression measurements from >3,700 cancer samples. From this, we characterized a common transcriptional profile that is universally activated in most cancer types relative to the normal tissues from which they arose, likely reflecting essential transcriptional features of neoplastic transformation. In addition, we characterized a transcriptional profile that is commonly activated in various types of undifferentiated cancer, suggesting common molecular mechanisms by which cancer cells progress and avoid differentiation. Finally, we validated these transcriptional profiles on independent data sets.

[Reynolds2004Rational] A. Reynolds, D. Leake, Q. Boese, S. Scaringe, W. S. Marshall, and A. Khvorova. Rational siRNA design for RNA interference. Nat. Biotechnol., 22(3):326-330, Mar 2004. [ bib | DOI | http | .pdf ]
Short-interfering RNAs suppress gene expression through a highly regulated enzyme-mediated process called RNA interference (RNAi)1, 2, 3, 4. RNAi involves multiple RNA-protein interactions characterized by four major steps: assembly of siRNA with the RNA-induced silencing complex (RISC), activation of the RISC, target recognition and target cleavage. These interactions may bias strand selection during siRNA-RISC assembly and activation, and contribute to the overall efficiency of RNAi5, 6. To identify siRNA-specific features likely to contribute to efficient processing at each step, we performed a systematic analysis of 180 siRNAs targeting the mRNA of two genes. Eight characteristics associated with siRNA functionality were identified: low G/C content, a bias towards low internal stability at the sense strand 3'-terminus, lack of inverted repeats, and sense strand base preferences (positions 3, 10, 13 and 19). Further analyses revealed that application of an algorithm incorporating all eight criteria significantly improves potent siRNA selection. This highlights the utility of rational design for selecting potent siRNAs and facilitating functional gene knockdown studies.

Keywords: sirna
[Rege2004Parallel] Kaushal Rege, Asif Ladiwala, Nihal Tugcu, Curt M Breneman, and Steven M Cramer. Parallel screening of selective and high-affinity displacers for proteins in ion-exchange systems. J Chromatogr A, 1033(1):19-28, Apr 2004. [ bib ]
This paper employs a parallel batch screening technique for the identification of both selective and high-affinity displacers for a model binary mixture of proteins in a cation-exchange system. A variety of molecules were screened as possible displacers for the proteins ribonuclease A (RNAseA) and alpha-chymotrypsinogen A (alpha-chyA) on high performance Sepharose SP. The batch screening data for each protein was used to select leads for selective and high-affinity displacers and column experiments were carried out to evaluate the performance of the selected leads. The data from the batch displacements was also employed to generate quantitative structure-efficacy relationship (QSER) models based on a support vector machine regression approach. The resulting models had high correlation coefficients and were able to predict the behaviour of molecules not included in the training set. The descriptors selected in the QSER models for both proteins were examined to provide insights into factors influencing displacer selectivity in ion-exchange systems. The results presented in this paper demonstrate that this parallel batch screening-QSER approach can be employed for the identification of selective and high-affinity displacers for protein mixtures.

[Reche2004Enhancement] Pedro A Reche, John-Paul Glutting, Hong Zhang, and Ellis L Reinherz. Enhancement to the RANKPEP resource for the prediction of peptide binding to MHC molecules using profiles. Immunogenetics, 56(6):405-419, Sep 2004. [ bib | DOI | http ]
We introduced previously an on-line resource, RANKPEP that uses position specific scoring matrices (PSSMs) or profiles for the prediction of peptide-MHC class I (MHCI) binding as a basis for CD8 T-cell epitope identification. Here, using PSSMs that are structurally consistent with the binding mode of MHC class II (MHCII) ligands, we have extended RANKPEP to prediction of peptide-MHCII binding and anticipation of CD4 T-cell epitopes. Currently, 88 and 50 different MHCI and MHCII molecules, respectively, can be targeted for peptide binding predictions in RANKPEP. Because appropriate processing of antigenic peptides must occur prior to major histocompatibility complex (MHC) binding, cleavage site prediction methods are important adjuncts for T-cell epitope discovery. Given that the C-terminus of most MHCI-restricted epitopes results from proteasomal cleavage, we have modeled the cleavage site from known MHCI-restricted epitopes using statistical language models. The RANKPEP server now determines whether the C-terminus of any predicted MHCI ligand may result from such proteasomal cleavage. Also implemented is a variability masking function. This feature focuses prediction on conserved rather than highly variable protein segments encoded by infectious genomes, thereby offering identification of invariant T-cell epitopes to thwart mutation as an immune evasion mechanism.

Keywords: Algorithms; Amino Acid Motifs; Antigen Presentation; Epitopes, T-Lymphocyte; Histocompatibility Antigens Class I; Histocompatibility Antigens Class II; Humans; Models, Molecular; Peptide Fragments; Research Support, Non-U.S. Gov't; Research Support, U.S. Gov't, P.H.S.; T-Lymphocytes
[Rahnenfuehrer2004Calculating] J. Rahnenführer, F. S. Domingues, J. Maydt, and T. Lengauer. Calculating the statistical significance of changes in pathway activity from gene expression data. Stat. Appl. Genet. Mol. Biol., 3:Article16, 2004. [ bib | DOI | http ]
We present a statistical approach to scoring changes in activity of metabolic pathways from gene expression data. The method identifies the biologically relevant pathways with corresponding statistical significance. Based on gene expression data alone, only local structures of genetic networks can be recovered. Instead of inferring such a network, we propose a hypothesis-based approach. We use given knowledge about biological networks to improve sensitivity and interpretability of findings from microarray experiments. Recently introduced methods test if members of predefined gene sets are enriched in a list of top-ranked genes in a microarray study. We improve this approach by defining scores that depend on all members of the gene set and that also take pairwise co-regulation of these genes into account. We calculate the significance of co-regulation of gene sets with a nonparametric permutation test. On two data sets the method is validated and its biological relevance is discussed. It turns out that useful measures for co-regulation of genes in a pathway can be identified adaptively. We refine our method in two aspects specific to pathways. First, to overcome the ambiguity of enzyme-to-gene mappings for a fixed pathway, we introduce algorithms for selecting the best fitting gene for a specific enzyme in a specific condition. In selected cases, functional assignment of genes to pathways is feasible. Second, the sensitivity of detecting relevant pathways is improved by integrating information about pathway topology. The distance of two enzymes is measured by the number of reactions needed to connect them, and enzyme pairs with a smaller distance receive a higher weight in the score calculation.

[Rahnenfuhrer2004] J Rahnenfuhrer, FS Domingues, J Maydt, and T. Lengauer. Calculating the statistical significance of changes in pathway activity from gene expression data. Statistical Applications in Genetics and Molecular Biology, 3(1):Article 16, 2004. [ bib ]
[Rahmann2004Mean] S. Rahmann and C. Gräfe. Mean and variance of the Gibbs free energy of oligonucleotides in the nearest neighbor model under varying conditions. Bioinformatics, 20(17):2928-33, Nov 2004. [ bib | DOI | http ]
MOTIVATION: In order to assess the stability of DNA-DNA hybridizations-for example during PCR primer design or oligonucleotide selection for microarrays-one needs to predict the change in Gibbs free energy DeltaG during hybridization. The nearest neighbor model provides a good compromise between accuracy and computational simplicity for this task. To determine optimal combinations of reaction parameters (temperature, salt concentration, oligonucleotide length and GC-content), one would like to understand how DeltaG depends on all of these parameters simultaneously. RESULTS: We derive analytic results about the distribution of nearest neighbor DeltaG values for a Bernoulli random sequence model (specified by oligonucleotide length and average GC-content) under given experimental conditions. We find that the distribution of DeltaG values is approximately Gaussian and provide exact formulas for expectation and variance.

[Qu2004Supervised] Yi Qu and Shizhong Xu. Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics, 20(12):1905-13, Aug 2004. [ bib | DOI | http | .pdf ]
MOTIVATION: Grouping genes having similar expression patterns is called gene clustering, which has been proved to be a useful tool for extracting underlying biological information of gene expression data. Many clustering procedures have shown success in microarray gene clustering; most of them belong to the family of heuristic clustering algorithms. Model-based algorithms are alternative clustering algorithms, which are based on the assumption that the whole set of microarray data is a finite mixture of a certain type of distributions with different parameters. Application of the model-based algorithms to unsupervised clustering has been reported. Here, for the first time, we demonstrated the use of the model-based algorithm in supervised clustering of microarray data. RESULTS: We applied the proposed methods to real gene expression data and simulated data. We showed that the supervised model-based algorithm is superior over the unsupervised method and the support vector machines (SVM) method. AVAILABILITY: The program written in the SAS language implementing methods I-III in this report is available upon request. The software of SVMs is available in the website http://svm.sdsc.edu/cgi-bin/nph-SVMsubmit.cgi

[Qin2004[Automated] Dong mei Qin, Zhan yi Hu, and Yong heng Zhao. Automated classification of celestial spectra based on support vector machines. Guang Pu Xue Yu Guang Pu Fen Xi, 24(4):507-11, Apr 2004. [ bib ]
The main objective of an automatic recognition system of celestial objects via their spectra is to classify celestial spectra and estimate physical parameters automatically. This paper proposes a new automatic classification method based on support vector machines to separate non-active objects from active objects via their spectra. With low SNR and unknown red-shift value, it is difficult to extract true spectral lines, and as a result, active objects can not be determined by finding strong spectral lines and the spectral classification between non-active and active objects becomes difficult. The proposed method in this paper combines the principal component analysis with support vector machines, and can automatically recognize the spectra of active objects with unknown red-shift values from non-active objects. It finds its applicability in the automatic processing of voluminous observed data from large sky surveys in astronomy.

Keywords: 80 and over, Adult, Aged, Algorithms, Amino Acids, Animals, Area Under Curve, Artifacts, Automated, Birefringence, Brain Chemistry, Brain Neoplasms, Comparative Study, Computer-Assisted, Cornea, Cross-Sectional Studies, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Face, Female, Genetic, Glaucoma, Humans, Intraocular Pressure, Lasers, Least-Squares Analysis, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Male, Middle Aged, Models, Molecular, Nerve Fibers, Non-U.S. Gov't, Numerical Analysis, Ophthalmological, Optic Nerve Diseases, Optical Coherence, P.H.S., Pattern Recognition, Photic Stimulation, Prospective Studies, Protein, ROC Curve, Regression Analysis, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, Sequence Analysis, Statistics, Tomography, U.S. Gov't, Visual Fields, beta-Lactamases, 15766170
[Praz2004CleanEx] V. Praz, V. Jagannathan, and P. Bucher. CleanEx: a database of heterogeneous gene expression data based on a consistent gene nomenclature. Nucleic Acids Res., 32(Database issue):D542-D547, Jan 2004. [ bib | DOI | http | .pdf ]
The main goal of CleanEx is to provide access to public gene expression data via unique gene names. A second objective is to represent heterogeneous expression data produced by different technologies in a way that facilitates joint analysis and cross-data set comparisons. A consistent and up-to-date gene nomenclature is achieved by associating each single experiment with a permanent target identifier consisting of a physical description of the targeted RNA population or the hybridization reagent used. These targets are then mapped at regular intervals to the growing and evolving catalogues of human genes and genes from model organisms. The completely automatic mapping procedure relies partly on external genome information resources such as UniGene and RefSeq. The central part of CleanEx is a weekly built gene index containing cross-references to all public expression data already incorporated into the system. In addition, the expression target database of CleanEx provides gene mapping and quality control information for various types of experimental resource, such as cDNA clones or Affymetrix probe sets. The web-based query interfaces offer access to individual entries via text string searches or quantitative expression criteria. CleanEx is accessible at: http://www.cleanex.isb-sib.ch/.

[Prados2004Mining] J. Prados, A. Kalousis, J.C. Sanchez, L. Allard, O. Carrette, and M. Hilario. Mining mass spectra for diagnosis and biomarker discovery of cerebral accidents. Proteomics, 4(8):2320-2332, 2004. [ bib | DOI | http | .pdf ]
In this paper we try to identify potential biomarkers for early stroke diagnosis using surface-enhanced laser desorption/ionization mass spectrometry coupled with analysis tools from machine learning and data mining. Data consist of 42 specimen samples, i.e., mass spectra divided in two big categories, stroke and control specimens. Among the stroke specimens two further categories exist that correspond to ischemic and hemorrhagic stroke; in this paper we limit our data analysis to discriminating between control and stroke specimens. We performed two suites of experiments. In the first one we simply applied a number of different machine learning algorithms; in the second one we have chosen the best performing algorithm as it was determined from the first phase and coupled it with a number of different feature selection methods. The reason for this was 2-fold, first to establish whether feature selection can indeed improve performance, which in our case it did not seem to confirm, but more importantly to acquire a small list of potentially interesting biomarkers. Of the different methods explored the most promising one was support vector machines which gave us high levels of sensitivity and specificity. Finally, by analyzing the models constructed by support vector machines we produced a small set of 13 features that could be used as potential biomarkers, and which exhibited good performance both in terms of sensitivity, specificity and model stability.

Keywords: biosvm proteomics
[Pochet2004Systematic] N. Pochet, F. De Smet, J. A. K. Suykens, and B. L. R. De Moor. Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction. Bioinformatics, 20(17):3185-3195, Nov 2004. [ bib | DOI | http | .pdf ]
Motivation: Microarrays are capable of determining the expression levels of thousands of genes simultaneously. In combination with classification methods, this technology can be useful to support clinical management decisions for individual patients, e.g. in oncology. The aim of this paper is to systematically benchmark the role of non-linear versus linear techniques and dimensionality reduction methods. Results: A systematic benchmarking study is performed by comparing linear versions of standard classification and dimensionality reduction techniques with their non-linear versions based on non-linear kernel functions with a radial basis function (RBF) kernel. A total of 9 binary cancer classification problems, derived from 7 publicly available microarray datasets, and 20 randomizations of each problem are examined. Conclusions: Three main conclusions can be formulated based on the performances on independent test sets. (1) When performing classification with least squares support vector machines (LS-SVMs) (without dimensionality reduction), RBF kernels can be used without risking too much overfitting. The results obtained with well-tuned RBF kernels are never worse and sometimes even statistically significantly better compared to results obtained with a linear kernel in terms of test set receiver operating characteristic and test set accuracy performances. (2) Even for classification with linear classifiers like LS-SVM with linear kernel, using regularization is very important. (3) When performing kernel principal component analysis (kernel PCA) before classification, using an RBF kernel for kernel PCA tends to result in overfitting, especially when using supervised feature selection. It has been observed that an optimal selection of a large number of features is often an indication for overfitting. Kernel PCA with linear kernel gives better results. Availability: Matlab scripts are available on request. Supplementary information: http://www.esat.kuleuven.ac.be/ npochet/Bioinformatics/

Keywords: biosvm microarray
[Piliouras2004Development] N. Piliouras, I. Kalatzis, N. Dimitropoulos, and D. Cavouras. Development of the cubic least squares mapping linear-kernel support vector machine classifier for improving the characterization of breast lesions on ultrasound. Comput Med Imaging Graph, 28(5):247-55, Jul 2004. [ bib | DOI | http ]
An efficient classification algorithm is proposed for characterizing breast lesions. The algorithm is based on the cubic least squares mapping and the linear-kernel support vector machine (SVM(LSM)) classifier. Ultrasound images of 154 confirmed lesions (59 benign and 52 malignant solid masses, 7 simple cysts, and 32 complicated cysts) were manually segmented by a physician using a custom developed software. Texture and outline features and the SVM(LSM) algorithm were used to design a hierarchical tree classification system. Classification accuracy was 98.7%, misdiagnosing 1 malignant an 1 benign solid lesions only. This system may be used as a second opinion tool to the radiologists.

[Perlman2004Multidimensional] Z. E. Perlman, M. D. Slack, Y. Feng, T. J. Mitchison, L. F. Wu, and S. J. Altschuler. Multidimensional drug profiling by automated microscopy. Science, 306(5699):1194-1198, Nov 2004. [ bib | DOI | http | .pdf ]
We present a method for high-throughput cytological profiling by microscopy. Our system provides quantitative multidimensional measures of individual cell states over wide ranges of perturbations. We profile dose-dependent phenotypic effects of drugs in human cell culture with a titration-invariant similarity score (TISS). This method successfully categorized blinded drugs and suggested targets for drugs of uncertain mechanism. Multivariate single-cell analysis is a starting point for identifying relationships among drug effects at a systems level and a step toward phenotypic profiling at the single-cell level. Our methods will be useful for discovering the mechanism and predicting the toxicity of new drugs.

Keywords: chemogenomics, highcontentscreening
[Pavlidis2004Support] Paul Pavlidis, Ilan Wapinski, and William Stafford Noble. Support vector machine classification on the web. Bioinformatics, 20(4):586-7, Mar 2004. [ bib | DOI | http | .pdf ]
The support vector machine (SVM) learning algorithm has been widely applied in bioinformatics. We have developed a simple web interface to our implementation of the SVM algorithm, called Gist. This interface allows novice or occasional users to apply a sophisticated machine learning algorithm easily to their data. More advanced users can download the software and source code for local installation. The availability of these tools will permit more widespread application of this powerful learning algorithm in bioinformatics.

Keywords: Adaptation, Algorithms, Ambergris, Amino Acid Sequence, Animals, Artifacts, Artificial Intelligence, Automated, Cadmium, Candida, Candida albicans, Capillary, Clinical, Cluster Analysis, Combinatorial Chemistry Techniques, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Computing Methodologies, Databases, Decision Support Systems, Electrophoresis, Enzymes, Europe, Eye Enucleation, Humans, Image Interpretation, Image Processing, Information Storage and Retrieval, Internet, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Markov Chains, Melanoma, Models, Molecular, Molecular Conformation, Molecular Sequence Data, Molecular Structure, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Odors, P.H.S., Pattern Recognition, Perfume, Physiological, Predictive Value of Tests, Prognosis, Prospective Studies, Protein, Protein Structure, Proteins, Proteomics, Quantitative Structure-Activity Relationship, Rats, Reproducibility of Results, Research Support, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Secondary, Sensitivity and Specificity, Signal Processing, Single-Blind Method, Soft Tissue Neoplasms, Software, Statistical, U.S. Gov't, Uveal Neoplasms, Visual, 14990457
[Passerini2004New] Andrea Passerini, Massimiliano Pontil, and Paolo Frasconi. New results on error correcting output codes of kernel machines. IEEE Trans Neural Netw, 15(1):45-54, Jan 2004. [ bib ]
We study the problem of multiclass classification within the framework of error correcting output codes (ECOC) using margin-based binary classifiers. Specifically, we address two important open problems in this context: decoding and model selection. The decoding problem concerns how to map the outputs of the classifiers into class codewords. In this paper we introduce a new decoding function that combines the margins through an estimate of their class conditional probabilities. Concerning model selection, we present new theoretical results bounding the leave-one-out (LOO) error of ECOC of kernel machines, which can be used to tune kernel hyperparameters. We report experiments using support vector machines as the base binary classifiers, showing the advantage of the proposed decoding function over other functions of the margin commonly used in practice. Moreover, our empirical evaluations on model selection indicate that the bound leads to good estimates of kernel parameters.

Keywords: Neural Networks (Computer), Research Design, 15387246
[Passerini2004Learning] A. Passerini and P. Frasconi. Learning to discriminate between ligand-bound and disulfide-bound cysteines. Protein Eng. Des. Sel., 17(4):367-373, 2004. [ bib | DOI | http | .pdf ]
We present a machine learning method to discriminate between cysteines involved in ligand binding and cysteines forming disulfide bridges. Our method uses a window of multiple alignment profiles to represent each instance and support vector machines with a polynomial kernel as the learning algorithm. We also report results obtained with two new kernel functions based on similarity matrices. Experimental results indicate that binding type can be predicted at significantly higher accuracy than using PROSITE patterns.

Keywords: biosvm
[Parham2004The] Peter Parham. The Immune System. Garland Science Publishing, 2004. [ bib ]
[Pan2004Comprehensive] Fei Pan, Baoying Wang, Xin Hu, and William Perrizo. Comprehensive vertical sample-based KNN/LSVM classification for gene expression analysis. J Biomed Inform, 37(4):240-8, Aug 2004. [ bib | DOI | http | .pdf ]
Classification analysis of microarray gene expression data has been widely used to uncover biological features and to distinguish closely related cell types that often appear in the diagnosis of cancer. However, the number of dimensions of gene expression data is often very high, e.g., in the hundreds or thousands. Accurate and efficient classification of such high-dimensional data remains a contemporary challenge. In this paper, we propose a comprehensive vertical sample-based KNN/LSVM classification approach with weights optimized by genetic algorithms for high-dimensional data. Experiments on common gene expression datasets demonstrated that our approach can achieve high accuracy and efficiency at the same time. The improvement of speed is mainly related to the vertical data representation, P-tree,Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#:K96130308. and its optimized logical algebra. The high accuracy is due to the combination of a KNN majority voting approach and a local support vector machine approach that makes optimal decisions at the local level. As a result, our approach could be a powerful tool for high-dimensional gene expression data analysis.

Keywords: biosvm
[Pajares2004On] Gonzalo Pajares and Jesús M de la Cruz. On combining support vector machines and simulated annealing in stereovision matching. IEEE Trans Syst Man Cybern B Cybern, 34(4):1646-57, Aug 2004. [ bib ]
This paper outlines a method for solving the stereovision matching problem using edge segments as the primitives. In stereovision matching, the following constraints are commonly used: epipolar, similarity, smoothness, ordering, and uniqueness. We propose a new strategy in which such constraints are sequentially combined. The goal is to achieve high performance in terms of correct matches by combining several strategies. The contributions of this paper are reflected in the development of a similarity measure through a support vector machines classification approach; the transformation of the smoothness, ordering and epipolar constraints into the form of an energy function, through an optimization simulated annealing approach, whose minimum value corresponds to a good matching solution and by introducing specific conditions to overcome the violation of the smoothness and ordering constraints. The performance of the proposed method is illustrated by comparative analysis against some recent global matching methods.

[Osowski2004Support] Stanislaw Osowski, Linh Tran Hoai, and Tomasz Markiewicz. Support vector machine-based expert system for reliable heartbeat recognition. IEEE Trans Biomed Eng, 51(4):582-9, Apr 2004. [ bib ]
This paper presents a new solution to the expert system for reliable heartbeat recognition. The recognition system uses the support vector machine (SVM) working in the classification mode. Two different preprocessing methods for generation of features are applied. One method involves the higher order statistics (HOS) while the second the Hermite characterization of QRS complex of the registered electrocardiogram (ECG) waveform. Combining the SVM network with these preprocessing methods yields two neural classifiers, which have been combined into one final expert system. The combination of classifiers utilizes the least mean square method to optimize the weights of the weighted voting integrating scheme. The results of the performed numerical experiments for the recognition of 13 heart rhythm types on the basis of ECG waveforms confirmed the reliability and advantage of the proposed approach.

[Okada2004retinal] T. Okada, M. Sugihara, A.-N. Bondar, M. Elstner, P. Entel, and V. Buss. The retinal conformation and its environment in rhodopsin in light of a new 2.2 a crystal structure. J. Mol. Biol., 342(2):571-583, Sep 2004. [ bib | DOI | http ]
A new high-resolution structure is reported for bovine rhodopsin, the visual pigment in rod photoreceptor cells. Substantial improvement of the resolution limit to 2.2 A has been achieved by new crystallization conditions, which also reduce significantly the probability of merohedral twinning in the crystals. The new structure completely resolves the polypeptide chain and provides further details of the chromophore binding site including the configuration about the C6-C7 single bond of the 11-cis-retinal Schiff base. Based on both an earlier structure and the new improved model of the protein, a theoretical study of the chromophore geometry has been carried out using combined quantum mechanics/force field molecular dynamics. The consistency between the experimental and calculated chromophore structures is found to be significantly improved for the 2.2 A model, including the angle of the negatively twisted 6-s-cis-bond. Importantly, the new crystal structure refinement reveals significant negative pre-twist of the C11-C12 double bond and this is also supported by the theoretical calculation although the latter converges to a smaller value. Bond alternation along the unsaturated chain is significant, but weaker in the calculated structure than the one obtained from the X-ray data. Other differences between the experimental and theoretical structures in the chromophore binding site are discussed with respect to the unique spectral properties and excited state reactivity of the chromophore.

Keywords: chemogenomics
[ODriscoll2004Virtual] Cath O'Driscoll. A Virtual Space Odyssey. Nature Horizon : Charting Chemical Space, 2004. [ bib ]
[Noble2004Support] W. S. Noble. Support vector machine applications in computational biology. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 71-92. MIT Press, 2004. [ bib | .pdf ]
During the past three years, the support vector machine learning algorithm has been extensively applied within the field of computational biology. The algorithm has been used to detect patterns within and among biological sequences, to classify genes and patients based upon gene expression profiles, and has recently been applied to several new biological problems. This chapter reviews the state of the art with respect to SVM applications in computational biology.

Keywords: biosvm
[Nilsson2004Approximate] J. Nilsson, T. Fioretos, M. Höglund, and M. Fontes. Approximate geodesic distances reveal biologically relevant structures in microarray data. Bioinformatics, 20(6):874-80, Apr 2004. [ bib | DOI | http | .pdf ]
MOTIVATION: Genome-wide gene expression measurements, as currently determined by the microarray technology, can be represented mathematically as points in a high-dimensional gene expression space. Genes interact with each other in regulatory networks, restricting the cellular gene expression profiles to a certain manifold, or surface, in gene expression space. To obtain knowledge about this manifold, various dimensionality reduction methods and distance metrics are used. For data points distributed on curved manifolds, a sensible distance measure would be the geodesic distance along the manifold. In this work, we examine whether an approximate geodesic distance measure captures biological similarities better than the traditionally used Euclidean distance. RESULTS: We computed approximate geodesic distances, determined by the Isomap algorithm, for one set of lymphoma and one set of lung cancer microarray samples. Compared with the ordinary Euclidean distance metric, this distance measure produced more instructive, biologically relevant, visualizations when applying multidimensional scaling. This suggests the Isomap algorithm as a promising tool for the interpretation of microarray data. Furthermore, the results demonstrate the benefit and importance of taking nonlinearities in gene expression data into account.

Keywords: dimred
[Nielsen2004Improved] M. Nielsen, C. Lundegaard, P. Worning, C. S. Hvid, K. Lamberth, S. Buus, S. Brunak, and O. Lund. Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics, 20(9):1388-1397, Jun 2004. [ bib | DOI | http ]
MOTIVATION: Prediction of which peptides will bind a specific major histocompatibility complex (MHC) constitutes an important step in identifying potential T-cell epitopes suitable as vaccine candidates. MHC class II binding peptides have a broad length distribution complicating such predictions. Thus, identifying the correct alignment is a crucial part of identifying the core of an MHC class II binding motif. In this context, we wish to describe a novel Gibbs motif sampler method ideally suited for recognizing such weak sequence motifs. The method is based on the Gibbs sampling method, and it incorporates novel features optimized for the task of recognizing the binding motif of MHC classes I and II. The method locates the binding motif in a set of sequences and characterizes the motif in terms of a weight-matrix. Subsequently, the weight-matrix can be applied to identifying effectively potential MHC binding peptides and to guiding the process of rational vaccine design. RESULTS: We apply the motif sampler method to the complex problem of MHC class II binding. The input to the method is amino acid peptide sequences extracted from the public databases of SYFPEITHI and MHCPEP and known to bind to the MHC class II complex HLA-DR4(B1*0401). Prior identification of information-rich (anchor) positions in the binding motif is shown to improve the predictive performance of the Gibbs sampler. Similarly, a consensus solution obtained from an ensemble average over suboptimal solutions is shown to outperform the use of a single optimal solution. In a large-scale benchmark calculation, the performance is quantified using relative operating characteristics curve (ROC) plots and we make a detailed comparison of the performance with that of both the TEPITOPE method and a weight-matrix derived using the conventional alignment algorithm of ClustalW. The calculation demonstrates that the predictive performance of the Gibbs sampler is higher than that of ClustalW and in most cases also higher than that of the TEPITOPE method.

Keywords: immunoinformatics
[Natt2004Prediction] N.K. Natt, H. Kaur, and G.P. Raghava. Prediction of transmembrane regions of beta-barrel proteins using ANN- and SVM-based methods. Proteins, 56(1):11-18, 2004. [ bib | DOI | http | .pdf ]
This article describes a method developed for predicting transmembrane beta-barrel regions in membrane proteins using machine learning techniques: artificial neural network (ANN) and support vector machine (SVM). The ANN used in this study is a feed-forward neural network with a standard back-propagation training algorithm. The accuracy of the ANN-based method improved significantly, from 70.4 when evolutionary information was added to a single sequence as a multiple sequence alignment obtained from PSI-BLAST. We have also developed an SVM-based method using a primary sequence as input and achieved an accuracy of 77.4 by adding 36 physicochemical parameters to the amino acid sequence information. Finally, ANN- and SVM-based methods were combined to utilize the full potential of both techniques. The accuracy and Matthews correlation coefficient (MCC) value of SVM, ANN, and combined method are 78.5 and 0.64, respectively. These methods were trained and tested on a nonredundant data set of 16 proteins, and performance was evaluated using "leave one out cross-validation" (LOOCV). Based on this study, we have developed a Web server, TBBPred, for predicting transmembrane beta-barrel regions in proteins (available at http://www.imtech.res.in/raghava/tbbpred).

Keywords: biosvm
[Morley2004Genetic] Michael Morley, Cliona M. Molony, Teresa M. Weber, James L. Devlin, Kathryn G. Ewens, Richard S. Spielman, and Vivian G. Cheung. Genetic analysis of genome-wide variation in human gene expression. Nature, 430(7001):743-747, Aug 2004. [ bib | DOI | http ]
Natural variation in gene expression is extensive in humans and other organisms, and variation in the baseline expression level of many genes has a heritable component. To localize the genetic determinants of these quantitative traits (expression phenotypes) in humans, we used microarrays to measure gene expression levels and performed genome-wide linkage analysis for expression levels of 3,554 genes in 14 large families. For approximately 1,000 expression phenotypes, there was significant evidence of linkage to specific chromosomal regions. Both cis- and trans-acting loci regulate variation in the expression levels of genes, although most act in trans. Many gene expression phenotypes are influenced by several genetic determinants. Furthermore, we found hotspots of transcriptional regulation where significant evidence of linkage for several expression phenotypes (up to 31) coincides, and expression levels of many genes that share the same regulatory region are significantly correlated. The combination of microarray techniques for phenotyping and linkage analysis for quantitative traits allows the genetic mapping of determinants that contribute to variation in human gene expression.

[Mitra2004probabilistic] Pabitra Mitra, C. A. Murthy, and Sankar K Pal. A probabilistic active support vector learning algorithm. IEEE Trans Pattern Anal Mach Intell, 26(3):413-8, Mar 2004. [ bib ]
The paper describes a probabilistic active learning strategy for support vector machine (SVM) design in large data applications. The learning strategy is motivated by the statistical query model. While most existing methods of active SVM learning query for points based on their proximity to the current separating hyperplane, the proposed method queries for a set of points according to a distribution as determined by the current separating hyperplane and a newly defined concept of an adaptive confidence factor. This enables the algorithm to have more robust and efficient learning capabilities. The confidence factor is estimated from local information using the k nearest neighbor principle. The effectiveness of the method is demonstrated on real-life data sets both in terms of generalization performance, query complexity, and training time.

[Mika2004Protein] Sven Mika and Burkhard Rost. Protein names precisely peeled off free text. Bioinformatics, 20(Suppl. 1):i241-i247, 2004. [ bib | http | .pdf ]
Motivation: Automatically identifying protein names from the scientific literature is a pre-requisite for the increasing demand in data-mining this wealth of information. Existing approaches are based on dictionaries, rules and machine-learning. Here, we introduced a novel system that combines a pre-processing dictionary- and rule-based filtering step with several separately trained support vector machines (SVMs) to identify protein names in the MEDLINE abstracts. Results: Our new tagging-system NLProt is capable of extracting protein names with a precision (accuracy) of 75 76 and contains 200 annotated abstracts. For our estimate of sustained performance, we considered partially identified names as false positives. One important issue frequently ignored in the literature is the redundancy in evaluation sets. We suggested some guidelines for removing overly inadequate overlaps between training and testing sets. Applying these new guidelines, our program appeared to significantly out-perform other methods tagging protein names. NLProt was so successful due to the SVM-building blocks that succeeded in utilizing the local context of protein names in the scientific literature. We challenge that our system may constitute the most general and precise method for tagging protein names. Availability: http://cubic.bioc.columbia.edu/services/nlprot/

Keywords: biosvm nlp
[Mika2004NLProt] Sven Mika and Burkhard Rost. NLProt: extracting protein names and sequences from papers. Nucleic Acids Res, 32(Web Server issue):W634-7, Jul 2004. [ bib | DOI | http ]
Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases is becoming increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several support vector machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagged names as errors, NLProt still reached a precision of 75% at a recall of 76%. By many criteria our system outperformed other tagging methods significantly; in particular, it proved very reliable even for novel names. Names encountered particularly frequently in Drosophila, such as white, wing and bizarre, constitute an obvious limitation of NLProt. Our method is available both as an Internet server and as a program for download (http://cubic.bioc.columbia.edu/services/NLProt/). Input can be PubMed/MEDLINE identifiers, authors, titles and journals, as well as collections of abstracts, or entire papers.

Keywords: biosvm nlp
[Middendorf2004Discriminative] M. Middendorf, E. Ziv, C. Adams, J. Hom, R. Koytcheff, C. Levovitz, G. Woods, L. Chen, and C. Wiggins. Discriminative topological features reveal biological network mechanisms. BMC Bioinformatics, 5(181), 2004. [ bib | DOI | http | .pdf ]
BACKGROUND: Recent genomic and bioinformatic advances have motivated the development of numerous network models intending to describe graphs of biological, technological, and sociological origin. In most cases the success of a model has been evaluated by how well it reproduces a few key features of the real-world data, such as degree distributions, mean geodesic lengths, and clustering coefficients. Often pairs of models can reproduce these features with indistinguishable fidelity despite being generated by vastly different mechanisms. In such cases, these few target features are insufficient to distinguish which of the different models best describes real world networks of interest; moreover, it is not clear a priori that any of the presently-existing algorithms for network generation offers a predictive description of the networks inspiring them. RESULTS: We present a method to assess systematically which of a set of proposed network generation algorithms gives the most accurate description of a given biological network. To derive discriminative classifiers, we construct a mapping from the set of all graphs to a high-dimensional (in principle infinite-dimensional) "word space". This map defines an input space for classification schemes which allow us to state unambiguously which models are most descriptive of a given network of interest. Our training sets include networks generated from 17 models either drawn from the literature or introduced in this work. We show that different duplication-mutation schemes best describe the E. coli genetic network, the S. cerevisiae protein interaction network, and the C. elegans neuronal network, out of a set of network models including a linear preferential attachment model and a small-world model. CONCLUSIONS: Our method is a first step towards systematizing network models and assessing their predictability, and we anticipate its usefulness for a number of communities.

Keywords: biosvm
[Meron2004Finite-memory] E. Meron and M. Feder. Finite-memory universal prediction of individual sequences. IEEE Trans. Inform. Theory, 50(7):1506-1523, Jul 2004. [ bib | .pdf ]
The problem of predicting the next outcome of an individual binary sequence under the constraint that the universal predictor has a finite memory, is explored. In this analysis, the finite-memory universal predictors are either deterministic or random time-invariant finite-state (FS) machines with K states (K-state machines). The paper provides bounds on the asymptotic achievable regret of these constrained universal predictors as a function of K, the number of their states, for long enough sequences. The specific results are as follows. When the universal predictors are deterministic machines, the comparison class consists of constant predictors, and prediction is with respect to the 0-1 loss function (Hamming distance), we get tight bounds indicating that the optimal asymptotic regret is 1/(2K). In that case of K-state deterministic universal predictors, the constant predictors comparison class, but prediction is with respect to the self-information (code length) and the square-error loss functions, we show an upper bound on the regret (coding redundancy) of O(K/sup -2/3/) and a lower bound of /spl Theta/(K/sup -4/5/). For these loss functions, if the predictor is allowed to be a random K-state machine, i.e., a machine with random state transitions, we get a lower bound of /spl Theta/(1/K) on the regret, with a matching upper bound of O(1/K) for the square-error loss, and an upper bound of O(logK/K) Throughout the paper for the self-information loss. In addition, we provide results for all these loss functions in the case where the comparison class consists of all predictors that are order-L Markov machines.

Keywords: information-theory source-coding
[Merkwirth2004Ensemble] Christian Merkwirth, Harald Mauser, Tanja Schulz-Gasch, Olivier Roche, Martin Stahl, and Thomas Lengauer. Ensemble methods for classification in cheminformatics. J Chem Inf Comput Sci, 44(6):1971-8, 2004. [ bib | DOI | http | .pdf ]
We describe the application of ensemble methods to binary classification problems on two pharmaceutical compound data sets. Several variants of single and ensembles models of k-nearest neighbors classifiers, support vector machines (SVMs), and single ridge regression models are compared. All methods exhibit robust classification even when more features are given than observations. On two data sets dealing with specific properties of drug-like substances (cytochrome P450 inhibition and "Frequent Hitters", i.e., unspecific protein inhibition), we achieve classification rates above 90%. We are able to reduce the cross-validated misclassification rate for the Frequent Hitters problem by a factor of 2 compared to previous results obtained for the same data set with different modeling techniques.

Keywords: chemoinformatics
[Mercier2004Biological] G. Mercier, N. Berthault, J. Mary, J. Peyre, A. Antoniadis, J.-P. Comet, A. Cornuejols, C. Froidevaux, and M. Dutreix. Biological detection of low radiation doses by combining results of two microarray analysis methods. Nucleic Acids Res., 32(1):e12, 2004. [ bib | DOI | http ]
The accurate determination of the biological effects of low doses of pollutants is a major public health challenge. DNA microarrays are a powerful tool for investigating small intracellular changes. However, the inherent low reliability of this technique, the small number of replicates and the lack of suitable statistical methods for the analysis of such a large number of attributes (genes) impair accurate data interpretation. To overcome this problem, we combined results of two independent analysis methods (ANOVA and RELIEF). We applied this analysis protocol to compare gene expression patterns in Saccharomyces cerevisiae growing in the absence and continuous presence of varying low doses of radiation. Global distribution analysis highlights the importance of mitochondrial membrane functions in the response. We demonstrate that microarrays detect cellular changes induced by irradiation at doses that are 1000-fold lower than the minimal dose associated with mutagenic effects.

[Puijalon2004Malaria] O. Mercereau-Puijalon. Malaria research in the post-genomic era. J. Soc. Biol., 198(3):193-197, 2004. [ bib ]
Genomic sequence determination of Plasmodium falciparum and other species of the genus, as well as that of Anopheles gambiae, and human, rat and mouse genome sequencing have completely changed the landscape of fundamental research about malaria. These data should urgently be exploited, in order to develop new tools to combat the disease: new drugs, fine dissection of the cascade of events following infection of the various vector species and vertebrate host, analysis of the complex interaction leading to the pathology or, inversely, contributing to sustained protection. Powerful population biology tools are now available, allowing to investigate genetic exchanges within natural population and to identify factors structuring parasitic and vector populations. Nevertheless, important impediments persist, including the complexity of experimental systems and the unclear relevance of animals models. Numerous challenges are to be faced; they call upon a more efficient organisation of research efforts in the systematic explorations using the powerful novel post-genomic technologies, as well as the development of new tools and experimental models required by functional genomics and integrative biology.

Keywords: plasmodium
[Mello2004Revealing] Craig C. Mello and Darryl Conte Jr. Revealing the world of RNA interference. Nature, 43:338-342, 2004. [ bib ]
Keywords: csbcbook
[Meister2004Mechanisms] G. Meister and T. Tuschl. Mechanisms of gene silencing by double-stranded RNA. Nature, 431(7006):343-9, Sep 2004. [ bib | DOI | http | .pdf ]
Double-stranded RNA (dsRNA) is an important regulator of gene expression in many eukaryotes. It triggers different types of gene silencing that are collectively referred to as RNA silencing or RNA interference. A key step in known silencing pathways is the processing of dsRNAs into short RNA duplexes of characteristic size and structure. These short dsRNAs guide RNA silencing by specific and distinct mechanisms. Many components of the RNA silencing machinery still need to be identified and characterized, but a more complete understanding of the process is imminent.

Keywords: sirna
[Meinicke2004Oligo] P. Meinicke, M. Tech, B. Morgenstern, and R. Merkl. Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinformatics, 5(169), 2004. [ bib | DOI | http | .pdf ]
Background Kernel-based learning algorithms are among the most advanced machine learning methods and have been successfully applied to a variety of sequence classification tasks within the field of bioinformatics. Conventional kernels utilized so far do not provide an easy interpretation of the learnt representations in terms of positional and compositional variability of the underlying biological signals. Results We propose a kernel-based approach to datamining on biological sequences. With our method it is possible to model and analyze positional variability of oligomers of any length in a natural way. On one hand this is achieved by mapping the sequences to an intuitive but high-dimensional feature space, well-suited for interpretation of the learnt models. On the other hand, by means of the kernel trick we can provide a general learning algorithm for that high-dimensional representation because all required statistics can be computed without performing an explicit feature space mapping of the sequences. By introducing a kernel parameter that controls the degree of position-dependency, our feature space representation can be tailored to the characteristics of the biological problem at hand. A regularized learning scheme enables application even to biological problems for which only small sets of example sequences are available. Our approach includes a visualization method for transparent representation of characteristic sequence features. Thereby importance of features can be measured in terms of discriminative strength with respect to classification of the underlying sequences. To demonstrate and validate our concept on a biochemically well-defined case, we analyze E. coli translation initiation sites in order to show that we can find biologically relevant signals. For that case, our results clearly show that the Shine-Dalgarno sequence is the most important signal upstream a start codon. The variability in position and composition we found for that signal is in accordance with previous biological knowledge. We also find evidence for signals downstream of the start codon, previously introduced as transcriptional enhancers. These signals are mainly characterized by occurrences of adenine in a region of about 4 nucleotides next to the start codon. Conclusions We showed that the oligo kernel can provide a valuable tool for the analysis of relevant signals in biological sequences. In the case of translation initiation sites we could clearly deduce the most discriminative motifs and their positional variation from example sequences. Attractive features of our approach are its flexibility with respect to oligomer length and position conservation. By means of these two parameters oligo kernels can easily be adapted to different biological problems.

Keywords: biosvm
[McAuliffe2004Multiple-sequence] J. D. McAuliffe, L. Pachter, and M. I. Jordan. Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. Bioinformatics, 20(12):1850-1860, Aug 2004. [ bib | DOI | http | .pdf ]
MOTIVATION: Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with feature-based functional annotation methods. The resulting model, a generalized hidden Markov phylogeny (GHMP), applies to a variety of situations where functional regions are to be inferred from evolutionary constraints. RESULTS: We show how GHMPs can be used to predict complete shared gene structures in multiple primate sequences. We also describe shadower, our implementation of such a prediction system. We find that shadower outperforms previously reported ab initio gene finders, including comparative human-mouse approaches, on a small sample of diverse exonic regions. Finally, we report on an empirical analysis of shadower's performance which reveals that as few as five well-chosen species may suffice to attain maximal sensitivity and specificity in exon demarcation. AVAILABILITY: A Web server is available at http://bonaire.lbl.gov/shadower

Keywords: biogm
[Mattfeldt2004Classification] Torsten Mattfeldt, Danilo Trijic, Hans-Werner Gottfried, and Hans A Kestler. Classification of incidental carcinoma of the prostate using learning vector quantization and support vector machines. Cell Oncol, 26(1-2):45-55, 2004. [ bib ]
The subclassification of incidental prostatic carcinoma into the categories T1a and T1b is of major prognostic and therapeutic relevance. In this paper an attempt was made to find out which properties mainly predispose to these two tumor categories, and whether it is possible to predict the category from a battery of clinical and histopathological variables using newer methods of multivariate data analysis. The incidental prostatic carcinomas of the decade 1990-99 diagnosed at our department were reexamined. Besides acquisition of routine clinical and pathological data, the tumours were scored by immunohistochemistry for proliferative activity and p53-overexpression. Tumour vascularization (angiogenesis) and epithelial texture were investigated by quantitative stereology. Learning vector quantization (LVQ) and support vector machines (SVM) were used for the purpose of prediction of tumour category from a set of 10 input variables (age, Gleason score, preoperative PSA value, immunohistochemical scores for proliferation and p53-overexpression, 3 stereological parameters of angiogenesis, 2 stereological parameters of epithelial texture). In a stepwise logistic regression analysis with the tumour categories T1a and T1b as dependent variables, only the Gleason score and the volume fraction of epithelial cells proved to be significant as independent predictor variables of the tumour category. Using LVQ and SVM with the information from all 10 input variables, more than 80 of the cases could be correctly predicted as T1a or T1b category with specificity, sensitivity, negative and positive predictive value from 74-92%. Using only the two significant input variables Gleason score and epithelial volume fraction, the accuracy of prediction was not worse. Thus, descriptive and quantitative texture parameters of tumour cells are of major importance for the extent of propagation in the prostate gland in incidental prostatic adenocarcinomas. Classical statistical tools and neuronal approaches led to consistent conclusions.

[Mattfeldt2004Prediction] T. Mattfeldt, H. A. Kestler, and H. P. Sinn. Prediction of the axillary lymph node status in mammary cancer on the basis of clinicopathological data and flow cytometry. Med Biol Eng Comput, 42(6):733-9, Nov 2004. [ bib ]
Axillary lymph node status is a major prognostic factor in mammary carcinoma. It is clinically desirable to predict the axillary lymph node status from data from the mammary cancer specimen. In the study, the axillary lymph node status, routine histological parameters and flow-cytometric data were retrospectively obtained from 1139 specimens of invasive mammary cancer. The ten variables: age, tumour type, tumour grade, tumour size, skin infiltration, lymphangiosis carcinomatosa, pT4 category, percentage of tumour cells in G2/M- and S-phases of the cell cycle, and ploidy index were considered as predictor variables, and the single variable lymph node metastasis pN (0 for pN0, or 1 for pN1 or pN2) was used as an output variable. A stepwise logistic regression analysis, with the axillary lymph node as a dependent variable, was used for feature selection. Only lymphangiosis carcinomatosa and tumour size proved to be significant as independent predictor variables; the other variables were non-contributory. Three paradigms with supervised learning rules (multilayer perceptron, learning vector quantisation and support vector machines) were used for the purpose of prediction. If any of these paradigms was used with the information from all ten input variables, 73% of cases could be correctly predicted, with specificity ranging from 82 to 84% and sensitivity ranging from 60 to 63%. If only the two significant input variables were used, lymphangiosis carcinomatosa and tumour diameter, the prediction accuracy was no worse. Nearly identical results were obtained by two different techniques of cross-validation (leave-one-out against ten-fold cross validation). It was concluded that: artificial neural networks can be used for risk stratification on the basis of routine data in individual cases of mammary cancer; and lymphangiosis carcinomatosa and tumour size are independent predictors of axillary lymph node metastasis in mammary cancer.

Keywords: breastcancer
[Martin2004Classification] T. C. Martin, J. Moecks, A. Belooussov, S. Cawthraw, B. Dolenko, M. Eiden, J. Von Frese, W. Kohler, J. Schmitt, R. Somorjai, T. Udelhoven, S. Verzakov, and W. Petrich. Classification of signatures of Bovine Spongiform Encephalopathy in serum using infrared spectroscopy. Analyst, 129(10):897-901, Oct 2004. [ bib | DOI | http | .pdf ]
Signatures of Bovine Spongiform Encephalopathy (BSE) have been identified in serum by means of "Diagnostic Pattern Recognition (DPR)". For DPR-analysis, mid-infrared spectroscopy of dried films of 641 serum samples was performed using disposable silicon sample carriers and a semi-automated DPR research system operating at room temperature. The combination of four mathematical classification approaches (principal component analysis plus linear discriminant analysis, robust linear discriminant analysis, artificial neural network, support vector machine) allowed for a reliable assignment of spectra to the class "BSE-positive" or "BSE-negative". An independent, blinded validation study was carried out on a second DPR research system at the Veterinary Laboratory Agency, Weybridge, UK. Out of 84 serum samples originating from terminally-ill, BSE-positive cattle, 78 were classified correctly. Similarly, 73 out of 76 BSE-negative samples were correctly identified by DPR such that, numerically, an accuracy of 94.4 % can be calculated. At a confidence level of 0.95 (alpha = 0.05) these results correspond to a sensitivity > 85% and a specificity > 90%. Identical class assignment by all four classifiers occurred in 75% of the cases while ambiguous results were obtained in only 8 of the 160 cases. With an area under the ROC (receiver operating charateristics) curve of 0.991, DPR may potentially supply a valuable surrogate marker for BSE even in cases in which a deliberate bias towards improved sensitivity or specificity is desired. To the best of our knowledge, DPR is the first and-up to now-only method which has demonstrated its capability of detecting BSE-related signatures in serum.

[Mao2004Feature] K. Z. Mao. Feature subset selection for support vector machines through discriminative function pruning analysis. IEEE Trans Syst Man Cybern B Cybern, 34(1):60-7, Feb 2004. [ bib ]
In many pattern classification applications, data are represented by high dimensional feature vectors, which induce high computational cost and reduce classification speed in the context of support vector machines (SVMs). To reduce the dimensionality of pattern representation, we develop a discriminative function pruning analysis (DFPA) feature subset selection method in the present study. The basic idea of the DFPA method is to learn the SVM discriminative function from training data using all input variables available first, and then to select feature subset through pruning analysis. In the present study, the pruning is implement using a forward selection procedure combined with a linear least square estimation algorithm, taking advantage of linear-in-the-parameter structure of the SVM discriminative function. The strength of the DFPA method is that it combines good characters of both filter and wrapper methods. Firstly, it retains the simplicity of the filter method avoiding training of a large number of SVM classifier. Secondly, it inherits the good performance of the wrapper method by taking the SVM classification algorithm into account.

[Man2004Evaluating] M.Z. Man, G. Dyson, K. Johnson, and B. Liao. Evaluating methods for classifying expression data. J. Biopharm. Stat., 14(4):1065-1084, 2004. [ bib | DOI | .pdf ]
An attractive application of expression technologies is to predict drug efficacy or safety using expression data of biomarkers. To evaluate the performance of various classification methods for building predictive models, we applied these methods on six expression datasets. These datasets were from studies using microarray technologies and had either two or more classes. From each of the original datasets, two subsets were generated to simulate two scenarios in biomarker applications. First, a 50-gene subset was used to simulate a candidate gene approach when it might not be practical to measure a large number of genes/biomarkers. Next, a 2000-gene subset was used to simulate a whole genome approach. We evaluated the relative performance of several classification methods by using leave-one-out cross-validation and bootstrap cross-validation. Although all methods perform well in both subsets for a relative easy dataset with two classes, differences in performance do exist among methods for other datasets. Overall, partial least squares discriminant analysis (PLS-DA) and support vector machines (SVM) outperform all other methods. We suggest a practical approach to take advantage of multiple methods in biomarker applications.

Keywords: biosvm
[Mahe2004Extensions] P. Mahé, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Extensions of marginalized graph kernels. In R. Greiner and D. Schuurmans, editors, Proceedings of the Twenty-First International Conference on Machine Learning (ICML 2004), pages 552-559. ACM Press, 2004. [ bib | www: ]
Positive definite kernels between labeled graphs have recently been proposed.They enable the application of kernel methods, such as support vectormachines, to the analysis and classification of graphs, for example, chemicalcompounds. These graph kernels are obtained by marginalizing a kernel betweenpaths with respect to a random walk model on the graph vertices along theedges. We propose two extensions of these graph kernels, with the double goal toreduce their computation time and increase their relevance as measure ofsimilarity between graphs. First, we propose to modify the label of eachvertex by automatically adding information about its environment with the useof the Morgan algorithm. Second, we suggest a modification of the random walkmodel to prevent the walk from coming back to a vertex that was just visited.These extensions are then tested on benchmark experiments of chemicalcompounds classification, with promising results.

Keywords: biosvm chemoinformatics
[Maglogiannis2004Characterization] I. G. Maglogiannis and E. P. Zafiropoulos. Characterization of digital medical images utilizing support vector machines. BMC Med. Informat. Decis. Making, 4(4), 2004. [ bib | DOI | .pdf ]
Background In this paper we discuss an efficient methodology for the image analysis and characterization of digital images containing skin lesions using Support Vector Machines and present the results of a preliminary study. Methods The methodology is based on the support vector machines algorithm for data classification and it has been applied to the problem of the recognition of malignant melanoma versus dysplastic naevus. Border and colour based features were extracted from digital images of skin lesions acquired under reproducible conditions, using basic image processing techniques. Two alternative classification methods, the statistical discriminant analysis and the application of neural networks were also applied to the same problem and the results are compared. Results The SVM (Support Vector Machines) algorithm performed quite well achieving 94.1 performance of the other two classification methodologies. The method of discriminant analysis classified correctly 88 (71 while the neural networks performed approximately the same. Conclusion The use of a computer-based system, like the one described in this paper, is intended to avoid human subjectivity and to perform specific tasks according to a number of criteria. However the presence of an expert dermatologist is considered necessary for the overall visual assessment of the skin lesion and the final diagnosis.

[Madeira2004Biclustering] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform, 1(1):24-45, 2004. [ bib | DOI | http ]
A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the data matrix has been proposed. The goal is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this paper, we refer to this class of algorithms as biclustering. Biclustering is also referred in the literature as coclustering and direct clustering, among others names, and has also been used in fields such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications.

Keywords: Algorithms; Cluster Analysis; Computational Biology, methods; Gene Expression Profiling, statistics /&/ numerical data; Gene Expression, genetics; Humans; Models, Statistical; Oligonucleotide Array Sequence Analysis, methods; Saccharomyces cerevisiae, genetics
[Maby2004Analysis] E. Maby, R. Le Bouquin Jeannès, C. Liégeois-Chauvel, B. Gourevitch, and G. Faucon. Analysis of auditory evoked potential parameters in the presence of radiofrequency fields using a support vector machines method. Med Biol Eng Comput, 42(4):562-8, Jul 2004. [ bib ]
The paper presents a study of global system for mobile (GSM) phone radiofrequency effects on human cerebral activity. The work was based on the study of auditory evoked potentials (AEPs) recorded from healthy humans and epileptic patients. The protocol allowed the comparison of AEPs recorded with or without exposure to electrical fields. Ten variables measured from AEPs were employed in the design of a supervised support vector machines classifier. The classification performance measured the classifier's ability to discriminate features performed with or without radiofrequency exposure. Most significant features were chosen by a backward sequential selection that ranked the variables according to their pertinence for the discrimination. Finally, the most discriminating features were analysed statistically by a Wilcoxon signed rank test. For both populations, the N100 amplitudes were reduced under the influence of GSM radiofrequency (mean attenuation of -0.36 microV for healthy subjects and -0.60 microV for epileptic patients). Healthy subjects showed a N100 latency decrease (-5.23 ms in mean), which could be consistent with mild, localised heating. The auditory cortical activity in humans was modified by GSM phone radiofrequencies, but an effect on brain functionality has not been proven.

[Lopez-Bigas2004Genome-wide] N. López-Bigas and C. A. Ouzounis. Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Res., 32(10):3108-3114, 2004. [ bib | DOI | http | .pdf ]
Sequence analysis of the group of proteins known to be associated with hereditary diseases allows the detection of key distinctive features shared within this group. The disease proteins are characterized by greater length of their amino acid sequence, a broader phylogenetic extent, and specific conservation and paralogy profiles compared with all human proteins. This unique property pattern provides insights into the global nature of hereditary diseases and moreover can be used to predict novel disease genes. We have developed a computational method that allows the detection of genes likely to be involved in hereditary disease in the human genome. The probability score assignments for the human genome are accessible at http://maine.ebi. ac.uk:8000/services/dgp.

[Luo2004Recognizing] Tong Luo, Kurt Kramer, Dmitry B Goldgof, Lawrence O Hall, Scott Samson, Andrew Remsen, and Thomas Hopkins. Recognizing plankton images from the shadow image particle profiling evaluation recorder. IEEE Trans Syst Man Cybern B Cybern, 34(4):1753-62, Aug 2004. [ bib ]
We present a system to recognize underwater plankton images from the shadow image particle profiling evaluation recorder (SIPPER). The challenge of the SIPPER image set is that many images do not have clear contours. To address that, shape features that do not heavily depend on contour information were developed. A soft margin support vector machine (SVM) was used as the classifier. We developed a way to assign probability after multiclass SVM classification. Our approach achieved approximately 90% accuracy on a collection of plankton images. On another larger image set containing manually unidentifiable particles, it also provided 75.6% overall accuracy. The proposed approach was statistically significantly more accurate on the two data sets than a C4.5 decision tree and a cascade correlation neural network. The single SVM significantly outperformed ensembles of decision trees created by bagging and random forests on the smaller data set and was slightly better on the other data set. The 15-feature subset produced by our feature selection approach provided slightly better accuracy than using all 29 features. Our probability model gave us a reasonable rejection curve on the larger data set.

[Lugosi2004On] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods. Ann. Stat., 32:30-55, 2004. [ bib | DOI | http | .pdf ]
[Liu2004comparative] Y. Liu. A comparative study on feature selection methods for drug discovery. J Chem Inf Comput Sci, 44(5):1823-8, 2004. [ bib | DOI | http | .pdf ]
Feature selection is frequently used as a preprocessing step to machine learning. The removal of irrelevant and redundant information often improves the performance of learning algorithms. This paper is a comparative study of feature selection in drug discovery. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including information gain, mutual information, a chi2-test, odds ratio, and GSS coefficient. Two well-known classification algorithms, Naïve Bayesian and Support Vector Machine (SVM), were used to classify the chemical compounds. The results showed that Naïve Bayesian benefited significantly from the feature selection, while SVM performed better when all features were used. In this experiment, information gain and chi2-test were most effective feature selection methods. Using information gain with a Naïve Bayesian classifier, removal of up to 96% of the features yielded an improved classification accuracy measured by sensitivity. When information gain was used to select the features, SVM was much less sensitive to the reduction of feature space. The feature set size was reduced by 99%, while losing only a few percent in terms of sensitivity (from 58.7% to 52.5%) and specificity (from 98.4% to 97.2%). In contrast to information gain and chi2-test, mutual information had relatively poor performance due to its bias toward favoring rare features and its sensitivity to probability estimation errors.

Keywords: biosvm
[Liu2004Active] Y. Liu. Active learning with support vector machine applied to gene expression data for cancer classification. J. Chem. Inf. Comput. Sci., 44(6):1936-1941, 2004. [ bib | DOI | http | .pdf ]
There is growing interest in the application of machine learning techniques in bioinformatics. The supervised machine learning approach has been widely applied to bioinformatics and gained a lot of success in this research area. With this learning approach researchers first develop a large training set, which is a time-consuming and costly process. Moreover, the proportion of the positive examples and negative examples in the training set may not represent the real-world data distribution, which causes concept drift. Active learning avoids these problems. Unlike most conventional learning methods where the training set used to derive the model remains static, the classifier can actively choose the training data and the size of training set increases. We introduced an algorithm for performing active learning with support vector machine and applied the algorithm to gene expression profiles of colon cancer, lung cancer, and prostate cancer samples. We compared the classification performance of active learning with that of passive learning. The results showed that employing the active learning method can achieve high accuracy and significantly reduce the need for labeled training instances. For lung cancer classification, to achieve 96 only 31 labeled examples were needed in active learning whereas in passive learning 174 labeled examples were required. That meant over 82 the areas under the receiver operating characteristic (ROC) curves were over 0.81, while in passive learning the areas under the ROC curves were below 0.50.

Keywords: biosvm
[Liu2004Comments] Xiaomei Liu, Lawrence O Hall, and Kevin W Bowyer. Comments on "a parallel mixture of SVMs for very large scale problems". Neural Comput, 16(7):1345-51, Jul 2004. [ bib | DOI | http ]
Collobert, Bengio, and Bengio (2002) recently introduced a novel approach to using a neural network to provide a class prediction from an ensemble of support vector machines (SVMs). This approach has the advantage that the required computation scales well to very large data sets. Experiments on the Forest Cover data set show that this parallel mixture is more accurate than a single SVM, with 90.72% accuracy reported on an independent test set. Although this accuracy is impressive, their article does not consider alternative types of classifiers. We show that a simple ensemble of decision trees results in a higher accuracy, 94.75%, and is computationally efficient. This result is somewhat surprising and illustrates the general value of experimental comparisons using different types of classifiers.

[Liu2004QSAR] H. X. Liu, R. S. Zhang, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. QSAR and classification models of a novel series of COX-2 selective inhibitors: 1,5-diarylimidazoles based on support vector machines. J Comput Aided Mol Des, 18(6):389-99, Jun 2004. [ bib ]
The support vector machine, which is a novel algorithm from the machine learning community, was used to develop quantitation and classification models which can be used as a potential screening mechanism for a novel series of COX-2 selective inhibitors. Each compound was represented by calculated structural descriptors that encode constitutional, topological, geometrical, electrostatic, and quantum-chemical features. The heuristic method was then used to search the descriptor space and select the descriptors responsible for activity. Quantitative modelling results in a nonlinear, seven-descriptor model based on SVMs with root mean-square errors of 0.107 and 0.136 for training and prediction sets, respectively. The best classification results are found using SVMs: the accuracy for training and test sets is 91.2% and 88.2%, respectively. This paper proposes a new and effective method for drug design and screening.

Keywords: biosvm chemoinformatics
[Liu2004Prediction] H. X. Liu, R. S. Zhang, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. Prediction of the isoelectric point of an amino acid based on GA-PLS and SVMs. J Chem Inf Comput Sci, 44(1):161-7, 2004. [ bib | DOI | http | .pdf ]
The support vector machine (SVM), as a novel type of a learning machine, for the first time, was used to develop a QSPR model that relates the structures of 35 amino acids to their isoelectric point. Molecular descriptors calculated from the structure alone were used to represent molecular structures. The seven descriptors selected using GA-PLS, which is a sophisticated hybrid approach that combines GA as a powerful optimization method with PLS as a robust statistical method for variable selection, were used as inputs of RBFNNs and SVM to predict the isoelectric point of an amino acid. The optimal QSPR model developed was based on support vector machines, which showed the following results: the root-mean-square error of 0.2383 and the prediction correlation coefficient R=0.9702 were obtained for the whole data set. Satisfactory results indicated that the GA-PLS approach is a very effective method for variable selection, and the support vector machine is a very promising tool for the nonlinear approximation.

Keywords: biosvm
[Liu2004Quantitative] H. X. Liu, C. X. Xue, R. S. Zhang, X. J. Yao, M. C. Liu, Z. D. Hu, and B. T. Fan. Quantitative prediction of logk of peptides in high-performance liquid chromatography based on molecular descriptors by using the heuristic method and support vector machine. J Chem Inf Comput Sci, 44(6):1979-86, 2004. [ bib | DOI | http | .pdf ]
A new method support vector machine (SVM) and the heuristic method (HM) were used to develop the nonlinear and linear models between the capacity factor (logk) and seven molecular descriptors of 75 peptides for the first time. The molecular descriptors representing the structural features of the compounds only included the constitutional and topological descriptors, which can be obtained easily without optimizing the structure of the molecule. The seven molecular descriptors selected by the heuristic method in CODESSA were used as inputs for SVM. The results obtained by SVM were compared with those obtained by the heuristic method. The prediction result of the SVM model is better than that of heuristic method. For the test set, a predictive correlation coefficient R = 0.9801 and root-mean-square error of 0.1523 were obtained. The prediction results are in very good agreement with the experimental values. But the linear model of the heuristic method is easier to understand and ready to use for a chemist. This paper provided a new and effective method for predicting the chromatography retention of peptides and some insight into the structural features which are related to the capacity factor of peptides.

Keywords: biosvm
[Liu2004Using] Huiqing Liu, Hao Han, Jinyan Li, and Limsoon Wong. Using amino acid patterns to accurately predict translation initiation sites. In Silico Biol., 4(3):255-69, 2004. [ bib | http ]
The translation initiation site (TIS) prediction problem is about how to correctly identify TIS in mRNA, cDNA, or other types of genomic sequences. High prediction accuracy can be helpful in a better understanding of protein coding from nucleotide sequences. This is an important step in genomic analysis to determine protein coding from nucleotide sequences. In this paper, we present an in silico method to predict translation initiation sites in vertebrate cDNA or mRNA sequences. This method consists of three sequential steps as follows. In the first step, candidate features are generated using k-gram amino acid patterns. In the second step, a small number of top-ranked features are selected by an entropy-based algorithm. In the third step, a classification model is built to recognize true TISs by applying support vector machines or ensembles of decision trees to the selected features. We have tested our method on several independent data sets, including two public ones and our own extracted sequences. The experimental results achieved are better than those reported previously using the same data sets. Our high accuracy not only demonstrates the feasibility of our method, but also indicates that there might be "amino acid" patterns around TIS in cDNA and mRNA sequences.

Keywords: biosvm
[Listgarten2004Predictive] J. Listgarten, S. Damaraju, B. Poulin, L. Cook, J. Dufour, A. Driga, J. Mackey, D. Wishart, R. Greiner, and B. Zanke. Predictive Models for Breast Cancer Susceptibility from Multiple Single Nucleotide Polymorphisms. Clin. Cancer Res., 10(8):2725-2737, 2004. [ bib | arXiv | http | .pdf ]
Hereditary predisposition and causative environmental exposures have long been recognized in human malignancies. In most instances, cancer cases occur sporadically, suggesting that environmental influences are critical in determining cancer risk. To test the influence of genetic polymorphisms on breast cancer risk, we have measured 98 single nucleotide polymorphisms (SNPs) distributed over 45 genes of potential relevance to breast cancer etiology in 174 patients and have compared these with matched normal controls. Using machine learning techniques such as support vector machines (SVMs), decision trees, and naive Bayes, we identified a subset of three SNPs as key discriminators between breast cancer and controls. The SVMs performed maximally among predictive models, achieving 69 power in distinguishing between the two groups, compared with a 50 baseline predictive power obtained from the data after repeated random permutation of class labels (individuals with cancer or controls). However, the simpler naive Bayes model as well as the decision tree model performed quite similarly to the SVM. The three SNP sites most useful in this model were (a) the +4536T/C site of the aldosterone synthase gene CYP11B2 at amino acid residue 386 Val/Ala (T/C) (rs4541); (b) the +4328C/G site of the aryl hydrocarbon hydroxylase CYP1B1 at amino acid residue 293 Leu/Val (C/G) (rs5292); and (c) the +4449C/T site of the transcription factor BCL6 at amino acid 387 Asp/Asp (rs1056932). No single SNP site on its own could achieve more than 60 predictive accuracy. We have shown that multiple SNP sites from different genes over distant parts of the genome are better at identifying breast cancer patients than any one SNP alone. As high-throughput technology for SNPs improves and as more SNPs are identified, it is likely that much higher predictive accuracy will be achieved and a useful clinical tool developed.

Keywords: biosvm, breastcancer
[Lin2004Classification] WuMei Lin, Xin Yuan, Powing Yuen, William I Wei, Jonathan Sham, PengCheng Shi, and Jianan Qu. Classification of in vivo autofluorescence spectra using support vector machines. J Biomed Opt, 9(1):180-6, 2004. [ bib | DOI | http | .pdf ]
An algorithm based on support vector machines (SVM), the most recent advance in pattern recognition, is presented for use in classifying light-induced autofluorescence collected from cancerous and normal tissues. The in vivo autofluorescence spectra used for development and evaluation of SVM diagnostic algorithms were measured from 85 nasopharyngeal carcinoma (NPC) lesions and 131 normal tissue sites from 59 subjects during routine nasal endoscopy. Leave-one-out cross-validation was used to evaluate the performance of the algorithms. An overall diagnostic accuracy of 96%, a sensitivity of 94%, and a specificity of 97% for discriminating nasopharyngeal carcinomas from normal tissues were achieved using a linear SVM algorithm. A diagnostic accuracy of 98%, a sensitivity of 95%, and a specificity of 99% for detecting NPC were achieved with a nonlinear SVM algorithm. In a comparison with previously developed algorithms using the same dataset and the principal component analysis (PCA) technique, the SVM algorithms produced better diagnostic accuracy in all instances. In addition, we investigated a method combining PCA and SVM techniques for reducing the complexity of the SVM algorithms.

[Lin2004Adaptive] Tzu-Chao Lin and Pao-Ta Yu. Adaptive two-pass median filter based on support vector machines for image restoration. Neural Comput, 16(2):332-53, Feb 2004. [ bib | DOI | http ]
In this letter, a novel adaptive filter, the adaptive two-pass median (ATM) filter based on support vector machines (SVMs), is proposed to preserve more image details while effectively suppressing impulse noise for image restoration. The proposed filter is composed of a noise decision maker and two-pass median filters. Our new approach basically uses an SVM impulse detector to judge whether the input pixel is noise. If a pixel is detected as a corrupted pixel, the noise-free reduction median filter will be triggered to replace it. Otherwise, it remains unchanged. Then, to improve the quality of the restored image, a decision impulse filter is put to work in the second-pass filtering procedure. As for the noise suppressing both fixed-valued and random-valued impulses without degrading the quality of the fine details, the results of our extensive experiments demonstrate that the proposed filter outperforms earlier median-based filters in the literature. Our new filter also provides excellent robustness at various percentages of impulse noise.

Keywords: Adaptation, Algorithms, Ambergris, Animals, Artifacts, Artificial Intelligence, Automated, Cadmium, Candida, Candida albicans, Capillary, Cluster Analysis, Combinatorial Chemistry Techniques, Computer-Assisted, Electrophoresis, Eye Enucleation, Humans, Image Processing, Magnetic Resonance Spectroscopy, Melanoma, Models, Molecular, Molecular Conformation, Neural Networks (Computer), Non-U.S. Gov't, Nonlinear Dynamics, Odors, P.H.S., Pattern Recognition, Perfume, Physiological, Predictive Value of Tests, Prognosis, Prospective Studies, Quantitative Structure-Activity Relationship, Rats, Research Support, Signal Processing, U.S. Gov't, Uveal Neoplasms, Visual, 15006099
[Lin2004Orphan] S. H. S. Lin and O. Civelli. Orphan G protein-coupled receptors: targets for new therapeutic interventions. Ann. Med., 36(3):204-214, 2004. [ bib | DOI | http ]
With the completion of the human genome, many genes will be uncovered with unknown functions. The 'orphan' G protein coupled receptors (GPCRs) are examples of genes without known functions. These are genes that exhibit the seven helical conformation hallmark of the GPCRs but that are called 'orphans' because they are activated by none of the primary messengers known to activate GPCRs in vivo. They are the targets of undiscovered transmitters and this lack of knowledge precludes understanding their function. Yet, because they belong to the supergene family that has the widest regulatory role in the organism, the orphan GPCRs have generated much excitement in academia and industry. They hold much hope for revealing new intercellular interactions that will open new areas of basic research which ultimately will lead to new therapeutic applications. However, the first step in understanding the function of orphan GPCRs is to 'deorphanize' them, to identify their natural transmitters. Here we review the search for the natural primary messengers of orphan GPCRs and focus on two recently deorphanized GPCR systems, the melanin-concentrating hormone (MCH) and prolactin-releasing peptide (PrRP) systems, to illustrate the strategies applied to solve their function and to exemplify the therapeutic potentials that such systems hold.

Keywords: chemogenomics
[Li2004comparative] T. Li, C. Zhang, and M. Ogihara. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics, 20(15):2429-2437, 2004. [ bib | http | .pdf ]
Summary: This paper studies the problem of building multiclass classifiers for tissue classification based on gene expression. The recent development of microarray technologies has enabled biologists to quantify gene expression of tens of thousands of genes in a single experiment. Biologists have begun collecting gene expression for a large number of samples. One of the urgent issues in the use of microarray data is to develop methods for characterizing samples based on their gene expression. The most basic step in the research direction is binary sample classification, which has been studied extensively over the past few years. This paper investigates the next step-multiclass classification of samples based on gene expression. The characteristics of expression data (e.g. large number of genes with small sample size) makes the classification problem more challenging. The process of building multiclass classifiers is divided into two components: (i) selection of the features (i.e. genes) to be used for training and testing and (ii) selection of the classification method. This paper compares various feature selection methods as well as various state-of-the-art classification methods on various multiclass gene expression datasets. Our study indicates that multiclass classification problem is much more difficult than the binary one for the gene expression datasets. The difficulty lies in the fact that the data are of high dimensionality and that the sample size is small. The classification accuracy appears to degrade very rapidly as the number of classes increases. In particular, the accuracy was very low regardless of the choices of the methods for large-class datasets (e.g. NCI60 and GCM). While increasing the number of samples is a plausible solution to the problem of accuracy degradation, it is important to develop algorithms that are able to analyze effectively multiple-class expression data for these special datasets.

Keywords: biosvm
[Li2004Fusing] Shutao Li, James Tin-Yau Kwok, Ivor Wai-Hung Tsang, and Yaonan Wang. Fusing images with different focuses using support vector machines. IEEE Trans Neural Netw, 15(6):1555-61, Nov 2004. [ bib ]
Many vision-related processing tasks, such as edge detection, image segmentation and stereo matching, can be performed more easily when all objects in the scene are in good focus. However, in practice, this may not be always feasible as optical lenses, especially those with long focal lengths, only have a limited depth of field. One common approach to recover an everywhere-in-focus image is to use wavelet-based image fusion. First, several source images with different focuses of the same scene are taken and processed with the discrete wavelet transform (DWT). Among these wavelet decompositions, the wavelet coefficient with the largest magnitude is selected at each pixel location. Finally, the fused image can be recovered by performing the inverse DWT. In this paper, we improve this fusion procedure by applying the discrete wavelet frame transform (DWFT) and the support vector machines (SVM). Unlike DWT, DWFT yields a translation-invariant signal representation. Using features extracted from the DWFT coefficients, a SVM is trained to select the source image that has the best focus at each pixel location, and the corresponding DWFT coefficients are then incorporated into the composite wavelet representation. Experimental results show that the proposed method outperforms the traditional approach both visually and quantitatively.

Keywords: Algorithms, Amino Acid, Amino Acids, Artificial Intelligence, Ascomycota, Automated, Base Sequence, Chromosome Mapping, Codon, Colonic Neoplasms, Comparative Study, Computer Simulation, Computer-Assisted, Computing Methodologies, Crystallography, DNA, DNA Primers, Databases, Diagnostic Imaging, Enzymes, Fixation, Gene Expression Profiling, Genetic, Hordeum, Host-Parasite Relations, Humans, Image Enhancement, Image Interpretation, Informatics, Information Storage and Retrieval, Kinetics, Magnetic Resonance Spectroscopy, Models, Nanotechnology, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Ocular, Oligonucleotide Array Sequence Analysis, P.H.S., Pattern Recognition, Plant, Plants, Predictive Value of Tests, Protein, Protein Conformation, Research Support, Sample Size, Selection (Genetics), Sequence Alignment, Sequence Analysis, Sequence Homology, Signal Processing, Skin, Software, Statistical, Subtraction Technique, Theoretical, Thermodynamics, U.S. Gov't, Viral Proteins, X-Ray, 15565781
[Li2004Data] L. Li, H. Tang, Z. Wu, J. Gong, M. Gruidl, J. Zou, M. Tockman, and R.A. Clark. Data mining techniques for cancer detection using serum proteomic profiling. Artif. Intell. Med., 32(2):71-83, 2004. [ bib | DOI | http | .pdf ]
OBJECTIVE: Pathological changes in an organ or tissue may be reflected in proteomic patterns in serum. It is possible that unique serum proteomic patterns could be used to discriminate cancer samples from non-cancer ones. Due to the complexity of proteomic profiling, a higher order analysis such as data mining is needed to uncover the differences in complex proteomic patterns. The objectives of this paper are (1) to briefly review the application of data mining techniques in proteomics for cancer detection/diagnosis; (2) to explore a novel analytic method with different feature selection methods; (3) to compare the results obtained on different datasets and that reported by Petricoin et al. in terms of detection performance and selected proteomic patterns. METHODS AND MATERIAL: Three serum SELDI MS data sets were used in this research to identify serum proteomic patterns that distinguish the serum of ovarian cancer cases from non-cancer controls. A support vector machine-based method is applied in this study, in which statistical testing and genetic algorithm-based methods are used for feature selection respectively. Leave-one-out cross validation with receiver operating characteristic (ROC) curve is used for evaluation and comparison of cancer detection performance. RESULTS AND CONCLUSIONS: The results showed that (1) data mining techniques can be successfully applied to ovarian cancer detection with a reasonably high performance; (2) the classification using features selected by the genetic algorithm consistently outperformed those selected by statistical testing in terms of accuracy and robustness; (3) the discriminatory features (proteomic patterns) can be very different from one selection method to another. In other words, the pattern selection and its classification efficiency are highly classifier dependent. Therefore, when using data mining techniques, the discrimination of cancer from normal does not depend solely upon the identity and origination of cancer-related proteins.

Keywords: biosvm
[Lett2004Interaction] D. Lett, M. Hsing, and F. Pio. Interaction profile-based protein classification of death domain. BMC Bioinformatics, 5(75), 2004. [ bib | DOI | http | .pdf ]
Background The increasing number of protein sequences and 3D structure obtained from genomic initiatives is leading many of us to focus on proteomics, and to dedicate our experimental and computational efforts on the creation and analysis of information derived from 3D structure. In particular, the high-throughput generation of protein-protein interaction data from a few organisms makes such an approach very important towards understanding the molecular recognition that make-up the entire protein-protein interaction network. Since the generation of sequences, and experimental protein-protein interactions increases faster than the 3D structure determination of protein complexes, there is tremendous interest in developing in silico methods that generate such structure for prediction and classification purposes. In this study we focused on classifying protein family members based on their protein-protein interaction distinctiveness. Structure-based classification of protein-protein interfaces has been described initially by Ponstingl et al. [1] and more recently by Valdar et al. [2] and Mintseris et al. [3], from complex structures that have been solved experimentally. However, little has been done on protein classification based on the prediction of protein-protein complexes obtained from homology modeling and docking simulation. Results We have developed an in silico classification system entitled HODOCO (Homology modeling, Docking and Classification Oracle), in which protein Residue Potential Interaction Profiles (RPIPS) are used to summarize protein-protein interaction characteristics. This system applied to a dataset of 64 proteins of the death domain superfamily was used to classify each member into its proper subfamily. Two classification methods were attempted, heuristic and support vector machine learning. Both methods were tested with a 5-fold cross-validation. The heuristic approach yielded a 61 accuracy, while the machine learning approach yielded an 89 accuracy. Conclusion We have confirmed the reliability and potential value of classifying proteins via their predicted interactions. Our results are in the same range of accuracy as other studies that classify protein-protein interactions from 3D complex structure obtained experimentally. While our classification scheme does not take directly into account sequence information our results are in agreement with functional and sequence based classification of death domain family members.

Keywords: biosvm
[Leslie2004Mismatch] C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble. Mismatch string kernels for discriminative protein classification. Bioinformatics, 20(4):467-476, 2004. [ bib | http | .pdf ]
Motivation: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. Results: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies. Availability: SVM software is publicly available at http://microarray.cpmc.columbia.edu/gist. Mismatch kernel software is available upon request.

Keywords: biosvm
[Leslie2004Fast] C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences. J. Mach. Learn. Res., 5:1435-1455, 2004. [ bib ]
[Leng2004note] C. Leng, Y. Lin, and G. Wahba. A note on the Lasso and related procedures in model selection. Statistica Sinica, 16(4):1273-1284, 2004. [ bib | .pdf ]
[Lao2004Morphological] Zhiqiang Lao, Dinggang Shen, Zhong Xue, Bilge Karacali, Susan M Resnick, and Christos Davatzikos. Morphological classification of brains via high-dimensional shape transformations and machine learning methods. Neuroimage, 21(1):46-57, Jan 2004. [ bib ]
A high-dimensional shape transformation posed in a mass-preserving framework is used as a morphological signature of a brain image. Population differences with complex spatial patterns are then determined by applying a nonlinear support vector machine (SVM) pattern classification method to the morphological signatures. Significant reduction of the dimensionality of the morphological signatures is achieved via wavelet decomposition and feature reduction methods. Applying the method to MR images with simulated atrophy shows that the method can correctly detect subtle and spatially complex atrophy, even when the simulated atrophy represents only a 5% variation from the original image. Applying this method to actual MR images shows that brains can be correctly determined to be male or female with a successful classification rate of 97%, using the leave-one-out method. This proposed method also shows a high classification rate for old adults' age classification, even under difficult test scenarios. The main characteristic of the proposed methodology is that, by applying multivariate pattern classification methods, it can detect subtle and spatially complex patterns of morphological group differences which are often not detectable by voxel-based morphometric methods, because these methods analyze morphological measurements voxel-by-voxel and do not consider the entirety of the data simultaneously.

[Lanckriet2004statistical] G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomic data fusion. Bioinformatics, 20(16):2626-2635, 2004. [ bib | DOI | http | .pdf ]
Motivation: During the past decade, the new focus on genomics has highlighted a particular challenge: to integrate the different views of the genome that are provided by various types of experimental data. Results: This paper describes a computational framework for integrating and drawing inferences from a collection of genome-wide measurements. Each dataset is represented via a kernel function, which defines generalized similarity relationships between pairs of entities, such as genes or proteins. The kernel representation is both flexible and efficient, and can be applied to many different types of data. Furthermore, kernel functions derived from different types of data can be combined in a straightforward fashion. Recent advances in the theory of kernel methods have provided efficient algorithms to perform such combinations in a way that minimizes a statistical loss function. These methods exploit semidefinite programming techniques to reduce the problem of finding optimizing kernel combinations to a convex optimization problem. Computational experiments performed using yeast genome-wide datasets, including amino acid sequences, hydropathy profiles, gene expression data and known protein-protein interactions, demonstrate the utility of this approach. A statistical learning algorithm trained from all of these data to recognize particular classes of proteins-membrane proteins and ribosomal proteins-performs significantly better than the same algorithm trained on any single type of data. Availability: Supplementary data at http://noble.gs.washington.edu/proj/sdp-svm

Keywords: biosvm
[Lanckriet2004Kernel-baseda] G.R. Lanckriet, M. Deng, N. Cristianini, M.I. Jordan, and W.S. Noble. Kernel-based data fusion and its application to protein function prediction in yeast. In Proceedings of the Pacific Symposium on Biocomputing, pages 300-311, 2004. [ bib | .pdf ]
Kernel methods provide a principled framework in which to represent many types of data, including vectors, strings, trees and graphs. As such, these methods are useful for drawing inferences about biological phenomena. We describe a method for combining multiple kernel representations in an optimal fashion, by formulating the problem as a convex optimization problem that can be solved using semidefinite programming techniques. The method is applied to the problem of predicting yeast protein functional classifications using a support vector machine (SVM) trained on five types of data. For this problem, the new method performs better than a previously-described Markov random field method, and better than the SVM trained on any single type of data.

Keywords: biosvm
[Lanckriet2004Kernel-based] G.R.G. Lanckriet, N. Cristianini, M.I. Jordan, and W.S. Noble. Kernel-based integration of genomic data using semidefinite programming. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 231-259. MIT Press, 2004. [ bib ]
Keywords: biosvm
[Lanckriet2004Learning] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res., 5:27-72, 2004. [ bib | .html | .pdf ]
[Lal2004Support] Thomas Navin Lal, Michael Schröder, Thilo Hinterberger, Jason Weston, Martin Bogdan, Niels Birbaumer, and Bernhard Schölkopf. Support vector channel selection in BCI. IEEE Trans Biomed Eng, 51(6):1003-10, Jun 2004. [ bib ]
Designing a brain computer interface (BCI) system one can choose from a variety of features that may be useful for classifying brain activity during a mental task. For the special case of classifying electroencephalogram (EEG) signals we propose the usage of the state of the art feature selection algorithms Recursive Feature Elimination and Zero-Norm Optimization which are based on the training of support vector machines (SVM). These algorithms can provide more accurate solutions than standard filter methods for feature selection. We adapt the methods for the purpose of selecting EEG channels. For a motor imagery paradigm we show that the number of used channels can be reduced significantly without increasing the classification error. The resulting best channels agree well with the expected underlying cortical activity patterns during the mental tasks. Furthermore we show how time dependent task specific information can be visualized.

Keywords: Algorithms, Animals, Antisense, Artificial Intelligence, Automated, Autonomic Nervous System, Brain, Cell Line, Cerebral Cortex, Child, Cluster Analysis, Cognition, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA Fingerprinting, Databases, Drug Evaluation, Electroencephalography, Emotions, Event-Related Potentials, Evoked Potentials, Factual, Fluorescence, Fuzzy Logic, Gene Silencing, Gene Targeting, Genetic, Hand, Hela Cells, Humans, Imaging, Intracellular Space, Male, Microscopy, Models, Monitoring, Motor, Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotides, P.H.S., P300, Pattern Recognition, Peptides, Physiologic, Preclinical, Predictive Value of Tests, Preschool, Prognosis, Protein Interaction Mapping, Protein Structure, Proteins, Proteomics, Quantitative Structure-Activity Relationship, Quaternary, RNA, RNA Interference, Recognition (Psychology), Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Small Interfering, Software, Thionucleotides, Three-Dimensional, Tumor, U.S. Gov't, User-Computer Interface, Word Processing, 15188871
[Lahav2004NatGenet] G. Lahav, N. Rosenfeld, A. Sigal, N. Geva-Zatorsky, A. J. Levine, M. B. Elowitz, and U. Alon. Dynamics of the p53-mdm2 feedback loop in individual cells. Nat Genet, 36(2):147-50, 2004. [ bib ]
The tumor suppressor p53, one of the most intensely investigated proteins, is usually studied by experiments that are averaged over cell populations, potentially masking the dynamic behavior in individual cells. We present a system for following, in individual living cells, the dynamics of p53 and its negative regulator Mdm2 (refs. 1,4-7): this system uses functional p53-CFP and Mdm2-YFP fusion proteins and time-lapse fluorescence microscopy. We found that p53 was expressed in a series of discrete pulses after DNA damage. Genetically identical cells had different numbers of pulses: zero, one, two or more. The mean height and duration of each pulse were fixed and did not depend on the amount of DNA damage. The mean number of pulses, however, increased with DNA damage. This approach can be used to study other signaling systems and suggests that the p53-Mdm2 feedback loop generates a 'digital' clock that releases well-timed quanta of p53 until damage is repaired or the cell dies.

Keywords: csbcbook
[LHeureux2004Locally] P. J. L'Heureux, J. Carreau, Y. Bengio, O. Delalleau, and S. Y. Yue. Locally linear embedding for dimensionality reduction in QSAR. J. Comput. Aided Mol. Des., 18(7-9):475-82, 2004. [ bib | DOI | http | .pdf ]
Current practice in Quantitative Structure Activity Relationship (QSAR) methods usually involves generating a great number of chemical descriptors and then cutting them back with variable selection techniques. Variable selection is an effective method to reduce the dimensionality but may discard some valuable information. This paper introduces Locally Linear Embedding (LLE), a local non-linear dimensionality reduction technique, that can statistically discover a low-dimensional representation of the chemical data. LLE is shown to create more stable representations than other non-linear dimensionality reduction algorithms, and to be capable of capturing non-linearity in chemical data.

Keywords: dimred
[Kuncheva2004Combining] Ludmila I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, 2004. [ bib ]
[Kuang2004Protein] R. Kuang, C. S. Leslie, and A.-S. Yang. Protein backbone angle prediction with machine learning approaches. Bioinformatics, 20(10):1612-1621, 2004. [ bib | http | .pdf ]
Motivation: Protein backbone torsion angle prediction provides useful local structural information that goes beyond conventional three-state (alpha, beta and coil) secondary structure predictions. Accurate prediction of protein backbone torsion angles will substantially improve modeling procedures for local structures of protein sequence segments, especially in modeling loop conformations that do not form regular structures as in alpha-helices or beta-strands. Results: We have devised two novel automated methods in protein backbone conformational state prediction: one method is based on support vector machines (SVMs); the other method combines a standard feed-forward back-propagation artificial neural network (NN) with a local structure-based sequence profile database (LSBSP1). Extensive benchmark experiments demonstrate that both methods have improved the prediction accuracy rate over the previously published methods for conformation state prediction when using an alphabet of three or four states. Availability: LSBSP1 and the NN algorithm have been implemented in PrISM.1, which is available from www.columbia.edu/ ay1/. Supplementary information: Supplementary data for the SVM method can be downloaded from the Website www.cs.columbia.edu/compbio/backbone.

Keywords: biosvm
[Kuang2004Profile-based] R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, and C. Leslie. Profile-based string kernels for remote homology detection and motif extraction. Proc IEEE Comput Syst Bioinform Conf, pages 152-160, 2004. [ bib ]
We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the pro- files is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" - short regions of the original profile that contribute almost all the weight of the SVM classification score - and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results are comparable to cluster kernels while providing much better scalability to large datasets.

Keywords: biosvm
[Krishnapuram2004bayesian] B. Krishnapuram, A. J. Hartemink, L. Carin, and M. A. T. Figueiredo. A bayesian approach to joint feature selection and classifier design. IEEE T. Pattern. Anal., 26(9):1105-11, Sep 2004. [ bib | DOI | http | .pdf ]
This paper adopts a Bayesian approach to simultaneously learn both an optimal nonlinear classifier and a subset of predictor variables (or features) that are most relevant to the classification task. The approach uses heavy-tailed priors to promote sparsity in the utilization of both basis functions and features; these priors act as regularizers for the likelihood function that rewards good classification on the training data. We derive an expectation-maximization (EM) algorithm to efficiently compute a maximum a posteriori (MAP) point estimate of the various parameters. The algorithm is an extension of recent state-of-the-art sparse Bayesian classifiers, which in turn can be seen as Bayesian counterparts of support vector machines. Experimental comparisons using kernel classifiers demonstrate both parsimonious feature selection and excellent classification accuracy on a range of synthetic and benchmark data sets.

Keywords: biosvm
[Krishnapuram2004Joint] B. Krishnapuram, L. Carin, and A. Hartemink. Joint Classifier and Feature Optimization for Comprehensive Cancer Diagnosis Using Gene Expression Data. J. Comput. Biol., 11(2-3):227-242, 2004. [ bib | DOI | http | .pdf ]
ecent research has demonstrated quite convincingly that accurate cancer diagnosis can be achieved by constructing classifiers that are designed to compare the gene expression profile of a tissue of unknown cancer status to a database of stored expression profiles from tissues of known cancer status. This paper introduces the JCFO, a novel algorithm that uses a sparse Bayesian approach to jointly identify both the optimal nonlinear classifier for diagnosis and the optimal set of genes on which to base that diagnosis. We show that the diagnostic classification accuracy of the proposed algorithm is superior to a number of current state-of-the-art methods in a full leave-one-out cross-validation study of five widely used benchmark datasets. In addition to its superior classification accuracy, the algorithm is designed to automatically identify a small subset of genes (typically around twenty in our experiments) that are capable of providing complete discriminatory information for diagnosis. Focusing attention on a small subset of genes is useful not only because it produces a classifier with good generalization capacity, but also because this set of genes may provide insights into the mechanisms responsible for the disease itself. A number of the genes identified by the JCFO in our experiments are already in use as clinical markers for cancer diagnosis; some of the remaining genes may be excellent candidates for further clinical investigation. If it is possible to identify a small set of genes that is indeed capable of providing complete discrimination, inexpensive diagnostic assays might be widely deployable in clinical settings.

Keywords: biosvm
[Krishnapuram2004Gene] B. Krishnapuram, L. Carin, and A. Hartemink. Gene expression analysis: joint feature selection and classifier design. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 299-317. MIT Press, 2004. [ bib | www: ]
Keywords: biosvm
[Kovatcheva2004Combinatorial] Assia Kovatcheva, Alexander Golbraikh, Scott Oloff, Yun-De Xiao, Weifan Zheng, Peter Wolschann, Gerhard Buchbauer, and Alexander Tropsha. Combinatorial QSAR of ambergris fragrance compounds. J Chem Inf Comput Sci, 44(2):582-95, 2004. [ bib | DOI | http | .pdf ]
A combinatorial quantitative structure-activity relationships (Combi-QSAR) approach has been developed and applied to a data set of 98 ambergris fragrance compounds with complex stereochemistry. The Combi-QSAR approach explores all possible combinations of different independent descriptor collections and various individual correlation methods to obtain statistically significant models with high internal (for the training set) and external (for the test set) accuracy. Seven different descriptor collections were generated with commercially available MOE, CoMFA, CoMMA, Dragon, VolSurf, and MolconnZ programs; we also included chirality topological descriptors recently developed in our laboratory (Golbraikh, A.; Bonchev, D.; Tropsha, A. J. Chem. Inf. Comput. Sci. 2001, 41, 147-158). CoMMA descriptors were used in combination with MOE descriptors. MolconnZ descriptors were used in combination with chirality descriptors. Each descriptor collection was combined individually with four correlation methods, including k-nearest neighbors (kNN) classification, Support Vector Machines (SVM), decision trees, and binary QSAR, giving rise to 28 different types of QSAR models. Multiple diverse and representative training and test sets were generated by the divisions of the original data set in two. Each model with high values of leave-one-out cross-validated correct classification rate for the training set was subjected to extensive internal and external validation to avoid overfitting and achieve reliable predictive power. Two validation techniques were employed, i.e., the randomization of the target property (in this case, odor intensity) also known as the Y-randomization test and the assessment of external prediction accuracy using test sets. We demonstrate that not every combination of the data modeling technique and the descriptor collection yields a validated and predictive QSAR model. kNN classification in combination with CoMFA descriptors was found to be the best QSAR approach overall since predictive models with correct classification rates for both training and test sets of 0.7 and higher were obtained for all divisions of the ambergris data set into the training and test sets. Many predictive QSAR models were also found using a combination of kNN classification method with other collections of descriptors. The combinatorial QSAR affords automation, computational efficiency, and higher probability of identifying significant QSAR models for experimental data sets than the traditional approaches that rely on a single QSAR method.

Keywords: Algorithms, Ambergris, Combinatorial Chemistry Techniques, Models, Molecular, Molecular Conformation, Odors, P.H.S., Perfume, Predictive Value of Tests, Quantitative Structure-Activity Relationship, Research Support, U.S. Gov't, 15032539
[Kote-Jarai2004Gene] Zsofia Kote-Jarai, Richard D Williams, Nicola Cattini, Maria Copeland, Ian Giddings, Richard Wooster, Robert H tePoele, Paul Workman, Barry Gusterson, John Peacock, Gerald Gui, Colin Campbell, and Ros Eeles. Gene expression profiling after radiation-induced DNA damage is strongly predictive of BRCA1 mutation carrier status. Clin. Cancer Res., 10(3):958-63, Feb 2004. [ bib | http | .pdf ]
PURPOSE: The impact of the presence of a germ-line BRCA1 mutation on gene expression in normal breast fibroblasts after radiation-induced DNA damage has been investigated. EXPERIMENTAL DESIGN: High-density cDNA microarray technology was used to identify differential responses to DNA damage in fibroblasts from nine heterozygous BRCA1 mutation carriers compared with five control samples without personal or family history of any cancer. Fibroblast cultures were irradiated, and their expression profile was compared using intensity ratios of the cDNA microarrays representing 5603 IMAGE clones. RESULTS: Class comparison and class prediction analysis has shown that BRCA1 mutation carriers can be distinguished from controls with high probability (approximately 85%). Significance analysis of microarrays and the support vector machine classifier identified gene sets that discriminate the samples according to their mutation status. These include genes already known to interact with BRCA1 such as CDKN1B, ATR, and RAD51. CONCLUSIONS: The results of this initial study suggest that normal cells from heterozygous BRCA1 mutation carriers display a different gene expression profile from controls in response to DNA damage. Adaptations of this pilot result to other cell types could result in the development of a functional assay for BRCA1 mutation status.

Keywords: biosvm , breastcancer
[Kondor2004Diffusion] R. Kondor and J.-P. Vert. Diffusion kernels. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 171-192. MIT Press, 2004. [ bib | www: ]
Keywords: biosvm
[Koike2004Prediction] A. Koike and T. Takagi. Prediction of protein-protein interaction sites using support vector machines. Protein Eng. Des. Sel., 17(2):165-173, Feb 2004. [ bib | DOI | http | .pdf ]
The identification of protein-protein interaction sites is essential for the mutant design and prediction of protein-protein networks. The interaction sites of residue units were predicted using support vector machines (SVM) and the profiles of sequentially/spatially neighboring residues, plus additional information. When only sequence information was used, prediction performance was highest using the feature vectors, sequentially neighboring profiles and predicted interaction site ratios, which were calculated by SVM regression using amino acid compositions. When structural information was also used, prediction performance was highest using the feature vectors, spatially neighboring residue profiles, accessible surface areas, and the with/without protein interaction sites ratios predicted by SVM regression and amino acid compositions. In the latter case, the precision at recall = 50 test set and >20 30 closest sequentially/spatially neighboring on the interaction site residues. The predicted residues covered 86-87 (96-97 appeared to be slightly higher than a previously reported study. Comparing the prediction accuracy of each molecule, it seems to be easier to predict interaction sites for stable complexes.

Keywords: biosvm
[Kohlmann2004Pediatric] A. Kohlmann, C. Schoch, S. Schnittger, M. Dugas, W. Hiddemann, W. Kern, and T. Haferlach. Pediatric acute lymphoblastic leukemia (ALL) gene expression signatures classify an independent cohort of adult ALL patients. Leukemia, 18(1):63-71, 2004. [ bib | DOI | http | .pdf ]
Recent reports support a possible future application of gene expression profiling for the diagnosis of leukemias. However, the robustness of subtype-specific gene expression signatures has to be proven on independent patient samples. Here, we present gene expression data of 34 adult acute lymphoblastic leukemia (ALL) patients (Affymetrix U133A microarrays). Support Vector Machines (SVMs) were applied to stratify our samples based on given gene lists reported to predict MLL, BCR-ABL, and T-ALL, as well as MLL and non-MLL gene rearrangement positive pediatric ALL. In addition, seven other B-precursor ALL cases not bearing t(9;22) or t(11q23)/MLL chromosomal aberrations were analyzed. Using top differentially expressed genes, hierarchical cluster and principal component analyses demonstrate that the genetically more heterogeneous B-precursor ALL samples intercalate with BCR-ABL-positive cases, but were clearly distinct from T-ALL and MLL profiles. Similar expression signatures were observed for both heterogeneous B-precursor ALL and for BCR-ABL-positive cases. As an unrelated laboratory, we demonstrate that gene signatures defined for childhood ALL were also capable of stratifying distinct subtypes in our cohort of adult ALL patients. As such, previously reported gene expression patterns identified by microarray technology are validated and confirmed on truly independent leukemia patient samples.

Keywords: biosvm
[Klamt2004Minimal] S. Klamt and E. D. Gilles. Minimal cut sets in biochemical reaction networks. Bioinformatics, 20(2):226-234, Jan 2004. [ bib | DOI | http ]
Structural studies of metabolic networks yield deeper insight into topology, functionality and capabilities of the metabolisms of different organisms. Here, we address the analysis of potential failure modes in metabolic networks whose occurrence will render the network structurally incapable of performing certain functions. Such studies will help to identify crucial parts in the network structure and to find suitable targets for repressing undesired metabolic functions.We introduce the concept of minimal cut sets for biochemical networks. A minimal cut set (MCS) is a minimal (irreducible) set of reactions in the network whose inactivation will definitely lead to a failure in certain network functions. We present an algorithm which enables the computation of the MCSs in a given network related to user-defined objective reactions. This algorithm operates on elementary modes. A number of potential applications are outlined, including network verifications, phenotype predictions, assessing structural robustness and fragility, metabolic flux analysis and target identification in drug discovery. Applications are illustrated by the MCSs in the central metabolism of Escherichia coli for growth on different substrates.Computation and analysis of MCSs is an additional feature of the FluxAnalyzer (freely available for academic users upon request, special contracts for industrial companies; see web page below). Supplementary information: http://www.mpi-magdeburg.mpg.de/projects/fluxanalyzer

[Kitchen2004Docking] D. B. Kitchen, H. Decornez, J. R. Furr, and J. Bajorath. Docking and scoring in virtual screening for drug discovery: methods and applications. Nat Rev Drug Discov, 3(11):935-949, Nov 2004. [ bib | DOI | http ]
Computational approaches that 'dock' small molecules into the structures of macromolecular targets and 'score' their potential complementarity to binding sites are widely used in hit identification and lead optimization. Indeed, there are now a number of drugs whose development was heavily influenced by or based on structure-based design and screening strategies, such as HIV protease inhibitors. Nevertheless, there remain significant challenges in the application of these approaches, in particular in relation to current scoring schemes. Here, we review key concepts and specific features of small-molecule-protein docking methods, highlight selected applications and discuss recent advances that aim to address the acknowledged limitations of established approaches.

[Kitano2004Cancer] H. Kitano. Cancer as a robust system: implications for anticancer therapy. Nat. Rev. Cancer, 4:227-235, 2004. [ bib | DOI | http | .pdf ]
Cancers are extremely complex, heterogeneous diseases. Many approaches to anticancer treatment have had limited success ? cures are still rare. A fundamental hurdle to cancer therapy is acquired tumour 'robustness'. The goal of this article is to present a perspective on cancer as a robust system to provide a framework from which the complexity of tumours can be approached to yield novel therapies.

[Kim2004Enhancing] Sang-Woon Kim and B. John Oommen. Enhancing prototype reduction schemes with recursion: a method applicable for "large" data sets. IEEE Trans Syst Man Cybern B Cybern, 34(3):1384-97, Jun 2004. [ bib ]
Most of the prototype reduction schemes (PRS), which have been reported in the literature, process the data in its entirety to yield a subset of prototypes that are useful in nearest-neighbor-like classification. Foremost among these are the prototypes for nearest neighbor classifiers, the vector quantization technique, and the support vector machines. These methods suffer from a major disadvantage, namely, that of the excessive computational burden encountered by processing all the data. In this paper, we suggest a recursive and computationally superior mechanism referred to as adaptive recursive partitioning (ARP)_PRS. Rather than process all the data using a PRS, we propose that the data be recursively subdivided into smaller subsets. This recursive subdivision can be arbitrary, and need not utilize any underlying clustering philosophy. The advantage of ARP_PRS is that the PRS processes subsets of data points that effectively sample the entire space to yield smaller subsets of prototypes. These prototypes are then, in turn, gathered and processed by the PRS to yield more refined prototypes. In this manner, prototypes which are in the interior of the Voronoi spaces, and thus ineffective in the classification, are eliminated at the subsequent invocations of the PRS. We are unaware of any PRS that employs such a recursive philosophy. Although we marginally forfeit accuracy in return for computational efficiency, our experimental results demonstrate that the proposed recursive mechanism yields classification comparable to the best reported prototype condensation schemes reported to-date. Indeed, this is true for both artificial data sets and for samples involving real-life data sets. The results especially demonstrate that a fair computational advantage can be obtained by using such a recursive strategy for "large" data sets, such as those involved in data mining and text categorization applications.

[Kim2004Prediction] J. H. Kim, J. Lee, B. Oh, K. Kimm, and I. Koh. Prediction of phosphorylation sites using SVMs. Bioinformatics, 20(17):3179-3184, 2004. [ bib | DOI | http | .pdf ]
Motivation: Phosphorylation is involved in diverse signal transduction pathways. By predicting phosphorylation sites and their kinases from primary protein sequences, we can obtain much valuable information that can form the basis for further research. Using support vector machines, we attempted to predict phosphorylation sites and the type of kinase that acts at each site. Results: Our prediction system was limited to phosphorylation sites catalyzed by four protein kinase families and four protein kinase groups. The accuracy of the predictions ranged from 83 to 95 kinase group level. The prediction system used-PredPhospho-can be applied to the functional study of proteins, and can help predict the changes in phosphorylation sites caused by amino acid variations at intra- and interspecies levels. Availability: PredPhospho is available at http://www.ngri.re.kr/proteo/PredPhospho.htm. Supplementary information: http://www.ngri.re.kr/proteo/supplementary.doc

Keywords: biosvm
[Kim2004Predictiona] H. Kim and H. Park. Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor. Proteins, 54(3):557-562, Feb 2004. [ bib | DOI | http | .pdf ]
The prediction of protein relative solvent accessibility gives us helpful information for the prediction of tertiary structure of a protein. The SVMpsi method, which uses support vector machines (SVMs), and the position-specific scoring matrix (PSSM) generated from PSI-BLAST have been applied to achieve better prediction accuracy of the relative solvent accessibility. We have introduced a three-dimensional local descriptor that contains information about the expected remote contacts by both the long-range interaction matrix and neighbor sequences. Moreover, we applied feature weights to kernels in SVMs in order to consider the degree of significance that depends on the distance from the specific amino acid. Relative solvent accessibility based on a two state-model, for 25 and 0 accuracy, respectively. Three-state prediction results provide a 64.5 approach has successfully been applied for solvent accessibility prediction by considering long-range interaction and handling unbalanced data.

Keywords: biosvm
[Kharchenko2004Filling] P. Kharchenko, D. Vitkup, and G. M. Church. Filling gaps in a metabolic network using expression information. Bioinformatics, 20 Suppl 1:I178-I185, Aug 2004. [ bib | DOI | http ]
MOTIVATION: The metabolic models of both newly sequenced and well-studied organisms contain reactions for which the enzymes have not been identified yet. We present a computational approach for identifying genes encoding such missing metabolic enzymes in a partially reconstructed metabolic network. RESULTS: The metabolic expression placement (MEP) method relies on the coexpression properties of the metabolic network and is complementary to the sequence homology and genome context methods that are currently being used to identify missing metabolic genes. The MEP algorithm predicts over 20% of all known Saccharomyces cerevisiae metabolic enzyme-encoding genes within the top 50 out of 5594 candidates for their enzymatic function, and 70% of metabolic genes whose expression level has been significantly perturbed across the conditions of the expression dataset used. AVAILABILITY: Freely available (in Supplementary information). SUPPLEMENTARY INFORMATION: Available at the following URL http://arep.med.harvard.edu/kharchenko/mep/supplements.html

Keywords: Bacterial, Binding Sites, Biological, Comparative Study, DNA, Energy Metabolism, Enzyme Induction, Enzymes, Escherichia coli Proteins, Fungal, Gene Expression Regulation, Genes, Genetic, Genome, Models, Non-P.H.S., Non-U.S. Gov't, Phylogeny, Promoter Regions (Genetics), Protein, Research Support, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Sequence Analysis, Systems Biology, Transcription Factors, U.S. Gov't, 15262797
[Kelley2004PathBLAST] B.P. Kelley, B. Yuan, F. Lewitter, R. Sharan, B.R. Stockwell, and T. Ideker. PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res., 32(Web Server issue):W83-W88, Jul 2004. [ bib | DOI | http ]
PathBLAST is a network alignment and search tool for comparing protein interaction networks across species to identify protein pathways and complexes that have been conserved by evolution. The basic method searches for high-scoring alignments between pairs of protein interaction paths, for which proteins of the first path are paired with putative orthologs occurring in the same order in the second path. This technique discriminates between true- and false-positive interactions and allows for functional annotation of protein interaction pathways based on similarity to the network of another, well-characterized species. PathBLAST is now available at http://www.pathblast.org/ as a web-based query. In this implementation, the user specifies a short protein interaction path for query against a target protein-protein interaction network selected from a network database. PathBLAST returns a ranked list of matching paths from the target network along with a graphical view of these paths and the overlap among them. Target protein-protein interaction networks are currently available for Helicobacter pylori, Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster. Just as BLAST enables rapid comparison of protein sequences between genomes, tools such as PathBLAST are enabling comparative genomics at the network level.

[Kellenberger2004Comparative] E. Kellenberger, J. Rodrigo, P. Muller, and D. Rognan. Comparative evaluation of eight docking tools for docking and virtual screening accuracy. Proteins, 57(2):225-242, Nov 2004. [ bib | DOI | http ]
Eight docking programs (DOCK, FLEXX, FRED, GLIDE, GOLD, SLIDE, SURFLEX, and QXP) that can be used for either single-ligand docking or database screening have been compared for their propensity to recover the X-ray pose of 100 small-molecular-weight ligands, and for their capacity to discriminate known inhibitors of an enzyme (thymidine kinase) from randomly chosen "drug-like" molecules. Interestingly, both properties are found to be correlated, since the tools showing the best docking accuracy (GLIDE, GOLD, and SURFLEX) are also the most successful in ranking known inhibitors in a virtual screening experiment. Moreover, the current study pinpoints some physicochemical descriptors of either the ligand or its cognate protein-binding site that generally lead to docking/scoring inaccuracies.

[Kashima2004Kernels] H. Kashima, K. Tsuda, and A. Inokuchi. Kernels for graphs. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 155-170. MIT Press, The MIT Press, Cambridge, Massachussetts, 2004. [ bib ]
Keywords: biosvm chemoinformatics
[Kaper2004BCI] Matthias Kaper, Peter Meinicke, Ulf Grossekathoefer, Thomas Lingner, and Helge Ritter. BCI Competition 2003-Data set IIb: support vector machines for the P300 speller paradigm. IEEE Trans Biomed Eng, 51(6):1073-6, Jun 2004. [ bib ]
We propose an approach to analyze data from the P300 speller paradigm using the machine-learning technique support vector machines. In a conservative classification scheme, we found the correct solution after five repetitions. While the classification within the competition is designed for offline analysis, our approach is also well-suited for a real-world online solution: It is fast, requires only 10 electrode positions and demands only a small amount of preprocessing.

Keywords: Algorithms, Animals, Antisense, Artificial Intelligence, Automated, Autonomic Nervous System, Brain, Cell Line, Child, Cluster Analysis, Cognition, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA Fingerprinting, Databases, Drug Evaluation, Electroencephalography, Emotions, Event-Related Potentials, Factual, Fluorescence, Fuzzy Logic, Gene Silencing, Gene Targeting, Genetic, Hela Cells, Humans, Imaging, Intracellular Space, Microscopy, Models, Monitoring, Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotides, P.H.S., P300, Pattern Recognition, Peptides, Physiologic, Preclinical, Predictive Value of Tests, Preschool, Prognosis, Protein Interaction Mapping, Protein Structure, Proteins, Proteomics, Quantitative Structure-Activity Relationship, Quaternary, RNA, RNA Interference, Recognition (Psychology), Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Small Interfering, Software, Thionucleotides, Three-Dimensional, Tumor, U.S. Gov't, User-Computer Interface, Word Processing, 15188881
[Kanehisa2004KEGG] M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori. The KEGG resource for deciphering the genome. Nucleic Acids Res., 32(Database issue):D277-80, Jan 2004. [ bib | DOI | http ]
A grand challenge in the post-genomic era is a complete computer representation of the cell and the organism, which will enable computational prediction of higher-level complexity of cellular processes and organism behavior from genomic information. Toward this end we have been developing a knowledge-based approach for network prediction, which is to predict, given a complete set of genes in the genome, the protein interaction networks that are responsible for various cellular processes. KEGG at http://www.genome.ad.jp/kegg/ is the reference knowledge base that integrates current knowledge on molecular interaction networks such as pathways and complexes (PATHWAY database), information about genes and proteins generated by genome projects (GENES/SSDB/KO databases) and information about biochemical compounds and reactions (COMPOUND/GLYCAN/REACTION databases). These three types of database actually represent three graph objects, called the protein network, the gene universe and the chemical universe. New efforts are being made to abstract knowledge, both computationally and manually, about ortholog clusters in the KO (KEGG Orthology) database, and to collect and analyze carbohydrate structures in the GLYCAN database.

Keywords: glycans
[Kalatzis2004Design] I. Kalatzis, N. Piliouras, E. Ventouras, C. C. Papageorgiou, A. D. Rabavilas, and D. Cavouras. Design and implementation of an SVM-based computer classification system for discriminating depressive patients from healthy controls using the P600 component of ERP signals. Comput Methods Programs Biomed, 75(1):11-22, Jul 2004. [ bib | DOI | http | .pdf ]
A computer-based classification system has been designed capable of distinguishing patients with depression from normal controls by event-related potential (ERP) signals using the P600 component. Clinical material comprised 25 patients with depression and an equal number of gender and aged-matched healthy controls. All subjects were evaluated by a computerized version of the digit span Wechsler test. EEG activity was recorded and digitized from 15 scalp electrodes (leads). Seventeen features related to the shape of the waveform were generated and were employed in the design of an optimum support vector machine (SVM) classifier at each lead. The outcomes of those SVM classifiers were selected by a majority-vote engine (MVE), which assigned each subject to either the normal or depressive classes. MVE classification accuracy was 94% when using all leads and 92% or 82% when using only the right or left scalp leads, respectively. These findings support the hypothesis that depression is associated with dysfunction of right hemisphere mechanisms mediating the processing of information that assigns a specific response to a specific stimulus, as those mechanisms are reflected by the P600 component of ERPs. Our method may aid the further understanding of the neurophysiology underlying depression, due to its potentiality to integrate theories of depression and psychophysiology.

[Jorgensen2004many] W. L. Jorgensen. The many roles of computation in drug discovery. Science, 303(5665):1813-1818, Mar 2004. [ bib | DOI | http | .pdf ]
An overview is given on the diverse uses of computational chemistry in drug discovery. Particular emphasis is placed on virtual screening, de novo design, evaluation of drug-likeness, and advanced methods for determining protein-ligand binding.

Keywords: chemoinformatics
[Jong2004Breakpoint] K. Jong, E. Marchiori, G. Meijer, A. V. D. Vaart, and B. Ylstra. Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics, 20(18):3636-3637, Dec 2004. [ bib | DOI | http | .pdf ]
SUMMARY: We describe a tool, called aCGH-Smooth, for the automated identification of breakpoints and smoothing of microarray comparative genomic hybridization (array CGH) data. aCGH-Smooth is written in visual C++, has a user-friendly interface including a visualization of the results and user-defined parameters adapting the performance of data smoothing and breakpoint recognition. aCGH-Smooth can handle array-CGH data generated by all array-CGH platforms: BAC, PAC, cosmid, cDNA and oligo CGH arrays. The tool has been successfully applied to real-life data. AVAILABILITY: aCGH-Smooth is free for researchers at academic and non-profit institutions at http://www.few.vu.nl/ vumarray/.

Keywords: cgh
[Jones-Rhoades2004Computational] Matthew W Jones-Rhoades and David P Bartel. Computational identification of plant micrornas and their targets, including a stress-induced mirna. Mol Cell, 14(6):787-799, Jun 2004. [ bib | DOI | http | .pdf ]
MicroRNAs (miRNAs) are approximately 21-nucleotide RNAs, some of which have been shown to play important gene-regulatory roles during plant development. We developed comparative genomic approaches to systematically identify both miRNAs and their targets that are conserved in Arabidopsis thaliana and rice (Oryza sativa). Twenty-three miRNA candidates, representing seven newly identified gene families, were experimentally validated in Arabidopsis, bringing the total number of reported miRNA genes to 92, representing 22 families. Nineteen newly identified target candidates were confirmed by detecting mRNA fragments diagnostic of miRNA-directed cleavage in plants. Overall, plant miRNAs have a strong propensity to target genes controlling development, particularly those of transcription factors and F-box proteins. However, plant miRNAs have conserved regulatory functions extending beyond development, in that they also target superoxide dismutases, laccases, and ATP sulfurylases. The expression of miR395, the sulfurylase-targeting miRNA, increases upon sulfate starvation, showing that miRNAs can be induced by environmental stress.

Keywords: sirna
[Jones2004Molecular] C. Jones, E. Ford, C. Gillett, K. Ryder, S. Merrett, J. S. Reis-Filho, L. G. Fulford, A. Hanby, and S. R. Lakhani. Molecular cytogenetic identification of subgroups of grade iii invasive ductal breast carcinomas with different clinical outcomes. Clin. Cancer Res., 10(18):5988-5997, 2004. [ bib | DOI | arXiv | http | .pdf ]
Tumor grade is an established indicator of breast cancer outcome, although considerable heterogeneity exists even within-grade. Around 25 with a "basal" phenotype, and these tumors are reported to be a distinct subgroup. We have investigated whether this group of breast cancers has a distinguishing pattern of genetic alterations and which of these may relate to the different clinical outcome of these patients. We performed comparative genomic hybridization (CGH) analysis on 43 grade III invasive ductal breast carcinomas positive for basal cytokeratin 14, as well as 43 grade- and age-matched CK14-negative controls, all with up to 25 years (median, 7 years) of clinical follow-up. Significant differences in CGH alterations were seen between the two groups in terms of mean number of changes (CK14+ve - 6.5, CK14-ve - 10.3; P = 0.0012) and types of alterations at chromosomes 4q, 7q, 8q, 9p, 13q, 16p, 17p, 17q, 19p, 19q, 20p, 20q and Xp. Supervised and unsupervised algorithms separated the two groups on CGH data alone with 76 revealed distinct subgroups, one of which contained 18 (42 CK14+ve tumors. This subgroup had significantly shorter overall survival (P = 0.0414) than other grade III tumors, regardless of CK14 status, and was an independent prognostic marker (P = 0.031). These data provide evidence that the "basal" phenotype on its own does not convey a poor prognosis. Basal tumors are also heterogeneous with only a subset, identifiable by pattern of genetic alterations, exhibiting a shorter overall survival. Robust characterization of this basal group is necessary if it is to have a major impact on management of patients with breast cancer.

Keywords: breastcancer, cgh
[John2004Human] Bino John, Anton J Enright, Alexei Aravin, Thomas Tuschl, Chris Sander, and Debora S Marks. Human microrna targets. PLoS Biol, 2(11):e363, Nov 2004. [ bib | DOI | http | .pdf ]
MicroRNAs (miRNAs) interact with target mRNAs at specific sites to induce cleavage of the message or inhibit translation. The specific function of most mammalian miRNAs is unknown. We have predicted target sites on the 3' untranslated regions of human gene transcripts for all currently known 218 mammalian miRNAs to facilitate focused experiments. We report about 2,000 human genes with miRNA target sites conserved in mammals and about 250 human genes conserved as targets between mammals and fish. The prediction algorithm optimizes sequence complementarity using position-specific rules and relies on strict requirements of interspecies conservation. Experimental support for the validity of the method comes from known targets and from strong enrichment of predicted targets in mRNAs associated with the fragile X mental retardation protein in mammals. This is consistent with the hypothesis that miRNAs act as sequence-specific adaptors in the interaction of ribonuclear particles with translationally regulated messages. Overrepresented groups of targets include mRNAs coding for transcription factors, components of the miRNA machinery, and other proteins involved in translational regulation, as well as components of the ubiquitin machinery, representing novel feedback loops in gene regulation. Detailed information about target genes, target processes, and open-source software for target prediction (miRanda) is available at http://www.microrna.org. Our analysis suggests that miRNA genes, which are about 1% of all human genes, regulate protein production for 10% or more of all human genes.

Keywords: sirna
[Jiang-Ning2004Cooperativity] S. Jiang-Ning, L. Wei-Jiang, and X. Wen-Bo. Cooperativity of the oxidization of cysteines in globular proteins. J. Theor. Biol., 231(1):85-95, 2004. [ bib | DOI | http | .pdf ]
Based on the 639 non-homologous proteins with 2910 cysteine-containing segments of well-resolved three-dimensional structures, a novel approach has been proposed to predict the disulfide-bonding state of cysteines in proteins by constructing a two-stage classifier combining a first global linear discriminator based on their amino acid composition and a second local support vector machine classifier. The overall prediction accuracy of this hybrid classifier for the disulfide-bonding state of cysteines in proteins has scored 84.1 on cysteine and protein basis using the rigorous jack-knife procedure, respectively. It shows that whether cysteines should form disulfide bonds depends not only on the global structural features of proteins but also on the local sequence environment of proteins. The result demonstrates the applicability of this novel method and provides comparable prediction performance compared with existing methods for the prediction of the oxidation states of cysteines in proteins.

Keywords: biosvm
[Jebara2004Probability] T. Jebara, R. Kondor, and A. Howard. Probability Product Kernels. J. Mach. Learn. Res., 5:819-844, 2004. [ bib | .html ]
Keywords: kernel-theory
[Jebara2004Multi-task] Tony Jebara. Multi-task feature and kernel selection for svms. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 55, New York, NY, USA, 2004. ACM. [ bib | DOI ]
[Jackson2004Noise] A. L. Jackson and P. S. Linsley. Noise amidst the silence: off-target effects of siRNAs? Trends Genet., 20(11):521-4, Nov 2004. [ bib | DOI | http ]
RNA interference (RNAi), mediated by short interfering RNAs (siRNAs), is widely used to silence gene expression and to define gene function in mammalian cells. Initially, this gene silencing via transcript degradation was believed to be exquisitely specific, requiring near-identity between the siRNA and the target mRNA. However, several recent reports have suggested that non-specific effects can be induced by siRNAs, both at the level of mRNA and protein. These findings suggest that siRNAs can regulate the expression of unintended targets, and argue for further experiments on the mechanism and extent of off-target gene regulation(s). In the meantime, caution is warranted in interpreting gene function and phenotypes resulting from RNAi experiments.

Keywords: sirna
[Ishkanian2004tiling] A. S. Ishkanian, C. A. Malloff, S. K. Watson, R. J. DeLeeuw, B. Chi, B. P. Coe, A. Snijders, D. G. Albertson, D. Pinkel, M. A. Marra, V. Ling, C. MacAulay, and W. L. Lam. A tiling resolution DNA microarray with complete coverage of the human genome. Nat. Genet., 36(3):299-303, Mar 2004. [ bib | DOI | http | .pdf ]
We constructed a tiling resolution array consisting of 32,433 overlapping BAC clones covering the entire human genome. This increases our ability to identify genetic alterations and their boundaries throughout the genome in a single comparative genomic hybridization (CGH) experiment. At this tiling resolution, we identified minute DNA alterations not previously reported. These alterations include microamplifications and deletions containing oncogenes, tumor-suppressor genes and new genes that may be associated with multiple tumor types. Our findings show the need to move beyond conventional marker-based genome comparison approaches, that rely on inference of continuity between interval markers. Our submegabase resolution tiling set for array CGH (SMRT array) allows comprehensive assessment of genomic integrity and thereby the identification of new genes associated with disease.

Keywords: csbcbook, microarray
[Consortium2004Finishing] International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431, 2004. [ bib ]
[Iafrate2004Detection] A. John Iafrate, Lars Feuk, Miguel N Rivera, Marc L Listewnik, Patricia K Donahoe, Ying Qi, Stephen W Scherer, and Charles Lee. Detection of large-scale variation in the human genome. Nat. Genet., 36(9):949-951, Sep 2004. [ bib | DOI | http | .pdf ]
We identified 255 loci across the human genome that contain genomic imbalances among unrelated individuals. Twenty-four variants are present in > 10% of the individuals that we examined. Half of these regions overlap with genes, and many coincide with segmental duplications or gaps in the human genome assembly. This previously unappreciated heterogeneity may underlie certain human phenotypic variation and susceptibility to disease and argues for a more dynamic human genome structure.

Keywords: cgh, csbcbook, csbcbook-ch2
[Hutter2004Prediction] B. Hutter, C. Schaab, S. Albrecht, M. Borgmann, N. A. Brunner, C. Freiberg, K. Ziegelbauer, C. O. Rock, I. Ivanov, and H. Loferer. Prediction of Mechanisms of Action of Antibacterial Compounds by Gene Expression Profiling. Antimicrob. Agents Chemother., 48(8):2838-2844, Aug 2004. [ bib | DOI | arXiv | http | .pdf ]
We have generated a database of expression profiles carrying the transcriptional responses of the model organism Bacillus subtilis following treatment with 37 well-characterized antibacterial compounds of different classes. The database was used to build a predictor for the assignment of the mechanisms of action (MoAs) of antibacterial compounds by the use of support vector machines. This predictor was able to correctly classify the MoA class for most compounds tested. Furthermore, we provide evidence that the in vivo MoA of hexachlorophene does not match the MoA predicted from in vitro data, a situation frequently faced in drug discovery. A database of this kind may facilitate the prioritization of novel antibacterial entities in drug discovery programs. Potential applications and limitations are discussed.

Keywords: biosvm
[Hupe2004Analysis] P. Hupé, N. Stransky, J.-P. Thiery, F. Radvanyi, and E. Barillot. Analysis of array CGH data: from signal ratio to gain and loss of dna regions. Bioinformatics, 20(18):3413-3422, Dec 2004. [ bib | DOI | http | .pdf ]
MOTIVATION: Genomic DNA regions are frequently lost or gained during tumor progression. Array Comparative Genomic Hybridization (array CGH) technology makes it possible to assess these changes in DNA in cancers, by comparison with a normal reference. The identification of systematically deleted or amplified genomic regions in a set of tumors enables biologists to identify genes involved in cancer progression because tumor suppressor genes are thought to be located in lost genomic regions and oncogenes, in gained regions. Array CGH profiles should also improve the classification of tumors. The achievement of these goals requires a methodology for detecting the breakpoints delimiting altered regions in genomic patterns and assigning a status (normal, gained or lost) to each chromosomal region. RESULTS: We have developed a methodology for the automatic detection of breakpoints from array CGH profile, and the assignment of a status to each chromosomal region. The breakpoint detection step is based on the Adaptive Weights Smoothing (AWS) procedure and provides highly convincing results: our algorithm detects 97, 100 and 94% of breakpoints in simulated data, karyotyping results and manually analyzed profiles, respectively. The percentage of correctly assigned statuses ranges from 98.9 to 99.8% for simulated data and is 100% for karyotyping results. Our algorithm also outperforms other solutions on a public reference dataset. AVAILABILITY: The R package GLAD (Gain and Loss Analysis of DNA) is available upon request.

Keywords: cgh
[Hue2004Semi-supervised] M. Hue. Semi-supervised learning for protein structure prediction. Master's thesis, Ecole des Mines de Paris, 2004. [ bib ]
[Huang2004Boosting] K. Huang and R.F. Murphy. Boosting accuracy of automated classification of fluorescence microscope images for location proteomics. BMC Bioinformatics, 5(78):78, 2004. [ bib | DOI | http | .pdf ]
Background Detailed knowledge of the subcellular location of each expressed protein is critical to a full understanding of its function. Fluorescence microscopy, in combination with methods for fluorescent tagging, is the most suitable current method for proteome-wide determination of subcellular location. Previous work has shown that neural network classifiers can distinguish all major protein subcellular location patterns in both 2D and 3D fluorescence microscope images. Building on these results, we evaluate here new classifiers and features to improve the recognition of protein subcellular location patterns in both 2D and 3D fluorescence microscope images. Results We report here a thorough comparison of the performance on this problem of eight different state-of-the-art classification methods, including neural networks, support vector machines with linear, polynomial, radial basis, and exponential radial basis kernel functions, and ensemble methods such as AdaBoost, Bagging, and Mixtures-of-Experts. Ten-fold cross validation was used to evaluate each classifier with various parameters on different Subcellular Location Feature sets representing both 2D and 3D fluorescence microscope images, including new feature sets incorporating features derived from Gabor and Daubechies wavelet transforms. After optimal parameters were chosen for each of the eight classifiers, optimal majority-voting ensemble classifiers were formed for each feature set. Comparison of results for each image for all eight classifiers permits estimation of the lower bound classification error rate for each subcellular pattern, which we interpret to reflect the fraction of cells whose patterns are distorted by mitosis, cell death or acquisition errors. Overall, we obtained statistically significant improvements in classification accuracy over the best previously published results, with the overall error rate being reduced by one-third to one-half and with the average accuracy for single 2D images being higher than 90 accuracy for the easily confused endomembrane compartments (endoplasmic reticulum, Golgi, endosomes, lysosomes) was improved by 5?15 We achieved further improvements when classification was conducted on image sets rather than on individual cell images. Conclusions The availability of accurate, fast, automated classification systems for protein location patterns in conjunction with high throughput fluorescence microscope imaging techniques enables a new subfield of proteomics, location proteomics. The accuracy and sensitivity of this approach represents an important alternative to low-resolution assignments by curation or sequence-based prediction.

[Huang2004novel] Jian qiang Huang, Xiang xian Chen, and Le yu Wang. A novel method for tracking pedestrians from real-time video. J Zhejiang Univ Sci, 5(1):99-105, Jan 2004. [ bib ]
This novel method of Pedestrian Tracking using Support Vector (PTSV) proposed for a video surveillance instrument combines the Support Vector Machine (SVM) classifier into an optic-flow based tracker. The traditional method using optical flow tracks objects by minimizing an intensity difference function between successive frames, while PTSV tracks objects by maximizing the SVM classification score. As the SVM classifier for object and non-object is pre-trained, there is need only to classify an image block as object or non-object without having to compare the pixel region of the tracked object in the previous frame. To account for large motions between successive frames we build pyramids from the support vectors and use a coarse-to-fine scan in the classification stage. To accelerate the training of SVM, a Sequential Minimal Optimization Method (SMO) is adopted. The results of using a kernel-PTSV for pedestrian tracking from real time video are shown at the end. Comparative experimental results showed that PTSV improves the reliability of tracking compared to that of traditional tracking method using optical flow.

[Huan2004Accurate] J. Huan, W. Wang, A. Washington, J. Prins, R. Shah, and A. Tropsha. Accurate classification of protein structural families using coherent subgraph analysis. In Proceedings of the Pacific Symposium on Biocomputing 2002, pages 411-422, 2004. [ bib | .pdf ]
Protein structural annotation and classification is an important problem in bioinformatics. We report on the development of an efficient subgraph mining technique and its application to finding characteristic substructural patterns within protein structural families. In our method, protein structures are represented by graphs where the nodes are residues and the edges connect residues found within certain distance from each other. Application of subgraph mining to proteins is challenging for a number reasons: (1) protein graphs are large and complex, (2) current protein databases are large and continue to grow rapidly, and (3) only a small fraction of the frequent subgraphs among the huge pool of all possible subgraphs could be significant in the context of protein classification. To address these challenges, we have developed an information theoretic model called coherent subgraph mining. From information theory, the entropy of a random variable X measures the information content carried by X and the Mutual Information (MI) between two random variables X and Y measures the correlation between X and Y. We define a subgraph X as coherent if it is strongly correlated with every sufficiently large sub-subgraph Y embedded in it. Based on the MI metric, we have designed a search scheme that only reports coherent subgraphs. To determine the significance of coherent protein subgraphs, we have conducted an experimental study in which all coherent subgraphs were identified in several protein structural families annotated in the SCOP database (Murzin et al, 1995). The Support Vector Machine algorithm was used to classify proteins from different families under the binary classification scheme. We find that this approach identifies spatial motifs unique to individual SCOP families and affords excellent discrimination between families.

Keywords: biosvm
[Hu2004Improved] H.J. Hu, Y. Pan, R. Harrison, and P.C. Tai. Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier. IEEE Trans. Nanobioscience, 3(4):265-271, 2004. [ bib | .pdf ]
Prediction of protein secondary structures is an important problem in bioinformatics and has many applications. The recent trend of secondary structure prediction studies is mostly based on the neural network or the support vector machine (SVM). The SVM method is a comparatively new learning system which has mostly been used in pattern recognition problems. In this study, SVM is used as a machine learning tool for the prediction of secondary structure and several encoding schemes, including orthogonal matrix, hydrophobicity matrix, BLOSUM62 substitution matrix, and combined matrix of these, are applied and optimized to improve the prediction accuracy. Also, the optimal window length for six SVM binary classifiers is established by testing different window sizes and our new encoding scheme is tested based on this optimal window size via sevenfold cross validation tests. The results show 2 classifiers when compared with the instances in which the classical orthogonal matrix is used. Finally, to combine the results of the six SVM binary classifiers, a new tertiary classifier which combines the results of one-versus-one binary classifiers is introduced and the performance is compared with those of existing tertiary classifiers. According to the results, the Q3 prediction accuracy of new tertiary classifier reaches 78.8 reported in the literature.

Keywords: biosvm
[Hu2004Developing] C. Hu, X. Li, and J. Liang. Developing optimal non-linear scoring function for protein design. Bioinformatics, 20(17):3080-3098, 2004. [ bib | DOI | http | www: ]
Motivation. Protein design aims to identify sequences compatible with a given protein fold but incompatible to any alternative folds. To select the correct sequences and to guide the search process, a design scoring function is critically important. Such a scoring function should be able to characterize the global fitness landscape of many proteins simultaneously. Results: To find optimal design scoring functions, we introduce two geometric views and propose a formulation using a mixture of non-linear Gaussian kernel functions. We aim to solve a simplified protein sequence design problem. Our goal is to distinguish each native sequence for a major portion of representative protein structures from a large number of alternative decoy sequences, each a fragment from proteins of different folds. Our scoring function discriminates perfectly a set of 440 native proteins from 14 million sequence decoys. We show that no linear scoring function can succeed in this task. In a blind test of unrelated proteins, our scoring function misclassfies only 13 native proteins out of 194. This compares favorably with about three-four times more misclassifications when optimal linear functions reported in the literature are used. We also discuss how to develop protein folding scoring function. Availability: Available on request from the authors.

Keywords: biosvm
[Hsieh2004library] A. C. Hsieh, R. Bo, J. Manola, F. Vazquez, O. Bare, A. Khvorova, S. Scaringe, and W. R. Sellers. A library of siRNA duplexes targeting the phosphoinositide 3-kinase pathway: determinants of gene silencing for use in cell-based screens. Nucleic Acids Res., 32(3):893-901, 2004. [ bib | DOI | http | .pdf ]
Gene silencing through RNA interference (RNAi) has been established as a means of conducting reverse genetic studies. In order to better understand the determinants of short interfering RNA (siRNA) knockdown for use in high-throughput cell-based screens, 148 siRNA duplexes targeting 30 genes within the PI3K pathway were selected and synthesized. The extent of RNA knockdown was measured for 22 genes by quantitative real-time PCR. Analysis of the parameters correlating with effective knockdown showed that (i) duplexes targeting the middle of the coding sequence silenced significantly poorer, (ii) silencing by duplexes targeting the 3'UTR was comparable with duplexes targeting the coding sequence, (iii) pooling of four or five duplexes per gene was remarkably efficient in knocking down gene expression and (iv) among duplexes that achieved a >70% knockdown of the mRNA there were strong nucleotide preferences at specific positions, most notably positions 11 (G or C) and 19 (T) of the siRNA duplex. Finally, in a proof-of-principle pathway-wide cell-based genetic screen, conducted to detect negative genetic regulators of Akt S473 phosphorylation, both known negative regulators of this phosphorylation, PTEN and PDK1, were found. These data help to lay the foundation for genome-wide siRNA screens in mammalian cells.

Keywords: sirna
[Hoyer2004Non-negative] P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res., 5:1457-1469, 2004. [ bib | .pdf ]
[Hou2004Remote] Y. Hou, W. Hsu, M. L. Lee, and C. Bystroff. Remote homolog detection using local sequence-structure correlations. Proteins, 57(3):518-530, 2004. [ bib | DOI | .pdf ]
Remote homology detection refers to the detection of structural homology in proteins when there is little or no sequence similarity. In this article, we present a remote homolog detection method called SVM-HMMSTR that overcomes the reliance on detectable sequence similarity by transforming the sequences into strings of hidden Markov states that represent local folding motif patterns. These state strings are transformed into fixed-dimension feature vectors for input to a support vector machine. Two sets of features are defined: an order-independent feature set that captures the amino acid and local structure composition; and an order-dependent feature set that captures the sequential ordering of the local structures. Tests using the Structural Classification of Proteins (SCOP) 1.53 data set show that the SVM-HMMSTR gives a significant improvement over several current methods.

Keywords: biosvm
[Horvath2004Cyclic] T. Horváth, T. Gärtner, and S. Wrobel. Cyclic pattern kernels for predictive graph mining. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 158-167, New York, NY, USA, 2004. ACM Press. [ bib | DOI ]
Keywords: chemoinformatics kernel-theory
[Horn2004Dynamic] David Horn, Gideon Dror, and Brigitte Quenet. Dynamic proximity of spatio-temporal sequences. IEEE Trans Neural Netw, 15(5):1002-8, Sep 2004. [ bib ]
Recurrent networks can generate spatio-temporal neural sequences of very large cycles, having an apparent random behavior. Nonetheless a proximity measure between these sequences may be defined through comparison of the synaptic weight matrices that generate them. Following the dynamic neural filter (DNF) formalism we demonstrate this concept by comparing teacher and student recurrent networks of binary neurons. We show that large sequences, providing a training set well exceeding the Cover limit, allow for good determination of the synaptic matrices. Alternatively, assuming the matrices to be known, very fast determination of the biases can be achieved. Thus, a spatio-temporal sequence may be regarded as spatio-temporal encoding of the bias vector. We introduce a linear support vector machine (SVM) variant of the DNF in order to specify an optimal weight matrix. This approach allows us to deal with noise. Spatio-temporal sequences generated by different DNFs with the same number of neurons may be compared by calculating correlations of the synaptic matrices of the reconstructed DNFs. Other types of spatio-temporal sequences need the introduction of hidden neurons, and/or the use of a kernel variant of the SVM approach. The latter is being defined as a recurrent support vector network (RSVN).

[Hood2004Systems] L. Hood, J. R. Heath, M. E. Phelps, and B. Lin. Systems biology and new technologies enable predictive and preventative medicine. Science, 306(5696):640-643, Oct 2004. [ bib | DOI | http ]
Systems approaches to disease are grounded in the idea that disease-perturbed protein and gene regulatory networks differ from their normal counterparts; we have been pursuing the possibility that these differences may be reflected by multiparameter measurements of the blood. Such concepts are transforming current diagnostic and therapeutic approaches to medicine and, together with new technologies, will enable a predictive and preventive medicine that will lead to personalized medicine.

[Hockstein2004Diagnosis] Neil G Hockstein, Erica R Thaler, Drew Torigian, Wallace T Miller, Olivia Deffenderfer, and C. William Hanson. Diagnosis of pneumonia with an electronic nose: correlation of vapor signature with chest computed tomography scan findings. Laryngoscope, 114(10):1701-5, Oct 2004. [ bib ]
OBJECTIVES/HYPOTHESIS: The electronic nose is a sensor of volatile molecules that is useful in the analysis of expired gases. The device is well suited to testing the breath of patients receiving mechanical ventilation and is a potential diagnostic adjunct that can aid in the detection of patients with ventilator-associated pneumonia. STUDY DESIGN: A prospective study. METHODS: We performed a prospective study of patients receiving mechanical ventilation in a surgical intensive care unit who underwent chest computed tomography (CT) scanning. A single attending radiologist reviewed the chest CT scans, and imaging features were recorded on a standardized form. Within 48 hours of chest CT scan, five sets of exhaled gas were sampled from the expiratory limb of the ventilator circuit. The gases were assayed with a commercially available electronic nose. Both linear and nonlinear analyses were performed to identify correlations between imaging features and the assayed gas signatures. RESULTS: Twenty-five patients were identified, 13 of whom were diagnosed with pneumonia by CT scan. Support vector machine analysis was performed in two separate analyses. In the first analysis, in which a training set was identical to a prediction set, the accuracy of prediction results was greater than 91.6%. In the second analysis, in which the training set and the prediction set were different, the accuracy of prediction results was at least 80%, with higher accuracy depending on the specific parameters and models being used. CONCLUSION: The electronic nose is a new technology that continues to show promise as a potential diagnostic adjunct in the diagnosis of pneumonia and other infectious diseases.

[Hochreiter2004Gene] S. Hochreiter and K. Obermayer. Gene selection for microarray data. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 319-355. MIT Press, 2004. [ bib | www: ]
Keywords: biosvm
[Hizukuri2004Extraction] Yoshiyuki Hizukuri, Yoshihiro Yamanishi, Kosuke Hashimoto, and Minoru Kanehisa. Extraction of species-specific glycan substructures. Genome Inform Ser Workshop Genome Inform, 15(1):69-81, 2004. [ bib | .html | .pdf ]
Glycans, which are carbohydrate sugar chains attached to some lipids or proteins, have a huge variety of structures and play a key role in cell communication, protein interaction and immunity. The availability of a number of glycan structures stored in the KEGG/GLYCAN database makes it possible for us to conduct a large-scale comparative research of glycans. In this paper, we present a novel approach to compare glycan structures and extract characteristic glycan substructures of certain organisms. In the algorithm we developed a new similarity measure of glycan structures taking into account of several biological aspects of glycan synthesis and glycosyltransferases, and we confirmed the validity of our similarity measure by conducting experiments on its ability to classify glycans between organisms in the framework of a support vector machine. Finally, our method successfully extracted a set of candidates of substructrues which are characteristic to human, rat, mouse, bovine, pig, chicken, yeast, wheat and sycamore, respectively. We confirmed that the characteristic substructures extracted by our method correspond to the substructures which are known as the species-specific sugar chain of gamma-glutamyltranspeptidases in the kidney.

Keywords: Amino Acid Sequence, Animals, Carbohydrate Conformation, Carbohydrate Sequence, Cattle, Computer Simulation, Databases, Genes, Histocompatibility Antigens Class I, Humans, Least-Squares Analysis, MHC Class I, Major Histocompatibility Complex, Mice, Monosaccharides, Non-U.S. Gov't, Peptides, Phylogeny, Plants, Polysaccharides, Protein, Rats, Research Support, Saccharomyces cerevisiae, Species Specificity, 15712111
[Hert2004Comparison] J. Hert, P. Willett, D. J. Wilton, P. Acklin, K. Azzaoui, E. Jacoby, and A. Schuffenhauer. Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures. J Chem Inf Comput Sci, 44(3):1177-1185, 2004. [ bib | DOI | http ]
Fingerprint-based similarity searching is widely used for virtual screening when only a single bioactive reference structure is available. This paper reviews three distinct ways of carrying out such searches when multiple bioactive reference structures are available: merging the individual fingerprints into a single combined fingerprint; applying data fusion to the similarity rankings resulting from individual similarity searches; and approximations to substructural analysis. Extended searches on the MDL Drug Data Report database suggest that fusing similarity scores is the most effective general approach, with the best individual results coming from the binary kernel discrimination technique.

Keywords: chemoinformatics, PUlearning
[Helma2004Data] C. Helma, T. Cramer, S. Kramer, and L. De Raedt. Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. J. Chem. Inf. Comput. Sci., 44(4):1402-11, 2004. [ bib | DOI | http | .pdf ]
This paper explores the utility of data mining and machine learning algorithms for the induction of mutagenicity structure-activity relationships (SARs) from noncongeneric data sets. We compare (i) a newly developed algorithm (MOLFEA) for the generation of descriptors (molecular fragments) for noncongeneric compounds with traditional SAR approaches (molecular properties) and (ii) different machine learning algorithms for the induction of SARs from these descriptors. In addition we investigate the optimal parameter settings for these programs and give an exemplary interpretation of the derived models. The predictive accuracies of models using MOLFEA derived descriptors is approximately 10-15%age points higher than those using molecular properties alone. Using both types of descriptors together does not improve the derived models. From the applied machine learning techniques the rule learner PART and support vector machines gave the best results, although the differences between the learning algorithms are only marginal. We were able to achieve predictive accuracies up to 78% for 10-fold cross-validation. The resulting models are relatively easy to interpret and usable for predictive as well as for explanatory purposes.

Keywords: biosvm chemoinformatics
[Heiner2004Biosystems] M. Heiner, I. Koch, and J. Will. Model validation of biological pathways using petri nets-demonstrated for apoptosis. Biosystems, 75(1-3):15-28, 2004. [ bib ]
This paper demonstrates the first steps of a new integrating methodology to develop and analyse models of biological pathways in a systematic manner using well established Petri net technologies. The whole approach comprises step-wise modelling, animation, model validation as well as qualitative and quantitative analysis for behaviour prediction. In this paper, the first phase is addressed how to develop and validate a qualitative model, which might be extended afterwards to a quantitative model. The example used in this paper is devoted to apoptosis, the genetically programmed cell death. Apoptosis is an essential part of normal physiology for most metazoan species. Disturbances in the apoptotic process could lead to several diseases. The signal transduction pathway of apoptosis includes highly complex mechanisms to control and execute programmed cell death. This paper explains how to model and validate this pathway using qualitative Petri nets. The results provide a mathematically unique and valid model enabling the confirmation of known properties as well as new insights in this pathway.

Keywords: csbcbook
[Hannon2004Unlocking] G. J. Hannon and J. J. Rossi. Unlocking the potential of the human genome with RNA interference. Nature, 431(7006):371-8, Sep 2004. [ bib | DOI | http | .pdf ]
The discovery of RNA interference (RNAi) may well be one of the transforming events in biology in the past decade. RNAi can result in gene silencing or even in the expulsion of sequences from the genome. Harnessed as an experimental tool, RNAi has revolutionized approaches to decoding gene function. It also has the potential to be exploited therapeutically, and clinical trials to test this possibility are already being planned.

Keywords: sirna
[Hanash2004Integrated] S. Hanash. Integrated global profiling of cancer. Nat. Rev. Cancer, 4(8):638-644, 2004. [ bib | DOI | http | .pdf ]
Tumours are complex biological systems. No single type of molecular approach fully elucidates tumour behaviour, necessitating analysis at multiple levels encompassing genomics and proteomics. Integrated data sets are required to fully determine the contributions of genome alterations, host factors and environmental exposures to tumour growth and progression, as well as the consequences of interactions between malignant or premalignant cells and their microenvironment. The sheer amount and heterogeneous nature of data that need to be collected and integrated are daunting, but effort has already begun to address these obstacles.

[Han2004Prediction] L.Y. Han, C.Z. Cai, S.L. Lo, M.C. Chung, and Y.Z. Chen. Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA, 10(3):355-368, 2004. [ bib | http | .pdf ]
Elucidation of the interaction of proteins with different molecules is of significance in the understanding of cellular processes. Computational methods have been developed for the prediction of protein-protein interactions. But insufficient attention has been paid to the prediction of protein-RNA interactions, which play central roles in regulating gene expression and certain RNA-mediated enzymatic processes. This work explored the use of a machine learning method, support vector machines (SVM), for the prediction of RNA-binding proteins directly from their primary sequence. Based on the knowledge of known RNA-binding and non-RNA-binding proteins, an SVM system was trained to recognize RNA-binding proteins. A total of 4011 RNA-binding and 9781 non-RNA-binding proteins was used to train and test the SVM classification system, and an independent set of 447 RNA-binding and 4881 non-RNA-binding proteins was used to evaluate the classification accuracy. Testing results using this independent evaluation set show a prediction accuracy of 94.1 proteins, and 98.7 and non-tRNA-binding proteins, respectively. The SVM classification system was further tested on a small class of snRNA-binding proteins with only 60 available sequences. The prediction accuracy is 40.0 and 99.9 a need for a sufficient number of proteins to train SVM. The SVM classification systems trained in this work were added to our Web-based protein functional classification software SVMProt, at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi. Our study suggests the potential of SVM as a useful tool for facilitating the prediction of protein-RNA interactions.

Keywords: biosvm
[Han2004Predicting] L.Y. Han, C.Z. Cai, Z.L. Ji, Z.W. Cao, J. Cui, and Y.Z. Chen. Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucl. Acids Res., 32(21):6437-6444, 2004. [ bib | DOI | http | .pdf ]
The function of a protein that has no sequence homolog of known function is difficult to assign on the basis of sequence similarity. The same problem may arise for homologous proteins of different functions if one is newly discovered and the other is the only known protein of similar sequence. It is desirable to explore methods that are not based on sequence similarity. One approach is to assign functional family of a protein to provide useful hint about its function. Several groups have employed a statistical learning method, support vector machines (SVMs), for predicting protein functional family directly from sequence irrespective of sequence similarity. These studies showed that SVM prediction accuracy is at a level useful for functional family assignment. But its capability for assignment of distantly related proteins and homologous proteins of different functions has not been critically and adequately assessed. Here SVM is tested for functional family assignment of two groups of enzymes. One consists of 50 enzymes that have no homolog of known function from PSI-BLAST search of protein databases. The other contains eight pairs of homologous enzymes of different families. SVM correctly assigns 72 pairs in the second group, suggesting that it is potentially useful for facilitating functional study of novel proteins. A web version of our software, SVMProt, is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.

Keywords: biosvm
[Han2004Evidence] J.-D. J. Han, N. Bertin, T. Hao, D. S. Goldberg, G. F. Berriz, L. V. Zhang, D. Dupuy, A. J. M. Walhout, M. E. Cusick, F. P. Roth, and M. Vidal. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature, 430(6995):88-93, Jul 2004. [ bib | DOI | http | .pdf ]
In apparently scale-free protein-protein interaction networks, or 'interactome' networks, most proteins interact with few partners, whereas a small but significant proportion of proteins, the 'hubs', interact with many partners. Both biological and non-biological scale-free networks are particularly resistant to random node removal but are extremely sensitive to the targeted removal of hubs. A link between the potential scale-free topology of interactome networks and genetic robustness seems to exist, because knockouts of yeast genes encoding hubs are approximately threefold more likely to confer lethality than those of non-hubs. Here we investigate how hubs might contribute to robustness and other cellular properties for protein-protein interactions dynamically regulated both in time and in space. We uncovered two types of hub: 'party' hubs, which interact with most of their partners simultaneously, and 'date' hubs, which bind their different partners at different times or locations. Both in silico studies of network connectivity and genetic interactions described in vivo support a model of organized modularity in which date hubs organize the proteome, connecting biological processes-or modules-to each other, whereas party hubs function inside modules.

[Hall2004Unravelling] J. Hall. Unravelling the general properties of siRNAs: strength in numbers and lessons from the past. Nat. Rev. Genet., 5(7):552-7, Jul 2004. [ bib | DOI | http | .pdf ]
Keywords: sirna
[Haley2004Kinetic] B. Haley and P. D. Zamore. Kinetic analysis of the RNAi enzyme complex. Nat. Struct. Mol. Biol., 11(7):599-606, Jul 2004. [ bib | DOI | http | .pdf ]
The siRNA-directed ribonucleoprotein complex, RISC, catalyzes target RNA cleavage in the RNA interference pathway. Here, we show that siRNA-programmed RISC is a classical Michaelis-Menten enzyme in the presence of ATP. In the absence of ATP, the rate of multiple rounds of catalysis is limited by release of the cleaved products from the enzyme. Kinetic analysis suggests that different regions of the siRNA play distinct roles in the cycle of target recognition, cleavage, and product release. Bases near the siRNA 5' end disproportionately contribute to target RNA-binding energy, whereas base pairs formed by the central and 3' regions of the siRNA provide a helical geometry required for catalysis. Finally, the position of the scissile phosphate on the target RNA seems to be determined during RISC assembly, before the siRNA encounters its RNA target.

Keywords: sirna
[Hakenberg2004Finding] J. Hakenberg, S. Schmeier, A. Kowald, E. Klipp, and U. Leser. Finding kinetic parameters using text mining. OMICS, 8(2):131-152, 2004. [ bib | http | .pdf ]
The mathematical modeling and description of complex biological processes has become more and more important over the last years. Systems biology aims at the computational simulation of complex systems, up to whole cell simulations. An essential part focuses on solving a large number of parameterized differential equations. However, measuring those parameters is an expensive task, and finding them in the literature is very laborious. We developed a text mining system that supports researchers in their search for experimentally obtained parameters for kinetic models. Our system classifies full text documents regarding the question whether or not they contain appropriate data using a support vector machine. We evaluated our approach on a manually tagged corpus of 800 documents and found that it outperforms keyword searches in abstracts by a factor of five in terms of precision.

Keywords: biosvm
[Gartner2004Kernels] T. Gärtner, J.W. Lloyd, and P.A. Flach. Kernels and distances for structured data. Mach. Learn., 57(3):205-232, 2004. [ bib | DOI | http ]
This paper brings together two strands of machine learning of increasing importance: kernel methods and highly structured data. We propose a general method for constructing a kernel following the syntactic structure of the data, as defined by its type signature in a higher-order logic. Our main theoretical result is the positive definiteness of any kernel thus defined. We report encouraging experimental results on a range of real-world data sets. By converting our kernel to a distance pseudo-metric for 1-nearest neighbour, we were able to improve the best accuracy from the literature on the Diterpene data set by more than 10%.

Keywords: biosvm
[Guo2004novel] J. Guo, H. Chen, Z. Sun, and Y. Lin. A novel method for protein secondary structure prediction using dual-layer SVM and profiles. Proteins, 54(4):738-743, 2004. [ bib | DOI | http | .pdf ]
A high-performance method was developed for protein secondary structure prediction based on the dual-layer support vector machine (SVM) and position-specific scoring matrices (PSSMs). SVM is a new machine learning technology that has been successfully applied in solving problems in the field of bioinformatics. The SVM's performance is usually better than that of traditional machine learning approaches. The performance was further improved by combining PSSM profiles with the SVM analysis. The PSSMs were generated from PSI-BLAST profiles, which contain important evolution information. The final prediction results were generated from the second SVM layer output. On the CB513 data set, the three-state overall per-residue accuracy, Q3, reached 75.2 to 80.0 74.0 has been constructed and is available at http://www.bioinfo.tsinghua.edu.cn/pmsvm.

Keywords: biosvm
[Guermeur2004Combining] Y. Guermeur, G. Pollastri, A. Elisseeff, D. Zelus, H. Paugam-Moisy, and P. Baldi. Combining protein secondary structure prediction models with ensemble methods of optimal complexity. Neurocomputing, 56:305-327, 2004. [ bib | DOI | http | .pdf ]
Many sophisticated methods are currently available to perform protein secondary structure prediction. Since they are frequently based on different principles, and different knowledge sources, significant benefits can be expected from combining them. However, the choice of an appropriate combiner appears to be an issue in its own right. The first difficulty to overcome when combining prediction methods is overfitting. This is the reason why we investigate the implementation of Support Vector Machines to perform the task. A family of multi-class SVMs is introduced. Two of these machines are used to combine some of the current best protein secondary structure prediction methods. Their performance is consistently superior to the performance of the ensemble methods traditionally used in the field. They also outperform the decomposition approaches based on bi-class SVMs. Furthermore, initial experimental evidence suggests that their outputs could be processed by the biologist to perform higher-level treatments.

Keywords: biosvm
[Guermeur2004kernel] Y. Guermeur, A. Lifschitz, and R. Vert. A kernel for protein secondary structure prediction. In B. Schölkopf, K. Tsuda, and J.P. Vert, editors, Kernel Methods in Computational Biology, pages 193-206. MIT Press, 2004. [ bib ]
Keywords: biosvm
[Greenshtein2004Persistence] E. Greenshtein and Y. Ritov. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10(6):971-988, 2004. [ bib | DOI | http | .pdf ]
Let Zi=(Yi,X1i,...,Xmi), i=1,...,n, be independent and identically distributed random vectors, Zi ~F, F in F. It is desired to predict Y by βj Xj, where (β1,...,βm) inBn m, under a prediction loss. Suppose that m=n^α, α>1, that is, there are many more explanatory variables than observations. We consider sets Bn restricted by the maximal number of non-zero coefficients of their members, or by their l1 radius. We study the following asymptotic question: how 'large' may the set Bn be, so that it is still possible to select empirically a predictor whose risk under F is close to that of the best predictor in the set? Sharp bounds for orders of magnitudes are given under various assumptions on F. Algorithmic complexity of the ensuing procedures is also studied. The main message of this paper and the implications of the orders derived are that under various sparsity assumptions on the optimal predictor there is 'asymptotically no harm' in introducing many more explanatory variables than observations. Furthermore, such practice can be beneficial in comparison with a procedure that screens in advance a small subset of explanatory variables. Another main result is that 'lasso' procedures, that is, optimization under l1 constraints, could be efficient in finding optimal sparse predictors in high dimensions.

[Graumann2004Applicability] Johannes Graumann, Leslie A Dunipace, Jae Hong Seol, WHayes McDonald, John R Yates, Barbara J Wold, and Raymond J Deshaies. Applicability of tandem affinity purification MudPIT to pathway proteomics in yeast. Mol Cell Proteomics, 3(3):226-37, Mar 2004. [ bib | DOI | http | .pdf ]
A combined multidimensional chromatography-mass spectrometry approach known as "MudPIT" enables rapid identification of proteins that interact with a tagged bait while bypassing some of the problems associated with analysis of polypeptides excised from SDS-polyacrylamide gels. However, the reproducibility, success rate, and applicability of MudPIT to the rapid characterization of dozens of proteins have not been reported. We show here that MudPIT reproducibly identified bona fide partners for budding yeast Gcn5p. Additionally, we successfully applied MudPIT to rapidly screen through a collection of tagged polypeptides to identify new protein interactions. Twenty-five proteins involved in transcription and progression through mitosis were modified with a new tandem affinity purification (TAP) tag. TAP-MudPIT analysis of 22 yeast strains that expressed these tagged proteins uncovered known or likely interacting partners for 21 of the baits, a figure that compares favorably with traditional approaches. The proteins identified here comprised 102 previously known and 279 potential physical interactions. Even for the intensively studied Swi2p/Snf2p, the catalytic subunit of the Swi/Snf chromatin remodeling complex, our analysis uncovered a new interacting protein, Rtt102p. Reciprocal tagging and TAP-MudPIT analysis of Rtt102p revealed subunits of both the Swi/Snf and RSC complexes, identifying Rtt102p as a common interactor with, and possible integral component of, these chromatin remodeling machines. Our experience indicates it is feasible for an investigator working with a single ion trap instrument in a conventional molecular/cellular biology laboratory to carry out proteomic characterization of a pathway, organelle, or process (i.e. "pathway proteomics") by systematic application of TAP-MudPIT.

Keywords: Affinity Labels, Comparative Study, Electrospray Ionization, Genetic, Mass, Mitosis, Non-P.H.S., Non-U.S. Gov't, P.H.S., Protein Interaction Mapping, Proteome, Proteomics, Research Support, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Signal Transduction, Spectrometry, Transcription, U.S. Gov't, 14660704
[Gong2004Picking] DGong and JE. Ferrell. Picking a winner: new mechanistic insights into the design of effective siRNAs. Trends Biotechnol., 22(9):451-4, Sep 2004. [ bib | DOI | http ]
Recent work has shown that the efficacy of a small interfering RNA (siRNA) for silencing gene expression is a function of how easy it is to unwind the siRNA from the 5'-antisense end. Based on these insights, one group has designed an algorithm that substantially improves the odds of picking an effective siRNA, and two groups have shown that 'forked' or 'frayed' siRNAs, which should be easier to unwind from the 5'-antisense end, are more effective than conventional siRNAs. These strategies represent important steps towards the rational design of effective siRNAs.

Keywords: sirna
[Glotsos2004Computer-based] Dimitris Glotsos, Panagiota Spyridonos, Panagiotis Petalas, Dionisis Cavouras, Panagiota Ravazoula, Petroula-Arampatoni Dadioti, Ioanna Lekka, and George Nikiforidis. Computer-based malignancy grading of astrocytomas employing a support vector machine classifier, the WHO grading system and the regular hematoxylin-eosin diagnostic staining procedure. Anal Quant Cytol Histol, 26(2):77-83, Apr 2004. [ bib ]
OBJECTIVE: To investigate and develop an automated technique for astrocytoma malignancy grading compatible with the clinical routine. STUDY DESIGN: One hundred forty biopsies of astrocytomas were collected from 2 hospitals. The degree of tumor malignancy was defined as low or high according to the World Health Organization grading system. From each biopsy, images were digitized and segmented to isolate nuclei from background tissue. Morphologic and textural nuclear features were quantified to encode tumor malignancy. Each case was represented by a 40-dimensional feature vector. An exhaustive search procedure in feature space was utilized to determine the best feature combination that resulted in the smallest classification error. Low and high grade tumors were discriminated using support vector machines (SVMs). To evaluate the system performance, all available data were split randomly into training and test sets. RESULTS: The best vector combination consisted of 3 textural and 2 morphologic features. Low and high grade cases were discriminated with an accuracy of 90.7% and 88.9%, respectively, using an SVM classifier with polynomial kernel of degree 2. CONCLUSION: The proposed methodology was based on standards that are common in daily clinical practice and might be used in parallel with conventional grading as a second-opinion tool to reduce subjectivity in the classification of astrocytomas.

Keywords: Amino Acids, Antibodies, Artificial Intelligence, Astrocytoma, Biological, Biopsy, Brain, Brain Mapping, Brain Neoplasms, Calibration, Comparative Study, Computational Biology, Computer-Assisted, Cysteine, Cystine, Electrodes, Electroencephalography, Eosine Yellowish-(YS), Evoked Potentials, Female, Hematoxylin, Horseradish Peroxidase, Humans, Image Processing, Imagery (Psychotherapy), Imagination, Laterality, Male, Monoclonal, Movement, Neoplasms, Non-P.H.S., Non-U.S. Gov't, P.H.S., Perception, Principal Component Analysis, Protein, Protein Array Analysis, Proteins, Research Support, Sensitivity and Specificity, Sequence Analysis, Software, Tumor Markers, U.S. Gov't, User-Computer Interface, World Health Organization, 15131894
[Glotsos2004Automated] Dimitris Glotsos, Panagiota Spyridonos, Dionisis Cavouras, Panagiota Ravazoula, Petroula-Arampantoni Dadioti, and George Nikiforidis. Automated segmentation of routinely hematoxylin-eosin-stained microscopic images by combining support vector machine clustering and active contour models. Anal Quant Cytol Histol, 26(6):331-40, Dec 2004. [ bib ]
OBJECTIVE: To develop a method for the automated segmentation of images of routinely hematoxylin-eosin (H-E)-stained microscopic sections to guarantee correct results in computer-assisted microscopy. STUDY DESIGN: Clinical material was composed 50 H-E-stained biopsies of astrocytomas and 50 H-E-stained biopsies of urinary bladder cancer. The basic idea was to use a support vector machine clustering (SVMC) algorithm to provide gross segmentation of regions holding nuclei and subsequently to refine nuclear boundary detection with active contours. The initialization coordinates of the active contour model were defined using a SVMC pixel-based classification algorithm that discriminated nuclear regions from the surrounding tissue. Starting from the boundaries of these regions, the snake fired and propagated until converging to nuclear boundaries. RESULTS: The method was validated for 2 different types of H-E-stained images. Results were evaluated by 2 histopathologists. On average, 94% of nuclei were correctly delineated. CONCLUSION: The proposed algorithm could be of value in computer-based systems for automated interpretation of microscopic images.

Keywords: Adenosinetriphosphatase, Adolescent, Adult, Algorithms, Amino Acid Sequence, Amino Acids, Animals, Astrocytoma, Automated, Automation, Base Sequence, Bayes Theorem, Biological, Biopsy, Bladder Neoplasms, Breast Neoplasms, Carbohydrate Conformation, Carbohydrate Sequence, Cattle, Cell Cycle Proteins, Cell Nucleus, Computational Biology, Computer Simulation, Computer-Assisted, Crystallography, DNA, Databases, Diagnosis, Differential, Eosine Yellowish-(YS), Exoribonucleases, Factual, False Negative Reactions, False Positive Reactions, Female, Gene Expression, Gene Expression Profiling, Genes, Genetic, Genetic Techniques, Genetic Vectors, Genome, Hematoxylin, Histocompatibility Antigens Class I, Human, Humans, Image Interpretation, Image Processing, Introns, Least-Squares Analysis, MHC Class I, Major Histocompatibility Complex, Markov Chains, Messenger, Mice, Middle Aged, Models, Molecular Structure, Monosaccharides, Multigene Family, Mutation, Neoplasms, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonparametric, Nucleotidyltransferases, Observer Variation, Oligonucleotide Array Sequence Analysis, P.H.S., Pattern Recognition, Peptides, Phenotype, Phylogeny, Plants, Poly A, Polysaccharides, Predictive Value of Tests, Protein, Protein Biosynthesis, Protein Kinase Inhibitors, Protein Structure, Proteins, RNA, RNA Helicases, RNA Splicing, Rats, Reproducibility of Results, Research Support, Retrospective Studies, Saccharomyces cerevisiae, Saccharomyces cerevisiae Proteins, Secondary, Sensitivity and Specificity, Sequence Alignment, Software, Species Specificity, Staining and Labeling, Statistics, Theoretical, Transcription, U.S. Gov't, Ultrasonography, X-Ray, 15678615
[Glenisson2004TXTGate:] PGlenisson, BCoessens, SVan Vooren, JMathys, YMoreau, and BDe Moor. Txtgate: profiling gene groups with text-based information. Genome Biol, 5(6):R43, 2004. [ bib | DOI | http ]
We implemented a framework called TXTGate that combines literature indices of selected public biological resources in a flexible text-mining system designed towards the analysis of groups of genes. By means of tailored vocabularies, term- as well as gene-centric views are offered on selected textual fields and MEDLINE abstracts used in LocusLink and the Saccharomyces Genome Database. Subclustering and links to external resources allow for in-depth analysis of the resulting term profiles.

Keywords: Animals; Cluster Analysis; Databases, Genetic; Disease Models, Animal; Gene Expression Profiling; Gene Expression Regulation, Fungal; Gene Expression Regulation, Neoplastic; Genes, Fungal; Genes, Neoplasm; Genome, Fungal; Genome, Human; Humans; Information Storage and Retrieval; MEDLINE; Mice; Saccharomyces; Salivary Gland Neoplasms; Vocabulary
[GatViks2004] IGat-Viks, ATanay, and RShamir. Modeling and analysis of heterogeneous regulation in biological networks. J Comput Biol, 11(6):1034-49, 2004. [ bib ]
In this study, we propose a novel model for the representation of biological networks and provide algorithms for learning model parameters from experimental data. Our approach is to build an initial model based on extant biological knowledge and refine it to increase the consistency between model predictions and experimental data. Our model encompasses networks which contain heterogeneous biological entities (mRNA, proteins, metabolites) and aims to capture diverse regulatory circuitry on several levels (metabolism, transcription, translation, post-translation and feedback loops, among them). Algorithmically, the study raises two basic questions: how to use the model for predictions and inference of hidden variables states, and how to extend and rectify model components. We show that these problems are hard in the biologically relevant case where the network contains cycles. We provide a prediction methodology in the presence of cycles and a polynomial time, constant factor approximation for learning the regulation of a single entity. A key feature of our approach is the ability to utilize both high-throughput experimental data, which measure many model entities in a single experiment, as well as specific experimental measurements of few entities or even a single one. In particular, we use together gene expression, growth phenotypes, and proteomics data. We tested our strategy on the lysine biosynthesis pathway in yeast. We constructed a model of more than 150 variables based on an extensive literature survey and evaluated it with diverse experimental data. We used our learning algorithms to propose novel regulatory hypotheses in several cases where the literature-based model was inconsistent with the experiments. We showed that our approach has better accuracy than extant methods of learning regulation.

[Garcia-Gomez2004Benign] Juan M García-Gómez, César Vidal, Luis Martí-Bonmatí, Joaquín Galant, Nicolas Sans, Montserrat Robles, and Francisco Casacuberta. Benign/malignant classifier of soft tissue tumors using MR imaging. MAGMA, 16(4):194-201, Mar 2004. [ bib | DOI | http | .pdf ]
This article presents a pattern-recognition approach to the soft tissue tumors (STT) benign/malignant character diagnosis using magnetic resonance (MR) imaging applied to a large multicenter database. OBJECTIVE: To develop and test an automatic classifier of STT into benign or malignant by using classical MR imaging findings and epidemiological information. MATERIALS AND METHODS: A database of 430 patients (62% benign and 38% malignant) from several European multicenter registers. There were 61 different histologies (36 with benign and 25 with malignant nature). Three pattern-recognition methods (artificial neural networks, support vector machine, k-nearest neighbor) were applied to learn the discrimination between benignity and malignancy based on a defined MR imaging findings protocol. After the systems had learned by using training samples (with 302 cases), the clinical decision support system was tested in the diagnosis of 128 new STT cases. RESULTS: An 88-92% efficacy was obtained in a not-viewed set of tumors using the pattern-recognition techniques. The best results were obtained with a back-propagation artificial neural network. CONCLUSION: Benign vs. malignant STT discrimination is accurate by using pattern-recognition methods based on classical MR image findings. This objective tool will assist radiologists in STT grading.

[Friedman2004Inferring] NFriedman. Inferring cellular networks using probabilistic graphical models. Science, 303(5659):799, 2004. [ bib ]
[Fong2004Predicting] JH. Fong, AE. Keating, and MSingh. Predicting specificity in bZIP coiled-coil protein interactions. Genome Biol., 5(R11), 2004. [ bib | http | .pdf ]
We present a method for predicting protein-protein interactions mediated by the coiled-coil motif. When tested on interactions between nearly all human and yeast bZIP proteins, our method identifies 70 strong interactions while maintaining that 92 correct. Furthermore, cross-validation testing shows that including the bZIP experimental data significantly improves performance. Our method can be used to predict bZIP interactions in other genomes and is a promising approach for predicting coiled-coil interactions more generally.

Keywords: biosvm
[Faugeras2004Variational] Olivier Faugeras, Geoffray Adde, Guillaume Charpiat, Christophe Chefd'hotel, Maureen Clerc, Thomas Deneux, Rachid Deriche, Gerardo Hermosillo, Renaud Keriven, Pierre Kornprobst, Jan Kybic, Christophe Lenglet, Lucero Lopez-Perez, Théo Papadopoulo, Jean-Philippe Pons, Florent Segonne, Bertrand Thirion, David Tschumperlé, Thierry Viéville, and Nicolas Wotawa. Variational, geometric, and statistical methods for modeling brain anatomy and function. Neuroimage, 23 Suppl 1:S46-55, 2004. [ bib | DOI | http | .pdf ]
We survey the recent activities of the Odyssée Laboratory in the area of the application of mathematics to the design of models for studying brain anatomy and function. We start with the problem of reconstructing sources in MEG and EEG, and discuss the variational approach we have developed for solving these inverse problems. This motivates the need for geometric models of the head. We present a method for automatically and accurately extracting surface meshes of several tissues of the head from anatomical magnetic resonance (MR) images. Anatomical connectivity can be extracted from diffusion tensor magnetic resonance images but, in the current state of the technology, it must be preceded by a robust estimation and regularization stage. We discuss our work based on variational principles and show how the results can be used to track fibers in the white matter (WM) as geodesics in some Riemannian space. We then go to the statistical modeling of functional magnetic resonance imaging (fMRI) signals from the viewpoint of their decomposition in a pseudo-deterministic and stochastic part that we then use to perform clustering of voxels in a way that is inspired by the theory of support vector machines and in a way that is grounded in information theory. Multimodal image matching is discussed next in the framework of image statistics and partial differential equations (PDEs) with an eye on registering fMRI to the anatomy. The paper ends with a discussion of a new theory of random shapes that may prove useful in building anatomical and functional atlases.

Keywords: Adolescent, Adult, Algorithms, Anatomic, Bacterial Proteins, Brain, Brain Mapping, Comparative Study, Computer Simulation, Computer-Assisted, Diffusion Magnetic Resonance Imaging, Facial Asymmetry, Facial Expression, Facial Paralysis, Female, Gene Expression Profiling, Gram-Negative Bacteria, Gram-Positive Bacteria, Humans, Image Interpretation, Magnetoencephalography, Male, Middle Aged, Models, Motion, Neural Pathways, Non-U.S. Gov't, Photography, Protein, Proteome, Research Support, Retina, Sequence Alignment, Sequence Analysis, Severity of Illness Index, Software, Statistical, Subcellular Fractions, 15501100
[Evgeniou2004Regularized] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi-task learning. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109-117, New York, NY, USA, 2004. ACM. [ bib | DOI ]
[Eroes2004Comparison] DErös, GKéri, IKövesdi, CSzántai-Kis, GMészáros, and LOrfi. Comparison of predictive ability of water solubility QSPR models generated by MLR, PLS and ANN methods. Mini Rev Med Chem, 4(2):167-177, Feb 2004. [ bib ]
ADME/Tox computational screening is one of the most hot topics of modern drug research. About one half of the potential drug candidates fail because of poor ADME/Tox properties. Since the experimental determination of water solubility is time-consuming also, reliable computational predictions are needed for the pre-selection of acceptable "drug-like" compounds from diverse combinatorial libraries. Recently many successful attempts were made for predicting water solubility of compounds. A comprehensive review of previously developed water solubility calculation methods is presented here, followed by the description of the solubility prediction method designed and used in our laboratory. We have selected carefully 1381 compounds from scientific publications in a unified database and used this dataset in the calculations. The externally validated models were based on calculated descriptors only. The aim of model optimization was to improve repeated evaluations statistics of the predictions and effective descriptor scoring functions were used to facilitate quick generation of multiple linear regression analysis (MLR), partial least squares method (PLS) and artificial neural network (ANN) models with optimal predicting ability. Standard error of prediction of the best model generated with ANN (with 39-7-1 network structure) was 0.72 in logS units while the cross validated squared correlation coefficient (Q(2)) was better than 0.85. These values give a good chance for successful pre-selection of screening compounds from virtual libraries, based on the predicted water solubility.

Keywords: Chemical, Chemistry, Comparative Study, Cytochrome P-450 Enzyme System, Estradiol, Least-Squares Analysis, Ligands, Linear Models, Models, Molecular, Naphthalenes, Neural Networks (Computer), Non-U.S. Gov't, Physical, Quantitative Structure-Activity Relationship, Reproducibility of Results, Research Support, Solubility, Spectrum Analysis, Statistical, Water, 14965289
[El-Naqa2004similarity] IEl-Naqa, YYang, NP. Galatsanos, RM. Nishikawa, and MN. Wernick. A similarity learning approach to content-based image retrieval: application to digital mammography. IEEE Trans Med Imaging, 23(10):1233-44, Oct 2004. [ bib ]
In this paper, we describe an approach to content-based retrieval of medical images from a database, and provide a preliminary demonstration of our approach as applied to retrieval of digital mammograms. Content-based image retrieval (CBIR) refers to the retrieval of images from a database using information derived from the images themselves, rather than solely from accompanying text indices. In the medical-imaging context, the ultimate aim of CBIR is to provide radiologists with a diagnostic aid in the form of a display of relevant past cases, along with proven pathology and other suitable information. CBIR may also be useful as a training tool for medical students and residents. The goal of information retrieval is to recall from a database information that is relevant to the user's query. The most challenging aspect of CBIR is the definition of relevance (similarity), which is used to guide the retrieval machine. In this paper, we pursue a new approach, in which similarity is learned from training examples provided by human observers. Specifically, we explore the use of neural networks and support vector machines to predict the user's notion of similarity. Within this framework we propose using a hierarchal learning approach, which consists of a cascade of a binary classifier and a regression module to optimize retrieval effectiveness and efficiency. We also explore how to incorporate online human interaction to achieve relevance feedback in this learning framework. Our experiments are based on a database consisting of 76 mammograms, all of which contain clustered microcalcifications (MCs). Our goal is to retrieve mammogram images containing similar MC clusters to that in a query. The performance of the retrieval system is evaluated using precision-recall curves computed using a cross-validation procedure. Our experimental results demonstrate that: 1) the learning framework can accurately predict the perceptual similarity reported by human observers, thereby serving as a basis for CBIR; 2) the learning-based framework can significantly outperform a simple distance-based similarity metric; 3) the use of the hierarchical two-stage network can improve retrieval performance; and 4) relevance feedback can be effectively incorporated into this learning framework to achieve improvement in retrieval precision based on online interaction with users; and 5) the retrieved images by the network can have predicting value for the disease condition of the query.

[Eissing2004Bistability] TEissing, HConzelmann, ED. Gilles, FAllgower, EBullinger, and PScheurich. Bistability analyses of a caspase activation model for receptor-induced apoptosis. J. Biol. Chem., 279(35):36892-36897, 2004. [ bib | DOI | arXiv | http | .pdf ]
Apoptosis is an important physiological process crucially involved in development and homeostasis of multicellular organisms. Although the major signaling pathways have been unraveled, a detailed mechanistic understanding of the complex underlying network remains elusive. We have translated here the current knowledge of the molecular mechanisms of the death-receptor-activated caspase cascade into a mathematical model. A reduction down to the apoptotic core machinery enables the application of analytical mathematical methods to evaluate the system behavior within a wide range of parameters. Using parameter values from the literature, the model reveals an unstable status of survival indicating the need for further control. Based on recent publications we tested one additional regulatory mechanism at the level of initiator caspase activation and demonstrated that the resulting system displays desired characteristics such as bistability. In addition, the results from our model studies allowed us to reconcile the fast kinetics of caspase 3 activation observed at the single cell level with the much slower kinetics found at the level of a cell population.

Keywords: csbcbook
[Efron2004Least] BEfron, THastie, IJohnstone, and RTibshirani. Least angle regression. Ann. Stat., 32(2):407-499, 2004. [ bib | .pdf ]
[Doytchinova2004Identifying] Irini A Doytchinova, Pingping Guan, and Darren R Flower. Identifying human MHC supertypes using bioinformatic methods. J. Immunol., 172(7):4314-4323, Apr 2004. [ bib ]
Classification of MHC molecules into supertypes in terms of peptide-binding specificities is an important issue, with direct implications for the development of epitope-based vaccines with wide population coverage. In view of extremely high MHC polymorphism (948 class I and 633 class II HLA alleles) the experimental solution of this task is presently impossible. In this study, we describe a bioinformatics strategy for classifying MHC molecules into supertypes using information drawn solely from three-dimensional protein structure. Two chemometric techniques-hierarchical clustering and principal component analysis-were used independently on a set of 783 HLA class I molecules to identify supertypes based on structural similarities and molecular interaction fields calculated for the peptide binding site. Eight supertypes were defined: A2, A3, A24, B7, B27, B44, C1, and C4. The two techniques gave 77% consensus, i.e., 605 HLA class I alleles were classified in the same supertype by both methods. The proposed strategy allowed "supertype fingerprints" to be identified. Thus, the A2 supertype fingerprint is Tyr(9)/Phe(9), Arg(97), and His(114) or Tyr(116); the A3-Tyr(9)/Phe(9)/Ser(9), Ile(97)/Met(97) and Glu(114) or Asp(116); the A24-Ser(9) and Met(97); the B7-Asn(63) and Leu(81); the B27-Glu(63) and Leu(81); for B44-Ala(81); the C1-Ser(77); and the C4-Asn(77).

Keywords: Alleles; Amino Acid Motifs; Binding Sites; Computational Biology; DNA Fingerprinting; HLA Antigens; HLA-A Antigens; HLA-B Antigens; HLA-C Antigens; Histocompatibility Antigens Class I; Histocompatibility Testing; Humans; Multigene Family; Protein Interaction Mapping
[Ding2004Sfold] YDing, CChan Yu, and CE. Lawrence. Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res., 32(Web Server issue):W135-W141, Jul 2004. [ bib | DOI | http ]
The Sfold web server provides user-friendly access to Sfold, a recently developed nucleic acid folding software package, via the World Wide Web (WWW). The software is based on a new statistical sampling paradigm for the prediction of RNA secondary structure. One of the main objectives of this software is to offer computational tools for the rational design of RNA-targeting nucleic acids, which include small interfering RNAs (siRNAs), antisense oligonucleotides and trans-cleaving ribozymes for gene knock-down studies. The methodology for siRNA design is based on a combination of RNA target accessibility prediction, siRNA duplex thermodynamic properties and empirical design rules. Our approach to target accessibility evaluation is an original extension of the underlying RNA folding algorithm to account for the likely existence of a population of structures for the target mRNA. In addition to the application modules Sirna, Soligo and Sribo for siRNAs, antisense oligos and ribozymes, respectively, the module Srna offers comprehensive features for statistical representation of sampled structures. Detailed output in both graphical and text formats is available for all modules. The Sfold server is available at http://sfold.wadsworth.org and http://www.bioinfo.rpi.edu/applications/sfold.

Keywords: sirna
[Devos2004Classification] ADevos, LLukas, JA K Suykens, LVanhamme, AR. Tate, FA. Howe, CMajós, AMoreno-Torres, Mvan der Graaf, CArús, and SVan Huffel. Classification of brain tumours using short echo time 1H MR spectra. J Magn Reson, 170(1):164-75, Sep 2004. [ bib | DOI | http | .pdf ]
The purpose was to objectively compare the application of several techniques and the use of several input features for brain tumour classification using Magnetic Resonance Spectroscopy (MRS). Short echo time 1H MRS signals from patients with glioblastomas (n = 87), meningiomas (n = 57), metastases (n = 39), and astrocytomas grade II (n = 22) were provided by six centres in the European Union funded INTERPRET project. Linear discriminant analysis, least squares support vector machines (LS-SVM) with a linear kernel and LS-SVM with radial basis function kernel were applied and evaluated over 100 stratified random splittings of the dataset into training and test sets. The area under the receiver operating characteristic curve (AUC) was used to measure the performance of binary classifiers, while the percentage of correct classifications was used to evaluate the multiclass classifiers. The influence of several factors on the classification performance has been tested: L2- vs. water normalization, magnitude vs. real spectra and baseline correction. The effect of input feature reduction was also investigated by using only the selected frequency regions containing the most discriminatory information, and peak integrated values. Using L2-normalized complete spectra the automated binary classifiers reached a mean test AUC of more than 0.95, except for glioblastomas vs. metastases. Similar results were obtained for all classification techniques and input features except for water normalized spectra, where classification performance was lower. This indicates that data acquisition and processing can be simplified for classification purposes, excluding the need for separate water signal acquisition, baseline correction or phasing.

[Daubechies2004iterative] Ingrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11):1413-1457, 2004. [ bib ]
[Darbellay2004Solid] Georges A Darbellay, Rebecca Duff, Jean-Marc Vesin, Paul-André Despland, Dirk W Droste, Carlos Molina, Joachim Serena, Roman Sztajzel, Patrick Ruchat, Theodoros Karapanayiotides, Afksendyios Kalangos, Julien Bogousslavsky, Erich B Ringelstein, and Gérald Devuyst. Solid or gaseous circulating brain emboli: are they separable by transcranial ultrasound? J Cereb Blood Flow Metab, 24(8):860-8, Aug 2004. [ bib ]
High-intensity transient signals (HITS) detected by transcranial Doppler (TCD) ultrasound may correspond to artifacts or to microembolic signals, the latter being either solid or gaseous emboli. The goal of this study was to assess what can be achieved with an automatic signal processing system for artifact/microembolic signals and solid/gas differentiation in different clinical situations. The authors studied 3,428 HITS in vivo in a multicenter study, i.e., 1,608 artifacts in healthy subjects, 649 solid emboli in stroke patients with a carotid stenosis, and 1,171 gaseous emboli in stroke patients with patent foramen ovale. They worked with the dual-gate TCD combined to three types of statistical classifiers: binary decision trees (BDT), artificial neural networks (ANN), and support vector machines (SVM). The sensitivity and specificity to separate artifacts from microembolic signals by BDT reached was 94% and 97%, respectively. For the discrimination between solid and gaseous emboli, the classifier achieved a sensitivity and specificity of 81% and 81% for BDT, 84% and 84% for ANN, and 86% and 86% for SVM, respectively. The current results for artifact elimination and solid/gas differentiation are already useful to extract data for future prospective clinical studies.

Keywords: Air, Algorithms, Amino Acids, Animals, Artifacts, Atrial, Carotid Stenosis, Cerebrovascular Accident, Cerebrovascular Circulation, Comparative Study, Cysteine, Decision Trees, Disulfides, Doppler, Embolism, Heart Septal Defects, Humans, Intracranial Embolism, Models, Molecular, Neural Networks (Computer), Non-U.S. Gov't, Oxidation-Reduction, Protein Binding, Protein Folding, Proteins, Research Support, Sensitivity and Specificity, Transcranial, Ultrasonography, 15362716
[Cuturi2004mutual] MCuturi and J.-P. Vert. A mutual information kernel for strings. In Proceedings of IJCNN 2004, pages 1904-1910, 2004. [ bib | www: ]
Keywords: biosvm
[Cui2004Esub8] QCui, TJiang, BLiu, and SMa. Esub8: A novel tool to predict protein subcellular localizations in eukaryotic organisms. BMC Bioinformatics, 5(66):66, 2004. [ bib | DOI | http | .pdf ]
Background Subcellular localization of a new protein sequence is very important and fruitful for understanding its function. As the number of new genomes has dramatically increased over recent years, a reliable and efficient system to predict protein subcellular location is urgently needed. Results Esub8 was developed to predict protein subcellular localizations for eukaryotic proteins based on amino acid composition. In this research, the proteins are classified into the following eight groups: chloroplast, cytoplasm, extracellular, Golgi apparatus, lysosome, mitochondria, nucleus and peroxisome. We know subcellular localization is a typical classification problem; consequently, a one-against-one (1-v-1) multi-class support vector machine was introduced to construct the classifier. Unlike previous methods, ours considers the order information of protein sequences by a different method. Our method is tested in three subcellular localization predictions for prokaryotic proteins and four subcellular localization predictions for eukaryotic proteins on Reinhardt's dataset. The results are then compared to several other methods. The total prediction accuracies of two tests are both 100 self-consistency test, and are 92.9 test, respectively. Esub8 also provides excellent results: the total prediction accuracies are 100 87 a different approach for predicting protein subcellular localization and achieved a satisfactory result; furthermore, we believe Esub8 will be a useful tool for predicting protein subcellular localizations in eukaryotic organisms.

Keywords: biosvm
[Conte2004Thirty] DConte, PFoggia, CSansone, and MVento. Thirty years of graph matching in pattern recognition. Int. J. Pattern. Recogn. Artif. Intell., 18(3):265-298, 2004. [ bib | DOI | http ]
A recent paper posed the question: "Graph Matching: What are we really talking about?". Far from providing a definite answer to that question, in this paper we will try to characterize the role that graphs play within the Pattern Recognition field. To this aim two taxonomies are presented and discussed. The first includes almost all the graph matching algorithms proposed from the late seventies, and describes the different classes of algorithms. The second taxonomy considers the types of common applications of graph-based techniques in the Pattern Recognition and Machine Vision field.

[Collier2004Comparison] Nigel Collier and Koichi Takeuchi. Comparison of character-level and part of speech features for name recognition in biomedical texts. J Biomed Inform, 37(6):423-35, Dec 2004. [ bib | DOI | http | .pdf ]
The immense volume of data which is now available from experiments in molecular biology has led to an explosion in reported results most of which are available only in unstructured text format. For this reason there has been great interest in the task of text mining to aid in fact extraction, document screening, citation analysis, and linkage with large gene and gene-product databases. In particular there has been an intensive investigation into the named entity (NE) task as a core technology in all of these tasks which has been driven by the availability of high volume training sets such as the GENIA v3.02 corpus. Despite such large training sets accuracy for biology NE has proven to be consistently far below the high levels of performance in the news domain where F scores above 90 are commonly reported which can be considered near to human performance. We argue that it is crucial that more rigorous analysis of the factors that contribute to the model's performance be applied to discover where the underlying limitations are and what our future research direction should be. Our investigation in this paper reports on variations of two widely used feature types, part of speech (POS) tags and character-level orthographic features, and makes a comparison of how these variations influence performance. We base our experiments on a proven state-of-the-art model, support vector machines using a high quality subset of 100 annotated MEDLINE abstracts. Experiments reveal that the best performing features are orthographic features with F score of 72.6. Although the Brill tagger trained in-domain on the GENIA v3.02p POS corpus gives the best overall performance of any POS tagger, at an F score of 68.6, this is still significantly below the orthographic features. In combination these two features types appear to interfere with each other and degrade performance slightly to an F score of 72.3.

Keywords: biosvm nlp
[Coleman2004Noninvasive] DJackson Coleman, Ronald H Silverman, Mark J Rondeau, HCulver Boldt, Harriet O Lloyd, Frederic L Lizzi, Thomas A Weingeist, Xue Chen, Sumalee Vangveeravong, and Robert Folberg. Noninvasive in vivo detection of prognostic indicators for high-risk uveal melanoma: ultrasound parameter imaging. Ophthalmology, 111(3):558-64, Mar 2004. [ bib | DOI | http | .pdf ]
PURPOSE: Primary malignant melanoma of the choroid and ciliary body has traditionally been treated without histologic staging, using purely clinical indicators. The presence of extravascular matrix patterns (EMP) in histologic sections of uveal melanoma has been shown to be an independent indicator of metastatic risk. These patterns are of a dimension and physical composition that are likely to be detected with ultrasound backscatter analysis. Our aim was to determine whether ultrasound parameter imaging could detect the presence of EMP at a diagnostically significant level for treatment staging and for planning investigational studies of therapeutic modalities. DESIGN: Prospective, masked ultrasound-pathologic correlative study. PARTICIPANTS: One hundred seventeen patients diagnosed with previously untreated choroidal melanoma were scanned within 2 weeks before enucleation. METHODS: Tumors were evaluated histologically and divided into high-risk and low-risk groups on the basis of the presence of 2% or more histologic cross-sectional area composed of EMP patterns. Digital ultrasound data were processed to generate parameter images representing the size and concentration of ultrasound scatterers. Histologic and ultrasound images and data were correlated, and linear and nonlinear statistical methods were used to create multivariate models for noninvasive differentiation of high-risk and low-risk tumors. MAIN OUTCOME MEASURES: Presence or absence of high-risk EMP and associated ultrasound parameter classification models. RESULTS: Of the 117 tumors, 69 were classified as low risk, and 48 were classified as high-risk with histologic analysis. A classification that used ultrasound parameter image features with linear discriminant analysis could correctly identify 79.5% of cases retrospectively and 75.2% of cases by use of cross-validation, an estimate of prospective classification ability. By use of a more powerful classification technique (support vector machine), 93.1% of cases were correctly classified retrospectively. With a cross-validation procedure, 80.10% of cases were correctly classified. CONCLUSIONS: Ultrasound can be used noninvasively to classify tumors into high-risk and low-risk groups by detecting the presence of EMP patterns. By the use of previous studies that compared the histologic presence of EMP patterns with patient survival, estimates of hazard rates associated with ultrasound risk groups can be made. The noninvasive ultrasound classification is potentially useful as a prognostic variable and as a tool for stratification of patient populations for tumor treatment evaluation.

Keywords: Algorithms, Ambergris, Combinatorial Chemistry Techniques, Eye Enucleation, Humans, Melanoma, Models, Molecular, Molecular Conformation, Non-U.S. Gov't, Odors, P.H.S., Perfume, Predictive Value of Tests, Prognosis, Prospective Studies, Quantitative Structure-Activity Relationship, Research Support, U.S. Gov't, Uveal Neoplasms, 15019336
[Cohen2004application] Gilles Cohen, Mélanie Hilario, Hugo Sax, Stéphane Hugonnet, Christian Pellegrini, and Antoine Geissbuhler. An application of one-class support vector machine to nosocomial infection detection. Medinfo, 11(Pt 1):716-20, 2004. [ bib ]
Nosocomial infections (NIs)-those acquired in health care settings-are among the major causes of increased mortality among hospitalized patients. They are a significant burden for patients and health authorities alike; it is thus important to monitor and detect them through an effective surveillance system. This paper describes a retrospective analysis of a prevalence survey of NIs done in the Geneva University Hospital. Our goal is to identify patients with one or more NIs on the basis of clinical and other data collected during the survey. In this two-class classification task, the main difficulty lies in the significant imbalance between positive or infected (11%) and negative (89%) cases. To cope with class imbalance, we investigate one-class SVMs which can be trained to distinguish two classes on the basis of examples from a single class (in this case, only "normal" or non infected patients). The infected ones are then identified as "abnormal" cases or outliers that deviate significantly from the normal profile. Experimental results are encouraging: whereas standard 2-class SVMs scored a baseline sensitivity of 50.6% on this problem, the one-class approach increased sensitivity to as much as 92.6%. These results are comparable to those obtained by the authors in a previous study on asymmetrical soft margin SVMs; they suggest that one-class SVMs can provide an effective and efficient way of overcoming data imbalance in classification problems.

Keywords: Aged, Air, Algorithms, Amino Acids, Animals, Area Under Curve, Artifacts, Artificial Intelligence, Atrial, Automated, Canada, Carotid Stenosis, Cerebrovascular Accident, Cerebrovascular Circulation, Comparative Study, Computer-Assisted, Cross Infection, Cysteine, Data Collection, Decision Trees, Dementia, Diagnosis, Disulfides, Doppler, Embolism, Expert Systems, Extramural, Factor Analysis, Female, Gene Expression, Gene Expression Profiling, Health Status, Heart Septal Defects, Hospitals, Humans, Infection Control, Intracranial Embolism, Male, Models, Molecular, Myocardial Infarction, N.I.H., Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Oxidation-Reduction, P.H.S., Pattern Recognition, Population Surveillance, Prevalence, Prognosis, Protein Binding, Protein Folding, Proteins, ROC Curve, Research Support, Retrospective Studies, Sensitivity and Specificity, Software, Statistical, Switzerland, Transcranial, Treatment Outcome, U.S. Gov't, Ultrasonography, University, 15360906
[Cianfrocca2004Prognostic] MCianfrocca and LJ. Goldstein. Prognostic and predictive factors in early-stage breast cancer. Oncologist, 9(6):606-616, 2004. [ bib | DOI | http | .pdf ]
Breast cancer is the most common malignancy among American women. Due to increased screening, the majority of patients present with early-stage breast cancer. The Oxford Overview Analysis demonstrates that adjuvant hormonal therapy and polychemotherapy reduce the risk of recurrence and death from breast cancer. Adjuvant systemic therapy, however, has associated risks and it would be useful to be able to optimally select patients most likely to benefit. The purpose of adjuvant systemic therapy is to eradicate distant micrometastatic deposits. It is essential therefore to be able to estimate an individual patient's risk of harboring clinically silent micrometastatic disease using established prognostic factors. It is also beneficial to be able to select the optimal adjuvant therapy for an individual patient based on established predictive factors. It is standard practice to administer systemic therapy to all patients with lymph node-positive disease. However, there are clearly differences among node-positive women that may warrant a more aggressive therapeutic approach. Furthermore, there are many node-negative women who would also benefit from adjuvant systemic therapy. Prognostic factors therefore must be differentiated from predictive factors. A prognostic factor is any measurement available at the time of surgery that correlates with disease-free or overall survival in the absence of systemic adjuvant therapy and, as a result, is able to correlate with the natural history of the disease. In contrast, a predictive factor is any measurement associated with response to a given therapy. Some factors, such as hormone receptors and HER2/neu overexpression, are both prognostic and predictive.

Keywords: csbcbook, csbcbook-ch3
[Cherkassky2004Practical] Vladimir Cherkassky and Yunqian Ma. Practical selection of SVM parameters and noise estimation for SVM regression. Neural Netw, 17(1):113-26, Jan 2004. [ bib | DOI | http | .pdf ]
We investigate practical selection of hyper-parameters for support vector machines (SVM) regression (that is, epsilon-insensitive zone and regularization parameter C). The proposed methodology advocates analytic parameter selection directly from the training data, rather than re-sampling approaches commonly used in SVM applications. In particular, we describe a new analytical prescription for setting the value of insensitive zone epsilon, as a function of training sample size. Good generalization performance of the proposed parameter selection is demonstrated empirically using several low- and high-dimensional regression problems. Further, we point out the importance of Vapnik's epsilon-insensitive loss for regression problems with finite samples. To this end, we compare generalization performance of SVM regression (using proposed selection of epsilon-values) with regression using 'least-modulus' loss (epsilon=0) and standard squared loss. These comparisons indicate superior generalization performance of SVM regression under sparse sample settings, for various types of additive noise.

[Chen2004Prediction] Y.C. Chen, Y.S. Lin, C.J. Lin, and J.K. Hwang. Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences. Proteins, 55(4):1036-1042, 2004. [ bib | DOI | .pdf | .pdf ]
The support vector machine (SVM) method is used to predict the bonding states of cysteines. Besides using local descriptors such as the local sequences, we include global information, such as amino acid compositions and the patterns of the states of cysteines (bonded or nonbonded), or cysteine state sequences, of the proteins. We found that SVM based on local sequences or global amino acid compositions yielded similar prediction accuracies for the data set comprising 4136 cysteine-containing segments extracted from 969 nonhomologous proteins. However, the SVM method based on multiple feature vectors (combining local sequences and global amino acid compositions) significantly improves the prediction accuracy, from 80 cysteine state sequences, SVM based on multiple feature vectors yields 90 coefficient, around 10 obtained by SVM based on local sequence information.

Keywords: biosvm
[Chen2004Sparse] Sheng Chen, Xia Hong, and Chris J Harris. Sparse kernel density construction using orthogonal forward regression with leave-one-out test score and local regularization. IEEE Trans Syst Man Cybern B Cybern, 34(4):1708-17, Aug 2004. [ bib | DOI | http | .pdf ]
This paper presents an efficient construction algorithm for obtaining sparse kernel density estimates based on a regression approach that directly optimizes model generalization capability. Computational efficiency of the density construction is ensured using an orthogonal forward regression, and the algorithm incrementally minimizes the leave-one-out test score. A local regularization method is incorporated naturally into the density construction process to further enforce sparsity. An additional advantage of the proposed algorithm is that it is fully automatic and the user is not required to specify any criterion to terminate the density construction procedure. This is in contrast to an existing state-of-art kernel density estimation method using the support vector machine (SVM), where the user is required to specify some critical algorithm parameter. Several examples are included to demonstrate the ability of the proposed algorithm to effectively construct a very sparse kernel density estimate with comparable accuracy to that of the full sample optimized Parzen window density estimate. Our experimental results also demonstrate that the proposed algorithm compares favorably with the SVM method, in terms of both test accuracy and sparsity, for constructing kernel density estimates.

[Chen2004Level] QChen, ZM. Zhou, YG. Qu, PA. Heng, and DS. Xia. Level set based auto segmentation of the tagged left ventricle MR images. Stud Health Technol Inform, 98:63-5, 2004. [ bib ]
To facilitate automatic segmentation, we adopt SVM (Support Vector Machine) to localize the left ventricle, and the segmentation is then carried out with narrow band level set. The method of generating the narrow band is improved such that the time used is reduced. Based on the imaging characteristics of the tagged left ventricle MR images, BPV (block-pixel variation) and intensity comparability are introduced to improve the speed term of level set and to increase the precision of segmentation. Our method can perform the segmentation of the tagged left ventricle MR images accurately and automatically.

[Chen2004Reducing] Jiun-Hung Chen and Chu-Song Chen. Reducing SVM classification time using multiple mirror classifiers. IEEE Trans Syst Man Cybern B Cybern, 34(2):1173-83, Apr 2004. [ bib | DOI | http | .pdf ]
We propose an approach that uses mirror point pairs and a multiple classifier system to reduce the classification time of a support vector machine (SVM). Decisions made with multiple simple classifiers formed from mirror pairs are integrated to approximate the classification rule of a single SVM. A coarse-to-fine approach is developed for selecting a given number of member classifiers. A clustering method, derived from the similarities between classifiers, is used for a coarse selection. A greedy strategy is then used for fine selection of member classifiers. Selected member classifiers are further refined by finding a weighted combination with a perceptron. Experiment results show that our approach can successfully speed up SVM decisions while maintaining comparable classification accuracy.

[Chalk2004Improved] AM. Chalk, CWahlestedt, and ELL. Sonnhammer. Improved and automated prediction of effective siRNA. Biochem. Biophys. Res. Commun., 319(1):264-74, Jun 2004. [ bib | DOI | http | .pdf ]
Short interfering RNAs are used in functional genomics studies to knockdown a single gene in a reversible manner. The results of siRNA experiments are highly dependent on the choice of siRNA sequence. In order to evaluate siRNA design rules, we collected a database of 398 siRNAs of known efficacy from 92 genes. We used this database to evaluate previously proposed rules from smaller datasets, and to find a new set of rules that are optimal for the entire database. We also trained a regression tree with full cross-validation. It was however difficult to obtain the same precision as methods previously tested on small datasets from one or two genes. We show that those methods are overfitting as they work poorly on independent validation datasets from multiple genes. Our new design rules can predict siRNAs with efficacy >/= 50% in 91% of cases, and with efficacy >/=90% in 52% of cases, which is more than a twofold improvement over random selection. Software for designing siRNAs is available online via a web server at or as a standalone version for high-throughput applications.

Keywords: sirna
[Cawley2004Fast] Gavin C Cawley and Nicola L C Talbot. Fast exact leave-one-out cross-validation of sparse least-squares support vector machines. Neural Netw, 17(10):1467-75, Dec 2004. [ bib | DOI | http | .pdf ]
Leave-one-out cross-validation has been shown to give an almost unbiased estimator of the generalisation properties of statistical models, and therefore provides a sensible criterion for model selection and comparison. In this paper we show that exact leave-one-out cross-validation of sparse Least-Squares Support Vector Machines (LS-SVMs) can be implemented with a computational complexity of only O(ln2) floating point operations, rather than the O(l2n2) operations of a naïve implementation, where l is the number of training patterns and n is the number of basis vectors. As a result, leave-one-out cross-validation becomes a practical proposition for model selection in large scale applications. For clarity the exposition concentrates on sparse least-squares support vector machines in the context of non-linear regression, but is equally applicable in a pattern recognition setting.

[Causier2004Studying] Barry Causier. Studying the interactome with the yeast two-hybrid system and mass spectrometry. Mass Spectrom Rev, 23(5):350-367, 2004. [ bib | DOI | http ]
Protein interactions are crucial to the life of a cell. The analysis of such interactions is allowing biologists to determine the function of uncharacterized proteins and the genes that encode them. The yeast two-hybrid system has become one of the most popular and powerful tools to study protein-protein interactions. With the advent of proteomics, the two-hybrid system has found a niche in interactome mapping. However, it is clear that only by combining two-hybrid data with that from complementary approaches such as mass spectrometry (MS) can the interactome be analyzed in full. This review introduces the yeast two-hybrid system to those unfamiliar with the technique, and discusses how it can be used in combination with MS to unravel the network of protein interactions that occur in a cell.

Keywords: Genes, Fungal; Genome; Mass Spectrometry; Proteins; Proteomics; Yeasts
[Camps-Valls2004Profiled] GCamps-Valls, A.M. Chalk, A.J. Serrano-Lopez, J.D. Martin-Guerrero, and E.L. Sonnhammer. Profiled support vector machines for antisense oligonucleotide efficacy prediction. BMC Bioinformatics, 5(135):135, 2004. [ bib | DOI | http | .pdf ]
Background This paper presents the use of Support Vector Machines (SVMs) for prediction and analysis of antisense oligonucleotide (AO) efficacy. The collected database comprises 315 AO molecules including 68 features each, inducing a problem well-suited to SVMs. The task of feature selection is crucial given the presence of noisy or redundant features, and the well-known problem of the curse of dimensionality. We propose a two-stage strategy to develop an optimal model: (1) feature selection using correlation analysis, mutual information, and SVM-based recursive feature elimination (SVM-RFE), and (2) AO prediction using standard and profiled SVM formulations. A profiled SVM gives different weights to different parts of the training data to focus the training on the most important regions. Results In the first stage, the SVM-RFE technique was most efficient and robust in the presence of low number of samples and high input space dimension. This method yielded an optimal subset of 14 representative features, which were all related to energy and sequence motifs. The second stage evaluated the performance of the predictors (overall correlation coefficient between observed and predicted efficacy, r; mean error, ME; and root-mean-square-error, RMSE) using 8-fold and minus-one-RNA cross-validation methods. The profiled SVM produced the best results (r = 0.44, ME = 0.022, and RMSE= 0.278) and predicted high (>75 gene expression) and low efficacy (<25 of 83.3 approaches. A web server for AO prediction is available online at http://aosvm.cgb.ki.se/. Conclusions The SVM approach is well suited to the AO prediction problem, and yields a prediction accuracy superior to previous methods. The profiled SVM was found to perform better than the standard SVM, suggesting that it could lead to improvements in other prediction problems as well.

Keywords: biosvm
[Campanini2004novel] Renato Campanini, Danilo Dongiovanni, Emiro Iampieri, Nico Lanconelli, Matteo Masotti, Giuseppe Palermo, Alessandro Riccardi, and Matteo Roffilli. A novel featureless approach to mass detection in digital mammograms based on support vector machines. Phys Med Biol, 49(6):961-75, Mar 2004. [ bib | DOI | http | .pdf ]
In this work, we present a novel approach to mass detection in digital mammograms. The great variability of the appearance of masses is the main obstacle to building a mass detection method. It is indeed demanding to characterize all the varieties of masses with a reduced set of features. Hence, in our approach we have chosen not to extract any feature, for the detection of the region of interest; in contrast, we exploit all the information available on the image. A multiresolution overcomplete wavelet representation is performed, in order to codify the image with redundancy of information. The vectors of the very-large space obtained are then provided to a first support vector machine (SVM) classifier. The detection task is considered here as a two-class pattern recognition problem: crops are classified as suspect or not, by using this SVM classifier. False candidates are eliminated with a second cascaded SVM. To further reduce the number of false positives, an ensemble of experts is applied: the final suspect regions are achieved by using a voting strategy. The sensitivity of the presented system is nearly 80% with a false-positive rate of 1.1 marks per image, estimated on images coming from the USF DDSM database.

[Cai2004Identify] Y.D. Cai, G.P. Zhou, C.H. Jen, S.L. Lin, and K.C. Chou. Identify catalytic triads of serine hydrolases by support vector machines. J. Theor. Biol., 228(4):551-557, 2004. [ bib | DOI | http | .pdf ]
The core of an enzyme molecule is its active site from the viewpoints of both academic research and industrial application. To reveal the structural and functional mechanism of an enzyme, one needs to know its active site; to conduct structure-based drug design by regulating the function of an enzyme, one needs to know the active site and its microenvironment as well. Given the atomic coordinates of an enzyme molecule, how can we predict its active site? To tackle such a problem, a distance group approach was proposed and the support vector machine algorithm applied to predict the catalytic triad of serine hydrolase family. The success rate by jackknife test for the 139 serine hydrolases was 85 promising and may become a useful tool in structural bioinformatics.

Keywords: biosvm
[Cai2004Application] Y.D. Cai, P.W. Ricardo, C.H. Jen, and K.C. Chou. Application of SVM to predict membrane protein types. J. Theor. Biol., 226(4):373-376, 2004. [ bib | DOI | http | .pdf ]
As a continuous effort to develop automated methods for predicting membrane protein types that was initiated by Chou and Elrod (PROTEINS: Structure, Function, and Genetics, 1999, 34, 137-153), the support vector machine (SVM) is introduced. Results obtained through re-substitution, jackknife, and independent data set tests, respectively, have indicated that the SVM approach is quite a promising one, suggesting that the covariant discriminant algorithm (Chou and Elrod, Protein Eng. 12 (1999) 107) and SVM, if effectively complemented with each other, will become a powerful tool for predicting membrane protein types and the other protein attributes as well.

Keywords: biosvm
[Cai2004Enzyme] C.Z. Cai, L.Y. Han, Z.L. Ji, and Y.Z. Chen. Enzyme family classification by support vector machines. Proteins, 55(1):66-76, 2004. [ bib | DOI | http | .pdf ]
One approach for facilitating protein function prediction is to classify proteins into functional families. Recent studies on the classification of G-protein coupled receptors and other proteins suggest that a statistical learning method, Support vector machines (SVM), may be potentially useful for protein classification into functional families. In this work, SVM is applied and tested on the classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. SVM classification system for each family is trained from representative enzymes of that family and seed proteins of Pfam curated protein families. The classification accuracy for enzymes from 46 families and for non-enzymes is in the range of 50.0 Matthews correlation coefficient is in the range of 54.1 Moreover, 80.3 classified into a specific enzyme family by using a scoring function, indicating that SVM may have certain level of unique prediction capability. Testing results also suggest that SVM in some cases is capable of classification of distantly related enzymes and homologous enzymes of different functions. Effort is being made to use a more comprehensive set of enzymes as training sets and to incorporate multi-class SVM classification systems to further enhance the unique prediction accuracy. Our results suggest the potential of SVM for enzyme family classification and for facilitating protein function prediction. Our software is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.

Keywords: biosvm
[Byvatov2004SVM-based] Evgeny Byvatov and Gisbert Schneider. SVM-based feature selection for characterization of focused compound collections. J Chem Inf Comput Sci, 44(3):993-9, 2004. [ bib | DOI | http | .pdf ]
Artificial neural networks, the support vector machine (SVM), and other machine learning methods for the classification of molecules are often considered as a "black box", since the molecular features that are most relevant for a given classifier are usually not presented in a human-interpretable form. We report on an SVM-based algorithm for the selection of relevant molecular features from a trained classifier that might be important for an understanding of ligand-receptor interactions. The original SVM approach was extended to allow for feature selection. The method was applied to characterize focused libraries of enzyme inhibitors. A comparison with classical Kolmogorov-Smirnov (KS)-based feature selection was performed. In most of the applications the SVM method showed sustained classification accuracy, thereby relying on a smaller number of molecular features than KS-based classifiers. In one case both methods produced comparable results. Limiting the calculation of descriptors to only the most relevant ones for a certain biological activity can also be used to speed up high-throughput virtual screening.

Keywords: biosvm chemoinformatics featureselection
[Busuttil2004Support] SBusuttil, JAbela, and GJ. Pace. Support vector machines with profile-based kernels for remote protein homology detection. Genome Inform Ser Workshop Genome Inform, 15(2):191-200, 2004. [ bib | .html | .pdf ]
Two new techniques for remote protein homology detection particulary suited for sparse data are introduced. These methods are based on position specific scoring matrices or profiles and use a support vector machine (SVM) for discrimination. The performance on standard benchmarks outperforms previous non-discriminative techniques and is comparable to that of other SVM-based methods while giving distinct advantages.

Keywords: biosvm
[Buriol04New] LBuriol, PM. França, and PMoscato. A new memetic algorithm for the asymmetric traveling salesman problem. Journal of Heuristics, 10(5):483-506, 2004. [ bib | DOI ]
[Buck2004ChIP] Michael J Buck and Jason D Lieb. Chip-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics, 83(3):349-360, Mar 2004. [ bib ]
Chromatin immunoprecipitation (ChIP) is a well-established procedure to investigate interactions between proteins and DNA. Coupled with whole-genome DNA microarrays, ChIPS allow one to determine the entire spectrum of in vivo DNA binding sites for any given protein. The design and analysis of ChIP-microarray (also called ChIP-chip) experiments differ significantly from the conventions used for locus ChIP approaches and ChIP-chip experiments, and these differences require new methods of analysis. In this light, we review the design of DNA microarrays, the selection of controls, the level of repetition required, and other critical parameters for success in the design and analysis of ChIP-chip experiments, especially those conducted in the context of mammalian or other relatively large genomes.

Keywords: Animals; Chromatin, metabolism; Genome; Humans; Immunoprecipitation, methods; Models, Theoretical; Oligonucleotide Array Sequence Analysis, methods; Research Design
[Brunet2004Metagenes] JP. Brunet, PTamayo, TR. Golub, and JP. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A, 101(12):4164-9, 2004. [ bib | DOI | http | .pdf ]
We describe here the use of nonnegative matrix factorization (NMF), an algorithm based on decomposition by parts that can reduce the dimension of expression data from thousands of genes to a handful of metagenes. Coupled with a model selection mechanism, adapted to work for any stochastic clustering algorithm, NMF is an efficient method for identification of distinct molecular patterns and provides a powerful method for class discovery. We demonstrate the ability of NMF to recover meaningful biological information from cancer-related microarray data. NMF appears to have advantages over other methods such as hierarchical clustering or self-organizing maps. We found it less sensitive to a priori selection of genes or initial conditions and able to detect alternative or context-dependent patterns of gene expression in complex biological systems. This ability, similar to semantic polysemy in text, provides a general method for robust molecular pattern discovery.

[Breslin2004Autofluorescence] Tara M Breslin, Fushen Xu, Gregory M Palmer, Changfang Zhu, Kennedy W Gilchrist, and Nirmala Ramanujam. Autofluorescence and diffuse reflectance properties of malignant and benign breast tissues. Ann Surg Oncol, 11(1):65-70, Jan 2004. [ bib | DOI | http | .pdf ]
BACKGROUND: Fluorescence spectroscopy is an evolving technology that can rapidly differentiate between benign and malignant tissues. These differences are thought to be due to endogenous fluorophores, including nicotinamide adenine dinucleotide, flavin adenine dinucleotide, and tryptophan, and absorbers such as beta-carotene and hemoglobin. We hypothesized that a statistically significant difference would be demonstrated between benign and malignant breast tissues on the basis of their unique fluorescence and reflectance properties. METHODS: Optical measurements were performed on 56 samples of tumor or benign breast tissue. Autofluorescence spectra were measured at excitation wavelengths ranging from 300 to 460 nm, and diffuse reflectance was measured between 300 and 600 nm. Principal component analysis to dimensionally reduce the spectral data and a Wilcoxon ranked sum test were used to determine which wavelengths showed statistically significant differences. A support vector machine algorithm compared classification results with the histological diagnosis (gold standard). RESULTS: Several excitation wavelengths and diffuse reflectance spectra showed significant differences between tumor and benign tissues. By using the support vector machine algorithm to incorporate relevant spectral differences, a sensitivity of 70.0% and specificity of 91.7% were achieved. CONCLUSIONS: A statistically significant difference was demonstrated in the diffuse reflectance and fluorescence emission spectra of benign and malignant breast tissue. These differences could be exploited in the development of adjuncts to diagnostic and surgical procedures.

Keywords: breastcancer
[Bredel2004Chemogenomics] MBredel and EJacoby. Chemogenomics: an emerging strategy for rapid target and drug discovery. Nat. Rev. Genet., 5(4):262-275, Apr 2004. [ bib | DOI | http | .pdf ]
Keywords: chemogenomics
[Bozdech2004Antioxidant] ZBozdech and HGinsburg. Antioxidant defense in Plasmodium falciparum - data mining of the transcriptome. Malaria Journal, 3(1):23, 2004. [ bib | DOI | http | .pdf ]
The intraerythrocytic malaria parasite is under constant oxidative stress originating both from endogenous and exogenous processes. The parasite is endowed with a complete network of enzymes and proteins that protect it from those threats, but also uses redox activities to regulate enzyme activities. In the present analysis, the transcription of the genes coding for the antioxidant defense elements are viewed in the time-frame of the intraerythrocytic cycle. Time-dependent transcription data were taken from the transcriptome of the human malaria parasite Plasmodium falciparum. Whereas for several processes the transcription of the many participating genes is coordinated, in the present case there are some outstanding deviations where gene products that utilize glutathione or thioredoxin are transcribed before the genes coding for elements that control the levels of those substrates are transcribed. Such insights may hint to novel, non-classical pathways that necessitate further investigations.

Keywords: microarray plasmodium
[Boyd2004Convex] SBoyd and LVandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004. [ bib | .pdf ]
[Bowd2004Confocal] Christopher Bowd, Linda M Zangwill, Felipe A Medeiros, Jiucang Hao, Kwokleung Chan, Te-Won Lee, Terrence J Sejnowski, Michael H Goldbaum, Pamela A Sample, Jonathan G Crowston, and Robert N Weinreb. Confocal scanning laser ophthalmoscopy classifiers and stereophotograph evaluation for prediction of visual field abnormalities in glaucoma-suspect eyes. Invest Ophthalmol Vis Sci, 45(7):2255-62, Jul 2004. [ bib | DOI | http | .pdf ]
PURPOSE: To determine whether Heidelberg Retina Tomograph (HRT; Heidelberg Engineering, Dossenheim, Germany) classification techniques and investigational support vector machine (SVM) analyses can detect optic disc abnormalities in glaucoma-suspect eyes before the development of visual field abnormalities. METHODS: Glaucoma-suspect eyes (n = 226) were classified as converts or nonconverts based on the development of repeatable (either two or three consecutive) standard automated perimetry (SAP)-detected abnormalities over the course of the study (mean follow-up, approximately 4.5 years). Hazard ratios for development of SAP abnormalities were calculated based on baseline classification results, follow-up time, and end point status (convert, nonconvert). Classification techniques applied were HRT classification (HRTC), Moorfields Regression Analysis, forward-selection optimized SVM (SVM fwd) and backward elimination-optimized SVM (SVM back) analysis of HRT data, and stereophotograph assessment. RESULTS: Univariate analyses indicated that all classification techniques were predictors of the development of two repeatable abnormal SAP results, with hazards ratios (95% confidence interval [CI]) ranging from 1.32 (1.00-1.75) for HRTC to 2.0 (1.48-2.76) for stereophotograph assessment (all P < or = 0.05). Only SVM (SVM fwd and SVM back) analysis of HRT data and stereophotograph assessment were univariate predictors of the development of three repeatable abnormal SAP results, with hazard ratios (95% CI) ranging from 1.73 (1.16-2.82) for SVM fwd to 1.82 (1.19-3.12) for SVM back (both P < 0.007). Multivariate analyses including each classification technique individually in a model with age, baseline SAP pattern standard deviation [PSD], and baseline IOP indicated that all classification techniques except HRTC (P = 0.06) were predictors of the development of two repeatable abnormal SAP results with hazards ratios ranging from 1.30 (0.99, 1.73) for HRTC to 1.90 (1.37, 2.69) for stereophotograph assessment. Only SVM (SVM fwd and SVM back) analysis of HRT data and stereophotograph assessment were significant predictors of the development of three repeatable abnormal SAP results in multivariate analyses; hazard ratios of 1.57 (1.03, 2.59) and 1.70 (1.18, 2.51), respectively. SAP PSD was a significant predictor of two repeatable abnormal SAP results in multivariate models with all classification techniques, with hazard ratios ranging from 3.31 (1.39, 7.89) to 4.70 (2.02, 10.93) per 1-dB increase. CONCLUSIONS: HRT classifications techniques and stereophotograph assessment can detect optic disc topography abnormalities in glaucoma-suspect eyes before the development of SAP abnormalities. These data support strongly the importance of optic disc examination for early glaucoma diagnosis.

Keywords: 80 and over, Adolescent, Adult, Aged, Algorithms, Artificial Intelligence, Auditory, Benchmarking, Binding Sites, Brain Stem, Breast Diseases, Chemical, Child, Chromosomes, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, Data Interpretation, Databases, Diagnosis, Diagnostic Errors, Differential, Drug Resistance, Electroencephalography, Epilepsy, Evoked Potentials, Female, Forecasting, Gene Expression, Gene Expression Profiling, Genetic, Genotype, Glaucoma, Greece, HIV Protease Inhibitors, HIV-1, Human, Humans, Infant, Information Management, Information Storage and Retrieval, Intraocular Pressure, Kinetics, Language Development Disorders, Lasers, Least-Squares Analysis, Linear Models, Male, Microbial Sensitivity Tests, Middle Aged, Models, Molecular, Monitoring, Nephroblastoma, Non-U.S. Gov't, Nonlinear Dynamics, Ocular Hypertension, Oligonucleotide Array Sequence Analysis, Ophthalmoscopy, Optic Disk, Optic Nerve Diseases, P.H.S., Pair 1, Perimetry, Periodicals, Phosphorylation, Phosphotransferases, Photography, Physiologic, Point Mutation, Preschool, Prognosis, Protein, Proteins, Pyrimidinones, Reaction Time, Recurrence, Reproducibility of Results, Research Support, Reverse Transcriptase Inhibitors, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Signal Processing, Software, Sound Localization, Statistical, Stochastic Processes, Structure-Activity Relationship, Theoretical, Time Factors, U.S. Gov't, Viral, Vision Disorders, Visual Fields, 15223803
[Blanchard2004Statistical] GBlanchard, OBousquet, and PMasssart. Statistical Performance of Support Vector Machine. Submitted Ann.Stat., 2004. [ bib ]
[Bilello2004Automatic] Michel Bilello, Salih Burak Gokturk, Terry Desser, Sandy Napel, RBrooke Jeffrey, and Christopher F Beaulieu. Automatic detection and classification of hypodense hepatic lesions on contrast-enhanced venous-phase CT. Med Phys, 31(9):2584-93, Sep 2004. [ bib ]
The objective of this work was to develop and validate algorithms for detection and classification of hypodense hepatic lesions, specifically cysts, hemangiomas, and metastases from CT scans in the portal venous phase of enhancement. Fifty-six CT sections from 51 patients were used as representative of common hypodense liver lesions, including 22 simple cysts, 11 hemangiomas, 22 metastases, and 1 image containing both a cyst and a hemangioma. The detection algorithm uses intensity-based histogram methods to find central lesions, followed by liver contour refinement to identify peripheral lesions. The classification algorithm operates on the focal lesions identified during detection, and includes shape-based segmentation, edge pixel weighting, and lesion texture filtering. Support vector machines are then used to perform a pair-wise lesion classification. For the detection algorithm, 80% lesion sensitivity was achieved at approximately 0.3 false positives (FP) per slice for central lesions, and 0.5 FP per slice for peripheral lesions, giving a total of 0.8 FP per section. For 90% sensitivity, the total number of FP rises to about 2.2 per section. The pair-wise classification yielded good discrimination between cysts and metastases (at 95% sensitivity for detection of metastases, only about 5% of cysts are incorrectly classified as metastases), perfect discrimination between hemangiomas and cysts, and was least accurate in discriminating between hemangiomas and metastases (at 90% sensitivity for detection of hemangiomas, about 28% of metastases were incorrectly classified as hemangiomas). Initial implementations of our algorithms are promising for automating liver lesion detection and classification.

[Biasotti20043D] SBiasotti, SMarini, MMortara, GPatane, MSpagnuolo, and BFalcidieno. 3d shape matching through topological structures. In Discrete Geometry for Computer Imagery, pages 194-203. Springer Berlin / Heidelberg, 2004. [ bib ]
[Bhasin2004SVM] MBhasin and GPS. Raghava. SVM based method for predicting HLA-DRB1*0401 binding peptides in an antigen sequence. Bioinformatics, 20(3):421-423, 2004. [ bib | http | .pdf ]
Summary: Prediction of peptides binding with MHC class II allele HLA-DRB1*0401 can effectively reduce the number of experiments required for identifying helper T cell epitopes. This paper describes support vector machine (SVM) based method developed for identifying HLA-DRB1*0401 binding peptides in an antigenic sequence. SVM was trained and tested on large and clean data set consisting of 567 binders and equal number of non-binders. The accuracy of the method was 86 Available: A web server HLA-DR4Pred based on above approach is available at http://www.imtech.res.in/raghava/hladr4pred/ and http://bioinformatics.uams.edu/mirror/hladr4pred/ (Mirror Site). Supplementary information: http://www.imtech.res.in/raghava/hladr4pred/info.html

Keywords: biosvm immunoinformatics
[Bhasin2004Prediction] MBhasin and GPS. Raghava. Prediction of CTL epitopes using QM, SVM and ANN techniques. Vaccine, 22(23-24):3195-3204, 2004. [ bib | DOI | http | .pdf ]
Cytotoxic T lymphocyte (CTL) epitopes are potential candidates for subunit vaccine design for various diseases. Most of the existing T cell epitope prediction methods are indirect methods that predict MHC class I binders instead of CTL epitopes. In this study, a systematic attempt has been made to develop a direct method for predicting CTL epitopes from an antigenic sequence. This method is based on quantitative matrix (QM) and machine learning techniques such as Support Vector Machine (SVM) and Artificial Neural Network (ANN). This method has been trained and tested on non-redundant dataset of T cell epitopes and non-epitopes that includes 1137 experimentally proven MHC class I restricted T cell epitopes. The accuracy of QM-, ANN- and SVM-based methods was 70.0, 72.2 and 75.2 has been evaluated through Leave One Out Cross-Validation (LOOCV) at a cutoff score where sensitivity and specificity was nearly equal. Finally, both machine-learning methods were used for consensus and combined prediction of CTL epitopes. The performances of these methods were evaluated on blind dataset where machine learning-based methods perform better than QM-based method. We also demonstrated through subgroup analysis that our methods can discriminate between T-cell epitopes and MHC binders (non-epitopes). In brief this method allows prediction of CTL epitopes using QM, SVM, ANN approaches. The method also facilitates prediction of MHC restriction in predicted T cell epitopes.

Keywords: biosvm immunoinformatics
[Bhasin2004GPCRpred] MBhasin and GPS. Raghava. GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucl. Acids Res., 32(Supp.2):W383-389, 2004. [ bib | DOI | arXiv | http | .pdf ]
G-protein coupled receptors (GPCRs) belong to one of the largest superfamilies of membrane proteins and are important targets for drug design. In this study, a support vector machine (SVM)-based method, GPCRpred, has been developed for predicting families and subfamilies of GPCRs from the dipeptide composition of proteins. The dataset used in this study for training and testing was obtained from http://www.soe.ucsc.edu/research/compbio/gpcr/. The method classified GPCRs and non-GPCRs with an accuracy of 99.5 evaluated using 5-fold cross-validation. The method is further able to predict five major classes or families of GPCRs with an overall Matthew's correlation coefficient (MCC) and accuracy of 0.81 and 97.5 of the rhodopsin-like family, the method achieved an average MCC and accuracy of 0.97 and 97.3 overall accuracy of 91.3 respectively when evaluated on an independent/blind dataset of 650 GPCRs. A server for recognition and classification of GPCRs based on multiclass SVMs has been set up at http://www.imtech.res.in/raghava/gpcrpred/. We have also suggested subfamilies for 42 sequences which were previously identified as unclassified ClassA GPCRs. The supplementary information is available at http://www.imtech.res.in/raghava/gpcrpred/info.html.

Keywords: biosvm
[Bhasin2004ESLpred] MBhasin and GPS. Raghava. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucl. Acids Res., 32(Suppl. 2):W414-419, 2004. [ bib | DOI | arXiv | http | .pdf ]
Automated prediction of subcellular localization of proteins is an important step in the functional annotation of genomes. The existing subcellular localization prediction methods are based on either amino acid composition or N-terminal characteristics of the proteins. In this paper, support vector machine (SVM) has been used to predict the subcellular location of eukaryotic proteins from their different features such as amino acid composition, dipeptide composition and physico-chemical properties. The SVM module based on dipeptide composition performed better than the SVM modules based on amino acid composition or physico-chemical properties. In addition, PSI-BLAST was also used to search the query sequence against the dataset of proteins (experimentally annotated proteins) to predict its subcellular location. In order to improve the prediction accuracy, we developed a hybrid module using all features of a protein, which consisted of an input vector of 458 dimensions (400 dipeptide compositions, 33 properties, 20 amino acid compositions of the protein and 5 from PSI-BLAST output). Using this hybrid approach, the prediction accuracies of nuclear, cytoplasmic, mitochondrial and extracellular proteins reached 95.3, 85.2, 68.2 and 88.9 overall prediction accuracy of SVM modules based on amino acid composition, physico-chemical properties, dipeptide composition and the hybrid approach was 78.1, 77.8, 82.9 and 88.0 The accuracy of all the modules was evaluated using a 5-fold cross-validation technique. Assigning a reliability index (reliability index > or =3), 73.5 on the above approach, an online web server ESLpred was developed, which is available at http://www.imtech.res.in/raghava/eslpred/.

Keywords: biosvm
[Bhasin2004Classification] MBhasin and GPS. Raghava. Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. J. Biol. Chem., 279(22):23262-23266, 2004. [ bib | DOI | arXiv | http | .pdf ]
Nuclear receptors are key transcription factors that regulate crucial gene networks responsible for cell growth, differentiation, and homeostasis. Nuclear receptors form a superfamily of phylogenetically related proteins and control functions associated with major diseases (e.g. diabetes, osteoporosis, and cancer). In this study, a novel method has been developed for classifying the subfamilies of nuclear receptors. The classification was achieved on the basis of amino acid and dipeptide composition from a sequence of receptors using support vector machines. The training and testing was done on a non-redundant data set of 282 proteins obtained from the NucleaRDB data base (1). The performance of all classifiers was evaluated using a 5-fold cross validation test. In the 5-fold cross-validation, the data set was randomly partitioned into five equal sets and evaluated five times on each distinct set while keeping the remaining four sets for training. It was found that different subfamilies of nuclear receptors were quite closely correlated in terms of amino acid composition as well as dipeptide composition. The overall accuracy of amino acid composition-based and dipeptide compositionbased classifiers were 82.6 and 97.5 that different subfamilies of nuclear receptors are predictable with considerable accuracy using amino acid or dipeptide composition. Furthermore, based on above approach, an online web service, NRpred, was developed, which is available at www.imtech.res.in/raghava/nrpred.

Keywords: biosvm
[Bhasin2004Analysis] MBhasin and GPS. Raghava. Analysis and prediction of affinity of TAP binding peptides using cascade SVM. Protein Sci., 13(3):596-607, Mar 2004. [ bib | DOI | http | .pdf ]
The generation of cytotoxic T lymphocyte (CTL) epitopes from an antigenic sequence involves number of intracellular processes, including production of peptide fragments by proteasome and transport of peptides to endoplasmic reticulum through transporter associated with antigen processing (TAP). In this study, 409 peptides that bind to human TAP transporter with varying affinity were analyzed to explore the selectivity and specificity of TAP transporter. The abundance of each amino acid from P1 to P9 positions in high-, intermediate-, and low-affinity TAP binders were examined. The rules for predicting TAP binding regions in an antigenic sequence were derived from the above analysis. The quantitative matrix was generated on the basis of contribution of each position and residue in binding affinity. The correlation of r = 0.65 was obtained between experimentally determined and predicted binding affinity by using a quantitative matrix. Further a support vector machine (SVM)-based method has been developed to model the TAP binding affinity of peptides. The correlation (r = 0.80) was obtained between the predicted and experimental measured values by using sequence-based SVM. The reliability of prediction was further improved by cascade SVM that uses features of amino acids along with sequence. An extremely good correlation (r = 0.88) was obtained between measured and predicted values, when the cascade SVM-based method was evaluated through jackknife testing. A Web service, TAPPred (http://www.imtech.res.in/raghava/tappred/ or http://bioinformatics.uams.edu/mirror/tappred/), has been developed based on this approach.

Keywords: biosvm
[Bern2004Automatic] MBern, DGoldberg, WH. McDonald, and III Yates, JR. Automatic Quality Assessment of Peptide Tandem Mass Spectra. Bioinformatics, 20(Suppl. 1):i49-i54, 2004. [ bib | http | .pdf ]
Motivation: A powerful proteomics methodology couples high-performance liquid chromatography (HPLC) with tandem mass spectrometry and database-search software, such as SEQUEST. Such a set-up, however, produces a large number of spectra, many of which are of too poor quality to be useful. Hence a filter that eliminates poor spectra before the database search can significantly improve throughput and robustness. Moreover, spectra judged to be of high quality, but that cannot be identified by database search, are prime candidates for still more computationally intensive methods, such as de novo sequencing or wider database searches including post-translational modifications. Results: We report on two different approaches to assessing spectral quality prior to identification: binary classification, which predicts whether or not SEQUEST will be able to make an identification, and statistical regression, which predicts a more universal quality metric involving the number of b- and y-ion peaks. The best of our binary classifiers can eliminate over 75 spectra while losing only 10 regression can pick out spectra of modified peptides that can be identified by a de novo program but not by SEQUEST. In a section of independent interest, we discuss intensity normalization of mass spectra.

Keywords: biosvm proteomics
[Bentele2004JCB] MBentele, ILavrik, MUlrich, SStosser, DW. Heermann, HKalthoff, PH. Krammer, and REils. Mathematical modeling reveals threshold mechanism in cd95-induced apoptosis. J Cell Biol, 166(6):839-51, 2004. [ bib ]
Mathematical modeling is required for understanding the complex behavior of large signal transduction networks. Previous attempts to model signal transduction pathways were often limited to small systems or based on qualitative data only. Here, we developed a mathematical modeling framework for understanding the complex signaling behavior of CD95(APO-1/Fas)-mediated apoptosis. Defects in the regulation of apoptosis result in serious diseases such as cancer, autoimmunity, and neurodegeneration. During the last decade many of the molecular mechanisms of apoptosis signaling have been examined and elucidated. A systemic understanding of apoptosis is, however, still missing. To address the complexity of apoptotic signaling we subdivided this system into subsystems of different information qualities. A new approach for sensitivity analysis within the mathematical model was key for the identification of critical system parameters and two essential system properties: modularity and robustness. Our model describes the regulation of apoptosis on a systems level and resolves the important question of a threshold mechanism for the regulation of apoptosis.

Keywords: csbcbook
[Benito2004Adjustment] Monica Benito, Joel Parker, Quan Du, Junyuan Wu, Dong Xiang, Charles M Perou, and JS. Marron. Adjustment of systematic microarray data biases. Bioinformatics, 20(1):105-114, Jan 2004. [ bib ]
MOTIVATION: Systematic differences due to experimental features of microarray experiments are present in most large microarray data sets. Many different experimental features can cause biases including different sources of RNA, different production lots of microarrays or different microarray platforms. These systematic effects present a substantial hurdle to the analysis of microarray data. RESULTS: We present here a new method for the identification and adjustment of systematic biases that are present within microarray data sets. Our approach is based on modern statistical discrimination methods and is shown to be very effective in removing systematic biases present in a previously published breast tumor cDNA microarray data set. The new method of 'Distance Weighted Discrimination (DWD)' is shown to be better than Support Vector Machines and Singular Value Decomposition for the adjustment of systematic microarray effects. In addition, it is shown to be of general use as a tool for the discrimination of systematic problems present in microarray data sets, including the merging of two breast tumor data sets completed on different microarray platforms. AVAILABILITY: Matlab software to perform DWD can be retrieved from https://genome.unc.edu/pubsup/dwd/

[Bengio2004Learning] YBengio, ODelalleau, NLe Roux, J.-F. Paiement, PVincent, and MOuimet. Learning eigenfunctions links spectral embedding and kernel PCA. Neural Comput., 16(10):2197-219, Oct 2004. [ bib | DOI | http | .pdf ]
In this letter, we show a direct relation between spectral embedding methods and kernel principal components analysis and how both are special cases of a more general learning problem: learning the principal eigenfunctions of an operator defined from a kernel and the unknown data-generating density. Whereas spectral embedding methods provided only coordinates for the training points, the analysis justifies a simple extension to out-of-sample examples (the Nyström formula) for multidimensional scaling (MDS), spectral clustering, Laplacian eigenmaps, locally linear embedding (LLE), and Isomap. The analysis provides, for all such spectral embedding methods, the definition of a loss function, whose empirical average is minimized by the traditional algorithms. The asymptotic expected value of that loss defines a generalization performance and clarifies what these algorithms are trying to learn. Experiments with LLE, Isomap, spectral clustering, and MDS show that this out-of-sample embedding formula generalizes well, with a level of error comparable to the effect of small perturbations of the training set on the embedding.

Keywords: dimred
[Beer2004Predicting] MA. Beer and STavazoie. Predicting gene expression from sequence. Cell, 117:185-198, 2004. [ bib | .pdf ]
[Becker2004G] OM. Becker, YMarantz, SShacham, BInbal, AHeifetz, OKalid, SBar-Haim, DWarshaviak, MFichman, and SNoiman. G protein-coupled receptors: in silico drug discovery in 3D. Proc. Natl. Acad. Sci. USA, 101(31):11304-11309, Aug 2004. [ bib | DOI | http ]
The application of structure-based in silico methods to drug discovery is still considered a major challenge, especially when the x-ray structure of the target protein is unknown. Such is the case with human G protein-coupled receptors (GPCRs), one of the most important families of drug targets, where in the absence of x-ray structures, one has to rely on in silico 3D models. We report repeated success in using ab initio in silico GPCR models, generated by the predict method, for blind in silico screening when applied to a set of five different GPCR drug targets. More than 100,000 compounds were typically screened in silico for each target, leading to a selection of <100 "virtual hit" compounds to be tested in the lab. In vitro binding assays of the selected compounds confirm high hit rates, of 12-21% (full dose-response curves, Ki < 5 microM). In most cases, the best hit was a novel compound (New Chemical Entity) in the 1- to 100-nM range, with very promising pharmacological properties, as measured by a variety of in vitro and in vivo assays. These assays validated the quality of the hits as lead compounds for drug discovery. The results demonstrate the usefulness and robustness of ab initio in silico 3D models and of in silico screening for GPCR drug discovery.

Keywords: chemogenomics
[Baumgartner2004Unsupervised] RBaumgartner, RSomorjai, CBowman, TC. Sorrell, CE. Mountford, and UHimmelreich. Unsupervised feature dimension reduction for classification of MR spectra. Magn Reson Imaging, 22(2):251-6, Feb 2004. [ bib | DOI | http | .pdf ]
We present an unsupervised feature dimension reduction method for the classification of magnetic resonance spectra. The technique preserves spectral information, important for disease profiling. We propose to use this technique as a preprocessing step for computationally demanding wrapper-based feature subset selection. We show that the classification accuracy on an independent test set can be sustained while achieving considerable feature reduction. Our method is applicable to other classification techniques, such as neural networks, support vector machines, etc.

Keywords: Algorithms, Ambergris, Candida, Candida albicans, Combinatorial Chemistry Techniques, Eye Enucleation, Humans, Magnetic Resonance Spectroscopy, Melanoma, Models, Molecular, Molecular Conformation, Non-U.S. Gov't, Odors, P.H.S., Perfume, Predictive Value of Tests, Prognosis, Prospective Studies, Quantitative Structure-Activity Relationship, Research Support, U.S. Gov't, Uveal Neoplasms, 15010118
[Baumgartner2004Supervised] CBaumgartner, CBohm, DBaumgartner, GMarini, KWeinberger, BOlgemoller, BLiebl, and AA. Roscher. Supervised machine learning techniques for the classification of metabolic disorders in newborns. Bioinformatics, 20(17):2985-2996, 2004. [ bib | DOI | http | .pdf ]
Motivation: During the Bavarian newborn screening programme all newborns have been tested for about 20 inherited metabolic disorders. Owing to the amount and complexity of the generated experimental data, machine learning techniques provide a promising approach to investigate novel patterns in high-dimensional metabolic data which form the source for constructing classification rules with high discriminatory power. Results: Six machine learning techniques have been investigated for their classification accuracy focusing on two metabolic disorders, phenylketo nuria (PKU) and medium-chain acyl-CoA dehydrogenase deficiency (MCADD). Logistic regression analysis led to superior classification rules (sensitivity >96.8 to all investigated algorithms. Including novel constellations of metabolites into the models, the positive predictive value could be strongly increased (PKU 71.9 54.6 clearly prove that the mined data confirm the known and indicate some novel metabolic patterns which may contribute to a better understanding of newborn metabolism. Availability: WEKA machine learning package: www.cs.waikato.ac.nzml/weka and statistical software package ADE-4: http://pbil.univ-lyon1.fr/ADE-4

Keywords: biosvm proteomics
[Bartlett2004Sparseness] PL. Bartlett and ATewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. In Lecture Notes in Computer Science, volume 3120, pages 564-578. Springer, 2004. [ bib ]
[Bartel2004MicroRNAs] David P Bartel. Micrornas: genomics, biogenesis, mechanism, and function. Cell, 116(2):281-297, Jan 2004. [ bib | .pdf ]
MicroRNAs (miRNAs) are endogenous approximately 22 nt RNAs that can play important regulatory roles in animals and plants by targeting mRNAs for cleavage or translational repression. Although they escaped notice until relatively recently, miRNAs comprise one of the more abundant classes of gene regulatory molecules in multicellular organisms and likely influence the output of many protein-coding genes.

Keywords: sirna
[Bach2004Multiple] FR. Bach, GRG. Lanckriet, and MI. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the Twenty-First International Conference on Machine Learning, page 6, New York, NY, USA, 2004. ACM. [ bib | DOI | .pdf ]
While classical kernel-based classifiers are based on a single kernel, in practice it is often desirable to base classifiers on combinations of multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for the support vector machine (SVM), and showed that the optimization of the coefficients of such a combination reduces to a convex optimization problem known as a quadratically-constrained quadratic program (QCQP). Unfortunately, current convex optimization toolboxes can solve this problem only for a small number of kernels and a small number of data points; moreover, the sequential minimal optimization (SMO) techniques that are essential in large-scale implementations of the SVM cannot be applied because the cost function is non-differentiable. We propose a novel dual formulation of the QCQP as a second-order cone programming problem, and show how to exploit the technique of Moreau-Yosida regularization to yield a formulation to which SMO techniques can be applied. We present experimental results that show that our SMO-based algorithm is significantly more efficient than the general-purpose interior point methods available in current optimization toolboxes.

[Babur2004] O Babur, E Demir, A Ayaz, U Dogrusoz, and O Sakarya. Pathway activity inference using microarray data. Technical report, Bilkent Center for Bioinformatics (BCBI), 2004. [ bib ]
[Avidan2004Support] Shai Avidan. Support vector tracking. IEEE Trans Pattern Anal Mach Intell, 26(8):1064-72, Aug 2004. [ bib | DOI | http | .pdf ]
Support Vector Tracking (SVT) integrates the Support Vector Machine (SVM) classifier into an optic-flow-based tracker. Instead of minimizing an intensity difference function between successive frames, SVT maximizes the SVM classification score. To account for large motions between successive frames, we build pyramids from the support vectors and use a coarse-to-fine approach in the classification stage. We show results of using SVT for vehicle tracking in image sequences.

[Aoki2004KCaM] KF. Aoki, AYamaguchi, NUeda, TAkutsu, HMamitsuka, SGoto, and MKanehisa. KCaM (KEGG Carbohydrate Matcher): a software tool for analyzing the structures of carbohydrate sugar chains. Nucleic Acids Res., 32(Web Server issue):W267-72, Jul 2004. [ bib | DOI | http ]
KCaM (KEGG Carbohydrate Matcher) is a tool for the analysis of carbohydrate sugar chains, or glycans. It consists of a web-based graphical user interface that allows users to enter glycans easily with the mouse. The glycan structure is then transformed into our KCF (KEGG Chemical Function) file format and sent to our program which implements an efficient tree-structure alignment algorithm, similar to sequence alignment algorithms but for branched tree structures. Users can also retrieve glycan tree structures in KCF format from their local computers for visualization over the web. The tree-matching algorithm provides several options for performing different types of tree-matching procedures on glycans. These options consist of whether to incorporate gaps in a match, whether to take the linkage information into consideration and local versus global alignment. The results of this program are returned as a list of glycan structures in order of similarity based on these options. The actual alignment can be viewed graphically, and the annotation information can also be viewed easily since all this information is linked with KEGG's comprehensive suite of genomic data. Analogously to BLAST, users are thus able to compare glycan structures of interest with glycans from different glycan databases using a variety of tree-alignment options. KCaM is currently available at http://glycan.genome.ad.jp.

Keywords: glycans
[Amarzguioui2004algorithm] MAmarzguioui and HPrydz. An algorithm for selection of functional siRNA sequences. Biochem. Biophys. Res. Commun., 316(4):1050-8, Apr 2004. [ bib | DOI | http ]
Randomly designed siRNA targeting different positions within the same mRNA display widely differing activities. We have performed a statistical analysis of 46 siRNA, identifying various features of the 19bp duplex that correlate significantly with functionality at the 70% knockdown level and verified these results against an independent data set of 34 siRNA recently reported by others. Features that consistently correlated positively with functionality across the two data sets included an asymmetry in the stability of the duplex ends (measured as the A/U differential of the three terminal basepairs at either end of the duplex) and the motifs S1, A6, and W19. The presence of the motifs U1 or G19 was associated with lack of functionality. A selection algorithm based on these findings strongly differentiated between the two functional groups of siRNA in both data sets and proved highly effective when used to design siRNA targeting new endogenous human genes.

Keywords: sirna
[Altun2004Exponential] YAltun, ASmola, and THofmann. Exponential Families for Conditional Random Fields. In 20th Conference on Uncertainty in Artificial Intelligennce, 2004. [ bib | .pdf ]
Keywords: conditional-random-field
[Altun2004Gaussian] YAltun, THofmann, and A.J. Smola. Gaussian process classification for segmenting and annotating sequences. In Twenty-first international conference on Machine learning. ACM Press, 2004. [ bib | DOI | http | .pdf ]
Many real-world classification tasks involve the prediction of multiple, inter-dependent class labels. A prototypical case of this sort deals with prediction of a sequence of labels for a sequence of observations. Such problems arise naturally in the context of annotating and segmenting observation sequences. This paper generalizes Gaussian Process classification to predict multiple labels by taking dependencies between neighboring labels into account. Our approach is motivated by the desire to retain rigorous probabilistic semantics, while overcoming limitations of parametric methods like Conditional Random Fields, which exhibit conceptual and computational difficulties in high-dimensional input spaces. Experiments on named entity recognition and pitch accent prediction tasks demonstrate the competitiveness of our approach.

Keywords: conditional-random-field
[Abernethy2004optimization] JAbernethy, TEvgeniou, and J.-P. Vert. An optimization framework for adaptive conjoint questionnaire design. Technical report, INSEAD, 2004. [ bib ]
[Bach2004Fast] FR. Bach, GLanckriet, and MI. Jordan. Fast kernel learning using sequential minimal optimization. Technical Report UCB/CSD-04-1307, Computer Science Division, UC Berkeley, February 2004. [ bib | .pdf ]
[Caelli2004eigenspace] TCaelli and SKosinov. An eigenspace projection clustering method for inexact graph matching. IEEE Trans. Pattern Anal. Mach. Intell., 26(4):515-519, April 2004. [ bib | DOI | http ]
[Zhou2004Recognizing] GuoDong Zhou, Jie Zhang, Jian Su, Dan Shen, and ChewLim Tan. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics, 20(7):1178-90, May 2004. [ bib | DOI | http | .pdf ]
MOTIVATION: With an overwhelming amount of textual information in molecular biology and biomedicine, there is a need for effective and efficient literature mining and knowledge discovery that can help biologists to gather and make use of the knowledge encoded in text documents. In order to make organized and structured information available, automatically recognizing biomedical entity names becomes critical and is important for information retrieval, information extraction and automated knowledge acquisition. RESULTS: In this paper, we present a named entity recognition system in the biomedical domain, called PowerBioNE. In order to deal with the special phenomena of naming conventions in the biomedical domain, we propose various evidential features: (1) word formation pattern; (2) morphological pattern, such as prefix and suffix; (3) part-of-speech; (4) head noun trigger; (5) special verb trigger and (6) name alias feature. All the features are integrated effectively and efficiently through a hidden Markov model (HMM) and a HMM-based named entity recognizer. In addition, a k-Nearest Neighbor (k-NN) algorithm is proposed to resolve the data sparseness problem in our system. Finally, we present a pattern-based post-processing to automatically extract rules from the training data to deal with the cascaded entity name phenomenon. From our best knowledge, PowerBioNE is the first system which deals with the cascaded entity name phenomenon. Evaluation shows that our system achieves the F-measure of 66.6 and 62.2 on the 23 classes of GENIA V3.0 and V1.1, respectively. In particular, our system achieves the F-measure of 75.8 on the "protein" class of GENIA V3.0. For comparison, our system outperforms the best published result by 7.8 on GENIA V1.1, without help of any dictionaries. It also shows that our HMM and the k-NN algorithm outperform other models, such as back-off HMM, linear interpolated HMM, support vector machines, C4.5, C4.5 rules and RIPPER, by effectively capturing the local context dependency and resolving the data sparseness problem. Moreover, evaluation on GENIA V3.0 shows that the post-processing for the cascaded entity name phenomenon improves the F-measure by 3.9. Finally, error analysis shows that about half of the errors are caused by the strict annotation scheme and the annotation inconsistency in the GENIA corpus. This suggests that our system achieves an acceptable F-measure of 83.6 on the 23 classes of GENIA V3.0 and in particular 86.2 on the "protein" class, without help of any dictionaries. We think that a F-measure of 90 on the 23 classes of GENIA V3.0 and in particular 92 on the "protein" class, can be achieved through refining of the annotation scheme in the GENIA corpus, such as flexible annotation scheme and annotation consistency, and inclusion of a reasonable biomedical dictionary. AVAILABILITY: A demo system is available at http://textmining.i2r.a-star.edu.sg/NLS/demo.htm. Technology license is available upon the bilateral agreement.

[Yap2004Prediction] CW. Yap, CZ. Cai, YXue, and YZ. Chen. Prediction of torsade-causing potential of drugs by support vector machine approach. Toxicol Sci, 79(1):170-7, May 2004. [ bib | DOI | http | .pdf ]
In an effort to facilitate drug discovery, computational methods for facilitating the prediction of various adverse drug reactions (ADRs) have been developed. So far, attention has not been sufficiently paid to the development of methods for the prediction of serious ADRs that occur less frequently. Some of these ADRs, such as torsade de pointes (TdP), are important issues in the approval of drugs for certain diseases. Thus there is a need to develop tools for facilitating the prediction of these ADRs. This work explores the use of a statistical learning method, support vector machine (SVM), for TdP prediction. TdP involves multiple mechanisms and SVM is a method suitable for such a problem. Our SVM classification system used a set of linear solvation energy relationship (LSER) descriptors and was optimized by leave-one-out cross validation procedure. Its prediction accuracy was evaluated by using an independent set of agents and by comparison with results obtained from other commonly used classification methods using the same dataset and optimization procedure. The accuracies for the SVM prediction of TdP-causing agents and non-TdP-causing agents are 97.4 and 84.6% respectively; one is substantially improved against and the other is comparable to the results obtained by other classification methods useful for multiple-mechanism prediction problems. This indicates the potential of SVM in facilitating the prediction of TdP-causing risk of small molecules and perhaps other ADRs that involve multiple mechanisms.

Keywords: biosvm chemoinformatics
[Roth2004Bayesian] Volker Roth and Tilman Lange. Bayesian class discovery in microarray datasets. IEEE Trans Biomed Eng, 51(5):707-718, May 2004. [ bib ]
A novel approach to class discovery in gene expression datasets is presented. In the context of clinical diagnosis, the central goal of class discovery algorithms is to simultaneously find putative (sub-)types of diseases and to identify informative subsets of genes with disease-type specific expression profile. Contrary to many other approaches in the literature, the method presented implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discriminative power of the used partitioning algorithm. The usual combinatorial problems associated with wrapper approaches are overcome by a Bayesian inference mechanism. On the technical side, we present an efficient optimization algorithm with guaranteed local convergence property. The only free parameter of the optimization method is selected by a resampling-based stability analysis. Experiments with Leukemia and Lymphoma datasets demonstrate that our method is able to correctly infer partitions and corresponding subsets of genes which both are relevant in a biological sense. Moreover, the frequently observed problem of ambiguities caused by different but equally high-scoring partitions is successfully overcome by the model selection method proposed.

Keywords: Algorithms, Automated, Bayes Theorem, Cluster Analysis, Comparative Study, DNA, Databases, Gene Expression Profiling, Genetic, Genetic Screening, Humans, Leukemia, Models, Non-U.S. Gov't, Nucleic Acid, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Reproducibility of Results, Research Support, Sensitivity and Specificity, Sequence Alignment, Sequence Analysis, Statistical, 15132496
[Perola2004Conformational] Emanuele Perola and Paul S Charifson. Conformational analysis of drug-like molecules bound to proteins: an extensive study of ligand reorganization upon binding. J. Med. Chem., 47(10):2499-2510, May 2004. [ bib | DOI | http ]
This paper describes a large-scale study on the nature and the energetics of the conformational changes drug-like molecules experience upon binding. Ligand strain energies and conformational reorganization were analyzed with different computational methods on 150 crystal structures of pharmaceutically relevant protein-ligand complexes. The common knowledge that ligands rarely bind in their lowest calculated energy conformation was confirmed. Additionally, we found that over 60% of the ligands do not bind in a local minimum conformation. While approximately 60% of the ligands were calculated to bind with strain energies lower than 5 kcal/mol, strain energies over 9 kcal/mol were calculated in at least 10% of the cases regardless of the method used. A clear correlation was found between acceptable strain energy and ligand flexibility, while there was no correlation between strain energy and binding affinity, thus indicating that expensive conformational rearrangements can be tolerated in some cases without overly penalizing the tightness of binding. On the basis of the trends observed, thresholds for the acceptable strain energies of bioactive conformations were defined with consideration of the impact of ligand flexibility. An analysis of the degree of folding of the bound ligands confirmed the general tendency of small molecules to bind in an extended conformation. The results suggest that the unfolding of hydrophobic ligands during binding, which exposes hydrophobic surfaces to contact with protein residues, could be one of the factors accounting for high reorganization energies. Finally, different methods for conformational analysis were evaluated, and guidelines were defined to maximize the prevalence of bioactive conformations in computationally generated ensembles.

Keywords: Drug Design; Endopeptidases; Ligands; Molecular Conformation; Pharmaceutical Preparations; Phosphotransferases; Protein Binding; Protein Folding; Proteins; Thermodynamics
[Pavey2004Microarray] SPavey, PJohansson, LPacker, JTaylor, MStark, P.M. Pollock, G.J. Walker, G.M. Boyle, UHarper, S.J. Cozzi, KHansen, LYudt, CSchmidt, PHersey, K.A. Ellem, M.G. O'Rourke, P.G. Parsons, PMeltzer, MRingner, and N.K. Hayward. Microarray expression profiling in melanoma reveals a BRAF mutation signature. Oncogene, 23(23):4060-4067, May 2004. [ bib | DOI | http | .pdf ]
We have used microarray gene expression profiling and machine learning to predict the presence of BRAF mutations in a panel of 61 melanoma cell lines. The BRAF gene was found to be mutated in 42 samples (69 seven samples (11 Using support vector machines, we have built a classifier that differentiates between melanoma cell lines based on BRAF mutation status. As few as 83 genes are able to discriminate between BRAF mutant and BRAF wild-type samples with clear separation observed using hierarchical clustering. Multidimensional scaling was used to visualize the relationship between a BRAF mutation signature and that of a generalized mitogen-activated protein kinase (MAPK) activation (either BRAF or NRAS mutation) in the context of the discriminating gene list. We observed that samples carrying NRAS mutations lie somewhere between those with or without BRAF mutations. These observations suggest that there are gene-specific mutation signals in addition to a common MAPK activation that result from the pleiotropic effects of either BRAF or NRAS on other signaling pathways, leading to measurably different transcriptional changes.

Keywords: biosvm microarray
[Mittal2004Improving] VMittal. Improving the efficiency of RNA interference in mammals. Nat. Rev. Genet., 5(5):355-65, May 2004. [ bib | DOI | http ]
Keywords: sirna
[Mestres2004Computational] Jordi Mestres. Computational chemogenomics approaches to systematic knowledge-based drug discovery. Curr Opin Drug Discov Devel, 7(3):304-313, May 2004. [ bib ]
Chemogenomics, the identification of all possible drugs for all possible targets, has recently emerged as a new paradigm in drug discovery in which efficiency in the compound design and optimization process is achieved through the gain and reuse of targeted knowledge. As targeted knowledge resides at the interface between chemistry and biology, computational tools aimed at integrating the chemical and biological spaces play a central role in chemogenomics. This review covers the recent progress made in integrative computational approaches to data annotation and knowledge generation for the systematic knowledge-based design and screening of chemical libraries.

Keywords: Chemistry, Pharmaceutical; Combinatorial Chemistry Techniques; Computational Biology; Drug Design; Genomics; Ligands; Proteins; Receptors, G-Protein-Coupled
[Ma2004Structural] J.B. Ma, KYe, and D.J. Patel. Structural basis for overhang-specific small interfering RNA recognition by the PAZ domain. Nature, 429(6989):318-322, May 2004. [ bib | DOI | http ]
Short RNAs mediate gene silencing, a process associated with virus resistance, developmental control and heterochromatin formation in eukaryotes. RNA silencing is initiated through Dicer-mediated processing of double-stranded RNA into small interfering RNA (siRNA). The siRNA guide strand associates with the Argonaute protein in silencing effector complexes, recognizes complementary sequences and targets them for silencing. The PAZ domain is an RNA-binding module found in Argonaute and some Dicer proteins and its structure has been determined in the free state. Here, we report the 2.6 A crystal structure of the PAZ domain from human Argonaute eIF2c1 bound to both ends of a 9-mer siRNA-like duplex. In a sequence-independent manner, PAZ anchors the 2-nucleotide 3' overhang of the siRNA-like duplex within a highly conserved binding pocket, and secures the duplex by binding the 7-nucleotide phosphodiester backbone of the overhang-containing strand and capping the 5'-terminal residue of the complementary strand. On the basis of the structure and on binding assays, we propose that PAZ might serve as an siRNA-end-binding module for siRNA transfer in the RNA silencing pathway, and as an anchoring site for the 3' end of guide RNA within silencing effector complexes.

Keywords: sirna
[Luo2004gene-silencing] KQ. Luo and DC. Chang. The gene-silencing efficiency of siRNA is strongly dependent on the local structure of mRNA at the targeted region. Biochem. Biophys. Res. Commun., 318(1):303-10, May 2004. [ bib | DOI | http ]
The gene-silencing effect of short interfering RNA (siRNA) is known to vary strongly with the targeted position of the mRNA. A number of hypotheses have been suggested to explain this phenomenon. We would like to test if this positional effect is mainly due to the secondary structure of the mRNA at the target site. We proposed that this structural factor can be characterized by a single parameter called "the hydrogen bond (H-b) index," which represents the average number of hydrogen bonds formed between nucleotides in the target region and the rest of the mRNA. This index can be determined using a computational approach. We tested the correlation between the H-b index and the gene-silencing effects on three genes (Bcl-2, hTF, and cyclin B1) using a variety of siRNAs. We found that the gene-silencing effect is inversely dependent on the H-b index, indicating that the local mRNA structure at the targeted site is the main cause of the positional effect. Based on this finding, we suggest that the H-b index can be a useful guideline for future siRNA design.

Keywords: Animals, Apoptosis, Base Composition, Base Pairing, Base Sequence, Binding Sites, Cell Cycle, Cell Proliferation, Comparative Study, Cultured, Cyclin B, Cyclin D1, DNA-Binding Proteins, Down-Regulation, Extramural, Fluorescence, Gene Silencing, Gene Targeting, Genetic Vectors, Green Fluorescent Proteins, Hela Cells, Humans, Hydrogen Bonding, Luminescent Proteins, Male, Messenger, Mice, Microscopy, Models, Molecular, Molecular Sequence Data, N.I.H., Non-U.S. Gov't, Nucleic Acid Conformation, Nude, P.H.S., Prostatic Neoplasms, Proto-Oncogene Proteins c-bcl-2, Proto-Oncogene Proteins c-myc, RNA, Regression Analysis, Research Support, STAT3 Transcription Factor, Small Interfering, Thromboplastin, Trans-Activators, Tumor Cells, U.S. Gov't, 15110788
[Lukas2004Brain] LLukas, ADevos, JA K Suykens, LVanhamme, FA. Howe, CMajós, AMoreno-Torres, MVan der Graaf, AR. Tate, CArús, and SVan Huffel. Brain tumor classification based on long echo proton MRS signals. Artif. Intell. Med., 31(1):73-89, May 2004. [ bib | DOI | http | .pdf ]
There has been a growing research interest in brain tumor classification based on proton magnetic resonance spectroscopy (1H MRS) signals. Four research centers within the EU funded INTERPRET project have acquired a significant number of long echo 1H MRS signals for brain tumor classification. In this paper, we present an objective comparison of several classification techniques applied to the discrimination of four types of brain tumors: meningiomas, glioblastomas, astrocytomas grade II and metastases. Linear and non-linear classifiers are compared: linear discriminant analysis (LDA), support vector machines (SVM) and least squares SVM (LS-SVM) with a linear kernel as linear techniques and LS-SVM with a radial basis function (RBF) kernel as a non-linear technique. Kernel-based methods can perform well in processing high dimensional data. This motivates the inclusion of SVM and LS-SVM in this study. The analysis includes optimal input variable selection, (hyper-) parameter estimation, followed by performance evaluation. The classification performance is evaluated over 200 stratified random samplings of the dataset into training and test sets. Receiver operating characteristic (ROC) curve analysis measures the performance of binary classification, while for multiclass classification, we consider the accuracy as performance measure. Based on the complete magnitude spectra, automated binary classifiers are able to reach an area under the ROC curve (AUC) of more than 0.9 except for the hard case glioblastomas versus metastases. Although, based on the available long echo 1H MRS data, we did not find any statistically significant difference between the performances of LDA and the kernel-based methods, the latter have the strength that no dimensionality reduction is required to obtain such a high performance.

[Liu2004Gabor-based] Chengjun Liu. Gabor-based kernel PCA with fractional power polynomial models for face recognition. IEEE Trans Pattern Anal Mach Intell, 26(5):572-81, May 2004. [ bib ]
This paper presents a novel Gabor-based kernel Principal Component Analysis (PCA) method by integrating the Gabor wavelet representation of face images and the kernel PCA method for face recognition. Gabor wavelets first derive desirable facial features characterized by spatial frequency, spatial locality, and orientation selectivity to cope with the variations due to illumination and facial expression changes. The kernel PCA method is then extended to include fractional power polynomial models for enhanced face recognition performance. A fractional power polynomial, however, does not necessarily define a kernel function, as it might not define a positive semidefinite Gram matrix. Note that the sigmoid kernels, one of the three classes of widely used kernel functions (polynomial kernels, Gaussian kernels, and sigmoid kernels), do not actually define a positive semidefinite Gram matrix either. Nevertheless, the sigmoid kernels have been successfully used in practice, such as in building support vector machines. In order to derive real kernel PCA features, we apply only those kernel PCA eigenvectors that are associated with positive eigenvalues. The feasibility of the Gabor-based kernel PCA method with fractional power polynomial models has been successfully tested on both frontal and pose-angled face recognition, using two data sets from the FERET database and the CMU PIE database, respectively. The FERET data set contains 600 frontal face images of 200 subjects, while the PIE data set consists of 680 images across five poses (left and right profiles, left and right half profiles, and frontal view) with two different facial expressions (neutral and smiling) of 68 subjects. The effectiveness of the Gabor-based kernel PCA method with fractional power polynomial models is shown in terms of both absolute performance indices and comparative performance against the PCA method, the kernel PCA method with polynomial kernels, the kernel PCA method with fractional power polynomial models, the Gabor wavelet-based PCA method, and the Gabor wavelet-based kernel PCA method with polynomial kernels.

[Lee2004efficient] Martin M S Lee, SSathiya Keerthi, Chong Jin Ong, and Dennis DeCoste. An efficient method for computing leave-one-out error in support vector machines with Gaussian kernels. IEEE Trans Neural Netw, 15(3):750-7, May 2004. [ bib ]
In this paper, we give an efficient method for computing the leave-one-out (LOO) error for support vector machines (SVMs) with Gaussian kernels quite accurately. It is particularly suitable for iterative decomposition methods of solving SVMs. The importance of various steps of the method is illustrated in detail by showing the performance on six benchmark datasets. The new method often leads to speedups of 10-50 times compared to standard LOO error computation. It has good promise for use in hyperparameter tuning and model comparison

Keywords: Algorithms, Bayes Theorem, Computing Methodologies, Models, Neural Networks (Computer), Non-U.S. Gov't, Normal Distribution, Research Design, Research Support, Theoretical, 15384561
[Kim2004Emotion] KH. Kim, SW. Bang, and SR. Kim. Emotion recognition system using short-term monitoring of physiological signals. Med Biol Eng Comput, 42(3):419-27, May 2004. [ bib ]
A physiological signal-based emotion recognition system is reported. The system was developed to operate as a user-independent system, based on physiological signal databases obtained from multiple subjects. The input signals were electrocardiogram, skin temperature variation and electrodermal activity, all of which were acquired without much discomfort from the body surface, and can reflect the influence of emotion on the autonomic nervous system. The system consisted of preprocessing, feature extraction and pattern classification stages. Preprocessing and feature extraction methods were devised so that emotion-specific characteristics could be extracted from short-segment signals. Although the features were carefully extracted, their distribution formed a classification problem, with large overlap among clusters and large variance within clusters. A support vector machine was adopted as a pattern classifier to resolve this difficulty. Correct-classification ratios for 50 subjects were 78.4% and 61.8%, for the recognition of three and four categories, respectively.

Keywords: Algorithms, Animals, Antisense, Artificial Intelligence, Autonomic Nervous System, Cell Line, Child, Cluster Analysis, Comparative Study, Computational Biology, Computer Simulation, Computer-Assisted, DNA Fingerprinting, Drug Evaluation, Emotions, Fluorescence, Fuzzy Logic, Gene Silencing, Gene Targeting, Genetic, Hela Cells, Humans, Imaging, Intracellular Space, Microscopy, Models, Monitoring, Neoplasms, Neural Networks (Computer), Non-U.S. Gov't, Oligonucleotides, P.H.S., Physiologic, Preclinical, Preschool, Prognosis, Proteomics, Quantitative Structure-Activity Relationship, RNA, RNA Interference, Recognition (Psychology), Research Support, Sensitivity and Specificity, Signal Processing, Small Interfering, Thionucleotides, Three-Dimensional, Tumor, U.S. Gov't, User-Computer Interface, 15191089
[Kapetanovic2004Overview] Izet M Kapetanovic, Simon Rosenfeld, and Grant Izmirlian. Overview of commonly used bioinformatics methods and their applications. Ann N Y Acad Sci, 1020:10-21, May 2004. [ bib | DOI | http | .pdf ]
Bioinformatics, in its broad sense, involves application of computer processes to solve biological problems. A wide range of computational tools are needed to effectively and efficiently process large amounts of data being generated as a result of recent technological innovations in biology and medicine. A number of computational tools have been developed or adapted to deal with the experimental riches of complex and multivariate data and transition from data collection to information or knowledge. These include a wide variety of clustering and classification algorithms, including self-organized maps (SOM), artificial neural networks (ANN), support vector machines (SVM), fuzzy logic, and even hyphenated techniques as neuro-fuzzy networks. These bioinformatics tools are being evaluated and applied in various medical areas including early detection, risk assessment, classification, and prognosis of cancer. The goal of these efforts is to develop and identify bioinformatics methods with optimal sensitivity, specificity, and predictive capabilities.

Keywords: Computational Biology, Fuzzy Logic, Humans, Neoplasms, Neural Networks (Computer), Prognosis, 15208179
[Hammond20043D] Peter Hammond, Tim J Hutton, Judith E Allanson, Linda E Campbell, Raoul C M Hennekam, Sean Holden, Michael A Patton, Adam Shaw, IKaren Temple, Matthew Trotter, Kieran C Murphy, and Robin M Winter. 3D analysis of facial morphology. Am J Med Genet A, 126(4):339-48, May 2004. [ bib | DOI | http | .pdf ]
Dense surface models can be used to analyze 3D facial morphology by establishing a correspondence of thousands of points across each 3D face image. The models provide dramatic visualizations of 3D face-shape variation with potential for training physicians to recognize the key components of particular syndromes. We demonstrate their use to visualize and recognize shape differences in a collection of 3D face images that includes 280 controls (2 weeks to 56 years of age), 90 individuals with Noonan syndrome (NS) (7 months to 56 years), and 60 individuals with velo-cardio-facial syndrome (VCFS; 3 to 17 years of age). Ten-fold cross-validation testing of discrimination between the three groups was carried out on unseen test examples using five pattern recognition algorithms (nearest mean, C5.0 decision trees, neural networks, logistic regression, and support vector machines). For discriminating between individuals with NS and controls, the best average sensitivity and specificity levels were 92 and 93% for children, 83 and 94% for adults, and 88 and 94% for the children and adults combined. For individuals with VCFS and controls, the best results were 83 and 92%. In a comparison of individuals with NS and individuals with VCFS, a correct identification rate of 95% was achieved for both syndromes. This article contains supplementary material, which may be viewed at the American Journal of Medical Genetics website at http://www.interscience.wiley.com/jpages/0148-7299/suppmat/index.html.

[Gunderson2004Decoding] Kevin L Gunderson, Semyon Kruglyak, Michael S Graige, Francisco Garcia, Bahram G Kermani, Chanfeng Zhao, Diping Che, Todd Dickinson, Eliza Wickham, Jim Bierle, Dennis Doucet, Monika Milewski, Robert Yang, Chris Siegmund, Juergen Haas, Lixin Zhou, Arnold Oliphant, Jian-Bing Fan, Steven Barnard, and Mark S Chee. Decoding randomly ordered dna arrays. Genome Res, 14(5):870-877, May 2004. [ bib | DOI | http ]
We have developed a simple and efficient algorithm to identify each member of a large collection of DNA-linked objects through the use of hybridization, and have applied it to the manufacture of randomly assembled arrays of beads in wells. Once the algorithm has been used to determine the identity of each bead, the microarray can be used in a wide variety of applications, including single nucleotide polymorphism genotyping and gene expression profiling. The algorithm requires only a few labels and several sequential hybridizations to identify thousands of different DNA sequences with great accuracy. We have decoded tens of thousands of arrays, each with 1520 sequences represented at approximately 30-fold redundancy by up to approximately 50,000 beads, with a median error rate of <1 x 10(-4) per bead. The approach makes use of error checking codes and provides, for the first time, a direct functional quality control of every element of each array that is manufactured. The algorithm can be applied to any spatially fixed collection of objects or molecules that are associated with specific DNA sequences.

Keywords: Algorithms; Computational Biology, methods; Oligonucleotide Array Sequence Analysis, methods/trends; Random Allocation; Research Design; Sequence Analysis, DNA, methods; Silicon Dioxide, chemistry
[Galan2004Odor-driven] Roberto Fdez Galán, Silke Sachse, CGiovanni Galizia, and Andreas V M Herz. Odor-driven attractor dynamics in the antennal lobe allow for simple and rapid olfactory pattern classification. Neural Comput, 16(5):999-1012, May 2004. [ bib | DOI | http ]
The antennal lobe plays a central role for odor processing in insects, as demonstrated by electrophysiological and imaging experiments. Here we analyze the detailed temporal evolution of glomerular activity patterns in the antennal lobe of honeybees. We represent these spatiotemporal patterns as trajectories in a multidimensional space, where each dimension accounts for the activity of one glomerulus. Our data show that the trajectories reach odor-specific steady states (attractors) that correspond to stable activity patterns at about 1 second after stimulus onset. As revealed by a detailed mathematical investigation, the trajectories are characterized by different phases: response onset, steady-state plateau, response offset, and periods of spontaneous activity. An analysis based on support-vector machines quantifies the odor specificity of the attractors and the optimal time needed for odor discrimination. The results support the hypothesis of a spatial olfactory code in the antennal lobe and suggest a perceptron-like readout mechanism that is biologically implemented in a downstream network, such as the mushroom body.

[Chang2004Analysis] Ming-Wei Chang, Chih-Jen Lin, and Ruby Chiu-Hsing Weng. Analysis of switching dynamics with competing support vector machines. IEEE Trans Neural Netw, 15(3):720-7, May 2004. [ bib | DOI | http | .pdf ]
We present a framework for the unsupervised segmentation of switching dynamics using support vector machines. Following the architecture by Pawelzik et al., where annealed competing neural networks were used to segment a nonstationary time series, in this paper, we exploit the use of support vector machines, a well-known learning technique. First, a new formulation of support vector regression is proposed. Second, an expectation-maximization step is suggested to adaptively adjust the annealing parameter. Results indicate that the proposed approach is promising.

Keywords: Algorithms, Artificial Intelligence, Bayes Theorem, Computing Methodologies, Models, Neural Networks (Computer), Non-U.S. Gov't, Normal Distribution, Regression Analysis, Research Design, Research Support, Theoretical, 15384558
[Cai2004Prediction] Yu-Dong Cai and Andrew J Doig. Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition. Bioinformatics, 20(8):1292-300, May 2004. [ bib | DOI | http | .pdf ]
MOTIVATION: A key goal of genomics is to assign function to genes, especially for orphan sequences. RESULTS: We compared the clustered functional domains in the SBASE database to each protein sequence using BLASTP. This representation for a protein is a vector, where each of the non-zero entries in the vector indicates a significant match between the sequence of interest and the SBASE domain. The machine learning methods nearest neighbour algorithm (NNA) and support vector machines are used for predicting protein functional classes from this information. We find that the best results are found using the SBASE-A database and the NNA, namely 72% accuracy for 79% coverage. We tested an assigning function based on searching for InterPro sequence motifs and by taking the most significant BLAST match within the dataset. We applied the functional domain composition method to predict the functional class of 2018 currently unclassified yeast open reading frames. AVAILABILITY: A program for the prediction method, that uses NNA called Functional Class Prediction based on Functional Domains (FCPFD) is available and can be obtained by contacting Y.D.Cai at y.cai@umist.ac.uk

Keywords: biosvm
[Berretti04Graph] SBerretti, ADel Bimbo, and PPala. A graph edit distance based on node merging. In Proc. of ACM International Conference on Image and Video Retrieval (CIVR), pages 464-472, Dublin, Ireland, July 2004. [ bib | http ]
[Zou2005Regularization] HZou and THastie. Regularization and variable selection via the Elastic Net. J. R. Stat. Soc. Ser. B, 67:301-320, 2005. [ bib | http ]
Summary. We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together.The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p n case. An algorithm called LARS-EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.

Keywords: elastic-net, feature-selection, lars, lasso
[Zhu2005Kernel] JZhu and THastie. Kernel Logistic Regression and the Import Vector Machine. Journal of Computational & Graphical Statistics, 14(1):185-205, Mar 2005. [ bib | DOI | http | .pdf ]
The support vector machine (SVM) is known for its good performance in two-class classification, but its extension to multiclass classification is still an ongoing research issue. In this article, we propose a new approach for classification, called the import vector machine (IVM), which is built on kernel logistic regression (KLR). We show that the IVM not only performs as well as the SVM in two-class classification, but also can naturally be generalized to the multiclass case. Furthermore, the IVM provides an estimate of the underlying probability. Similar to the support points of the SVM, the IVM model uses only a fraction of the training data to index kernel basis functions, typically a much smaller fraction than the SVM. This gives the IVM a potential computational advantage over the SVM.

[Zhou2005LS] Xin Zhou and KZ. Mao. LS Bound based gene selection for DNA microarray data. Bioinformatics, 21(8):1559-64, Apr 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: One problem with discriminant analysis of DNA microarray data is that each sample is represented by quite a large number of genes, and many of them are irrelevant, insignificant or redundant to the discriminant problem at hand. Methods for selecting important genes are, therefore, of much significance in microarray data analysis. In the present study, a new criterion, called LS Bound measure, is proposed to address the gene selection problem. The LS Bound measure is derived from leave-one-out procedure of LS-SVMs (least squares support vector machines), and as the upper bound for leave-one-out classification results it reflects to some extent the generalization performance of gene subsets. RESULTS: We applied this LS Bound measure for gene selection on two benchmark microarray datasets: colon cancer and leukemia. We also compared the LS Bound measure with other evaluation criteria, including the well-known Fisher's ratio and Mahalanobis class separability measure, and other published gene selection algorithms, including Weighting factor and SVM Recursive Feature Elimination. The strength of the LS Bound measure is that it provides gene subsets leading to more accurate classification results than the filter method while its computational complexity is at the level of the filter method. AVAILABILITY: A companion website can be accessed at http://www.ntu.edu.sg/home5/pg02776030/lsbound/. The website contains: (1) the source code of the gene selection algorithm; (2) the complete set of tables and figures regarding the experimental study; (3) proof of the inequality (9). CONTACT: ekzmao@ntu.edu.sg.

Keywords: biosvm featureselection microarray
[Zhou2005Recognition] GuoDong Zhou, Dan Shen, Jie Zhang, Jian Su, and SoonHeng Tan. Recognition of protein/gene names from text using an ensemble of classifiers. BMC Bioinformatics, 6 Suppl 1:S7, 2005. [ bib | DOI | http | .pdf ]
This paper proposes an ensemble of classifiers for biomedical name recognition in which three classifiers, one Support Vector Machine and two discriminative Hidden Markov Models, are combined effectively using a simple majority voting strategy. In addition, we incorporate three post-processing modules, including an abbreviation resolution module, a protein/gene name refinement module and a simple dictionary matching module, into the system to further improve the performance. Evaluation shows that our system achieves the best performance from among 10 systems with a balanced F-measure of 82.58 on the closed evaluation of the BioCreative protein/gene name recognition task (Task 1A).

Keywords: biosvm nlp
[Zheng2005Foley-Sammon] Wenming Zheng, Li Zhao, and Cairong Zou. Foley-Sammon optimal discriminant vectors using kernel approach. IEEE Trans Neural Netw, 16(1):1-9, Jan 2005. [ bib ]
A new nonlinear feature extraction method called kernel Foley-Sammon optimal discriminant vectors (KFSODVs) is presented in this paper. This new method extends the well-known Foley-Sammon optimal discriminant vectors (FSODVs) from linear domain to a nonlinear domain via the kernel trick that has been used in support vector machine (SVM) and other commonly used kernel-based learning algorithms. The proposed method also provides an effective technique to solve the so-called small sample size (SSS) problem which exists in many classification problems such as face recognition. We give the derivation of KFSODV and conduct experiments on both simulated and real data sets to confirm that the KFSODV method is superior to the previous commonly used kernel-based learning algorithms in terms of the performance of discrimination.

[Zhang2005Descriptor-based] ZZhang, SKochhar, and MG. Grigorov. Descriptor-based protein remote homology identification. Protein Sci., 42(2):431-444, 2005. [ bib | DOI | http | .pdf ]
Here, we report a novel protein sequence descriptor-based remote homology identification method, able to infer fold relationships without the explicit knowledge of structure. In a first phase, we have individually benchmarked 13 different descriptor types in fold identification experiments in a highly diverse set of protein sequences. The relevant descriptors were related to the fold class membership by using simple similarity measures in the descriptor spaces, such as the cosine angle. Our results revealed that the three best-performing sets of descriptors were the sequence-alignment-based descriptor using PSI-BLAST e-values, the descriptors based on the alignment of secondary structural elements (SSEA), and the descriptors based on the occurrence of PROSITE functional motifs. In a second phase, the three top-performing descriptors were combined to obtain a final method with improved performance, which we named DescFold. Class membership was predicted by Support Vector Machine (SVM) learning. In comparison with the individual PSI-BLAST-based descriptor, the rate of remote homology identification increased from 33.7 was able to identify the true remote homolog for nearly every sixth sequence at the 95 PSI-BLAST search. We have benchmarked the DescFold method against several other state-of-the-art fold recognition algorithms for the 172 LiveBench-8 targets, and we concluded that it was able to add value to the existing techniques by providing a confident hit for at least 10 known methods.

Keywords: biosvm
[Zhang2005Study] Lu-Da Zhang, Shi-Guang Su, Lai-Sheng Wang, Jun-Hui Li, and Li-Ming Yang. Study on application of Fourier transformation near-infrared spectroscopy analysis with support vector machine (SVM). Guang Pu Xue Yu Guang Pu Fen Xi, 25(1):33-5, Jan 2005. [ bib ]
Support Vector Machine (SVM) is a method for the research on identifying two types of problem. It is the latest branch in the statistics study theories, and the identification model has a strict mathematics foundation. In this paper, the basic principle and method of SVM are not only introduced, but also applied to chemometrics. One hundred and three rhubarb samples were used as experimental materials. The identification models were established with near-infrared spectroscopy and SVM training method with the intention of identifying whether the rhubarb samples are true or false. The thirty-three samples in training set were identified by the identifying models with the accurate rate of 100%, while seventy estimate samples had an accurate rate of 96.77%. The research result provided the method of identifying the traditional Chinese medicine rhubarb quickly. So, it shows the feasibility of establishing the models with near-infrared spectroscopy and SVM method to identify biological samples. This paper introduced the theme of SVM training method in order to beget the attention of the research members who deal with chemometrics.

[Zhang2005MULTIPRED] GL. Zhang, AM. Khan, KN. Srinivasan, JT. August, and VBrusic. MULTIPRED: a computational system for prediction of promiscuous HLA binding peptides. Nucleic Acids Res/, 33(Web Server issue):W172-W179, Jul 2005. [ bib | DOI | http ]
MULTIPRED is a web-based computational system for the prediction of peptide binding to multiple molecules (proteins) belonging to human leukocyte antigens (HLA) class I A2, A3 and class II DR supertypes. It uses hidden Markov models and artificial neural network methods as predictive engines. A novel data representation method enables MULTIPRED to predict peptides that promiscuously bind multiple HLA alleles within one HLA supertype. Extensive testing was performed for validation of the prediction models. Testing results show that MULTIPRED is both sensitive and specific and it has good predictive ability (area under the receiver operating characteristic curve A(ROC) > 0.80). MULTIPRED can be used for the mapping of promiscuous T-cell epitopes as well as the regions of high concentration of these targets-termed T-cell epitope hotspots. MULTIPRED is available at http://antigen.i2r.a-star.edu.sg/multipred/.

Keywords: Algorithms, Amino Acid Sequence, Antigen-Antibody Complex, Automated, Binding Sites, Computational Biology, Drug Delivery Systems, Drug Design, Epitopes, HLA Antigens, HLA-A Antigens, HLA-DR Antigens, Humans, Internet, Markov Chains, Molecular Sequence Data, Neural Networks (Computer), Pattern Recognition, Peptides, Protein, Protein Binding, Protein Interaction Mapping, Sequence Analysis, Software, T-Lymphocyte, User-Computer Interface, Viral Vaccines, 15980449
[Zaki2005Application] NM. Zaki, SDeris, and RIllias. Application of string kernels in protein sequence classification. Appl. Bioinformatics, 4(1):45-52, 2005. [ bib ]
INTRODUCTION: The production of biological information has become much greater than its consumption. The key issue now is how to organise and manage the huge amount of novel information to facilitate access to this useful and important biological information. One core problem in classifying biological information is the annotation of new protein sequences with structural and functional features. METHOD: This article introduces the application of string kernels in classifying protein sequences into homogeneous families. A string kernel approach used in conjunction with support vector machines has been shown to achieve good performance in text categorisation tasks. We evaluated and analysed the performance of this approach, and we present experimental results on three selected families from the SCOP (Structural Classification of Proteins) database. We then compared the overall performance of this method with the existing protein classification methods on benchmark SCOP datasets. RESULTS: According to the F1 performance measure and the rate of false positive (RFP) measure, the string kernel method performs well in classifying protein sequences. The method outperformed all the generative-based methods and is comparable with the SVM-Fisher method. DISCUSSION: Although the string kernel approach makes no use of prior biological knowledge, it still captures sufficient biological information to enable it to outperform some of the state-of-the-art methods.

Keywords: biosvm
[Yu2005Learning] Kai Yu, Volker Tresp, and Anton Schwaighofer. Learning gaussian processes from multiple tasks. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 1012-1019, New York, NY, USA, 2005. ACM. [ bib | DOI ]
[Yu2005integrated] J.-k. Yu, SZheng, YTang, and LLi. An integrated approach utilizing proteomics and bioinformatics to detect ovarian cancer. J Zhejiang Univ Sci B, 6(4):227-31, Apr 2005. [ bib | DOI | http | .pdf ]
OBJECTIVE: To find new potential biomarkers and establish the patterns for the detection of ovarian cancer. METHODS: Sixty one serum samples including 32 ovarian cancer patients and 29 healthy people were detected by surface-enhanced laser desorption/ionization mass spectrometry (SELDI-MS). The protein fingerprint data were analyzed by bioinformatics tools. Ten folds cross-validation support vector machine (SVM) was used to establish the diagnostic pattern. RESULTS: Five potential biomarkers were found (2085 Da, 5881 Da, 7564 Da, 9422 Da, 6044 Da), combined with which the diagnostic pattern separated the ovarian cancer from the healthy samples with a sensitivity of 96.7%, a specificity of 96.7% and a positive predictive value of 96.7%. CONCLUSIONS: The combination of SELDI with bioinformatics tools could find new biomarkers and establish patterns with high sensitivity and specificity for the detection of ovarian cancer.

Keywords: biosvm
[Yu2005Classifying] CYu, NZavaljevski, FJ. Stevens, KYackovich, and JReifman. Classifying noisy protein sequence data: a case study of immunoglobulin light chains. Bioinformatics, 21(Supp 1):i495-i501, Jun 2005. [ bib | DOI | http | .pdf ]
SUMMARY: The classification of protein sequences obtained from patients with various immunoglobulin-related conformational diseases may provide insight into structural correlates of pathogenicity. However, clinical data are very sparse and, in the case of antibody-related proteins, the collected sequences have large variability with only a small subset of variations relevant to the protein pathogenicity (function). On this basis, these sequences represent a model system for development of strategies to recognize the small subset of function-determining variations among the much larger number of primary structure diversifications introduced during evolution. Under such conditions, most protein classification algorithms have limited accuracy. To address this problem, we propose a support vector machine (SVM)-based classifier that combines sequence and 3D structural averaging information. Each amino acid in the sequence is represented by a set of six physicochemical properties: hydrophobicity, hydrophilicity, volume, surface area, bulkiness and refractivity. Each position in the sequence is described by the properties of the amino acid at that position and the properties of its neighbors in 3D space or in the sequence. A structure template is selected to determine neighbors in 3D space and a window size is used to determine the neighbors in the sequence. The test data consist of 209 proteins of human antibody immunoglobulin light chains, each represented by aligned sequences of 120 amino acids. The methodology is applied to the classification of protein sequences collected from patients with and without amyloidosis, and indicates that the proposed modified classifiers are more robust to sequence variability than standard SVM classifiers, improving classification error between 5 and 25% and sensitivity between 9 and 17%. The classification results might also suggest possible mechanisms for the propensity of immunoglobulin light chains to amyloid formation. CONTACT: cyu@bioanalysis.org.

Keywords: biosvm
[Young2005Using] JA. Young and EA. Winzeler. Using expression information to discover new drug and vaccine targets in the malaria parasite Plasmodium falciparum. Pharmacogenomics, 6(1):17-26, Jan 2005. [ bib ]
The recent completion of the malaria parasite Plasmodium falciparum genome has opened the door for applying a variety of genomic-based systems biology approaches that complement existing gene-by-gene methods of investigation. Transcriptomic analyses of P.falciparum using DNA microarrays has allowed for the rapid elucidation of gene function, parasite drug response, and invivo expression profiles, as well as general mechanisms guiding the parasite life cycle that are vital to disease pathogenesis. The results of these studies have identified promising novel gene targets for the development of new drug and vaccine therapies.

Keywords: plasmodium
[Young2005Plasmodium] Jason A Young, Quinton L Fivelman, Peter L Blair, Patricia de la Vega, Karine G Le Roch, Yingyao Zhou, Daniel J Carucci, David A Baker, and Elizabeth A Winzeler. The Plasmodium falciparum sexual development transcriptome: a microarray analysis using ontology-based pattern identification. Mol. Biochem. Parasitol., 143(1):67-79, Sep 2005. [ bib | DOI | http | .pdf ]
The sexual stages of malarial parasites are essential for the mosquito transmission of the disease and therefore are the focus of transmission-blocking drug and vaccine development. In order to better understand genes important to the sexual development process, the transcriptomes of high-purity stage I-V Plasmodium falciparum gametocytes were comprehensively profiled using a full-genome high-density oligonucleotide microarray. The interpretation of this transcriptional data was aided by applying a novel knowledge-based data-mining algorithm termed ontology-based pattern identification (OPI) using current information regarding known sexual stage genes as a guide. This analysis resulted in the identification of a sexual development cluster containing 246 genes, of which approximately 75% were hypothetical, exhibiting highly-correlated, gametocyte-specific expression patterns. Inspection of the upstream promoter regions of these 246 genes revealed putative cis-regulatory elements for sexual development transcriptional control mechanisms. Furthermore, OPI analysis was extended using current annotations provided by the Gene Ontology Consortium to identify 380 statistically significant clusters containing genes with expression patterns characteristic of various biological processes, cellular components, and molecular functions. Collectively, these results, available as part of a web-accessible OPI database (http://carrier.gnf.org/publications/Gametocyte), shed light on the components of molecular mechanisms underlying parasite sexual development and other areas of malarial parasite biology.

Keywords: plasmodium
[Yiu2005Filtering] SM. Yiu, Prudence WH. Wong, T.W. Lam, Y.C. Mui, HF. Kung, Marie Lin, and YT. Cheung. Filtering of Ineffective siRNAs and Improved siRNA Design Tool. Bioinformatics, 21(2):144-151, Jan 2005. To appear. [ bib | DOI | http | .pdf ]
Motivation: Short interfering RNAs (siRNAs) can be used to suppress gene expression and possess many potential applications in therapy, but how to design an effective siRNA is still not clear. Based on the MPI (Max-Planck-Institute) basic principles, a number of siRNA design tools have been developed recently. The set of candidates reported by these tools is usually large and often contains ineffective siRNAs. In view of this, we initiate the study of filtering ineffective siRNAs. Results: The contribution of this paper is 2-fold. First, we propose a fair scheme to compare existing design tools based on real data in the literature. Second, we attempt to improve the MPI principles and existing tools by an algorithm that can filter ineffective siRNAs. The algorithm is based on some new observations on the secondary structure, which we have verified by AI techniques (decision trees and support vector machines). We have tested our algorithm together with the MPI principles and the existing tools. The results show that our filtering algorithm is effective. Availability: The siRNA design software tool can be found in the website http://www.cs.hku.hksirna/ Contact: smyiu@cs.hku.hk

Keywords: biosvm
[Yap2005Prediction] CW. Yap and YZ. Chen. Prediction of Cytochrome P450 3A4, 2D6, and 2C9 Inhibitors and Substrates by Using Support Vector Machines. J Chem Inf Model, 45(4):982-92, 2005. [ bib | DOI | http | .pdf ]
Statistical learning methods have been used in developing filters for predicting inhibitors of two P450 isoenzymes, CYP3A4 and CYP2D6. This work explores the use of different statistical learning methods for predicting inhibitors of these enzymes and an additional P450 enzyme, CYP2C9, and the substrates of the three P450 isoenzymes. Two consensus support vector machine (CSVM) methods, "positive majority" (PM-CSVM) and "positive probability" (PP-CSVM), were used in this work. These methods were first tested for the prediction of inhibitors of CYP3A4 and CYP2D6 by using a significantly higher number of inhibitors and noninhibitors than that used in earlier studies. They were then applied to the prediction of inhibitors of CYP2C9 and substrates of the three enzymes. Both methods predict inhibitors of CYP3A4 and CYP2D6 at a similar level of accuracy as those of earlier studies. For classification of inhibitors of CYP2C9, the best CSVM method gives an accuracy of 88.9% for inhibitors and 96.3% for noninhibitors. The accuracies for classification of substrates and nonsubstrates of CYP3A4, CYP2D6, and CYP2C9 are 98.2 and 90.9%, 96.6 and 94.4%, and 85.7 and 98.8%, respectively. Both CSVM methods are potentially useful as filters for predicting inhibitors and substrates of P450 isoenzymes. These methods generally give better accuracies than single SVM classification systems, and the performance of the PP-CSVM method is slightly better than that of the PM-CSVM method.

Keywords: biosvm chemoinformatics
[Yanover2005Predicting] Chen Yanover and Tomer Hertz. Predicting protein-peptide binding affinity by learning peptide-peptide distance functions. In RECOMB, pages 456-471, 2005. [ bib ]
[Yamanishi2005Supervised] YYamanishi, J.-P. Vert, and MKanehisa. Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics, 21:i468-i477, 2005. [ bib | DOI | http | .pdf ]
Motivation: The metabolic network is an important biological network which relates enzyme proteins and chemical compounds. A large number of metabolic pathways remain unknown nowadays, and many enzymes are missing even in known metabolic pathways. There is, therefore, an incentive to develop methods to reconstruct the unknown parts of the metabolic network and to identify genes coding for missing enzymes. Results: This paper presents new methods to infer enzyme networks from the integration of multiple genomic data and chemical information, in the framework of supervised graph inference. The originality of the methods is the introduction of chemical compatibility as a constraint for refining the network predicted by the network inference engine. The chemical compatibility between two enzymes is obtained automatically from the information encoded by their Enzyme Commission (EC) numbers. The proposed methods are tested and compared on their ability to infer the enzyme network of the yeast Saccharomyces cerevisiae from four datasets for enzymes with assigned EC numbers: gene expression data, protein localization data, phylogenetic profiles and chemical compatibility information. It is shown that the prediction accuracy of the network reconstruction consistently improves owing to the introduction of chemical constraints, the use of a supervised approach and the weighted integration of multiple datasets. Finally, we conduct a comprehensive prediction of a global enzyme network consisting of all enzyme candidate proteins of the yeast to obtain new biological findings.

Keywords: biosvm
[Yamada2005Accelerated] TYamada and SMorishita. Accelerated off-target search algorithm for siRNA. Bioinformatics, 21(8):1316-24, Apr 2005. [ bib | DOI | http ]
MOTIVATION: Designing highly effective short interfering RNA (siRNA) sequences with maximum target-specificity for mammalian RNA interference (RNAi) is one of the hottest topics in molecular biology. The relationship between siRNA sequences and RNAi activity has been studied extensively to establish rules for selecting highly effective sequences. However, there is a pressing need to compute siRNA sequences that minimize off-target silencing effects efficiently and to match any non-targeted sequences with mismatches. RESULTS: The enumeration of potential cross-hybridization candidates is non-trivial, because siRNA sequences are short, ca. 19 nt in length, and at least three mismatches with non-targets are required. With at least three mismatches, there are typically four or five contiguous matches, so that a BLAST search frequently overlooks off-target candidates. By contrast, existing accurate approaches are expensive to execute; thus we need to develop an accurate, efficient algorithm that uses seed hashing, the pigeonhole principle, and combinatorics to identify mismatch patterns. Tests show that our method can list potential cross-hybridization candidates for any siRNA sequence of selected human gene rapidly, outperforming traditional methods by orders of magnitude in terms of computational performance. AVAILABILITY: http://design.RNAi.jp CONTACT: yamada@cb.k.u-tokyo.ac.jp.

Keywords: sirna
[Yabuki2005GRIFFIN] YYabuki, TMuramatsu, THirokawa, HMukai, and MSuwa. GRIFFIN: a system for predicting GPCR-G-protein coupling selectivity using a support vector machine and a hidden Markov model. Nucleic Acids Res., 33(Web Server issue):W148-53, Jul 2005. [ bib | DOI | http | .pdf ]
We describe a novel system, GRIFFIN (G-protein and Receptor Interaction Feature Finding INstrument), that predicts G-protein coupled receptor (GPCR) and G-protein coupling selectivity based on a support vector machine (SVM) and a hidden Markov model (HMM) with high sensitivity and specificity. Based on our assumption that whole structural segments of ligands, GPCRs and G-proteins are essential to determine GPCR and G-protein coupling, various quantitative features were selected for ligands, GPCRs and G-protein complex structures, and those parameters that are the most effective in selecting G-protein type were used as feature vectors in the SVM. The main part of GRIFFIN includes a hierarchical SVM classifier using the feature vectors, which is useful for Class A GPCRs, the major family. For the opsins and olfactory subfamilies of Class A and other minor families (Classes B, C, frizzled and smoothened), the binding G-protein is predicted with high accuracy using the HMM. Applying this system to known GPCR sequences, each binding G-protein is predicted with high sensitivity and specificity (>85% on average). GRIFFIN (http://griffin.cbrc.jp/) is freely available and allows users to easily execute this reliable prediction of G-proteins.

Keywords: biosvm
[Xie2005LOCSVMPSI] Dan Xie, Ao Li, Minghui Wang, Zhewen Fan, and Huanqing Feng. LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acids Res., 33(Web Server issue):W105-10, Jul 2005. [ bib | DOI | http | .pdf ]
Subcellular location of a protein is one of the key functional characters as proteins must be localized correctly at the subcellular level to have normal biological function. In this paper, a novel method named LOCSVMPSI has been introduced, which is based on the support vector machine (SVM) and the position-specific scoring matrix generated from profiles of PSI-BLAST. With a jackknife test on the RH2427 data set, LOCSVMPSI achieved a high overall prediction accuracy of 90.2%, which is higher than the prediction results by SubLoc and ESLpred on this data set. In addition, prediction performance of LOCSVMPSI was evaluated with 5-fold cross validation test on the PK7579 data set and the prediction results were consistently better than the previous method based on several SVMs using composition of both amino acids and amino acid pairs. Further test on the SWISSPROT new-unique data set showed that LOCSVMPSI also performed better than some widely used prediction methods, such as PSORTII, TargetP and LOCnet. All these results indicate that LOCSVMPSI is a powerful tool for the prediction of eukaryotic protein subcellular localization. An online web server (current version is 1.3) based on this method has been developed and is freely available to both academic and commercial users, which can be accessed by at http://Bioinformatics.ustc.edu.cn/LOCSVMPSI/LOCSVMPSI.php.

Keywords: biosvm
[Weinberger2005Nonlinear] KWeinberger, BPacker, and LSaul. Nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization. In RG. Cowell and ZGhahramani, editors, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, Jan 6-8, 2005, Savannah Hotel, Barbados, pages 381-388. Society for Artificial Intelligence and Statistics, 2005. [ bib | www: ]
We describe an algorithm for nonlinear dimensionality reduction based on semidefinite programming and kernel matrix factorization. The algorithm learns a kernel matrix for high dimensional data that lies on or near a low dimensional manifold. In earlier work, the kernel matrix was learned by maximizing the variance in feature space while preserving the distances and angles between nearest neighbors. In this paper, adapting recent ideas from semi-supervised learning on graphs, we show that the full kernel matrix can be very well approximated by a product of smaller matrices. Representing the kernel matrix in this way, we can reformulate the semidefinite program in terms of a much smaller submatrix of inner products between randomly chosen landmarks. The new framework leads to order-of-magnitude reductions in computation time and makes it possible to study much larger problems in manifold learning.

Keywords: dimred
[Wei2005study] Liyang Wei, Yongyi Yang, Robert M Nishikawa, and Yulei Jiang. A study on several machine-learning methods for classification of malignant and benign clustered microcalcifications. IEEE Trans Med Imaging, 24(3):371-80, Mar 2005. [ bib ]
In this paper, we investigate several state-of-the-art machine-learning methods for automated classification of clustered microcalcifications (MCs). The classifier is part of a computer-aided diagnosis (CADx) scheme that is aimed to assisting radiologists in making more accurate diagnoses of breast cancer on mammograms. The methods we considered were: support vector machine (SVM), kernel Fisher discriminant (KFD), relevance vector machine (RVM), and committee machines (ensemble averaging and AdaBoost), of which most have been developed recently in statistical learning theory. We formulated differentiation of malignant from benign MCs as a supervised learning problem, and applied these learning methods to develop the classification algorithm. As input, these methods used image features automatically extracted from clustered MCs. We tested these methods using a database of 697 clinical mammograms from 386 cases, which included a wide spectrum of difficult-to-classify cases. We analyzed the distribution of the cases in this database using the multidimensional scaling technique, which reveals that in the feature space the malignant cases are not trivially separable from the benign ones. We used receiver operating characteristic (ROC) analysis to evaluate and to compare classification performance by the different methods. In addition, we also investigated how to combine information from multiple-view mammograms of the same case so that the best decision can be made by a classifier. In our experiments, the kernel-based methods (i.e., SVM, KFD, and RVM) yielded the best performance (Az = 0.85, SVM), significantly outperforming a well-established, clinically-proven CADx approach that is based on neural network (Az = 0.80).

[Waring2005Face] Christopher A Waring and Xiuwen Liu. Face detection using spectral histograms and SVMs. IEEE Trans Syst Man Cybern B Cybern, 35(3):467-76, Jun 2005. [ bib ]
We present a face detection method using spectral histograms and support vector machines (SVMs). Each image window is represented by its spectral histogram, which is a feature vector consisting of histograms of filtered images. Using statistical sampling, we show systematically the representation groups face images together; in comparison, commonly used representations often do not exhibit this necessary and desirable property. By using an SVM trained on a set of 4500 face and 8000 nonface images, we obtain a robust classifying function for face and non-face patterns. With an effective illumination-correction algorithm, our system reliably discriminates face and nonface patterns in images under different kinds of conditions. Our method on two commonly used data sets give the best performance among recent face-detection ones. We attribute the high performance to the desirable properties of the spectral histogram representation and good generalization of SVMs. Several further improvements in computation time and in performance are discussed.

[Wang2005Gene] Yu Wang, Igor V Tetko, Mark A Hall, Eibe Frank, Axel Facius, Klaus F X Mayer, and Hans W Mewes. Gene selection from microarray data for cancer classification-a machine learning approach. Comput. Biol. Chem., 29(1):37-46, Feb 2005. [ bib | DOI | http | .pdf ]
A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, naïve Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.

Keywords: biosvm microarray
[Wang2005Gene-expression] YWang, J.G.M. Klijn, YZhang, A.M. Sieuwerts, M.P. Look, FYang, DTalantov, MTimmermans, M.E. Meijer-van Gelder, JYu, TJatkoe, E.M.J.J. Berns, DAtkins, and J.A. Foekens. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancers. Lancet, 365(9460):671-679, 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: Genome-wide measures of gene expression can identify patterns of gene activity that subclassify tumours and might provide a better means than is currently available for individual risk assessment in patients with lymph-node-negative breast cancer. METHODS: We analysed, with Affymetrix Human U133a GeneChips, the expression of 22000 transcripts from total RNA of frozen tumour samples from 286 lymph-node-negative patients who had not received adjuvant systemic treatment. FINDINGS: In a training set of 115 tumours, we identified a 76-gene signature consisting of 60 genes for patients positive for oestrogen receptors (ER) and 16 genes for ER-negative patients. This signature showed 93% sensitivity and 48% specificity in a subsequent independent testing set of 171 lymph-node-negative patients. The gene profile was highly informative in identifying patients who developed distant metastases within 5 years (hazard ratio 5.67 [95% CI 2.59-12.4]), even when corrected for traditional prognostic factors in multivariate analysis (5.55 [2.46-12.5]). The 76-gene profile also represented a strong prognostic factor for the development of metastasis in the subgroups of 84 premenopausal patients (9.60 [2.28-40.5]), 87 postmenopausal patients (4.04 [1.57-10.4]), and 79 patients with tumours of 10-20 mm (14.1 [3.34-59.2]), a group of patients for whom prediction of prognosis is especially difficult. INTERPRETATION: The identified signature provides a powerful tool for identification of patients at high risk of distant recurrence. The ability to identify patients who have a favourable prognosis could, after independent confirmation, allow clinicians to avoid adjuvant systemic therapy or to choose less aggressive therapeutic options.

Keywords: microarray, breastcancer
[Wang2005Prediction] Ming-Lei Wang, Hui Yao, and Wen-Bo Xu. Prediction by support vector machines and analysis by Z-score of poly-L-proline type II conformation based on local sequence. Comput. Biol. Chem., 29(2):95-100, Apr 2005. [ bib | DOI | http | .pdf ]
In recent years, the poly-L-proline type II (PPII) conformation has gained more and more importance. This structure plays vital roles in many biological processes. But few studies have been made to predict PPII secondary structures computationally. The support vector machine (SVM) represents a new approach to supervised pattern classification and has been successfully applied to a wide range of pattern recognition problems. In this paper, we present a SVM prediction method of PPII conformation based on local sequence. The overall accuracy for both the independent testing set and estimate of jackknife testing reached approximately 70%. Matthew's correlation coefficient (MCC) could reach 0.4. By comparing the results of training and testing datasets with different sequence identities, we suggest that the performance of this method correlates with the sequence identity of dataset. The parameter of SVM kernel function was an important factor to the performance of this method. The propensities of residues located at different positions were also analyzed. By computing Z-scores, we found that P and G were the two most important residues to PPII structure conformation.

Keywords: biosvm
[Wang2005Using] MWang, JYang, and K-C. Chou. Using string kernel to predict signal peptide cleavage site based on subsite coupling model. Amino Acids, 28(4):395-402, Jun 2005. [ bib | DOI | http | .pdf ]
Owing to the importance of signal peptides for studying the molecular mechanisms of genetic diseases, reprogramming cells for gene therapy, and finding new drugs for healing a specific defect, it is in great demand to develop a fast and accurate method to identify the signal peptides. Introduction of the so-called -3,-1, +1 coupling model (Chou, K. C.: Protein Engineering, 2001, 14-2, 75-79) has made it possible to take into account the coupling effect among some key subsites and hence can significantly enhance the prediction quality of peptide cleavage site. Based on the subsite coupling model, a kind of string kernels for protein sequence is introduced. Integrating the biologically relevant prior knowledge, the constructed string kernels can thus be used by any kernel-based method. A Support vector machines (SVM) is thus built to predict the cleavage site of signal peptides from the protein sequences. The current approach is compared with the classical weight matrix method. At small false positive ratios, our method outperforms the classical weight matrix method, indicating the current approach may at least serve as a powerful complemental tool to other existing methods for predicting the signal peptide cleavage site.The software that generated the results reported in this paper is available upon requirement, and will appear at http://www.pami.sjtu.edu.cn/wm.

Keywords: biosvm
[Wang2005computer] JF. Wang, CZ. Cai, CY. Kong, ZW. Cao, and YZ. Chen. A computer method for validating traditional Chinese medicine herbal prescriptions. Am J Chin Med, 33(2):281-97, 2005. [ bib ]
Traditional Chinese medicine (TCM) has been widely practiced and is considered as an alternative to conventional medicine. TCM herbal prescriptions contain a mixture of herbs that collectively exert therapeutic actions and modulating effects. Traditionally defined herbal properties, related to the pharmacodynamic, pharmacokinetic and toxicological, as well as physicochemical properties of their principal ingredients, have been used as the basis for formulating TCM multi-herb prescriptions. These properties are used in this work to develop a computer program for predicting whether a multi-herb recipe is a valid TCM prescription. This program is based on a statistical learning method, support vector machine (SVM), and it is trained by using 575 well-known TCM prescriptions and 1961 non-TCM recipes generated by random combination of TCM herbs. Testing results by using 72 well-known TCM prescriptions and 5039 non-TCM recipes showed that 73.6% of the TCM prescriptions and 99.9% of non-TCM recipes are correctly classified by this system. A further test by using 48 TCM prescriptions published in recent years found that 68.7% of these are correctly classified. These accuracies are comparable to those of SVM classification of other biological systems. Our study indicates the potential of SVM for facilitating the analysis of TCM prescriptions.

Keywords: Artificial Intelligence, Conservation of Natural Resources, Decision Support Techniques, Ecosystem, Environment, Forestry, Regression Analysis, Spain, 15974487
[Wang2005Protein] JWang, W.-K. Sung, AKrishnan, and K.-B. Li. Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines. BMC Bioinformatics, 6(1):174, Jul 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: Predicting the subcellular localization of proteins is important for determining the function of proteins. Previous works focused on predicting protein localization in Gram-negative bacteria obtained good results. However, these methods had relatively low accuracies for the localization of extracellular proteins. This paper studies ways to improve the accuracy for predicting extracellular localization in Gram-negative bacteria. RESULTS: We have developed a system for predicting the subcellular localization of proteins for Gram-negative bacteria based on amino acid subalphabets and a combination of multiple support vector machines. The recall of the extracellular site and overall recall of our predictor reach 86.0% and 89.8%, respectively, in 5-fold cross-validation. To the best of our knowledge, these are the most accurate results for predicting subcellular localization in Gram-negative bacteria. CONCLUSIONS: Clustering 20 amino acids into a few groups by the proposed greedy algorithm provides a new way to extract features from protein sequences to cover more adjacent amino acids and hence reduce the dimensionality of the input vector of protein features. It was observed that a good amino acid grouping leads to an increase in prediction performance. Furthermore, a proper choice of a subset of complementary support vector machines constructed by different features of proteins maximizes the prediction accuracy.

Keywords: biosvm
[Vlahovicek2005SBASE] Kristian Vlahovicek, László Kaján, Vilmos Agoston, and Sándor Pongor. The SBASE domain sequence resource, release 12: prediction of protein domain-architecture using support vector machines. Nucleic Acids Res, 33(Database issue):D223-5, Jan 2005. [ bib | DOI | http | .pdf ]
SBASE (http://www.icgeb.trieste.it/sbase) is an online resource designed to facilitate the detection of domain homologies based on sequence database search. The present release of the SBASE A library of protein domain sequences contains 972,397 protein sequence segments annotated by structure, function, ligand-binding or cellular topology, clustered into 8547 domain groups. SBASE B contains 169,916 domain sequences clustered into 2526 less well-characterized groups. Domain prediction is based on an evaluation of database search results in comparison with a 'similarity network' of inter-sequence similarity scores, using support vector machines trained on similarity search results of known domains.

Keywords: biosvm
[Vert2005Consistency] RVert and J.-P. Vert. Consistency and convergence rates of one-class SVM and related algorithms. Technical Report 1414, LRI, Université Paris-Sud, 2005. [ bib | .pdf ]
We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator.

[Vert2005Supervised] J.-P. Vert and YYamanishi. Supervised graph inference. In LK. Saul, YWeiss, and LBottou, editors, Adv. Neural Inform. Process. Syst., volume 17, pages 1433-1440. MIT Press, Cambridge, MA, 2005. [ bib | www: ]
Keywords: biosvm
[Vert2005Kernel] J.-P. Vert. Kernel methods in computational biology. Technical Report ccsd-00012124, CNRS-HAL, Oct 2005. [ bib | http | .pdf ]
Support vector machines and kernel methods are increasingly popular in genomics and computational biology, due to their good performance in real-world applications and strong modularity that makes them suitable to a wide range of problems, from the classification of tumors to the automatic annotation of proteins. Their ability to work in high dimension, to process non-vectorial data, and the natural framework they provide to integrate heterogeneous data are particularly relevant to various problems arising in computational biology. In this chapter we survey some of the most prominent applications published so far, highlighting the particular developments in kernel methods triggered by problems in biology, and mention a few promising research directions likely to expand in the future.

Keywords: biosvm
[Ueda2005Probabilistic] NUeda, KF. Aoki-Kinoshita, AYamaguchi, TAkutsu, and HMamitsuka. A Probabilistic Model for Mining Labeled Ordered Trees: Capturing Patterns in Carbohydrate Sugar Chains. IEEE Transactions on Knowledge and Data Engineering, 17(8):1051-1064, 2005. [ bib | DOI | http | .pdf ]
Glycans, or carbohydrate sugar chains, which play a number of important roles in the development and functioning of multicellular organisms, can be regarded as labeled ordered trees. A recent increase in the documentation of glycan structures, especially in the form of database curation, has made mining glycans important for the understanding of living cells. We propose a probabilistic model for mining labeled ordered trees, and we further present an efficient learning algorithm for this model, based on an EM algorithm. The time and space complexities of this algorithm are rather favorable, falling within the practical limits set by a variety of existing probabilistic models, including stochastic context-free grammars. Experimental results have shown that, in a supervised problem setting, the proposed method outperformed five other competing methods by a statistically significant factor in all cases. We further applied the proposed method to aligning multiple glycan trees, and we detected biologically significant common subtrees in these alignments where the trees are automatically classified into subtypes already known in glycobiology. Extended abstracts of parts of the work presented in this paper have appeared in [35], [4], and [3].

[Turlach2005Simultaneous] BA. Turlach, WN. Venables, and SJ. Wright. Simultaneous variable selection. Technometrics, 47(3):349-363, 2005. [ bib ]
[Tung2005GenSo-FDSS] WL. Tung and CQuek. GenSo-FDSS: a neural-fuzzy decision support system for pediatric ALL cancer subtype identification using gene expression data. Artif. Intell. Med., 33(1):61-88, Jan 2005. [ bib | DOI | http | .pdf ]
OBJECTIVE: Acute lymphoblastic leukemia (ALL) is the most common malignancy of childhood, representing nearly one third of all pediatric cancers. Currently, the treatment of pediatric ALL is centered on tailoring the intensity of the therapy applied to a patient's risk of relapse, which is linked to the type of leukemia the patient has. Hence, accurate and correct diagnosis of the various leukemia subtypes becomes an important first step in the treatment process. Recently, gene expression profiling using DNA microarrays has been shown to be a viable and accurate diagnostic tool to identify the known prognostically important ALL subtypes. Thus, there is currently a huge interest in developing autonomous classification systems for cancer diagnosis using gene expression data. This is to achieve an unbiased analysis of the data and also partly to handle the large amount of genetic information extracted from the DNA microarrays. METHODOLOGY: Generally, existing medical decision support systems (DSS) for cancer classification and diagnosis are based on traditional statistical methods such as Bayesian decision theory and machine learning models such as neural networks (NN) and support vector machine (SVM). Though high accuracies have been reported for these systems, they fall short on certain critical areas. These included (a) being able to present the extracted knowledge and explain the computed solutions to the users; (b) having a logical deduction process that is similar and intuitive to the human reasoning process; and (c) flexible enough to incorporate new knowledge without running the risk of eroding old but valid information. On the other hand, a neural fuzzy system, which is synthesized to emulate the human ability to learn and reason in the presence of imprecise and incomplete information, has the ability to overcome the above-mentioned shortcomings. However, existing neural fuzzy systems have their own limitations when used in the design and implementation of DSS. Hence, this paper proposed the use of a novel neural fuzzy system: the generic self-organising fuzzy neural network (GenSoFNN) with truth-value restriction (TVR) fuzzy inference, as a fuzzy DSS (denoted as GenSo-FDSS) for the classification of ALL subtypes using gene expression data. RESULTS AND CONCLUSION: The performance of the GenSo-FDSS system is encouraging when benchmarked against those of NN, SVM and the K-nearest neighbor (K-NN) classifier. On average, a classification rate of above 90% has been achieved using the GenSo-FDSS system.

[Tsochantaridis2005Large] ITsochantaridis, TJoachims, THofmann, and YAltun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6:1453-1484, 2005. [ bib | .html | .pdf ]
Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.

[Tsirigos2005sensitive] ATsirigos and IRigoutsos. A sensitive, support-vector-machine method for the detection of horizontal gene transfers in viral, archaeal and bacterial genomes. Nucleic Acids Res., 33(12):3699-707, 2005. [ bib | DOI | http | .pdf ]
In earlier work, we introduced and discussed a generalized computational framework for identifying horizontal transfers. This framework relied on a gene's nucleotide composition, obviated the need for knowledge of codon boundaries and database searches, and was shown to perform very well across a wide range of archaeal and bacterial genomes when compared with previously published approaches, such as Codon Adaptation Index and C + G content. Nonetheless, two considerations remained outstanding: we wanted to further increase the sensitivity of detecting horizontal transfers and also to be able to apply the method to increasingly smaller genomes. In the discussion that follows, we present such a method, Wn-SVM, and show that it exhibits a very significant improvement in sensitivity compared with earlier approaches. Wn-SVM uses a one-class support-vector machine and can learn using rather small training sets. This property makes Wn-SVM particularly suitable for studying small-size genomes, similar to those of viruses, as well as the typically larger archaeal and bacterial genomes. We show experimentally that the new method results in a superior performance across a wide range of organisms and that it improves even upon our own earlier method by an average of 10% across all examined genomes. As a small-genome case study, we analyze the genome of the human cytomegalovirus and demonstrate that Wn-SVM correctly identifies regions that are known to be conserved and prototypical of all beta-herpesvirinae, regions that are known to have been acquired horizontally from the human host and, finally, regions that had not up to now been suspected to be horizontally transferred. Atypical region predictions for many eukaryotic viruses, including the alpha-, beta- and gamma-herpesvirinae, and 123 archaeal and bacterial genomes, have been made available online at http://cbcsrv.watson.ibm.com/HGTSVM/.

Keywords: biosvm
[Truss2005HuSiDa] MTruss, MSwat, SM. Kielbasa, RSchäfer, HHerzel, and CHagemeier. HuSiDa-the human siRNA database: an open-access database for published functional siRNA sequences and technical details of efficient transfer into recipient cells. Nucleic Acids Res., 33(Database issue):D108-11, Jan 2005. [ bib | DOI | http ]
Small interfering RNAs (siRNAs) have become a standard tool in functional genomics. Once incorporated into the RNA-induced silencing complex (RISC), siRNAs mediate the specific recognition of corresponding target mRNAs and their cleavage. However, only a small fraction of randomly chosen siRNA sequences is able to induce efficient gene silencing. In common laboratory practice, successful RNA interference experiments typically require both, the labour and cost-intensive identification of an active siRNA sequence and the optimization of target cell line-specific procedures for optimal siRNA delivery. To optimize the design and performance of siRNA experiments, we have established the human siRNA database (HuSiDa). The database provides sequences of published functional siRNA molecules targeting human genes and important technical details of the corresponding gene silencing experiments, including the mode of siRNA generation, recipient cell lines, transfection reagents and procedures and direct links to published references (PubMed). The database can be accessed at http://www.human-siRNA-database.net. We used the siRNA sequence information stored in the database for scrutinizing published sequence selection parameters for efficient gene silencing.

Keywords: sirna
[Tompa2005Assessing] Martin Tompa, Nan Li, Timothy L Bailey, George M Church, Bart De Moor, Eleazar Eskin, Alexander V Favorov, Martin C Frith, Yutao Fu, W James Kent, Vsevolod J Makeev, Andrei A Mironov, William Stafford Noble, Giulio Pavesi, Graziano Pesole, Mireille Régnier, Nicolas Simonis, Saurabh Sinha, Gert Thijs, Jacques Van Helden, Mathias Vandenbogaert, Zhiping Weng, Christopher Workman, Chun Ye, and Zhou Zhu. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol., 23:137-144, 2005. [ bib ]
Keywords: csbcbook
[Tomfohr2005Pathway] JTomfohr, JLu, and TB. Kepler. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics, 6:225, 2005. [ bib | DOI | http | .pdf ]
A promising direction in the analysis of gene expression focuses on the changes in expression of specific predefined sets of genes that are known in advance to be related (e.g., genes coding for proteins involved in cellular pathways or complexes). Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation. In this article, we present a new method of this kind that operates by quantifying the level of 'activity' of each pathway in different samples. The activity levels, which are derived from singular value decompositions, form the basis for statistical comparisons and other applications.We demonstrate our approach using expression data from a study of type 2 diabetes and another of the influence of cigarette smoke on gene expression in airway epithelia. A number of interesting pathways are identified in comparisons between smokers and non-smokers including ones related to nicotine metabolism, mucus production, and glutathione metabolism. A comparison with results from the related approach, 'gene-set enrichment analysis', is also provided.Our method offers a flexible basis for identifying differentially expressed pathways from gene expression data. The results of a pathway-based analysis can be complementary to those obtained from one more focused on individual genes. A web program PLAGE (Pathway Level Analysis of Gene Expression) for performing the kinds of analyses described here is accessible at http://dulci.biostat.edu/pathways.

[Tomari2005Perspective] YTomari and PD. Zamore. Perspective: machines for RNAi. Genes Dev., 19(5):517-29, Mar 2005. [ bib | DOI | http ]
RNA silencing pathways convert the sequence information in long RNA, typically double-stranded RNA, into approximately 21-nt RNA signaling molecules such as small interfering RNAs (siRNAs) and microRNAs (miRNAs). siRNAs and miRNAs provide specificity to protein effector complexes that repress mRNA transcription or translation, or catalyze mRNA destruction. Here, we review our current understanding of how small RNAs are produced, how they are loaded into protein complexes, and how they repress gene expression.

Keywords: sirna
[Tobita2005discriminant] MTobita, TNishikawa, and RNagashima. A discriminant model constructed by the support vector machine method for HERG potassium channel inhibitors. Bioorg. Med. Chem. Lett., 15(11):2886-90, Jun 2005. [ bib | DOI | http | .pdf ]
HERG attracts attention as a risk factor for arrhythmia, which might trigger torsade de pointes. A highly accurate classifier of chemical compounds for inhibition of the HERG potassium channel is constructed using support vector machine. For two test sets, our discriminant models achieved 90% and 95% accuracy, respectively. The classifier is even applied for the prediction of cardio vascular adverse effects to achieve about 70% accuracy. While modest inhibitors are partly characterized by properties linked to global structure of a molecule including hydrophobicity and diameter, strong inhibitors are exclusively characterized by properties linked to substructures of a molecule.

Keywords: biosvm chemoinformatics herg
[Tiffin2005Integration] NTiffin, JF. Kelso, AR. Powell, HPan, VB. Bajic, and WA. Hide. Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res., 33(5):1544-1552, 2005. [ bib | DOI | http | .pdf ]
Genome-wide techniques such as microarray analysis, Serial Analysis of Gene Expression (SAGE), Massively Parallel Signature Sequencing (MPSS), linkage analysis and association studies are used extensively in the search for genes that cause diseases, and often identify many hundreds of candidate disease genes. Selection of the most probable of these candidate disease genes for further empirical analysis is a significant challenge. Additionally, identifying the genes that cause complex diseases is problematic due to low penetrance of multiple contributing genes. Here, we describe a novel bioinformatic approach that selects candidate disease genes according to their expression profiles. We use the eVOC anatomical ontology to integrate text-mining of biomedical literature and data-mining of available human gene expression data. To demonstrate that our method is successful and widely applicable, we apply it to a database of 417 candidate genes containing 17 known disease genes. We successfully select the known disease gene for 15 out of 17 diseases and reduce the candidate gene set to 63.3% (+/-18.8%) of its original size. This approach facilitates direct association between genomic data describing gene expression and information from biomedical texts describing disease phenotype, and successfully prioritizes candidate genes according to their expression in disease-affected tissues.

[Tibshirani2005Sparsity] RTibshirani, MSaunders, SRosset, JZhu, and KKnight. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 67(1):91-108, 2005. [ bib | .html ]
[Tian2005Discovering] L Tian, SA. Greenberg, SW. Kong, JAltschuler, IS. Kohane, and PJ. Park. Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci U S A, 102(38):13544-13549, Sep 2005. [ bib | DOI | http | .pdf ]
Accurate and rapid identification of perturbed pathways through the analysis of genome-wide expression profiles facilitates the generation of biological hypotheses. We propose a statistical framework for determining whether a specified group of genes for a pathway has a coordinated association with a phenotype of interest. Several issues on proper hypothesis-testing procedures are clarified. In particular, it is shown that the differences in the correlation structure of each set of genes can lead to a biased comparison among gene sets unless a normalization procedure is applied. We propose statistical tests for two important but different aspects of association for each group of genes. This approach has more statistical power than currently available methods and can result in the discovery of statistically significant pathways that are not detected by other methods. This method is applied to data sets involving diabetes, inflammatory myopathies, and Alzheimer's disease, using gene sets we compiled from various public databases. In the case of inflammatory myopathies, we have correctly identified the known cytotoxic T lymphocyte-mediated autoimmunity in inclusion body myositis. Furthermore, we predicted the presence of dendritic cells in inclusion body myositis and of an IFN-alpha/beta response in dermatomyositis, neither of which was previously described. These predictions have been subsequently corroborated by immunohistochemistry.

[Thukral2005Prediction] Sushil K Thukral, Paul J Nordone, Rong Hu, Leah Sullivan, Eric Galambos, Vincent D Fitzpatrick, Laura Healy, Michael B Bass, Mary E Cosenza, and Cynthia A Afshari. Prediction of nephrotoxicant action and identification of candidate toxicity-related biomarkers. Toxicol Pathol, 33(3):343-55, 2005. [ bib | DOI | http ]
A vast majority of pharmacological compounds and their metabolites are excreted via the urine, and within the complex structure of the kidney,the proximal tubules are a main target site of nephrotoxic compounds. We used the model nephrotoxicants mercuric chloride, 2-bromoethylamine hydrobromide, hexachlorobutadiene, mitomycin, amphotericin, and puromycin to elucidate time- and dose-dependent global gene expression changes associated with proximal tubular toxicity. Male Sprague-Dawley rats were dosed via intraperitoneal injection once daily for mercuric chloride and amphotericin (up to 7 doses), while a single dose was given for all other compounds. Animals were exposed to 2 different doses of these compounds and kidney tissues were collected on day 1, 3, and 7 postdosing. Gene expression profiles were generated from kidney RNA using 17K rat cDNA dual dye microarray and analyzed in conjunction with histopathology. Analysis of gene expression profiles showed that the profiles clustered based on similarities in the severity and type of pathology of individual animals. Further, the expression changes were indicative of tubular toxicity showing hallmarks of tubular degeneration/regeneration and necrosis. Use of gene expression data in predicting the type of nephrotoxicity was then tested with a support vector machine (SVM)-based approach. A SVM prediction module was trained using 120 profiles of total profiles divided into four classes based on the severity of pathology and clustering. Although mitomycin C and amphotericin B treatments did not cause toxicity, their expression profiles were included in the SVM prediction module to increase the sample size. Using this classifier, the SVM predicted the type of pathology of 28 test profiles with 100% selectivity and 82% sensitivity. These data indicate that valid predictions could be made based on gene expression changes from a small set of expression profiles. A set of potential biomarkers showing a time- and dose-response with respect to the progression of proximal tubular toxicity were identified. These include several transporters (Slc21a2, Slc15, Slc34a2), Kim 1, IGFbp-1, osteopontin, alpha-fibrinogen, and Gstalpha.

Keywords: Algorithms, Animals, Antibiotics, Antineoplastic, Artificial Intelligence, Butadienes, Chloroplasts, Comparative Study, Computer Simulation, Computer-Assisted, Diagnosis, Disinfectants, Dose-Response Relationship, Drug, Drug Toxicity, Electrodes, Electroencephalography, Ethylamines, Expert Systems, Feedback, Fungicides, Gene Expression Profiling, Genes, Genetic Markers, Humans, Implanted, Industrial, Information Storage and Retrieval, Kidney, Kidney Tubules, MEDLINE, Male, Mercuric Chloride, Microarray Analysis, Molecular Biology, Motor Cortex, Movement, Natural Language Processing, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Plant Proteins, Predictive Value of Tests, Proteins, Proteome, Proximal, Puromycin Aminonucleoside, Rats, Reproducibility of Results, Research Support, Sprague-Dawley, Subcellular Fractions, Terminology, Therapy, Time Factors, Toxicogenetics, U.S. Gov't, User-Computer Interface, 15805072
[Tang2005Granular] Yuchun Tang, Bo Jin, and Yan-Qing Zhang. Granular support vector machines with association rules mining for protein homology prediction. Artif. Intell. Med., Jul 2005. [ bib | DOI | http | .pdf ]
OBJECTIVE:: Protein homology prediction between protein sequences is one of critical problems in computational biology. Such a complex classification problem is common in medical or biological information processing applications. How to build a model with superior generalization capability from training samples is an essential issue for mining knowledge to accurately predict/classify unseen new samples and to effectively support human experts to make correct decisions. METHODOLOGY:: A new learning model called granular support vector machines (GSVM) is proposed based on our previous work. GSVM systematically and formally combines the principles from statistical learning theory and granular computing theory and thus provides an interesting new mechanism to address complex classification problems. It works by building a sequence of information granules and then building support vector machines (SVM) in some of these information granules on demand. A good granulation method to find suitable granules is crucial for modeling a GSVM with good performance. In this paper, we also propose an association rules-based granulation method. For the granules induced by association rules with high enough confidence and significant support, we leave them as they are because of their high "purity" and significant effect on simplifying the classification task. For every other granule, a SVM is modeled to discriminate the corresponding data. In this way, a complex classification problem is divided into multiple smaller problems so that the learning task is simplified. RESULTS AND CONCLUSIONS:: The proposed algorithm, here named GSVM-AR, is compared with SVM by KDDCUP04 protein homology prediction data. The experimental results show that finding the splitting hyperplane is not a trivial task (we should be careful to select the association rules to avoid overfitting) and GSVM-AR does show significant improvement compared to building one single SVM in the whole feature space. Another advantage is that the utility of GSVM-AR is very good because it is easy to be implemented. More importantly and more interestingly, GSVM provides a new mechanism to address complex classification problems.

Keywords: , , 16024240
[Tang2005Discovering] Thomas Tang, Jinbo Xu, and Ming Li. Discovering sequence-structure motifs from protein segments and two applications. Pac Symp Biocomput, pages 370-81, 2005. [ bib ]
We present a novel method for clustering short protein segments having strong sequence-structure correlations, and demonstrate that these clusters contain useful structural information via two applications. When applied to local tertiary structure prediction, we achieve approximately 60% accuracy with a novel dynamic programming algorithm. When applied to secondary structure prediction based on Support Vector Machines, we obtain a approximately 2% gain in Q3 performance by incorporating cluster-derived data into training and classification. These encouraging results illustrate the great potential of using conserved local motifs to tackle protein structure predictions and possibly other important problems in biology.

Keywords: biosvm
[Tang2005siRNA] GTang. siRNA and miRNA: an insight into RISCs. Trends Biochem. Sci., 30(2):106-14, Feb 2005. [ bib | DOI | http ]
Two classes of short RNA molecule, small interfering RNA (siRNA) and microRNA (miRNA), have been identified as sequence-specific posttranscriptional regulators of gene expression. siRNA and miRNA are incorporated into related RNA-induced silencing complexes (RISCs), termed siRISC and miRISC, respectively. The current model argues that siRISC and miRISC are functionally interchangeable and target specific mRNAs for cleavage or translational repression, depending on the extent of sequence complementarity between the small RNA and its target. Emerging evidence indicates, however, that siRISC and miRISC are distinct complexes that regulate mRNA stability and translation. The assembly of RISCs can be traced from the biogenesis of the small RNA molecules and the recruitment of these RNAs by the RISC loading complex (RLC) to the transition of the RLC into the active RISC. Target recognition by the RISC can then take place through different interacting modes.

Keywords: sirna
[Talih2005Structural] MTalih and NHengartner. Structural learning with time-varying components: tracking the cross-section of financial time series. J. R. Stat. Soc. Ser. B, 67(3):321-341, 2005. [ bib | .pdf ]
[Takeuchi2005Bio-medical] Koichi Takeuchi and Nigel Collier. Bio-medical entity extraction using support vector machines. Artif. Intell. Med., 33(2):125-37, Feb 2005. [ bib | DOI | http | .pdf ]
OBJECTIVE: Support vector machines (SVMs) have achieved state-of-the-art performance in several classification tasks. In this article we apply them to the identification and semantic annotation of scientific and technical terminology in the domain of molecular biology. This illustrates the extensibility of the traditional named entity task to special domains with large-scale terminologies such as those in medicine and related disciplines. METHODS AND MATERIALS: The foundation for the model is a sample of text annotated by a domain expert according to an ontology of concepts, properties and relations. The model then learns to annotate unseen terms in new texts and contexts. The results can be used for a variety of intelligent language processing applications. We illustrate SVMs capabilities using a sample of 100 journal abstracts texts taken from the human, blood cell, transcription factor domain of MEDLINE. RESULTS: Approximately 3400 terms are annotated and the model performs at about 74% F-score on cross-validation tests. A detailed analysis based on empirical evidence shows the contribution of various feature sets to performance. CONCLUSION: Our experiments indicate a relationship between feature window size and the amount of training data and that a combination of surface words, orthographic features and head noun features achieve the best performance among the feature sets tested.

Keywords: biosvm
[Swamidass2005Kernels] SJ. Swamidass, JChen, JBruand, PPhung, LRalaivola, and PBaldi. Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics, 21(Suppl. 1):i359-i368, Jun 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Small molecules play a fundamental role in organic chemistry and biology. They can be used to probe biological systems and to discover new drugs and other useful compounds. As increasing numbers of large datasets of small molecules become available, it is necessary to develop computational methods that can deal with molecules of variable size and structure and predict their physical, chemical and biological properties. RESULTS: Here we develop several new classes of kernels for small molecules using their 1D, 2D and 3D representations. In 1D, we consider string kernels based on SMILES strings. In 2D, we introduce several similarity kernels based on conventional or generalized fingerprints. Generalized fingerprints are derived by counting in different ways subpaths contained in the graph of bonds, using depth-first searches. In 3D, we consider similarity measures between histograms of pairwise distances between atom classes. These kernels can be computed efficiently and are applied to problems of classification and prediction of mutagenicity, toxicity and anti-cancer activity on three publicly available datasets. The results derived using cross-validation methods are state-of-the-art. Tradeoffs between various kernels are briefly discussed. AVAILABILITY: Datasets available from http://www.igb.uci.edu/servers/servers.html CONTACT: pfbaldi@ics.uci.edu.

Keywords: biosvm
[Suthram2005Plasmodium] SSuthram, TSittler, and TIdeker. The plasmodium protein network diverges from those of other eukaryotes. Nature, 438(7064):108-112, Nov 2005. [ bib | DOI | http | www: ]
Plasmodium falciparum is the pathogen responsible for over 90% of human deaths from malaria. Therefore, it has been the focus of a considerable research initiative, involving the complete DNA sequencing of the genome, large-scale expression analyses, and protein characterization of its life-cycle stages. The Plasmodium genome sequence is relatively distant from those of most other eukaryotes, with more than 60% of the 5,334 encoded proteins lacking any notable sequence similarity to other organisms. To systematically elucidate functional relationships among these proteins, a large two-hybrid study has recently mapped a network of 2,846 interactions involving 1,312 proteins within Plasmodium. This network adds to a growing collection of available interaction maps for a number of different organisms, and raises questions about whether the divergence of Plasmodium at the sequence level is reflected in the configuration of its protein network. Here we examine the degree of conservation between the Plasmodium protein network and those of model organisms. Although we find 29 highly connected protein complexes specific to the network of the pathogen, we find very little conservation with complexes observed in other organisms (three in yeast, none in the others). Overall, the patterns of protein interaction in Plasmodium, like its genome sequence, set it apart from other species.

[Subramanian2005Gene] ASubramanian, PTamayo, VK. Mootha, SMukherjee, BL. Ebert, MA. Gillette, APaulovich, SL. Pomeroy, TR. Golub, ES. Lander, and JP. Mesirov. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA, 102(43):15545-15550, Oct 2005. [ bib | DOI | http | .pdf ]
Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

[Stucki2005JTB] JW. Stucki and H.-U. Simon. Mathematical modeling of the regulation of caspase-3 activation and degradation. Journal of Theoretical Biology, 234(1):123-131, 2005. [ bib | DOI | http ]
Caspases are thought to be important players in the execution process of apoptosis. Inhibitors of apoptosis (IAPs) are able to block caspases and therefore apoptosis. The fact that a subgroup of the IAP family inhibits active caspases implies that not each caspase activation necessarily leads to apoptosis. In such a scenario, however, processed and enzymically active caspases should somehow be removed. Indeed, IAP-caspase complexes covalently bind ubiquitin, resulting in degradation by the 26S proteasome. Following release from mitochondria, IAP antagonists (e.g. second mitochondrial activator of caspases (Smac)) inactivate IAPs. Moreover, although pro-apoptotic factors such as irradiation or anti-cancer drugs may release Smac from mitochondria in tumor cells, high cytoplasmic survivin and ML-IAP levels might be able to neutralize it and, consequently, IAPs would further be able to bind activated caspases. Here, we propose a simple mathematical model, describing the molecular interactions between Smac deactivators, Smac, IAPs, and caspase-3, including the requirements for both induction and prevention of apoptosis, respectively. In addition, we predict a novel mechanism of caspase-3 degradation that might be particularly relevant in long-living cells.

Keywords: csbcbook
[Stelzl2005human] Ulrich Stelzl, Uwe Worm, Maciej Lalowski, Christian Haenig, Felix H Brembeck, Heike Goehler, Martin Stroedicke, Martina Zenkner, Anke Schoenherr, Susanne Koeppen, Jan Timm, Sascha Mintzlaff, Claudia Abraham, Nicole Bock, Silvia Kietzmann, Astrid Goedde, Engin Toksöz, Anja Droege, Sylvia Krobitsch, Bernhard Korn, Walter Birchmeier, Hans Lehrach, and Erich E Wanker. A human protein-protein interaction network: a resource for annotating the proteome. Cell, 122(6):957-968, Sep 2005. [ bib | DOI | http ]
Protein-protein interaction maps provide a valuable framework for a better understanding of the functional organization of the proteome. To detect interacting pairs of human proteins systematically, a protein matrix of 4456 baits and 5632 preys was screened by automated yeast two-hybrid (Y2H) interaction mating. We identified 3186 mostly novel interactions among 1705 proteins, resulting in a large, highly connected network. Independent pull-down and co-immunoprecipitation assays validated the overall quality of the Y2H interactions. Using topological and GO criteria, a scoring system was developed to define 911 high-confidence interactions among 401 proteins. Furthermore, the network was searched for interactions linking uncharacterized gene products and human disease proteins to regulatory cellular pathways. Two novel Axin-1 interactions were validated experimentally, characterizing ANP32A and CRMP1 as modulators of Wnt signaling. Systematic human protein interaction screens can lead to a more comprehensive understanding of protein function and cellular processes.

Keywords: Databases as Topic; Humans; Intracellular Signaling Peptides and Proteins; Models, Molecular; Nerve Tissue Proteins; Protein Binding; Proteins; Proteomics; Repressor Proteins; Two-Hybrid System Techniques
[Steinwart2005classification] ISteinwart, DHush, and CScovel. A classification framework for anomaly detection. J. Mach. Learn. Res., 6:211-232, 2005. [ bib | .pdf | .pdf ]
[Steinwart2005Density] ISteinwart, DHush, and CScovel. Density Level Detection is Classification. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17. MIT Press, Cambridge, MA, 2005. [ bib | .pdf ]
[Steinwart2005Consistency] ISteinwart. Consistency of support vector machines and other regularized kernel classifiers. IEEE Trans. Inform. Theory, 51(1):128-142, 2005. [ bib | DOI | http | .pdf ]
[Statnikov2005comprehensive] AStatnikov, CF. Aliferis, ITsamardinos, DHardin, and SLevy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 2005. To appear. [ bib | http | .pdf ]
Motivation: Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. In order to equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods, and two cross validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types.Results: Multicategory Support Vector Machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins, and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms such as K-Nearest Neighbors, Backpropagation and Probabilistic Neural Networks, often to a remarkable degree. Gene selection techniques can significantly improve classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets.Availability: The software system GEMS is available for download from http://www.gems-system.org for non-commercial use.

Keywords: biosvm microarray
[Srebro2005Rank] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In COLT, pages 545-560, 2005. [ bib ]
[Srebro2005Maximum] NSrebro, JDM. Rennie, and TS. Jaakkola. Maximum-margin matrix factorization. In LK. Saul, YWeiss, and LBottou, editors, Adv. Neural. Inform. Process Syst. 17, pages 1329-1336, Cambridge, MA, 2005. MIT Press. [ bib ]
[Son2005Database] C.G. Son, SBilke, SDavis, B.T. Greer, J.S. Wei, C.C. Whiteford, Q-R. Chen, NCenacchi, and JKhan. Database of mRNA gene expression profiles of multiple human organs. Genome Res., 15(3):443-450, Mar 2005. [ bib | DOI | http ]
Genome-wide expression profiling of normal tissue may facilitate our understanding of the etiology of diseased organs and augment the development of new targeted therapeutics. Here, we have developed a high-density gene expression database of 18,927 unique genes for 158 normal human samples from 19 different organs of 30 different individuals using DNA microarrays. We report four main findings. First, despite very diverse sample parameters (e.g., age, ethnicity, sex, and postmortem interval), the expression profiles belonging to the same organs cluster together, demonstrating internal stability of the database. Second, the gene expression profiles reflect major organ-specific functions on the molecular level, indicating consistency of our database with known biology. Third, we demonstrate that any small (i.e., n approximately 100), randomly selected subset of genes can approximately reproduce the hierarchical clustering of the full data set, suggesting that the observed differential expression of >90% of the probed genes is of biological origin. Fourth, we demonstrate a potential application of this database to cancer research by identifying 19 tumor-specific genes in neuroblastoma. The selected genes are relatively underexpressed in all of the organs examined and belong to therapeutically relevant pathways, making them potential novel diagnostic markers and targets for therapy. We expect this database will be of utility for developing rationally designed molecularly targeted therapeutics in diseases such as cancer, as well as for exploring the functions of genes.

Keywords: Cluster Analysis; Databases, Nucleic Acid; Gene Expression Profiling; Humans; Oligonucleotide Array Sequence Analysis; Organ Specificity; Principal Component Analysis; RNA, Messenger
[Smyth2005Bioinformatics] GK. Smyth. Bioinformatics and Computational Biology Solutions using R and Bioconductor, chapter Limma: linear model for microarray data, pages 397-420. Springer, New York, 2005. [ bib | .pdf ]
[Shulman-Peleg2005SiteEngines] Alexandra Shulman-Peleg, Ruth Nussinov, and Haim J Wolfson. Siteengines: recognition and comparison of binding sites and protein-protein interfaces. Nucleic Acids Res, 33(Web Server issue):W337-W341, Jul 2005. [ bib | DOI | http ]
Protein surface regions with similar physicochemical properties and shapes may perform similar functions and bind similar binding partners. Here we present two web servers and software packages for recognition of the similarity of binding sites and interfaces. Both methods recognize local geometrical and physicochemical similarity, which can be present even in the absence of overall sequence or fold similarity. The first method, SiteEngine (http:/bioinfo3d.cs.tau.ac.il/SiteEngine), receives as an input two protein structures and searches the complete surface of one protein for regions similar to the binding site of the other. The second, Interface-to-Interface (I2I)-SiteEngine (http:/bioinfo3d.cs.tau.ac.il/I2I-SiteEngine), compares protein-protein interfaces, which are regions of interaction between two protein molecules. It receives as an input two structures of protein-protein complexes, extracts the interfaces and finds the three-dimensional transformation that maximizes the similarity between two pairs of interacting binding sites. The output of both servers consists of a superimposition in PDB file format and a list of physicochemical properties shared by the compared entities. The methods are highly efficient and the freely available software packages are suitable for large-scale database searches of the entire PDB.

Keywords: Amino Acids, chemistry; Binding Sites; Internet; Multiprotein Complexes, chemistry/metabolism; Protein Conformation; Protein Interaction Mapping, methods; Software; User-Computer Interface
[Shilton2005Incremental] Alistair Shilton, MPalaniswami, Daniel Ralph, and Ah Chung Tsoi. Incremental training of support vector machines. IEEE Trans Neural Netw, 16(1):114-31, Jan 2005. [ bib ]
We propose a new algorithm for the incremental training of support vector machines (SVMs) that is suitable for problems of sequentially arriving data and fast constraint parameter variation. Our method involves using a "warm-start" algorithm for the training of SVMs, which allows us to take advantage of the natural incremental properties of the standard active set approach to linearly constrained optimization problems. Incremental training involves quickly retraining a support vector machine after adding a small number of additional training vectors to the training set of an existing (trained) support vector machine. Similarly, the problem of fast constraint parameter variation involves quickly retraining an existing support vector machine using the same training set but different constraint parameters. In both cases, we demonstrate the computational superiority of incremental training over the usual batch retraining method.

[Shi2005Building] Lei Shi and Fabien Campagne. Building a protein name dictionary from full text: a machine learning term extraction approach. BMC Bioinformatics, 6(1):88, Apr 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature. RESULTS: We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text. CONCLUSION: This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt.

Keywords: biosvm
[Shi2005Sensitivity] DShi, DS. Yeung, and JGao. Sensitivity analysis applied to the construction of radial basis function networks. Neural Netw, Jun 2005. [ bib | DOI | http ]
Conventionally, a radial basis function (RBF) network is constructed by obtaining cluster centers of basis function by maximum likelihood learning. This paper proposes a novel learning algorithm for the construction of radial basis function using sensitivity analysis. In training, the number of hidden neurons and the centers of their radial basis functions are determined by the maximization of the output's sensitivity to the training data. In classification, the minimal number of such hidden neurons with the maximal sensitivity will be the most generalizable to unknown data. Our experimental results show that our proposed sensitivity-based RBF classifier outperforms the conventional RBFs and is as accurate as support vector machine (SVM). Hence, sensitivity analysis is expected to be a new alternative way to the construction of RBF networks.

[Shen2005[Detection] Li Shen, Jie Yang, and Yue Zhou. Detection of PVCs with support vector machine. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi, 22(1):78-81, Feb 2005. [ bib ]
The classifiction of heart beats is the foundation for automated arrhythmia monitoring devices. Support vector machnies (SVMs) have meant a great advance in solving classification or pattern recognition. This study describes SVM for the identification of premature ventricular contractions (PVCs) in surface ECGs. Features for the classification task are extracted by analyzing the heart rate, morphology and wavelet energy of the heart beats from a single lead. The performance of different SVMs is evaluated on the MIT-BIH arrhythmia database following the association for the advancement of medical instrumentation (AAMI) recommendations.

Keywords: 80 and over, Adult, Aged, Algorithms, Amino Acids, Animals, Area Under Curve, Artifacts, Automated, Birefringence, Brain Chemistry, Brain Neoplasms, Comparative Study, Computer-Assisted, Cornea, Cross-Sectional Studies, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Face, Female, Genetic, Glaucoma, Humans, Intraocular Pressure, Lasers, Least-Squares Analysis, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Male, Middle Aged, Models, Molecular, Nerve Fibers, Non-U.S. Gov't, Numerical Analysis, Ophthalmological, Optic Nerve Diseases, Optical Coherence, P.H.S., Pattern Recognition, Photic Stimulation, Prospective Studies, Protein, ROC Curve, Regression Analysis, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, Sequence Analysis, Statistics, Tomography, U.S. Gov't, Visual Fields, beta-Lactamases, 15762121
[Sheinerman2005High] Felix B Sheinerman, Elie Giraud, and Abdelazize Laoui. High affinity targets of protein kinase inhibitors have similar residues at the positions energetically important for binding. J. Mol. Biol., 352(5):1134-1156, Oct 2005. [ bib | DOI | http ]
Inhibition of protein kinase activity is a focus of intense drug discovery efforts in several therapeutic areas. Major challenges facing the field include understanding of the factors determining the selectivity of kinase inhibitors and the development of compounds with the desired selectivity profile. Here, we report the analysis of sequence variability among high and low affinity targets of eight different small molecule kinase inhibitors (BIRB796, Tarceva, NU6102, Gleevec, SB203580, balanol, H89, PP1). It is observed that all high affinity targets of each inhibitor are found among a relatively small number of kinases, which have similar residues at the specific positions important for binding. The findings are highly statistically significant, and allow one to exclude the majority of kinases in a genome from a list of likely targets for an inhibitor. The findings have implications for the design of novel inhibitors with a desired selectivity profile (e.g. targeted at multiple kinases), the discovery of new targets for kinase inhibitor drugs, comparative analysis of different in vivo models, and the design of "a-la-carte" chemical libraries tailored for individual kinases.

Keywords: Amino Acid Sequence; Amino Acids; Binding Sites; Electrostatics; Humans; Ligands; Molecular Sequence Data; Piperazines; Protein Binding; Protein Kinase Inhibitors; Protein Kinases; Pyrazoles; Pyrimidines; Sequence Alignment; Thermodynamics
[Sharan2005Conserved] RSharan, SSuthram, R.M. Kelley, TKuhn, SMcCuine, PUetz, TSittler, R.M. Karp, and TIdeker. Conserved patterns of protein interaction in multiple species. Proc. Natl. Acad. Sci. USA, 102(6):1974-1979, Feb 2005. [ bib | DOI | http | .pdf ]
To elucidate cellular machinery on a global scale, we performed a multiple comparison of the recently available protein-protein interaction networks of Caenorhabditis elegans, Drosophila melanogaster, and Saccharomyces cerevisiae. This comparison integrated protein interaction and sequence information to reveal 71 network regions that were conserved across all three species and many exclusive to the metazoans. We used this conservation, and found statistically significant support for 4,645 previously undescribed protein functions and 2,609 previously undescribed protein interactions. We tested 60 interaction predictions for yeast by two-hybrid analysis, confirming approximately half of these. Significantly, many of the predicted functions and interactions would not have been identified from sequence similarity alone, demonstrating that network comparisons provide essential biological information beyond what is gleaned from the genome.

[Sharan2005motif-based] RSharan and EW Myers. A motif-based framework for recognizing sequence families. Bioinformatics, 21 Suppl 1:i387-i393, Jun 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Many signals in biological sequences are based on the presence or absence of base signals and their spatial combinations. One of the best known examples of this is the signal identifying a core promoter-the site at which the basal transcription machinery starts the transcription of a gene. Our goal is a fully automatic pattern recognition system for a family of sequences, which simultaneously discovers the base signals, their spatial relationships and a classifier based upon them. RESULTS: In this paper we present a general method for characterizing a set of sequences by their recurrent motifs. Our approach relies on novel probabilistic models for DNA binding sites and modules of binding sites, on algorithms to study them from the data and on a support vector machine that uses the models studied to classify a set of sequences. We demonstrate the applicability of our approach to diverse instances, ranging from families of promoter sequences to a dataset of intronic sequences flanking alternatively spliced exons. On a core promoter dataset our results are comparable with the state-of-the-art McPromoter. On a dataset of alternatively spliced exons we outperform a previous approach. We also achieve high success rates in recognizing cell cycle regulated genes. These results demonstrate that a fully automatic pattern recognition algorithm can meet or exceed the performance of hand-crafted approaches. AVAILABILITY: The software and datasets are available from the authors upon request. CONTACT: roded@tau.ac.il.

Keywords: biosvm
[Shadforth2005Protein] Ian Shadforth, Daniel Crowther, and Conrad Bessant. Protein and peptide identification algorithms using ms for use in high-throughput, automated pipelines. Proteomics, 5(16):4082-4095, Nov 2005. [ bib | DOI | http ]
Current proteomics experiments can generate vast quantities of data very quickly, but this has not been matched by data analysis capabilities. Although there have been a number of recent reviews covering various aspects of peptide and protein identification methods using MS, comparisons of which methods are either the most appropriate for, or the most effective at, their proposed tasks are not readily available. As the need for high-throughput, automated peptide and protein identification systems increases, the creators of such pipelines need to be able to choose algorithms that are going to perform well both in terms of accuracy and computational efficiency. This article therefore provides a review of the currently available core algorithms for PMF, database searching using MS/MS, sequence tag searches and de novo sequencing. We also assess the relative performances of a number of these algorithms. As there is limited reporting of such information in the literature, we conclude that there is a need for the adoption of a system of standardised reporting on the performance of new peptide and protein identification algorithms, based upon freely available datasets. We go on to present our initial suggestions for the format and content of these datasets.

Keywords: Algorithms; Alternative Splicing; Databases, Protein; Peptides; Polymorphism, Genetic; Proteins; Proteomics; Sequence Analysis; Software; Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization
[Senawongse2005Predicting] Pasak Senawongse, Andrew R Dalby, and Zheng Rong Yang. Predicting the phosphorylation sites using hidden markov models and machine learning methods. J Chem Inf Model, 45(4):1147-52, 2005. [ bib | DOI | http | .pdf ]
Accurately predicting phosphorylation sites in proteins is an important issue in postgenomics, for which how to efficiently extract the most predictive features from amino acid sequences for modeling is still challenging. Although both the distributed encoding method and the bio-basis function method work well, they still have some limits in use. The distributed encoding method is unable to code the biological content in sequences efficiently, whereas the bio-basis function method is a nonparametric method, which is often computationally expensive. As hidden Markov models (HMMs) can be used to generate one model for one cluster of aligned protein sequences, the aim in this study is to use HMMs to extract features from amino acid sequences, where sequence clusters are determined using available biological knowledge. In this novel method, HMMs are first constructed using functional sequences only. Both functional and nonfunctional training sequences are then inputted into the trained HMMs to generate functional and nonfunctional feature vectors. From this, a machine learning algorithm is used to construct a classifier based on these feature vectors. It is found in this work that (1) this method provides much better prediction accuracy than the use of HMMs only for prediction, and (2) the support vector machines (SVMs) algorithm outperforms decision trees and neural network algorithms when they are constructed on the features extracted using the trained HMMs.

Keywords: biosvm
[Seike2005Proteomic] MSeike, TKondo, KFujii, TOkano, TYamada, YMatsuno, AGemma, SKudoh, and SHirohashi. Proteomic signatures for histological types of lung cancer. Proteomics, Jul 2005. [ bib | DOI | http | .pdf ]
We performed proteomic studies on lung cancer cells to elucidate the mechanisms that determine histological phenotype. Thirty lung cancer cell lines with three different histological backgrounds (squamous cell carcinoma, small cell lung carcinoma and adenocarcinoma) were subjected to two-dimensional difference gel electrophoresis (2-D DIGE) and grouped by multivariate analyses on the basis of their protein expression profiles. 2-D DIGE achieves more accurate quantification of protein expression by using highly sensitive fluorescence dyes to label the cysteine residues of proteins prior to two-dimensional polyacrylamide gel electrophoresis. We found that hierarchical clustering analysis and principal component analysis divided the cell lines according to their original histology. Spot ranking analysis using a support vector machine algorithm and unsupervised classification methods identified 32 protein spots essential for the classification. The proteins corresponding to the spots were identified by mass spectrometry. Next, lung cancer cells isolated from tumor tissue by laser microdissection were classified on the basis of the expression pattern of these 32 protein spots. Based on the expression profile of the 32 spots, the isolated cancer cells were categorized into three histological groups: the squamous cell carcinoma group, the adenocarcinoma group, and a group of carcinomas with other histological types. In conclusion, our results demonstrate the utility of quantitative proteomic analysis for molecular diagnosis and classification of lung cancer cells.

Keywords: biosvm proteomics
[Segal2005From] ESegal, NFriedman, NKaminski, ARegev, and DKoller. From signatures to models: understanding cancer using microarrays. Nat Genet, 37(6 Suppl):S38-45, 2005. [ bib | DOI | http | .pdf ]
Genomics has the potential to revolutionize the diagnosis and management of cancer by offering an unprecedented comprehensive view of the molecular underpinnings of pathology. Computational analysis is essential to transform the masses of generated data into a mechanistic understanding of disease. Here we review current research aimed at uncovering the modular organization and function of transcriptional networks and responses in cancer. We first describe how methods that analyze biological processes in terms of higher-level modules can identify robust signatures of disease mechanisms. We then discuss methods that aim to identify the regulatory mechanisms underlying these modules and processes. Finally, we show how comparative analysis, combining human data with model organisms, can lead to more robust findings. We conclude by discussing the challenges of generalizing these methods from cells to tissues and the opportunities they offer to improve cancer diagnosis and management.

Keywords: microarray
[Scott2005Neyman] CScott and RNowak. A Neyman-Pearson approach to statistical learning. IEEE Trans. Inf. Theory, 51(11):3806-3819, 2005. [ bib | DOI | http ]
The Neyman-Pearson (NP) approach to hypothesis testing is useful in situations where different types of error have different consequences or a priori probabilities are unknown. For any /spl alpha/>0, the NP lemma specifies the most powerful test of size /spl alpha/, but assumes the distributions for each hypothesis are known or (in some cases) the likelihood ratio is monotonic in an unknown parameter. This paper investigates an extension of NP theory to situations in which one has no knowledge of the underlying distributions except for a collection of independent and identically distributed (i.i.d.) training examples from each hypothesis. Building on a "fundamental lemma" of Cannon et al., we demonstrate that several concepts from statistical learning theory have counterparts in the NP context. Specifically, we consider constrained versions of empirical risk minimization (NP-ERM) and structural risk minimization (NP-SRM), and prove performance guarantees for both. General conditions are given under which NP-SRM leads to strong universal consistency. We also apply NP-SRM to (dyadic) decision trees to derive rates of convergence. Finally, we present explicit algorithms to implement NP-SRM for histograms and dyadic decision trees.

Keywords: neyman, pearson
[Schellewald2005Probabilistic] CSchellewald and CSchnorr. Probabilistic subgraph matching based on convex relaxation. In EMMCVPR05, pages 171-186, 2005. [ bib ]
[Sassi2005automated] Alexander P Sassi, Frank Andel, Hans-Marcus L Bitter, Michael P S Brown, Robert G Chapman, Jeraldine Espiritu, Alfred C Greenquist, Isabelle Guyon, Mariana Horchi-Alegre, Kathy L Stults, Ann Wainright, Jonathan C Heller, and John T Stults. An automated, sheathless capillary electrophoresis-mass spectrometry platform for discovery of biomarkers in human serum. Electrophoresis, 26(7-8):1500-12, Apr 2005. [ bib | DOI | http | .pdf ]
A capillary electrophoresis-mass spectrometry (CE-MS) method has been developed to perform routine, automated analysis of low-molecular-weight peptides in human serum. The method incorporates transient isotachophoresis for in-line preconcentration and a sheathless electrospray interface. To evaluate the performance of the method and demonstrate the utility of the approach, an experiment was designed in which peptides were added to sera from individuals at each of two different concentrations, artificially creating two groups of samples. The CE-MS data from the serum samples were divided into separate training and test sets. A pattern-recognition/feature-selection algorithm based on support vector machines was used to select the mass-to-charge (m/z) values from the training set data that distinguished the two groups of samples from each other. The added peptides were identified correctly as the distinguishing features, and pattern recognition based on these peptides was used to assign each sample in the independent test set to its respective group. A twofold difference in peptide concentration could be detected with statistical significance (p-value < 0.0001). The accuracy of the assignment was 95%, demonstrating the utility of this technique for the discovery of patterns of biomarkers in serum.

Keywords: 80 and over, Adult, Aged, Algorithms, Amino Acids, Animals, Area Under Curve, Artifacts, Automated, Birefringence, Brain Chemistry, Brain Neoplasms, Comparative Study, Computer-Assisted, Cornea, Cross-Sectional Studies, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Face, Female, Genetic, Glaucoma, Humans, Intraocular Pressure, Lasers, Least-Squares Analysis, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Male, Middle Aged, Models, Molecular, Nerve Fibers, Non-U.S. Gov't, Numerical Analysis, Ophthalmological, Optic Nerve Diseases, Optical Coherence, P.H.S., Pattern Recognition, Photic Stimulation, Prospective Studies, Protein, ROC Curve, Regression Analysis, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, Sequence Analysis, Statistics, Tomography, U.S. Gov't, Visual Fields, beta-Lactamases, 15765480
[Sarda2005pSLIP] Deepak Sarda, Gek Huey Chua, Kuo-Bin Li, and Arun Krishnan. pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinformatics, 6(1):152, Jun 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: Protein subcellular localization is an important determinant of protein function and hence, reliable methods for prediction of localization are needed. A number of prediction algorithms have been developed based on amino acid compositions or on the N-terminal characteristics (signal peptides) of proteins. However, such approaches lead to a loss of contextual information. Moreover, where information about the physicochemical properties of amino acids has been used, the methods employed to exploit that information are less than optimal and could use the information more effectively. RESULTS: In this paper, we propose a new algorithm called pSLIP which uses Support Vector Machines (SVMs) in conjunction with multiple physicochemical properties of amino acids to predict protein subcellular localization in eukaryotes across six different locations, namely, chloroplast, cytoplasmic, extracellular, mitochondrial, nuclear and plasma membrane. The algorithm was applied to the dataset provided by Park and Kanehisa and we obtained prediction accuracies for the different classes ranging from 87.7%-97.0% with an overall accuracy of 93.1%. CONCLUSIONS: This study presents a physicochemical property based protein localization prediction algorithm. Unlike other algorithms, contextual information is preserved by dividing the protein sequences into clusters. The prediction accuracy shows an improvement over other algorithms based on various types of amino acid composition (single, pair and gapped pair). We have also implemented a web server to predict protein localization across the six classes (available at http://pslip.bii.a-star.edu.sg).

Keywords: biosvm
[Sanguinetti2005Predicting] MC. Sanguinetti and JS. Mitcheson. Predicting drug-hERG channel interactions that cause acquired long QT syndrome. Trends Pharmacol. Sci., 26(3):119-124, Mar 2005. [ bib | DOI | http ]
Avoiding drug-induced cardiac arrhythmia is recognized as a major hurdle in the successful development of new drugs. The most common problem is acquired long QT syndrome caused by drugs that block human ether-a-go-go-related-gene (hERG) K(+) channels, delay cardiac repolarization and increase the risk of torsades de pointes arrhythmia (TdP). Not all hERG channel blockers induce TdP because they can also modulate other channels that counteract the hERG channel-mediated effect. However, hERG channel blockade is an important indicator of potential pro-arrhythmic liability. The molecular determinants of hERG channel blockade have been defined using a site-directed mutagenesis approach. Combined with pharmacophore models, knowledge of the drug-binding site of hERG channels will facilitate in silico design efforts to discover drugs that are devoid of this rare, but potentially lethal, side-effect.

Keywords: herg
[Saeh2005Lead] JSaeh, PLyne, BTakasaki, and DCosgrove. Lead hopping using SVM and 3D pharmacophore fingerprints. J Chem Inf Model, 45(4):1122-1133, Jul 2005. [ bib | DOI | http | .pdf ]
The combination of 3D pharmacophore fingerprints and the support vector machine classification algorithm has been used to generate robust models that are able to classify compounds as active or inactive in a number of G-protein-coupled receptor assays. The models have been tested against progressively more challenging validation sets where steps are taken to ensure that compounds in the validation set are chemically and structurally distinct from the training set. In the most challenging example, we simulate a lead-hopping experiment by excluding an entire class of compounds (defined by a core substructure) from the training set. The left-out active compounds comprised approximately 40% of the actives. The model trained on the remaining compounds is able to recall 75% of the actives from the "new" lead series while correctly classifying >99% of the 5000 inactives included in the validation set.

Keywords: biosvm chemoinformatics
[Roesch2005Chemotaxonomic] Petra Rösch, Michaela Harz, Michael Schmitt, Klaus-Dieter Peschke, Olaf Ronneberger, Hans Burkhardt, Hans-Walter Motzkus, Markus Lankers, Stefan Hofer, Hans Thiele, and Jürgen Popp. Chemotaxonomic identification of single bacteria by micro-Raman spectroscopy: application to clean-room-relevant biological contaminations. Appl Environ Microbiol, 71(3):1626-37, Mar 2005. [ bib | DOI | http | .pdf ]
Microorganisms, such as bacteria, which might be present as contamination inside an industrial food or pharmaceutical clean room process need to be identified on short time scales in order to minimize possible health hazards as well as production downtimes causing financial deficits. Here we describe the first results of single-particle micro-Raman measurements in combination with a classification method, the so-called support vector machine technique, allowing for a fast, reliable, and nondestructive online identification method for single bacteria.

[Raetsch2005RASE] GRätsch, SSonnenburg, and BSchölkopf. RASE: recognition of alternatively spliced exons in C.elegans. Bioinformatics, 21(Suppl. 1):i369-i377, Jun 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Eukaryotic pre-mRNAs are spliced to form mature mRNA. Pre-mRNA alternative splicing greatly increases the complexity of gene expression. Estimates show that more than half of the human genes and at least one-third of the genes of less complex organisms, such as nematodes or flies, are alternatively spliced. In this work, we consider one major form of alternative splicing, namely the exclusion of exons from the transcript. It has been shown that alternatively spliced exons have certain properties that distinguish them from constitutively spliced exons. Although most recent computational studies on alternative splicing apply only to exons which are conserved among two species, our method only uses information that is available to the splicing machinery, i.e. the DNA sequence itself. We employ advanced machine learning techniques in order to answer the following two questions: (1) Is a certain exon alternatively spliced? (2) How can we identify yet unidentified exons within known introns? RESULTS: We designed a support vector machine (SVM) kernel well suited for the task of classifying sequences with motifs having positional preferences. In order to solve the task (1), we combine the kernel with additional local sequence information, such as lengths of the exon and the flanking introns. The resulting SVM-based classifier achieves a true positive rate of 48.5% at a false positive rate of 1%. By scanning over single EST confirmed exons we identified 215 potential alternatively spliced exons. For 10 randomly selected such exons we successfully performed biological verification experiments and confirmed three novel alternatively spliced exons. To answer question (2), we additionally used SVM-based predictions to recognize acceptor and donor splice sites. Combined with the above mentioned features we were able to identify 85.2% of skipped exons within known introns at a false positive rate of 1%. AVAILABILITY: Datasets, model selection results, our predictions and additional experimental results are available at http://www.fml.tuebingen.mpg.deraetsch/RASE CONTACT: Gunnar.Raetsch@tuebingen.mpg.de SUPPLEMENTARY INFORMATION: http://www.fml.tuebingen.mpg.de/raetsch/RASE.

Keywords: biosvm
[Ruepp2005Assessment] SRuepp, FBoess, LSuter, MC. de Vera, GSteiner, TSteele, TWeiser, and SAlbertini. Assessment of hepatotoxic liabilities by transcript profiling. Toxicol Appl Pharmacol, Jun 2005. [ bib | DOI | http | .pdf ]
Male Wistar rats were treated with various model compounds or the appropriate vehicle controls in order to create a reference database for toxicogenomics assessment of novel compounds. Hepatotoxic compounds in the database were either known hepatotoxicants or showed hepatotoxicity during preclinical testing. Histopathology and clinical chemistry data were used to anchor the transcript profiles to an established endpoint (steatosis, cholestasis, direct acting, peroxisomal proliferation or nontoxic/control). These reference data were analyzed using a supervised learning method (support vector machines, SVM) to generate classification rules. This predictive model was subsequently used to assess compounds with regard to a potential hepatotoxic liability. A steatotic and a non-hepatotoxic 5HT(6) receptor antagonist compound from the same series were successfully discriminated by this toxicogenomics model. Additionally, an example is shown where a hepatotoxic liability was correctly recognized in the absence of pathological findings. In vitro experiments and a dog study confirmed the correctness of the toxicogenomics alert. Another interesting observation was that transcript profiles indicate toxicologically relevant changes at an earlier timepoint than routinely used methods. Together, these results support the useful application of toxicogenomics in raising alerts for adverse effects and generating mechanistic hypotheses that can be followed up by confirmatory experiments.

Keywords: biosvm
[Rudd2005Eclair] SRudd and IV. Tetko. Eclair-a web service for unravelling species origin of sequences sampled from mixed host interfaces. Nucleic Acids Res, 33(Web Server issue):W724-7, Jul 2005. [ bib | DOI | http | .pdf ]
The identification of the genes that participate at the biological interface of two species remains critical to our understanding of the mechanisms of disease resistance, disease susceptibility and symbiosis. The sequencing of complementary DNA (cDNA) libraries prepared from the biological interface between two organisms provides an inexpensive way to identify the novel genes that may be expressed as a cause or consequence of compatible or incompatible interactions. Sequence classification and annotation of species origin typically use an orthology-based approach and require access to large portions of either genome, or a close relative. Novel species- or clade-specific sequences may have no counterpart within existing databases and remain ambiguous features. Here we present a web-service, Eclair, which utilizes support vector machines for the classification of the origin of expressed sequence tags stemming from mixed host cDNA libraries. In addition to providing an interface for the classification of sequences, users are presented with the opportunity to train a model to suit their preferred species pair. Eclair is freely available at http://eclair.btk.fi.

Keywords: biosvm
[Rual2005Towards] Jean-François Rual, Kavitha Venkatesan, Tong Hao, Tomoko Hirozane-Kishikawa, Amélie Dricot, Ning Li, Gabriel F Berriz, Francis D Gibbons, Matija Dreze, Nono Ayivi-Guedehoussou, Niels Klitgord, Christophe Simon, Mike Boxem, Stuart Milstein, Jennifer Rosenberg, Debra S Goldberg, Lan V Zhang, Sharyl L Wong, Giovanni Franklin, Siming Li, Joanna S Albala, Janghoo Lim, Carlene Fraughton, Estelle Llamosas, Sebiha Cevik, Camille Bex, Philippe Lamesch, Robert S Sikorski, Jean Vandenhaute, Huda Y Zoghbi, Alex Smolyar, Stephanie Bosak, Reynaldo Sequerra, Lynn Doucette-Stamm, Michael E Cusick, David E Hill, Frederick P Roth, and Marc Vidal. Towards a proteome-scale map of the human protein-protein interaction network. Nature, 437(7062):1173-1178, Oct 2005. [ bib | DOI | http ]
Systematic mapping of protein-protein interactions, or 'interactome' mapping, was initiated in model organisms, starting with defined biological processes and then expanding to the scale of the proteome. Although far from complete, such maps have revealed global topological and dynamic features of interactome networks that relate to known biological properties, suggesting that a human interactome map will provide insight into development and disease mechanisms at a systems level. Here we describe an initial version of a proteome-scale map of human binary protein-protein interactions. Using a stringent, high-throughput yeast two-hybrid system, we tested pairwise interactions among the products of approximately 8,100 currently available Gateway-cloned open reading frames and detected approximately 2,800 interactions. This data set, called CCSB-HI1, has a verification rate of approximately 78% as revealed by an independent co-affinity purification assay, and correlates significantly with other biological attributes. The CCSB-HI1 data set increases by approximately 70% the set of available binary interactions within the tested space and reveals more than 300 new connections to over 100 disease-associated proteins. This work represents an important step towards a systematic and comprehensive human interactome project.

Keywords: Cloning, Molecular; Humans; Open Reading Frames; Protein Binding; Proteome; RNA; Saccharomyces cerevisiae; Two-Hybrid System Techniques
[Rose2005Correlation] JR. Rose, Jr. Turkett, WH., IC. Oroian, WW. Laegreid, and JKeele. Correlation of amino acid preference and mammalian viral genome type. Bioinformatics, 2005. [ bib | DOI | http | .pdf ]
Motivation: In the event of an outbreak of a disease caused by an initially unknown pathogen, the ability to characterize anonymous sequences prior to isolation and culturing of the pathogen will be helpful. We show that it is possible to classify viral sequences by genome type (dsDNA, ssDNA, ssRNA positive strand, ssRNA negative strand, retroid) using amino acid distribution.Results: In this paper we describe the results of analysis of amino acid preference in mammalian viruses. The study was carried out at the genome level as well as two shorter sequence levels: short (300 amino acids) and medium length (660 amino acids). The analysis indicates a correlation between the viral genome types dsDNA, ssDNA, ssRNA positive strand, ssRNA negative strand, and retroid and amino acid preference. We investigated three different models of amino acid preference. The simplest amino acid preference model, 1-AAP, is a normalized description of the frequency of amino acids in genomes of a viral genome type. A slightly more complex model is the ordered pair amino acid preference model (2-AAP), which characterizes genomes of different viral genome types by the frequency of ordered pairs of amino acids. The most complex and accurate model is the ordered triple amino acid preference model (3-AAP), which is based on ordered triples of amino acids. The results demonstrate that mammalian viral genome types differ in their amino acid preference.Availability: The tools used to format and analyze data and supplementary material are available at http://www.cse.sc.edurose/aminoPreference/index.html.

Keywords: biosvm
[Rolland2005G-protein-coupled] CRolland, RGozalbes, ANicolaï, M.-F. Paugam, LCoussy, FBarbosa, DHorvath, and FRevah. G-protein-coupled receptor affinity prediction based on the use of a profiling dataset: Qsar design, synthesis, and experimental validation. J. Med. Chem., 48(21):6563-6574, Oct 2005. [ bib | DOI | http ]
A QSAR model accounting for "average" G-protein-coupled receptor (GPCR) binding was built from a large set of experimental standardized binding data (1939 compounds systematically tested over 40 different GPCRs) and applied to the design of a library of "GPCR-predicted" compounds. Three hundred and sixty of these compounds were randomly selected and tested in 21 GPCR binding assays. Positives were defined by their ability to inhibit by more than 70% the binding of reference compounds at 10 microM. A 5.5-fold enrichment in positives was observed when comparing the "GPCR-predicted" compounds with 600 randomly selected compounds predicted as "non-GPCR" from a general collection. The model was efficient in predicting strongest binders, since enrichment was greater for higher cutoffs. Significant enrichment was also observed for peptidic GPCRs and receptors not included to develop the QSAR model, suggesting the usefulness of the model to design ligands binding with newly identified GPCRs, including orphan ones.

Keywords: chemogenomics
[Rohlf2005J] FJames Rohlf. J. felsenstein, inferring phylogenies, sinauer assoc., 2004, pp. xx + 664. J. Classif., 22(1):139-142, 2005. [ bib | DOI ]
[Rice2005Mining] Simon B Rice, Goran Nenadic, and Benjamin J Stapley. Mining protein function from text using term-based support vector machines. BMC Bioinformatics, 6 Suppl 1:S22, 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents. RESULTS: The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent. CONCLUSION: A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2.

Keywords: biosvm
[Rice2005Reconstructing] J.J. Rice, YTu, and GStolovitzky. Reconstructing biological networks using conditional correlation analysis. Bioinformatics, 21(6):765-773, Mar 2005. [ bib | DOI | http ]
MOTIVATION: One of the present challenges in biological research is the organization of the data originating from high-throughput technologies. One way in which this information can be organized is in the form of networks of influences, physical or statistical, between cellular components. We propose an experimental method for probing biological networks, analyzing the resulting data and reconstructing the network architecture. METHODS: We use networks of known topology consisting of nodes (genes), directed edges (gene-gene interactions) and a dynamics for the genes' mRNA concentrations in terms of the gene-gene interactions. We proposed a network reconstruction algorithm based on the conditional correlation of the mRNA equilibrium concentration between two genes given that one of them was knocked down. Using simulated gene expression data on networks of known connectivity, we investigated how the reconstruction error is affected by noise, network topology, size, sparseness and dynamic parameters. RESULTS: Errors arise from correlation between nodes connected through intermediate nodes (false positives) and when the correlation between two directly connected nodes is obscured by noise, non-linearity or multiple inputs to the target node (false negatives). Two critical components of the method are as follows: (1) the choice of an optimal correlation threshold for predicting connections and (2) the reduction of errors arising from indirect connections (for which a novel algorithm is proposed). With these improvements, we can reconstruct networks with the topology of the transcriptional regulatory network in Escherichia coli with a reasonably low error rate.

Keywords: Algorithms; Computer Simulation; Gene Expression Profiling; Gene Expression Regulation; Models, Biological; Models, Statistical; Oligonucleotide Array Sequence Analysis; Protein Interaction Mapping; Signal Transduction; Statistics as Topic; Transcription Factors
[Rhodes2005Probabilistic] DR. Rhodes, SA. Tomlins, SVarambally, VMahavisno, SBarrette, T.and Kalyana-Sundaram, DGhosh, APandey, and AM. Chinnaiyan. Probabilistic model of the human protein-protein interaction network. Nat Biotechnol, 23(8):951-959, Aug 2005. [ bib | DOI | http | .pdf ]
A catalog of all human protein-protein interactions would provide scientists with a framework to study protein deregulation in complex diseases such as cancer. Here we demonstrate that a probabilistic analysis integrating model organism interactome data, protein domain data, genome-wide gene expression data and functional annotation data predicts nearly 40,000 protein-protein interactions in humans-a result comparable to those obtained with experimental and computational approaches in model organisms. We validated the accuracy of the predictive model on an independent test set of known interactions and also experimentally confirmed two predicted interactions relevant to human cancer, implicating uncharacterized proteins into definitive pathways. We also applied the human interactome network to cancer genomics data and identified several interaction subnetworks activated in cancer. This integrative analysis provides a comprehensive framework for exploring the human protein interaction network.

[Rhodes2005Mining] DR. Rhodes, SKalyana-Sundaram, VMahavisno, TR. Barrette, DGhosh, and AM. Chinnaiyan. Mining for regulatory programs in the cancer transcriptome. Nat. Genet., 37(6):579-583, Jun 2005. [ bib | DOI | http | .pdf ]
DNA microarrays have been widely applied to cancer transcriptome analysis. The Oncomine database contains a large collection of such data, as well as hundreds of derived gene-expression signatures. We studied the regulatory mechanisms responsible for gene deregulation in these cancer signatures by searching for the coordinate regulation of genes with common transcription factor binding sites. We found that genes with binding sites for the archetypal cancer transcription factor, E2F, were disproportionately overexpressed in a wide variety of cancers, whereas genes with binding sites for other transcription factors, such as Myc-Max, c-Rel and ATF, were disproportionately overexpressed in specific cancer types. These results suggest that alterations in pathways activating these transcription factors may be responsible for the observed gene deregulation and cancer pathogenesis.

[Rhodes2005Integrative] DR. Rhodes and AM. Chinnaiyan. Integrative analysis of the cancer transcriptome. Nat. Genet., 37 Suppl:S31-S37, Jun 2005. [ bib | DOI | http | .pdf ]
DNA microarrays have been widely applied to the study of human cancer, delineating myriad molecular subtypes of cancer, many of which are associated with distinct biological underpinnings, disease progression and treatment response. These primary analyses have begun to decipher the molecular heterogeneity of cancer, but integrative analyses that evaluate cancer transcriptome data in the context of other data sources are often capable of extracting deeper biological insight from the data. Here we discuss several such integrative computational and analytical approaches, including meta-analysis, functional enrichment analysis, interactome analysis, transcriptional network analysis and integrative model system analysis.

[Rensing2005Protein] Stefan A Rensing, Dana Fritzowsky, Daniel Lang, and Ralf Reski. Protein encoding genes in an ancient plant: analysis of codon usage, retained genes and splice sites in a moss, Physcomitrella patens. BMC Genomics, 6(1):43, Mar 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: The moss Physcomitrella patens is an emerging plant model system due to its high rate of homologous recombination, haploidy, simple body plan, physiological properties as well as phylogenetic position. Available EST data was clustered and assembled, and provided the basis for a genome-wide analysis of protein encoding genes. RESULTS: We have clustered and assembled Physcomitrella patens EST and CDS data in order to represent the transcriptome of this non-seed plant. Clustering of the publicly available data and subsequent prediction resulted in a total of 19,081 non-redundant ORF. Of these putative transcripts, approximately 30% have a homolog in both rice and Arabidopsis transcriptome. More than 130 transcripts are not present in seed plants but can be found in other kingdoms. These potential "retained genes" might have been lost during seed plant evolution. Functional annotation of these genes reveals unequal distribution among taxonomic groups and intriguing putative functions such as cytotoxicity and nucleic acid repair. Whereas introns in the moss are larger on average than in the seed plant Arabidopsis thaliana, position and amount of introns are approximately the same. Contrary to Arabidopsis, where CDS contain on average 44% G/C, in Physcomitrella the average G/C content is 50%. Interestingly, moss orthologs of Arabidopsis genes show a significant drift of codon fraction usage, towards the seed plant. While averaged codon bias is the same in Physcomitrella and Arabidopsis, the distribution pattern is different, with 15% of moss genes being unbiased. Species-specific, sensitive and selective splice site prediction for Physcomitrella has been developed using a dataset of 368 donor and acceptor sites, utilizing a support vector machine. The prediction accuracy is better than those achieved with tools trained on Arabidopsis data. CONCLUSION: Analysis of the moss transcriptome displays differences in gene structure, codon and splice site usage in comparison with the seed plant Arabidopsis. Putative retained genes exhibit possible functions that might explain the peculiar physiological properties of mosses. Both the transcriptome representation (including a BLAST and retrieval service) and splice site prediction have been made available on http://www.cosmoss.org, setting the basis for assembly and annotation of the Physcomitrella genome, of which draft shotgun sequences will become available in 2005.

Keywords: biosvm
[Rennie2005Fast] JDM. Rennie and NSrebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pages 713-719, New York, NY, USA, 2005. ACM Press. [ bib | DOI | http ]
[Ren2005HIV] JRen and D.K. Stammers. HIV reverse transcriptase structures: designing new inhibitors and understanding mechanisms of drug resistance. Trends Pharmacol. Sci., 26:4-7, 2005. [ bib ]
[Rangwala2005Profile-based] HRangwala and GKarypis. Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21(23):4239-4247, Dec 2005. [ bib | DOI | http ]
MOTIVATION: Protein remote homology detection is a central problem in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for remote homology detection. The performance of these methods depends on how the protein sequences are modeled and on the method used to compute the kernel function between them. RESULTS: We introduce two classes of kernel functions that are constructed by combining sequence profiles with new and existing approaches for determining the similarity between pairs of protein sequences. These kernels are constructed directly from these explicit protein similarity measures and employ effective profile-to-profile scoring schemes for measuring the similarity between pairs of proteins. Experiments with remote homology detection and fold recognition problems show that these kernels are capable of producing results that are substantially better than those produced by all of the existing state-of-the-art SVM-based methods. In addition, the experiments show that these kernels, even when used in the absence of profiles, produce results that are better than those produced by existing non-profile-based schemes. AVAILABILITY: The programs for computing the various kernel functions are available on request from the authors.

Keywords: biosvm
[Ralaivola2005Graph] LRalaivola, SJ. Swamidass, HSaigo, and PBaldi. Graph kernels for chemical informatics. Neural Netw., 18(8):1093-1110, Sep 2005. [ bib | DOI | http | .pdf ]
Increased availability of large repositories of chemical compounds is creating new challenges and opportunities for the application of machine learning methods to problems in computational chemistry and chemical informatics. Because chemical compounds are often represented by the graph of their covalent bonds, machine learning methods in this domain must be capable of processing graphical structures with variable size. Here, we first briefly review the literature on graph kernels and then introduce three new kernels (Tanimoto, MinMax, Hybrid) based on the idea of molecular fingerprints and counting labeled paths of depth up to d using depth-first search from each possible vertex. The kernels are applied to three classification problems to predict mutagenicity, toxicity, and anti-cancer activity on three publicly available data sets. The kernels achieve performances at least comparable, and most often superior, to those previously reported in the literature reaching accuracies of 91.5% on the Mutag dataset, 65-67% on the PTC (Predictive Toxicology Challenge) dataset, and 72% on the NCI (National Cancer Institute) dataset. Properties and tradeoffs of these kernels, as well as other proposed kernels that leverage 1D or 3D representations of molecules, are briefly discussed.

Keywords: chemoinformatics
[Rajagopalan2005Inferring] DRajagopalan and PAgarwal. Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics, 21(6):788-793, Mar 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: A number of omic technologies such as transcriptional profiling, proteomics, literature searches, genetic association, etc. help in the identification of sets of important genes. A subset of these genes may act in a coordinated manner, possibly because they are part of the same biological pathway. Interpreting such gene lists and relating them to pathways is a challenging task. Databases of biological relationships between thousands of mammalian genes can help in deciphering omics data. The relationships between genes can be assembled into a biological network with each protein as a node and each relationship as an edge between two proteins (or nodes). This network may then be searched for subnetworks consisting largely of interesting genes from the omics experiment. The subset of genes in the subnetwork along with the web of relationships between them helps to decipher the underlying pathways. Finding such subnetworks that maximally include all proteins from the query set but few others is the focus for this paper. RESULTS: We present a heuristic algorithm and a scoring function that work well both on simulated data and on data from known pathways. The scoring function is an extension of a previous study for a single biological experiment. We use a simple set of heuristics that provide a more efficient solution than the simulated annealing method. We find that our method works on reasonably complex curated networks containing approximately 9000 biological entities (genes and metabolites), and approximately 30,000 biological relationships. We also show that our method can pick up a pathway signal from a query list including a moderate number of genes unrelated to the pathway. In addition, we quantify the sensitivity and specificity of the technique.

[Raghava2005Correlation] Gajendra P S Raghava and Joon H Han. Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein. BMC Bioinformatics, 6(1):59, Mar 2005. [ bib | DOI | http ]
BACKGROUND: A large number of papers have been published on analysis of microarray data with particular emphasis on normalization of data, detection of differentially expressed genes, clustering of genes and regulatory network. On other hand there are only few studies on relation between expression level and composition of nucleotide/protein sequence, using expression data. There is a need to understand why particular genes/proteins express more in particular conditions. In this study, we analyze 3468 genes of Saccharomyces cerevisiae obtained from Holstege et al., (1998) to understand the relationship between expression level and amino acid composition. RESULTS: We compute the correlation between expression of a gene and amino acid composition of its protein. It was observed that some residues (like Ala, Gly, Arg and Val) have significant positive correlation (r > 0.20) and some other residues (Like Asp, Leu, Asn and Ser) have negative correlation (r < -0.15) with the expression of genes. A significant negative correlation (r = -0.18) was also found between length and gene expression. These observations indicate the relationship between percent composition and gene expression level. Thus, attempts have been made to develop a Support Vector Machine (SVM) based method for predicting the expression level of genes from its protein sequence. In this method the SVM is trained with proteins whose gene expression data is known in a given condition. Then trained SVM is used to predict the gene expression of other proteins of the same organism in the same condition. A correlation coefficient r = 0.70 was obtained between predicted and experimentally determined expression of genes, which improves from r = 0.70 to 0.72 when dipeptide composition was used instead of residue composition. The method was evaluated using 5-fold cross validation test. We also demonstrate that amino acid composition information along with gene expression data can be used for improving the function classification of proteins. CONCLUSION: There is a correlation between gene expression and amino acid composition that can be used to predict the expression level of genes up to a certain extent. A web server based on the above strategy has been developed for calculating the correlation between amino acid composition and gene expression and prediction of expression level http://kiwi.postech.ac.kr/raghava/lgepred/. This server will allow users to study the evolution from expression data.

Keywords: biosvm
[Radulescu2005JBSI] O Radulescu, S Lagarrigue, A Siegel, M Le Borgne, and P Veber. Topology and static response of interaction networks in molecular biology. J.R.Soc.Interface, Published online, 2005. [ bib ]
[Qiu2005computational] SQiu, CM. Adema, and TLane. A computational study of off-target effects of RNA interference. Nucleic Acids Res., 33(6):1834-1847, 2005. [ bib | DOI | http ]
RNA interference (RNAi) is an intracellular mechanism for post-transcriptional gene silencing that is frequently used to study gene function. RNAi is initiated by short interfering RNA (siRNA) of approximately 21 nt in length, either generated from the double-stranded RNA (dsRNA) by using the enzyme Dicer or introduced experimentally. Following association with an RNAi silencing complex, siRNA targets mRNA transcripts that have sequence identity for destruction. A phenotype resulting from this knockdown of expression may inform about the function of the targeted gene. However, 'off-target effects' compromise the specificity of RNAi if sequence identity between siRNA and random mRNA transcripts causes RNAi to knockdown expression of non-targeted genes. The complete off-target effects must be investigated systematically on each gene in a genome by adjusting a group of parameters, which is too expensive to conduct experimentally and motivates a study in silico. This computational study examined the potential for off-target effects of RNAi, employing the genome and transcriptome sequence data of Homo sapiens, Caenorhabditis elegans and Schizosaccharomyces pombe. The chance for RNAi off-target effects proved considerable, ranging from 5 to 80% for each of the organisms, when using as parameter the exact identity between any possible siRNA sequences (arbitrary length ranging from 17 to 28 nt) derived from a dsRNA (range 100-400 nt) representing the coding sequences of target genes and all other siRNAs within the genome. Remarkably, high-sequence specificity and low probability for off-target reactivity were optimally balanced for siRNA of 21 nt, the length observed mostly in vivo. The chance for off-target RNAi increased (although not always significantly) with greater length of the initial dsRNA sequence, inclusion into the analysis of available untranslated region sequences and allowing for mismatches between siRNA and target sequences. siRNA sequences from within 100 nt of the 5' termini of coding sequences had low chances for off-target reactivity. This may be owing to coding constraints for signal peptide-encoding regions of genes relative to regions that encode for mature proteins. Off-target distribution varied along the chromosomes of C.elegans, apparently owing to the use of more unique sequences in gene-dense regions. Finally, biological and thermodynamical descriptors of effective siRNA reduced the number of potential siRNAs compared with those identified by sequence identity alone, but off-target RNAi remained likely, with an off-target error rate of approximately 10%. These results also suggest a direction for future in vivo studies that could both help in calibrating true off-target rates in living organisms and also in contributing evidence toward the debate of whether siRNA efficacy is correlated with, or independent of, the target molecule. In summary, off-target effects present a real but not prohibitive concern that should be considered for RNAi experiments.

Keywords: sirna
[Qin2005Application] Zhong Qin, Qiang Yu, Jun Li, Zhi-Yi Wu, and Bing-Min Hu. Application of least squares vector machines in modelling water vapor and carbon dioxide fluxes over a cropland. J Zhejiang Univ Sci, 6(6):491-5, Jun 2005. [ bib | DOI | http | .pdf ]
Least squares support vector machines (LS-SVMs), a nonlinear kemel based machine was introduced to investigate the prospects of application of this approach in modelling water vapor and carbon dioxide fluxes above a summer maize field using the dataset obtained in the North China Plain with eddy covariance technique. The performances of the LS-SVMs were compared to the corresponding models obtained with radial basis function (RBF) neural networks. The results indicated the trained LS-SVMs with a radial basis function kernel had satisfactory performance in modelling surface fluxes; its excellent approximation and generalization property shed new light on the study on complex processes in ecosystem.

[Perez-Cruz2005Convergence] Fernando Pérez-Cruz, Carlos Bousoño-Calzón, and Antonio Artés-Rodríguez. Convergence of the IRWLS Procedure to the Support Vector Machine Solution. Neural Comput, 17(1):7-18, Jan 2005. [ bib ]
An iterative reweighted least squares (IRWLS) procedure recently proposed is shown to converge to the support vector machine solution. The convergence to a stationary point is ensured by modifying the original IRWLS procedure.

Keywords: 80 and over, Aged, Algorithms, Amino Acids, Animals, Area Under Curve, Automated, Brain Chemistry, Brain Neoplasms, Comparative Study, Computer-Assisted, Cross-Sectional Studies, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Face, Genetic, Glaucoma, Humans, Lasers, Least-Squares Analysis, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Middle Aged, Models, Molecular, Nerve Fibers, Non-U.S. Gov't, Numerical Analysis, Ophthalmological, Optic Nerve Diseases, P.H.S., Pattern Recognition, Photic Stimulation, Protein, ROC Curve, Regression Analysis, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, Sequence Analysis, Statistics, U.S. Gov't, beta-Lactamases, 15779160
[Prill2005PlosBiol] Robert J Prill, Pablo A Iglesias, and Andre Levchenko. Dynamic properties of network motifs contribute to biological network organization. PLoS Biol, 3(11):e343, Nov 2005. [ bib | DOI | http ]
Biological networks, such as those describing gene regulation, signal transduction, and neural synapses, are representations of large-scale dynamic systems. Discovery of organizing principles of biological networks can be enhanced by embracing the notion that there is a deep interplay between network structure and system dynamics. Recently, many structural characteristics of these non-random networks have been identified, but dynamical implications of the features have not been explored comprehensively. We demonstrate by exhaustive computational analysis that a dynamical property-stability or robustness to small perturbations-is highly correlated with the relative abundance of small subnetworks (network motifs) in several previously determined biological networks. We propose that robust dynamical stability is an influential property that can determine the non-random structure of biological networks.

Keywords: Animals; Caenorhabditis elegans, physiology; Computational Biology, methods; Computer Simulation; Drosophila melanogaster, physiology; Escherichia coli, physiology; Models, Biological; Nerve Net; Saccharomyces cerevisiae, physiology; Signal Transduction; Statistics as Topic; Systems Theory; Transcription, Genetic
[Plewczyski2005support] Dariusz Plewczynski, Adrian Tkacz, Adam Godzik, and Leszek Rychlewski. A support vector machine approach to the identification of phosphorylation sites. Cell Mol Biol Lett, 10(1):73-89, 2005. [ bib | .pdf ]
We describe a bioinformatics tool that can be used to predict the position of phosphorylation sites in proteins based only on sequence information. The method uses the support vector machine (SVM) statistical learning theory. The statistical models for phosphorylation by various types of kinases are built using a dataset of short (9-amino acid long) sequence fragments. The sequence segments are dissected around post-translationally modified sites of proteins that are on the current release of the Swiss-Prot database, and that were experimentally confirmed to be phosphorylated by any kinase. We represent them as vectors in a multidimensional abstract space of short sequence fragments. The prediction method is as follows. First, a given query protein sequence is dissected into overlapping short segments. All the fragments are then projected into the multidimensional space of sequence fragments via a collection of different representations. Those points are classified with pre-built statistical models (the SVM method with linear, polynomial and radial kernel functions) either as phosphorylated or inactive ones. The resulting list of plausible sites for phosphorylation by various types of kinases in the query protein is returned to the user. The efficiency of the method for each type of phosphorylation is estimated using leave-one-out tests and presented here. The sensitivities of the models can reach over 70%, depending on the type of kinase. The additional information from profile representations of short sequence fragments helps in gaining a higher degree of accuracy in some phosphorylation types. The further development of an automatic phosphorylation site annotation predictor based on our algorithm should yield a significant improvement when using statistical algorithms in order to quantify the results.

Keywords: biosvm
[Pinkel2005Array] DPinkel and DG. Albertson. Array comparative genomic hybridization and its applications in cancer. Nat. Genet., 37 Suppl:S11-S17, Jun 2005. [ bib | DOI | http | .pdf ]
Alteration in DNA copy number is one of the many ways in which gene expression and function may be modified. Some variations are found among normal individuals, others occur in the course of normal processes in some species and still others participate in causing various disease states. For example, many defects in human development are due to gains and losses of chromosomes and chromosomal segments that occur before or shortly after fertilization, and DNA dosage-alteration changes occurring in somatic cells are frequent contributors to cancer. Detecting these aberrations and interpreting them in the context of broader knowledge facilitates the identification of crucial genes and pathways involved in biological processes and disease. Over the past several years, array comparative genomic hybridization has proven its value for analyzing DNA copy-number variations. Here, we discuss the state of the art of array comparative genomic hybridization and its applications in cancer, emphasizing general concepts rather than specific results.

Keywords: csbcbook, cgh, csbcbook-ch2
[Picard2005statistical] FPicard, SRobin, MLavielle, CVaisse, and J.-J. Daudin. A statistical approach for array CGH data analysis. BMC Bioinformatics, 6:27, 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: Microarray-CGH experiments are used to detect and map chromosomal imbalances, by hybridizing targets of genomic DNA from a test and a reference sample to sequences immobilized on a slide. These probes are genomic DNA sequences (BACs) that are mapped on the genome. The signal has a spatial coherence that can be handled by specific statistical tools. Segmentation methods seem to be a natural framework for this purpose. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose BACs share the same relative copy number on average. We model a CGH profile by a random Gaussian process whose distribution parameters are affected by abrupt changes at unknown coordinates. Two major problems arise: to determine which parameters are affected by the abrupt changes (the mean and the variance, or the mean only), and the selection of the number of segments in the profile. RESULTS: We demonstrate that existing methods for estimating the number of segments are not well adapted in the case of array CGH data, and we propose an adaptive criterion that detects previously mapped chromosomal aberrations. The performances of this method are discussed based on simulations and publicly available data sets. Then we discuss the choice of modeling for array CGH data and show that the model with a homogeneous variance is adapted to this context. CONCLUSIONS: Array CGH data analysis is an emerging field that needs appropriate statistical tools. Process segmentation and model selection provide a theoretical framework that allows precise biological interpretations. Adaptive methods for model selection give promising results concerning the estimation of the number of altered regions on the genome.

[Pham2005Support] Tho Hoan Pham, Kenji Satou, and Tu Bao Ho. Support vector machines for prediction and analysis of beta and gamma-turns in proteins. J. Bioinform. Comput. Biol., 3(2):343-58, Apr 2005. [ bib ]
Tight turns have long been recognized as one of the three important features of proteins, together with alpha-helix and beta-sheet. Tight turns play an important role in globular proteins from both the structural and functional points of view. More than 90% tight turns are beta-turns and most of the rest are gamma-turns. Analysis and prediction of beta-turns and gamma-turns is very useful for design of new molecules such as drugs, pesticides, and antigens. In this paper we investigated two aspects of applying support vector machine (SVM), a promising machine learning method for bioinformatics, to prediction and analysis of beta-turns and gamma-turns. First, we developed two SVM-based methods, called BTSVM and GTSVM, which predict beta-turns and gamma-turns in a protein from its sequence. When compared with other methods, BTSVM has a superior performance and GTSVM is competitive. Second, we used SVMs with a linear kernel to estimate the support of amino acids for the formation of beta-turns and gamma-turns depending on their position in a protein. Our analysis results are more comprehensive and easier to use than the previous results in designing turns in proteins.

Keywords: biosvm
[Peters2005Generating] Bjoern Peters and Alessandro Sette. Generating quantitative models describing the sequence specificity of biological processes with the stabilized matrix method. BMC Bioinformatics, 6:132, 2005. [ bib | DOI | http ]
BACKGROUND: Many processes in molecular biology involve the recognition of short sequences of nucleic-or amino acids, such as the binding of immunogenic peptides to major histocompatibility complex (MHC) molecules. From experimental data, a model of the sequence specificity of these processes can be constructed, such as a sequence motif, a scoring matrix or an artificial neural network. The purpose of these models is two-fold. First, they can provide a summary of experimental results, allowing for a deeper understanding of the mechanisms involved in sequence recognition. Second, such models can be used to predict the experimental outcome for yet untested sequences. In the past we reported the development of a method to generate such models called the Stabilized Matrix Method (SMM). This method has been successfully applied to predicting peptide binding to MHC molecules, peptide transport by the transporter associated with antigen presentation (TAP) and proteasomal cleavage of protein sequences. RESULTS: Herein we report the implementation of the SMM algorithm as a publicly available software package. Specific features determining the type of problems the method is most appropriate for are discussed. Advantageous features of the package are: (1) the output generated is easy to interpret, (2) input and output are both quantitative, (3) specific computational strategies to handle experimental noise are built in, (4) the algorithm is designed to effectively handle bounded experimental data, (5) experimental data from randomized peptide libraries and conventional peptides can easily be combined, and (6) it is possible to incorporate pair interactions between positions of a sequence. CONCLUSION: Making the SMM method publicly available enables bioinformaticians and experimental biologists to easily access it, to compare its performance to other prediction methods, and to extend it to other applications.

Keywords: Algorithms; Amino Acid Sequence; Biology; Computational Biology; Computer Simulation; Data Interpretation, Statistical; Databases, Protein; Models, Biological; Models, Statistical; Neural Networks (Computer); Peptide Library; Peptides; Programming Languages; Prote; Sensitivity and Specificity; Software; in Binding
[Perkins2005Expanding] D O Perkins, C Jeffries, and P Sullivan. Expanding the 'central dogma': the regulatory role of nonprotein coding genes and implications for the genetic liability to schizophrenia. Molecular Psychiatry, 10:69-78, 2005. [ bib ]
Keywords: csbcbook
[Perez-Iratxeta2005] Carolina Perez-Iratxeta, Matthias Wjst, Peer Bork, and Miguel A Andrade. G2d: a tool for mining genes associated with disease. BMC Genet, 6:45, 2005. [ bib | DOI | http ]
BACKGROUND: Human inherited diseases can be associated by genetic linkage with one or more genomic regions. The availability of the complete sequence of the human genome allows examining those locations for an associated gene. We previously developed an algorithm to prioritize genes on a chromosomal region according to their possible relation to an inherited disease using a combination of data mining on biomedical databases and gene sequence analysis. RESULTS: We have implemented this method as a web application in our site G2D (Genes to Diseases). It allows users to inspect any region of the human genome to find candidate genes related to a genetic disease of their interest. In addition, the G2D server includes pre-computed analyses of candidate genes for 552 linked monogenic diseases without an associated gene, and the analysis of 18 asthma loci. CONCLUSION: G2D can be publicly accessed at http://www.ogic.ca/projects/g2d2/.

Keywords: Algorithms; Alzheimer Disease; Asthma; Genetic Diseases, Inborn; Genetic Predisposition to Disease; Humans; Internet; Linkage (Genetics)
[Peng2005Feature] HPeng, FLong, and CDing. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8):1226-1238, 2005. [ bib ]
[Pawitan2005Gene] YPawitan, JBjöhle, LAmler, A.L. Borg, SEgyhazi, PHall, XHan, LHolmberg, FHuang, SKlaar, et al. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Research, 7(6):R953, 2005. [ bib ]
[Patil2005Uncovering] KR. Patil and JNielsen. Uncovering transcriptional regulation of metabolism by using metabolic network topology. Proc. Natl. Acad. Sci. U. S. A., 102(8):2685-2689, Feb 2005. [ bib | DOI | http | .pdf ]
Cellular response to genetic and environmental perturbations is often reflected and/or mediated through changes in the metabolism, because the latter plays a key role in providing Gibbs free energy and precursors for biosynthesis. Such metabolic changes are often exerted through transcriptional changes induced by complex regulatory mechanisms coordinating the activity of different metabolic pathways. It is difficult to map such global transcriptional responses by using traditional methods, because many genes in the metabolic network have relatively small changes at their transcription level. We therefore developed an algorithm that is based on hypothesis-driven data analysis to uncover the transcriptional regulatory architecture of metabolic networks. By using information on the metabolic network topology from genome-scale metabolic reconstruction, we show that it is possible to reveal patterns in the metabolic network that follow a common transcriptional response. Thus, the algorithm enables identification of so-called reporter metabolites (metabolites around which the most significant transcriptional changes occur) and a set of connected genes with significant and coordinated response to genetic or environmental perturbations. We find that cells respond to perturbations by changing the expression pattern of several genes involved in the specific part(s) of the metabolism in which a perturbation is introduced. These changes then are propagated through the metabolic network because of the highly connected nature of metabolism.

[Papadopoulos2005Characterization] APapadopoulos, DI. Fotiadis, and ALikas. Characterization of clustered microcalcifications in digitized mammograms using neural networks and support vector machines. Artif. Intell. Med., 34(2):141-50, Jun 2005. [ bib | DOI | http | .pdf ]
OBJECTIVE: Detection and characterization of microcalcification clusters in mammograms is vital in daily clinical practice. The scope of this work is to present a novel computer-based automated method for the characterization of microcalcification clusters in digitized mammograms. METHODS AND MATERIAL: The proposed method has been implemented in three stages: (a) the cluster detection stage to identify clusters of microcalcifications, (b) the feature extraction stage to compute the important features of each cluster and (c) the classification stage, which provides with the final characterization. In the classification stage, a rule-based system, an artificial neural network (ANN) and a support vector machine (SVM) have been implemented and evaluated using receiver operating characteristic (ROC) analysis. The proposed method was evaluated using the Nijmegen and Mammographic Image Analysis Society (MIAS) mammographic databases. The original feature set was enhanced by the addition of four rule-based features. RESULTS AND CONCLUSIONS: In the case of Nijmegen dataset, the performance of the SVM was Az=0.79 and 0.77 for the original and enhanced feature set, respectively, while for the MIAS dataset the corresponding characterization scores were Az=0.81 and 0.80. Utilizing neural network classification methodology, the corresponding performance for the Nijmegen dataset was Az=0.70 and 0.76 while for the MIAS dataset it was Az=0.73 and 0.78. Although the obtained high classification performance can be successfully applied to microcalcification clusters characterization, further studies must be carried out for the clinical evaluation of the system using larger datasets. The use of additional features originating either from the image itself (such as cluster location and orientation) or from the patient data may further improve the diagnostic value of the system.

Keywords: Apoptosis, Gene Expression Profiling, Humans, Neoplasms, Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Polymerase Chain Reaction, Proteins, Research Support, Subcellular Fractions, Unknown Primary, 15894178
[Pang2005Face] Shaoning Pang, Daijin Kim, and Sung Yang Bang. Face membership authentication using SVM classification tree generated by membership-based LLE data partition. IEEE Trans Neural Netw, 16(2):436-46, Mar 2005. [ bib ]
This paper presents a new membership authentication method by face classification using a support vector machine (SVM) classification tree, in which the size of membership group and the members in the membership group can be changed dynamically. Unlike our previous SVM ensemble-based method, which performed only one face classification in the whole feature space, the proposed method employed a divide and conquer strategy that first performs a recursive data partition by membership-based locally linear embedding (LLE) data clustering, then does the SVM classification in each partitioned feature subset. Our experimental results show that the proposed SVM tree not only keeps the good properties that the SVM ensemble method has, such as a good authentication accuracy and the robustness to the change of members, but also has a considerable improvement on the stability under the change of membership group size.

Keywords: 80 and over, Aged, Algorithms, Area Under Curve, Cross-Sectional Studies, Decision Trees, Diagnostic Imaging, Diagnostic Techniques, Face, Glaucoma, Humans, Lasers, Least-Squares Analysis, Middle Aged, Nerve Fibers, Non-U.S. Gov't, Ophthalmological, Optic Nerve Diseases, P.H.S., Photic Stimulation, ROC Curve, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, Statistics, U.S. Gov't, 15787150
[Pahikkala2005Contextual] Tapio Pahikkala, Filip Ginter, Jorma Boberg, Jouni Jarvinen, and Tapio Salakoski. Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation. BMC Bioinformatics, 6(1):157, Jun 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: The ability to distinguish between genes and proteins is essential for understanding biological text. Support Vector Machines (SVMs) have been proven to be very efficient in general data mining tasks. We explore their capability for the gene versus protein name disambiguation task. RESULTS: We incorporated into the conventional SVM a weighting scheme based on distances of context words from the word to be disambiguated. This weighting scheme increased the performance of SVMs by five percentage points giving performance better than 85% as measured by the area under ROC curve and outperformed the Weighted Additive Classifier, which also incorporates the weighting, and the Naive Bayes classifier. CONCLUSIONS: We show that the performance of SVMs can be improved by the proposed weighting scheme. Furthermore, our results suggest that in this study the increase of the classification performance due to the weighting is greater than that obtained by selecting the underlying classifier or the kernel part of the SVM.

Keywords: biosvm
[Oyang2005Data] Yen-Jen Oyang, Shien-Ching Hwang, Yu-Yen Ou, Chien-Yu Chen, and Zhi-Wei Chen. Data classification with radial basis function networks based on a novel kernel density estimation algorithm. IEEE Trans Neural Netw, 16(1):225-36, Jan 2005. [ bib ]
This paper presents a novel learning algorithm for efficient construction of the radial basis function (RBF) networks that can deliver the same level of accuracy as the support vector machines (SVMs) in data classification applications. The proposed learning algorithm works by constructing one RBF subnetwork to approximate the probability density function of each class of objects in the training data set. With respect to algorithm design, the main distinction of the proposed learning algorithm is the novel kernel density estimation algorithm that features an average time complexity of O(n log n), where n is the number of samples in the training data set. One important advantage of the proposed learning algorithm, in comparison with the SVM, is that the proposed learning algorithm generally takes far less time to construct a data classifier with an optimized parameter setting. This feature is of significance for many contemporary applications, in particular, for those applications in which new objects are continuously added into an already large database. Another desirable feature of the proposed learning algorithm is that the RBF networks constructed are capable of carrying out data classification with more than two classes of objects in one single run. In other words, unlike with the SVM, there is no need to resort to mechanisms such as one-against-one or one-against-all for handling datasets with more than two classes of objects. The comparison with SVM is of particular interest, because it has been shown in a number of recent studies that SVM generally are able to deliver higher classification accuracy than the other existing data classification algorithms. As the proposed learning algorithm is instance-based, the data reduction issue is also addressed in this paper. One interesting observation in this regard is that, for all three data sets used in data reduction experiments, the number of training samples remaining after a naive data reduction mechanism is applied is quite close to the number of support vectors identified by the SVM software. This paper also compares the performance of the RBF networks constructed with the proposed learning algorithm and those constructed with a conventional cluster-based learning algorithm. The most interesting observation learned is that, with respect to data classification, the distributions of training samples near the boundaries between different classes of objects carry more crucial information than the distributions of samples in the inner parts of the clusters.

[Ordonez2005Learning] Celestino Ordóñez, Javier Taboada, Fernando Bastante, Jose María Matías, and Angel Manuel Felicísimo. Learning machines applied to potential forest distribution. Environ Manage, 35(1):109-20, Jan 2005. [ bib ]
The clearing of forests to obtain land for pasture and agriculture and the replacement of autochthonous species by other faster-growing varieties of trees for timber have both led to the loss of vast areas of forest worldwide. At present, many developed countries are attempting to reverse these effects, establishing policies for the restoration of older woodland systems. Reforestation is a complex matter, planned and carried out by experts who need objective information regarding the type of forest that can be sustained in each area. This information is obtained by drawing up feasibility models constructed using statistical methods that make use of the information provided by morphological and environmental variables (height, gradient, rainfall, etc.) that partially condition the presence or absence of a specific kind of forestation in an area. The aim of this work is to construct a set of feasibility models for woodland located in the basin of the River Liébana (NW Spain), to serve as a support tool for the experts entrusted with carrying out the reforestation project. The techniques used are multilayer perceptron neural networks and support vector machines. Their results will be compared to the results obtained by traditional techniques (such as discriminant analysis and logistic regression) by measuring the degree of fit between each model and the existing distribution of woodlands. The interpretation and problems of the feasibility models are commented on in the Discussion section.

Keywords: Artificial Intelligence, Conservation of Natural Resources, Decision Support Techniques, Ecosystem, Environment, Forestry, Regression Analysis, Spain, 15984068
[Olson2005Closed-loop] Byron P Olson, Jennie Si, Jing Hu, and Jiping He. Closed-loop cortical control of direction using support vector machines. IEEE Trans Neural Syst Rehabil Eng, 13(1):72-80, Mar 2005. [ bib ]
Motor neuroprosthetics research has focused on reproducing natural limb motions by correlating firing rates of cortical neurons to continuous movement parameters. We propose an alternative system where specific spatial-temporal spike patterns, emerging in tasks, allow detection of classes of behavior with the aid of sophisticated nonlinear classification algorithms. Specifically, we attempt to examine ensemble activity from motor cortical neurons, not to reproduce the action this neural activity normally precedes, but rather to predict an output supervisory command to potentially control a vehicle. To demonstrate the principle, this design approach was implemented in a discrete directional task taking a small number of motor cortical signals (8-10 single units) fed into a support vector machine (SVM) to produce the commands Left and Right. In this study, rats were placed in a conditioning chamber performing a binary paddle pressing task mimicking the control of a wheelchair turning left or right. Four animal subjects (male Sprague-Dawley rats) were able to use such a brain-machine interface (BMI) with an average accuracy of 78% on their first day of exposure. Additionally, one animal continued to use the interface for three consecutive days with an average accuracy over 90%.

Keywords: Algorithms, Animals, Artificial Intelligence, Computer-Assisted, Diagnosis, Electrodes, Electroencephalography, Feedback, Implanted, Male, Motor Cortex, Movement, Non-P.H.S., Non-U.S. Gov't, Rats, Research Support, Sprague-Dawley, Therapy, U.S. Gov't, User-Computer Interface, 15813408
[ODonnell2005Gene] Rebekah K O'Donnell, Michael Kupferman, SJack Wei, Sunil Singhal, Randal Weber, Bert O'Malley, Yi Cheng, Mary Putt, Michael Feldman, Barry Ziober, and Ruth J Muschel. Gene expression signature predicts lymphatic metastasis in squamous cell carcinoma of the oral cavity. Oncogene, 24(7):1244-51, Feb 2005. [ bib | DOI | http | .pdf ]
Metastasis via the lymphatics is a major risk factor in squamous cell carcinoma of the oral cavity (OSCC). We sought to determine whether the presence of metastasis in the regional lymph node could be predicted by a gene expression signature of the primary tumor. A total of 18 OSCCs were characterized for gene expression by hybridizing RNA to Affymetrix U133A gene chips. Genes with differential expression were identified using a permutation technique and verified by quantitative RT-PCR and immunohistochemistry. A predictive rule was built using a support vector machine, and the accuracy of the rule was evaluated using crossvalidation on the original data set and prediction of an independent set of four patients. Metastatic primary tumors could be differentiated from nonmetastatic primary tumors by a signature gene set of 116 genes. This signature gene set correctly predicted the four independent patients as well as associating five lymph node metastases from the original patient set with the metastatic primary tumor group. We concluded that lymph node metastasis could be predicted by gene expression profiles of primary oral cavity squamous cell carcinomas. The presence of a gene expression signature for lymph node metastasis indicates that clinical testing to assess risk for lymph node metastasis should be possible.

Keywords: biosvm microarray
[Nicholls2005Openeye] ANicholls. Oechem, version 1.3.4, openeye scientific software. website, 2005. [ bib ]
[Nguyen2005Two-stage] MN. Nguyen and JC. Rajapakse. Two-stage multi-class support vector machines to protein secondary structure prediction. Pac Symp Biocomput, pages 346-57, 2005. [ bib ]
Bioinformatics techniques to protein secondary structure (PSS) prediction are mostly single-stage approaches in the sense that they predict secondary structures of proteins by taking into account only the contextual information in amino acid sequences. In this paper, we propose two-stage Multi-class Support Vector Machine (MSVM) approach where a MSVM predictor is introduced to the output of the first stage MSVM to capture the sequential relationship among secondary structure elements for the prediction. By using position specific scoring matrices, generated by PSI-BLAST, the two-stage MSVM approach achieves Q3 accuracies of 78.0% and 76.3% on the RS126 dataset of 126 nonhomologous globular proteins and the CB396 dataset of 396 nonhomologous proteins, respectively, which are better than the highest scores published on both datasets to date.

Keywords: biosvm
[Nguyen2005Prediction] Minh N Nguyen and Jagath C Rajapakse. Prediction of protein relative solvent accessibility with a two-stage SVM approach. Proteins, 59(1):30-7, Apr 2005. [ bib | DOI | http | .pdf ]
Information on relative solvent accessibility (RSA) of amino acid residues in proteins provides valuable clues to the prediction of protein structure and function. A two-stage approach with support vector machines (SVMs) is proposed, where an SVM predictor is introduced to the output of the single-stage SVM approach to take into account the contextual relationships among solvent accessibilities for the prediction. By using the position-specific scoring matrices (PSSMs) generated by PSI-BLAST, the two-stage SVM approach achieves accuracies up to 90.4% and 90.2% on the Manesh data set of 215 protein structures and the RS126 data set of 126 nonhomologous globular proteins, respectively, which are better than the highest published scores on both data sets to date. A Web server for protein RSA prediction using a two-stage SVM method has been developed and is available (http://birc.ntu.edu.sgpas0186457/rsa.html).

Keywords: biosvm
[Nattkemper2005Evaluation] Tim W Nattkemper, Bert Arnrich, Oliver Lichte, Wiebke Timm, Andreas Degenhard, Linda Pointon, Carmel Hayes, Martin O Leach, and The UK MARIBS Breast Screening Study. Evaluation of radiological features for breast tumour classification in clinical screening with machine learning methods. Artif. Intell. Med., 34(2):129-39, Jun 2005. [ bib ]
OBJECTIVE: In this work, methods utilizing supervised and unsupervised machine learning are applied to analyze radiologically derived morphological and calculated kinetic tumour features. The features are extracted from dynamic contrast enhanced magnetic resonance imaging (DCE-MRI) time-course data. MATERIAL: The DCE-MRI data of the female breast are obtained within the UK Multicenter Breast Screening Study. The group of patients imaged in this study is selected on the basis of an increased genetic risk for developing breast cancer. METHODS: The k-means clustering and self-organizing maps (SOM) are applied to analyze the signal structure in terms of visualization. We employ k-nearest neighbor classifiers (k-nn), support vector machines (SVM) and decision trees (DT) to classify features using a computer aided diagnosis (CAD) approach. RESULTS: Regarding the unsupervised techniques, clustering according to features indicating benign and malignant characteristics is observed to a limited extend. The supervised approaches classified the data with 74% accuracy (DT) and providing an area under the receiver-operator-characteristics (ROC) curve (AUC) of 0.88 (SVM). CONCLUSION: It was found that contour and wash-out type (WOT) features determined by the radiologists lead to the best SVM classification results. Although a fast signal uptake in early time-point measurements is an important feature for malignant/benign classification of tumours, our results indicate that the wash-out characteristics might be considered as important.

Keywords: breastcancer
[Nair2005Mimicking] Rajesh Nair and Burkhard Rost. Mimicking cellular sorting improves prediction of subcellular localization. J Mol Biol, 348(1):85-100, Apr 2005. [ bib | DOI | http | .pdf ]
Predicting the native subcellular compartment of a protein is an important step toward elucidating its function. Here we introduce LOCtree, a hierarchical system combining support vector machines (SVMs) and other prediction methods. LOCtree predicts the subcellular compartment of a protein by mimicking the mechanism of cellular sorting and exploiting a variety of sequence and predicted structural features in its input. Currently LOCtree does not predict localization for membrane proteins, since the compositional properties of membrane proteins significantly differ from those of non-membrane proteins. While any information about function can be used by the system, we present estimates of performance that are valid when only the amino acid sequence of a protein is known. When evaluated on a non-redundant test set, LOCtree achieved sustained levels of 74% accuracy for non-plant eukaryotes, 70% for plants, and 84% for prokaryotes. We rigorously benchmarked LOCtree in comparison to the best alternative methods for localization prediction. LOCtree outperformed all other methods in nearly all benchmarks. Localization assignments using LOCtree agreed quite well with data from recent large-scale experiments. Our preliminary analysis of a few entirely sequenced organisms, namely human (Homo sapiens), yeast (Saccharomyces cerevisiae), and weed (Arabidopsis thaliana) suggested that over 35% of all non-membrane proteins are nuclear, about 20% are retained in the cytosol, and that every fifth protein in the weed resides in the chloroplast.

Keywords: biosvm
[Nabieva2005Whole-proteome] Elena Nabieva, Kam Jim, Amit Agarwal, Bernard Chazelle, and Mona Singh. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21 Suppl 1:i302-i310, Jun 2005. [ bib | DOI | http ]
MOTIVATION: Determining protein function is one of the most important problems in the post-genomic era. For the typical proteome, there are no functional annotations for one-third or more of its proteins. Recent high-throughput experiments have determined proteome-scale protein physical interaction maps for several organisms. These physical interactions are complemented by an abundance of data about other types of functional relationships between proteins, including genetic interactions, knowledge about co-expression and shared evolutionary history. Taken together, these pairwise linkages can be used to build whole-proteome protein interaction maps. RESULTS: We develop a network-flow based algorithm, FunctionalFlow, that exploits the underlying structure of protein interaction maps in order to predict protein function. In cross-validation testing on the yeast proteome, we show that FunctionalFlow has improved performance over previous methods in predicting the function of proteins with few (or no) annotated protein neighbors. By comparing several methods that use protein interaction maps to predict protein function, we demonstrate that FunctionalFlow performs well because it takes advantage of both network topology and some measure of locality. Finally, we show that performance can be improved substantially as we consider multiple data sources and use them to create weighted interaction networks. AVAILABILITY: http://compbio.cs.princeton.edu/function

Keywords: Algorithms; Computational Biology, methods; Evolution, Molecular; Fungal Proteins, chemistry; Genomics; Models, Statistical; Models, Theoretical; Protein Interaction Mapping, methods; Proteins, chemistry; Proteomics, methods
[Mueller2005Classifying] K.-R. Müller, GRätsch, SSonnenburg, SMika, MGrimm, and NHeinrich. Classifying 'drug-likeness' with Kernel-based learning methods. J Chem Inf Model, 45(2):249-53, 2005. [ bib | DOI | http | .pdf ]
In this article we report about a successful application of modern machine learning technology, namely Support Vector Machines, to the problem of assessing the 'drug-likeness' of a chemical from a given set of descriptors of the substance. We were able to drastically improve the recent result by Byvatov et al. (2003) on this task and achieved an error rate of about 7% on unseen compounds using Support Vector Machines. We see a very high potential of such machine learning techniques for a variety of computational chemistry problems that occur in the drug discovery and drug design process.

Keywords: biosvm chemoinformatics
[Mohamed2005Prostate] SS. Mohamed, MM A Salama, MKamel, EF. El-Saadany, KRizkalla, and JChin. Prostate cancer multi-feature analysis using trans-rectal ultrasound images. Phys Med Biol, 50(15):N175-85, Aug 2005. [ bib | DOI | http | .pdf ]
This note focuses on extracting and analysing prostate texture features from trans-rectal ultrasound (TRUS) images for tissue characterization. One of the principal contributions of this investigation is the use of the information of the images' frequency domain features and spatial domain features to attain a more accurate diagnosis. Each image is divided into regions of interest (ROIs) by the Gabor multi-resolution analysis, a crucial stage, in which segmentation is achieved according to the frequency response of the image pixels. The pixels with a similar response to the same filter are grouped to form one ROI. Next, from each ROI two different statistical feature sets are constructed; the first set includes four grey level dependence matrix (GLDM) features and the second set consists of five grey level difference vector (GLDV) features. These constructed feature sets are then ranked by the mutual information feature selection (MIFS) algorithm. Here, the features that provide the maximum mutual information of each feature and class (cancerous and non-cancerous) and the minimum mutual information of the selected features are chosen, yeilding a reduced feature subset. The two constructed feature sets, GLDM and GLDV, as well as the reduced feature subset, are examined in terms of three different classifiers: the condensed k-nearest neighbour (CNN), the decision tree (DT) and the support vector machine (SVM). The accuracy classification results range from 87.5% to 93.75%, where the performance of the SVM and that of the DT are significantly better than the performance of the CNN.

Keywords: , , 16030375
[Mitsumori2005Gene] Tomohiro Mitsumori, Sevrani Fation, Masaki Murata, Kouichi Doi, and Hirohumi Doi. Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinformatics, 6 Suppl 1:S8, 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: Automated information extraction from biomedical literature is important because a vast amount of biomedical literature has been published. Recognition of the biomedical named entities is the first step in information extraction. We developed an automated recognition system based on the SVM algorithm and evaluated it in Task 1.A of BioCreAtIvE, a competition for automated gene/protein name recognition. RESULTS: In the work presented here, our recognition system uses the feature set of the word, the part-of-speech (POS), the orthography, the prefix, the suffix, and the preceding class. We call these features "internal resource features", i.e., features that can be found in the training data. Additionally, we consider the features of matching against dictionaries to be external resource features. We investigated and evaluated the effect of these features as well as the effect of tuning the parameters of the SVM algorithm. We found that the dictionary matching features contributed slightly to the improvement in the performance of the f-score. We attribute this to the possibility that the dictionary matching features might overlap with other features in the current multiple feature setting. CONCLUSION: During SVM learning, each feature alone had a marginally positive effect on system performance. This supports the fact that the SVM algorithm is robust on the high dimensionality of the feature vector space and means that feature selection is not required.

Keywords: biosvm nlp
[Miteva2005Fast] MA. Miteva, WH. Lee, MO. Montes, and BO. Villoutreix. Fast structure-based virtual ligand screening combining FRED, DOCK, and Surflex. J. Med. Chem., 48(19):6012-6022, Sep 2005. [ bib | DOI | http ]
A protocol was devised in which FRED, DOCK, and Surflex were combined in a multistep virtual ligand screening (VLS) procedure to screen the pocket of four different proteins. One goal was to evaluate the impact of chaining "freely available packages to academic users" on docking/scoring accuracy and CPU time consumption. A bank of 65 660 compounds including 49 known actives was generated. Our procedure is successful because docking/scoring parameters are tuned according to the nature of the binding pocket and because a shape-based filtering tool is applied prior to flexible docking. The obtained enrichment factors are in line with those reported in recent studies. We suggest that consensus docking/scoring could be valuable to some drug discovery projects. The present protocol could process the entire bank for one receptor in less than a week on one processor, suggesting that VLS experiments could be performed even without large computer resources.

Keywords: Binding Sites, Databases, Estrogen, Factor VIIa, Factual, Ligands, Molecular Structure, Neuraminidase, Non-U.S. Gov't, Protein Binding, Quantitative Structure-Activity Relationship, Receptors, Research Support, Thymidine Kinase, 16162004
[Michiels2005Prediction] SMichiels, SKoscielny, and CHill. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, 365(9458):488-492, 2005. [ bib | DOI | http ]
BACKGROUND: General studies of microarray gene-expression profiling have been undertaken to predict cancer outcome. Knowledge of this gene-expression profile or molecular signature should improve treatment of patients by allowing treatment to be tailored to the severity of the disease. We reanalysed data from the seven largest published studies that have attempted to predict prognosis of cancer patients on the basis of DNA microarray analysis. METHODS: The standard strategy is to identify a molecular signature (ie, the subset of genes most differentially expressed in patients with different outcomes) in a training set of patients and to estimate the proportion of misclassifications with this signature on an independent validation set of patients. We expanded this strategy (based on unique training and validation sets) by using multiple random sets, to study the stability of the molecular signature and the proportion of misclassifications. FINDINGS: The list of genes identified as predictors of prognosis was highly unstable; molecular signatures strongly depended on the selection of patients in the training sets. For all but one study, the proportion misclassified decreased as the number of patients in the training set increased. Because of inadequate validation, our chosen studies published overoptimistic results compared with those from our own analyses. Five of the seven studies did not classify patients better than chance. INTERPRETATION: The prognostic value of published microarray results in cancer studies should be considered with caution. We advocate the use of validation by repeated random sampling.

Keywords: featureselection, breastcancer, microarray
[Micchelli2005On] Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural Comput, 17(1):177-204, Jan 2005. [ bib | DOI | http ]
In this letter, we provide a study of learning in a Hilbert space of vectorvalued functions. We motivate the need for extending learning theory of scalar-valued functions by practical considerations and establish some basic results for learning vector-valued functions that should prove useful in applications. Specifically, we allow an output space Y to be a Hilbert space, and we consider a reproducing kernel Hilbert space of functions whose values lie in Y. In this setting, we derive the form of the minimal norm interpolant to a finite set of data and apply it to study some regularization functionals that are important in learning theory. We consider specific examples of such functionals corresponding to multiple-output regularization networks and support vector machines, for both regression and classification. Finally, we provide classes of operator-valued kernels of the dot product and translation-invariant type.

Keywords: Algorithms, Amino Acid, Amino Acids, Artificial Intelligence, Ascomycota, Automated, Base Sequence, Chromosome Mapping, Codon, Colonic Neoplasms, Comparative Study, Computer Simulation, Computer-Assisted, Computing Methodologies, Crystallography, DNA, DNA Primers, Databases, Decision Support Techniques, Diagnostic Imaging, Enzymes, Feedback, Fixation, Gene Expression Profiling, Genetic, Hordeum, Host-Parasite Relations, Humans, Image Enhancement, Image Interpretation, Informatics, Information Storage and Retrieval, Kinetics, Logistic Models, Magnetic Resonance Spectroscopy, Mathematical Computing, Models, Nanotechnology, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Nonlinear Dynamics, Ocular, Oligonucleotide Array Sequence Analysis, P.H.S., Pattern Recognition, Plant, Plants, Predictive Value of Tests, Protein, Protein Conformation, Regression Analysis, Research Support, Sample Size, Selection (Genetics), Sequence Alignment, Sequence Analysis, Sequence Homology, Signal Processing, Skin, Software, Statistical, Subtraction Technique, Theoretical, Thermodynamics, U.S. Gov't, Viral Proteins, X-Ray, 15563752
[Menchetti2005Weighted] SMenchetti, FCosta, and PFrasconi. Weighted decomposition kernels. In LDe Raedt and SWrobel, editors, Proceedings of the Twenty-Second International Conference on Machine Learning (ICML 2005), pages 585-592. ACM Press, 2005. [ bib ]
[Mavroforakis2005Significance] Michael Mavroforakis, Harris Georgiou, Nikos Dimitropoulos, Dionisis Cavouras, and Sergios Theodoridis. Significance analysis of qualitative mammographic features, using linear classifiers, neural networks and support vector machines. Eur J Radiol, 54(1):80-9, Apr 2005. [ bib | DOI | http | .pdf ]
Advances in modern technologies and computers have enabled digital image processing to become a vital tool in conventional clinical practice, including mammography. However, the core problem of the clinical evaluation of mammographic tumors remains a highly demanding cognitive task. In order for these automated diagnostic systems to perform in levels of sensitivity and specificity similar to that of human experts, it is essential that a robust framework on problem-specific design parameters is formulated. This study is focused on identifying a robust set of clinical features that can be used as the base for designing the input of any computer-aided diagnosis system for automatic mammographic tumor evaluation. A thorough list of clinical features was constructed and the diagnostic value of each feature was verified against current clinical practices by an expert physician. These features were directly or indirectly related to the overall morphological properties of the mammographic tumor or the texture of the fine-scale tissue structures as they appear in the digitized image, while others contained external clinical data of outmost importance, like the patient's age. The entire feature set was used as an annotation list for describing the clinical properties of mammographic tumor cases in a quantitative way, such that subsequent objective analyses were possible. For the purposes of this study, a mammographic image database was created, with complete clinical evaluation descriptions and positive histological verification for each case. All tumors contained in the database were characterized according to the identified clinical features' set and the resulting dataset was used as input for discrimination and diagnostic value analysis for each one of these features. Specifically, several standard methodologies of statistical significance analysis were employed to create feature rankings according to their discriminating power. Moreover, three different classification models, namely linear classifiers, neural networks and support vector machines, were employed to investigate the true efficiency of each one of them, as well as the overall complexity of the diagnostic task of mammographic tumor characterization. Both the statistical and the classification results have proven the explicit correlation of all the selected features with the final diagnosis, qualifying them as an adequate input base for any type of similar automated diagnosis system. The underlying complexity of the diagnostic task has justified the high value of sophisticated pattern recognition architectures.

Keywords: Algorithms, Animals, Antibiotics, Antineoplastic, Artificial Intelligence, Butadienes, Chloroplasts, Comparative Study, Computer Simulation, Computer-Assisted, Diagnosis, Disinfectants, Dose-Response Relationship, Drug, Drug Toxicity, Electrodes, Electroencephalography, Ethylamines, Expert Systems, Feedback, Fungicides, Gene Expression Profiling, Genes, Genetic Markers, Humans, Implanted, Industrial, Information Storage and Retrieval, Kidney, Kidney Tubules, MEDLINE, Male, Mercuric Chloride, Microarray Analysis, Molecular Biology, Motor Cortex, Movement, Natural Language Processing, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Plant Proteins, Predictive Value of Tests, Proteins, Proteome, Proximal, Puromycin Aminonucleoside, Rats, Reproducibility of Results, Research Support, Sprague-Dawley, Subcellular Fractions, Terminology, Therapy, Time Factors, Toxicogenetics, U.S. Gov't, User-Computer Interface, 15797296
[Matsuda2005novel] AMatsuda, J.-P. Vert, HSaigo, NUeda, HToh, and TAkutsu. A novel representation of protein sequences for prediction of subcellular location using support vector machines. Protein Sci., 14(11):2804-2813, 2005. [ bib | DOI | http ]
As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: N-terminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold cross-validation tests, overall accuracies of more than 87 proteins, respectively. It is concluded that considering the respective features in the N-terminal, middle, and C-terminal parts is helpful to predict the subcellular location.

Keywords: biosvm
[Martin2005Predicting] SMartin, DRoe, and J.-L. Faulon. Predicting protein-protein interactions using signature products. Bioinformatics, 21(2):218-226, Jan 2005. [ bib | DOI | http | .pdf ]
Motivation: Proteome-wide prediction of protein-protein interaction is a difficult and important problem in biology. Although there have been recent advances in both experimental and computational methods for predicting protein-protein interactions, we are only beginning to see a confluence of these techniques. In this paper, we describe a very general, high-throughput method for predicting protein-protein interactions. Our method combines a sequence-based description of proteins with experimental information that can be gathered from any type of protein-protein interaction screen. The method uses a novel description of interacting proteins by extending the signature descriptor, which has demonstrated success in predicting peptide/protein binding interactions for individual proteins. This descriptor is extended to protein pairs by taking signature products. The signature product is implemented within a support vector machine classifier as a kernel function. Results: We have applied our method to publicly available yeast, Helicobacter pylori, human and mouse datasets. We used the yeast and H.pylori datasets to verify the predictive ability of our method, achieving from 70 to 80 human and mouse datasets to demonstrate that our method is capable of cross-species prediction. Finally, we reused the yeast dataset to explore the ability of our algorithm to predict domains. Contact: smartin@sandia.gov.

Keywords: biosvm
[Mao2005Multiclass] Yong Mao, Xiaobo Zhou, Daoying Pi, Youxian Sun, and Stephen T C Wong. Multiclass cancer classification by using fuzzy support vector machine and binary decision tree with gene selection. J Biomed Biotechnol, 2005(2):160-71, 2005. [ bib | DOI | http | .pdf ]
We investigate the problems of multiclass cancer classification with gene selection from gene expression data. Two different constructed multiclass classifiers with gene selection are proposed, which are fuzzy support vector machine (FSVM) with gene selection and binary classification tree based on SVM with gene selection. Using F test and recursive feature elimination based on SVM as gene selection methods, binary classification tree based on SVM with F test, binary classification tree based on SVM with recursive feature elimination based on SVM, and FSVM with recursive feature elimination based on SVM are tested in our experiments. To accelerate computation, preselecting the strongest genes is also used. The proposed techniques are applied to analyze breast cancer data, small round blue-cell tumors, and acute leukemia data. Compared to existing multiclass cancer classifiers and binary classification tree based on SVM with F test or binary classification tree based on SVM with recursive feature elimination based on SVM mentioned in this paper, FSVM based on recursive feature elimination based on SVM can find most important genes that affect certain types of cancer with high recognition accuracy.

Keywords: biosvm
[Majumder2005Support] SK. Majumder, NGhosh, and PK. Gupta. Support vector machine for optical diagnosis of cancer. J Biomed Opt, 10(2):024034, 2005. [ bib | DOI | http | .pdf ]
We report the application of a support vector machine (SVM) for the development of diagnostic algorithms for optical diagnosis of cancer. Both linear and nonlinear SVMs have been investigated for this purpose. We develop a methodology that makes use of SVM for both feature extraction and classification jointly by integrating the newly developed recursive feature elimination (RFE) in the framework of SVM. This leads to significantly improved classification results compared to those obtained when an independent feature extractor such as principal component analysis (PCA) is used. The integrated SVM-RFE approach is also found to outperform the classification results yielded by traditional Fisher's linear discriminant (FLD)-based algorithms. All the algorithms are developed using spectral data acquired in a clinical in vivo laser-induced fluorescence (LIF) spectroscopic study conducted on patients being screened for cancer of the oral cavity and normal volunteers. The best sensitivity and specificity values provided by the nonlinear SVM-RFE algorithm over the data sets investigated are 95 and 96% toward cancer for the training set data based on leave-one-out cross validation and 93 and 97% toward cancer for the independent validation set data. When tested on the spectral data of the uninvolved oral cavity sites from the patients it yielded a specificity of 85%.

[Majumder2005Relevance] Shovan K Majumder, Nirmalya Ghosh, and Pradeep K Gupta. Relevance vector machine for optical diagnosis of cancer. Lasers Surg Med, 36(4):323-33, Apr 2005. [ bib | DOI | http | .pdf ]
BACKGROUND AND OBJECTIVES: A probability-based, robust diagnostic algorithm is an essential requirement for successful clinical use of optical spectroscopy for cancer diagnosis. This study reports the use of the theory of relevance vector machine (RVM), a recent Bayesian machine-learning framework of statistical pattern recognition, for development of a fully probabilistic algorithm for autofluorescence diagnosis of early stage cancer of human oral cavity. It also presents a comparative evaluation of the diagnostic efficacy of the RVM algorithm with that based on support vector machine (SVM) that has recently received considerable attention for this purpose. STUDY DESIGN/MATERIALS AND METHODS: The diagnostic algorithms were developed using in vivo autofluorescence spectral data acquired from human oral cavity with a N(2) laser-based portable fluorimeter. The spectral data of both patients as well as normal volunteers, enrolled at Out Patient department of the Govt. Cancer Hospital, Indore for screening of oral cavity, were used for this purpose. The patients selected had no prior confirmed malignancy and were diagnosed of squamous cell carcinoma (SCC), Grade-I on the basis of histopathology of biopsy taken from abnormal site subsequent to acquisition of spectra. Autofluorescence spectra were recorded from a total of 171 tissue sites from 16 patients and 154 healthy squamous tissue sites from 13 normal volunteers. Of 171 tissues sites from patients, 83 were SCC and the rest were contralateral uninvolved squamous tissue. Each site was treated separately and classified via the diagnostic algorithm developed. Instead of the spectral data from uninvolved sites of patients, the data from normal volunteers were used as the normal database for the development of diagnostic algorithms. RESULTS: The diagnostic algorithms based on RVM were found to provide classification performance comparable to the state-of-the-art SVMs, while at the same time explicitly predicting the probability of class membership. The sensitivity and specificity towards cancer were up to 88% and 95% for the training set data based on leave- one-out cross validation and up to 91% and 96% for the validation set data. When implemented on the spectral data of the uninvolved oral cavity sites from the patients, it yielded a specificity of up to 91%. CONCLUSIONS: The Bayesian framework of RVM formulation makes it possible to predict the posterior probability of class membership in discriminating early SCC from the normal squamous tissue sites of the oral cavity in contrast to dichotomous classification provided by the non-Bayesian SVM. Such classification is very helpful in handling asymmetric misclassification costs like assigning different weights for having a false negative result for identifying cancer compared to false positive. The results further demonstrate that for comparable diagnostic performances, the RVM-based algorithms use significantly fewer kernel functions and do not need to estimate any hoc parameters associated with the learning or the optimization technique to be used. This implies a considerable saving in memory and computation in a practical implementation.

Keywords: , , 15825208
[Mahe2005Graph] PMahé, NUeda, TAkutsu, J.-L. Perret, and J.-P. Vert. Graph kernels for molecular structure-activity relationship analysis with support vector machines. J. Chem. Inf. Model., 45(4):939-51, 2005. [ bib | DOI | http | .pdf ]
The support vector machine algorithm together with graph kernel functions has recently been introduced to model structure-activity relationships (SAR) of molecules from their 2D structure, without the need for explicit molecular descriptor computation. We propose two extensions to this approach with the double goal to reduce the computational burden associated with the model and to enhance its predictive accuracy: description of the molecules by a Morgan index process and definition of a second-order Markov model for random walks on 2D structures. Experiments on two mutagenicity data sets validate the proposed extensions, making this approach a possible complementary alternative to other modeling strategies.

Keywords: biosvm chemoinformatics
[Machado2005Detection] Roberto F Machado, Daniel Laskowski, Olivia Deffenderfer, Timothy Burch, Shuo Zheng, Peter J Mazzone, Tarek Mekhail, Constance Jennings, James K Stoller, Jacqueline Pyle, Jennifer Duncan, Raed A Dweik, and Serpil C Erzurum. Detection of lung cancer by sensor array analyses of exhaled breath. Am J Respir Crit Care Med, 171(11):1286-91, Jun 2005. [ bib | DOI | http | .pdf ]
RATIONALE: Electronic noses are successfully used in commercial applications, including detection and analysis of volatile organic compounds in the food industry. OBJECTIVES: We hypothesized that the electronic nose could identify and discriminate between lung diseases, especially bronchogenic carcinoma. METHODS: In a discovery and training phase, exhaled breath of 14 individuals with bronchogenic carcinoma and 45 healthy control subjects or control subjects without cancer was analyzed. Principal components and canonic discriminant analysis of the sensor data was used to determine whether exhaled gases could discriminate between cancer and noncancer. Discrimination between classes was performed using Mahalanobis distance. Support vector machine analysis was used to create and apply a cancer prediction model prospectively in a separate group of 76 individuals, 14 with and 62 without cancer. MAIN RESULTS: Principal components and canonic discriminant analysis demonstrated discrimination between samples from patients with lung cancer and those from other groups. In the validation study, the electronic nose had 71.4% sensitivity and 91.9% specificity for detecting lung cancer; positive and negative predictive values were 66.6 and 93.4%, respectively. In this population with a lung cancer prevalence of 18%, positive and negative predictive values were 66.6 and 94.5%, respectively. CONCLUSION: The exhaled breath of patients with lung cancer has distinct characteristics that can be identified with an electronic nose. The results provide feasibility to the concept of using the electronic nose for managing and detecting lung cancer.

[Ma2005PNAS] LMa, JWagner, JJ. Rice, WHu, AJ. Levine, and GA. Stolovitzky. A plausible model for the digital response of p53 to dna damage. Proc Natl Acad Sci U S A, 102(40):14266-71, 2005. [ bib ]
Recent observations show that the single-cell response of p53 to ionizing radiation (IR) is "digital" in that it is the number of oscillations rather than the amplitude of p53 that shows dependence on the radiation dose. We present a model of this phenomenon. In our model, double-strand break (DSB) sites induced by IR interact with a limiting pool of DNA repair proteins, forming DSB-protein complexes at DNA damage foci. The persisting complexes are sensed by ataxia telangiectasia mutated (ATM), a protein kinase that activates p53 once it is phosphorylated by DNA damage. The ATM-sensing module switches on or off the downstream p53 oscillator, consisting of a feedback loop formed by p53 and its negative regulator, Mdm2. In agreement with experiments, our simulations show that by assuming stochasticity in the initial number of DSBs and the DNA repair process, p53 and Mdm2 exhibit a coordinated oscillatory dynamics upon IR stimulation in single cells, with a stochastic number of oscillations whose mean increases with IR dose. The damped oscillations previously observed in cell populations can be explained as the aggregate behavior of single cells.

Keywords: csbcbook
[Ma2005Structural] J-B. Ma, Y.-R. Yuan, G.. Meister, YPei, TTuschl, and Patel D.J. Structural basis for 5'-end-specific recognition of guide RNA by the A. fulgidus PIWI protein. Nature, 434:666-670, 2005. [ bib ]
Keywords: sirna
[Luan2005Classification] Feng Luan, Ruisheng Zhang, Chunyan Zhao, Xiaojun Yao, Mancang Liu, Zhide Hu, and Botao Fan. Classification of the carcinogenicity of N-nitroso compounds based on support vector machines and linear discriminant analysis. Chem Res Toxicol, 18(2):198-203, Feb 2005. [ bib | DOI | http | .pdf ]
The support vector machine (SVM), as a novel type of learning machine, was used to develop a classification model of carcinogenic properties of 148 N-nitroso compounds. The seven descriptors calculated solely from the molecular structures of compounds selected by forward stepwise linear discriminant analysis (LDA) were used as inputs of the SVM model. The obtained results confirmed the discriminative capacity of the calculated descriptors. The result of SVM (total accuracy of 95.2%) is better than that of LDA (total accuracy of 89.8%).

Keywords: biosvm
[Lu2005Potential] Wei-Zhen Lu and Wen-Jian Wang. Potential assessment of the "support vector machine" method in forecasting ambient air pollutant trends. Chemosphere, 59(5):693-701, Apr 2005. [ bib | DOI | http | .pdf ]
Monitoring and forecasting of air quality parameters are popular and important topics of atmospheric and environmental research today due to the health impact caused by exposing to air pollutants existing in urban air. The accurate models for air pollutant prediction are needed because such models would allow forecasting and diagnosing potential compliance or non-compliance in both short- and long-term aspects. Artificial neural networks (ANN) are regarded as reliable and cost-effective method to achieve such tasks and have produced some promising results to date. Although ANN has addressed more attentions to environmental researchers, its inherent drawbacks, e.g., local minima, over-fitting training, poor generalization performance, determination of the appropriate network architecture, etc., impede the practical application of ANN. Support vector machine (SVM), a novel type of learning machine based on statistical learning theory, can be used for regression and time series prediction and have been reported to perform well by some promising results. The work presented in this paper aims to examine the feasibility of applying SVM to predict air pollutant levels in advancing time series based on the monitored air pollutant database in Hong Kong downtown area. At the same time, the functional characteristics of SVM are investigated in the study. The experimental comparisons between the SVM model and the classical radial basis function (RBF) network demonstrate that the SVM is superior to the conventional RBF network in predicting air quality parameters with different time series and of better generalization performance than the RBF model.

[Lu2005MicroRNA] JLu, GGetz, EA. Miska, EAlvarez-Saavedra, JLamb, DPeck, ASweet-Cordero, DL. Ebert, RH. Mak, AA. Ferrando, JR. Downing, TJacks, HR. Horvitz, and TR. Golub. Microrna expression profiles classify human cancers. Nature, 435(7043):834-838, Jun 2005. [ bib | DOI | http | .pdf ]
Recent work has revealed the existence of a class of small non-coding RNA species, known as microRNAs (miRNAs), which have critical functions across various biological processes. Here we use a new, bead-based flow cytometric miRNA expression profiling method to present a systematic expression analysis of 217 mammalian miRNAs from 334 samples, including multiple human cancers. The miRNA profiles are surprisingly informative, reflecting the developmental lineage and differentiation state of the tumours. We observe a general downregulation of miRNAs in tumours compared with normal tissues. Furthermore, we were able to successfully classify poorly differentiated tumours using miRNA expression profiles, whereas messenger RNA profiles were highly inaccurate when applied to the same samples. These findings highlight the potential of miRNA profiling in cancer diagnosis.

Keywords: csbcbook, csbcbook-ch3
[Lo2005Effect] Siaw Ling Lo, Cong Zhong Cai, Yu Zong Chen, and Maxey C M Chung. Effect of training datasets on support vector machine prediction of protein-protein interactions. Proteomics, 5(4):876-84, Mar 2005. [ bib | DOI | http | .pdf ]
Knowledge of protein-protein interaction is useful for elucidating protein function via the concept of 'guilt-by-association'. A statistical learning method, Support Vector Machine (SVM), has recently been explored for the prediction of protein-protein interactions using artificial shuffled sequences as hypothetical noninteracting proteins and it has shown promising results (Bock, J. R., Gough, D. A., Bioinformatics 2001, 17, 455-460). It remains unclear however, how the prediction accuracy is affected if real protein sequences are used to represent noninteracting proteins. In this work, this effect is assessed by comparison of the results derived from the use of real protein sequences with that derived from the use of shuffled sequences. The real protein sequences of hypothetical noninteracting proteins are generated from an exclusion analysis in combination with subcellular localization information of interacting proteins found in the Database of Interacting Proteins. Prediction accuracy using real protein sequences is 76.9% compared to 94.1% using artificial shuffled sequences. The discrepancy likely arises from the expected higher level of difficulty for separating two sets of real protein sequences than that for separating a set of real protein sequences from a set of artificial sequences. The use of real protein sequences for training a SVM classification system is expected to give better prediction results in practical cases. This is tested by using both SVM systems for predicting putative protein partners of a set of thioredoxin related proteins. The prediction results are consistent with observations, suggesting that real sequence is more practically useful in development of SVM classification system for facilitating protein-protein interaction prediction.

Keywords: biosvm
[Liu2005Gene] Zhenqiu Liu, Dechang Chen, and Halima Bensmail. Gene expression data classification with kernel principal component analysis. J Biomed Biotechnol, 2005(2):155-9, 2005. [ bib | DOI | http | .pdf ]
One important feature of the gene expression data is that the number of genes M far exceeds the number of samples N . Standard statistical methods do not work well when N < M . Development of new methodologies or modification of existing methodologies is needed for the analysis of the microarray data. In this paper, we propose a novel analysis procedure for classifying the gene expression data. This procedure involves dimension reduction using kernel principal component analysis (KPCA) and classification with logistic regression (discrimination). KPCA is a generalization and nonlinear version of principal component analysis. The proposed algorithm was applied to five different gene expression datasets involving human tumor samples. Comparison with other popular classification methods such as support vector machines and neural networks shows that our algorithm is very promising in classifying gene expression data.

Keywords: biosvm
[Liu2005Multiclass] Jane Jijun Liu, Gene Cutler, Wuxiong Li, Zheng Pan, Sihua Peng, Tim Hoey, Liangbiao Chen, and Xuefeng Bruce Ling. Multiclass cancer classification and biomarker discovery using GA-based algorithms. Bioinformatics, 21(11):2691-7, Jun 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: The development of microarray-based high-throughput gene profiling has led to the hope that this technology could provide an efficient and accurate means of diagnosing and classifying tumors, as well as predicting prognoses and effective treatments. However, the large amount of data generated by microarrays requires effective reduction of discriminant gene features into reliable sets of tumor biomarkers for such multiclass tumor discrimination. The availability of reliable sets of biomarkers, especially serum biomarkers, should have a major impact on our understanding and treatment of cancer. RESULTS: We have combined genetic algorithm (GA) and all paired (AP) support vector machine (SVM) methods for multiclass cancer categorization. Predictive features can be automatically determined through iterative GA/SVM, leading to very compact sets of non-redundant cancer-relevant genes with the best classification performance reported to date. Interestingly, these different classifier sets harbor only modest overlapping gene features but have similar levels of accuracy in leave-one-out cross-validations (LOOCV). Further characterization of these optimal tumor discriminant features, including the use of nearest shrunken centroids (NSC), analysis of annotations and literature text mining, reveals previously unappreciated tumor subclasses and a series of genes that could be used as cancer biomarkers. With this approach, we believe that microarray-based multiclass molecular analysis can be an effective tool for cancer biomarker discovery and subsequent molecular cancer diagnosis.

Keywords: biosvm
[Liu2005[Establishment] Jian Liu, Shu Zheng, Jie kai Yu, Xue bin Yu, Wei guo Liu, Jian min Zhang, and Xun Hu. [Establishment of diagnostic model of cerebrospinal protein fingerprint pattern for glioma and its clinical application.]. Zhejiang Da Xue Xue Bao Yi Xue Ban, 34(2):141-7, Mar 2005. [ bib ]
OBJECTIVE: To establish the diagnostic model of cerebrospinal protein profile for gliomas by surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF MS) and bioinformatics. METHODS: Seventy-five samples of cerebrospinal fluid from patients with gliomas, benign brain tumors and mild brain traumas were collected. A total of 50 samples from gliomas and non-brain-tumors were divided into training sets (33 cases including 17 gliomas and 16 non-brain-tumors) and testing sets (17 cases including 5 gliomas and 12 non-brain-tumors). The cerebrospinal proteins bound to H4 chip were detected by SELDI-TOF MS, the profiles of cerebrospinal protein were gained and then analyzed with artificial neural network algorithm (ANN); and the diagnostic model of cerebrospinal protein profiles for differentiating gliomas from non-brain-tumors was established. Forty-seven of cerebrospinal samples of gliomas and benign brain tumors were divided into training sets (31 cases including 13 gliomas and 18 benign brain tumors) and testing sets (16 cases including 9 gliomas and 7 benign brain tumors), the diagnostic model of cerebrospinal protein profiles for differentiating gliomas from benign brain tumors was established based on the same method. The support vector machine (SVM) algorithm was also used for evaluation, both results were very similar, but the result derived from ANN was more stable than that from SVM. RESULT: The diagnostic model of cerebrospinal protein profiles for differentiating gliomas from non-brain-tumors was established and was challenged with the test set randomly, the sensitivity and specificity were 100% and 91.7%, respectively. The cerebrospinal protein profiling model for differentiating gliomas from benign brain tumors was also developed and was challenged with the test set randomly, the sensitivity and specificity were 88.9%, and 100%, respectively. CONCLUSION: The technology of SELDI-TOF MS which combined with analysis tools of bioinformatics is a novel effective method for screening and identifying tumor biomarkers of gliomas and it may provide a new approach for the clinical diagnosis of glioma.

Keywords: Algorithms, Animals, Artificial Intelligence, Computer-Assisted, Diagnosis, Electrodes, Electroencephalography, Feedback, Implanted, Male, Motor Cortex, Movement, Non-P.H.S., Non-U.S. Gov't, Rats, Research Support, Sprague-Dawley, Therapy, U.S. Gov't, User-Computer Interface, 15812888
[Liu2005Use] Huiqing Liu, Jinyan Li, and Limsoon Wong. Use of extreme patient samples for outcome prediction from gene expression data. Bioinformatics, Jun 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Patient outcome prediction using microarray technologies is an important application in bioinformatics. Based on patients' genotypic microarray data, predictions are made to estimate patients' survival time and their risk of tumor metastasis or recurrence. So, accurate prediction can potentially help to provide better treatment for patients. RESULTS: We present a new computational method for patient outcome prediction. In the training phase of this method, we make use of two types of extreme patient samples: short-term survivors who got an unfavorable outcome within a short period and long-term survivors who were maintaining a favorable outcome after a long follow-up time. These extreme training samples yield a clear platform for us to identify relevant genes whose expression is closely related to the outcome. The selected extreme samples and the relevant genes are then integrated by a support vector machine to build a prediction model, by which each validation sample is assigned a risk score that falls into one of special pre-defined risk groups. We apply this method to several public data sets. In most cases, patients in high and low risk groups stratified by our method have clearly distinguishable outcome status as seen in their Kaplan-Meier curves. We also show that the idea of selecting only extreme patient samples for training is effective for improving the prediction accuracy when different gene selection methods are used. SUPPLEMENTARY INFORMATION: http://research.i2r.a-star.edu.sg/huiqing/supplementaldata/survival/survival.html.

Keywords: biosvm
[Lim2005Microarray] Lee P Lim, Nelson C Lau, Philip Garrett-Engele, Andrew Grimson, Janell M Schelter, John Castle, David P Bartel, Peter S Linsley, and Jason M Johnson. Microarray analysis shows that some micrornas downregulate large numbers of target mrnas. Nature, 433(7027):769-773, Feb 2005. [ bib | DOI | http ]
MicroRNAs (miRNAs) are a class of noncoding RNAs that post-transcriptionally regulate gene expression in plants and animals. To investigate the influence of miRNAs on transcript levels, we transfected miRNAs into human cells and used microarrays to examine changes in the messenger RNA profile. Here we show that delivering miR-124 causes the expression profile to shift towards that of brain, the organ in which miR-124 is preferentially expressed, whereas delivering miR-1 shifts the profile towards that of muscle, where miR-1 is preferentially expressed. In each case, about 100 messages were downregulated after 12 h. The 3' untranslated regions of these messages had a significant propensity to pair to the 5' region of the miRNA, as expected if many of these messages are the direct targets of the miRNAs. Our results suggest that metazoan miRNAs can reduce the levels of many of their target transcripts, not just the amount of protein deriving from these transcripts. Moreover, miR-1 and miR-124, and presumably other tissue-specific miRNAs, seem to downregulate a far greater number of targets than previously appreciated, thereby helping to define tissue-specific gene expression in humans.

Keywords: sirna
[Li2005robust] LLi, WJiang, XLi, K.L. Moser, ZGuo, LDu, QWang, E.J. Topol, QWang, and SRao. A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. Genomics, 85(1):16-23, 2005. [ bib | DOI | http | .pdf ]
Development of a robust and efficient approach for extracting useful information from microarray data continues to be a significant and challenging task. Microarray data are characterized by a high dimension, high signal-to-noise ratio, and high correlations between genes, but with a relatively small sample size. Current methods for dimensional reduction can further be improved for the scenario of the presence of a single (or a few) high influential gene(s) in which its effect in the feature subset would prohibit inclusion of other important genes. We have formalized a robust gene selection approach based on a hybrid between genetic algorithm and support vector machine. The major goal of this hybridization was to exploit fully their respective merits (e.g., robustness to the size of solution space and capability of handling a very large dimension of feature genes) for identification of key feature genes (or molecular signatures) for a complex biological phenotype. We have applied the approach to the microarray data of diffuse large B cell lymphoma to demonstrate its behaviors and properties for mining the high-dimension data of genome-wide gene expression profiles. The resulting classifier(s) (the optimal gene subset(s)) has achieved the highest accuracy (99 for prediction of independent microarray samples in comparisons with marginal filters and a hybrid between genetic algorithm and K nearest neighbors.

Keywords: biosvm
[Li2005Prediction] HLi, CUng, CYap, YXue, ZLi, ZCao, and YChen. Prediction of genotoxicity of chemical compounds by statistical learning methods. Chem. Res. Toxicol., 18(6):1071-1080, Jun 2005. [ bib | DOI | http | .pdf ]
Various toxicological profiles, such as genotoxic potential, need to be studied in drug discovery processes and submitted to the drug regulatory authorities for drug safety evaluation. As part of the effort for developing low cost and efficient adverse drug reaction testing tools, several statistical learning methods have been used for developing genotoxicity prediction systems with an accuracy of up to 73.8% for genotoxic (GT+) and 92.8% for nongenotoxic (GT-) agents. These systems have been developed and tested by using less than 400 known GT+ and GT- agents, which is significantly less in number and diversity than the 860 GT+ and GT- agents known at present. There is a need to examine if a similar level of accuracy can be achieved for the more diverse set of molecules and to evaluate other statistical learning methods not yet applied to genotoxicity prediction. This work is intended for testing several statistical learning methods by using 860 GT+ and GT- agents, which include support vector machines (SVM), probabilistic neural network (PNN), k-nearest neighbor (k-NN), and C4.5 decision tree (DT). A feature selection method, recursive feature elimination, is used for selecting molecular descriptors relevant to genotoxicity study. The overall accuracies of SVM, k-NN, and PNN are comparable to and those of DT lower than the results from earlier studies, with SVM giving the highest accuracies of 77.8% for GT+ and 92.7% for GT- agents. Our study suggests that statistical learning methods, particularly SVM, k-NN, and PNN, are useful for facilitating the prediction of genotoxic potential of a diverse set of molecules.

Keywords: biosvm chemoinformatics
[Lewis2005Conserved] Benjamin P Lewis, Christopher B Burge, and David P Bartel. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microrna targets. Cell, 120(1):15-20, Jan 2005. [ bib | DOI | http | .pdf ]
We predict regulatory targets of vertebrate microRNAs (miRNAs) by identifying mRNAs with conserved complementarity to the seed (nucleotides 2-7) of the miRNA. An overrepresentation of conserved adenosines flanking the seed complementary sites in mRNAs indicates that primary sequence determinants can supplement base pairing to specify miRNA target recognition. In a four-genome analysis of 3' UTRs, approximately 13,000 regulatory relationships were detected above the estimate of false-positive predictions, thereby implicating as miRNA targets more than 5300 human genes, which represented 30% of our gene set. Targeting was also detected in open reading frames. In sum, well over one third of human genes appear to be conserved miRNA targets.

Keywords: sirna
[Lee2005extensive] JW. Lee, JB. Lee, MPark, and SH. Song. An extensive comparison of recent classification tools applied to microarray datas. Comput. Stat. Data An., 48:869-885, 2005. [ bib | .pdf ]
[Lee2005improved] Jaewook Lee and Daewon Lee. An improved cluster labeling method for support vector clustering. IEEE Trans Pattern Anal Mach Intell, 27(3):461-4, Mar 2005. [ bib ]
The support vector clustering (SVC) algorithm is a recently emerged unsupervised learning method inspired by support vector machines. One key step involved in the SVC algorithm is the cluster assignment of each data point. A new cluster labeling method for SVC is developed based on some invariant topological properties of a trained kernel radius function. Benchmark results show that the proposed method outperforms previously reported labeling techniques.

[Lavielle2005Adaptive] MLavielle and Teyssière. Adaptive detection of multiple change-points in asset price volatility. In GTeyssière and AKirman, editors, Long-Memory in Economics, pages 129-156. Springer Verlag, Berlin, 2005. [ bib ]
[Lavielle2005Using] MLavielle. Using penalized contrasts for the change-point problem. Signal Process., 85(8):1501-1510, 2005. [ bib | DOI | .pdf ]
[Laurie2005Q-Site] ATR. Laurie and RM. Jackson. Q-sitefinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics, 21(9):1908-1916, 2005. [ bib | DOI ]
[Lasso2005Vessel] András Lassó and Emanuele Trucco. Vessel enhancement in digital X-ray angiographic sequences by temporal statistical learning. Comput Med Imaging Graph, 29(5):343-55, Jul 2005. [ bib | DOI | http ]
Keywords: Apoptosis, Gene Expression Profiling, Humans, Neoplasms, Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Polymerase Chain Reaction, Proteins, Research Support, Subcellular Fractions, Unknown Primary, 15893453
[Larsen2005integrative] Mette Voldby Larsen, Claus Lundegaard, Kasper Lamberth, Søren Buus, Søren Brunak, Ole Lund, and Morten Nielsen. An integrative approach to CTL epitope prediction: a combined algorithm integrating MHC class I binding, TAP transport efficiency, and proteasomal cleavage predictions. Eur. J. Immunol., 35(8):2295-2303, Aug 2005. [ bib | DOI | http ]
Reverse immunogenetic approaches attempt to optimize the selection of candidate epitopes, and thus minimize the experimental effort needed to identify new epitopes. When predicting cytotoxic T cell epitopes, the main focus has been on the highly specific MHC class I binding event. Methods have also been developed for predicting the antigen-processing steps preceding MHC class I binding, including proteasomal cleavage and transporter associated with antigen processing (TAP) transport efficiency. Here, we use a dataset obtained from the SYFPEITHI database to show that a method integrating predictions of MHC class I binding affinity, TAP transport efficiency, and C-terminal proteasomal cleavage outperforms any of the individual methods. Using an independent evaluation dataset of HIV epitopes from the Los Alamos database, the validity of the integrated method is confirmed. The performance of the integrated method is found to be significantly higher than that of the two publicly available prediction methods BIMAS and SYFPEITHI. To identify 85% of the epitopes in the HIV dataset, 9% and 10% of all possible nonamers in the HIV proteins must be tested when using the BIMAS and SYFPEITHI methods, respectively, for the selection of candidate epitopes. This number is reduced to 7% when using the integrated method. In practical terms, this means that the experimental effort needed to identify an epitope in a hypothetical protein with 85% probability is reduced by 20-30% when using the integrated method.The method is available at http://www.cbs.dtu.dk/services/NetCTL. Supplementary material is available at http://www.cbs.dtu.dk/suppl/immunology/CTL.php.

Keywords: Algorithms; Data Interpretation, Statistical; Epitopes, T-Lymphocyte; Histocompatibility Antigens Class I; Humans; Hydrolysis; Predictive Value of Tests; Proteasome Endopeptidase Complex; Protein Binding; Research Support, N.I.H., Extramural; Research Support, Non-U.S. Gov't; Research Support, U.S. Gov't, P.H.S.; T-Lymphocytes, Cytotoxic
[Lapinsh2005Improved] MLapinsh, PPrusis, SUhlén, and JES. Wikberg. Improved approach for proteochemometrics modeling: application to organic compound-amine G protein-coupled receptor interactions. Bioinformatics, 21(23):4289-4296, Dec 2005. [ bib | DOI | http ]
MOTIVATION: Proteochemometrics is a novel technology for the analysis of interactions of series of proteins with series of ligands. We have here customized it for analysis of large datasets and evaluated it for the modeling of the interaction of psychoactive organic amines with all the five known families of amine G protein-coupled receptors (GPCRs). RESULTS: The model exploited data for the binding of 22 compounds to 31 amine GPCRs, correlating chemical descriptions and cross-descriptions of compounds and receptors to binding affinity using a novel strategy. A highly valid model (q2 = 0.76) was obtained which was further validated by external predictions using data for 10 other entirely independent compounds, yielding the high q2ext = 0.67. Interpretation of the model reveals molecular interactions that govern psychoactive organic amines overall affinity for amine GPCRs, as well as their selectivity for particular amine GPCRs. The new modeling procedure allows us to obtain fully interpretable proteochemometrics models using essentially unlimited number of ligand and protein descriptors.

Keywords: chemogenomics
[LaCount2005protein] Douglas J LaCount, Marissa Vignali, Rakesh Chettier, Amit Phansalkar, Russell Bell, Jay R Hesselberth, Lori W Schoenfeld, Irene Ota, Sudhir Sahasrabudhe, Cornelia Kurschner, Stanley Fields, and Robert E Hughes. A protein interaction network of the malaria parasite Plasmodium falciparum. Nature, 438(7064):103-107, Nov 2005. [ bib | DOI | http | .pdf ]
Plasmodium falciparum causes the most severe form of malaria and kills up to 2.7 million people annually. Despite the global importance of P. falciparum, the vast majority of its proteins have not been characterized experimentally. Here we identify P. falciparum protein-protein interactions using a high-throughput version of the yeast two-hybrid assay that circumvents the difficulties in expressing P. falciparum proteins in Saccharomyces cerevisiae. From more than 32,000 yeast two-hybrid screens with P. falciparum protein fragments, we identified 2,846 unique interactions, most of which include at least one previously uncharacterized protein. Informatic analyses of network connectivity, coexpression of the genes encoding interacting fragments, and enrichment of specific protein domains or Gene Ontology annotations were used to identify groups of interacting proteins, including one implicated in chromatin modification, transcription, messenger RNA stability and ubiquitination, and another implicated in the invasion of host cells. These data constitute the first extensive description of the protein interaction network for this important human pathogen.

Keywords: plasmodium
[LaConte2005Support] Stephen LaConte, Stephen Strother, Vladimir Cherkassky, Jon Anderson, and Xiaoping Hu. Support vector machines for temporal classification of block design fMRI data. Neuroimage, 26(2):317-29, Jun 2005. [ bib | DOI | http ]
This paper treats support vector machine (SVM) classification applied to block design fMRI, extending our previous work with linear discriminant analysis [LaConte, S., Anderson, J., Muley, S., Ashe, J., Frutiger, S., Rehm, K., Hansen, L.K., Yacoub, E., Hu, X., Rottenberg, D., Strother S., 2003a. The evaluation of preprocessing choices in single-subject BOLD fMRI using NPAIRS performance metrics. NeuroImage 18, 10-27; Strother, S.C., Anderson, J., Hansen, L.K., Kjems, U., Kustra, R., Siditis, J., Frutiger, S., Muley, S., LaConte, S., Rottenberg, D. 2002. The quantitative evaluation of functional neuroimaging experiments: the NPAIRS data analysis framework. NeuroImage 15, 747-771]. We compare SVM to canonical variates analysis (CVA) by examining the relative sensitivity of each method to ten combinations of preprocessing choices consisting of spatial smoothing, temporal detrending, and motion correction. Important to the discussion are the issues of classification performance, model interpretation, and validation in the context of fMRI. As the SVM has many unique properties, we examine the interpretation of support vector models with respect to neuroimaging data. We propose four methods for extracting activation maps from SVM models, and we examine one of these in detail. For both CVA and SVM, we have classified individual time samples of whole brain data, with TRs of roughly 4 s, thirty slices, and nearly 30,000 brain voxels, with no averaging of scans or prior feature selection.

[LaBaer2005Protein] Joshua LaBaer and Niroshan Ramachandran. Protein microarrays as tools for functional proteomics. Curr Opin Chem Biol, 9(1):14-19, Feb 2005. [ bib | DOI | http ]
Protein microarrays present an innovative and versatile approach to study protein abundance and function at an unprecedented scale. Given the chemical and structural complexity of the proteome, the development of protein microarrays has been challenging. Despite these challenges there has been a marked increase in the use of protein microarrays to map interactions of proteins with various other molecules, and to identify potential disease biomarkers, especially in the area of cancer biology. In this review, we discuss some of the promising advances made in the development and use of protein microarrays.

Keywords: Protein Array Analysis; Proteins; Proteomics; Surface Properties
[Kumar2005BhairPred] MKumar, MBhasin, NK. Natt, and GPS. Raghava. BhairPred: prediction of beta-hairpins in a protein from multiple alignment information using ANN and SVM techniques. Nucleic Acids Res, 33(Web Server issue):W154-9, Jul 2005. [ bib | DOI | http | .pdf ]
This paper describes a method for predicting a supersecondary structural motif, beta-hairpins, in a protein sequence. The method was trained and tested on a set of 5102 hairpins and 5131 non-hairpins, obtained from a non-redundant dataset of 2880 proteins using the DSSP and PROMOTIF programs. Two machine-learning techniques, an artificial neural network (ANN) and a support vector machine (SVM), were used to predict beta-hairpins. An accuracy of 65.5% was achieved using ANN when an amino acid sequence was used as the input. The accuracy improved from 65.5 to 69.1% when evolutionary information (PSI-BLAST profile), observed secondary structure and surface accessibility were used as the inputs. The accuracy of the method further improved from 69.1 to 79.2% when the SVM was used for classification instead of the ANN. The performances of the methods developed were assessed in a test case, where predicted secondary structure and surface accessibility were used instead of the observed structure. The highest accuracy achieved by the SVM based method in the test case was 77.9%. A maximum accuracy of 71.1% with Matthew's correlation coefficient of 0.41 in the test case was obtained on a dataset previously used by X. Cruz, E. G. Hutchinson, A. Shephard and J. M. Thornton (2002) Proc. Natl Acad. Sci. USA, 99, 11157-11162. The performance of the method was also evaluated on proteins used in the '6th community-wide experiment on the critical assessment of techniques for protein structure prediction (CASP6)'. Based on the algorithm described, a web server, BhairPred (http://www.imtech.res.in/raghava/bhairpred/), has been developed, which can be used to predict beta-hairpins in a protein using the SVM approach.

Keywords: biosvm
[Kuang2005Profile-based] RKuang, EIe, KWang, KWang, MSiddiqi, YFreund, and CLeslie. Profile-based string kernels for remote homology detection and motif extraction. J. Bioinform. Comput. Biol., 3(3):527-550, Jun 2005. [ bib ]
We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs"-short regions of the original profile that contribute almost all the weight of the SVM classification score-and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets.

Keywords: biosvm
[Kratochwil2005automated] NA. Kratochwil, PMalherbe, LLindemann, MEbeling, MC. Hoener, AMühlemann, RHP. Porter, MStahl, and PR. Gerber. An automated system for the analysis of G protein-coupled receptor transmembrane binding pockets: alignment, receptor-based pharmacophores, and their application. J. Chem. Inf. Model., 45(5):1324-1336, 2005. [ bib | DOI | http | .pdf ]
G protein-coupled receptors (GPCRs) share a common architecture consisting of seven transmembrane (TM) domains. Various lines of evidence suggest that this fold provides a generic binding pocket within the TM region for hosting agonists, antagonists, and allosteric modulators. Here, a comprehensive and automated method allowing fast analysis and comparison of these putative binding pockets across the entire GPCR family is presented. The method relies on a robust alignment algorithm based on conservation indices, focusing on pharmacophore-like relationships between amino acids. Analysis of conservation patterns across the GPCR family and alignment to the rhodopsin X-ray structure allows the extraction of the amino acids lining the TM binding pocket in a so-called ligand binding pocket vector (LPV). In a second step, LPVs are translated to simple 3D receptor pharmacophore models, where each amino acid is represented by a single spherical pharmacophore feature and all atomic detail is omitted. Applications of the method include the assessment of selectivity issues, support of mutagenesis studies, and the derivation of rules for focused screening to identify chemical starting points in early drug discovery projects. Because of the coarseness of this 3D receptor pharmacophore model, however, meaningful scoring and ranking procedures of large sets of molecules are not justified. The LPV analysis of the trace amine-associated receptor family and its experimental validation is discussed as an example. The value of the 3D receptor model is demonstrated for a class C GPCR family, the metabotropic glutamate receptors.

Keywords: chemogenomics
[Komura2005Multidimensional] DKomura, HNakamura, STsutsumi, HAburatani, and SIhara. Multidimensional support vector machines for visualization of gene expression data. Bioinformatics, 21(4):439-444, Feb 2005. [ bib | DOI | http | .pdf ]
Motivation: Since DNA microarray experiments provide us with huge amount of gene expression data, they should be analyzed with statistical methods to extract the meanings of experimental results. Some dimensionality reduction methods such as Principal Component Analysis (PCA) are used to roughly visualize the distribution of high dimensional gene expression data. However, in the case of binary classification of gene expression data, PCA does not utilize class information when choosing axes. Thus clearly separable data in the original space may not be so in the reduced space used in PCA.Results: For visualization and class prediction of gene expression data, we have developed a new SVM-based method called multidimensional SVMs, that generate multiple orthogonal axes. This method projects high dimensional data into lower dimensional space to exhibit properties of the data clearly and to visualize a distribution of the data roughly. Furthermore, the multiple axes can be used for class prediction. The basic properties of conventional SVMs are retained in our method: solutions of mathematical programming are sparse, and nonlinear classification is implemented implicitly through the use of kernel functions. The application of our method to the experimentally obtained gene expression datasets for patients' samples indicates that our algorithm is efficient and useful for visualization and class prediction.

Keywords: biosvm
[Koehn-05] PKoehn. Europarl: A parallel corpus for statistical machine translation. In MT Summit, 2005. [ bib ]
[Kitano2005NatBiotechnol] HKitano, AFunahashi, YMatsuoka, and KOda. Using process diagrams for the graphical representation of biological networks. Nat. Biotechnol., 8:961-966, 2005. [ bib | DOI ]
With the increased interest in understanding biological networks, such as protein-protein interaction networks and gene regulatory networks, methods for representing and communicating such networks in both human- and machine-readable form have become increasingly important. Although there has been significant progress in machine-readable representation of networks, as exemplified by the Systems Biology Mark-up Language (SBML) (http://www.sbml.org) issues in human-readable representation have been largely ignored. This article discusses human-readable diagrammatic representations and proposes a set of notations that enhances the formality and richness of the information represented. The process diagram is a fully state transition-based diagram that can be translated into machine-readable forms such as SBML in a straightforward way. It is supported by CellDesigner, a diagrammatic network editing software (http://www.celldesigner.org/), and has been used to represent a variety of networks of various sizes (from only a few components to several hundred components).

Keywords: csbcbook
[Kim2005Locally] Tae-Kyun Kim and Josef Kittler. Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image. IEEE Trans Pattern Anal Mach Intell, 27(3):318-27, Mar 2005. [ bib ]
We present a novel method of nonlinear discriminant analysis involving a set of locally linear transformations called "Locally Linear Discriminant Analysis (LLDA)." The underlying idea is that global nonlinear data structures are locally linear and local structures can be linearly aligned. Input vectors are projected into each local feature space by linear transformations found to yield locally linearly transformed classes that maximize the between-class covariance while minimizing the within-class covariance. In face recognition, linear discriminant analysis (LDA) has been widely adopted owing to its efficiency, but it does not capture nonlinear manifolds of faces which exhibit pose variations. Conventional nonlinear classification methods based on kernels such as generalized discriminant analysis (GDA) and support vector machine (SVM) have been developed to overcome the shortcomings of the linear method, but they have the drawback of high computational cost of classification and overfitting. Our method is for multiclass nonlinear discrimination and it is computationally highly efficient as compared to GDA. The method does not suffer from overfitting by virtue of the linear base structure of the solution. A novel gradient-based learning algorithm is proposed for finding the optimal set of local linear bases. The optimization does not exhibit a local-maxima problem. The transformation functions facilitate robust face recognition in a low-dimensional subspace, under pose variations, using a single model image. The classification results are given for both synthetic and real face data.

[Khan2005Proteome] Shahid M Khan, Blandine Franke-Fayard, Gunnar R Mair, Edwin Lasonder, Chris J Janse, Matthias Mann, and Andrew P Waters. Proteome analysis of separated male and female gametocytes reveals novel sex-specific Plasmodium biology. Cell, 121(5):675-687, Jun 2005. [ bib | DOI | http | .pdf ]
Gametocytes, the precursor cells of malaria-parasite gametes, circulate in the blood and are responsible for transmission from host to mosquito vector. The individual proteomes of male and female gametocytes were analyzed using mass spectrometry, following separation by flow sorting of transgenic parasites expressing green fluorescent protein, in a sex-specific manner. Promoter tagging in transgenic parasites confirmed the designation of stage and sex specificity of the proteins. The male proteome contained 36% (236 of 650) male-specific and the female proteome 19% (101 of 541) female-specific proteins, but they share only 69 proteins, emphasizing the diverged features of the sexes. Of all the malaria life-cycle stages analyzed, the male gametocyte has the most distinct proteome, containing many proteins involved in flagellar-based motility and rapid genome replication. By identification of gender-specific protein kinases and phosphatases and using targeted gene disruption of two kinases, new sex-specific regulatory pathways were defined.

Keywords: plasmodium
[BioCyc2005] PD. Karp, CA. Ouzounis, CMoore-Kochlacs, LGoldovsky, PKaipa, DAhren, STsoka, NDarzentas, VKunin, and NLopez-Bigas. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res, 33(19):6083-9, 2005. [ bib ]
The BioCyc database collection is a set of 160 pathway/genome databases (PGDBs) for most eukaryotic and prokaryotic species whose genomes have been completely sequenced to date. Each PGDB in the BioCyc collection describes the genome and predicted metabolic network of a single organism, inferred from the MetaCyc database, which is a reference source on metabolic pathways from multiple organisms. In addition, each bacterial PGDB includes predicted operons for the corresponding species. The BioCyc collection provides a unique resource for computational systems biology, namely global and comparative analyses of genomes and metabolic networks, and a supplement to the BioCyc resource of curated PGDBs. The Omics viewer available through the BioCyc website allows scientists to visualize combinations of gene expression, proteomics and metabolomics data on the metabolic maps of these organisms. This paper discusses the computational methodology by which the BioCyc collection has been expanded, and presents an aggregate analysis of the collection that includes the range of number of pathways present in these organisms, and the most frequently observed pathways. We seek scientists to adopt and curate individual PGDBs within the BioCyc collection. Only by harnessing the expertise of many scientists we can hope to produce biological databases, which accurately reflect the depth and breadth of knowledge that the biomedical research community is producing.

Keywords: Animals Computational Biology *Databases, Genetic *Genome Genome, Archaeal Genome, Bacterial Genomics Humans Metabolism/genetics Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, P.H.S.
[Baringhaus2005A] Gerhard Hessler Karl-Heinz Baringhaus. A chemical genomics approach for ion channel modulators. In Chemogenomics in Drug Discovery, chapter 8, pages 221-242. Wiley-VCH, 2005. [ bib ]
[Karklin2005Classificationa] Yan Karklin, Richard F Meraz, and Stephen R Holbrook. Classification of non-coding RNA using graph representations of secondary structure. Pac Symp Biocomput, pages 4-15, 2005. [ bib ]
Some genes produce transcripts that function directly in regulatory, catalytic, or structural roles in the cell. These non-coding RNAs are prevalent in all living organisms, and methods that aid the understanding of their functional roles are essential. RNA secondary structure, the pattern of base-pairing, contains the critical information for determining the three dimensional structure and function of the molecule. In this work we examine whether the basic geometric and topological properties of secondary structure are sufficient to distinguish between RNA families in a learning framework. First, we develop a labeled dual graph representation of RNA secondary structure by adding biologically meaningful labels to the dual graphs proposed by Gan et al [1]. Next, we define a similarity measure directly on the labeled dual graphs using the recently developed marginalized kernels [2]. Using this similarity measure, we were able to train Support Vector Machine classifiers to distinguish RNAs of known families from random RNAs with similar statistics. For 22 of the 25 families tested, the classifier achieved better than 70% accuracy, with much higher accuracy rates for some families. Training a set of classifiers to automatically assign family labels to RNAs using a one vs. all multi-class scheme also yielded encouraging results. From these initial learning experiments, we suggest that the labeled dual graph representation, together with kernel machine methods, has potential for use in automated analysis and classification of uncharacterized RNA molecules or efficient genome-wide screens for RNA molecules from existing families.

Keywords: Base Sequence, Models, Molecular, Non-, Nucleic Acid Conformation, P.H.S., RNA, Research Support, U.S. Gov't, Untranslated, 15759609
[Karklin2005Classification] YKarklin, RF. Meraz, and S.R. Holbrook. Classification of non-coding RNA using graph representations of secondary structure. Pac. Symp. Biocomput., pages 4-15, 2005. [ bib | .pdf ]
Some genes produce transcripts that function directly in regulatory, catalytic, or structural roles in the cell. These non-coding RNAs are prevalent in all living organisms, and methods that aid the understanding of their functional roles are essential. RNA secondary structure, the pattern of base-pairing, contains the critical information for determining the three dimensional structure and function of the molecule. In this work we examine whether the basic geometric and topological properties of secondary structure are sufficient to distinguish between RNA families in a learning framework. First, we develop a labeled dual graph representation of RNA secondary structure by adding biologically meaningful labels to the dual graphs proposed by Gan et al [1]. Next, we define a similarity measure directly on the labeled dual graphs using the recently developed marginalized kernels [2]. Using this similarity measure, we were able to train Support Vector Machine classifiers to distinguish RNAs of known families from random RNAs with similar statistics. For 22 of the 25 families tested, the classifier achieved better than 70% accuracy, with much higher accuracy rates for some families. Training a set of classifiers to automatically assign family labels to RNAs using a one vs. all multi-class scheme also yielded encouraging results. From these initial learning experiments, we suggest that the labeled dual graph representation, together with kernel machine methods, has potential for use in automated analysis and classification of uncharacterized RNA molecules or efficient genome-wide screens for RNA molecules from existing families.

Keywords: biosvm
[Karchin2005Improving] RKarchin, LKelly, and ASali. Improving functional annotation of non-synonomous SNPs with information theory. Pac Symp Biocomput, pages 397-408, 2005. [ bib ]
Automated functional annotation of nsSNPs requires that amino-acid residue changes are represented by a set of descriptive features, such as evolutionary conservation, side-chain volume change, effect on ligand-binding, and residue structural rigidity. Identifying the most informative combinations of features is critical to the success of a computational prediction method. We rank 32 features according to their mutual information with functional effects of amino-acid substitutions, as measured by in vivo assays. In addition, we use a greedy algorithm to identify a subset of highly informative features. The method is simple to implement and provides a quantitative measure for selecting the best predictive features given a set of features that a human expert believes to be informative. We demonstrate the usefulness of the selected highly informative features by cross-validated tests of a computational classifier, a support vector machine (SVM). The SVM's classification accuracy is highly correlated with the ranking of the input features by their mutual information. Two features describing the solvent accessibility of "wild-type" and "mutant" amino-acid residues and one evolutionary feature based on superfamily-level multiple alignments produce comparable overall accuracy and 6% fewer false positives than a 32-feature set that considers physiochemical properties of amino acids, protein electrostatics, amino-acid residue flexibility, and binding interactions.

Keywords: biosvm
[Reactome2005] GJoshi-Tope, MGillespie, IVastrik, PD'Eustachio, ESchmidt, Bde Bono, BJassal, GR. Gopinath, GR. Wu, LMatthews, SLewis, EBirney, and LStein. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res, 33(Database issue):D428-32, 2005. 1362-4962 (Electronic) Journal Article. [ bib ]
Reactome, located at http://www.reactome.org is a curated, peer-reviewed resource of human biological processes. Given the genetic makeup of an organism, the complete set of possible reactions constitutes its reactome. The basic unit of the Reactome database is a reaction; reactions are then grouped into causal chains to form pathways. The Reactome data model allows us to represent many diverse processes in the human system, including the pathways of intermediary metabolism, regulatory pathways, and signal transduction, and high-level processes, such as the cell cycle. Reactome provides a qualitative framework, on which quantitative data can be superimposed. Tools have been developed to facilitate custom data entry and annotation by expert biologists, and to allow visualization and exploration of the finished dataset as an interactive process map. Although our primary curational domain is pathways from Homo sapiens, we regularly create electronic projections of human pathways onto other organisms via putative orthologs, thus making Reactome relevant to model organism research communities. The database is publicly available under open source terms, which allows both its content and its software infrastructure to be freely used and redistributed.

Keywords: Animals *Databases, Factual Gene Expression Profiling Humans Metabolism *Physiological Processes Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, P.H.S. Signal Transduction User-Computer Interface
[Jorissen2005Virtual] RN. Jorissen and MK. Gilson. Virtual screening of molecular databases using a support vector machine. J Chem Inf Model, 45(3):549-61, 2005. [ bib | DOI | http | .pdf ]
The Support Vector Machine (SVM) is an algorithm that derives a model used for the classification of data into two categories and which has good generalization properties. This study applies the SVM algorithm to the problem of virtual screening for molecules with a desired activity. In contrast to typical applications of the SVM, we emphasize not classification but enrichment of actives by using a modified version of the standard SVM function to rank molecules. The method employs a simple and novel criterion for picking molecular descriptors and uses cross-validation to select SVM parameters. The resulting method is more effective at enriching for active compounds with novel chemistries than binary fingerprint-based methods such as binary kernel discrimination.

Keywords: biosvm
[Johnson2005Kinomics] SA. Johnson and THunter. Kinomics: methods for deciphering the kinome. Nat. Methods, 2(1):17-25, Jan 2005. [ bib | DOI | http | .pdf ]
Phosphorylation by protein kinases is the most widespread and well-studied signaling mechanism in eukaryotic cells. Phosphorylation can regulate almost every property of a protein and is involved in all fundamental cellular processes. Cataloging and understanding protein phosphorylation is no easy task: many kinases may be expressed in a cell, and one-third of all intracellular proteins may be phosphorylated, representing as many as 20,000 distinct phosphoprotein states. Defining the kinase complement of the human genome, the kinome, has provided an excellent starting point for understanding the scale of the problem. The kinome consists of 518 kinases, and every active protein kinase phosphorylates a distinct set of substrates in a regulated manner. Deciphering the complex network of phosphorylation-based signaling is necessary for a thorough and therapeutically applicable understanding of the functioning of a cell in physiological and pathological states. We review contemporary techniques for identifying physiological substrates of the protein kinases and studying phosphorylation in living cells.

Keywords: csbcbook, csbcbook-ch2
[Jerebko2005Support] Anna K Jerebko, James D Malley, Marek Franaszek, and Ronald M Summers. Support vector machines committee classification method for computer-aided polyp detection in CT colonography. Acad Radiol, 12(4):479-86, Apr 2005. [ bib | DOI | http ]
RATIONALE AND OBJECTIVES: A new classification scheme for the computer-aided detection of colonic polyps in computed tomographic colonography is proposed. MATERIALS AND METHODS: The scheme involves an ensemble of support vector machines (SVMs) for classification, a smoothed leave-one-out (SLOO) cross-validation method for obtaining error estimates, and use of a bootstrap aggregation method for training and model selection. Our use of an ensemble of SVM classifiers with bagging (bootstrap aggregation), built on different feature subsets, is intended to improve classification performance compared with single SVMs and reduce the number of false-positive detections. The bootstrap-based model-selection technique is used for tuning SVM parameters. In our first experiment, two independent data sets were used: the first, for feature and model selection, and the second, for testing to evaluate the generalizability of our model. In the second experiment, the test set that contained higher resolution data was used for training and testing (using the SLOO method) to compare SVM committee and single SVM performance. RESULTS: The overall sensitivity on independent test set was 75%, with 1.5 false-positive detections/study, compared with 76%-78% sensitivity and 4.5 false-positive detections/study estimated using the SLOO method on the training set. The sensitivity of the SVM ensemble retrained on the former test set estimated using the SLOO method was 81%, which is 7%-10% greater than the sensitivity of a single SVM. The number of false-positive detections per study was 2.6, a 1.5 times reduction compared with a single SVM. CONCLUSION: Training an SVM ensemble on one data set and testing it on the independent data has shown that the SVM committee classification method has good generalizability and achieves high sensitivity and a low false-positive rate. The model selection and improved error estimation method are effective for computer-aided polyp detection.

Keywords: , , 15831422
[Jarzab2005Gene] Barbara Jarzab, Malgorzata Wiench, Krzysztof Fujarewicz, Krzysztof Simek, Michal Jarzab, Malgorzata Oczko-Wojciechowska, Jan Wloch, Agnieszka Czarniecka, Ewa Chmielik, Dariusz Lange, Agnieszka Pawlaczek, Sylwia Szpak, Elzbieta Gubala, and Andrzej Swierniak. Gene expression profile of papillary thyroid cancer: sources of variability and diagnostic implications. Cancer Res., 65(4):1587-97, Feb 2005. [ bib | DOI | http | .pdf ]
The study looked for an optimal set of genes differentiating between papillary thyroid cancer (PTC) and normal thyroid tissue and assessed the sources of variability in gene expression profiles. The analysis was done by oligonucleotide microarrays (GeneChip HG-U133A) in 50 tissue samples taken intraoperatively from 33 patients (23 PTC patients and 10 patients with other thyroid disease). In the initial group of 16 PTC and 16 normal samples, we assessed the sources of variability in the gene expression profile by singular value decomposition which specified three major patterns of variability. The first and the most distinct mode grouped transcripts differentiating between tumor and normal tissues. Two consecutive modes contained a large proportion of immunity-related genes. To generate a multigene classifier for tumor-normal difference, we used support vector machines-based technique (recursive feature replacement). It included the following 19 genes: DPP4, GJB3, ST14, SERPINA1, LRP4, MET, EVA1, SPUVE, LGALS3, HBB, MKRN2, MRC2, IGSF1, KIAA0830, RXRG, P4HA2, CDH3, IL13RA1, and MTMR4, and correctly discriminated 17 of 18 additional PTC/normal thyroid samples and all 16 samples published in a previous microarray study. Selected novel genes (LRP4, EVA1, TMPRSS4, QPCT, and SLC34A2) were confirmed by Q-PCR.Our results prove that the gene expression signal of PTC is easily detectable even when cancer cells do not prevail over tumor stroma. We indicate and separate the confounding variability related to the immune response. Finally, we propose a potent molecular classifier able to discriminate between PTC and nonmalignant thyroid in more than 90% of investigated samples.

Keywords: biosvm
[Janoueix-Lerosey2005Preferential] Isabelle Janoueix-Lerosey, Philippe Hupé, Zofia Maciorowski, Philippe La Rosa, Gudrun Schleiermacher, Gaëlle Pierron, Stéphane Liva, Emmanuel Barillot, and Olivier Delattre. Preferential occurrence of chromosome breakpoints within early replicating regions in neuroblastoma. Cell Cycle, 4(12):1842-1846, Dec 2005. [ bib | http | .pdf ]
Neuroblastoma (NB) is a frequent paediatric extra cranial solid tumor characterized by the occurrence of unbalanced chromosome translocations, frequently, but not exclusively, involving chromosomes 1 and 17. We have used a 1 Mb resolution BAC array to further refine the mapping of breakpoints in NB cell lines. Replication timing profiles were evaluated in 7 NB cell lines, using DNAs from G1 and S phases flow sorted nuclei hybridised on the same array. Strikingly, these replication timing profiles were highly similar between the different NB cell lines. Furthermore, a significant level of similarity was also observed between NB cell lines and lymphoblastoid cells. A segmentation analysis using the Adaptative Weights Smoothing procedure was performed to determine regions of coordinate replication. More than 50% of the breakpoints mapped to early replicating regions, which account for 23.7% of the total genome. The breakpoints frequency per 10(8) bases was therefore 10.84 for early replicating regions, whereas it was only 2.94 for late replicating regions, these difference being highly significant (p < 10(-4)). This strong association was also observed when chromosomes 1 and 17, the two most frequent translocation partners in NB were excluded from the statistical analysis. These results unambiguously establish a link between unbalanced translocations, whose most likely mechanism of occurrence relies on break-induced replication, and early replication of the genome.

Keywords: csbcbook, csbcbook-ch2
[Irwin2005ZINC] JJ. Irwin and BK. Shoichet. ZINC-a free database of commercially available compounds for virtual screening. J Chem Inf Model, 45(1):177-82, 2005. [ bib | DOI | http | .pdf ]
A critical barrier to entry into structure-based virtual screening is the lack of a suitable, easy to access database of purchasable compounds. We have therefore prepared a library of 727,842 molecules, each with 3D structure, using catalogs of compounds from vendors (the size of this library continues to grow). The molecules have been assigned biologically relevant protonation states and are annotated with properties such as molecular weight, calculated LogP, and number of rotatable bonds. Each molecule in the library contains vendor and purchasing information and is ready for docking using a number of popular docking programs. Within certain limits, the molecules are prepared in multiple protonation states and multiple tautomeric forms. In one format, multiple conformations are available for the molecules. This database is available for free download (http://zinc.docking.org) in several common file formats including SMILES, mol2, 3D SDF, and DOCK flexibase format. A Web-based query tool incorporating a molecular drawing interface enables the database to be searched and browsed and subsets to be created. Users can process their own molecules by uploading them to a server. Our hope is that this database will bring virtual screening libraries to a wide community of structural biologists and medicinal chemists.

Keywords: Databases, Digital, Drug Design, Factual, Libraries, Molecular Conformation, Molecular Structure, P.H.S., Research Support, U.S. Gov't, 15667143
[Ioannidis2005Microarrays] JPA. Ioannidis. Microarrays and molecular research: noise discovery? Lancet, 365(9458):454, 2005. [ bib | .pdf ]
Keywords: microarray
[Ioannidis2005most] J.P.A. Ioannidis. Why most published research findings are false. PLoS medicine, 2(8):e124, 2005. [ bib ]
[Ikeda2005asymptotic] Kazushi Ikeda and Tsutomu Aoishi. An asymptotic statistical analysis of support vector machines with soft margins. Neural Netw, 18(3):251-9, Apr 2005. [ bib | DOI | http | .pdf ]
The generalization properties of support vector machines (SVMs) are examined. From a geometrical point of view, the estimated parameter of an SVM is the one nearest the origin in the convex hull formed with given examples. Since introducing soft margins is equivalent to reducing the convex hull of the examples, an SVM with soft margins has a different learning curve from the original. In this paper we derive the asymptotic average generalization error of SVMs with soft margins in simple cases, that is, only when the dimension of inputs is one, and quantitatively show that soft margins increase the generalization error.

Keywords: Apoptosis, Gene Expression Profiling, Humans, Neoplasms, Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Polymerase Chain Reaction, Proteins, Research Support, Subcellular Fractions, Unknown Primary, 15896573
[Huppi2005Defining] KHuppi, SE. Martin, and NJ. Caplen. Defining and assaying RNAi in mammalian cells. Mol. Cell, 17(1):1-10, Jan 2005. [ bib | DOI | http ]
The investigation of protein function through the inhibition of activity has been critical to our understanding of many normal and abnormal biological processes. Until recently, functional inhibition in biological systems has been induced using a variety of approaches including small molecule antagonists, antibodies, aptamers, ribozymes, antisense oligonucleotides or transcripts, morpholinos, dominant-negative mutants, and knockout transgenic animals. Although all of these approaches have made substantial advances in our understanding of the function of many proteins, a lack of specificity or restricted applicability has limited their utility. Recently, exploitation of the naturally occurring posttranscriptional gene silencing mechanism triggered by double-stranded RNA (dsRNA), termed RNA interference (RNAi), has gained much favor as an alternative means for analyzing gene function. Aspects of the basic biology of RNAi, its application as a functional genomics tool, and its potential as a therapeutic approach have been extensively reviewed (Hannon and Rossi, 2004; Meister and Tuschl, 2004); however, there has been only limited discussion as to how to design and validate an individual RNAi effector molecule and how to interpret RNAi data overall, particularly with reference to experimentation in mammalian cells. This perspective will aim to consider some of the issues encountered when conducting and interpreting RNAi experiments in mammalian cells.

Keywords: sirna
[Huesken2005Design] DHuesken, JLange, CMickanin, JWeiler, FAsselbergs, JWarner, BMeloon, SEngel, ARosenberg, DCohen, MLabow, MReinhardt, FNatt, and JHall. Design of a genome-wide siRNA library using an artificial neural network. Nat. Biotechnol., 23(8):995-1001, Aug 2005. [ bib | DOI | http | .pdf ]
The largest gene knock-down experiments performed to date have used multiple short interfering/short hairpin (si/sh)RNAs per gene1, 2, 3. To overcome this burden for design of a genome-wide siRNA library, we used the Stuttgart Neural Net Simulator to train algorithms on a data set of 2,182 randomly selected siRNAs targeted to 34 mRNA species, assayed through a high-throughput fluorescent reporter gene system. The algorithm, (BIOPREDsi), reliably predicted activity of 249 siRNAs of an independent test set (Pearson coefficient r = 0.66) and siRNAs targeting endogenous genes at mRNA and protein levels. Neural networks trained on a complementary 21-nucleotide (nt) guide sequence were superior to those trained on a 19-nt sequence. BIOPREDsi was used in the design of a genome-wide siRNA collection with two potent siRNAs per gene. When this collection of 50,000 siRNAs was used to identify genes involved in the cellular response to hypoxia, two of the most potent hits were the key hypoxia transcription factors HIF1A and ARNT.

Keywords: sirna
[Huang2005Supporta] Yu-Len Huang and Dar-Ren Chen. Support vector machines in sonography: application to decision making in the diagnosis of breast cancer. Clin Imaging, 29(3):179-84, 2005. [ bib | DOI | http ]
We evaluated a series of pathologically proven breast tumors using the support vector machine (SVM) in the differential diagnosis of solid breast tumors. This study evaluated two ultrasonic image databases, i.e., DB1 and DB2. The DB1 contained 140 ultrasonic images of solid breast nodules (52 malignant and 88 benign). The DB2 contained 250 ultrasonic images of solid breast nodules (35 malignant and 215 benign). The physician-located regions of interest (ROI) of sonography and textual features were utilized to classify breast tumors. An SVM classifier using interpixel textual features classified the tumor as benign or malignant. The receiver operating characteristic (ROC) area index for the proposed system on the DB1 and the DB2 are 0.9695+/-0.0150 and 0.9552+/-0.0161, respectively. The proposed system differentiates solid breast nodules with a relatively high accuracy and helps inexperienced operators avoid misdiagnosis. The main advantage in the proposed system is that the training procedure of SVM was very fast and stable. The training and diagnosis procedure of the proposed system is almost 700 times faster than that of multilayer perception neural networks (MLPs). With the growth of the database, new ultrasonic images can be collected and used as reference cases while performing diagnoses. This study reduces the training and diagnosis time dramatically.

[Huang2005Gene] TM. Huang and VKecman. Gene extraction for cancer diagnosis by support vector machines-An improvement. Artif. Intell. Med., Jul 2005. [ bib | DOI | http | .pdf ]
OBJECTIVE:: To improve the performance of gene extraction for cancer diagnosis by recursive feature elimination with support vector machines (RFE-SVMs): A cancer diagnosis by using the DNA microarray data faces many challenges the most serious one being the presence of thousands of genes and only several dozens (at the best) of patient's samples. Thus, making any kind of classification in high-dimensional spaces from a limited number of data is both an extremely difficult and a prone to an error procedure. The improved RFE-SVMs is introduced and used here for an elimination of less relevant genes and just for a reduction of the overall number of genes used in a medical diagnostic. METHODS:: The paper shows why and how the, usually neglected, penalty parameter C and some standard data preprocessing techniques (normalizing and scaling) influence classification results and the gene selection of RFE-SVMs. The gene selected by RFE-SVMs is compared with eight other gene selection algorithms implemented in the Rankgene software to investigate whether there is any consensus among the algorithms, so the scope of finding the right set of genes can be reduced. RESULTS:: The improved RFE-SVMs is applied on the two benchmarking colon and lymphoma cancer data sets with various C parameters and different standard preprocessing techniques. Here, decreasing C leads to the smaller diagnosis error in comparisons to other known methods applied to the benchmarking data sets. With an appropriate parameter C and with a proper preprocessing procedure, the reduction in a diagnosis error is as high as 36%. CONCLUSIONS:: The results suggest that with a properly chosen parameter C, the extracted genes and the constructed classifier will ensure less overfitting of the training data leading to an increased accuracy in selecting relevant genes. Finally, comparison in gene ranking obtained by different algorithms shows that there is a significant consensus among the various algorithms as to which set of genes is relevant.

Keywords: biosvm
[Huang2005Computation] Shao-Wei Huang and Jenn-Kang Hwang. Computation of conformational entropy from protein sequences using the machine-learning method-application to the study of the relationship between structural conservation and local structural stability. Proteins, 59(4):802-9, Jun 2005. [ bib | DOI | http | .pdf ]
A complete protein sequence can usually determine a unique conformation; however, the situation is different for shorter subsequences-some of them are able to adopt unique conformations, independent of context; while others assume diverse conformations in different contexts. The conformations of subsequences are determined by the interplay between local and nonlocal interactions. A quantitative measure of such structural conservation or variability will be useful in the understanding of the sequence-structure relationship. In this report, we developed an approach using the support vector machine method to compute the conformational variability directly from sequences, which is referred to as the sequence structural entropy. As a practical application, we studied the relationship between sequence structural entropy and the hydrogen exchange for a set of well-studied proteins. We found that the slowest exchange cores usually comprise amino acids of the lowest sequence structural entropy. Our results indicate that structural conservation is closely related to the local structural stability. This relationship may have interesting implications in the protein folding processes, and may be useful in the study of the sequence-structure relationship.

Keywords: biosvm
[Huang2005CTKPred] NHuang, HChen, and ZSun. CTKPred: an SVM-based method for the prediction and classification of the cytokine superfamily. Protein Eng. Des. Sel., Jun 2005. [ bib | DOI | http | .pdf ]
Cell proliferation, differentiation and death are controlled by a multitude of cell-cell signals and loss of this control has devastating consequences. Prominent among these regulatory signals is the cytokine superfamily, which has crucial functions in the development, differentiation and regulation of immune cells. In this study, a support vector machine (SVM)-based method was developed for predicting families and subfamilies of cytokines using dipeptide composition. The taxonomy of the cytokine superfamily with which our method complies was described in the Cytokine Family cDNA Database (dbCFC) and the dataset used in this study for training and testing was obtained from the dbCFC and Structural Classification of Proteins (SCOP). The method classified cytokines and non-cytokines with an accuracy of 92.5% by 7-fold cross-validation. The method is further able to predict seven major classes of cytokine with an overall accuracy of 94.7%. A server for recognition and classification of cytokines based on multi-class SVMs has been set up at http://bioinfo.tsinghua.edu.cnhuangni/CTKPred/.

Keywords: biosvm
[Huang2005Support] Jing Huang and Feng Shi. Support vector machines for predicting apoptosis proteins types. Acta Biotheor., 53(1):39-47, 2005. [ bib | DOI | http | .pdf ]
Apoptosis proteins have a central role in the development and homeostasis of an organism. These proteins are very important for understanding the mechanism of programmed cell death, and their function is related to their types. According to the classification scheme by Zhou and Doctor (2003), the apoptosis proteins are categorized into the following four types: (1) cytoplasmic protein; (2) plasma membrane-bound protein; (3) mitochondrial inner and outer proteins; (4) other proteins. A powerful learning machine, the Support Vector Machine, is applied for predicting the type of a given apoptosis protein by incorporating the sqrt-amino acid composition effect. High success rates were obtained by the re-substitute test (98/98 = 100 %) and the jackknife test (89/98 = 90.8%).

Keywords: biosvm
[Hua2005Optimal] JHua, ZXiong, JLowey, ESuh, and ER. Dougherty. Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21(8):1509-1515, Apr 2005. To appear. [ bib | DOI | http | .pdf ]
Motivation: Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically (but not always), for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The potential downside of using too many features is most critical for small samples, which are commonplace for gene-expression-based classifiers for phenotype discrimination. For fixed sample size and feature-label distribution, the issue is to find an optimal number of features.Results: Since only in rare cases is there a known distribution of the error as a function of the number of features and sample size, this study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes. To achieve the desired end, finding the optimal number of features as a function of sample size, it employs massively parallel computation. Seven classifiers are treated: 3-nearest-neighbor, Gaussian kernel, linear support vector machine, polynomial support vector machine, perceptron, regular histogram and linear discriminant analysis. Three Gaussian-based models are considered: linear, nonlinear and bimodal. In addition, real patient data from a large breast-cancer study is considered. To mitigate the combinatorial search for finding optimal feature sets, and to model the situation in which subsets of genes are co-regulated and correlation is internal to these subsets, we assume that the covariance matrix of the features is blocked, with each block corresponding to a group of correlated features. Altogether there is a large number of error surfaces for the many cases. These are provided in full on a companion web-site, which is meant to serve as resource for those working with small-sample classification.Availability: For the companion web-site, please visit http://public.tgen.org/tamu/ofs/.

Keywords: biosvm
[Hua2005JImmunol] FHua, MG. Cornejo, MH. Cardone, CL. Stokes, and DA. Lauffenburger. Effects of bcl-2 levels on fas signaling-induced caspase-3 activation: molecular genetic tests of computational model predictions. J Immunol, 175(2):985-95, 2005. [ bib ]
Fas-induced apoptosis is a critical process for normal immune system development and function. Although many molecular components in the Fas signaling pathway have been identified, a systematic understanding of how they work together to determine network dynamics and apoptosis itself has remained elusive. To address this, we generated a computational model for interpreting and predicting effects of pathway component properties. The model integrates current information concerning the signaling network downstream of Fas activation, through both type I and type II pathways, until activation of caspase-3. Unknown parameter values in the model were estimated using experimental data obtained from human Jurkat T cells. To elucidate critical signaling network properties, we examined the effects of altering the level of Bcl-2 on the kinetics of caspase-3 activation, using both overexpression and knockdown in the model and experimentally. Overexpression was used to distinguish among alternative hypotheses for inhibitory binding interactions of Bcl-2 with various components in the mitochondrial pathway. In comparing model simulations with experimental results, we find the best agreement when Bcl-2 blocks the release of cytochrome c by binding to both Bax and truncated Bid instead of Bax, truncated Bid, or Bid alone. Moreover, although Bcl-2 overexpression strongly reduces caspase-3 activation, Bcl-2 knockdown has a negligible effect, demonstrating a general model finding that varying the expression levels of signal molecules frequently has asymmetric effects on the outcome. Finally, we demonstrate that the relative dominance of type I vs type II pathways can be switched by varying particular signaling component levels without changing network structure.

Keywords: csbcbook
[Holford2005Visual] NHolford. VPC, the visual predictive check - superiority to standard diagnostic (Rorschach) plots. In PAGE 14 (http://www. page-meeting.org/?abstract=738), 2005. [ bib | http ]
[Hizukuri2005Extraction] YHizukuri, YYamanishi, ONakamura, FYagi, SGoto, and MKanehisa. Extraction of leukemia specific glycan motifs in humans by computational glycomics. Carbohydr. Res., 340(14):2270-8, Oct 2005. [ bib | DOI | http ]
There have been almost no standard methods for conducting computational analyses on glycan structures in comparison to DNA and proteins. In this paper, we present a novel method for extracting functional motifs from glycan structures using the KEGG/GLYCAN database. First, we developed a new similarity measure for comparing glycan structures taking into account the characteristic mechanisms of glycan biosynthesis, and we tested its ability to classify glycans of different blood components in the framework of support vector machines (SVMs). The results show that our method can successfully classify glycans from four types of human blood components: leukemic cells, erythrocyte, serum, and plasma. Next, we extracted characteristic functional motifs of glycans considered to be specific to each blood component. We predicted the substructure alpha-d-Neup5Ac-(2->3)-beta-d-Galp-(1->4)-d-GlcpNAc as a leukemia specific glycan motif. Based on the fact that the Agrocybe cylindracea galectin (ACG) specifically binds to the same substructure, we conducted an experiment using cell agglutination assay and confirmed that this fungal lectin specifically recognized human leukemic cells.

Keywords: glycans
[Hendrix2005Phosphodiesterase] Martin Hendrix and Christopher Kallus. Phosphodiesterase inhibitors: A chemogenomic view. In Chemogenomics in Drug Discovery, chapter 9, pages 243-288. Wiley-VCH, 2005. [ bib ]
[Han2005Fold] Sangjo Han, Byung-Chul Lee, Seung Taek Yu, Chan-Seok Jeong, Soyoung Lee, and Dongsup Kim. Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics, 21(11):2667-73, Jun 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Currently, the most accurate fold-recognition method is to perform profile-profile alignments and estimate the statistical significances of those alignments by calculating Z-score or E-value. Although this scheme is reliable in recognizing relatively close homologs related at the family level, it has difficulty in finding the remote homologs that are related at the superfamily or fold level. RESULTS: In this paper, we present an alternative method to estimate the significance of the alignments. The alignment between a query protein and a template of length n in the fold library is transformed into a feature vector of length n + 1, which is then evaluated by support vector machine (SVM). The output from SVM is converted to a posterior probability that a query sequence is related to a template, given SVM output. Results show that a new method shows significantly better performance than PSI-BLAST and profile-profile alignment with Z-score scheme. While PSI-BLAST and Z-score scheme detect 16 and 20% of superfamily-related proteins, respectively, at 90% specificity, a new method detects 46% of these proteins, resulting in more than 2-fold increase in sensitivity. More significantly, at the fold level, a new method can detect 14% of remotely related proteins at 90% specificity, a remarkable result considering the fact that the other methods can detect almost none at the same level of specificity.

Keywords: biosvm
[Han2005Prediction] L.Y. Han, C.Z. Cai, Z.L. Ji, and Y.Z. Chen. Prediction of functional class of novel viral proteins by a statistical learning method irrespective of sequence similarity. Virology, 331(1):136-143, 2005. [ bib | DOI | http | .pdf ]
The function of a substantial percentage of the putative protein-coding open reading frames (ORFs) in viral genomes is unknown. As their sequence is not similar to that of proteins of known function, the function of these ORFs cannot be assigned on the basis of sequence similarity. Methods complement or in combination with sequence similarity-based approaches are being explored. The web-based software SVMProt () to some extent assigns protein functional family irrespective of sequence similarity and has been found to be useful for studying distantly related proteins [Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z., 2003. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31(13): 3692-3697]. Here 25 novel viral proteins are selected to test the capability of SVMProt for functional family assignment of viral proteins whose function cannot be confidently predicted on by sequence similarity methods at present. These proteins are without a sequence homolog in the Swissprot database, with its precise function provided in the literature, and not included in the training sets of SVMProt. The predicted functional classes of 72 the literature-described function, which is compared to the overall accuracy of 87 proteins. This suggests that SVMProt to some extent is capable of functional class assignment irrespective of sequence similarity and it is potentially useful for facilitating functional study of novel viral proteins.

Keywords: biosvm
[Hall2005comprehensive] Neil Hall, Marianna Karras, JDale Raine, Jane M Carlton, Taco W A Kooij, Matthew Berriman, Laurence Florens, Christoph S Janssen, Arnab Pain, Georges K Christophides, Keith James, Kim Rutherford, Barbara Harris, David Harris, Carol Churcher, Michael A Quail, Doug Ormond, Jon Doggett, Holly E Trueman, Jacqui Mendoza, Shelby L Bidwell, Marie-Adele Rajandream, Daniel J Carucci, John R Yates, Fotis C Kafatos, Chris J Janse, Bart Barrell, CMichael R Turner, Andrew P Waters, and Robert E Sinden. A comprehensive survey of the Plasmodium life cycle by genomic, transcriptomic, and proteomic analyses. Science, 307(5706):82-86, Jan 2005. [ bib | DOI | http | .pdf ]
Plasmodium berghei and Plasmodium chabaudi are widely used model malaria species. Comparison of their genomes, integrated with proteomic and microarray data, with the genomes of Plasmodium falciparum and Plasmodium yoelii revealed a conserved core of 4500 Plasmodium genes in the central regions of the 14 chromosomes and highlighted genes evolving rapidly because of stage-specific selective pressures. Four strategies for gene expression are apparent during the parasites' life cycle: (i) housekeeping; (ii) host-related; (iii) strategy-specific related to invasion, asexual replication, and sexual development; and (iv) stage-specific. We observed posttranscriptional gene silencing through translational repression of messenger RNA during sexual development, and a 47-base 3' untranslated region motif is implicated in this process.

Keywords: plasmodium
[Hakenberg2005Systematic] Jörg Hakenberg, Steffen Bickel, Conrad Plake, Ulf Brefeld, Hagen Zahn, Lukas Faulstich, Ulf Leser, and Tobias Scheffer. Systematic feature evaluation for gene name recognition. BMC Bioinformatics, 6 Suppl 1:S9, 2005. [ bib | DOI | http | .pdf ]
In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.

Keywords: biosvm
[Haigh2005Small] JA Haigh, BT. Pickup, JA. Grant, and ANicholls. Small molecule shape-fingerprints. J. Chem. Inf. Model., 45(3):673-684, 2005. [ bib | DOI | http ]
The optimal overlap between two molecular structures is a useful measure of shape similarity. However, it usually requires significant computation. This work describes the design of shape-fingerprints: binary bit strings that encode molecular shape. Standard measures of similarity between two shape-fingerprints are shown to be an excellent surrogate for similarity based on volume overlap but several orders of magnitude faster to compute. Consequently, shape-fingerprints can be used for clustering of large data sets, evaluating the diversity of compound libraries, as descriptors in SAR and as a prescreen for exact shape comparison against large virtual databases. Our results show that a small set of shapes can be used to build these fingerprints and that this set can be applied universally.

Keywords: 15921457
[Haferlach2005global] Torsten Haferlach, Alexander Kohlmann, Susanne Schnittger, Martin Dugas, Wolfgang Hiddemann, Wolfgang Kern, and Claudia Schoch. A global approach to the diagnosis of leukemia using gene expression profiling. Blood, 106(4):1189-1198, Aug 2005. [ bib | DOI | http | .pdf ]
Accurate diagnosis and classification of leukemias are the bases for the appropriate management of patients. The diagnostic accuracy and efficiency of present methods may be improved by the use of microarrays for gene expression profiling. We analyzed gene expression profiles in bone marrow and peripheral blood samples from 937 patients with all clinically relevant leukemia subtypes (n=892) and non-leukemic controls (n=45) by U133A and B GeneChips (Affymetrix). For each subgroup differentially expressed genes were calculated. Class prediction was performed using support vector machines. Prediction accuracies were estimated by 10-fold cross validation and assessed for robustness in a 100-fold resampling approach using randomly chosen test-sets consisting of 1/3 of the samples. Applying the top 100 genes of each subgroup an overall prediction accuracy of 95.1% was achieved which was confirmed by resampling (median, 93.8%; 95% confidence interval, 91.4%-95.8%). In particular, AML with t(15;17), t(8;21), or inv(16), CLL, and Pro-B-ALL with t(11q23) were classified with 100% sensitivity and 100% specificity. Accordingly, cluster analysis completely separated all of the 13 subgroups analyzed. Gene expression profiling can predict all clinically relevant subentities of leukemia with high accuracy.

Keywords: biosvm microarray
[Haferlach2005AML] Torsten Haferlach, Alexander Kohlmann, Susanne Schnittger, Martin Dugas, Wolfgang Hiddemann, Wolfgang Kern, and Claudia Schoch. AML M3 and AML M3 variant each have a distinct gene expression signature but also share patterns different from other genetically defined AML subtypes. Genes Chromosomes Cancer, 43(2):113-27, Jun 2005. [ bib | DOI | http | .pdf ]
Acute promyelocytic leukemia (APL) with t(15;17) appears in two phenotypes: AML M3, with abnormal promyelocytes showing heavy granulation and bundles of Auer rods, and AML M3 variant (M3v), with non- or hypogranular cytoplasm and a bilobed nucleus. We investigated the global gene expression profiles of 35 APL patients (19 AML M3, 16 AML M3v) by using high-density DNA-oligonucleotide microarrays. First, an unsupervised approach clearly separated APL samples from other AMLs characterized genetically as t(8;21) (n = 35), inv(16) (n = 35), or t(11q23)/MLL (n = 35) or as having a normal karyotype (n = 50). Second, we found genes with functional relevance for blood coagulation that were differentially expressed between APL and other AMLs. Furthermore, a supervised pairwise comparison between M3 and M3v revealed differential expression of genes that encode for biological functions and pathways such as granulation and maturation of hematologic cells, explaining morphologic and clinical differences. Discrimination between M3 and M3v based on gene signatures showed a median classification accuracy of 90% by use of 10-fold CV and support vector machines. Additional molecular mutations such as FLT3-LM, which were significantly more frequent in M3v than in M3 (P < 0.0001), may partly contribute to the different phenotypes. However, linear regression analysis demonstrated that genes differentially expressed between M3 and M3v did not correlate with FLT3-LM.

Keywords: biosvm microarray
[Haasdonk2005Feature] Bernard Haasdonk. Feature space interpretation of SVMs with indefinite kernels. IEEE Trans Pattern Anal Mach Intell, 27(4):482-92, Apr 2005. [ bib | DOI | http | .pdf ]
Kernel methods are becoming increasingly popular for various kinds of machine learning tasks, the most famous being the support vector machine (SVM) for classification. The SVM is well understood when using conditionally positive definite (cpd) kernel functions. However, in practice, non-cpd kernels arise and demand application in SVMs. The procedure of "plugging" these indefinite kernels in SVMs often yields good empirical classification results. However, they are hard to interpret due to missing geometrical and theoretical understanding. In this paper, we provide a step toward the comprehension of SVM classifiers in these situations. We give a geometric interpretation of SVMs with indefinite kernel functions. We show that such SVMs are optimal hyperplane classifiers not by margin maximization, but by minimization of distances between convex hulls in pseudo-Euclidean spaces. By this, we obtain a sound framework and motivation for indefinite SVMs. This interpretation is the basis for further theoretical analysis, e.g., investigating uniqueness, and for the derivation of practical guidelines like characterizing the suitability of indefinite SVMs.

Keywords: Algorithms, Animals, Antibiotics, Antineoplastic, Artificial Intelligence, Automated, Automatic Data Processing, Butadienes, Chloroplasts, Cluster Analysis, Comparative Study, Computer Simulation, Computer-Assisted, Computing Methodologies, Database Management Systems, Databases, Diagnosis, Disinfectants, Dose-Response Relationship, Drug, Drug Toxicity, Electrodes, Electroencephalography, Ethylamines, Expert Systems, Factual, Feedback, Fungicides, Gene Expression Profiling, Genes, Genetic Markers, Humans, Image Enhancement, Image Interpretation, Implanted, Industrial, Information Storage and Retrieval, Kidney, Kidney Tubules, MEDLINE, Male, Mercuric Chloride, Microarray Analysis, Molecular Biology, Motor Cortex, Movement, Natural Language Processing, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Numerical Analysis, Pattern Recognition, Plant Proteins, Predictive Value of Tests, Proteins, Proteome, Proximal, Puromycin Aminonucleoside, Rats, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Sprague-Dawley, Subcellular Fractions, Terminology, Therapy, Time Factors, Toxicogenetics, U.S. Gov't, User-Computer Interface, 15794155
[Gusterson2005Basal] BA. Gusterson, DT. Ross, VJ. Heath, and TStein. Basal cytokeratins and their relationship to the cellular origin and functional classification of breast cancer. Breast Cancer Res., 7(4):143-148, 2005. [ bib | DOI | http | .pdf ]
Recent publications have classified breast cancers on the basis of expression of cytokeratin-5 and -17 at the RNA and protein levels, and demonstrated the importance of these markers in defining sporadic tumours with bad prognosis and an association with BRCA1-related breast cancers. These important observations using different technology platforms produce a new functional classification of breast carcinoma. However, it is important in developing hypotheses about the pathogenesis of this tumour type to review the nomenclature that is being used to emphasize potential confusion between terminology that defines clinical subgroups and markers of cell lineage. This article reviews the lineages in the normal breast in relation to what have become known as the 'basal-like' carcinomas.

Keywords: breastcancer
[Guo2005Towards] ZGuo, TZhang, XLi, QWang, JXu, HYu, JZhu, HWang, CWang, EJ. Topol, QWang, and SRao. Towards precise classification of cancers based on robust gene functional expression profiles. BMC Bioinformatics, 6:58, 2005. [ bib | DOI | http | .pdf ]
Development of robust and efficient methods for analyzing and interpreting high dimension gene expression profiles continues to be a focus in computational biology. The accumulated experiment evidence supports the assumption that genes express and perform their functions in modular fashions in cells. Therefore, there is an open space for development of the timely and relevant computational algorithms that use robust functional expression profiles towards precise classification of complex human diseases at the modular level.Inspired by the insight that genes act as a module to carry out a highly integrated cellular function, we thus define a low dimension functional expression profile for data reduction. After annotating each individual gene to functional categories defined in a proper gene function classification system such as Gene Ontology applied in this study, we identify those functional categories enriched with differentially expressed genes. For each functional category or functional module, we compute a summary measure (s) for the raw expression values of the annotated genes to capture the overall activity level of the module. In this way, we can treat the gene expressions within a functional module as an integrative data point to replace the multiple values of individual genes. We compare the classification performance of decision trees based on functional expression profiles with the conventional gene expression profiles using four publicly available datasets, which indicates that precise classification of tumour types and improved interpretation can be achieved with the reduced functional expression profiles.This modular approach is demonstrated to be a powerful alternative approach to analyzing high dimension microarray data and is robust to high measurement noise and intrinsic biological variance inherent in microarray data. Furthermore, efficient integration with current biological knowledge has facilitated the interpretation of the underlying molecular mechanisms for complex human diseases at the modular level.

[Guo2005novel] Ting Guo, Yanxin Shi, and Zhirong Sun. A novel statistical ligand-binding site predictor: application to ATP-binding sites. Protein Eng Des Sel, 18(2):65-70, Feb 2005. [ bib | DOI | http | .pdf ]
Structural genomics initiatives are leading to rapid growth in newly determined protein 3D structures, the functional characterization of which may still be inadequate. As an attempt to provide insights into the possible roles of the emerging proteins whose structures are available and/or to complement biochemical research, a variety of computational methods have been developed for the screening and prediction of ligand-binding sites in raw structural data, including statistical pattern classification techniques. In this paper, we report a novel statistical descriptor (the Oriented Shell Model) for protein ligand-binding sites, which utilizes the distance and angular position distribution of various structural and physicochemical features present in immediate proximity to the center of a binding site. Using the support vector machine (SVM) as the classifier, our model identified 69% of the ATP-binding sites in whole-protein scanning tests and in eukaryotic proteins the accuracy is particularly high. We propose that this feature extraction and machine learning procedure can screen out ligand-binding-capable protein candidates and can yield valuable biochemical information for individual proteins.

Keywords: biosvm
[Guo2005Feature] Hong Guo, Lindsay B Jack, and Asoke K Nandi. Feature generation using genetic programming with application to fault classification. IEEE Trans Syst Man Cybern B Cybern, 35(1):89-99, Feb 2005. [ bib ]
One of the major challenges in pattern recognition problems is the feature extraction process which derives new features from existing features, or directly from raw data in order to reduce the cost of computation during the classification process, while improving classifier efficiency. Most current feature extraction techniques transform the original pattern vector into a new vector with increased discrimination capability but lower dimensionality. This is conducted within a predefined feature space, and thus, has limited searching power. Genetic programming (GP) can generate new features from the original dataset without prior knowledge of the probabilistic distribution. In this paper, a GP-based approach is developed for feature extraction from raw vibration data recorded from a rotating machine with six different conditions. The created features are then used as the inputs to a neural classifier for the identification of six bearing conditions. Experimental results demonstrate the ability of GP to discover autimatically the different bearing conditions using features expressed in the form of nonlinear functions. Furthermore, four sets of results-using GP extracted features with artificial neural networks (ANN) and support vector machines (SVM), as well as traditional features with ANN and SVM-have been obtained. This GP-based approach is used for bearing fault classification for the first time and exhibits superior searching power over other techniques. Additionaly, it significantly reduces the time for computation compared with genetic algorithm (GA), therefore, makes a more practical realization of the solution.

[Guo2005Learning] Guodong Guo and Charles R Dyer. Learning from examples in the small sample case: face expression recognition. IEEE Trans Syst Man Cybern B Cybern, 35(3):477-88, Jun 2005. [ bib ]
Example-based learning for computer vision can be difficult when a large number of examples to represent each pattern or object class is not available. In such situations, learning from a small number of samples is of practical value. To study this issue, the task of face expression recognition with a small number of training images of each expression is considered. A new technique based on linear programming for both feature selection and classifier training is introduced. A pairwise framework for feature selection, instead of using all classes simultaneously, is presented. Experimental results compare the method with three others: a simplified Bayes classifier, support vector machine, and AdaBoost. Finally, each algorithm is analyzed and a new categorization of these algorithms is given, especially for learning from examples in the small sample case.

[Golland2005Detection] Polina Golland, WEric L Grimson, Martha E Shenton, and Ron Kikinis. Detection and analysis of statistical differences in anatomical shape. Med Image Anal, 9(1):69-86, Feb 2005. [ bib | DOI | http ]
We present a computational framework for image-based analysis and interpretation of statistical differences in anatomical shape between populations. Applications of such analysis include understanding developmental and anatomical aspects of disorders when comparing patients versus normal controls, studying morphological changes caused by aging, or even differences in normal anatomy, for example, differences between genders. Once a quantitative description of organ shape is extracted from input images, the problem of identifying differences between the two groups can be reduced to one of the classical questions in machine learning of constructing a classifier function for assigning new examples to one of the two groups while making as few misclassifications as possible. The resulting classifier must be interpreted in terms of shape differences between the two groups back in the image domain. We demonstrate a novel approach to such interpretation that allows us to argue about the identified shape differences in anatomically meaningful terms of organ deformation. Given a classifier function in the feature space, we derive a deformation that corresponds to the differences between the two classes while ignoring shape variability within each class. Based on this approach, we present a system for statistical shape analysis using distance transforms for shape representation and the support vector machines learning algorithm for the optimal classifier estimation and demonstrate it on artificially generated data sets, as well as real medical studies.

Keywords: Algorithms, Amino Acid, Artificial Intelligence, Ascomycota, Automated, Base Sequence, Chromosome Mapping, Codon, Colonic Neoplasms, Comparative Study, Computer-Assisted, Crystallography, DNA, DNA Primers, Databases, Diagnostic Imaging, Gene Expression Profiling, Hordeum, Host-Parasite Relations, Humans, Image Interpretation, Informatics, Kinetics, Magnetic Resonance Spectroscopy, Models, Nanotechnology, Non-P.H.S., Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, P.H.S., Pattern Recognition, Plant, Plants, Predictive Value of Tests, Protein, Research Support, Selection (Genetics), Sequence Alignment, Sequence Analysis, Sequence Homology, Skin, Software, Statistical, Theoretical, Thermodynamics, U.S. Gov't, Viral Proteins, X-Ray, 15581813
[Glotsos2005Automated] Dimitris Glotsos, Jussi Tohka, Panagiota Ravazoula, Dionisis Cavouras, and George Nikiforidis. Automated diagnosis of brain tumours astrocytomas using probabilistic neural network clustering and support vector machines. Int J Neural Syst, 15(1-2):1-11, 2005. [ bib ]
A computer-aided diagnosis system was developed for assisting brain astrocytomas malignancy grading. Microscopy images from 140 astrocytic biopsies were digitized and cell nuclei were automatically segmented using a Probabilistic Neural Network pixel-based clustering algorithm. A decision tree classification scheme was constructed to discriminate low, intermediate and high-grade tumours by analyzing nuclear features extracted from segmented nuclei with a Support Vector Machine classifier. Nuclei were segmented with an average accuracy of 86.5%. Low, intermediate, and high-grade tumours were identified with 95%, 88.3%, and 91% accuracies respectively. The proposed algorithm could be used as a second opinion tool for the histopathologists.

[Giallourakis2005Disease] CGiallourakis, CHenson, MReich, XXie, and VK. Mootha. Disease gene discovery through integrative genomics. Annu. Rev. Genomics Hum. Genet., 6:381-406, 2005. [ bib | DOI | http | .pdf ]
The availability of complete genome sequences and the wealth of large-scale biological data sets now provide an unprecedented opportunity to elucidate the genetic basis of rare and common human diseases. Here we review some of the emerging genomics technologies and data resources that can be used to infer gene function to prioritize candidate genes. We then describe some computational strategies for integrating these large-scale data sets to provide more faithful descriptions of gene function, and how such approaches have recently been applied to discover genes underlying Mendelian disorders. Finally, we discuss future prospects and challenges for using integrative genomics to systematically discover not only single genes but also entire gene networks that underlie and modify human disease.

[Ghosh2005Classification] Debashis Ghosh and Arul M Chinnaiyan. Classification and Selection of Biomarkers in Genomic Data Using LASSO. J Biomed Biotechnol, 2005(2):147-54, 2005. [ bib | DOI | http | .pdf ]
High-throughput gene expression technologies such as microarrays have been utilized in a variety of scientific applications. Most of the work has been done on assessing univariate associations between gene expression profiles with clinical outcome (variable selection) or on developing classification procedures with gene expression data (supervised learning). We consider a hybrid variable selection/classification approach that is based on linear combinations of the gene expression profiles that maximize an accuracy measure summarized using the receiver operating characteristic curve. Under a specific probability model, this leads to the consideration of linear discriminant functions. We incorporate an automated variable selection approach using LASSO. An equivalence between LASSO estimation with support vector machines allows for model fitting using standard software. We apply the proposed method to simulated data as well as data from a recently published prostate cancer study.

Keywords: , , 16046820
[Gaudet2005MCPorteomics] SGaudet, KA. Janes, JG. Albeck, EA. Pace, DA. Lauffenburger, and PK. Sorger. A compendium of signals and responses triggered by prodeath and prosurvival cytokines. Mol Cell Proteomics, 4(10):1569-1590, 2005. [ bib ]
Cell-signaling networks consist of proteins with a variety of functions (receptors, adaptor proteins, GTPases, kinases, proteases, and transcription factors) working together to control cell fate. Although much is known about the identities and biochemical activities of these signaling proteins, the ways in which they are combined into networks to process and transduce signals are poorly understood. Network-level understanding of signaling requires data on a wide variety of biochemical processes such as posttranslational modification, assembly of macromolecular complexes, enzymatic activity, and localization. No single method can gather such heterogeneous data in high throughput, and most studies of signal transduction therefore rely on series of small, discrete experiments. Inspired by the power of systematic datasets in genomics, we set out to build a systematic signaling dataset that would enable the construction of predictive models of cell-signaling networks. Here we describe the compilation and fusion of [ ]10,000 signal and response measurements acquired from HT-29 cells treated with tumor necrosis factor-alpha, a proapoptotic cytokine, in combination with epidermal growth factor or insulin, two prosurvival growth factors. Nineteen protein signals were measured over a 24-h period using kinase activity assays, quantitative immunoblotting, and antibody microarrays. Four different measurements of apoptotic response were also collected by flow cytometry for each time course. Partial least squares regression models that relate signaling data to apoptotic response data reveal which aspects of compendium construction and analysis were important for the reproducibility, internal consistency, and accuracy of the fused set of signaling measurements. We conclude that it is possible to build self-consistent compendia of cell-signaling data that can be mined computationally to yield important insights into the control of mammalian cell responses.

Keywords: csbcbook
[Gaudan2005Resolving] SGaudan, HKirsch, and DRebholz-Schuhmann. Resolving abbreviations to their senses in Medline. Bioinformatics, Jul 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Biological literature contains many abbreviations with one particular sense in each document. However, most abbreviations do not have a unique sense across the literature. Furthermore, many documents do not contain the long-forms of the abbreviations. Resolving an abbreviation in a document consists of retrieving its sense in use. Abbreviation resolution improves accuracy of document retrieval engines and of information extraction systems. RESULTS: We combine an automatic analysis of Medline abstracts and linguistic methods to build a dictionary of abbreviation/sense pairs. The dictionary is used for the resolution of abbreviations occurring with their long-forms. Ambiguous global abbreviations are resolved using Support Vector Machines that have been trained on the context of each instance of the abbreviation/sense pairs, previously extracted for the dictionary setup. The system disambiguates abbreviations with a precision of 98.9% for a recall of 98.2% (98.5% accuracy). This performance is superior in comparison to previously reported research work. AVAILABILITY: The abbreviation resolution module is available at http://www.ebi.ac.uk/Rebholz/software.html.

Keywords: biosvm nlp
[Garg2005SVM-based] AGarg, MBhasin, and G.P. Raghava. SVM-based method for subcellular localization of human proteins using amino acid compositions, their order and similarity search. J. Biol. Chem., 280(15):14427-32, Apr 2005. [ bib | DOI | http | .pdf ]
Here we report a systematic approach for predicting subcellular localization (cytoplasm, mitochondrial, nuclear and plasma membrane) of human proteins. Firstly, SVM based modules for predicting subcellular localization using traditional amino acid and dipeptide (i+1) composition achieved overall accuracy of 76.6 when carried out using similarity-based search against non-redundant database of experimentally annotated proteins yielded 73.3 To gain further insight, hybrid module (hybrid1) was developed based on amino acid composition, dipeptide composition, and similarity information and attained better accuracy of 84.9 SVM module based on different higher order dipeptide i.e. i+2, i+3, and i+4 were also constructed for the prediction of subcellular localization of human proteins and overall accuracy of 79.7 and 77.1 module hybrid2 was developed using traditional dipeptide (i+1) and higher order dipeptide (i+2, i+3, and i+4) compositions, which gave an overall accuracy of 81.3 based on amino acid composition, traditional and higher order dipeptide compositions and PSI-BLAST output and achieved an overall accuracy of 84.4 or http://bioinformatics.uams.edu/raghava/hslpred/) has been designed to predict subcellular localization of human proteins using the above approaches.

Keywords: biosvm
[Gardy2005PSORTb] JL. Gardy, MR. Laird, FChen, SRey, CJ. Walsh, MEster, and FSL. Brinkman. PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics, 21(5):617-623, Mar 2005. [ bib | DOI | http | .pdf ]
Motivation: PSORTb v.1.1 is the most precise bacterial localization prediction tool available. However the program's predictive coverage and recall are low and the method is only applicable to Gram-negative bacteria. The goals of the present work were: increase PSORTb's coverage while maintaining the existing precision level, expand it to include Gram-positive bacteria, and then carry out a comparative analysis of localization.Results: An expanded database of proteins of known localization and new modules using frequent subsequence-based support vector machines were introduced into PSORTb v.2.0. The program attains a precision of 96 bacteria and predictive coverage comparable to other tools for whole proteome analysis. We show that the proportion of proteins at each localization is remarkably consistent across species, even in species with varying proteome size.Availability: Web-based version: http://www.psort.org/psortb. Standalone version: Available through the website under GNU General Public License.Supplementary Information: http://www.psort.org/psortb/supplementaryinfo.html.

Keywords: biosvm
[Gao2005Improving] Yuan Gao and George Church. Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics, 21(21):3970-3975, Nov 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Identifying different cancer classes or subclasses with similar morphological appearances presents a challenging problem and has important implication in cancer diagnosis and treatment. Clustering based on gene-expression data has been shown to be a powerful method in cancer class discovery. Non-negative matrix factorization is one such method and was shown to be advantageous over other clustering techniques, such as hierarchical clustering or self-organizing maps. In this paper, we investigate the benefit of explicitly enforcing sparseness in the factorization process. RESULTS: We report an improved unsupervised method for cancer classification by the use of gene-expression profile via sparse non-negative matrix factorization. We demonstrate the improvement by direct comparison with classic non-negative matrix factorization on the three well-studied datasets. In addition, we illustrate how to identify a small subset of co-expressed genes that may be directly involved in cancer.

[Gangal2005Human] Rajeev Gangal and Pankaj Sharma. Human pol II promoter prediction: time series descriptors and machine learning. Nucleic Acids Res, 33(4):1332-6, 2005. [ bib | DOI | http | .pdf ]
Although several in silico promoter prediction methods have been developed to date, they are still limited in predictive performance. The limitations are due to the challenge of selecting appropriate features of promoters that distinguish them from non-promoters and the generalization or predictive ability of the machine-learning algorithms. In this paper we attempt to define a novel approach by using unique descriptors and machine-learning methods for the recognition of eukaryotic polymerase II promoters. In this study, non-linear time series descriptors along with non-linear machine-learning algorithms, such as support vector machine (SVM), are used to discriminate between promoter and non-promoter regions. The basic idea here is to use descriptors that do not depend on the primary DNA sequence and provide a clear distinction between promoter and non-promoter regions. The classification model built on a set of 1000 promoter and 1500 non-promoter sequences, showed a 10-fold cross-validation accuracy of 87% and an independent test set had an accuracy >85% in both promoter and non-promoter identification. This approach correctly identified all 20 experimentally verified promoters of human chromosome 22. The high sensitivity and selectivity indicates that n-mer frequencies along with non-linear time series descriptors, such as Lyapunov component stability and Tsallis entropy, and supervised machine-learning methods, such as SVMs, can be useful in the identification of pol II promoters.

Keywords: biosvm
[Fuster2005sweet] MN. Fuster and JD. Esko. The sweet and sour of cancer: glycans as novel therapeutic targets. Nat. Rev. Cancer, 5(7):526-42, Jul 2005. [ bib | DOI | http | .pdf ]
A growing body of evidence supports crucial roles for glycans at various pathophysiological steps of tumour progression. Glycans regulate tumour proliferation, invasion, haematogenous metastasis and angiogenesis, and increased understanding of these roles sets the stage for developing pharmaceutical agents that target these molecules. Such novel agents might be used alone or in combination with operative and/or chemoradiation strategies for treating cancer.

Keywords: glycans
[Fundel2005simple] Katrin Fundel, Daniel Güttler, Ralf Zimmer, and Joannis Apostolakis. A simple approach for protein name identification: prospects and limits. BMC Bioinformatics, 6 Suppl 1:S15, 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: Significant parts of biological knowledge are available only as unstructured text in articles of biomedical journals. By automatically identifying gene and gene product (protein) names and mapping these to unique database identifiers, it becomes possible to extract and integrate information from articles and various data sources. We present a simple and efficient approach that identifies gene and protein names in texts and returns database identifiers for matches. It has been evaluated in the recent BioCreAtIvE entity extraction and mention normalization task by an independent jury. METHODS: Our approach is based on the use of synonym lists that map the unique database identifiers for each gene/protein to the different synonym names. For yeast and mouse, synonym lists were used as provided by the organizers who generated them from public model organism databases. The synonym list for fly was generated directly from the corresponding organism database. The lists were then extensively curated in largely automated procedure and matched against MEDLINE abstracts by exact text matching. Rule-based and support vector machine-based post filters were designed and applied to improve precision. RESULTS: Our procedure showed high recall and precision with F-measures of 0.897 for yeast and 0.764/0.773 for mouse in the BioCreAtIvE assessment (Task 1B) and 0.768 for fly in a post-evaluation. CONCLUSION: The results were close to the best over all submissions. Depending on the synonym properties it can be crucial to consider context and to filter out erroneous matches. This is especially important for fly, which has a very challenging nomenclature for the protein name identification task. Here, the support vector machine-based post filter proved to be very effective.

Keywords: , , 15960827
[Fu2005Image] JC. Fu, SK. Lee, ST C Wong, JY. Yeh, AH. Wang, and HK. Wu. Image segmentation feature selection and pattern classification for mammographic microcalcifications. Comput Med Imaging Graph, Jul 2005. [ bib | DOI | http ]
Since microcalcifications in X-ray mammograms are the primary indicator of breast cancer, detection of microcalcifications is central to the development of an effective diagnostic system. This paper proposes a two-stage detection procedure. In the first stage, a data driven, closed form mathematical model is used to calculate the location and shape of suspected microcalcifications. When tested on the Nijmegen University Hospital (Netherlands) database, data analysis shows that the proposed model can effectively detect the occurrence of microcalcifications. The proposed mathematical model not only eliminates the need for system training, but also provides information on the borders of suspected microcalcifications for further feature extraction. In the second stage, 61 features are extracted for each suspected microcalcification, representing texture, the spatial domain and the spectral domain. From these features, a sequential forward search (SFS) algorithm selects the classification input vector, which consists of features sensitive only to microcalcifications. Two types of classifiers-a general regression neural network (GRNN) and a support vector machine (SVM)-are applied, and their classification performance is compared using the Az value of the Receiver Operating Characteristic curve. For all 61 features used as input vectors, the test data set yielded Az values of 97.01% for the SVM and 96.00% for the GRNN. With input features selected by SFS, the corresponding Az values were 98.00% for the SVM and 97.80% for the GRNN. The SVM outperformed the GRNN, whether or not the input vectors first underwent SFS feature selection. In both cases, feature selection dramatically reduced the dimension of the input vectors (82% for the SVM and 59% for the GRNN). Moreover, SFS feature selection improved the classification performance, increasing the Az value from 97.01 to 98.00% for the SVM and from 96.00 to 97.80% for the GRNN.

Keywords: Archaeal, Artificial Intelligence, Bacterial, Cytomegalovirus, Gene Transfer, Genome, Genomics, Horizontal, Non-U.S. Gov't, Research Support, Viral, 16002263
[Frohlich2005Optimal] HFröhlich, JK. Wegner, FSieker, and AZell. Optimal assignment kernels for attributed molecular graphs. In Proceedings of the 22nd international conference on Machine learning, pages 225 - 232, New York, NY, USA, 2005. ACM Press. [ bib | DOI ]
Keywords: chemoinformatics
[Frimurer2005physicogenetic] TM. Frimurer, TUlven, CE. Elling, L.-O. Gerlach, EKostenis, and THögberg. A physicogenetic method to assign ligand-binding relationships between 7tm receptors. Bioorg. Med. Chem. Lett., 15(16):3707-3712, Aug 2005. [ bib | DOI | http | .pdf ]
A computational protocol has been devised to relate 7TM receptor proteins (GPCRs) with respect to physicochemical features of the core ligand-binding site as defined from the crystal structure of bovine rhodopsin. The identification of such receptors that already are associated with ligand information (e.g., small molecule ligands with mutagenesis or SAR data) is used to support structure-guided drug design of novel ligands. A case targeting the newly identified prostaglandin D2 receptor CRTH2 serves as a primary example to illustrate the procedure.

Keywords: chemogenomics
[Friedel2005Support] CC. Friedel, KHV. Jahn, SSommer, SRudd, HW. Mewes, and IV. Tetko. Support vector machines for separation of mixed plant-pathogen EST collections based on codon usage. Bioinformatics, 21:1383-1388, 2005. [ bib | DOI | http | .pdf ]
Motivation: Discovery of host and pathogen genes expressed at the plant-pathogen interface often requires the construction of mixed libraries that contain sequences from both genomes. Sequence identification requires high-throughput and reliable classification of genome origin. When using single-pass cDNA sequences difficulties arise from the short sequence length, the lack of sufficient taxonomically relevant sequence data in public databases and ambiguous sequence homology between plant and pathogen genes.Results: A novel method is described, which is independent of the availability of homologous genes and relies on subtle differences in codon usage between plant and fungal genes. We used support vector machines (SVMs) to identify the probable origin of sequences. SVMs were compared to several other machine learning techniques and to a probabilistic algorithm (PF-IND, Maor et al., 2003) for EST classification also based on codon bias differences. Our software (ECLAT) has achieved a classification accuracy of 93.1 vulgare and B. graminis, which is a significant improvement compared to PF-IND (prediction accuracy of 81.2 EST sequences with at least 50 nt of coding sequence can be classified by ECLAT with high confidence. ECLAT allows training of classifiers for any host-pathogen combination for which there are sufficient classified training sequences.Availability: ECLAT is freely available on the internet (http://mips.gsf.de/proj/est) or on request as a standalone version.

Keywords: biosvm
[Freyhult2005Unbiased] EFreyhult, PPrusis, MLapinsh, JES. Wikberg, VMoulton, and MG. Gustafsson. Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling. BMC Bioinformatics, 6:50, 2005. [ bib | DOI | http ]
BACKGROUND: Proteochemometrics is a new methodology that allows prediction of protein function directly from real interaction measurement data without the need of 3D structure information. Several reported proteochemometric models of ligand-receptor interactions have already yielded significant insights into various forms of bio-molecular interactions. The proteochemometric models are multivariate regression models that predict binding affinity for a particular combination of features of the ligand and protein. Although proteochemometric models have already offered interesting results in various studies, no detailed statistical evaluation of their average predictive power has been performed. In particular, variable subset selection performed to date has always relied on using all available examples, a situation also encountered in microarray gene expression data analysis. RESULTS: A methodology for an unbiased evaluation of the predictive power of proteochemometric models was implemented and results from applying it to two of the largest proteochemometric data sets yet reported are presented. A double cross-validation loop procedure is used to estimate the expected performance of a given design method. The unbiased performance estimates (P2) obtained for the data sets that we consider confirm that properly designed single proteochemometric models have useful predictive power, but that a standard design based on cross validation may yield models with quite limited performance. The results also show that different commercial software packages employed for the design of proteochemometric models may yield very different and therefore misleading performance estimates. In addition, the differences in the models obtained in the double CV loop indicate that detailed chemical interpretation of a single proteochemometric model is uncertain when data sets are small. CONCLUSION: The double CV loop employed offer unbiased performance estimates about a given proteochemometric modelling procedure, making it possible to identify cases where the proteochemometric design does not result in useful predictive models. Chemical interpretations of single proteochemometric models are uncertain and should instead be based on all the models selected in the double CV loop employed here.

Keywords: chemogenomics
[Fraunholz2005Systems] MJ. Fraunholz. Systems biology in malaria research. Trends Parasitol., 21(9):393-395, Sep 2005. [ bib | DOI | http | .pdf ]
A recent publication of genome and expression analyses of the murine parasites Plasmodium chabaudi chabaudi and Plasmodium berghei presents the state of the art in Plasmodium systems biology. By integrating genomics, transcriptomics and proteomics, the authors can classify and annotate genes by their expression profiles and can even detect evidence of posttranscriptional gene silencing in the murine malaria species.

Keywords: plasmodium
[Fliri2005Biospectra] Anton F Fliri, William T Loging, Peter F Thadeio, and Robert A Volkmann. Biospectra analysis: model proteome characterizations for linking molecular structure and biological response. J. Med. Chem., 48(22):6918-6925, Nov 2005. [ bib | DOI | http | .pdf ]
Establishing quantitative relationships between molecular structure and broad biological effects has been a long-standing goal in drug discovery. Evaluation of the capacity of molecules to modulate protein functions is a prerequisite for understanding the relationship between molecular structure and in vivo biological response. A particular challenge in these investigations is to derive quantitative measurements of a molecule's functional activity pattern across different proteins. Herein we describe an operationally simple probabilistic structure-activity relationship (SAR) approach, termed biospectra analysis, for identifying agonist and antagonist effect profiles of medicinal agents by using pattern similarity between biological activity spectra (biospectra) of molecules as the determinant. Accordingly, in vitro binding data (percent inhibition values of molecules determined at single high drug concentration in a battery of assays representing a cross section of the proteome) are useful for identifying functional effect profile similarity between medicinal agents. To illustrate this finding, the relationship between biospectra similarity of 24 molecules, identified by hierarchical clustering of a 1567 molecule dataset as being most closely aligned with the neurotransmitter dopamine, and their agonist or antagonist properties was probed. Distinguishing the results described in this study from those obtained with affinity-based methods, the observed association between biospectra and biological response profile similarity remains intact even upon removal of putative drug targets from the dataset (four dopaminergic [D1/D2/D3/D4] and two adrenergic [alpha1 and alpha2] receptors). These findings indicate that biospectra analysis provides an unbiased new tool for forecasting structure-response relationships and for translating broad biological effect information into chemical structure design.

Keywords: chemoinformatics
[Fliri2005Biological] AF. Fliri, WT. Loging, PF. Thadeio, and RA. Volkmann. Biological spectra analysis: Linking biological activity profiles to molecular structure. Proc. Natl. Acad. Sci. USA, 102(2):261-266, Jan 2005. [ bib | DOI | http | .pdf ]
Establishing quantitative relationships between molecular structure and broad biological effects has been a longstanding challenge in science. Currently, no method exists for forecasting broad biological activity profiles of medicinal agents even within narrow boundaries of structurally similar molecules. Starting from the premise that biological activity results from the capacity of small organic molecules to modulate the activity of the proteome, we set out to investigate whether descriptor sets could be developed for measuring and quantifying this molecular property. Using a 1,567-compound database, we show that percent inhibition values, determined at single high drug concentration in a battery of in vitro assays representing a cross section of the proteome, provide precise molecular property descriptors that identify the structure of molecules. When broad biological activity of molecules is represented in spectra form, organic molecules can be sorted by quantifying differences between biological spectra. Unlike traditional structure-activity relationship methods, sorting of molecules by using biospectra comparisons does not require knowledge of a molecule's putative drug targets. To illustrate this finding, we selected as starting point the biological activity spectra of clotrimazole and tioconazole because their putative target, lanosterol demethylase (CYP51), was not included in the bioassay array. Spectra similarity obtained through profile similarity measurements and hierarchical clustering provided an unbiased means for establishing quantitative relationships between chemical structures and biological activity spectra. This methodology, which we have termed biological spectra analysis, provides the capability not only of sorting molecules on the basis of biospectra similarity but also of predicting simultaneous interactions of new molecules with multiple proteins.

Keywords: chemoinformatics
[Fliri2005Analysis] AF. Fliri, WT. Loging, PF. Thadeio, and RA. Volkmann. Analysis of drug-induced effect patterns to link structure and side effects of medicines. Nat. Chem. Biol., 1(7):389-397, Dec 2005. [ bib | DOI | http | .pdf ]
The high failure rate of experimental medicines in clinical trials accentuates inefficiencies of current drug discovery processes caused by a lack of tools for translating the information exchange between protein and organ system networks. Recently, we reported that biological activity spectra (biospectra), derived from in vitro protein binding assays, provide a mechanism for assessing a molecule's capacity to modulate the function of protein-network components. Herein we describe the translation of adverse effect data derived from 1,045 prescription drug labels into effect spectra and show their utility for diagnosing drug-induced effects of medicines. In addition, notwithstanding the limitation imposed by the quality of drug label information, we show that biospectrum analysis, in concert with effect spectrum analysis, provides an alignment between preclinical and clinical drug-induced effects. The identification of this alignment provides a mechanism for forecasting clinical effect profiles of medicines.

Keywords: chemoinformatics
[Ferrer2005Offline] Miguel A Ferrer, Jesús B Alonso, and Carlos M Travieso. Offline geometric parameters for automatic signature verification using fixed-point arithmetic. IEEE Trans Pattern Anal Mach Intell, 27(6):993-7, Jun 2005. [ bib ]
This paper presents a set of geometric signature features for offline automatic signature verification based on the description of the signature envelope and the interior stroke distribution in polar and Cartesian coordinates. The features have been calculated using 16 bits fixed-point arithmetic and tested with different classifiers, such as hidden Markov models, support vector machines, and Euclidean distance classifier. The experiments have shown promising results in the task of discriminating random and simple forgeries.

[Feng2005Boosting] Kai-Yan Feng, Yu-Dong Cai, and Kuo-Chen Chou. Boosting classifier for predicting protein domain structural class. Biochem Biophys Res Commun, 334(1):213-7, Aug 2005. [ bib | DOI | http | .pdf ]
A novel classifier, the so-called "LogitBoost" classifier, was introduced to predict the structural class of a protein domain according to its amino acid sequence. LogitBoost is featured by introducing a log-likelihood loss function to reduce the sensitivity to noise and outliers, as well as by performing classification via combining many weak classifiers together to build up a very strong and robust classifier. It was demonstrated thru jackknife cross-validation tests that LogitBoost outperformed other classifiers including "support vector machine," a very powerful classifier widely used in biological literatures. It is anticipated that LogitBoost can also become a useful vehicle in classifying other attributes of proteins according to their sequences, such as subcellular localization and enzyme family class, among many others.

Keywords: Archaeal, Artificial Intelligence, Bacterial, Cytomegalovirus, Gene Transfer, Genome, Genomics, Horizontal, Non-U.S. Gov't, Research Support, Viral, 15993842
[Evgeniou2005Learning] TEvgeniou, CMicchelli, and MPontil. Learning multiple tasks with kernel methods. J. Mach. Learn. Res., 6:615-637, 2005. [ bib | http ]
We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task.

[Evers2005Structure-based] AEvers and TKlabunde. Structure-based drug discovery using GPCR homology modeling: successful virtual screening for antagonists of the alpha1A adrenergic receptor. J. Med. Chem., 48(4):1088-1097, Feb 2005. [ bib | DOI | http ]
In this paper, we describe homology modeling of the alpha1A receptor based on the X-ray structure of bovine rhodopsin. The protein model has been generated by applying ligand-supported homology modeling, using mutational and ligand SAR data to guide the protein modeling procedure. We performed a virtual screening of the company's compound collection to test how well this model is suited to identify alpha1A antagonists. We applied a hierarchical virtual screening procedure guided by 2D filters and three-dimensional pharmacophore models. The ca. 23,000 filtered compounds were docked into the alpha1A homology model with GOLD and scored with PMF. From the top-ranked compounds, 80 diverse compounds were tested in a radioligand displacement assay. 37 compounds revealed K(i) values better than 10 microM; the most active compound binds with 1.4 nM to the alpha1A receptor. Our findings suggest that rhodopsin-based homology models may be used as the structural basis for GPCR lead finding and compound optimization.

Keywords: chemogenomics
[Engelhardt2005Protein] BE. Engelhardt, MI. Jordan, KE. Muratore, and SE. Brenner. Protein Molecular Function Prediction by Bayesian Phylogenomics. PLoS Comput. Biol., 1(5):e45, Oct 2005. [ bib | DOI | http | .pdf ]
We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5'-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.

Keywords: biogm
[Ein-Dor2005Outcome] LEin-Dor, IKela, GGetz, DGivol, and EDomany. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21(2):171-178, Jan 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Predicting the metastatic potential of primary malignant tissues has direct bearing on the choice of therapy. Several microarray studies yielded gene sets whose expression profiles successfully predicted survival. Nevertheless, the overlap between these gene sets is almost zero. Such small overlaps were observed also in other complex diseases, and the variables that could account for the differences had evoked a wide interest. One of the main open questions in this context is whether the disparity can be attributed only to trivial reasons such as different technologies, different patients and different types of analyses. RESULTS: To answer this question, we concentrated on a single breast cancer dataset, and analyzed it by a single method, the one which was used by van't Veer et al. to produce a set of outcome-predictive genes. We showed that, in fact, the resulting set of genes is not unique; it is strongly influenced by the subset of patients used for gene selection. Many equally predictive lists could have been produced from the same analysis. Three main properties of the data explain this sensitivity: (1) many genes are correlated with survival; (2) the differences between these correlations are small; (3) the correlations fluctuate strongly when measured over different subsets of patients. A possible biological explanation for these properties is discussed. CONTACT: eytan.domany@weizmann.ac.il SUPPLEMENTARY INFORMATION: http://www.weizmann.ac.il/physics/complex/compphys/downloads/liate/

Keywords: breastcancer, microarray, featureselection
[Ehlers2005NBS1] Justis P Ehlers and JWilliam Harbour. NBS1 expression as a prognostic marker in uveal melanoma. Clin. Cancer Res., 11(5):1849-53, Mar 2005. [ bib | DOI | http | .pdf ]
PURPOSE: Up to half of uveal melanoma patients die of metastatic disease. Treatment of the primary eye tumor does not improve survival in high-risk patients due to occult micrometastatic disease, which is present at the time of eye tumor diagnosis but is not detected and treated until months to years later. Here, we use microarray gene expression data to identify a new prognostic marker. EXPERIMENTAL DESIGN: Microarray gene expression profiles were analyzed in 25 primary uveal melanomas. Tumors were ranked by support vector machine (SVM) and by cytologic severity. Nbs1 protein expression was assessed by quantitative immunohistochemistry in 49 primary uveal melanomas. Survival was assessed using Kaplan-Meier life-table analysis. RESULTS: Expression of the Nijmegen breakage syndrome (NBS1) gene correlated strongly with SVM and cytologic tumor rankings (P < 0.0001). Further, immunohistochemistry expression of the Nbs1 protein correlated strongly with both SVM and cytologic rankings (P < 0.0001). The 6-year actuarial survival was 100% in patients with low immunohistochemistry expression of Nbs1 and 22% in those with high Nbs1 expression (P = 0.01). CONCLUSIONS: NBS1 is a strong predictor of uveal melanoma survival and potentially could be used as a clinical marker for guiding clinical management.

Keywords: 80 and over, Adult, Aged, Algorithms, Amino Acid Sequence, Amino Acids, Analysis of Variance, Animals, Area Under Curve, Artifacts, Automated, Bacteriophage T4, Base Sequence, Biological, Birefringence, Brain Chemistry, Brain Neoplasms, Cell Cycle Proteins, Comparative Study, Computational Biology, Computer-Assisted, Cornea, Cross-Sectional Studies, Databases, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Extramural, Face, Female, Gene Expression Profiling, Genetic, Glaucoma, Humans, Immunohistochemistry, Intraocular Pressure, Lasers, Least-Squares Analysis, Likelihood Functions, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Male, Markov Chains, Melanoma, Middle Aged, Models, Molecular, Mutation, N.I.H., Nerve Fibers, Non-P.H.S., Non-U.S. Gov't, Nuclear Proteins, Nucleic Acid, Nucleic Acid Conformation, Numerical Analysis, Oligonucleotide Array Sequence Analysis, Ophthalmological, Optic Nerve Diseases, Optical Coherence, P.H.S., Pattern Recognition, Photic Stimulation, Polymorphism, Prognosis, Prospective Studies, Protein, Protein Structure, Proteins, RNA, ROC Curve, Regression Analysis, Reproducibility of Results, Research Support, Retinal Ganglion Cells, Secondary, Sensitivity and Specificity, Sequence Analysis, Single Nucleotide, Single-Stranded Conformational, Software, Statistics, Survival Analysis, Tertiary, Tomography, Tumor Markers, U.S. Gov't, Untranslated, Uveal Neoplasms, Visual Fields, beta-Lactamases, 15756009
[Donnes2005Integrated] PDönnes and OKohlbacher. Integrated modeling of the major events in the MHC class I antigen processing pathway. Protein Sci., 14:2132-2140, Jun 2005. [ bib | DOI | http | .pdf ]
Rational design of epitope-driven vaccines is a key goal of immunoinformatics. Typically, candidate selection relies on the prediction of MHC-peptide binding only, as this is known to be the most selective step in the MHC class I antigen processing pathway. However, proteasomal cleavage and transport by the transporter associated with antigen processing (TAP) are essential steps in antigen processing as well. While prediction methods exist for the individual steps, no method has yet offered an integrated prediction of all three major processing events. Here we present WAPP, a method combining prediction of proteasomal cleavage, TAP transport, and MHC binding into a single prediction system. The proteasomal cleavage site prediction employs a new matrix-based method that is based on experimentally verified proteasomal cleavage sites. Support vector regression is used for predicting peptides transported by TAP. MHC binding is the last step in the antigen processing pathway and was predicted using a support vector machine method, SVMHC. The individual methods are combined in a filtering approach mimicking the natural processing pathway. WAPP thus predicts peptides that are cleaved by the proteasome at the C terminus, transported by TAP, and show significant affinity to MHC class I molecules. This results in a decrease in false positive rates compared to MHC binding prediction alone. Compared to prediction of MHC binding only, we report an increased overall accuracy and a lower rate of false positive predictions for the HLA-A*0201, HLA-B*2705, HLA-A*01, and HLA-A*03 alleles using WAPP. The method is available online through our prediction server at http://www-bs.informatik.uni-tuebingen.de/WAPP.

Keywords: biosvm immunoinformatics
[Dubey2005Support] Anshul Dubey, Matthew J Realff, Jay H Lee, and Andreas S Bommarius. Support vector machines for learning to identify the critical positions of a protein. J Theor Biol, 234(3):351-61, Jun 2005. [ bib | DOI | http | .pdf ]
A method for identifying the positions in the amino acid sequence, which are critical for the catalytic activity of a protein using support vector machines (SVMs) is introduced and analysed. SVMs are supported by an efficient learning algorithm and can utilize some prior knowledge about the structure of the problem. The amino acid sequences of the variants of a protein, created by inducing mutations, along with their fitness are required as input data by the method to predict its critical positions. To investigate the performance of this algorithm, variants of the beta-lactamase enzyme were created in silico using simulations of both mutagenesis and recombination protocols. Results from literature on beta-lactamase were used to test the accuracy of this method. It was also compared with the results from a simple search algorithm. The algorithm was also shown to be able to predict critical positions that can tolerate two different amino acids and retain function.

Keywords: biosvm
[Dror2005Accurate] GDror, RSorek, and RShamir. Accurate identification of alternatively spliced exons using support vector machine. Bioinformatics, 21(7):897-901, Apr 2005. [ bib | DOI | http | .pdf ]
Motivation: Alternative splicing is a major component of the regulation acting on mammalian transcriptomes. It is estimated that over half of all human genes have more than one splice variant. Previous studies have shown that alternatively spliced exons possess several features that distinguish them from constitutively spliced ones. Recently, we have demonstrated that such features can be used to distinguish alternative from constitutive exons. In the current study we use advanced machine learning methods to generate robust alternative exons classifier.Results: We extracted several hundred local sequence features of constitutive as well as alternative exons. Using feature selection methods we find seven attributes that are dominant for the task of classification. Several less informative features help to slightly increase the performance of the classifier. The classifier achieves a true positive rate of 50 positive rate of 0.5 alternatively spliced exons in exon databases that are believed to be dominated by constitutive exons.Availability: Upon request from the authors.

Keywords: biosvm
[Doytchinova2005Towards] Irini A Doytchinova, Valerie Walshe, Persephone Borrow, and Darren R Flower. Towards the chemometric dissection of peptide-hla-a*0201 binding affinity: comparison of local and global qsar models. J Comput Aided Mol Des, 19(3):203-212, Mar 2005. [ bib | DOI | http ]
The affinities of 177 nonameric peptides binding to the HLA-A*0201 molecule were measured using a FACS-based MHC stabilisation assay and analysed using chemometrics. Their structures were described by global and local descriptors, QSAR models were derived by genetic algorithm, stepwise regression and PLS. The global molecular descriptors included molecular connectivity chi indices, kappa shape indices, E-state indices, molecular properties like molecular weight and log P, and three-dimensional descriptors like polarizability, surface area and volume. The local descriptors were of two types. The first used a binary string to indicate the presence of each amino acid type at each position of the peptide. The second was also position-dependent but used five z-scales to describe the main physicochemical properties of the amino acids forming the peptides. The models were developed using a representative training set of 131 peptides and validated using an independent test set of 46 peptides. It was found that the global descriptors could not explain the variance in the training set nor predict the affinities of the test set accurately. Both types of local descriptors gave QSAR models with better explained variance and predictive ability. The results suggest that, in their interactions with the MHC molecule, the peptide acts as a complicated ensemble of multiple amino acids mutually potentiating each other.

Keywords: Algorithms; Amino Acid Sequence; Binding Sites; HLA-A Antigens; Models, Theoretical; Oligopeptides; Quantitative Structure-Activity Relationship; Regression Analysis
[Doyle2005PlosBiol] John Doyle and Marie Csete. Motifs, control, and stability. PLoS Biol, 3(11):e392, Nov 2005. [ bib | DOI | http ]
Keywords: Amino Acid Motifs; Bacterial Physiological Phenomena; Bacterial Proteins, chemistry; Escherichia coli, metabolism; Genes, Bacterial; Genes, Plant; Glycolysis; Heat-Shock Proteins, chemistry; Models, Biological; Models, Theoretical; Molecular Chaperones, chemistry; Plant Proteins, chemistry; Protein Interaction Mapping; Protein Structure, Tertiary; Transcription Factors, chemistry; Transcription, Genetic
[Dong2005Fast] Jian xiong Dong, Adam Krzyzak, and Ching Y Suen. Fast SVM training algorithm with decomposition on very large data sets. IEEE Trans Pattern Anal Mach Intell, 27(4):603-18, Apr 2005. [ bib ]
Training a support vector machine on a data set of huge size with thousands of classes is a challenging problem. This paper proposes an efficient algorithm to solve this problem. The key idea is to introduce a parallel optimization step to quickly remove most of the nonsupport vectors, where block diagonal matrices are used to approximate the original kernel matrix so that the original problem can be split into hundreds of subproblems which can be solved more efficiently. In addition, some effective strategies such as kernel caching and efficient computation of kernel matrix are integrated to speed up the training process. Our analysis of the proposed algorithm shows that its time complexity grows linearly with the number of classes and size of the data set. In the experiments, many appealing properties of the proposed algorithm have been investigated and the results show that the proposed algorithm has a much better scaling capability than Libsvm, SVMlight, and SVMTorch. Moreover, the good generalization performances on several large databases have also been achieved.

Keywords: Algorithms, Animals, Antibiotics, Antineoplastic, Artificial Intelligence, Automated, Automatic Data Processing, Butadienes, Chloroplasts, Comparative Study, Computer Simulation, Computer-Assisted, Database Management Systems, Databases, Diagnosis, Disinfectants, Dose-Response Relationship, Drug, Drug Toxicity, Electrodes, Electroencephalography, Ethylamines, Expert Systems, Factual, Feedback, Fungicides, Gene Expression Profiling, Genes, Genetic Markers, Humans, Image Enhancement, Image Interpretation, Implanted, Industrial, Information Storage and Retrieval, Kidney, Kidney Tubules, MEDLINE, Male, Mercuric Chloride, Microarray Analysis, Molecular Biology, Motor Cortex, Movement, Natural Language Processing, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Numerical Analysis, Pattern Recognition, Plant Proteins, Predictive Value of Tests, Proteins, Proteome, Proximal, Puromycin Aminonucleoside, Rats, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Sprague-Dawley, Subcellular Fractions, Terminology, Therapy, Time Factors, Toxicogenetics, U.S. Gov't, User-Computer Interface, 15794164
[Dong2005Prediction] Hai-Long Dong and Yan-Fang Sui. Prediction of HLA-A2-restricted CTL epitope specific to HCC by SYFPEITHI combined with polynomial method. World J Gastroenterol, 11(2):208-211, Jan 2005. [ bib ]
AIM: To predict the HLA-A2-restricted CTL epitopes of tumor antigens associated with hepatocellular carcinoma (HCC). METHODS: MAGE-1, MAGE-3, MAGE-8, P53 and AFP were selected as objective antigens in this study for the close association with HCC. The HLA-A*0201 restricted CTL epitopes of objective tumor antigens were predicted by SYFPEITHI prediction method combined with the polynomial quantitative motifs method. The threshold of polynomial scores was set to -24. RESULTS: The SYFPEITHI prediction values of all possible nonamers of a given protein sequence were added together and the ten high-scoring peptides of each protein were chosen for further analysis in primary prediction. Thirty-five candidates of CTL epitopes (nonamers) derived from the primary prediction results were selected by analyzing with the polynomial method and compared with reported CTL epitopes. CONCLUSION: The combination of SYFPEITHI prediction method and polynomial method can improve the prediction efficiency and accuracy. These nonamers may be useful in the design of therapeutic peptide vaccine for HCC and as immunotherapeutic strategies against HCC after identified by immunology experiment.

Keywords: Amino Acid Sequence; Carcinoma, Hepatocellular; Databases, Protein; Epitopes; HLA-A2 Antigen; Humans; Liver Neoplasms; Major Histocompatibility Complex; Research Support, Non-U.S. Gov't; T-Lymphocytes, Cytotoxic
[Dobson2005Predicting] P.D. Dobson and A.J. Doig. Predicting enzyme class from protein structure without alignments. J. Mol. Biol., 345(1):187-199, Jan 2005. [ bib | DOI | http | .pdf ]
Methods for predicting protein function from structure are becoming more important as the rate at which structures are solved increases more rapidly than experimental knowledge. As a result, protein structures now frequently lack functional annotations. The majority of methods for predicting protein function are reliant upon identifying a similar protein and transferring its annotations to the query protein. This method fails when a similar protein cannot be identified, or when any similar proteins identified also lack reliable annotations. Here, we describe a method that can assign function from structure without the use of algorithms reliant upon alignments. Using simple attributes that can be calculated from any crystal structure, such as secondary structure content, amino acid propensities, surface properties and ligands, we describe each enzyme in a non-redundant set. The set is split according to Enzyme Classification (EC) number. We combine the predictions of one-class versus one-class support vector machine models to make overall assignments of EC number to an accuracy of 35 to 60 the utility of simple structural attributes in protein function prediction and shed light on the link between structure and function. We apply our methods to predict the function of every currently unclassified protein in the Protein Data Bank.

Keywords: biosvm
[Ding2005Minimum] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol, 3(2):185-205, Apr 2005. [ bib ]
How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. We propose a minimum redundancy - maximum relevance (MRMR) feature selection framework. Genes selected via MRMR provide a more balanced coverage of the space and capture broader characteristics of phenotypes. They lead to significantly improved class predictions in extensive experiments on 6 gene expression data sets: NCI, Lymphoma, Lung, Child Leukemia, Leukemia, and Colon. Improvements are observed consistently among 4 classification methods: Naive Bayes, Linear discriminant analysis, Logistic regression, and Support vector machines. SUPPLIMENTARY: The top 60 MRMR genes for each of the datasets are listed in http://crd.lbl.govcding/MRMR/. More information related to MRMR methods can be found at http://www.hpeng.net/.

Keywords: Adult, Aged, Aging, Algorithms, Animals, Apoptosis, Artificial Intelligence, Automated, Biological, Bone Marrow, Breast Neoplasms, Classification, Cluster Analysis, Comparative Study, Computer Simulation, Computer-Assisted, Diagnosis, Dose-Response Relationship, Drug, Female, Foot, Gait, Gene Expression Profiling, Gene Expression Regulation, Gene Silencing, Genetic Vectors, Humans, Image Interpretation, Information Storage and Retrieval, Kidney, Liver, Logistic Models, Male, Messenger, Models, Myocardium, Neoplasms, Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Pharmaceutical Preparations, Polymerase Chain Reaction, Principal Component Analysis, Proteins, RNA, Rats, Reproducibility of Results, Research Support, Sensitivity and Specificity, Small Interfering, Sprague-Dawley, Statistical, Subcellular Fractions, Unknown Primary, 15852500
[Dhingra2005Substantial] Vikas Dhingra, Mukta Gupta, Tracy Andacht, and Zhen F Fu. New frontiers in proteomics research: a perspective. Int. J. Pharm., 299(1-2):1-18, Aug 2005. [ bib | DOI | http ]
Substantial advances have been made in the fundamental understanding of human biology, ranging from DNA structure to identification of diseases associated with genetic abnormalities. Genome sequence information is becoming available in unprecedented amounts. The absence of a direct functional correlation between gene transcripts and their corresponding proteins, however, represents a significant roadblock for improving the efficiency of biological discoveries. The success of proteomics depends on the ability to identify and analyze protein products in a cell or tissue and, this is reliant on the application of several key technologies. Proteomics is in its exponential growth phase. Two-dimensional electrophoresis complemented with mass spectrometry provides a global view of the state of the proteins from the sample. Proteins identification is a requirement to understand their functional diversity. Subtle difference in protein structure and function can contribute to complexity and diversity of life. This review focuses on the progress and the applications of proteomics science with special reference to integration of the evolving technologies involved to address biological questions.

Keywords: Computational Biology; Electrophoresis, Gel, Two-Dimensional; Humans; Peptide Mapping; Protein Interaction Mapping; Proteomics; Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization
[Devos2005use] ADevos, AW. Simonetti, Mvan der Graaf, LLukas, JA K Suykens, LVanhamme, LM C Buydens, AHeerschap, and SVan Huffel. The use of multivariate MR imaging intensities versus metabolic data from MR spectroscopic imaging for brain tumour classification. J Magn Reson, 173(2):218-28, Apr 2005. [ bib | DOI | http | .pdf ]
This study investigated the value of information from both magnetic resonance imaging and magnetic resonance spectroscopic imaging (MRSI) to automated discrimination of brain tumours. The influence of imaging intensities and metabolic data was tested by comparing the use of MR spectra from MRSI, MR imaging intensities, peak integration values obtained from the MR spectra and a combination of the latter two. Three classification techniques were objectively compared: linear discriminant analysis, least squares support vector machines (LS-SVM) with a linear kernel as linear techniques and LS-SVM with radial basis function kernel as a nonlinear technique. Classifiers were evaluated over 100 stratified random splittings of the dataset into training and test sets. The area under the receiver operating characteristic (ROC) curve (AUC) was used as a global performance measure on test data. In general, all techniques obtained a high performance when using peak integration values with or without MR imaging intensities. For example for low- versus high-grade tumours, low- versus high-grade gliomas and gliomas versus meningiomas, the mean test AUC was higher than 0.91, 0.94, and 0.99, respectively, when both MR imaging intensities and peak integration values were used. The use of metabolic data from MRSI significantly improved automated classification of brain tumour types compared to the use of MR imaging intensities solely.

[Desobry2005online] FDesobry, MDavy, and CDoncarli. An online kernel change detection algorithm. IEEE T. Signal. Proces., 53(8):2961-2974, 2005. [ bib | DOI | http | .pdf ]
[Dennis2005Hunting] Jayne L Dennis and Karin A Oien. Hunting the primary: novel strategies for defining the origin of tumours. J Pathol, 205(2):236-47, Jan 2005. [ bib | DOI | http | .pdf ]
In 1995, two methods of genome-wide expression profiling were first described: expression microarrays and serial analysis of gene expression (SAGE). In the subsequent 10 years, many hundreds of papers have been published describing the application of these technologies to a wide spectrum of biological and clinical questions. Common to all of this research is a basic process of data gathering and analysis. The techniques and statistical and bio-informatic tools involved in this process are reviewed. The processes of class discovery (using clustering and self-organizing maps), class prediction (weighted voting, k nearest neighbour, support vector machines, and artificial neural networks), target identification (fold change, discriminant analysis, and principal component analysis), and target validation (RT-PCR and tissue microarrays) are described. Finally, the diagnostic problem of adenocarcinomas that present as metastases of unknown origin is reviewed, and it is demonstrated how integration of expression profiling techniques promises to throw new light on this important clinical challenge.

[Denis2005Learning] FDenis, RGilleron, and FLetouzey. Learning from positive and unlabeled examples. Theor. Comput. Sci., 348(1):70-83, 2005. [ bib | DOI ]
In many machine learning settings, labeled examples are difficult to collect while unlabeled data are abundant. Also, for some binary classification problems, positive examples which are elements of the target concept are available. Can these additional data be used to improve accuracy of supervised learning algorithms? We investigate in this paper the design of learning algorithms from positive and unlabeled data only. Many machine learning and data mining algorithms, such as decision tree induction algorithms and naive Bayes algorithms, use examples only to evaluate statistical queries (SQ-like algorithms). Kearns designed the statistical query learning model in order to describe these algorithms. Here, we design an algorithm scheme which transforms any SQ-like algorithm into an algorithm based on positive statistical queries (estimate for probabilities over the set of positive instances) and instance statistical queries (estimate for probabilities over the instance space). We prove that any class learnable in the statistical query learning model is learnable from positive statistical queries and instance statistical queries only if a lower bound on the weight of any target concept f can be estimated in polynomial time. Then, we design a decision tree induction algorithm POSC4.5, based on C4.5, that uses only positive and unlabeled examples and we give experimental results for this algorithm. In the case of imbalanced classes in the sense that one of the two classes (say the positive class) is heavily underrepresented compared to the other class, the learning problem remains open. This problem is challenging because it is encountered in many real-world applications.

Keywords: PUlearning
[Degroeve2005SpliceMachine] SDegroeve, YSaeys, BDe Baets, PRouze, and YVan de Peer. SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics, 21:1332-1338, 2005. [ bib | DOI | http | .pdf ]
Motivation: In this age of complete genome sequencing, finding the location and structure of genes is crucial for further molecular research. The accurate prediction of intron boundaries largely facilitates the correct prediction of gene structure in nuclear genomes. Many tools for localizing these boundaries on DNA sequences have been developed and are available to researchers through the internet. Nevertheless, these tools still make many false positive predictions.Results: This manuscript presents a novel publicly available splice site prediction tool named SpliceMachine that (i) shows state-of-the-art prediction performance on Arabidopsis thaliana and human sequences, (ii) performs a computationally fast annotation, and (iii) can be trained by the user on its own data.Availability: Results, figures and software are available at http://bioinformatics.psb.ugent.be/supplementarydata/.

Keywords: biosvm
[Davies2005Array] JJ. Davies, IM. Wilson, and WL. Lam. Array CGH technologies and their applications to cancer genomes. Chromosome Res., 13(3):237-248, 2005. [ bib | DOI | http | .pdf ]
Cancer is a disease characterized by genomic instability. Comparative genomic hybridization (CGH) is a technique designed for detecting segmental genomic alterations. Recent advances in array-based CGH technology have enabled examination of chromosomal regions in unprecedented detail, revolutionizing our understanding of tumour genomes. A number of array-based technologies have been developed, aiming to improve the resolution of CGH, enabling researchers to refine and define regions in the genome that may be causal to cancer, and facilitating gene discovery at a rapid rate. This article reviews the various array CGH platforms and their use in the study of cancer genomes. In addition, the need for high-resolution analysis is discussed as well as the importance of studying early-stage disease to discover genetic alterations that may be causal to cancer progression and aetiology.

Keywords: csbcbook, csbcbook-ch2
[Dai2005Evolving] MDai, PWang, A.D. Boyd, GKostov, BAthey, E.G. Jones, W.E. Bunney, R.M. Myers, T.P. Speed, HAkil, et al. Evolving gene/transcript definitions significantly alter the interpretation of genechip data. Nucleic acids research, 33(20):e175-e175, 2005. [ bib ]
[Cuturi2005context-tree] MCuturi and J.-P. Vert. The context-tree kernel for strings. Neural Network., 18(4):1111-1123, 2005. [ bib | DOI | http | .pdf ]
We propose a new kernel for strings which borrows ideas and techniques from information theory and data compression. This kernel can be used in combination with any kernel method, in particular Support Vector Machines for string classi- fication, with notable applications in proteomics. By using a Bayesian averaging framework with conjugate priors on a class of Markovian models known as prob- abilistic suffix trees or context-trees, we compute the value of this kernel in linear time and space while only using the information contained in the spectrum of the considered strings. This is ensured through an adaptation of a compression method known as the context-tree weighting algorithm. Encouraging classification results are reported on a standard protein homology detection experiment, showing that the context-tree kernel performs well with respect to other state-of-the-art methods while using no biological prior knowledge.

Keywords: biosvm
[Cuturi2005Semigroup] MCuturi and J.-P. Vert. Semigroup kernels on finite sets. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Adv. Neural Inform. Process. Syst., volume 17, pages 329-336. MIT Press, Cambridge, MA, 2005. [ bib | www: ]
[Cuturi2005Semigroupa] MCuturi, KFukumizu, and J.-P. Vert. Semigroup kernels on measures. J. Mach. Learn. Res., 6:1169-1198, 2005. [ bib | .html | .pdf ]
Keywords: kernel-theory
[Curtis2005] Keira R. Curtis, Matej Oresic, and Antonio Vidal-Puig. Pathways to the analysis of microarray data. Trends in Biotechnology, 23(8):429-435, 2005. [ bib | DOI | http ]
The development of microarray technology allows the simultaneous measurement of the expression of many thousands of genes. The information gained offers an unprecedented opportunity to fully characterize biological processes. However, this challenge will only be successful if new tools for the efficient integration and interpretation of large datasets are available. One of these tools, pathway analysis, involves looking for consistent but subtle changes in gene expression by incorporating either pathway or functional annotations. We review several methods of pathway analysis and compare the performance of three, the binomial distribution, z scores, and gene set enrichment analysis, on two microarray datasets. Pathway analysis is a promising tool to identify the mechanisms that underlie diseases, adaptive physiological compensatory responses and new avenues for investigation.

Keywords: networks pathways
[Cole2005Comparing] Jason C Cole, Christopher W Murray, JWillem M Nissink, Richard D Taylor, and Robin Taylor. Comparing protein-ligand docking programs is difficult. Proteins, 60(3):325-332, Aug 2005. [ bib | DOI | http ]
There is currently great interest in comparing protein-ligand docking programs. A review of recent comparisons shows that it is difficult to draw conclusions of general applicability. Statistical hypothesis testing is required to ensure that differences in pose-prediction success rates and enrichment rates are significant. Numerical measures such as root-mean-square deviation need careful interpretation and may profitably be supplemented by interaction-based measures and visual inspection of dockings. Test sets must be of appropriate diversity and of good experimental reliability. The effects of crystal-packing interactions may be important. The method used for generating starting ligand geometries and positions may have an appreciable effect on docking results. For fair comparison, programs must be given search problems of equal complexity (e.g. binding-site regions of the same size) and approximately equal time in which to solve them. Comparisons based on rescoring require local optimization of the ligand in the space of the new objective function. Re-implementations of published scoring functions may give significantly different results from the originals. Ostensibly minor details in methodology may have a profound influence on headline success rates.

Keywords: Algorithms; Artificial Intelligence; Binding Sites; Computational Biology, methods; Computer Simulation; Crystallization; Crystallography, X-Ray; Databases, Protein; Ligands; Models, Molecular; Molecular Structure; Programming Languages; Protein Binding; Proteins, chemistry; Proteomics, methods; Reproducibility of Results; Software
[Ciliberto2005CellCycle] ACiliberto, BNovak, and JJ. Tyson. Steady states and oscillations in the p53/mdm2 network. Cell Cycle, 4(3):488-93, 2005. [ bib ]
p53 is activated in response to events compromising the genetic integrity of a cell. Recent data show that p53 activity does not increase steadily with genetic damage but rather fluctuates in an oscillatory fashion. Theoretical studies suggest that oscillations can arise from a combination of positive and negative feedbacks or from a long negative feedback loop alone. Both negative and positive feedbacks are present in the p53/Mdm2 network, but it is not known what roles they play in the oscillatory response to DNA damage. We developed a mathematical model of p53 oscillations based on positive and negative feedbacks in the p53/Mdm2 network. According to the model, the system reacts to DNA damage by moving from a stable steady state into a region of stable limit cycles. Oscillations in the model are born with large amplitude, which guarantees an all-or-none response to damage. As p53 oscillates, damage is repaired and the system moves back to a stable steady state with low p53 activity. The model reproduces experimental data in quantitative detail. We suggest new experiments for dissecting the contributions of negative and positive feedbacks to the generation of oscillations.

Keywords: csbcbook
[Cianchetta2005Predictive] GCianchetta, YLi, JKang, DRampe, AFravolini, GCruciani, and RJ. Vaz. Predictive models for hERG potassium channel blockers. Bioorg. Med. Chem. Lett., 15(15):3637-3642, Aug 2005. [ bib | DOI | http ]
We report here a general method for the prediction of hERG potassium channel blockers using computational models generated from correlation analyses of a large dataset and pharmacophore-based GRIND descriptors. These 3D-QSAR models are compared favorably with other traditional and chemometric based HQSAR methods.

Keywords: chemoinformatics herg
[Chu2005improved] Wei Chu, Chong Jin Ong, and SSathiya Keerthi. An improved conjugate gradient scheme to the solution of least squares SVM. IEEE Trans Neural Netw, 16(2):498-501, Mar 2005. [ bib | DOI | http | .pdf ]
The least square support vector machines (LS-SVM) formulation corresponds to the solution of a linear system of equations. Several approaches to its numerical solutions have been proposed in the literature. In this letter, we propose an improved method to the numerical solution of LS-SVM and show that the problem can be solved using one reduced system of linear equations. Compared with the existing algorithm for LS-SVM, the approach used in this letter is about twice as efficient. Numerical results using the proposed method are provided for comparisons with other existing algorithms.

Keywords: 80 and over, Aged, Algorithms, Area Under Curve, Cross-Sectional Studies, Diagnostic Imaging, Diagnostic Techniques, Glaucoma, Humans, Lasers, Least-Squares Analysis, Middle Aged, Nerve Fibers, Non-U.S. Gov't, Ophthalmological, Optic Nerve Diseases, P.H.S., ROC Curve, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, U.S. Gov't, 15787157
[Cheng2005Protein] Betty Yee Man Cheng, Jaime G Carbonell, and Judith Klein-Seetharaman. Protein classification based on text document classification techniques. Proteins, 58(4):955-70, Mar 2005. [ bib | DOI | http | .pdf ]
The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively.

[Chen2005Understanding] YChen and DXu. Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics, 21:575-581, Mar 2005. [ bib | DOI | http | .pdf ]
Motivation: Protein dispensability is fundamental to understanding of gene function and evolution. Recent advances in generating high-throughput data such as genomic sequence data, protein-protein interaction data, gene-expression data, and growth-rate data of mutants allow us to investigate protein dispensability systematically at the genome scale.Results: In our studies, protein dispensability is represented as a fitness score that is measured by the growth rate of gene-deletion mutants. Through analyses of high-throughput data in yeast Saccharomyces cerevisia, we found that a protein's dispensability had significant correlations with its evolutionary rate and duplication rate, as well as its connectivity in protein-protein interaction network and gene-expression correlation network. Neural network and support vector machine were applied to predict protein dispensability through high-throughput data. Our studies shed some lights on global characteristics of protein dispensability and evolution.Availability: The original datasets for protein dispensability analysis and prediction, together with related scripts, are available at http://digbio.missouri.eduychen/ProDispen/.

Keywords: biosvm
[Chen2005stochastic] K.-C. Chen, T.-Y. Wang, H.-H. Tseng, C.-YF. Huang, and C.-Y. Kao. A stochastic differential equation model for quantifying transcriptional regulatory network in Saccharomyces cerevisiae. Bioinformatics, 21(12):2883-2890, Jun 2005. [ bib | DOI | http ]
MOTIVATION: The explosion of microarray studies has promised to shed light on the temporal expression patterns of thousands of genes simultaneously. However, available methods are far from adequate in efficiently extracting useful information to aid in a greater understanding of transcriptional regulatory network. Biological systems have been modeled as dynamic systems for a long history, such as genetic networks and cell regulatory network. This study evaluated if the stochastic differential equation (SDE), which is prominent for modeling dynamic diffusion process originating from the irregular Brownian motion, can be applied in modeling the transcriptional regulatory network in Saccharomyces cerevisiae. RESULTS: To model the time-continuous gene-expression datasets, a model of SDE is applied to depict irregular patterns. Our goal is to fit a generalized linear model by combining putative regulators to estimate the transcriptional pattern of a target gene. Goodness-of-fit is evaluated by log-likelihood and Akaike Information Criterion. Moreover, estimations of the contribution of regulators and inference of transcriptional pattern are implemented by statistical approaches. Our SDE model is basic but the test results agree well with the observed dynamic expression patterns. It implies that advanced SDE model might be perfectly suited to portray transcriptional regulatory networks. AVAILABILITY: The R code is available on request. CONTACT: cykao@csie.ntu.edu.tw SUPPLEMENTARY INFORMATION: http://www.csie.ntu.edu.twb89x035/yeast/

[Chen2005ChemDB] JChen, SJ. Swamidass, YDou, JBruand, and PBaldi. ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics, 21(22):4133-4139, Sep 2005. [ bib | DOI | http ]
MOTIVATION: The development of chemoinformatics has been hampered by the lack of large, publicly available, comprehensive repositories of molecules, in particular of small molecules. Small molecules play a fundamental role in organic chemistry and biology. They can be used as combinatorial building blocks for chemical synthesis, as molecular probes in chemical genomics and systems biology, and for the screening and discovery of new drugs and other useful compounds. RESULTS: We describe ChemDB, a public database of small molecules available over the Web. ChemDB is built using the digital catalogs of over a hundred vendors and other public sources and is annotated with information derived from these sources as well as from computational methods, such as predicted solubility and 3D structure. It supports multiple molecular formats and is periodically updated, automatically whenever possible. The current version of the database contains approximately 4.1 M commercially available compounds, 8.2 M counting isomers. The database includes a user-friendly graphical interface, chemical reactions capabilities, as well as unique search capabilities. AVAILABILITY: Database, datasets, and supplementary materials available through: http://cdb.ics.uci.edu.

[Chapelle2005A] Olivier Chapelle and Zaid Harchaoui. A machine learning approach to conjoint analysis. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 257-264. MIT Press, Cambridge, MA, 2005. [ bib ]
[Chang2005Automatic] Ruey-Feng Chang, Wen-Jie Wu, Woo Kyung Moon, and Dar-Ren Chen. Automatic ultrasound segmentation and morphology based diagnosis of solid breast tumors. Breast Cancer Res Treat, 89(2):179-85, Jan 2005. [ bib | DOI | http | .pdf ]
Ultrasound (US) is a useful diagnostic tool to distinguish benign from malignant masses of the breast. It is a very convenient and safe diagnostic method. However, there is a considerable overlap benignancy and malignancy in ultrasonic images and interpretation is subjective. A high performance breast tumors computer-aided diagnosis (CAD) system can provide an accurate and reliable diagnostic second opinion for physicians to distinguish benign breast lesions from malignant ones. The potential of sonographic texture analysis to improve breast tumor classifications has been demonstrated. However, the texture analysis is system-dependent. The disadvantages of these systems which use texture analysis to classify tumors are they usually perform well only in one specific ultrasound system. While Morphological based US diagnosis of breast tumor will take the advantage of nearly independent to either the setting of US system and different US machines. In this study, the tumors are segmented using the newly developed level set method at first and then six morphologic features are used to distinguish the benign and malignant cases. The support vector machine (SVM) is used to classify the tumors. There are 210 ultrasonic images of pathologically proven benign breast tumors from 120 patients and carcinomas from 90 patients in the ultrasonic image database. The database contains only one image from each patient. The ultrasonic images are captured at the largest diameter of the tumor. The images are collected consecutively from August 1, 1999 to May 31, 2000; the patients' ages ranged from 18 to 64 years. Sonography is performed using an ATL HDI 3000 system with a L10-5 small part transducer. In the experiment, the accuracy of SVM with shape information for classifying malignancies is 90.95% (191/210), the sensitivity is 88.89% (80/90), the specificity is 92.5% (111/120), the positive predictive value is 89.89% (80/89), and the negative predictive value is 91.74% (111/121).

Keywords: breastcancer
[Chalk2005siRNAdb] AM. Chalk, RE. Warfinge, PGeorgii-Hemming, and ELL. Sonnhammer. siRNAdb: a database of siRNA sequences. Nucleic Acids Res., 33(Database issue):D131-D134, Jan 2005. [ bib | DOI | http | .pdf ]
Short interfering RNAs (siRNAs) are a popular method for gene-knockdown, acting by degrading the target mRNA. Before performing experiments it is invaluable to locate and evaluate previous knockdown experiments for the gene of interest. The siRNA database provides a gene-centric view of siRNA experimental data, including siRNAs of known efficacy and siRNAs predicted to be of high efficacy by a combination of methods. Linked to these sequences is information such as siRNA thermodynamic properties and the potential for sequence-specific off-target effects. The database enables the user to evaluate an siRNA's potential for inhibition and non-specific effects. The database is available at http://siRNA.cgb.ki.se.

Keywords: sirna
[Cavalieri2005] DCavalieri and CDe Filippo. Bioinformatic methods for integrating whole-genome expression results into cellular networks. Drug Discov Today, 10(10):727-34, 2005. [ bib ]
Extracting a comprehensive overview from the huge amount of information arising from whole-genome analyses is a significant challenge. This review critically surveys the state of the art methods that are used to connect information from functional genomic studies to biological function. Cluster analysis methods for inferring the correlation between genes are discussed, as are the methods for integrating gene expression information with existing information on biological pathways and the methods that combine cluster analysis with biological information to reconstruct novel biological networks.

Keywords: Cluster Analysis *Computational Biology/methods/organization & administration/trends *Genomics/methods/organization & administration/trends Humans Oligonucleotide Array Sequence Analysis/methods
[Capriotti2005I-Mutant] ECapriotti, PFariselli, and RCasadio. I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res., 33(Web Server issue):W306-10, Jul 2005. [ bib | DOI | http | www: ]
I-Mutant2.0 is a support vector machine (SVM)-based tool for the automatic prediction of protein stability changes upon single point mutations. I-Mutant2.0 predictions are performed starting either from the protein structure or, more importantly, from the protein sequence. This latter task, to the best of our knowledge, is exploited for the first time. The method was trained and tested on a data set derived from ProTherm, which is presently the most comprehensive available database of thermodynamic experimental data of free energy changes of protein stability upon mutation under different conditions. I-Mutant2.0 can be used both as a classifier for predicting the sign of the protein stability change upon mutation and as a regression estimator for predicting the related DeltaDeltaG values. Acting as a classifier, I-Mutant2.0 correctly predicts (with a cross-validation procedure) 80% or 77% of the data set, depending on the usage of structural or sequence information, respectively. When predicting DeltaDeltaG values associated with mutations, the correlation of predicted with expected/experimental values is 0.71 (with a standard error of 1.30 kcal/mol) and 0.62 (with a standard error of 1.45 kcal/mol) when structural or sequence information are respectively adopted. Our web interface allows the selection of a predictive mode that depends on the availability of the protein structure and/or sequence. In this latter case, the web server requires only pasting of a protein sequence in a raw format. We therefore introduce I-Mutant2.0 as a unique and valuable helper for protein design, even when the protein structure is not yet known with atomic resolution. Availability: http://gpcr.biocomp.unibo.it/cgi/predictors/I-Mutant2.0/I-Mutant2.0.cgi.

Keywords: biosvm
[Candes2005Decoding] ECandès and TTao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203-4215, 2005. [ bib ]
[Cai2005Using] Yu-Dong Cai, Kai-Yan Feng, Wen-Cong Lu, and Kuo-Chen Chou. Using LogitBoost classifier to predict protein structural classes. J Theor Biol, Jul 2005. [ bib | DOI | http | .pdf ]
Prediction of protein classification is an important topic in molecular biology. This is because it is able to not only provide useful information from the viewpoint of structure itself, but also greatly stimulate the characterization of many other features of proteins that may be closely correlated with their biological functions. In this paper, the LogitBoost, one of the boosting algorithms developed recently, is introduced for predicting protein structural classes. It performs classification using a regression scheme as the base learner, which can handle multi-class problems and is particularly superior in coping with noisy data. It was demonstrated that the LogitBoost outperformed the support vector machines in predicting the structural classes for a given dataset, indicating that the new classifier is very promising. It is anticipated that the power in predicting protein structural classes as well as many other bio-macromolecular attributes will be further strengthened if the LogitBoost and some other existing algorithms can be effectively complemented with each other.

[Cabusora2005Differential] LCabusora, ESutton, AFulmer, and CV. Forst. Differential network expression during drug and stress response. Bioinformatics, 21(12):2898-2905, Jun 2005. [ bib | DOI | http | .pdf ]
The application of microarray chip technology has led to an explosion of data concerning the expression levels of the genes in an organism under a plethora of conditions. One of the major challenges of systems biology today is to devise generally applicable methods of interpreting this data in a way that will shed light on the complex relationships between multiple genes and their products. The importance of such information is clear, not only as an aid to areas of research like drug design, but also as a contribution to our understanding of the mechanisms behind an organism's ability to react to its environment.We detail one computational approach for using gene expression data to identify response networks in an organism. The method is based on the construction of biological networks given different sets of interaction information and the reduction of the said networks to important response sub-networks via the integration of the gene expression data. As an application, the expression data of known stress responders and DNA repair genes in Mycobacterium tuberculosis is used to construct a generic stress response sub-network. This is compared to similar networks constructed from data obtained from subjecting M.tuberculosis to various drugs; we are thus able to distinguish between generic stress response and specific drug response. We anticipate that this approach will be able to accelerate target identification and drug development for tuberculosis in the future.chris@lanl.govSupplementary Figures 1 through 6 on drug response networks and differential network analyses on cerulenin, chlorpromazine, ethionamide, ofloxacin, thiolactomycin and triclosan. Supplementary Tables 1 to 3 on predicted protein interactions. http://www.santafe.educhris/DifferentialNW.

[Burckin2005Exploring] TBurckin, RNagel, YMandel-Gutfreund, LShiue, TA. Clark, J.-L. Chong, T.-H. Chang, SSquazzo, GHartzog, and MAres. Exploring functional relationships between components of the gene expression machinery. Nat. Struct. Mol. Biol., 12(2):175-82, Feb 2005. [ bib | DOI | http | .pdf ]
Eukaryotic gene expression requires the coordinated activity of many macromolecular machines including transcription factors and RNA polymerase, the spliceosome, mRNA export factors, the nuclear pore, the ribosome and decay machineries. Yeast carrying mutations in genes encoding components of these machineries were examined using microarrays to measure changes in both pre-mRNA and mRNA levels. We used these measurements as a quantitative phenotype to ask how steps in the gene expression pathway are functionally connected. A multiclass support vector machine was trained to recognize the gene expression phenotypes caused by these mutations. In several cases, unexpected phenotype assignments by the computer revealed functional roles for specific factors at multiple steps in the gene expression pathway. The ability to resolve gene expression pathway phenotypes provides insight into how the major machineries of gene expression communicate with each other.

Keywords: biosvm microarray
[Bunescu2005Comparative] RBunescu, RGe, RJ. Kate, EM. Marcotte, RJ. Mooney, AK. Ramani, and YW. Wong. Comparative experiments on learning information extractors for proteins and their interactions. Artif. Intell. Med., 33(2):139-55, Feb 2005. [ bib | DOI | http | .pdf ]
OBJECTIVE: Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction efforts have been frustrated by the lack of conventions for describing human genes and proteins. We have developed and evaluated a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting information on interactions between the proteins. METHODS AND MATERIAL: We used a variety of machine learning methods to automatically develop information extraction systems for extracting information on gene/protein name, function and interactions from Medline abstracts. We present cross-validated results on identifying human proteins and their interactions by training and testing on a set of approximately 1000 manually-annotated Medline abstracts that discuss human genes/proteins. RESULTS: We demonstrate that machine learning approaches using support vector machines and maximum entropy are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manually-developed rules. CONCLUSION: Our results show that it is promising to use machine learning to automatically build systems for extracting information from biomedical text. The results also give a broad picture of the relative strengths of a wide variety of methods when tested on a reasonably large human-annotated corpus.

Keywords: biosvm
[Buijsman2005Structural] Rogier Buijsman. Structural aspects of kinases and their inhibitors. In Chemogenomics in Drug Discovery, chapter 7, pages 191-219. Wiley-VCH, 2005. [ bib ]
[Bui2005Automated] Huynh-Hoa Bui, John Sidney, Bjoern Peters, Muthuraman Sathiamurthy, Asabe Sinichi, Kelly-Anne Purton, Bianca R Mothé, Francis V Chisari, David I Watkins, and Alessandro Sette. Automated generation and evaluation of specific mhc binding predictive tools: Arb matrix applications. Immunogenetics, 57(5):304-314, Jun 2005. [ bib | DOI | http ]
Prediction of which peptides can bind major histocompatibility complex (MHC) molecules is commonly used to assist in the identification of T cell epitopes. However, because of the large numbers of different MHC molecules of interest, each associated with different predictive tools, tool generation and evaluation can be a very resource intensive task. A methodology commonly used to predict MHC binding affinity is the matrix or linear coefficients method. Herein, we described Average Relative Binding (ARB) matrix methods that directly predict IC(50) values allowing combination of searches involving different peptide sizes and alleles into a single global prediction. A computer program was developed to automate the generation and evaluation of ARB predictive tools. Using an in-house MHC binding database, we generated a total of 85 and 13 MHC class I and class II matrices, respectively. Results from the automated evaluation of tool efficiency are presented. We anticipate that this automation framework will be generally applicable to the generation and evaluation of large numbers of MHC predictive methods and tools, and will be of value to centralize and rationalize the process of evaluation of MHC predictions. MHC binding predictions based on ARB matrices were made available at http://epitope.liai.org:8080/matrix web server.

Keywords: Animals; Binding Sites; Computer Simulation; Databases, Protein; Epitopes; Histocompatibility Antigens; Humans; Major Histocompatibility Complex; Models, Biological; Protein Binding
[Buchala2005role] Samarasena Buchala, Neil Davey, Ray J Frank, Martin Loomes, and Tim M Gale. The role of global and feature based information in gender classification of faces: a comparison of human performance and computational models. Int J Neural Syst, 15(1-2):121-8, 2005. [ bib ]
Most computational models for gender classification use global information (the full face image) giving equal weight to the whole face area irrespective of the importance of the internal features. Here, we use a global and feature based representation of face images that includes both global and featural information. We use dimensionality reduction techniques and a support vector machine classifier and show that this method performs better than either global or feature based representations alone. We also present results of human subjects performance on gender classification task and evaluate how the different dimensionality reduction techniques compare with human subjects performance. The results support the psychological plausibility of the global and feature based representation.

[Briem2005Classifying] Hans Briem and Judith Günther. Classifying "kinase inhibitor-likeness" by using machine-learning methods. ChemBioChem, 6(3):558-66, Mar 2005. [ bib | DOI | http | .pdf ]
By using an in-house data set of small-molecule structures, encoded by Ghose-Crippen parameters, several machine learning techniques were applied to distinguish between kinase inhibitors and other molecules with no reported activity on any protein kinase. All four approaches pursued-support-vector machines (SVM), artificial neural networks (ANN), k nearest neighbor classification with GA-optimized feature selection (GA/kNN), and recursive partitioning (RP)-proved capable of providing a reasonable discrimination. Nevertheless, substantial differences in performance among the methods were observed. For all techniques tested, the use of a consensus vote of the 13 different models derived improved the quality of the predictions in terms of accuracy, precision, recall, and F1 value. Support-vector machines, followed by the GA/kNN combination, outperformed the other techniques when comparing the average of individual models. By using the respective majority votes, the prediction of neural networks yielded the highest F1 value, followed by SVMs.

Keywords: biosvm chemoinformatics
[Breslin2005Signal] TBreslin, MKrogh, CPeterson, and CTroein. Signal transduction pathway profiling of individual tumor samples. BMC Bioinformatics, 6:163, 2005. [ bib | DOI | http | .pdf ]
Signal transduction pathways convey information from the outside of the cell to transcription factors, which in turn regulate gene expression. Our objective is to analyze tumor gene expression data from microarrays in the context of such pathways.We use pathways compiled from the TRANSPATH/TRANSFAC databases and the literature, and three publicly available cancer microarray data sets. Variation in pathway activity, across the samples, is gauged by the degree of correlation between downstream targets of a pathway. Two correlation scores are applied; one considers all pairs of downstream targets, and the other considers only pairs without common transcription factors. Several pathways are found to be differentially active in the data sets using these scores. Moreover, we devise a score for pathway activity in individual samples, based on the average expression value of the downstream targets. Statistical significance is assigned to the scores using permutation of genes as null model. Hence, for individual samples, the status of a pathway is given as a sign, + or -, and a p-value. This approach defines a projection of high-dimensional gene expression data onto low-dimensional pathway activity scores. For each dataset and many pathways we find a much larger number of significant samples than expected by chance. Finally, we find that several sample-wise pathway activities are significantly associated with clinical classifications of the samples.This study shows that it is feasible to infer signal transduction pathway activity, in individual samples, from gene expression data. Furthermore, these pathway activities are biologically relevant in the three cancer data sets.

Keywords: csbcbook-ch4
[Brennecke2005Principles] Julius Brennecke, Alexander Stark, Robert B Russell, and Stephen M Cohen. Principles of microrna-target recognition. PLoS Biol, 3(3):e85, Mar 2005. [ bib | DOI | http ]
MicroRNAs (miRNAs) are short non-coding RNAs that regulate gene expression in plants and animals. Although their biological importance has become clear, how they recognize and regulate target genes remains less well understood. Here, we systematically evaluate the minimal requirements for functional miRNA-target duplexes in vivo and distinguish classes of target sites with different functional properties. Target sites can be grouped into two broad categories. 5' dominant sites have sufficient complementarity to the miRNA 5' end to function with little or no support from pairing to the miRNA 3' end. Indeed, sites with 3' pairing below the random noise level are functional given a strong 5' end. In contrast, 3' compensatory sites have insufficient 5' pairing and require strong 3' pairing for function. We present examples and genome-wide statistical support to show that both classes of sites are used in biologically relevant genes. We provide evidence that an average miRNA has approximately 100 target sites, indicating that miRNAs regulate a large fraction of protein-coding genes and that miRNA 3' ends are key determinants of target specificity within miRNA families.

Keywords: sirna
[Brein2005Inparanoid] KBrein, MRemm, and ESonnhammer. Inparanoid: a comprehensive database of eukaryothic orthologs. Nucleic Acids Res., 33, 2005. [ bib ]
[Brahnam2005Machine] Sheryl Brahnam, Chao-Fa Chuang, Frank Y Shih, and Melinda R Slack. Machine recognition and representation of neonatal facial displays of acute pain. Artif. Intell. Med., Jun 2005. [ bib | DOI | http | .pdf ]
OBJECTIVE:: It has been reported in medical literature that health care professionals have difficulty distinguishing a newborn's facial expressions of pain from facial reactions to other stimuli. Although a number of pain instruments have been developed to assist health professionals, studies demonstrate that health professionals are not entirely impartial in their assessment of pain and fail to capitalize on all the information exhibited in a newborn's facial displays. This study tackles these problems by applying three different state-of-the-art face classification techniques to the task of distinguishing a newborn's facial expressions of pain. METHODS:: The facial expressions of 26 neonates between the ages of 18h and 3 days old were photographed experiencing the pain of a heel lance and a variety of stressors, including transport from one crib to another (a disturbance that can provoke crying that is not in response to pain), an air stimulus on the nose, and friction on the external lateral surface of the heel. Three face classification techniques, principal component analysis (PCA), linear discriminant analysis (LDA), and support vector machine (SVM), were used to classify the faces. RESULTS:: In our experiments, the best recognition rates of pain versus nonpain (88.00%), pain versus rest (94.62%), pain versus cry (80.00%), pain versus air puff (83.33%), and pain versus friction (93.00%) were obtained from an SVM with a polynomial kernel of degree 3. The SVM outperformed two commonly used methods in face classification: PCA and LDA, each using the L(1) distance metric. CONCLUSION:: The results of this study indicate that the application of face classification techniques in pain assessment and management is a promising area of investigation.

Keywords: Artificial Intelligence, Conservation of Natural Resources, Decision Support Techniques, Ecosystem, Environment, Forestry, Regression Analysis, Spain, 15979291
[Bradford2005Improved] James R Bradford and David R Westhead. Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics, 21(8):1487-94, Apr 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Structural genomics projects are beginning to produce protein structures with unknown function, therefore, accurate, automated predictors of protein function are required if all these structures are to be properly annotated in reasonable time. Identifying the interface between two interacting proteins provides important clues to the function of a protein and can reduce the search space required by docking algorithms to predict the structures of complexes. RESULTS: We have combined a support vector machine (SVM) approach with surface patch analysis to predict protein-protein binding sites. Using a leave-one-out cross-validation procedure, we were able to successfully predict the location of the binding site on 76% of our dataset made up of proteins with both transient and obligate interfaces. With heterogeneous cross-validation, where we trained the SVM on transient complexes to predict on obligate complexes (and vice versa), we still achieved comparable success rates to the leave-one-out cross-validation suggesting that sufficient properties are shared between transient and obligate interfaces. AVAILABILITY: A web application based on the method can be found at http://www.bioinformatics.leeds.ac.uk/ppipred. The dataset of 180 proteins used in this study is also available via the same web site. CONTACT: westhead@bmb.leeds.ac.uk SUPPLEMENTARY INFORMATION: http://www.bioinformatics.leeds.ac.uk/ppi-pred/supp-material.

Keywords: biosvm
[Boyle2005Cancer] PBoyle and JFerlay. Cancer incidence and mortality in europe, 2004. Ann. Oncol., 16(3):481-488, Mar 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: There are no recent estimates of the incidence and mortality from cancer at a European level. Those data that are available generally refer to the mid-1990s and are of limited use for cancer control planning. We present estimates of the cancer burden in Europe in 2004, including data for the (25 Member States) European Union. METHODS: The most recent sources of incidence and mortality data available in the Descriptive Epidemiology Group at IARC were applied to population projections to derive the best estimates of the burden of cancer, in terms of incidence and mortality, for Europe in 2004. RESULTS: In 2004 in Europe, there were an estimated 2,886,800 incident cases of cancer diagnosed and 1,711,000 cancer deaths. The most common incident form of cancer was lung cancer (13.3% of all incident cases), followed by colorectal cancer (13.2%) and breast cancer (13%). Lung cancer was also the most common cause of cancer death (341,800 deaths), followed by colorectal (203,700), stomach (137,900) and breast (129,900). CONCLUSIONS: With an estimated 2.9 million new cases (54% occurring in men, 46% in women) and 1.7 million deaths (56% in men, 44% in women) each year, cancer remains an important public health problem in Europe, and the ageing of the European population will cause these numbers to continue to increase even if age-specific rates remain constant. To make great progress quickly against cancer in Europe, the need is evident to make a concerted attack on the big killers: lung, colorectal, breast and stomach cancer. Stomach cancer rates are falling everywhere in Europe and public health measures are available to reduce the incidence and mortality of lung cancer, colorectal cancer and breast cancer.

Keywords: breastcancer
[Bowd2005Relevance] Christopher Bowd, Felipe A Medeiros, Zuohua Zhang, Linda M Zangwill, Jiucang Hao, Te-Won Lee, Terrence J Sejnowski, Robert N Weinreb, and Michael H Goldbaum. Relevance vector machine and support vector machine classifier analysis of scanning laser polarimetry retinal nerve fiber layer measurements. Invest Ophthalmol Vis Sci, 46(4):1322-9, Apr 2005. [ bib | DOI | http | .pdf ]
PURPOSE: To classify healthy and glaucomatous eyes using relevance vector machine (RVM) and support vector machine (SVM) learning classifiers trained on retinal nerve fiber layer (RNFL) thickness measurements obtained by scanning laser polarimetry (SLP). METHODS: Seventy-two eyes of 72 healthy control subjects (average age = 64.3 +/- 8.8 years, visual field mean deviation = -0.71 +/- 1.2 dB) and 92 eyes of 92 patients with glaucoma (average age = 66.9 +/- 8.9 years, visual field mean deviation = -5.32 +/- 4.0 dB) were imaged with SLP with variable corneal compensation (GDx VCC; Laser Diagnostic Technologies, San Diego, CA). RVM and SVM learning classifiers were trained and tested on SLP-determined RNFL thickness measurements from 14 standard parameters and 64 sectors (approximately 5.6 degrees each) obtained in the circumpapillary area under the instrument-defined measurement ellipse (total 78 parameters). Ten-fold cross-validation was used to train and test RVM and SVM classifiers on unique subsets of the full 164-eye data set and areas under the receiver operating characteristic (AUROC) curve for the classification of eyes in the test set were generated. AUROC curve results from RVM and SVM were compared to those for 14 SLP software-generated global and regional RNFL thickness parameters. Also reported was the AUROC curve for the GDx VCC software-generated nerve fiber indicator (NFI). RESULTS: The AUROC curves for RVM and SVM were 0.90 and 0.91, respectively, and increased to 0.93 and 0.94 when the training sets were optimized with sequential forward and backward selection (resulting in reduced dimensional data sets). AUROC curves for optimized RVM and SVM were significantly larger than those for all individual SLP parameters. The AUROC curve for the NFI was 0.87. CONCLUSIONS: Results from RVM and SVM trained on SLP RNFL thickness measurements are similar and provide accurate classification of glaucomatous and healthy eyes. RVM may be preferable to SVM, because it provides a Bayesian-derived probability of glaucoma as an output. These results suggest that these machine learning classifiers show good potential for glaucoma diagnosis.

Keywords: 80 and over, Aged, Algorithms, Area Under Curve, Cross-Sectional Studies, Diagnostic Imaging, Diagnostic Techniques, Glaucoma, Humans, Lasers, Middle Aged, Nerve Fibers, Non-U.S. Gov't, Ophthalmological, Optic Nerve Diseases, P.H.S., ROC Curve, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, U.S. Gov't, 15790898
[Borgwardt2005Shortest-Path] Karsten M. Borgwardt and Hans-Peter Kriegel. Shortest-path kernels on graphs. In ICDM '05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 74-81, Washington, DC, USA, 2005. IEEE Computer Society. [ bib | DOI | .pdf ]
Keywords: chemoinformatics kernel-theory
[Borgwardt2005Protein] K.M. Borgwardt, C.S. Ong, SSchönauer, S.V.N. Vishwanathan, A.J. Smola, and H.-P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl. 1):i47-i56, Jun 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and support vector machine classification on these protein graphs. RESULTS: Our graph model, derivable from protein sequence and structure only, is competitive with vector models that require additional protein information, such as the size of surface pockets. If we include this extra information into our graph model, our classifier yields significantly higher accuracy levels than the vector models. Hyperkernels allow us to select and to optimally combine the most relevant node attributes in our protein graphs. We have laid the foundation for a protein function prediction system that integrates protein information from various sources efficiently and effectively. AVAILABILITY: More information available via www.dbs.ifi.lmu.de/Mitarbeiter/borgwardt.html. CONTACT: borgwardt@dbs.ifi.lmu.de.

Keywords: biosvm
[Bordner2005Statistical] Andrew J Bordner and Ruben Abagyan. Statistical analysis and prediction of protein-protein interfaces. Proteins, 60(3):353-66, Aug 2005. [ bib | DOI | http | .pdf ]
Predicting protein-protein interfaces from a three-dimensional structure is a key task of computational structural proteomics. In contrast to geometrically distinct small molecule binding sites, protein-protein interface are notoriously difficult to predict. We generated a large nonredundant data set of 1494 true protein-protein interfaces using biological symmetry annotation where necessary. The data set was carefully analyzed and a Support Vector Machine was trained on a combination of a new robust evolutionary conservation signal with the local surface properties to predict protein-protein interfaces. Fivefold cross validation verifies the high sensitivity and selectivity of the model. As much as 97% of the predicted patches had an overlap with the true interface patch while only 22% of the surface residues were included in an average predicted patch. The model allowed the identification of potential new interfaces and the correction of mislabeled oligomeric states.

Keywords: biosvm
[Boese2005Mechanistic] QBoese, DLeake, AReynolds, SRead, SA. Scaringe, WS. Marshall, and AKhvorova. Mechanistic insights aid computational short interfering RNA design. Methods Enzymol., 392:73-96, 2005. [ bib | DOI | http ]
RNA interference is widely recognized for its utility as a functional genomics tool. In the absence of reliable target site selection tools, however, the impact of RNA interference (RNAi) may be diminished. The primary determinants of silencing are influenced by highly coordinated RNA-protein interactions that occur throughout the RNAi process, including short interfering RNA (siRNA) binding and unwinding followed by target recognition, cleavage, and subsequent product release. Recently developed strategies for identification of functional siRNAs reveal that thermodynamic and siRNA sequence-specific properties are crucial to predict functional duplexes (Khvorova et al., 2003; Reynolds et al., 2004; Schwarz et al., 2003). Additional assessments of siRNA specificity reveal that more sophisticated sequence comparison tools are also required to minimize potential off-target effects (Jackson et al., 2003; Semizarov et al., 2003). This chapter reviews the biological basis for current computational design tools and how best to utilize and assess their predictive capabilities for selecting functional and specific siRNAs.

Keywords: sirna
[Bock2005Virtual] JR. Bock and DA. Gough. Virtual screen for ligands of orphan G protein-coupled receptors. J. Chem. Inform. Model., 45(5):1402-1414, 2005. [ bib | DOI | http | .pdf ]
This paper describes a virtual screening methodology that generates a ranked list of high-binding small molecule ligands for orphan G protein-coupled receptors (oGPCRs), circumventing the requirement for receptor three-dimensional structure determination. Features representing the receptor are based only on physicochemical properties of primary amino acid sequence, and ligand features use the two-dimensional atomic connection topology and atomic properties. An experimental screen comprised nearly 2 million hypothetical oGPCR-ligand complexes, from which it was observed that the top 1.96% predicted affinity scores corresponded to "highly active" ligands against orphan receptors. Results representing predicted high-scoring novel ligands for many oGPCRs are presented here. Validation of the method was carried out in several ways: (1) A random permutation of the structure-activity relationship of the training data was carried out; by comparing test statistic values of the randomized and nonshuffled data, we conclude that the value obtained with nonshuffled data is unlikely to have been encountered by chance. (2) Biological activities linked to the compounds with high cross-target binding affinity were analyzed using computed log-odds from a structure-based program. This information was correlated with literature citations where GPCR-related pathways or processes were linked to the bioactivity in question. (3) Anecdotal, out-of-sample predictions for nicotinic targets and known ligands were performed, with good accuracy in the low-to-high "active" binding range. (4) An out-of-sample consistency check using the commercial antipsychotic drug olanzapine produced "active" to "highly-active" predicted affinities for all oGPCRs in our study, an observation that is consistent with documented findings of cross-target affinity of this compound for many different GPCRs. It is suggested that this virtual screening approach may be used in support of the functional characterization of oGPCRs by identifying potential cognate ligands. Ultimately, this approach may have implications for pharmaceutical therapies to modulate the activity of faulty or disease-related cellular signaling pathways. In addition to application to cell surface receptors, this approach is a generalized strategy for discovery of small molecules that may bind intracellular enzymes and involve protein-protein interactions.

Keywords: chemogenomics
[Blaveri2005Bladder] EBlaveri, JL. Brewer, RRoydasgupta, JFridlyand, SDeVries, TKoppie, SPejavar, KMehta, PCarroll, JP. Simko, and FM. Waldman. Bladder cancer stage and outcome by array-based comparative genomic hybridization. Clin. Cancer Res., 11(19 Pt 1):7012-7022, Oct 2005. [ bib | DOI | http ]
PURPOSE: Bladder carcinogenesis is believed to follow alternative pathways of disease progression driven by an accumulation of genetic alterations. The purpose of this study was to evaluate associations between measures of genomic instability and bladder cancer clinical phenotype. EXPERIMENTAL DESIGN: Genome-wide copy number profiles were obtained for 98 bladder tumors of diverse stages (29 pT(a), 14 pT1, 55 pT(2-4)) and grades (21 low-grade and 8 high-grade superficial tumors) by array-based comparative genomic hybridization (CGH). Each array contained 2,464 bacterial artificial chromosome and P1 clones, providing an average resolution of 1.5 Mb across the genome. A total of 54 muscle-invasive cases had follow-up information available. Overall outcome analysis was done for patients with muscle-invasive tumors having "good" (alive >2 years) versus "bad" (dead in <2 years) prognosis. RESULTS: Array CGH analysis showed significant increases in copy number alterations and genomic instability with increasing stage and with outcome. The fraction of genome altered (FGA) was significantly different between tumors of different stages (pT(a) versus pT1, P = 0.0003; pT(a) versus pT(2-4), P = 0.02; and pT1 versus pT(2-4), P = 0.03). Individual clones that differed significantly between different tumor stages were identified after adjustment for multiple comparisons (false discovery rate < 0.05). For muscle-invasive tumors, the FGA was associated with patient outcome (bad versus good prognosis patients, P = 0.002) and was identified as the only independent predictor of overall outcome based on a multivariate Cox proportional hazards method. Unsupervised hierarchical clustering separated "good" and "bad" prognosis muscle-invasive tumors into clusters that showed significant association with FGA and survival (Kaplan-Meier, P = 0.019). Supervised tumor classification (prediction analysis for microarrays) had a 71% classification success rate based on 102 unique clones. CONCLUSIONS: Array-based CGH identified quantitative and qualitative differences in DNA copy number alterations at high resolution according to tumor stage and grade. Fraction genome altered was associated with worse outcome in muscle-invasive tumors, independent of other clinicopathologic parameters. Measures of genomic instability add independent power to outcome prediction of bladder tumors.

Keywords: Chromosome Mapping; Chromosomes, Artificial, Bacterial; Cluster Analysis; DNA; Disease Progression; Gene Deletion; Gene Expression Profiling; Gene Expression Regulation, Neoplastic; Genome; Humans; Image Processing, Computer-Assisted; Linkage (Genetics); Multivariate Analysis; Nucleic Acid Hybridization; Oligonucleotide Array Sequence Analysis; Phenotype; Prognosis; Proportional Hazards Models; Time Factors; Treatment Outcome; Urinary Bladder Neoplasms
[Bhasin2005Pcleavage] MBhasin and GPS. Raghava. Pcleavage: an SVM based method for prediction of constitutive proteasome and immunoproteasome cleavage sites in antigenic sequences. Nucleic Acids Res, 33(Web Server issue):W202-7, Jul 2005. [ bib | DOI | http | .pdf ]
This manuscript describes a support vector machine based method for the prediction of constitutive as well as immunoproteasome cleavage sites in antigenic sequences. This method achieved Matthew's correlation coefficents of 0.54 and 0.43 on in vitro and major histocompatibility complex ligand data, respectively. This shows that the performance of our method is comparable to that of the NetChop method, which is currently considered to be the best method for proteasome cleavage site prediction. Based on the method, a web server, Pcleavage, has also been developed. This server accepts protein sequences in any standard format and present results in a user-friendly format. The server is available for free use by all academic users at the URL http://www.imtech.res.in/raghava/pcleavage/ or http://bioinformatics.uams.edu/mirror/pcleavage/.

Keywords: biosvm immunoinformatics
[Bhasin2005GPCRsclass] MBhasin and GPS. Raghava. GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors. Nucleic Acids Res., 33(Web Server issue):W143-7, Jul 2005. [ bib | DOI | http | .pdf ]
The receptors of amine subfamily are specifically major drug targets for therapy of nervous disorders and psychiatric diseases. The recognition of novel amine type of receptors and their cognate ligands is of paramount interest for pharmaceutical companies. In the past, Chou and co-workers have shown that different types of amine receptors are correlated with their amino acid composition and are predictable on its basis with considerable accuracy [Elrod and Chou (2002) Protein Eng., 15, 713-715]. This motivated us to develop a better method for the recognition of novel amine receptors and for their further classification. The method was developed on the basis of amino acid composition and dipeptide composition of proteins using support vector machine. The method was trained and tested on 167 proteins of amine subfamily of G-protein-coupled receptors (GPCRs). The method discriminated amine subfamily of GPCRs from globular proteins with Matthew's correlation coefficient of 0.98 and 0.99 using amino acid composition and dipeptide composition, respectively. In classifying different types of amine receptors using amino acid composition and dipeptide composition, the method achieved an accuracy of 89.8 and 96.4%, respectively. The performance of the method was evaluated using 5-fold cross-validation. The dipeptide composition based method predicted 67.6% of protein sequences with an accuracy of 100% with a reliability index > or =5. A web server GPCRsclass has been developed for predicting amine-binding receptors from its amino acid sequence [http://www.imtech.res.in/raghava/gpcrsclass/ and http://bioinformatics.uams.edu/raghava/gpersclass/ (mirror site)].

Keywords: biosvm
[Bernardo2005Chemogenomic] DBernardo, MJ. Thompson, TS. Gardner, SE. Chobot, EL. Eastwood, AP. Wojtovich, SJ. Elliott, SE. Schaus, and JJ. Collins. Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat. Biotechnol., 23(3):377-383, Mar 2005. [ bib | DOI | http ]
A major challenge in drug discovery is to distinguish the molecular targets of a bioactive compound from the hundreds to thousands of additional gene products that respond indirectly to changes in the activity of the targets. Here, we present an integrated computational-experimental approach for computing the likelihood that gene products and associated pathways are targets of a compound. This is achieved by filtering the mRNA expression profile of compound-exposed cells using a reverse-engineered model of the cell's gene regulatory network. We apply the method to a set of 515 whole-genome yeast expression profiles resulting from a variety of treatments (compounds, knockouts and induced expression), and correctly enrich for the known targets and associated pathways in the majority of compounds examined. We demonstrate our approach with PTSB, a growth inhibitory compound with a previously unknown mode of action, by predicting and validating thioredoxin and thioredoxin reductase as its target.

[Bernardo2005Chemogenomica] Ddi Bernardo, M.J. Thompson, T.S. Gardner, S.E. Chobot, E.L. Eastwood, A.P. Wojtovich, S.J. Elliott, S.E. Schaus, and J.J. Collins. Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat Biotechnol, 23(3):377-383, Mar 2005. [ bib | DOI | http ]
A major challenge in drug discovery is to distinguish the molecular targets of a bioactive compound from the hundreds to thousands of additional gene products that respond indirectly to changes in the activity of the targets. Here, we present an integrated computational-experimental approach for computing the likelihood that gene products and associated pathways are targets of a compound. This is achieved by filtering the mRNA expression profile of compound-exposed cells using a reverse-engineered model of the cell's gene regulatory network. We apply the method to a set of 515 whole-genome yeast expression profiles resulting from a variety of treatments (compounds, knockouts and induced expression), and correctly enrich for the known targets and associated pathways in the majority of compounds examined. We demonstrate our approach with PTSB, a growth inhibitory compound with a previously unknown mode of action, by predicting and validating thioredoxin and thioredoxin reductase as its target.

Keywords: Algorithms; Artificial Intelligence; Computer Simulation; Drug Delivery Systems; Drug Design; Gene Expression Profiling; Gene Expression Regulation; Models, Biological; Models, Statistical; Protein Engineering; Protein Interaction Mapping; Saccharomyces cerevisiae; Saccharomyces cerevisiae Proteins; Signal Transduction; Thioredoxin-Disulfide Reductase; Thioredoxins
[Ben-Hur2005Kernel] ABen-Hur and WS. Noble. Kernel methods for predicting protein-protein interactions. Bioinformatics, 21(Suppl. 1):i38-i46, Jun 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Despite advances in high-throughput methods for discovering protein-protein interactions, the interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions. RESULTS: We present a kernel method for predicting protein-protein interactions using a combination of data sources, including protein sequences, Gene Ontology annotations, local properties of the network, and homologous interactions in other species. Whereas protein kernels proposed in the literature provide a similarity between single proteins, prediction of interactions requires a kernel between pairs of proteins. We propose a pairwise kernel that converts a kernel between single proteins into a kernel between pairs of proteins, and we illustrate the kernel's effectiveness in conjunction with a support vector machine classifier. Furthermore, we obtain improved performance by combining several sequence-based kernels based on k-mer frequency, motif and domain content and by further augmenting the pairwise sequence kernel with features that are based on other sources of data.We apply our method to predict physical interactions in yeast using data from the BIND database. At a false positive rate of 1% the classifier retrieves close to 80% of a set of trusted interactions. We thus demonstrate the ability of our method to make accurate predictions despite the sizeable fraction of false positives that are known to exist in interaction databases. AVAILABILITY: The classification experiments were performed using PyML available at http://pyml.sourceforge.net. Data are available at: http://noble.gs.washington.edu/proj/sppi CONTACT: asa@gs.washington.edu.

Keywords: biosvm
[Begg2005machine] RBegg and JKamruzzaman. A machine learning approach for automated recognition of movement patterns using basic, kinetic and kinematic gait data. J Biomech, 38(3):401-8, Mar 2005. [ bib | DOI | http | .pdf ]
This paper investigated application of a machine learning approach (Support vector machine, SVM) for the automatic recognition of gait changes due to ageing using three types of gait measures: basic temporal/spatial, kinetic and kinematic. The gaits of 12 young and 12 elderly participants were recorded and analysed using a synchronized PEAK motion analysis system and a force platform during normal walking. Altogether, 24 gait features describing the three types of gait characteristics were extracted for developing gait recognition models and later testing of generalization performance. Test results indicated an overall accuracy of 91.7% by the SVM in its capacity to distinguish the two gait patterns. The classification ability of the SVM was found to be unaffected across six kernel functions (linear, polynomial, radial basis, exponential radial basis, multi-layer perceptron and spline). Gait recognition rate improved when features were selected from different gait data type. A feature selection algorithm demonstrated that as little as three gait features, one selected from each data type, could effectively distinguish the age groups with 100% accuracy. These results demonstrate considerable potential in applying SVMs in gait classification for many applications.

[Beal2005Bayesian] MJ. Beal, FFalciani, ZGhahramani, CRangel, and DL. Wild. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21(3):349-356, Feb 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: We have used state-space models (SSMs) to reverse engineer transcriptional networks from highly replicated gene expression profiling time series data obtained from a well-established model of T cell activation. SSMs are a class of dynamic Bayesian networks in which the observed measurements depend on some hidden state variables that evolve according to Markovian dynamics. These hidden variables can capture effects that cannot be directly measured in a gene expression profiling experiment, for example: genes that have not been included in the microarray, levels of regulatory proteins, the effects of mRNA and protein degradation, etc. RESULTS: We have approached the problem of inferring the model structure of these state-space models using both classical and Bayesian methods. In our previous work, a bootstrap procedure was used to derive classical confidence intervals for parameters representing 'gene-gene' interactions over time. In this article, variational approximations are used to perform the analogous model selection task in the Bayesian context. Certain interactions are present in both the classical and the Bayesian analyses of these regulatory networks. The resulting models place JunB and JunD at the centre of the mechanisms that control apoptosis and proliferation. These mechanisms are key for clonal expansion and for controlling the long term behavior (e.g. programmed cell death) of these cells. AVAILABILITY: Supplementary data is available at http://public.kgi.edu/wild/index.htm and Matlab source code for variational Bayesian learning of SSMs is available at http://www.cse.ebuffalo.edu/faculty/mbeal/software.html.

Keywords: biogm
[Bartlett2005Local] PL. Bartlett, OBousquet, and SMendelson. Local Rademacher Complexities. Ann. Stat., 33(4):1497-1537, 2005. [ bib | DOI | http | .pdf ]
[Bao2005Identifying] Lei Bao. Identifying genes related to chemosensitivity using support vector machine. Methods Mol Med, 111:233-40, 2005. [ bib ]
In an effort to identify genes involved in chemosensitivity and to evaluate the functional relationships between genes and anticancer drugs acting by the same mechanism, a supervised machine learning approach called support vector machine (SVM) is used to associate genes with any of five predefined anticancer drug mechanistic categories. The drug activity profiles are used as training examples to train the SVM and then the gene expression profiles are used as test examples to predict their associated mechanistic categories. This method of correlating drugs and genes provides a strategy for finding novel biologically significant relationships for molecular pharmacology.

Keywords: biosvm
[Bagos2005Evaluation] Pantelis G Bagos, Theodore D Liakopoulos, and Stavros J Hamodrakas. Evaluation of methods for predicting the topology of beta-barrel outer membrane proteins and a consensus prediction method. BMC Bioinformatics, 6(1):7, Jan 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: Prediction of the transmembrane strands and topology of beta-barrel outer membrane proteins is of interest in current bioinformatics research. Several methods have been applied so far for this task, utilizing different algorithmic techniques and a number of freely available predictors exist. The methods can be grossly divided to those based on Hidden Markov Models (HMMs), on Neural Networks (NNs) and on Support Vector Machines (SVMs). In this work, we compare the different available methods for topology prediction of beta-barrel outer membrane proteins. We evaluate their performance on a non-redundant dataset of 20 beta-barrel outer membrane proteins of gram-negative bacteria, with structures known at atomic resolution. Also, we describe, for the first time, an effective way to combine the individual predictors, at will, to a single consensus prediction method. RESULTS: We assess the statistical significance of the performance of each prediction scheme and conclude that Hidden Markov Model based methods, HMM-B2TMR, ProfTMB and PRED-TMBB, are currently the best predictors, according to either the per-residue accuracy, the segments overlap measure (SOV) or the total number of proteins with correctly predicted topologies in the test set. Furthermore, we show that the available predictors perform better when only transmembrane beta-barrel domains are used for prediction, rather than the precursor full-length sequences, even though the HMM-based predictors are not influenced significantly. The consensus prediction method performs significantly better than each individual available predictor, since it increases the accuracy up to 4% regarding SOV and up to 15% in correctly predicted topologies. CONCLUSIONS: The consensus prediction method described in this work, optimizes the predicted topology with a dynamic programming algorithm and is implemented in a web-based application freely available to non-commercial users at http://bioinformatics.biol.uoa.gr/ConBBPRED.

Keywords: Algorithms, Cell Nucleus, Cytoplasm, Databases, Genetic Vectors, Humans, Internet, Mitochondria, Models, Non-U.S. Gov't, Peptides, Protein, Proteins, Proteomics, Reproducibility of Results, Research Support, Software, Theoretical, 15647112
[Bagga2005Quantitative] Harmohina Bagga, David S Greenfield, and William J Feuer. Quantitative assessment of atypical birefringence images using scanning laser polarimetry with variable corneal compensation. Am J Ophthalmol, 139(3):437-46, Mar 2005. [ bib | DOI | http | .pdf ]
PURPOSE: To define the clinical characteristics of atypical birefringence images and to describe a quantitative method for their identification. DESIGN: Prospective, comparative, clinical observational study. METHODS: Normal and glaucomatous eyes underwent complete examination, standard automated perimetry, scanning laser polarimetry with variable corneal compensation (GDx-VCC), and optical coherence tomography (OCT) of the macula, peripapillary retinal nerve fiber layer (RNFL), and optic disk. Eyes were classified into two groups: normal birefringence pattern (NBP) and atypical birefringence pattern (ABP). Clinical, functional, and structural characteristics were assessed separately. A multiple logistic regression model was used to predict eyes with ABP on the basis of a quantitative scan score generated by a support vector machine (SVM) with GDx-VCC. RESULTS: Sixty-five eyes of 65 patients were enrolled. ABP images were observed in 5 of 20 (25%) normal eyes and 23 of 45 (51%) glaucomatous eyes. Compared with eyes with NBP, glaucomatous eyes with ABP demonstrated significantly lower SVM scores (P < .0001, < 0.0001, 0.008, 0.03, and 0.03, respectively) and greater temporal, mean, inferior, and nasal RNFL thickness using GDx-VCC; and a weaker correlation with OCT generated RNFL thickness (R(2) = .75 vs .27). ABP images were significantly correlated with older age (R(2) = .16, P = .001). The SVM score was the only significant (P < .0001) predictor of ABP images and provided high discriminating power between eyes with NBP and ABP (area under the receiver operator characteristic curve = 0.98). CONCLUSIONS: ABP images exist in a subset of normal and glaucomatous eyes, are associated with older patient age, and produce an artifactual increase in RNFL thickness using GDx-VCC. The SVM score is highly predictive of ABP images.

Keywords: 80 and over, Adult, Aged, Algorithms, Amino Acids, Animals, Area Under Curve, Artifacts, Automated, Birefringence, Brain Chemistry, Brain Neoplasms, Comparative Study, Computer-Assisted, Cornea, Cross-Sectional Studies, Decision Trees, Diagnosis, Diagnostic Imaging, Diagnostic Techniques, Discriminant Analysis, Evolution, Face, Female, Genetic, Glaucoma, Humans, Intraocular Pressure, Lasers, Least-Squares Analysis, Magnetic Resonance Imaging, Magnetic Resonance Spectroscopy, Male, Middle Aged, Models, Molecular, Nerve Fibers, Non-U.S. Gov't, Numerical Analysis, Ophthalmological, Optic Nerve Diseases, Optical Coherence, P.H.S., Pattern Recognition, Photic Stimulation, Prospective Studies, Protein, ROC Curve, Regression Analysis, Research Support, Retinal Ganglion Cells, Sensitivity and Specificity, Sequence Analysis, Statistics, Tomography, U.S. Gov't, Visual Fields, beta-Lactamases, 15767051
[Bach2005Computing] FR. Bach, RThibaux, and MI. Jordan. Computing regularization paths for learning multiple kernels. In LK. Saul, YWeiss, and LBottou, editors, Advances in Neural Information Processing Systems 17, pages 73-80, Cambridge, MA, 2005. MIT Press. [ bib ]
[Bach2005Predictive] FR. Bach and MI. Jordan. Predictive low-rank decomposition for kernel methods. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 33-40, New York, NY, USA, 2005. ACM. [ bib | DOI ]
Low-rank matrix decompositions are essential tools in the application of kernel methods to large-scale learning problems. These decompositions have generally been treated as black boxes-the decomposition of the kernel matrix that they deliver is independent of the specific learning task at hand-and this is a potentially significant source of inefficiency. In this paper, we present an algorithm that can exploit side information (e.g., classification labels, regression responses) in the computation of low-rank decompositions for kernel matrices. Our algorithm has the same favorable scaling as state-of-the-art methods such as incomplete Cholesky decomposition-it is linear in the number of data points and quadratic in the rank of the approximation. We present simulation results that show that our algorithm yields decompositions of significantly smaller rank than those found by incomplete Cholesky decomposition.

[Atalay2005Implicit] VAtalay and RCetin-Atalay. Implicit motif distribution based hybrid computational kernel for sequence classification. Bioinformatics, 21(8):1429-1436, Apr 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: We designed a general computational kernel for classification problems that require specific motif extraction and search from sequences. Instead of searching for explicit motifs, our approach finds the distribution of implicit motifs and uses as a feature for classification. Implicit motif distribution approach may be used as modus operandi for bioinformatics problems that require specific motif extraction and search, which is otherwise computationally prohibitive. RESULTS: A system named P2SL that infer protein subcellular targeting was developed through this computational kernel. Targeting-signal was modeled by the distribution of subsequence occurrences (implicit motifs) using self-organizing maps. The boundaries among the classes were then determined with a set of support vector machines. P2SL hybrid computational system achieved approximately 81% of prediction accuracy rate over ER targeted, cytosolic, mitochondrial and nuclear protein localization classes. P2SL additionally offers the distribution potential of proteins among localization classes, which is particularly important for proteins, shuttle between nucleus and cytosol. AVAILABILITY: http://staff.vbi.vt.edu/volkan/p2sl and http://www.i-cancer.fen.bilkent.edu.tr/p2sl CONTACT: rengul@bilkent.edu.tr.

Keywords: biosvm
[Asefa2005Support] Tirusew Asefa, Mariush Kemblowski, Gilberto Urroz, and Mac McKee. Support vector machines (SVMs) for monitoring network design. Ground Water, 43(3):413-22, 2005. [ bib | DOI | http | .pdf ]
In this paper we present a hydrologic application of a new statistical learning methodology called support vector machines (SVMs). SVMs are based on minimization of a bound on the generalized error (risk) model, rather than just the mean square error over a training set. Due to Mercer's conditions on the kernels, the corresponding optimization problems are convex and hence have no local minima. In this paper, SVMs are illustratively used to reproduce the behavior of Monte Carlo-based flow and transport models that are in turn used in the design of a ground water contamination detection monitoring system. The traditional approach, which is based on solving transient transport equations for each new configuration of a conductivity field, is too time consuming in practical applications. Thus, there is a need to capture the behavior of the transport phenomenon in random media in a relatively simple manner. The objective of the exercise is to maximize the probability of detecting contaminants that exceed some regulatory standard before they reach a compliance boundary, while minimizing cost (i.e., number of monitoring wells). Application of the method at a generic site showed a rather promising performance, which leads us to believe that SVMs could be successfully employed in other areas of hydrology. The SVM was trained using 510 monitoring configuration samples generated from 200 Monte Carlo flow and transport realizations. The best configurations of well networks selected by the SVM were identical with the ones obtained from the physical model, but the reliabilities provided by the respective networks differ slightly.

Keywords: Adult, Aged, Aging, Algorithms, Apoptosis, Artificial Intelligence, Automated, Computer-Assisted, Female, Foot, Gait, Gene Expression Profiling, Humans, Image Interpretation, Male, Neoplasms, Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Polymerase Chain Reaction, Proteins, Reproducibility of Results, Research Support, Sensitivity and Specificity, Subcellular Fractions, Unknown Primary, 15882333
[Aronov2005Predictive] AM. Aronov. Predictive in silico modeling for hERG channel blockers. Drug Discov. Today, 10(2):149-155, Jan 2005. [ bib | DOI | http | .pdf ]
hERG-mediated sudden death as a side effect of non-antiarrhythmic drugs has been receiving increased regulatory attention. Perhaps owing to the unique shape of the ligand-binding site and its hydrophobic character, the hERG channel has been shown to interact with pharmaceuticals of widely varying structure. Several in silico approaches have attempted to predict hERG channel blockade. Some of these approaches are aimed primarily at filtering out potential hERG blockers in the context of virtual libraries, others involve understanding structure-activity relationships governing hERG-drug interactions. This review summarizes the most recent efforts in this emerging field.

Keywords: chemoinformatics herg
[Arodz2005Pattern] Tomasz ArodŹ, Marcin Kurdziel, Erik O D Sevre, and David A Yuen. Pattern recognition techniques for automatic detection of suspicious-looking anomalies in mammograms. Comput. Methods Programs Biomed., 79(2):135-49, Aug 2005. [ bib | DOI | http | .pdf ]
We have employed two pattern recognition methods used commonly for face recognition in order to analyse digital mammograms. The methods are based on novel classification schemes, the AdaBoost and the support vector machines (SVM). A number of tests have been carried out to evaluate the accuracy of these two algorithms under different circumstances. Results for the AdaBoost classifier method are promising, especially for classifying mass-type lesions. In the best case the algorithm achieved accuracy of 76% for all lesion types and 90% for masses only. The SVM based algorithm did not perform as well. In order to achieve a higher accuracy for this method, we should choose image features that are better suited for analysing digital mammograms than the currently used ones.

Keywords: biosvm image
[Arimoto2005Development] Rieko Arimoto, Madhu-Ashni Prasad, and Eric M Gifford. Development of CYP3A4 inhibition models: comparisons of machine-learning techniques and molecular descriptors. J Biomol Screen, 10(3):197-205, Apr 2005. [ bib | DOI | http ]
Computational models of cytochrome P450 3A4 inhibition were developed based on high-throughput screening data for 4470 proprietary compounds. Multiple models differentiating inhibitors (IC(50) <3 microM) and noninhibitors were generated using various machine-learning algorithms (recursive partitioning [RP], Bayesian classifier, logistic regression, k-nearest-neighbor, and support vector machine [SVM]) with structural fingerprints and topological indices. Nineteen models were evaluated by internal 10-fold cross-validation and also by an independent test set. Three most predictive models, Barnard Chemical Information (BCI)-fingerprint/SVM, MDL-keyset/SVM, and topological indices/RP, correctly classified 249, 248, and 236 compounds of 291 noninhibitors and 135, 137, and 147 compounds of 179 inhibitors in the validation set. Their overall accuracies were 82%, 82%, and 81%, respectively. Investigating applicability of the BCI/SVM model found a strong correlation between the predictive performance and the structural similarity to the training set. Using Tanimoto similarity index as a confidence measurement for the predictions, the limitation of the extrapolation was 0.7 in the case of the BCI/SVM model. Taking consensus of the 3 best models yielded a further improvement in predictive capability, kappa = 0.65 and accuracy = 83%. The consensus model could also be tuned to minimize either false positives or false negatives depending on the emphasis of the screening.

Keywords: biosvm chemoinformatics
[Concorde:website] DL. Applegate, RE. Bixby, VChvatal, and WJ. Cook. Concorde tsp solver. http://www.tsp.gatech.edu/concorde.html, 2005. [ bib ]
[Aphinyanaphongs2005Text] Yindalon Aphinyanaphongs, Ioannis Tsamardinos, Alexander Statnikov, Douglas Hardin, and Constantin F Aliferis. Text categorization models for high-quality article retrieval in internal medicine. J. Am. Med. Inform. Assoc., 12(2):207-16, 2005. [ bib | DOI | http | .pdf ]
OBJECTIVE Finding the best scientific evidence that applies to a patient problem is becoming exceedingly difficult due to the exponential growth of medical publications. The objective of this study was to apply machine learning techniques to automatically identify high-quality, content-specific articles for one time period in internal medicine and compare their performance with previous Boolean-based PubMed clinical query filters of Haynes et al. DESIGN The selection criteria of the ACP Journal Club for articles in internal medicine were the basis for identifying high-quality articles in the areas of etiology, prognosis, diagnosis, and treatment. Naive Bayes, a specialized AdaBoost algorithm, and linear and polynomial support vector machines were applied to identify these articles. MEASUREMENTS The machine learning models were compared in each category with each other and with the clinical query filters using area under the receiver operating characteristic curves, 11-point average recall precision, and a sensitivity/specificity match method. RESULTS In most categories, the data-induced models have better or comparable sensitivity, specificity, and precision than the clinical query filters. The polynomial support vector machine models perform the best among all learning methods in ranking the articles as evaluated by area under the receiver operating curve and 11-point average recall precision. CONCLUSION This research shows that, using machine learning methods, it is possible to automatically build models for retrieving high-quality, content-specific articles using inclusion or citation by the ACP Journal Club as a gold standard in a given time period in internal medicine that perform better than the 1994 PubMed clinical query filters.

Keywords: biosvm nlp
[Aoki2005score] KF. Aoki, HMamitsuka, TAkutsu, and MKanehisa. A score matrix to reveal the hidden links in glycans. Bioinformatics, 21(8):1457-63, Apr 2005. [ bib | DOI | http ]
MOTIVATION: Glycans are the third major class of biomolecules following DNA and proteins. They are extremely vital for the functioning of multicellular organisms. However, comparing the fast development of sequence analysis techniques, informatics work on glycans have a long way to go. Alignment algorithms for glycan tree structures are one of the foremost concerns. In addition, the statistical analysis of these algorithms in terms of biological significance needs to be addressed. RESULTS: We developed a tree-structure alignment algorithm for glycans and performed a statistical analysis of these alignment scores such that biologically interesting features could be captured into a score matrix for glycans. We generated our score matrix in a manner similar to BLOSUM, but with slight variations to accomodate our glycan data, including the incorporation of linkage information. We verified the effectiveness of our new glycan score matrix by illustrating how well the resulting score matrix entries correspond with biological knowledge. Future work for even better improvements with the use of a variety of score matrices for different subclasses of glycans due to their complexity is also discussed. CONTACT: mami@kuicr.kyoto-u.ac.jp SUPPLEMENTARY INFORMATION: The glycan score matrix can be downloaded from http://kanehisa.kuicr.kyoto-u.ac.jp/Paper/kcam/glycanMatrix0.1.txt.

Keywords: glycans
[Ando2005A] Rie Kubota Ando, Tong Zhang, and Peter Bartlett. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817-1853, 2005. [ bib ]
[Aires-de-Sousa2005Prediction] JAires-de Sousa and JGasteiger. Prediction of enantiomeric excess in a combinatorial library of catalytic enantioselective reactions. J Comb Chem, 7(2):298-301, 2005. [ bib | DOI | http | .pdf ]
A quantitative structure-enantioselectivity relationship was established for a combinatorial library of enantioselective reactions performed by addition of diethyl zinc to benzaldehyde. Chiral catalysts and additives were encoded by their chirality codes and presented as input to neural networks. The networks were trained to predict the enantiomeric excess. With independent test sets, predictions of enantiomeric excess could be made with an average error as low as 6% ee. Multilinear regression, perceptrons, and support vector machines were also evaluated as modeling tools. The method is of interest for the computer-aided design of combinatorial libraries involving chiral compounds or enantioselective reactions. This is the first example of a quantitative structure-property relationship based on chirality codes.

Keywords: biosvm chemoinformatics
[Adie2005Speeding] EA. Adie, RR. Adams, KL. Evans, DJ. Porteous, and BS. Pickard. Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics, 6:55, 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: Regions of interest identified through genetic linkage studies regularly exceed 30 centimorgans in size and can contain hundreds of genes. Traditionally this number is reduced by matching functional annotation to knowledge of the disease or phenotype in question. However, here we show that disease genes share patterns of sequence-based features that can provide a good basis for automatic prioritization of candidates by machine learning. RESULTS: We examined a variety of sequence-based features and found that for many of them there are significant differences between the sets of genes known to be involved in human hereditary disease and those not known to be involved in disease. We have created an automatic classifier called PROSPECTR based on those features using the alternating decision tree algorithm which ranks genes in the order of likelihood of involvement in disease. On average, PROSPECTR enriches lists for disease genes two-fold 77% of the time, five-fold 37% of the time and twenty-fold 11% of the time. CONCLUSION: PROSPECTR is a simple and effective way to identify genes involved in Mendelian and oligogenic disorders. It performs markedly better than the single existing sequence-based classifier on novel data. PROSPECTR could save investigators looking at large regions of interest time and effort by prioritizing positional candidate genes for mutation detection and case-control association studies.

[Natraj2005Three] NIyer, SJayanti, KLou, YKalyanaraman, and KRamani. Three-dimensional shape searching: state-of-the-art review and future trends. Computer-Aided Design, 37(5):509-530, April 2005. [ bib | DOI | http ]
Three-dimensional shape searching is a problem of current interest in several different fields. Most techniques have been developed for a particular domain and reduce a shape into a simpler shape representation. The techniques developed for a particular domain will also find applications in other domains.We classify and compare various 3D shape searching techniques based on their shape representations. A brief description of each technique is provided followed by a detailed survey of the state-of-the-art. The paper concludes by identifying gaps in current shape search techniques and identifies directions for future research.

Keywords: 3d-feature-extraction, cad, feature-extraction, object-modeling, object-representation, object-retrieval, pattern-recognition, search-benchmark, survey
[Zhang2005Improved] Qidong Zhang, Sukjoon Yoon, and William J Welsh. Improved method for predicting beta-turn using support vector machine. Bioinformatics, 21(10):2370-4, May 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Numerous methods for predicting beta-turns in proteins have been developed based on various computational schemes. Here, we introduce a new method of beta-turn prediction that uses the support vector machine (SVM) algorithm together with predicted secondary structure information. Various parameters from the SVM have been adjusted to achieve optimal prediction performance. RESULTS: The SVM method achieved excellent performance as measured by the Matthews correlation coefficient (MCC = 0.45) using a 7-fold cross validation on a database of 426 non-homologous protein chains. To our best knowledge, this MCC value is the highest achieved so far for predicting beta-turn. The overall prediction accuracy Qtotal was 77.3%, which is the best among the existing prediction methods. Among its unique attractive features, the present SVM method avoids overtraining and compresses information and provides a predicted reliability index.

Keywords: biosvm
[Yu2005Ovarian] JS. Yu, SOngarello, RFiedler, XW. Chen, GToffolo, CCobelli, and ZTrajanoski. Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics, 21(10):2200-9, May 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: High-throughput and high-resolution mass spectrometry instruments are increasingly used for disease classification and therapeutic guidance. However, the analysis of immense amount of data poses considerable challenges. We have therefore developed a novel method for dimensionality reduction and tested on a published ovarian high-resolution SELDI-TOF dataset. RESULTS: We have developed a four-step strategy for data preprocessing based on: (1) binning, (2) Kolmogorov-Smirnov test, (3) restriction of coefficient of variation and (4) wavelet analysis. Subsequently, support vector machines were used for classification. The developed method achieves an average sensitivity of 97.38% (sd = 0.0125) and an average specificity of 93.30% (sd = 0.0174) in 1000 independent k-fold cross-validations, where k = 2, ..., 10. AVAILABILITY: The software is available for academic and non-commercial institutions.

Keywords: biosvm proteomics
[Yeh2005Liver] Wen-Chun Yeh, Yung-Ming Jeng, Cheng-Han Li, Po-Huang Lee, and Pai-Chi Li. Liver steatosis classification using high-frequency ultrasound. Ultrasound Med Biol, 31(5):599-605, May 2005. [ bib | DOI | http | .pdf ]
High-frequency B-mode images of 19 fresh human liver samples were obtained to evaluate their usefulness in determining the steatosis grade. The images were acquired by a mechanically controlled single-crystal probe at 25 MHz. Image features derived from gray-level concurrence and nonseparable wavelet transform were extracted to classify steatosis grade using a classifier known as the support vector machine. A subsequent histologic examination of each liver sample graded the steatosis from 0 to 3. The four grades were then combined into two, three and four classes. The classification results were correlated with histology. The best classification accuracies of the two, three and four classes were 90.5%, 85.8% and 82.6%, respectively, which were markedly better than those at 7 MHz. These results indicate that liver steatosis can be more accurately characterized using high-frequency B-mode ultrasound. Limitations and their potential solutions of applying high-frequency ultrasound to liver imaging are also discussed.

[Yang2005Prediction] Zheng Rong Yang. Prediction of caspase cleavage sites using Bayesian bio-basis function neural networks. Bioinformatics, 21(9):1831-7, May 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Apoptosis has drawn the attention of researchers because of its importance in treating some diseases through finding a proper way to block or slow down the apoptosis process. Having understood that caspase cleavage is the key to apoptosis, we find novel methods or algorithms are essential for studying the specificity of caspase cleavage activity and this helps the effective drug design. As bio-basis function neural networks have proven to outperform some conventional neural learning algorithms, there is a motivation, in this study, to investigate the application of bio-basis function neural networks for the prediction of caspase cleavage sites. RESULTS: Thirteen protein sequences with experimentally determined caspase cleavage sites were downloaded from NCBI. Bayesian bio-basis function neural networks are investigated and the comparisons with single-layer perceptrons, multilayer perceptrons, the original bio-basis function neural networks and support vector machines are given. The impact of the sliding window size used to generate sub-sequences for modelling on prediction accuracy is studied. The results show that the Bayesian bio-basis function neural network with two Gaussian distributions for model parameters (weights) performed the best and the highest prediction accuracy is 97.15 +/- 1.13%. AVAILABILITY: The package of Bayesian bio-basis function neural network can be obtained by request to the author.

[Tothill2005expression-based] Richard W Tothill, Adam Kowalczyk, Danny Rischin, Alex Bousioutas, Izhak Haviv, Ryan K van Laar, Paul M Waring, John Zalcberg, Robyn Ward, Andrew V Biankin, Robert L Sutherland, Susan M Henshall, Kwun Fong, Jonathan R Pollack, David D L Bowtell, and Andrew J Holloway. An expression-based site of origin diagnostic method designed for clinical application to cancer of unknown origin. Cancer Res., 65(10):4031-40, May 2005. [ bib | DOI | http | .pdf ]
Gene expression profiling offers a promising new technique for the diagnosis and prognosis of cancer. We have applied this technology to build a clinically robust site of origin classifier with the ultimate aim of applying it to determine the origin of cancer of unknown primary (CUP). A single cDNA microarray platform was used to profile 229 primary and metastatic tumors representing 14 tumor types and multiple histologic subtypes. This data set was subsequently used for training and validation of a support vector machine (SVM) classifier, demonstrating 89% accuracy using a 13-class model. Further, we show the translation of a five-class classifier to a quantitative PCR-based platform. Selecting 79 optimal gene markers, we generated a quantitative-PCR low-density array, allowing the assay of both fresh-frozen and formalin-fixed paraffin-embedded (FFPE) tissue. Data generated using both quantitative PCR and microarray were subsequently used to train and validate a cross-platform SVM model with high prediction accuracy. Finally, we applied our SVM classifiers to 13 cases of CUP. We show that the microarray SVM classifier was capable of making high confidence predictions in 11 of 13 cases. These predictions were supported by comprehensive review of the patients' clinical histories.

Keywords: biosvm microarray
[Teramoto2005Prediction] Reiji Teramoto, Mikio Aoki, Toru Kimura, and Masaharu Kanaoka. Prediction of siRNA functionality using generalized string kernel and support vector machine. FEBS Lett., 579(13):2878-82, May 2005. [ bib | DOI | http | .pdf ]
Small interfering RNAs (siRNAs) are becoming widely used for sequence-specific gene silencing in mammalian cells, but designing an effective siRNA is still a challenging task. In this study, we developed an algorithm for predicting siRNA functionality by using generalized string kernel (GSK) combined with support vector machine (SVM). With GSK, siRNA sequences were represented as vectors in a multi-dimensional feature space according to the numbers of subsequences in each siRNA, and subsequently classified with SVM into effective or ineffective siRNAs. We applied this algorithm to published siRNAs, and could classify effective and ineffective siRNAs with 90.6%, 86.2% accuracy, respectively.

Keywords: sirna biosvm
[Takahashi2005Rigorous] Norikazu Takahashi and Tetsuo Nishi. Rigorous proof of termination of SMO algorithm for support vector machines. IEEE Trans Neural Netw, 16(3):774-6, May 2005. [ bib ]
Sequential minimal optimization (SMO) algorithm is one of the simplest decomposition methods for learning of support vector machines (SVMs). Keerthi and Gilbert have recently studied the convergence property of SMO algorithm and given a proof that SMO algorithm always stops within a finite number of iterations. In this letter, we point out the incompleteness of their proof and give a more rigorous proof.

[Scott2005Identifying] MS. Scott, TPerkins, SBunnell, FPepin, DY. Thomas, and MHallett. Identifying regulatory subnetworks for a set of genes. Mol. Cell. Proteomics, 4(5):683-692, May 2005. [ bib | DOI | http | .pdf ]
High throughput genomic/proteomic strategies, such as microarray studies, drug screens, and genetic screens, often produce a list of genes that are believed to be important for one or more reasons. Unfortunately it is often difficult to discern meaningful biological relationships from such lists. This study presents a new bioinformatic approach that can be used to identify regulatory subnetworks for lists of significant genes or proteins. We demonstrate the utility of this approach using an interaction network for yeast constructed from BIND, TRANSFAC, SCPD, and chromatin immunoprecipitation (ChIP)-Chip data bases and lists of genes from well known metabolic pathways or differential expression experiments. The approach accurately rediscovers known regulatory elements of the heat shock response as well as the gluconeogenesis, galactose, glycolysis, and glucose fermentation pathways in yeast. We also find evidence supporting a previous conjecture that approximately half of the enzymes in a metabolic pathway are transcriptionally co-regulated. Finally we demonstrate a previously unknown connection between GAL80 and the diauxic shift in yeast.

[Schubert2005Local] SSchubert, AGrünweller, VA. Erdmann, and JKurreck. Local RNA target structure influences siRNA efficacy: systematic analysis of intentionally designed binding regions. J. Mol. Biol., 348(4):883-893, May 2005. [ bib | DOI | http ]
Contradictory reports in the literature have emphasised either the sequence of small interfering RNAs (siRNA) or the structure of their target molecules to be the major determinant of the efficiency of RNA interference (RNAi) approaches. In the present study, we analyse systematically the contributions of these parameters to siRNA activity by using deliberately designed mRNA constructs. The siRNA target sites were included in well-defined structural elements rendering them either highly accessible or completely involved in stable base-pairing. Furthermore, complementary sequence elements and various hairpins with different stem lengths and designs were used as target sites. Only one of the strands of the siRNA duplex was found to be capable of silencing via its respective target site, indicating that thermodynamic characteristics intrinsic to the siRNA strands are a basic determinant of siRNA activity. A significant obstruction of gene silencing by the same siRNA, however, was observed to be caused by structural features of the substrate RNA. Bioinformatic analysis of the mRNA structures suggests a direct correlation between the extent of gene-knockdown and the local free energy in the target region. Our findings indicate that, although a favourable siRNA sequence is a necessary prerequisite for efficient RNAi, complex target structures may limit the applicability even of carefully chosen siRNAs.

Keywords: sirna
[Res2005evolution] IRes, IMihalek, and OLichtarge. An evolution based classifier for prediction of protein interfaces without using protein structures. Bioinformatics, 21(10):2496-501, May 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: The number of available protein structures still lags far behind the number of known protein sequences. This makes it important to predict which residues participate in protein-protein interactions using only sequence information. Few studies have tackled this problem until now. RESULTS: We applied support vector machines to sequences in order to generate a classification of all protein residues into those that are part of a protein interface and those that are not. For the first time evolutionary information was used as one of the attributes and this inclusion of evolutionary importance rankings improves the classification. Leave-one-out cross-validation experiments show that prediction accuracy reaches 64%.

Keywords: biosvm
[Plewczynski2005AutoMotif] Dariusz Plewczynski, Adrian Tkacz, Lucjan Stanislaw Wyrwicz, and Leszek Rychlewski. AutoMotif server: prediction of single residue post-translational modifications in proteins. Bioinformatics, 21(10):2525-7, May 2005. [ bib | DOI | http | .pdf ]
The AutoMotif Server allows for identification of post-translational modification (PTM) sites in proteins based only on local sequence information. The local sequence preferences of short segments around PTM residues are described here as linear functional motifs (LFMs). Sequence models for all types of PTMs are trained by support vector machine on short-sequence fragments of proteins in the current release of Swiss-Prot database (phosphorylation by various protein kinases, sulfation, acetylation, methylation, amidation, etc.). The accuracy of the identification is estimated using the standard leave-one-out procedure. The sensitivities for all types of short LFMs are in the range of 70%. AVAILABILITY: The AutoMotif Server is available free for academic use at http://automotif.bioinfo.pl/

Keywords: biosvm
[Overhoff2005Local] MOverhoff, MAlken, RK. Far, MLemaitre, BLebleu, GSczakiel, and IRobbins. Local RNA target structure influences siRNA efficacy: a systematic global analysis. J. Mol. Biol., 348(4):871-881, May 2005. [ bib | DOI | http ]
The efficiency with which small interfering RNAs (siRNAs) down-regulate specific gene expression in living cells is variable and a number of sequence-governed, biochemical parameters of the siRNA duplex have been proposed for the design of an efficient siRNA. Some of these parameters have been clearly identified to influence the assembly of the RNA-induced silencing complex (RISC), or to favour the sequence preferences of the RISC endonuclease. For other parameters, it is difficult to ascertain whether the influence is a determinant of the siRNA per se, or a determinant of the target RNA, especially its local structural characteristics. In order to gain an insight into the effects of local target structure on the biological activity of siRNA, we have used large sets of siRNAs directed against local targets of the mRNAs of ICAM-1 and survivin. Target structures were classified as accessible or inaccessible using an original, iterative computational approach and by experimental RNase H mapping. The effectiveness of siRNA was characterized by measuring the IC50 values in cell culture and the maximal extent of target suppression. Mean IC50 values were tenfold lower for accessible local target sites, with respect to inaccessible ones. Mean maximal target suppression was improved. These data illustrate that local target structure does, indeed, influence the activity of siRNA. We suggest that local target screening can significantly improve the hit rate in the design of biologically active siRNAs.

Keywords: sirna
[OFlanagan2005Non] RA. O'Flanagan, GPaillard, RLavery, and AM. Sengupta. Non-additivity in protein-DNA binding. Bioinformatics, 21(10):2254-63, May 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Localizing protein binding sites within genomic DNA is of considerable importance, but remains difficult for protein families, such as transcription factors, which have loosely defined target sequences. It is generally assumed that protein affinity for DNA involves additive contributions from successive nucleotide pairs within the target sequence. This is not necessarily true, and non-additive effects have already been experimentally demonstrated in a small number of cases. The principal origin of non-additivity involves the so-called indirect component of protein-DNA recognition which is related to the sequence dependence of DNA deformation induced during complex formation. Non-additive effects are difficult to study because they require the identification of many more binding sequences than are normally necessary for describing additive specificity (typically via the construction of weight matrices). RESULTS: In the present work we will use theoretically estimated binding energies as a basis for overcoming this problem. Our approach enables us to study the full combinatorial set of sequences for a variety of DNA-binding proteins, make a detailed analysis of non-additive effects and exploit this information to improve binding site predictions using either weight matrices or support vector machines. The results underline the fact that, even in the presence of significant deformation, non-additive effects may involve only a limited number of dinucleotide steps. This information helps to reduce the number of binding sites which need to be identified for successful predictions and to avoid problems of over-fitting. AVAILABILITY: The SVM software is available upon request from the authors.

Keywords: biosvm
[Natsoulis2005Classification] Georges Natsoulis, Laurent El Ghaoui, Gert R G Lanckriet, Alexander M Tolley, Fabrice Leroy, Shane Dunlea, Barrett P Eynon, Cecelia I Pearson, Stuart Tugendreich, and Kurt Jarnagin. Classification of a large microarray data set: algorithm comparison and analysis of drug signatures. Genome Res., 15(5):724-36, May 2005. [ bib | DOI | http | .pdf ]
A large gene expression database has been produced that characterizes the gene expression and physiological effects of hundreds of approved and withdrawn drugs, toxicants, and biochemical standards in various organs of live rats. In order to derive useful biological knowledge from this large database, a variety of supervised classification algorithms were compared using a 597-microarray subset of the data. Our studies show that several types of linear classifiers based on Support Vector Machines (SVMs) and Logistic Regression can be used to derive readily interpretable drug signatures with high classification performance. Both methods can be tuned to produce classifiers of drug treatments in the form of short, weighted gene lists which upon analysis reveal that some of the signature genes have a positive contribution (act as "rewards" for the class-of-interest) while others have a negative contribution (act as "penalties") to the classification decision. The combination of reward and penalty genes enhances performance by keeping the number of false positive treatments low. The results of these algorithms are combined with feature selection techniques that further reduce the length of the drug signatures, an important step towards the development of useful diagnostic biomarkers and low-cost assays. Multiple signatures with no genes in common can be generated for the same classification end-point. Comparison of these gene lists identifies biological processes characteristic of a given class.

[Morris2005Real] RJ. Morris, R.J. Najmanovich, AKahraman, and J.M. Thornton. Real spherical harmonic expansion coefficients as 3d shape descriptors for protein binding pocket and ligand comparisons. Bioinformatics, 21(10):2347-2355, May 2005. [ bib ]
MOTIVATION: An increasing number of protein structures are being determined for which no biochemical characterization is available. The analysis of protein structure and function assignment is becoming an unexpected challenge and a major bottleneck towards the goal of well-annotated genomes. As shape plays a crucial role in biomolecular recognition and function, the examination and development of shape description and comparison techniques is likely to be of prime importance for understanding protein structure-function relationships. RESULTS: A novel technique is presented for the comparison of protein binding pockets. The method uses the coefficients of a real spherical harmonics expansion to describe the shape of a protein's binding pocket. Shape similarity is computed as the L2 distance in coefficient space. Such comparisons in several thousands per second can be carried out on a standard linux PC. Other properties such as the electrostatic potential fit seamlessly into the same framework. The method can also be used directly for describing the shape of proteins and other molecules. AVAILABILITY: A limited version of the software for the real spherical harmonics expansion of a set of points in PDB format is freely available upon request from the authors. Binding pocket comparisons and ligand prediction will be made available through the protein structure annotation pipeline Profunc (written by Roman Laskowski) which will be accessible from the EBI website shortly.

[Martin2005bioavailability] YC. Martin. A bioavailability score. J. Med. Chem., 48(9):3164-3170, May 2005. [ bib | DOI | http ]
Responding to a demonstrated need for scientists to forecast the permeability and bioavailability (F) properties of compounds before their purchase, synthesis, or advanced testing, we have developed a score that assigns the probability that a compound will have F > 10% in the rat. Neither the rule-of-five, log P, log D, nor the combination of the number of rotatable bonds and polar surface area successfully categorized compounds. Instead, different properties govern the bioavailability of compounds depending on their predominant charge at biological pH. The fraction of anions with >10% F falls from 85% if the polar surface area (PSA) is < or = 75 A(2), to 56% if 75 < PSA < 150 A(2), to 11% if PSA is > or = 150 A(2). On the other hand, whereas 55% of the neutral, zwitterionic, or cationic compounds that pass the rule-of-five have >10% F, only 17% of those that fail have > 10% F. This same categorization distinguishes compounds that are poorly permeable from those that are permeable in Caco-2 cells. Further validation is provided with human bioavailability values from the literature.

Keywords: chemogenomics
[Majoros2005Efficient] WH. Majoros, LPertea, and SL. Salzberg. Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics, 21(9):1782-1788, May 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: The increased availability of genome sequences of closely related organisms has generated much interest in utilizing homology to improve the accuracy of gene prediction programs. Generalized pair hidden Markov models (GPHMMs) have been proposed as one means to address this need. However, all GPHMM implementations currently available are either closed-source or the details of their operation are not fully described in the literature, leaving a significant hurdle for others wishing to advance the state of the art in GPHMM design. RESULTS: We have developed an open-source GPHMM gene finder, TWAIN, which performs very well on two related Aspergillus species, A.fumigatus and A.nidulans, finding 89% of the exons and predicting 74% of the gene models exactly correctly in a test set of 147 conserved gene pairs. We describe the implementation of this GPHMM and we explicitly address the assumptions and limitations of the system. We suggest possible ways of relaxing those assumptions to improve the utility of the system without sacrificing efficiency beyond what is practical. AVAILABILITY: Available at http://www.tigr.org/software/pirate/twain/twain.html under the open-source Artistic License.

Keywords: biogm
[Kelley2005Systematic] RKelley and TIdeker. Systematic interpretation of genetic interactions using protein networks. Nat. Biotechnol., 23(5):561-566, May 2005. [ bib | DOI | http ]
Genetic interaction analysis,in which two mutations have a combined effect not exhibited by either mutation alone, is a powerful and widespread tool for establishing functional linkages between genes. In the yeast Saccharomyces cerevisiae, ongoing screens have generated >4,800 such genetic interaction data. We demonstrate that by combining these data with information on protein-protein, prote in-DNA or metabolic networks, it is possible to uncover physical mechanisms behind many of the observed genetic effects. Using a probabilistic model, we found that 1,922 genetic interactions are significantly associated with either between- or within-pathway explanations encoded in the physical networks, covering approximately 40% of known genetic interactions. These models predict new functions for 343 proteins and suggest that between-pathway explanations are better than within-pathway explanations at interpreting genetic interactions identified in systematic screens. This study provides a road map for how genetic and physical interactions can be integrated to reveal pathway organization and function.

[Kato2005Selective] TKato, KTsuda, and KAsai. Selective integration of multiple biological data for supervised network inference. Bioinformatics, 21(10):2488-2495, May 2005. [ bib | DOI | http ]
MOTIVATION: Inferring networks of proteins from biological data is a central issue of computational biology. Most network inference methods, including Bayesian networks, take unsupervised approaches in which the network is totally unknown in the beginning, and all the edges have to be predicted. A more realistic supervised framework, proposed recently, assumes that a substantial part of the network is known. We propose a new kernel-based method for supervised graph inference based on multiple types of biological datasets such as gene expression, phylogenetic profiles and amino acid sequences. Notably, our method assigns a weight to each type of dataset and thereby selects informative ones. Data selection is useful for reducing data collection costs. For example, when a similar network inference problem must be solved for other organisms, the dataset excluded by our algorithm need not be collected. RESULTS: First, we formulate supervised network inference as a kernel matrix completion problem, where the inference of edges boils down to estimation of missing entries of a kernel matrix. Then, an expectation-maximization algorithm is proposed to simultaneously infer the missing entries of the kernel matrix and the weights of multiple datasets. By introducing the weights, we can integrate multiple datasets selectively and thereby exclude irrelevant and noisy datasets. Our approach is favorably tested in two biological networks: a metabolic network and a protein interaction network. AVAILABILITY: Software is available on request.

[Jonsdottir2005Prediction] Svava Osk Jónsdóttir, Flemming Steen Jørgensen, and Søren Brunak. Prediction methods and databases within chemoinformatics: emphasis on drugs and drug candidates. Bioinformatics, 21(10):2145-2160, May 2005. [ bib | DOI | http ]
MOTIVATION: To gather information about available databases and chemoinformatics methods for prediction of properties relevant to the drug discovery and optimization process. RESULTS: We present an overview of the most important databases with 2-dimensional and 3-dimensional structural information about drugs and drug candidates, and of databases with relevant properties. Access to experimental data and numerical methods for selecting and utilizing these data is crucial for developing accurate predictive in silico models. Many interesting predictive methods for classifying the suitability of chemical compounds as potential drugs, as well as for predicting their physico-chemical and ADMET properties have been proposed in recent years. These methods are discussed, and some possible future directions in this rapidly developing field are described.

Keywords: Chemistry, Pharmaceutical; Computational Biology; Databases, Factual; Drug Design; Models, Chemical; Models, Molecular; Pharmaceutical Preparations; Structure-Activity Relationship
[Ikeda2005[Tongue] Naoya Ikeda and Takashi Uozumi. [Tongue diagnosis support system]. Hokkaido Igaku Zasshi, 80(3):269-77, May 2005. [ bib ]
Tongue diagnosis is one of the most important diagnostic methods in Oriental Medical Science (OMS). This diagnosis is painless and non-invasive method. However, it is not easy to cultivate skillful doctors. As one of the reasons, definition of tongue color is rather subjective and sensuous measure and color isn't related to quantitative physical value. It is, therefore, necessary to associate tongue color with physical numerical value. There are two problems to overcome the issue. 1) It is necessary for diagnosis to extract a region for diagnosis from entire picture because a tongue picture consists of two regions, a tongue and a background. 2) Associate tongue color with physical numerical value. For extracting tongue region, we used Progressive LiveWire method that is an Active Contour Model. And, for associating tongue color with physical measurement, we propose a hierarchical method. We use static rule and support vector machine for clustering colors. The performance of developed system is improved compared with an early developed one. In addition, the developed system did not make a critical incorrect discernment that causes incorrect choice about inspection in the layer of rule base. In this research average color appraisal is done from the region of 37 points. But, color judgment in the literature with the judgment by the eye of the human, has always done average judgment with not to limit, there is also a possibility some weight attaching being done. Therefore, from either one enabling the mass data and the comparison with the group of specialists is necessary as an appraisal.

Keywords: Color, Computer-Assisted, Diagnosis, English Abstract, Expert Systems, Humans, Tongue, 15960161
[Hofmann2005Concept-based] Oliver Hofmann and Dietmar Schomburg. Concept-based annotation of enzyme classes. Bioinformatics, 21(9):2059-66, May 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Given the explosive growth of biomedical data as well as the literature describing results and findings, it is getting increasingly difficult to keep up to date with new information. Keeping databases synchronized with current knowledge is a time-consuming and expensive task-one which can be alleviated by automatically gathering findings from the literature using linguistic approaches. We describe a method to automatically annotate enzyme classes with disease-related information extracted from the biomedical literature for inclusion in such a database. RESULTS: Enzyme names for the 3901 enzyme classes in the BRENDA database, a repository for quantitative and qualitative enzyme information, were identified in more than 100,000 abstracts retrieved from the PubMed literature database. Phrases in the abstracts were assigned to concepts from the Unified Medical Language System (UMLS) utilizing the MetaMap program, allowing for the identification of disease-related concepts by their semantic fields in the UMLS ontology. Assignments between enzyme classes and diseases were created based on their co-occurrence within a single sentence. False positives could be removed by a variety of filters including minimum number of co-occurrences, removal of sentences containing a negation and the classification of sentences based on their semantic fields by a Support Vector Machine. Verification of the assignments with a manually annotated set of 1500 sentences yielded favorable results of 92% precision at 50% recall, sufficient for inclusion in a high-quality database. AVAILABILITY: Source code is available from the author upon request. SUPPLEMENTARY INFORMATION: ftp.uni-koeln.de/institute/biochemie/pub/brenda/info/diseaseSupp.pdf.

Keywords: biosvm
[Camastra2005novel] Francesco Camastra and Alessandro Verri. A novel kernel method for clustering. IEEE Trans. Pattern Anal. Mach. Intell., 27(5):801-5, May 2005. [ bib | DOI | http | .pdf ]
Kernel Methods are algorithms that, by replacing the inner product with an appropriate positive definite function, implicitly perform a nonlinear mapping of the input data into a high-dimensional feature space. In this paper, we present a kernel method for clustering inspired by the classical K-Means algorithm in which each cluster is iteratively refined using a one-class Support Vector Machine. Our method, which can be easily implemented, compares favorably with respect to popular clustering algorithms, like K-Means, Neural Gas, and Self-Organizing Maps, on a synthetic data set and three UCI real data benchmarks (IRIS data, Wisconsin breast cancer database, Spam database).

[Begg2005Support] Rezaul K Begg, Marimuthu Palaniswami, and Brendan Owen. Support vector machines for automated gait classification. IEEE Trans Biomed Eng, 52(5):828-38, May 2005. [ bib | DOI | http | .pdf ]
Ageing influences gait patterns causing constant threats to control of locomotor balance. Automated recognition of gait changes has many advantages including, early identification of at-risk gait and monitoring the progress of treatment outcomes. In this paper, we apply an artificial intelligence technique [support vector machines (SVM)] for the automatic recognition of young-old gait types from their respective gait-patterns. Minimum foot clearance (MFC) data of 30 young and 28 elderly participants were analyzed using a PEAK-2D motion analysis system during a 20-min continuous walk on a treadmill at self-selected walking speed. Gait features extracted from individual MFC histogram-plot and Poincaré-plot images were used to train the SVM. Cross-validation test results indicate that the generalization performance of the SVM was on average 83.3% (+/-2.9) to recognize young and elderly gait patterns, compared to a neural network's accuracy of 75.0+/-5.0%. A "hill-climbing" feature selection algorithm demonstrated that a small subset (3-5) of gait features extracted from MFC plots could differentiate the gait patterns with 90% accuracy. Performance of the gait classifier was evaluated using areas under the receiver operating characteristic plots. Improved performance of the classifier was evident when trained with reduced number of selected good features and with radial basis function kernel. These results suggest that SVMs can function as an efficient gait classifier for recognition of young and elderly gait patterns, and has the potential for wider applications in gait identification for falls-risk minimization in the elderly.

Keywords: Adult, Aged, Aging, Algorithms, Apoptosis, Artificial Intelligence, Automated, Computer-Assisted, Female, Foot, Gait, Gene Expression Profiling, Humans, Image Interpretation, Male, Neoplasms, Non-U.S. Gov't, Oligonucleotide Array Sequence Analysis, Pattern Recognition, Polymerase Chain Reaction, Proteins, Reproducibility of Results, Research Support, Sensitivity and Specificity, Subcellular Fractions, Unknown Primary, 15887532
[Barry2005Significance] WT. Barry, AB. Nobel, and FA. Wright. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics, 21(9):1943-1949, May 2005. [ bib | DOI | http ]
MOTIVATION: In high-throughput genomic and proteomic experiments, investigators monitor expression across a set of experimental conditions. To gain an understanding of broader biological phenomena, researchers have until recently been limited to post hoc analyses of significant gene lists.Method: We describe a general framework, significance analysis of function and expression (SAFE), for conducting valid tests of gene categories ab initio. SAFE is a two-stage, permutation-based method that can be applied to various experimental designs, accounts for the unknown correlation among genes and enables permutation-based estimation of error rates. RESULTS: The utility and flexibility of SAFE is illustrated with a microarray dataset of human lung carcinomas and gene categories based on Gene Ontology and the Protein Family database. Significant gene categories were observed in comparisons of (1) tumor versus normal tissue, (2) multiple tumor subtypes and (3) survival times. AVAILABILITY: Code to implement SAFE in the statistical package R is available from the authors. SUPPLEMENTARY INFORMATION: http://www.bios.unc.edufwright/SAFE.

[Bao2005Prediction] Lei Bao and Yan Cui. Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics, 21(10):2185-90, May 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: There has been great expectation that the knowledge of an individual's genotype will provide a basis for assessing susceptibility to diseases and designing individualized therapy. Non-synonymous single nucleotide polymorphisms (nsSNPs) that lead to an amino acid change in the protein product are of particular interest because they account for nearly half of the known genetic variations related to human inherited diseases. To facilitate the identification of disease-associated nsSNPs from a large number of neutral nsSNPs, it is important to develop computational tools to predict the phenotypic effects of nsSNPs. RESULTS: We prepared a training set based on the variant phenotypic annotation of the Swiss-Prot database and focused our analysis on nsSNPs having homologous 3D structures. Structural environment parameters derived from the 3D homologous structure as well as evolutionary information derived from the multiple sequence alignment were used as predictors. Two machine learning methods, support vector machine and random forest, were trained and evaluated. We compared the performance of our method with that of the SIFT algorithm, which is one of the best predictive methods to date. An unbiased evaluation study shows that for nsSNPs with sufficient evolutionary information (with not <10 homologous sequences), the performance of our method is comparable with the SIFT algorithm, while for nsSNPs with insufficient evolutionary information (<10 homologous sequences), our method outperforms the SIFT algorithm significantly. These findings indicate that incorporating structural information is critical to achieving good prediction accuracy when sufficient evolutionary information is not available. AVAILABILITY: The codes and curated dataset are available at http://compbio.utmem.edu/snp/dataset/

Keywords: biosvm
[Deshpande2005Frequent] MDeshpande, MKuramochi, NWale, and GKarypis. Frequent Substructure-Based Approaches for Classifying Chemical Compounds. IEEE T. Knowl. Data. En., 17(8):1036-1050, August 2005. [ bib | DOI ]
Keywords: chemoinformatics
[Leordeanu2005Spectral] MLeordeanu and MHebert. A spectral technique for correspondence problems using pairwise constraints. In International Conference of Computer Vision (ICCV), volume 2, pages 1482 - 1489, October 2005. [ bib ]
[Rasmussen2005Gaussian] Carl E. Rasmussen and Christopher KI. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, December 2005. [ bib ]
Gaussian processes (GPs) provide a principled, practical, probabilistic approach to learning in kernel machines. GPs have received increased attention in the machine-learning community over the past decade, and this book provides a long-needed systematic and unified treatment of theoretical and practical aspects of GPs in machine learning. The treatment is comprehensive and self-contained, targeted at researchers and students in machine learning and applied statistics.<br /> <br /> The book deals with the supervised-learning problem for both regression and classification, and includes detailed algorithms. A wide variety of covariance (kernel) functions are presented and their properties discussed. Model selection is discussed both from a Bayesian and a classical perspective. Many connections to other well-known techniques from machine learning and statistics are discussed, including support-vector machines, neural networks, splines, regularization networks, relevance vector machines and others. Theoretical issues including learning curves and the PAC-Bayesian framework are treated, and several approximation methods for learning with large datasets are discussed. The book contains illustrative examples and exercises, and code and datasets are available on the Web. Appendixes provide mathematical background and a discussion of Gaussian Markov processes.

Keywords: machine_learning
[2006Chemical] SE. Jaroch and HWeinmann, editors. Chemical Genomics: Small Molecule Probes to Study Cellular Function. Ernst Schering Research Foundation Workshop. Springer, Berlin, 2006. [ bib ]
Keywords: chemogenomics
[Jacoby2006Chemogenomics] EJacoby, editor. Chemogenomics: Knowledge-based Approaches to Drug Discovery. Imperial College Press, 2006. [ bib ]
[Guyon2006Feature] IGuyon, SGunn, MNikravesh, and LZadeh, editors. Feature Extraction, Foundations and Applications. Springer, 2006. [ bib ]
[Zou2006Sparse] HZou, THastie, and RTibshirani. Sparse principal component analysis. J. Comput. Graph. Stat., 15(2):265-286, 2006. [ bib | DOI | http ]
Principal component analysis (PCA) is widely used in data processing and dimensionality reduction. However, PCA suffers from the fact that each principal component is a linear combination of all the original variables, thus it is often difficult to interpret the results. We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings. We first show that PCA can be formulated as a regression-type optimization problem; sparse loadings are then obtained by imposing the lasso (elastic net) constraint on the regression coefficients. Efficient algorithms are proposed to fit our SPCA models for both regular multivariate data and gene expression arrays. We also give a new formula to compute the total variance of modified principal components. As illustrations, SPCA is applied to real and simulated data with encouraging results.

[Zhu2006Improving] Shanfeng Zhu, Keiko Udaka, John Sidney, Alessandro Sette, Kiyoko F. Aoki-Kinoshita, and Hiroshi Mamitsuka. Improving MHC binding peptide prediction by incorporating binding data of auxiliary MHC molecules. Bioinformatics, 22(13):1648-1655, 2006. [ bib ]
[Zhao2006model] PZhao and BYu. On model selection consistency of lasso. J. Mach. Learn. Res., 7:2541, 2006. [ bib | .html | .pdf ]
Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufficient for Lasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large. Based on these results, sufficient conditions that are verifiable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are "irrepresentable" (in a sense to be clarified) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result.

Keywords: lasso
[Zhang2006Similarity] ZZhang and M.G. Grigorov. Similarity networks of protein binding sites. Proteins, 62(2):470-478, Feb 2006. [ bib ]
An increasing attention has been dedicated to the characterization of complex networks within the protein world. This work is reporting how we uncovered networked structures that reflected the structural similarities among protein binding sites. First, a 211 binding sites dataset has been compiled by removing the redundant proteins in the Protein Ligand Database (PLD) (http://www-mitchell.ch.cam.ac.uk/pld/). Using a clique detection algorithm we have performed all-against-all binding site comparisons among the 211 available ones. Within the set of nodes representing each binding site an edge was added whenever a pair of binding sites had a similarity higher than a threshold value. The generated similarity networks revealed that many nodes had few links and only few were highly connected, but due to the limited data available it was not possible to definitively prove a scale-free architecture. Within the same dataset, the binding site similarity networks were compared with the networks of sequence and fold similarity networks. In the protein world, indications were found that structure is better conserved than sequence, but on its own, sequence was better conserved than the subset of functional residues forming the binding site. Because a binding site is strongly linked with protein function, the identification of protein binding site similarity networks could accelerate the functional annotation of newly identified genes. In view of this we have discussed several potential applications of binding site similarity networks, such as the construction of novel binding site classification databases, as well as the implications for protein molecular design in general and computational chemogenomics in particular.

Keywords: complex network, binding site, small world, scale free, sequence, fold, drug design
[Yuan2006Model] MYuan and YLin. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B, 68(1):49-67, 2006. [ bib | .pdf ]
[Ylstra2006BAC] Bauke Ylstra, Paul Van den Ijssel, Beatriz Carvalho, Ruud H Brakenhoff, and Gerrit A Meijer. BAC to the future! or oligonucleotides: a perspective for micro array comparative genomic hybridization (array CGH). Nucleic Acids Res., 34:445-450, 2006. [ bib ]
Keywords: csbcbook, csbcbook-ch2
[Yao2006Coupling] XYao, CParnot, XDeupi, VRP. Ratnala, GSwaminath, DFarrens, and BKobilka. Coupling ligand structure to specific conformational switches in the beta2-adrenoceptor. Nat. Chem. Biol., 2(8):417-422, Aug 2006. [ bib | DOI | http ]
G protein-coupled receptors (GPCRs) regulate a wide variety of physiological functions in response to structurally diverse ligands ranging from cations and small organic molecules to peptides and glycoproteins. For many GPCRs, structurally related ligands can have diverse efficacy profiles. To investigate the process of ligand binding and activation, we used fluorescence spectroscopy to study the ability of ligands having different efficacies to induce a specific conformational change in the human beta2-adrenoceptor (beta2-AR). The 'ionic lock' is a molecular switch found in rhodopsin-family GPCRs that has been proposed to link the cytoplasmic ends of transmembrane domains 3 and 6 in the inactive state. We found that most partial agonists were as effective as full agonists in disrupting the ionic lock. Our results show that disruption of this important molecular switch is necessary, but not sufficient, for full activation of the beta2-AR.

Keywords: chemogenomics
[Xia2006IntNetDB] KXia, DDong, and J-D.J. Han. Intnetdb v1.0: an integrated protein-protein interaction network database generated by a probabilistic model. BMC Bioinformatics, 7:508, 2006. [ bib | DOI | http ]
BACKGROUND: Although protein-protein interaction (PPI) networks have been explored by various experimental methods, the maps so built are still limited in coverage and accuracy. To further expand the PPI network and to extract more accurate information from existing maps, studies have been carried out to integrate various types of functional relationship data. A frequently updated database of computationally analyzed potential PPIs to provide biological researchers with rapid and easy access to analyze original data as a biological network is still lacking. RESULTS: By applying a probabilistic model, we integrated 27 heterogeneous genomic, proteomic and functional annotation datasets to predict PPI networks in human. In addition to previously studied data types, we show that phenotypic distances and genetic interactions can also be integrated to predict PPIs. We further built an easy-to-use, updatable integrated PPI database, the Integrated Network Database (IntNetDB) online, to provide automatic prediction and visualization of PPI network among genes of interest. The networks can be visualized in SVG (Scalable Vector Graphics) format for zooming in or out. IntNetDB also provides a tool to extract topologically highly connected network neighborhoods from a specific network for further exploration and research. Using the MCODE (Molecular Complex Detections) algorithm, 190 such neighborhoods were detected among all the predicted interactions. The predicted PPIs can also be mapped to worm, fly and mouse interologs. CONCLUSION: IntNetDB includes 180,010 predicted protein-protein interactions among 9,901 human proteins and represents a useful resource for the research community. Our study has increased prediction coverage by five-fold. IntNetDB also provides easy-to-use network visualization and analysis tools that allow biological researchers unfamiliar with computational biology to access and analyze data over the internet. The web interface of IntNetDB is freely accessible at http://hanlab.genetics.ac.cn/IntNetDB.htm. Visualization requires Mozilla version 1.8 (or higher) or Internet Explorer with installation of SVGviewer.

[Winzeler2006Applied] EA Winzeler. Applied systems biology and malaria. Nat. Rev. Microbiol., 4(2):145-151, Feb 2006. [ bib | DOI | http | .pdf ]
One of the goals of systems-biology research is to discover networks and interactions by integrating diverse data sets. So far, systems-biology research has focused on model organisms, which are well characterized and therefore suited to testing new methods. Systems biology has great potential for use in the search for therapies for disease. Here, the potential of systems-biology approaches in the search for new drugs and vaccines to treat malaria is examined.

Keywords: plasmodium
[Willett2006Similarity-based] PWillett. Similarity-based virtual screening using 2d fingerprints. Drug Discov Today, 11(23-24):1046-1053, Dec 2006. [ bib | DOI | http | .pdf ]
This paper summarizes recent work at the University of Sheffield on virtual screening methods that use 2D fingerprint measures of structural similarity. A detailed comparison of a large number of similarity coefficients demonstrates that the well-known Tanimoto coefficient remains the method of choice for the computation of fingerprint-based similarity, despite possessing some inherent biases related to the sizes of the molecules that are being sought. Group fusion involves combining the results of similarity searches based on multiple reference structures and a single similarity measure. We demonstrate the effectiveness of this approach to screening, and also describe an approximate form of group fusion, turbo similarity searching, that can be used when just a single reference structure is available.

Keywords: PUlearning, chemoinformatics
[Whitfield2006Common] M.L. Whitfield, L.K. George, G.D. Grant, and C.M. Perou. Common markers of proliferation. Nature Reviews Cancer, 6(2):99-106, 2006. [ bib ]
[Wheeler2008Homologene] DL. Wheeler, T Barrett, D.A Benson, and S.H. Bryant. Database resources of the national center for biotechnology information. Nucleic Acids Res., 31:28-33, 2006. [ bib ]
[Weinberger2006Distance] KQ. Weinberger, JBlitzer, and LK. Saul. Distance metric learning for large margin nearest neighbor classification. In YWeiss, BSchoelkopf, and JPlatt, editors, Adv. Neural. Inform. Process Syst., volume 18, Cambridge, MA, 2006. MIT Press. [ bib ]
[Warren2006Critical] G.L. Warren, C.W. Andrews, A.M. Capelli, BClarke, JLaLonde, M.H. Lambert, MLindvall, NNevins, S.F. Semus, SSenger, GTedesco, I.D. Wall, J.M. Woolven, C.E. Peishoff, and M.S. Head. A critical assessment of docking programs and scoring functions. J. Med. Chem., 49(20):5912-5931, Oct 2006. [ bib ]
Docking is a computational technique that samples conformations of small molecules in protein binding sites; scoring functions are used to assess which of these conformations best complements the protein binding site. An evaluation of 10 docking programs and 37 scoring functions was conducted against eight proteins of seven protein types for three tasks: binding mode prediction, virtual screening for lead identification, and rank-ordering by affinity for lead optimization. All of the docking programs were able to generate ligand conformations similar to crystallographically determined protein/ligand complex structures for at least one of the targets. However, scoring functions were less successful at distinguishing the crystallographic conformation from the set of docked poses. Docking programs identified active compounds from a pharmaceutically relevant pool of decoy compounds; however, no single program performed well for all of the targets. For prediction of compound affinity, none of the docking programs or scoring functions made a useful prediction of ligand binding affinity.

[Wang2006Correspondence] HF. Wang and ER. Hancock. Correspondence matching using kernel principal components analysis and label consistency constraints. Pattern Recogn., 39(6):1012-1025, 2006. [ bib | DOI ]
[Wainwright2006Sharp] MJ. Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity. Technical Report 709, UC Berkeley, Department of Statistics, 2006. [ bib | .pdf | .pdf ]
[Vert2006Consistency] RVert and J.-P. Vert. Consistency and convergence rates of one-class SVMs and related algorithms. J. Mach. Learn. Res., 7:817-854, 2006. [ bib | .html ]
We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator.

[Vert2006Kernels] J.-P. Vert, RThurman, and WS. Noble. Kernels for gene regulatory regions. In YWeiss, BSchölkopf, and JPlatt, editors, Adv. Neural. Inform. Process Syst., volume 18, pages 1401-1408, Cambridge, MA, 2006. MIT Press. [ bib ]
Keywords: biosvm
[Vert2006accurate] J.-P. Vert, NFoveau, CLajaunie, and YVandenbrouck. An accurate and interpretable model for sirna efficacy prediction. BMC Bioinformatics, 7:520, 2006. [ bib | DOI | http ]
Background The use of exogenous small interfering RNAs (siRNAs) for gene silencing has quickly become a widespread molecular tool providing a powerful means for gene functional study and new drug target identification. Although considerable progress has been made recently in understanding how the RNAi pathway mediates gene silencing, the design of potent siRNAs remains challenging. Results We propose a simple linear model combining basic features of siRNA sequences for siRNA efficacy prediction. Trained and tested on a large dataset of siRNA sequences made recently available, it performs as well as more complex state-of-the-art models in terms of potency prediction accuracy, with the advantage of being directly interpretable. The analysis of this linear model allows us to detect and quantify the effect of nucleotide preferences at particular positions, including previously known and new observations. We also detect and quantify a strong propensity of potent siRNAs to contain short asymmetric motifs in their sequence, and show that, surprisingly, these motifs alone contain at least as much relevant information for potency prediction as the nucleotide preferences for particular positions. Conclusion The model proposed for prediction of siRNA potency is as accurate as a state-of-the-art nonlinear model and is easily interpretable in terms of biological features. It is freely available on the web at http://cbio.ensmp.fr/dsir

[Vert2006Low-rank] Jean-Philippe Vert, Francis Bach, and Theodoros Evgeniou. Low-rank matrix factorization with attributes, 2006. [ bib ]
[Vert2006Kernel] J.-P. Vert. Kernel methods in bioinformatics : a survey. To appear, 2006. [ bib ]
[Tropp2006Algorithms] Joel A. Tropp, Anna C. Gilbert, and Martin J. Strauss. Algorithms for simultaneous sparse approximation: part i: Greedy pursuit. Signal Process., 86(3):572-588, 2006. [ bib | DOI ]
[Tillmann06Efficient] CTillmann. Efficient Dynamic Programming Search Algorithms For Phrase-Based SMT. In Workshop On Computationally Hard Problems And Joint Inference In Speech And Language Processing, 2006. [ bib | .pdf ]
[Tartakovsky2006novel] AG. Tartakovsky, BL. Rozovskii, RB. Blazek, and Hongjoong Kim. A novel approach to detection of intrusions in computer networks via adaptive sequential and batch-sequential change-point detection methods. IEEE T. Signal. Proces., 54(9):3372-3382, 2006. [ bib | DOI | http | .pdf ]
[Soerlie2006Gene] TSørlie, CM. Perou, CFan, SGeisler, TAas, ANobel, GAnker, LA. Akslen, DBotstein, A.-L. Børresen-Dale, and PE. Lønning. Gene expression profiles do not consistently predict the clinical treatment response in locally advanced breast cancer. Mol. Cancer Ther., 5(11):2914-2918, Nov 2006. [ bib | DOI | http | .pdf ]
Neoadjuvant treatment offers an opportunity to correlate molecular variables to treatment response and to explore mechanisms of drug resistance in vivo. Here, we present a statistical analysis of large-scale gene expression patterns and their relationship to response following neoadjuvant chemotherapy in locally advanced breast cancers. We analyzed cDNA expression data from 81 tumors from two patient series, one treated with doxorubicin alone (51) and the other treated with 5-fluorouracil and mitomycin (30), and both were previously studied for correlations between TP53 status and response to therapy. We observed a low frequency of progressive disease within the luminal A subtype from both series (2 of 36 versus 13 of 45 patients; P = 0.0089) and a high frequency of progressive disease among patients with luminal B type tumors treated with doxorubicin (5 of 8 patients; P = 0.0078); however, aside from these two observations, no other consistent associations between response to chemotherapy and tumor subtype were observed. These specific associations could possibly be explained by covariance with TP53 mutation status, which also correlated with tumor subtype. Using supervised analysis, we could not uncover a gene profile that could reliably (>70% accuracy and specificity) predict response to either treatment regimen.

Keywords: csbcbook, csbcbook-ch3
[Surgand2006chemogenomic] Jean-Sebastien Surgand, Jordi Rodrigo, Esther Kellenberger, and Didier Rognan. A chemogenomic analysis of the transmembrane binding cavity of human g-protein-coupled receptors. Proteins, 62(2):509-538, Feb 2006. [ bib | DOI | http ]
The amino acid sequences of 369 human nonolfactory G-protein-coupled receptors (GPCRs) have been aligned at the seven transmembrane domain (TM) and used to extract the nature of 30 critical residues supposed-from the X-ray structure of bovine rhodopsin bound to retinal-to line the TM binding cavity of ground-state receptors. Interestingly, the clustering of human GPCRs from these 30 residues mirrors the recently described phylogenetic tree of full-sequence human GPCRs (Fredriksson et al., Mol Pharmacol 2003;63:1256-1272) with few exceptions. A TM cavity could be found for all investigated GPCRs with physicochemical properties matching that of their cognate ligands. The current approach allows a very fast comparison of most human GPCRs from the focused perspective of the predicted TM cavity and permits to easily detect key residues that drive ligand selectivity or promiscuity.

Keywords: Amino Acid Sequence; Binding Sites; Genomics; Humans; Ligands; Models, Molecular; Phylogeny; Receptors, G-Protein-Coupled
[Stransky2006Regional] NStransky, CVallot, FReyal, IBernard-Pierrot, SG. Diez de Medina, RSegraves, Yde Rycke, PElvin, ACassidy, CSpraggon, AGraham, JSouthgate, BAsselain, YAllory, CC. Abbou, DG. Albertson, J.-P. Thiery, DK. Chopin, DPinkel, and FRadvanyi. Regional copy number-independent deregulation of transcription in cancer. Nat. Genet., 38(12):1386-1396, Dec 2006. [ bib | DOI | http | .pdf ]
Genetic and epigenetic alterations have been identified that lead to transcriptional deregulation in cancers. Genetic mechanisms may affect single genes or regions containing several neighboring genes, as has been shown for DNA copy number changes. It was recently reported that epigenetic suppression of gene expression can also extend to a whole region; this is known as long-range epigenetic silencing. Various techniques are available for identifying regional genetic alterations, but no large-scale analysis has yet been carried out to obtain an overview of regional epigenetic alterations. We carried out an exhaustive search for regions susceptible to such mechanisms using a combination of transcriptome correlation map analysis and array CGH data for a series of bladder carcinomas. We validated one candidate region experimentally, demonstrating histone methylation leading to the loss of expression of neighboring genes without DNA methylation.

Keywords: csbcbook
[Stark2006BioGRID:] CStark, B-J. Breitkreutz, TReguly, LBoucher, ABreitkreutz, and MTyers. Biogrid: a general repository for interaction datasets. Nucleic Acids Res, 34(Database issue):D535-D539, Jan 2006. [ bib | DOI | http ]
Access to unified datasets of protein and genetic interactions is critical for interrogation of gene/protein function and analysis of global network properties. BioGRID is a freely accessible database of physical and genetic interactions available at http://www.thebiogrid.org. BioGRID release version 2.0 includes >116 000 interactions from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens. Over 30 000 interactions have recently been added from 5778 sources through exhaustive curation of the Saccharomyces cerevisiae primary literature. An internally hyper-linked web interface allows for rapid search and retrieval of interaction data. Full or user-defined datasets are freely downloadable as tab-delimited text files and PSI-MI XML. Pre-computed graphical layouts of interactions are available in a variety of file formats. User-customized graphs with embedded protein, gene and interaction attributes can be constructed with a visualization system called Osprey that is dynamically linked to the BioGRID.

[Srebrow2006Connection] ASrebrow and AR. Kornblihtt. The connection between splicing and cancer. J. Cell Sci., 119:2635-2641, 2006. [ bib ]
Keywords: csbcbook
[Sotiriou2006Gene] Christos Sotiriou, Pratyaksha Wirapati, Sherene Loi, Adrian Harris, Steve Fox, Johanna Smeds, Hans Nordgren, Pierre Farmer, Viviane Praz, Benjamin Haibe-Kains, Christine Desmedt, Denis Larsimont, Fatima Cardoso, Hans Peterse, Dimitry Nuyten, Marc Buyse, Marc J Van de Vijver, Jonas Bergh, Martine Piccart, and Mauro Delorenzi. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst, 98(4):262-272, Feb 2006. [ bib | DOI | http ]
Histologic grade in breast cancer provides clinically important prognostic information. However, 30%-60% of tumors are classified as histologic grade 2. This grade is associated with an intermediate risk of recurrence and is thus not informative for clinical decision making. We examined whether histologic grade was associated with gene expression profiles of breast cancers and whether such profiles could be used to improve histologic grading.We analyzed microarray data from 189 invasive breast carcinomas and from three published gene expression datasets from breast carcinomas. We identified differentially expressed genes in a training set of 64 estrogen receptor (ER)-positive tumor samples by comparing expression profiles between histologic grade 3 tumors and histologic grade 1 tumors and used the expression of these genes to define the gene expression grade index. Data from 597 independent tumors were used to evaluate the association between relapse-free survival and the gene expression grade index in a Kaplan-Meier analysis. All statistical tests were two-sided.We identified 97 genes in our training set that were associated with histologic grade; most of these genes were involved in cell cycle regulation and proliferation. In validation datasets, the gene expression grade index was strongly associated with histologic grade 1 and 3 status; however, among histologic grade 2 tumors, the index spanned the values for histologic grade 1-3 tumors. Among patients with histologic grade 2 tumors, a high gene expression grade index was associated with a higher risk of recurrence than a low gene expression grade index (hazard ratio = 3.61, 95% confidence interval = 2.25 to 5.78; P < .001, log-rank test).Gene expression grade index appeared to reclassify patients with histologic grade 2 tumors into two groups with high versus low risks of recurrence. This approach may improve the accuracy of tumor grading and thus its prognostic value.

Keywords: Breast Neoplasms, chemistry/genetics/pathology; Cell Cycle; Cell Proliferation; Disease-Free Survival; Female; Gene Expression Profiling; Gene Expression Regulation, Neoplastic; Humans; Lymphatic Metastasis; Mathematical Computing; Middle Aged; Multivariate Analysis; Oligonucleotide Array Sequence Analysis; Prognosis; Proportional Hazards Models; Receptors, Estrogen, analysis; Risk Factors
[Sonnenburg2006Large] Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, and Bernhard Schölkopf. Large scale multiple kernel learning. J. Mach. Learn. Res., 7:1531-1565, 2006. [ bib ]
[Song2006Development] MSong and MClark. Development and evaluation of an in silico model for hERG binding. J. Chem. Inf. Model., 46(1):392-400, 2006. [ bib | DOI | http | .pdf ]
It has been recognized that drug-induced QT prolongation is related to blockage of the human ether-a-go-go-related gene (hERG) ion channel. Therefore, it is prudent to evaluate the hERG binding of active compounds in early stages of drug discovery. In silico approaches provide an economic and quick method to screen for potential hERG liability. A diverse set of 90 compounds with hERG IC(50) inhibition data was collected from literature references. Fragment-based QSAR descriptors and three different statistical methods, support vector regression, partial least squares, and random forests, were employed to construct QSAR models for hERG binding affinity. Important fragment descriptors relevant to hERG binding affinity were identified through an efficient feature selection method based on sparse linear support vector regression. The support vector regression predictive model built upon selected fragment descriptors outperforms the other two statistical methods in this study, resulting in an r(2) of 0.912 and 0.848 for the training and testing data sets, respectively. The support vector regression model was applied to predict hERG binding affinities of 20 in-house compounds belonging to three different series. The model predicted the relative binding affinity well for two out of three compound series. The hierarchical clustering and dendrogram results show that the compound series with the best prediction has much higher structural similarity and more neighbors of training compounds than the other two compound series, demonstrating the predictive scope of the model. The combination of a QSAR model and postprocessing analysis, such as clustering and visualization, provides a way to assess the confidence level of QSAR prediction results on the basis of similarity to the training set.

Keywords: chemoinformatics herg
[Smeaton2006Evaluation] AF. Smeaton, POver, and WKraaij. Evaluation campaigns and TRECVid. In MIR '06: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pages 321-330, New-York, NY, USA, 2006. ACM Press. [ bib | DOI | http ]
[Shabalina2006Computational] SShabalina, ASpiridonov, and AOgurtsov. Computational models with thermodynamic and composition features improve siRNA design. BMC Bioinformatics, 7(1):65, Feb 2006. [ bib | DOI | http | .pdf ]
ABSTRACT: BACKGROUND: Small interfering RNAs (siRNAs) have become an important tool in cell and molecular biology. Reliable design of siRNA molecules is essential for the needs of large functional genomics projects. RESULTS: To improve the design of efficient siRNA molecules, we performed a comparative, thermodynamic and correlation analysis on a heterogeneous set of 653 siRNAs collected from the literature. We used this training set to select siRNA features and optimize computational models. We identified 18 parameters that correlate significantly with silencing efficiency. Some of these parameters characterize only the siRNA sequence, while others involve the whole mRNA. Most importantly, we derived an siRNA position-dependent consensus, and optimized the free-energy difference of the 5' and 3' terminal dinucleotides of the siRNA antisense strand. The position-dependent consensus is based on correlation and t-test analyses of the training set, and accounts for both significantly preferred and avoided nucleotides in all sequence positions. On the training set, the two parameters' correlation with silencing efficiency was 0.5 and 0.36, respectively. Among other features, a dinucleotide content index and the frequency of potential targets for siRNA in the mRNA added predictive power to our model (R = 0.55). We showed that our model is effective for predicting the efficiency of siRNAs at different concentrations. We optimized a neural network model on our training set using three parameters characterizing the siRNA sequence, and predicted efficiencies for the test siRNA dataset recently published by Novartis. On this validation set, the correlation coefficient between predicted and observed efficiency was 0.75. Using the same model, we performed a transcriptome-wide analysis of optimal siRNA targets for 22,600 human mRNAs. CONCLUSIONS: We demonstrated that the properties of the siRNAs themselves are essential for efficient RNA interference. The 5' ends of antisense strands of efficient siRNAs are U-rich and possess a content similarity to the pyrimidine-rich oligonucleotides interacting with the polypurine RNA tracks that are recognized by RNase H. The advantage of our method over similar methods is the small number of parameters. As a result, our method requires a much smaller training set to produce consistent results. Other mRNA features, though expensive to compute, can slightly improve our model.

Keywords: sirna
[Schumacher2006Microarray] ASchumacher, PKapranov, ZKaminsky, JFlanagan, AAssadzadeh, PYau, CVirtanen, NWinegarden, JCheng, TGingeras, and APetronis. Microarray-based DNA methylation profiling: technology and applications. Nucleic Acids Res., 34:528-542, 2006. [ bib ]
Keywords: csbcbook, csbcbook-ch2
[Sanguinetti2006hERG] M.C. Sanguinetti and MTristani-Firouzi. hERG potassium channels and cardiac arrhythmia. Nature, 440(7083):463-469, Mar 2006. [ bib | DOI | http | .pdf ]
hERG potassium channels are essential for normal electrical activity in the heart. Inherited mutations in the HERG gene cause long QT syndrome, a disorder that predisposes individuals to life-threatening arrhythmias. Arrhythmia can also be induced by a blockage of hERG channels by a surprisingly diverse group of drugs. This side effect is a common reason for drug failure in preclinical safety trials. Insights gained from the crystal structures of other potassium channels have helped our understanding of the block of hERG channels and the mechanisms of gating.

Keywords: herg
[Salomon2006Predicting] JSalomon and DR. Flower. Predicting Class II MHC-Peptide binding: a kernel based approach using similarity scores. BMC Bioinformatics, 7:501, 2006. [ bib | DOI | http ]
BACKGROUND: Modelling the interaction between potentially antigenic peptides and Major Histocompatibility Complex (MHC) molecules is a key step in identifying potential T-cell epitopes. For Class II MHC alleles, the binding groove is open at both ends, causing ambiguity in the positional alignment between the groove and peptide, as well as creating uncertainty as to what parts of the peptide interact with the MHC. Moreover, the antigenic peptides have variable lengths, making naive modelling methods difficult to apply. This paper introduces a kernel method that can handle variable length peptides effectively by quantifying similarities between peptide sequences and integrating these into the kernel. RESULTS: The kernel approach presented here shows increased prediction accuracy with a significantly higher number of true positives and negatives on multiple MHC class II alleles, when testing data sets from MHCPEP 1, MCHBN 2, and MHCBench 3. Evaluation by cross validation, when segregating binders and non-binders, produced an average of 0.824 AROC for the MHCBench data sets (up from 0.756), and an average of 0.96 AROC for multiple alleles of the MHCPEP database. CONCLUSION: The method improves performance over existing state-of-the-art methods of MHC class II peptide binding predictions by using a custom, knowledge-based representation of peptides. Similarity scores, in contrast to a fixed-length, pocket-specific representation of amino acids, provide a flexible and powerful way of modelling MHC binding, and can easily be applied to other dynamic sequence problems.

Keywords: Amino Acid, Binding Sites, Computational Biology, Databases, Epitope Mapping, Genetic, HLA-A Antigens, HLA-DR Antigens, Histocompatibility Antigens Class II, Humans, Peptides, Protein, Protein Binding, Protein Conformation, ROC Curve, Reproducibility of Results, Sequence Alignment, Sequence Analysis, Sequence Homology, 17105666
[Salgado2006RegulonDB] HSalgado, SGama-Castro, MPeralta-Gil, EDíaz-Peredo, FSánchez-Solano, ASantos-Zavaleta, IMartínez-Flores, VJiménez-Jacinto, CBonavides-Martínez, JSegura-Salazar, AMartínez-Antonio, and JCollado-Vides. RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res., 34(Database issue):D394-D397, Jan 2006. [ bib | DOI | http ]
RegulonDB is the internationally recognized reference database of Escherichia coli K-12 offering curated knowledge of the regulatory network and operon organization. It is currently the largest electronically-encoded database of the regulatory network of any free-living organism. We present here the recently launched RegulonDB version 5.0 radically different in content, interface design and capabilities. Continuous curation of original scientific literature provides the evidence behind every single object and feature. This knowledge is complemented with comprehensive computational predictions across the complete genome. Literature-based and predicted data are clearly distinguished in the database. Starting with this version, RegulonDB public releases are synchronized with those of EcoCyc since our curation supports both databases. The complex biology of regulation is simplified in a navigation scheme based on three major streams: genes, operons and regulons. Regulatory knowledge is directly available in every navigation step. Displays combine graphic and textual information and are organized allowing different levels of detail and biological context. This knowledge is the backbone of an integrated system for the graphic display of the network, graphic and tabular microarray comparisons with curated and predicted objects, as well as predictions across bacterial genomes, and predicted networks of functionally related gene products. Access RegulonDB at http://regulondb.ccg.unam.mx.

Keywords: Databases, Genetic; Escherichia coli K12; Gene Expression Regulation, Bacterial; Genome, Bacterial; Internet; Operon; Regulon; Software; Transcription, Genetic; User-Computer Interface
[Saigo2006linear] HSaigo, TKadowaki, and KTsuda. A linear programming approach for molecular QSAR analysis. In TGärtner, GC. Garriga, and TMeinl, editors, International workshop on mining and learning with graphs (MLG), pages 85-96, 2006. [ bib | .pdf ]
[Ren2006siRecords] YRen, WGong, QXu, XZheng, DLin, YWang, and TLi. siRecords: an extensive database of mammalian siRNAs with efficacy ratings. Bioinformatics, Jan 2006. [ bib | DOI | http ]
SUMMARY: Short interfering RNAs have been gaining popularity as the gene knock-down tool of choice by many researchers due to the clean nature of their workings as well as the technical simplicity and cost efficiency in their applications. We have constructed siRecords, a database of siRNAs experimentally tested by researchers with consistent efficacy ratings. This database will help siRNA researchers develop more reliable siRNA design rules; in the mean time, benefit experimental researchers directly by providing them with information about the siRNAs that have been experimentally tested against the genes of their interest. Currently, more than 4, 100 carefully annotated siRNA sequences obtained from more than 1, 200 published siRNA studies are hosted in siRecords. This database will continue to expand as more experimentally tested siRNAs are published. AVAILABILITY: The siRecords database can be accessed at http://siRecords.umn.edu/siRecords/.

Keywords: sirna
[Redon2006Global] Richard Redon, Shumpei Ishikawa, Karen R Fitch, Lars Feuk, George H Perry, TDaniel Andrews, Heike Fiegler, Michael H Shapero, Andrew R Carson, Wenwei Chen, Eun Kyung Cho, Stephanie Dallaire, Jennifer L Freeman, Juan R González, Mònica Gratacòs, Jing Huang, Dimitrios Kalaitzopoulos, Daisuke Komura, Jeffrey R MacDonald, Christian R Marshall, Rui Mei, Lyndal Montgomery, Kunihiro Nishimura, Kohji Okamura, Fan Shen, Martin J Somerville, Joelle Tchinda, Armand Valsesia, Cara Woodwark, Fengtang Yang, Junjun Zhang, Tatiana Zerjal, Jane Zhang, Lluis Armengol, Donald F Conrad, Xavier Estivill, Chris Tyler-Smith, Nigel P Carter, Hiroyuki Aburatani, Charles Lee, Keith W Jones, Stephen W Scherer, and Matthew E Hurles. Global variation in copy number in the human genome. Nature, 444(7118):444-454, Nov 2006. [ bib | DOI | http ]
Copy number variation (CNV) of DNA sequences is functionally significant but has yet to be fully ascertained. We have constructed a first-generation CNV map of the human genome through the study of 270 individuals from four populations with ancestry in Europe, Africa or Asia (the HapMap collection). DNA from these individuals was screened for CNV using two complementary technologies: single-nucleotide polymorphism (SNP) genotyping arrays, and clone-based comparative genomic hybridization. A total of 1,447 copy number variable regions (CNVRs), which can encompass overlapping or adjacent gains or losses, covering 360 megabases (12% of the genome) were identified in these populations. These CNVRs contained hundreds of genes, disease loci, functional elements and segmental duplications. Notably, the CNVRs encompassed more nucleotide content per genome than SNPs, underscoring the importance of CNV in genetic diversity and evolution. The data obtained delineate linkage disequilibrium patterns for many CNVs, and reveal marked variation in copy number among populations. We also demonstrate the utility of this resource for genetic disease studies.

Keywords: Chromosome Mapping; Gene Dosage; Genetic Variation; Genetics, Population; Genome, Human; Genomics, methods; Genotype; Humans; Linkage Disequilibrium; Molecular Diagnostic Techniques; Oligonucleotide Array Sequence Analysis, methods; Polymorphism, Single Nucleotide
[Radulescu2006ECCS] ORadulescu, AGorban, SVakulenko, and AZinovyev. Hierarchies and modules in complex biological systems. In Proceedings of European Conference on Complex Systems, Oxford, UK, 2006. [ bib ]
[Peters2006community] Bjoern Peters, Huynh-Hoa Bui, Sune Frankild, Morten Nielson, Claus Lundegaard, Emrah Kostem, Derek Basch, Kasper Lamberth, Mikkel Harndahl, Ward Fleri, Stephen S Wilson, John Sidney, Ole Lund, Soren Buus, and Alessandro Sette. A community resource benchmarking predictions of peptide binding to MHC-I molecules. PLoS Comput. Biol., 2(6):e65, Jun 2006. [ bib | DOI | http ]
Recognition of peptides bound to major histocompatibility complex (MHC) class I molecules by T lymphocytes is an essential part of immune surveillance. Each MHC allele has a characteristic peptide binding preference, which can be captured in prediction algorithms, allowing for the rapid scan of entire pathogen proteomes for peptide likely to bind MHC. Here we make public a large set of 48,828 quantitative peptide-binding affinity measurements relating to 48 different mouse, human, macaque, and chimpanzee MHC class I alleles. We use this data to establish a set of benchmark predictions with one neural network method and two matrix-based prediction methods extensively utilized in our groups. In general, the neural network outperforms the matrix-based predictions mainly due to its ability to generalize even on a small amount of data. We also retrieved predictions from tools publicly available on the internet. While differences in the data used to generate these predictions hamper direct comparisons, we do conclude that tools based on combinatorial peptide libraries perform remarkably well. The transparent prediction evaluation on this dataset provides tool developers with a benchmark for comparison of newly developed prediction methods. In addition, to generate and evaluate our own prediction methods, we have established an easily extensible web-based prediction framework that allows automated side-by-side comparisons of prediction methods implemented by experts. This is an advance over the current practice of tool developers having to generate reference predictions themselves, which can lead to underestimating the performance of prediction methods they are not as familiar with as their own. The overall goal of this effort is to provide a transparent prediction evaluation allowing bioinformaticians to identify promising features of prediction methods and providing guidance to immunologists regarding the reliability of prediction tools.

Keywords: Animals; Databases, Factual; HLA Antigens; Histocompatibility Antigens Class I; Humans; Inhibitory Concentration 50; Macaca; Mice; Neural Networks (Computer); Pan troglodytes; Peptides; ROC Curve; Software
[Paik2006Gene] Soonmyung Paik, Gong Tang, Steven Shak, Chungyeul Kim, Joffre Baker, Wanseop Kim, Maureen Cronin, Frederick L. Baehner, Drew Watson, John Bryant, Joseph P. Costantino, Charles E Geyer, Jr, D Lawrence Wickerham, and Norman Wolmark. Gene expression and benefit of chemotherapy in women with node-negative, estrogen receptor-positive breast cancer. J Clin Oncol, 24(23):3726-3734, Aug 2006. [ bib | DOI | http ]
The 21-gene recurrence score (RS) assay quantifies the likelihood of distant recurrence in women with estrogen receptor-positive, lymph node-negative breast cancer treated with adjuvant tamoxifen. The relationship between the RS and chemotherapy benefit is not known.The RS was measured in tumors from the tamoxifen-treated and tamoxifen plus chemotherapy-treated patients in the National Surgical Adjuvant Breast and Bowel Project (NSABP) B20 trial. Cox proportional hazards models were utilized to test for interaction between chemotherapy treatment and the RS.A total of 651 patients were assessable (227 randomly assigned to tamoxifen and 424 randomly assigned to tamoxifen plus chemotherapy). The test for interaction between chemotherapy treatment and RS was statistically significant (P = .038). Patients with high-RS (> or = 31) tumors (ie, high risk of recurrence) had a large benefit from chemotherapy (relative risk, 0.26; 95% CI, 0.13 to 0.53; absolute decrease in 10-year distant recurrence rate: mean, 27.6%; SE, 8.0%). Patients with low-RS (< 18) tumors derived minimal, if any, benefit from chemotherapy treatment (relative risk, 1.31; 95% CI, 0.46 to 3.78; absolute decrease in distant recurrence rate at 10 years: mean, -1.1%; SE, 2.2%). Patients with intermediate-RS tumors did not appear to have a large benefit, but the uncertainty in the estimate can not exclude a clinically important benefit.The RS assay not only quantifies the likelihood of breast cancer recurrence in women with node-negative, estrogen receptor-positive breast cancer, but also predicts the magnitude of chemotherapy benefit.

Keywords: Adult; Aged; Antineoplastic Combined Chemotherapy Protocols, administration /&/ dosage/therapeutic use; Breast Neoplasms, drug therapy/metabolism/pathology/prevention /&/ control; Cisplatin, administration /&/ dosage; Female; Fluorouracil, administration /&/ dosage; Gene Expression Regulation, Neoplastic; Humans; Linear Models; Lymphatic Metastasis; Methotrexate, administration /&/ dosage; Middle Aged; Mitomycins, administration /&/ dosage; Neoplasm Proteins, metabolism; Neoplasm Recurrence, Local, metabolism/prevention /&/ control; Odds Ratio; Predictive Value of Tests; Prognosis; Proportional Hazards Models; Randomized Controlled Trials as Topic; Receptors, Estrogen, metabolism; Recurrence, prevention /&/ control; Reverse Transcriptase Polymerase Chain Reaction; Risk Assessment; Risk Factors; Tamoxifen, administration /&/ dosage; Tumor Markers, Biological, metabolism
[Pai2006Prospects] SI. Pai, Y.-Y. Lin, BMacaes, AMeneshian, HungC.-F., and T.-C. Wu. Prospects of RNA interference therapy for cancer. Gene Ther., 13(6):464-477, Mar 2006. [ bib | DOI | http ]
RNA interference (RNAi) is a powerful gene-silencing process that holds great promise in the field of cancer therapy. The discovery of RNAi has generated enthusiasm within the scientific community, not only because it has been used to rapidly identify key molecules involved in many disease processes including cancer, but also because RNAi has the potential to be translated into a technology with major therapeutic applications. Our evolving understanding of the molecular pathways important for carcinogenesis has created opportunities for cancer therapy employing RNAi technology to target the key molecules within these pathways. Many gene products involved in carcinogenesis have already been explored as targets for RNAi intervention, and RNAi targeting of molecules crucial for tumor-host interactions and tumor resistance to chemo- or radiotherapy has also been investigated. In most of these studies, the silencing of critical gene products by RNAi technology has generated significant antiproliferative and/or proapoptotic effects in cell-culture systems or in preclinical animal models. Nevertheless, significant obstacles, such as in vivo delivery, incomplete suppression of target genes, nonspecific immune responses and the so-called off-target effects, need to be overcome before this technology can be successfully translated into the clinical arena. Significant progress has already been made in addressing some of these issues, and it is foreseen that early phase clinical trials will be initiated in the very near future.

Keywords: sirna
[Oti2006Predicting] MOti, BSnel, MA. Huynen, and HG. Brunner. Predicting disease genes using protein-protein interactions. J Med Genet, 43(8):691-698, Aug 2006. [ bib | DOI | http ]
BACKGROUND: The responsible genes have not yet been identified for many genetically mapped disease loci. Physically interacting proteins tend to be involved in the same cellular process, and mutations in their genes may lead to similar disease phenotypes. OBJECTIVE: To investigate whether protein-protein interactions can predict genes for genetically heterogeneous diseases. METHODS: 72,940 protein-protein interactions between 10,894 human proteins were used to search 432 loci for candidate disease genes representing 383 genetically heterogeneous hereditary diseases. For each disease, the protein interaction partners of its known causative genes were compared with the disease associated loci lacking identified causative genes. Interaction partners located within such loci were considered candidate disease gene predictions. Prediction accuracy was tested using a benchmark set of known disease genes. RESULTS: Almost 300 candidate disease gene predictions were made. Some of these have since been confirmed. On average, 10% or more are expected to be genuine disease genes, representing a 10-fold enrichment compared with positional information only. Examples of interesting candidates are AKAP6 for arrythmogenic right ventricular dysplasia 3 and SYN3 for familial partial epilepsy with variable foci. CONCLUSIONS: Exploiting protein-protein interactions can greatly increase the likelihood of finding positional candidate disease genes. When applied on a large scale they can lead to novel candidate gene predictions.

Keywords: Animals; Benchmarking; Databases, Protein; Disease; Genetic Predisposition to Disease; Humans; Protein Binding; Proteins
[Oloff2006Chemometric] Scott Oloff, Shuxing Zhang, Nagamani Sukumar, Curt Breneman, and Alexander Tropsha. Chemometric analysis of ligand receptor complementarity: identifying complementary ligands based on receptor information (colibri). J. Chem. Inf. Model., 46(2):844-851, 2006. [ bib | DOI | http ]
We have developed a novel structure-based chemoinformatics approach to search for Complimentary Ligands Based on Receptor Information (CoLiBRI). CoLiBRI is based on the representation of both receptor binding sites and their respective ligands in a space of universal chemical descriptors. The binding site atoms involved in the interaction with ligands are identified by the means of a computational geometry technique known as Delaunay tessellation as applied to X-ray characterized ligand-receptor complexes. TAE/RECON multiple chemical descriptors are calculated independently for each ligand as well as for its active site atoms. The representation of both ligands and active sites using chemical descriptors allows the application of well-known chemometric techniques in order to correlate chemical similarities between active sites and their respective ligands. We have established a protocol to map patterns of nearest neighbor active site vectors in a multidimensional TAE/RECON space onto those of their complementary ligands and vice versa. This protocol affords the prediction of a virtual complementary ligand vector in the ligand chemical space from the position of a known active site vector. This prediction is followed by chemical similarity calculations between this virtual ligand vector and those calculated for molecules in a chemical database to identify real compounds most similar to the virtual ligand. Consequently, the knowledge of the receptor active site structure affords straightforward and efficient identification of its complementary ligands in large databases of chemical compounds using rapid chemical similarity searches. Conversely, starting from the ligand chemical structure, one may identify possible complementary receptor cavities as well. We have applied the CoLiBRI approach to a data set of 800 X-ray characterized ligand-receptor complexes in the PDBbind database. Using a k nearest neighbor (kNN) pattern recognition approach and variable selection, we have shown that knowledge of the active site structure affords identification of its complimentary ligand among the top 1% of a large chemical database in over 90% of all test active sites when a binding site of the same protein family was present in the training set. In the case where test receptors are highly dissimilar and not present among the receptor families in the training set, the prediction accuracy is decreased; however, CoLiBRI was still able to quickly eliminate 75% of the chemical database as improbable ligands. CoLiBRI affords rapid prefiltering of a large chemical database to eliminate compounds that have little chance of binding to a receptor active site.

Keywords: Algorithms; Binding Sites; Binding, Competitive; Computational Biology; Databases, Factual; Drug Design; Drug Evaluation, Preclinical; Ligands; Models, Biological; Structure-Activity Relationship
[Okuno2006GLIDA] YOkuno, JYang, KTaneishi, HYabuuchi, and GTsujimoto. GLIDA: GPCR-ligand database for chemical genomic drug discovery. Nucleic Acids Res., 34(Database issue):D673-D677, Jan 2006. [ bib | DOI | http ]
G-protein coupled receptors (GPCRs) represent one of the most important families of drug targets in pharmaceutical development. GPCR-LIgand DAtabase (GLIDA) is a novel public GPCR-related chemical genomic database that is primarily focused on the correlation of information between GPCRs and their ligands. It provides correlation data between GPCRs and their ligands, along with chemical information on the ligands, as well as access information to the various web databases regarding GPCRs. These data are connected with each other in a relational database, allowing users in the field of GPCR-related drug discovery to easily retrieve such information from either biological or chemical starting points. GLIDA includes structure similarity search functions for the GPCRs and for their ligands. Thus, GLIDA can provide correlation maps linking the searched homologous GPCRs (or ligands) with their ligands (or GPCRs). By analyzing the correlation patterns between GPCRs and ligands, we can gain more detailed knowledge about their interactions and improve drug design efforts by focusing on inferred candidates for GPCR-specific drugs. GLIDA is publicly available at http://gdds.pharm.kyoto-u.ac.jp:8081/glida. We hope that it will prove very useful for chemical genomic research and GPCR-related drug discovery.

Keywords: chemogenomics
[OConnor2006BiotechnolLett] KC. O'Connor, JW. Muhitch, DJ. Lacks, and MAl-Rubeai. Modeling suppression of cell death by bcl-2 over-expression in myeloma ns0 6a1 cells. Biotechnol Lett, 28(23):1919-24, 2006. [ bib ]
A novel population-balance model was employed to evaluate the suppression of cell death in myeloma NS0 6A1 cells metabolically engineered to over-express the apoptotic suppressor Bcl-2. The model is robust in its ability to simulate cell population dynamics in batch suspension culture and in response to thymidine-induced growth inhibition: 89 percent of simulated cell concentrations are within two standard deviations of experimental data. Kinetic rate constants in model equations suggest that Bcl-2 over-expression extends culture longevity from 6 days to at least 15 days by suppressing the specific rate of early apoptotic cell formation by more than 6-fold and necrotic cell formation by at least 3-fold, despite nearly a 3-fold decrease in initial cell growth rate and no significant change in the specific rate of late apoptotic cell formation. This computational analysis supports a mechanism in which Bcl-2 is a common mediator of early apoptotic and necrotic events occurring at rates that are dependent on cellular factors accumulating over time. The model has current application to the rational design of cell cultures through metabolic engineering for the industrial production of biopharmaceuticals.

Keywords: csbcbook
[Nocedal2006Numerical] JNocedal and SWright. Numerical optimization. Springer, 2006. [ bib ]
[Nguyen2006Experimental] Nam-Ky Nguyen and ER. Williams. Experimental designs for 2-colour cdna microarray experiments. Applied Stochastic Models in Business and Industry, 22:631–638, 2006. [ bib ]
[Neuvial2006Spatial] Pierre Neuvial, Philippe Hupé, Isabel Brito, Stéphane Liva, Elodie Manié, Caroline Brennetot, François Radvanyi, Alain Aurias, and Emmanuel Barillot. Spatial normalization of array-cgh data. BMC Bioinformatics, 7:264, 2006. [ bib | DOI | http ]
BACKGROUND: Array-based comparative genomic hybridization (array-CGH) is a recently developed technique for analyzing changes in DNA copy number. As in all microarray analyses, normalization is required to correct for experimental artifacts while preserving the true biological signal. We investigated various sources of systematic variation in array-CGH data and identified two distinct types of spatial effect of no biological relevance as the predominant experimental artifacts: continuous spatial gradients and local spatial bias. Local spatial bias affects a large proportion of arrays, and has not previously been considered in array-CGH experiments. RESULTS: We show that existing normalization techniques do not correct these spatial effects properly. We therefore developed an automatic method for the spatial normalization of array-CGH data. This method makes it possible to delineate and to eliminate and/or correct areas affected by spatial bias. It is based on the combination of a spatial segmentation algorithm called NEM (Neighborhood Expectation Maximization) and spatial trend estimation. We defined quality criteria for array-CGH data, demonstrating significant improvements in data quality with our method for three data sets coming from two different platforms (198, 175 and 26 BAC-arrays). CONCLUSION: We have designed an automatic algorithm for the spatial normalization of BAC CGH-array data, preventing the misinterpretation of experimental artifacts as biologically relevant outliers in the genomic profile. This algorithm is implemented in the R package MANOR (Micro-Array NORmalization), which is described at http://bioinfo.curie.fr/projects/manor and available from the Bioconductor site http://www.bioconductor.org. It can also be tested on the CAPweb bioinformatics platform at http://bioinfo.curie.fr/CAPweb.

Keywords: Algorithms; Artifacts; Base Sequence; Chromosome Mapping, methods; Computer Simulation; Data Interpretation, Statistical; Gene Dosage; In Situ Hybridization, methods; Models, Genetic; Models, Statistical; Molecular Sequence Data; Oligonucleotide Array Sequence Analysis, methods; Sequence Analysis, DNA, methods
[Neuhaus06Fast] MNeuhaus, KRiesen, and HBunke. Fast suboptimal algorithms for the computation of graph edit distance. In Dit-Yan Yeung, James T. Kwok, Ana LN. Fred, Fabio Roli, and Dick de Ridder, editors, SSPR/SPR, volume 4109 of Lecture Notes in Computer Science, pages 163-172. Springer, 2006. [ bib | http ]
[Neuhaus2006Edit] MNeuhaus and HBunke. Edit distance-based kernel functions for structural pattern classification. Pattern Recognition, 39(10):1852-1863, Oct 2006. [ bib | DOI | http | .pdf ]
A common approach in structural pattern classification is to define a dissimilarity measure on patterns and apply a distance-based nearest-neighbor classifier. In this paper, we introduce an alternative method for classification using kernel functions based on edit distance. The proposed approach is applicable to both string and graph representations of patterns. By means of the kernel functions introduced in this paper, string and graph classification can be performed in an implicit vector space using powerful statistical algorithms. The validity of the kernel method cannot be established for edit distance in general. However, by evaluating theoretical criteria we show that the kernel functions are nevertheless suitable for classification, and experiments on various string and graph datasets clearly demonstrate that nearest-neighbor classifiers can be outperformed by support vector machines using the proposed kernel functions.

Keywords: image
[Nakabayashi2006Mathematical] JNakabayashi and ASasaki. A mathematical model for apoptosome assembly: The optimal cytochrome c/apaf-1 ratio. J. Theor. Biol., 242(2):280-287, 2006. [ bib | DOI | http ]
Apoptosis, a highly conserved form of cell suicide, is regulated by apoptotic signals and their transduction with caspases, a family of cystein proteases. Caspases are constantly expressed in the normal cells as inactive pro-enzymes. The activity of caspase is regulated by the proteolysis. Sequential proteolytic reactions of caspases are needed to execute apoptosis. Mitochondrial pathway is one of these apoptotic signal pathways, in which caspases are oligomerized into characteristic heptamer structure, called apoptosome, with caspase-9 that activate the effector caspases for apoptosis. To investigate the dynamics of signal transduction pathway regulated by oligomerization, we construct a mathematical model for Apaf-1 heptamer assembly process. The model first reveals that intermediate products can remain unconverted even after all assemble reactions are completed. The second result of the model is that the conversion efficiency of Apaf-1 heptamer assembly is maximized when the initial concentration of cytochrome c is equal to that of Apaf-1. When the concentration of cytochrome c is sufficiently larger or smaller than that of Apaf-1, the final Apaf-1 heptamer production is decreased, because intermediate Apaf-1 oligomers (tetramers and bigger oligomers), which themselves are unable to form active heptamer, accumulate too fast in the cells, choking a smooth production of Apaf-1 heptamer. Slow activation of Apaf-1 monomers and small oligomers increase the conversion efficiency. We also study the optimal number of subunits comprising an active oligomer that maximize the conversion efficiency in assembly process, and found that the tetramer is the optimum.

Keywords: csbcbook
[Nakabayashi2006JTB] JNakabayashi and ASasaki. A mathematical model for apoptosome assembly: The optimal cytochrome c/apaf-1 ratio. Journal of Theoretical Biology, 242(2):280 - 287, 2006. [ bib | DOI | http ]
Apoptosis, a highly conserved form of cell suicide, is regulated by apoptotic signals and their transduction with caspases, a family of cystein proteases. Caspases are constantly expressed in the normal cells as inactive pro-enzymes. The activity of caspase is regulated by the proteolysis. Sequential proteolytic reactions of caspases are needed to execute apoptosis. Mitochondrial pathway is one of these apoptotic signal pathways, in which caspases are oligomerized into characteristic heptamer structure, called apoptosome, with caspase-9 that activate the effector caspases for apoptosis. To investigate the dynamics of signal transduction pathway regulated by oligomerization, we construct a mathematical model for Apaf-1 heptamer assembly process. The model first reveals that intermediate products can remain unconverted even after all assemble reactions are completed. The second result of the model is that the conversion efficiency of Apaf-1 heptamer assembly is maximized when the initial concentration of cytochrome c is equal to that of Apaf-1. When the concentration of cytochrome c is sufficiently larger or smaller than that of Apaf-1, the final Apaf-1 heptamer production is decreased, because intermediate Apaf-1 oligomers (tetramers and bigger oligomers), which themselves are unable to form active heptamer, accumulate too fast in the cells, choking a smooth production of Apaf-1 heptamer. Slow activation of Apaf-1 monomers and small oligomers increase the conversion efficiency. We also study the optimal number of subunits comprising an active oligomer that maximize the conversion efficiency in assembly process, and found that the tetramer is the optimum.

Keywords: csbcbook
[Naderi2006gene-expression] ANaderi, AE. Teschendorff, NL. Barbosa-Morais, SE. Pinder, AR. Green, DG. Powe, JFR. Robertson, SAparicio, IO. Ellis, JD. Brenton, and CCaldas. A gene-expression signature to predict survival in breast cancer across independent data sets. Oncogene, 26(10):1507-1516, 2006. [ bib | DOI | http ]
[Mishra2006Human] G.R. Mishra, MSuresh, KKumaran, NKannabiran, SSuresh, PBala, KShivakumar, NAnuradha, RReddy, T.M. Raghavan, SMenon, GHanumanthu, MGupta, SUpendran, SGupta, MMahesh, BJacob, PMathew, PChatterjee, K.S. Arun, SSharma, K.N. Chandrika, NDeshpande, KPalvankar, RRaghavnath, RKrishnakanth, HKarathia, BRekha, RNayak, GVishnupriya, H.G.M. Kumar, MNagini, G.S.S. Kumar, RJose, PDeepthi, S.S. Mohan, GandhiT.K.B., H.C. Harsha, K.S. Deshpande, MSarker, T.S.K. Prasad, and APandey. Human protein reference database-2006 update. Nucleic Acids Res, 34(Database issue):D411-D414, Jan 2006. [ bib | DOI | http ]
Human Protein Reference Database (HPRD) (http://www.hprd.org) was developed to serve as a comprehensive collection of protein features, post-translational modifications (PTMs) and protein-protein interactions. Since the original report, this database has increased to >20 000 proteins entries and has become the largest database for literature-derived protein-protein interactions (>30 000) and PTMs (>8000) for human proteins. We have also introduced several new features in HPRD including: (i) protein isoforms, (ii) enhanced search options, (iii) linking of pathway annotations and (iv) integration of a novel browser, GenProt Viewer (http://www.genprot.org), developed by us that allows integration of genomic and proteomic information. With the continued support and active participation by the biomedical community, we expect HPRD to become a unique source of curated information for the human proteome and spur biomedical discoveries based on integration of genomic, transcriptomic and proteomic data.

Keywords: Databases, Protein; Genomics; Humans; Internet; Protein Interaction Mapping; Protein Isoforms; Protein Processing, Post-Translational; Proteins; Proteome; Proteomics; Signal Transduction; Systems Integration; User-Computer Interface
[Miranda2006pattern-based] Kevin C Miranda, Tien Huynh, Yvonne Tay, Yen-Sin Ang, Wai-Leong Tam, Andrew M Thomson, Bing Lim, and Isidore Rigoutsos. A pattern-based method for the identification of microrna binding sites and their corresponding heteroduplexes. Cell, 126(6):1203-1217, Sep 2006. [ bib | DOI | http | .pdf ]
We present rna22, a method for identifying microRNA binding sites and their corresponding heteroduplexes. Rna22 does not rely upon cross-species conservation, is resilient to noise, and, unlike previous methods, it first finds putative microRNA binding sites in the sequence of interest, then identifies the targeting microRNA. Computationally, we show that rna22 identifies most of the currently known heteroduplexes. Experimentally, with luciferase assays, we demonstrate average repressions of 30% or more for 168 of 226 tested targets. The analysis suggests that some microRNAs may have as many as a few thousand targets, and that between 74% and 92% of the gene transcripts in four model genomes are likely under microRNA control through their untranslated and amino acid coding regions. We also extended the method's key idea to a low-error microRNA-precursor-discovery scheme; our studies suggest that the number of microRNA precursors in mammalian genomes likely ranges in the tens of thousands.

Keywords: sirna
[Meinshausen2006High] NMeinshausen and PBühlmann. High dimensional graphs and variable selection with the lasso. Ann. Stat., 34:1436-1462, 2006. [ bib | DOI | http | .pdf ]
The pattern of zero entries in the inverse covariance matrix of a multivariate normal distribution corresponds to conditional independence restrictions between variables. Covariance selection aims at estimating those structural zeros from data. We show that neighborhood selection with the Lasso is a computationally attractive alternative to standard covariance selection for sparse high-dimensional graphs. Neighborhood selection estimates the conditional independence restrictions separately for each node in the graph and is hence equivalent to variable selection for Gaussian linear models. We show that the proposed neighborhood selection scheme is consistent for sparse high-dimensional graphs. Consistency hinges on the choice of the penalty parameter. The oracle value for optimal prediction does not lead to a consistent neighborhood estimate. Controlling instead the probability of falsely joining some distinct connectivity components of the graph, consistent estimation for sparse graphs is achieved (with exponential rates), even when the number of variables grows as the number of observations raised to an arbitrary power.

[Mattick2006Non] John S. Mattick and Igor V. Makunin. Non-coding RNA. Hum. Mol. Genet., 15:R17-R29, 2006. [ bib ]
Keywords: csbcbook
[Margolin2006ARACNE] AA. Margolin, INemenman, KBasso, CWiggins, GStolovitzky, RDalla Favera, and ACalifano. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular contexts. BMC Bioinformatics, 7 Suppl 1:S7, 2006. [ bib | DOI | http | .pdf ]
BACKGROUND: Elucidating gene regulatory networks is crucial for understanding normal cell physiology and complex pathologic phenotypes. Existing computational methods for the genome-wide "reverse engineering" of such networks have been successful only for lower eukaryotes with simple genomes. Here we present ARACNE, a novel algorithm, using microarray expression profiles, specifically designed to scale up to the complexity of regulatory networks in mammalian cells, yet general enough to address a wider range of network deconvolution problems. This method uses an information theoretic approach to eliminate the majority of indirect interactions inferred by co-expression methods. RESULTS: We prove that ARACNE reconstructs the network exactly (asymptotically) if the effect of loops in the network topology is negligible, and we show that the algorithm works well in practice, even in the presence of numerous loops and complex topologies. We assess ARACNE's ability to reconstruct transcriptional regulatory networks using both a realistic synthetic dataset and a microarray dataset from human B cells. On synthetic datasets ARACNE achieves very low error rates and outperforms established methods, such as Relevance Networks and Bayesian Networks. Application to the deconvolution of genetic networks in human B cells demonstrates ARACNE's ability to infer validated transcriptional targets of the cMYC proto-oncogene. We also study the effects of misestimation of mutual information on network reconstruction, and show that algorithms based on mutual information ranking are more resilient to estimation errors. CONCLUSION: ARACNE shows promise in identifying direct transcriptional interactions in mammalian cellular networks, a problem that has challenged existing reverse engineering algorithms. This approach should enhance our ability to use microarray data to elucidate functional mechanisms that underlie cellular processes and to identify molecular targets of pharmacological compounds in mammalian cellular networks.

[Mahe2006Pharmacophore] PMahé, LRalaivola, VStoven, and J.-P. Vert. The pharmacophore kernel for virtual screening with support vector machines. J. Chem. Inf. Model., 46(5):2003-2014, 2006. [ bib | DOI | http | .pdf ]
We introduce a family of positive definite kernels specifically optimized for the manipulation of 3D structures of molecules with kernel methods. The kernels are based on the comparison of the three-point pharmacophores present in the 3D structures of molecules, a set of molecular features known to be particularly relevant for virtual screening applications. We present a computationally demanding exact implementation of these kernels, as well as fast approximations related to the classical fingerprint-based approaches. Experimental results suggest that this new approach is competitive with state-of-the-art algorithms based on the 2D structure of molecules for the detection of inhibitors of several drug targets.

Keywords: chemoinformatics
[Mahe2006pharmacophorea] PMahé, LRalaivola, VStoven, and J.-P. Vert. The pharmacophore kernel for virtual screening with support vector machines. Technical Report Technical Report HAL:ccsd-00020066, Ecole des Mines de Paris, march 2006. [ bib | http ]
Keywords: chemoinformatics kernel-theory
[Ma2006MSB] Wenzhe Ma, Luhua Lai, Qi Ouyang, and Chao Tang. Robustness and modular design of the drosophila segment polarity network. Mol Syst Biol, 2:70, 2006. [ bib | DOI | http ]
Biomolecular networks have to perform their functions robustly. A robust function may have preferences in the topological structures of the underlying network. We carried out an exhaustive computational analysis on network topologies in relation to a patterning function in Drosophila embryogenesis. We found that whereas the vast majority of topologies can either not perform the required function or only do so very fragilely, a small fraction of topologies emerges as particularly robust for the function. The topology adopted by Drosophila, that of the segment polarity network, is a top ranking one among all topologies with no direct autoregulation. Furthermore, we found that all robust topologies are modular-each being a combination of three kinds of modules. These modules can be traced back to three subfunctions of the patterning function, and their combinations provide a combinatorial variability for the robust topologies. Our results suggest that the requirement of functional robustness drastically reduces the choices of viable topology to a limited set of modular combinations among which nature optimizes its choice under evolutionary and other biological constraints.

Keywords: Animals; Biological Evolution; Body Patterning; Computer Simulation; Drosophila Proteins, physiology; Drosophila melanogaster, anatomy /&/ histology/physiology; Feedback, Physiological; Gene Expression Regulation, Developmental; Genes, Insect; Models, Biological; Signal Transduction; Systems Biology, methods; Transcription Factors
[Loosli2006SimpleSVM] GLoosli. Simplesvm toolbox. Available at http://asi.insa-rouen.frgloosli/simpleSVM.html, 2006. [ bib ]
[Llinas2006Comparative] MLlinás, ZBozdech, ED. Wong, AT. Adai, and JL. DeRisi. Comparative whole genome transcriptome analysis of three Plasmodium falciparum strains. Nucleic Acids Res., 34(4):1166-1173, 2006. [ bib | DOI | http ]
Gene expression patterns have been demonstrated to be highly variable between similar cell types, for example lab strains and wild strains of Saccharomyces cerevisiae cultured under identical growth conditions exhibit a wide range of expression differences. We have used a genome-wide approach to characterize transcriptional differences between strains of Plasmodium falciparum by characterizing the transcriptome of the 48 h intraerythrocytic developmental cycle (IDC) for two strains, 3D7 and Dd2 and compared these results to our prior work using the HB3 strain. These three strains originate from geographically diverse locations and possess distinct drug sensitivity phenotypes. Our goal was to identify transcriptional differences related to phenotypic properties of these strains including immune evasion and drug sensitivity. We find that the highly streamlined transcriptome is remarkably well conserved among all three strains, and differences in gene expression occur mainly in genes coding for surface antigens involved in parasite-host interactions. Our analysis also detects several transcripts that are unique to individual strains as well as identifying large chromosomal deletions and highly polymorphic regions across strains. The majority of these genes are uncharacterized and have no homology to other species. These tractable transcriptional differences provide important phenotypes for these otherwise highly related strains of Plasmodium.

Keywords: plasmodium
[Legewie2006PLOSCompBiol] SLegewie, NBlathgen, and HHerzel. Mathematical modeling identifies inhibitors of apoptosis as mediators of positive feedback and bistability. PLoS Comput Biol, 2(9):e120, 09 2006. [ bib | DOI | http ]
The intrinsic, or mitochondrial, pathway of caspase activation is essential for apoptosis induction by various stimuli including cytotoxic stress. It depends on the cellular context, whether cytochrome c released from mitochondria induces caspase activation gradually or in an all-or-none fashion, and whether caspase activation irreversibly commits cells to apoptosis. By analyzing a quantitative kinetic model, we show that inhibition of caspase-3 (Casp3) and Casp9 by inhibitors of apoptosis (IAPs) results in an implicit positive feedback, since cleaved Casp3 augments its own activation by sequestering IAPs away from Casp9. We demonstrate that this positive feedback brings about bistability (i.e., all-or-none behaviour), and that it cooperates with Casp3-mediated feedback cleavage of Casp9 to generate irreversibility in caspase activation. Our calculations also unravel how cell-specific protein expression brings about the observed qualitative differences in caspase activation (gradual versus all-or-none and reversible versus irreversible). Finally, known regulators of the pathway are shown to efficiently shift the apoptotic threshold stimulus, suggesting that the bistable caspase cascade computes multiple inputs into an all-or-none caspase output. As cellular inhibitory proteins (e.g., IAPs) frequently inhibit consecutive intermediates in cellular signaling cascades (e.g., Casp3 and Casp9), the feedback mechanism described in this paper is likely to be a widespread principle on how cells achieve ultrasensitivity, bistability, and irreversibility.

Keywords: csbcbook
[Leach2006Prediction] AR. Leach, BK. Shoichet, and CE. Peishoff. Prediction of protein-ligand interactions. docking and scoring: successes and gaps. J. Med. Chem., 49(20):5851-5855, Oct 2006. [ bib | DOI | http ]
Keywords: Binding Sites; Drug Design; Ligands; Models, Molecular; Protein Binding; Proteins, chemistry; Quantitative Structure-Activity Relationship
[Lavielle2006Detection] MLavielle and GTeyssière. Detection of multiple change-points in multivariate time series. Lithuanian Mathematical Journal, 46(3):287-306, 2006. [ bib | DOI | http | .pdf ]
[Lai2006comparison] CLai, MJT. Reinders, LJ. van't Veer, and LFA. Wessels. A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC bioinformatics, 7(1):235, 2006. [ bib | DOI | http | .pdf ]
[Kurata2006PlosCompBio] Hiroyuki Kurata, Hana El-Samad, Rei Iwasaki, Hisao Ohtake, John C Doyle, Irina Grigorova, Carol A Gross, and Mustafa Khammash. Module-based analysis of robustness tradeoffs in the heat shock response system. PLoS Comput Biol, 2(7):e59, Jul 2006. [ bib | DOI | http ]
Biological systems have evolved complex regulatory mechanisms, even in situations where much simpler designs seem to be sufficient for generating nominal functionality. Using module-based analysis coupled with rigorous mathematical comparisons, we propose that in analogy to control engineering architectures, the complexity of cellular systems and the presence of hierarchical modular structures can be attributed to the necessity of achieving robustness. We employ the Escherichia coli heat shock response system, a strongly conserved cellular mechanism, as an example to explore the design principles of such modular architectures. In the heat shock response system, the sigma-factor sigma32 is a central regulator that integrates multiple feedforward and feedback modules. Each of these modules provides a different type of robustness with its inherent tradeoffs in terms of transient response and efficiency. We demonstrate how the overall architecture of the system balances such tradeoffs. An extensive mathematical exploration nevertheless points to the existence of an array of alternative strategies for the existing heat shock response that could exhibit similar behavior. We therefore deduce that the evolutionary constraints facing the system might have steered its architecture toward one of many robustly functional solutions.

Keywords: Computer Simulation; Escherichia coli Proteins, metabolism; Escherichia coli, metabolism; Feedback, physiology; Gene Expression Regulation, Bacterial, physiology; Heat-Shock Proteins, metabolism; Heat-Shock Response, physiology; Models, Biological; Oxidative Stress, physiology; Signal Transduction, physiology; Systems Biology, methods
[Kubinyi2006Chemogenomics] HKubinyi. Chemogenomics in drug discovery. Ernst Schering Res Found Workshop, 58:1-19, 2006. [ bib ]
Chemogenomics is a new strategy in drug discovery which, in principle, searches for all molecules that are capable of interacting with any biological target. Because of the almost infinite number of drug-like organic molecules, this is an impossible task. Therefore chemogenomics has been defined as the investigation of classes of compounds (libraries) against families of functionally related proteins. In this definition, chemogenomics deals with the systematic analysis of chemical-biological interactions. Congeneric series of chemical analogs are probes to investigate their action on specific target classes, e.g., GPCRs, kinases, phosphodiesterases, ion channels, serine proteases, and others. Whereas such a strategy developed in pharmaceutical industry almost 20 years ago, it is now more systematically applied in the search for target- and subtype-specific ligands. The term "privileged structures" has been defined for scaffolds, such as the benzodiazepines, which very often produce biologically active analogs in a target family, in this case in the class of G-protein-coupled receptors. The SOSA approach is a strategy to modify the selectivity of biologically active compounds, generating new drug candidates from the side activities of therapeutically used drugs.

Keywords: Animals; Chemistry, Pharmaceutical; Combinatorial Chemistry Techniques; Drug Design; Drug Industry; Genomics; Humans; Models, Chemical; Molecular Structure; Mutation; Pharmacogenetics; Protein Binding
[TransPath2006] MKrull, SPistor, NVoss, AKel, IReuter, DKronenberg, HMichael, KSchwarzer, APotapov, CChoi, OKel-Margoulis, and EWingender. TRANSPATH: an information resource for storing and visualizing signaling pathways and their pathological aberrations. Nucleic Acids Res, 34(Database issue):D546-51, 2006. [ bib ]
TRANSPATH is a database about signal transduction events. It provides information about signaling molecules, their reactions and the pathways these reactions constitute. The representation of signaling molecules is organized in a number of orthogonal hierarchies reflecting the classification of the molecules, their species-specific or generic features, and their post-translational modifications. Reactions are similarly hierarchically organized in a three-layer architecture, differentiating between reactions that are evidenced by individual publications, generalizations of these reactions to construct species-independent 'reference pathways' and the 'semantic projections' of these pathways. A number of search and browse options allow easy access to the database contents, which can be visualized with the tool PathwayBuildertrade mark. The module PathoSign adds data about pathologically relevant mutations in signaling components, including their genotypes and phenotypes. TRANSPATH and PathoSign can be used as encyclopaedia, in the educational process, for vizualization and modeling of signal transduction networks and for the analysis of gene expression data. TRANSPATH Public 6.0 is freely accessible for users from non-profit organizations under http://www.gene-regulation.com/pub/databases.html.

[Koyutuerk2006Pairwise] MKoyutürk, YKim, UTopkara, SSubramaniam, WSzpankowski, and AGrama. Pairwise alignment of protein interaction networks. J. Comput. Biol., 13(2):182-199, Mar 2006. [ bib | DOI | http ]
With an ever-increasing amount of available data on protein-protein interaction (PPI) networks and research revealing that these networks evolve at a modular level, discovery of conserved patterns in these networks becomes an important problem. Although available data on protein-protein interactions is currently limited, recently developed algorithms have been shown to convey novel biological insights through employment of elegant mathematical models. The main challenge in aligning PPI networks is to define a graph theoretical measure of similarity between graph structures that captures underlying biological phenomena accurately. In this respect, modeling of conservation and divergence of interactions, as well as the interpretation of resulting alignments, are important design parameters. In this paper, we develop a framework for comprehensive alignment of PPI networks, which is inspired by duplication/divergence models that focus on understanding the evolution of protein interactions. We propose a mathematical model that extends the concepts of match, mismatch, and gap in sequence alignment to that of match, mismatch, and duplication in network alignment and evaluates similarity between graph structures through a scoring function that accounts for evolutionary events. By relying on evolutionary models, the proposed framework facilitates interpretation of resulting alignments in terms of not only conservation but also divergence of modularity in PPI networks. Furthermore, as in the case of sequence alignment, our model allows flexibility in adjusting parameters to quantify underlying evolutionary relationships. Based on the proposed model, we formulate PPI network alignment as an optimization problem and present fast algorithms to solve this problem. Detailed experimental results from an implementation of the proposed framework show that our algorithm is able to discover conserved interaction patterns very effectively, in terms of both accuracies and computational cost.

[Korber2006Immunoinformatics] Bette Korber, Montiago LaBute, and Karina Yusim. Immunoinformatics comes of age. PLoS Comput. Biol., 2(6):e71, Jun 2006. [ bib | DOI | http ]
With the burgeoning immunological data in the scientific literature, scientists must increasingly rely on Internet resources to inform and enhance their work. Here we provide a brief overview of the adaptive immune response and summaries of immunoinformatics resources, emphasizing those with Web interfaces. These resources include searchable databases of epitopes and immune-related molecules, and analysis tools for T cell and B cell epitope prediction, vaccine design, and protein structure comparisons. There is an agreeable synergy between the growing collections in immune-related databases and the growing sophistication of analysis software; the databases provide the foundation for developing predictive computational tools, which in turn enable more rapid identification of immune responses to populate the databases. Collectively, these resources contribute to improved understanding of immune responses and escape, and evolution of pathogens under immune pressure. The public health implications are vast, including designing vaccines, understanding autoimmune diseases, and defining the correlates of immune protection.

Keywords: Amino Acid Sequence; Animals; Computational Biology; Databases, Factual; Epitopes, B-Lymphocyte; Epitopes, T-Lymphocyte; Humans; Immunity
[Klabunde2006Chemogenomics] TKlabunde and RJäger. Chemogenomics approaches to g-protein coupled receptor lead finding. Ernst Schering Res Found Workshop, 58:31-46, 2006. [ bib ]
G-protein coupled receptors (GPCRs) are promising targets for the discovery of novel drugs. In order to identify novel chemical series, high-throughput screening (HTS) is often complemented by rational chemogenomics lead finding approaches. We have compiled a GPCR directed screening set by ligand-based virtual screening of our corporate compound database. This set of compounds is supplemented with novel libraries synthesized around proprietary scaffolds. These target-directed libraries are designed using the knowledge of privileged fragments and pharmacophores to address specific GPCR subfamilies (e.g., purinergic or chemokine-binding GPCRs). Experimental testing of the GPCR collection has provided novel chemical series for several GPCR targets including the adenosine A1, the P2Y12, and the chemokine CCR1 receptor. In addition, GPCR sequence motifs linked to the recognition of GPCR ligands (termed chemoprints) are identified using homology modeling, molecular docking, and experimental profiling. These chemoprints can support the design and synthesis of compound libraries tailor-made for a novel GPCR target.

Keywords: chemogenomics
[Klabunde2006ChemogenomicsA] TKlabunde. Chemogenomics approaches to ligand design. In Ligand Design for G Protein-coupled Receptors, chapter 7, pages 115-135. Wiley-VCH, Great Britain, 2006. [ bib ]
[Kim2006Blockwise] YKim, JKim, and KKim. Blockwise sparse regression. Statistica Sinica, 16:375-390, 2006. [ bib | .pdf ]
Keywords: lasso
[Kharchenko2006Identifying] PKharchenko, LChen, YFreund, DVitkup, and GM. Church. Identifying metabolic enzymes with multiple types of association evidence. BMC Bioinformatics, 7:177, 2006. [ bib | DOI | http ]
BACKGROUND: Existing large-scale metabolic models of sequenced organisms commonly include enzymatic functions which can not be attributed to any gene in that organism. Existing computational strategies for identifying such missing genes rely primarily on sequence homology to known enzyme-encoding genes. RESULTS: We present a novel method for identifying genes encoding for a specific metabolic function based on a local structure of metabolic network and multiple types of functional association evidence, including clustering of genes on the chromosome, similarity of phylogenetic profiles, gene expression, protein fusion events and others. Using E. coli and S. cerevisiae metabolic networks, we illustrate predictive ability of each individual type of association evidence and show that significantly better predictions can be obtained based on the combination of all data. In this way our method is able to predict 60% of enzyme-encoding genes of E. coli metabolism within the top 10 (out of 3551) candidates for their enzymatic function, and as a top candidate within 43% of the cases. CONCLUSION: We illustrate that a combination of genome context and other functional association evidence is effective in predicting genes encoding metabolic enzymes. Our approach does not rely on direct sequence homology to known enzyme-encoding genes, and can be used in conjunction with traditional homology-based metabolic reconstruction methods. The method can also be used to target orphan metabolic activities.

[Kapp2006Discovery] Amy V Kapp, Stefanie S Jeffrey, Anita Langerød, Anne-Lise Børresen-Dale, Wonshik Han, Dong-Young Noh, Ida R K Bukholm, Monica Nicolau, Patrick O Brown, and Robert Tibshirani. Discovery and validation of breast cancer subtypes. BMC Genomics, 7:231, 2006. [ bib | DOI | http ]
Previous studies demonstrated breast cancer tumor tissue samples could be classified into different subtypes based upon DNA microarray profiles. The most recent study presented evidence for the existence of five different subtypes: normal breast-like, basal, luminal A, luminal B, and ERBB2+.Based upon the analysis of 599 microarrays (five separate cDNA microarray datasets) using a novel approach, we present evidence in support of the most consistently identifiable subtypes of breast cancer tumor tissue microarrays being: ESR1+/ERBB2-, ESR1-/ERBB2-, and ERBB2+ (collectively called the ESR1/ERBB2 subtypes). We validate all three subtypes statistically and show the subtype to which a sample belongs is a significant predictor of overall survival and distant-metastasis free probability.As a consequence of the statistical validation procedure we have a set of centroids which can be applied to any microarray (indexed by UniGene Cluster ID) to classify it to one of the ESR1/ERBB2 subtypes. Moreover, the method used to define the ESR1/ERBB2 subtypes is not specific to the disease. The method can be used to identify subtypes in any disease for which there are at least two independent microarray datasets of disease samples.

Keywords: Algorithms; Breast Neoplasms, classification/genetics/pathology; Female; Gene Expression Profiling, methods/statistics /&/ numerical data; Humans; Multivariate Analysis; Oligonucleotide Array Sequence Analysis, methods/statistics /&/ numerical data; Proportional Hazards Models; Risk Factors; Survival Analysis
[Jojic2006Learning] NJojic, MReyes-Gomez, DHeckerman, CKadie, and OSchueler-Furman. Learning MHC I-peptide binding. Bioinformatics, 22(14):e227-e235, Jul 2006. [ bib | DOI | http ]
MOTIVATION AND RESULTS: Motivated by the ability of a simple threading approach to predict MHC I-peptide binding, we developed a new and improved structure-based model for which parameters can be estimated from additional sources of data about MHC-peptide binding. In addition to the known 3D structures of a small number of MHC-peptide complexes that were used in the original threading approach, we included three other sources of information on peptide-MHC binding: (1) MHC class I sequences; (2) known binding energies for a large number of MHC-peptide complexes; and (3) an even larger binary dataset that contains information about strong binders (epitopes) and non-binders (peptides that have a low affinity for a particular MHC molecule). Our model significantly outperforms the standard threading approach in binding energy prediction. In our approach, which we call adaptive double threading, the parameters of the threading model are learnable, and both MHC and peptide sequences can be threaded onto structures of other alleles. These two properties make our model appropriate for predicting binding for alleles for which very little data (if any) is available beyond just their sequence, including prediction for alleles for which 3D structures are not available. The ability of our model to generalize beyond the MHC types for which training data is available also separates our approach from epitope prediction methods which treat MHC alleles as symbolic types, rather than biological sequences. We used the trained binding energy predictor to study viral infections in 246 HIV patients from the West Australian cohort, and over 1000 sequences in HIV clade B from Los Alamos National Laboratory database, capturing the course of HIV evolution over the last 20 years. Finally, we illustrate short-, medium-, and long-term adaptation of HIV to the human immune system. AVAILABILITY: http://www.research.microsoft.comjojic/hlaBinding.html.

Keywords: immunoinformatics
[Jia2006Demonstration] PJia, TShi, YCai, and YLi. Demonstration of two novel methods for predicting functional siRNA efficiency. BMC Bioinformatics, 7:271, 2006. [ bib | DOI | http | .pdf ]
BACKGROUND: siRNAs are small RNAs that serve as sequence determinants during the gene silencing process called RNA interference (RNAi). It is well know that siRNA efficiency is crucial in the RNAi pathway, and the siRNA efficiency for targeting different sites of a specific gene varies greatly. Therefore, there is high demand for reliable siRNAs prediction tools and for the design methods able to pick up high silencing potential siRNAs. RESULTS: In this paper, two systems have been established for the prediction of functional siRNAs: (1) a statistical model based on sequence information and (2) a machine learning model based on three features of siRNA sequences, namely binary description, thermodynamic profile and nucleotide composition. Both of the two methods show high performance on the two datasets we have constructed for training the model. CONCLUSION: Both of the two methods studied in this paper emphasize the importance of sequence information for the prediction of functional siRNAs. The way of denoting a bio-sequence by binary system in mathematical language might be helpful in other analysis work associated with fixed-length bio-sequence.

Keywords: sirna
[Jamieson2006Medicinal] CJamieson, EM. Moir, ZRankovic, and GWishart. Medicinal chemistry of hERG optimizations: Highlights and hang-ups. J. Med. Chem., 49(17):5029-5046, Aug 2006. [ bib | DOI | http | .pdf ]
Keywords: herg
[Jacoby20067] Edgar Jacoby, Rochdi Bouhelal, Marc Gerspacher, and Klaus Seuwen. The 7 tm g-protein-coupled receptor target family. ChemMedChem, 1(8):761-782, Aug 2006. [ bib | DOI | http ]
Chemical biology approaches have a long history in the exploration of the G-protein-coupled receptor (GPCR) family, which represents the largest and most important group of targets for therapeutics. The analysis of the human genome revealed a significant number of new members with unknown physiological function which are today the focus of many reverse pharmacology drug-discovery programs. As the seven hydrophobic transmembrane segments are a defining common structural feature of these receptors, and as signaling through heterotrimeric G proteins is not demonstrated in all cases, these proteins are also referred to as seven transmembrane (7 TM) or serpentine receptors. This review summarizes important historic milestones of GPCR research, from the beginning, when pharmacology was mainly descriptive, to the age of modern molecular biology, with the cloning of the first receptor and now the availability of the entire human GPCR repertoire at the sequence and protein level. It shows how GPCR-directed drug discovery was initially based on the careful testing of a few specifically made chemical compounds and is today pursued with modern drug-discovery approaches, including combinatorial library design, structural biology, molecular informatics, and advanced screening technologies for the identification of new compounds that activate or inhibit GPCRs specifically. Such compounds, in conjunction with other new technologies, allow us to study the role of receptors in physiology and medicine, and will hopefully result in novel therapies. We also outline how basic research on the signaling and regulatory mechanisms of GPCRs is advancing, leading to the discovery of new GPCR-interacting proteins and thus opening new perspectives for drug development. Practical examples from GPCR expression studies, HTS (high-throughput screening), and the design of monoamine-related GPCR-focused combinatorial libraries illustrate ongoing GPCR chemical biology research. Finally, we outline future progress that may relate today's discoveries to the development of new medicines.

[Jacob2006Epitope] LJacob and J.-P. Vert. Epitope prediction improved by multitask support vector machines. Technical Report arXiv:q-bio/0702008v1, arXiv, 2006. [ bib ]
[Ivshina2006Genetic] A.V. Ivshina, JGeorge, OSenko, BMow, T.C. Putti, JSmeds, TLindahl, YPawitan, PHall, HNordgren, et al. Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. Cancer research, 66(21):10292-10301, 2006. [ bib ]
[Irizarry2006Comparison] Rafael A Irizarry, Zhijin Wu, and Harris A Jaffee. Comparison of affymetrix genechip expression measures. Bioinformatics, 22(7):789-794, Apr 2006. [ bib | DOI | http ]
MOTIVATION: In the Affymetrix GeneChip system, preprocessing occurs before one obtains expression level measurements. Because the number of competing preprocessing methods was large and growing we developed a benchmark to help users identify the best method for their application. A webtool was made available for developers to benchmark their procedures. At the time of writing over 50 methods had been submitted. RESULTS: We benchmarked 31 probe set algorithms using a U95A dataset of spike in controls. Using this dataset, we found that background correction, one of the main steps in preprocessing, has the largest effect on performance. In particular, background correction appears to improve accuracy but, in general, worsen precision. The benchmark results put this balance in perspective. Furthermore, we have improved some of the original benchmark metrics to provide more detailed information regarding precision and accuracy. A handful of methods stand out as providing the best balance using spike-in data with the older U95A array, although different experiments on more current arrays may benchmark differently. AVAILABILITY: The affycomp package, now version 1.5.2, continues to be available as part of the Bioconductor project (http://www.bioconductor.org). The webtool continues to be available at http://affycomp.biostat.jhsph.edu CONTACT: rafa@jhu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Keywords: Algorithms; Benchmarking; Gene Expression Profiling, methods; Oligonucleotide Array Sequence Analysis, instrumentation/methods; Reproducibility of Results; Software
[Huebert2006Genome-wide] Dana J Huebert, Michael Kamal, Aisling O'Donovan, and Bradley E Bernstein. Genome-wide analysis of histone modifications by chip-on-chip. Methods, 40(4):365-369, Dec 2006. [ bib | DOI | http ]
Post-translational modifications to histone proteins regulate the packaging of genomic DNA into chromatin, gene activity and other functions of the genome. They are understood to play key roles in embryonic development and disease pathogenesis. Recent advances in technology have made it possible to analyze chromatin structure genome-wide in mammalian cells. Global patterns of histone modifications can be observed using a technique called ChIP-on-chip, which combines the specificity of chromatin immunoprecipitation with the unbiased, high-throughput capabilities of microarrays. The resulting maps provide insight into the functions of, and relationships between, different modifications. Here, we provide validated ChIP-on-chip methods for analyzing histone modification patterns at genome-scale in mammalian cells.

Keywords: Animals; Chromatin Immunoprecipitation; Chromosomes, Mammalian; Genomics; Histone Code; Histones; Oligonucleotide Array Sequence Analysis; Protein Processing, Post-Translational
[Huang2006Ligsite] Bingding Huang and Michael Schroeder. Ligsitecsc: predicting ligand binding sites using the connolly surface and degree of conservation. BMC Struct Biol, 6:19, 2006. [ bib | DOI | http ]
BACKGROUND: Identifying pockets on protein surfaces is of great importance for many structure-based drug design applications and protein-ligand docking algorithms. Over the last ten years, many geometric methods for the prediction of ligand-binding sites have been developed. RESULTS: We present LIGSITEcsc, an extension and implementation of the LIGSITE algorithm. LIGSITEcsc is based on the notion of surface-solvent-surface events and the degree of conservation of the involved surface residues. We compare our algorithm to four other approaches, LIGSITE, CAST, PASS, and SURFNET, and evaluate all on a dataset of 48 unbound/bound structures and 210 bound-structures. LIGSITEcsc performs slightly better than the other tools and achieves a success rate of 71% and 75%, respectively. CONCLUSION: The use of the Connolly surface leads to slight improvements, the prediction re-ranking by conservation to significant improvements of the binding site predictions. A web server for LIGSITEcsc and its source code is available at scoppi.biotec.tu-dresden.de/pocket

Keywords: Algorithms; Binding Sites; Databases, Protein; Ligands; Models, Molecular; Proteins, chemistry
[Hu2006molecular] Zhiyuan Hu, Cheng Fan, Daniel S Oh, JS. Marron, Xiaping He, Bahjat F Qaqish, Chad Livasy, Lisa A Carey, Evangeline Reynolds, Lynn Dressler, Andrew Nobel, Joel Parker, Matthew G Ewend, Lynda R Sawyer, Junyuan Wu, Yudong Liu, Rita Nanda, Maria Tretiakova, Alejandra Ruiz Orrico, Donna Dreher, Juan P Palazzo, Laurent Perreard, Edward Nelson, Mary Mone, Heidi Hansen, Michael Mullins, John F Quackenbush, Matthew J Ellis, Olufunmilayo I Olopade, Philip S Bernard, and Charles M Perou. The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics, 7:96, 2006. [ bib | DOI | http ]
Validation of a novel gene expression signature in independent data sets is a critical step in the development of a clinically useful test for cancer patient risk-stratification. However, validation is often unconvincing because the size of the test set is typically small. To overcome this problem we used publicly available breast cancer gene expression data sets and a novel approach to data fusion, in order to validate a new breast tumor intrinsic list.A 105-tumor training set containing 26 sample pairs was used to derive a new breast tumor intrinsic gene list. This intrinsic list contained 1300 genes and a proliferation signature that was not present in previous breast intrinsic gene sets. We tested this list as a survival predictor on a data set of 311 tumors compiled from three independent microarray studies that were fused into a single data set using Distance Weighted Discrimination. When the new intrinsic gene set was used to hierarchically cluster this combined test set, tumors were grouped into LumA, LumB, Basal-like, HER2+/ER-, and Normal Breast-like tumor subtypes that we demonstrated in previous datasets. These subtypes were associated with significant differences in Relapse-Free and Overall Survival. Multivariate Cox analysis of the combined test set showed that the intrinsic subtype classifications added significant prognostic information that was independent of standard clinical predictors. From the combined test set, we developed an objective and unchanging classifier based upon five intrinsic subtype mean expression profiles (i.e. centroids), which is designed for single sample predictions (SSP). The SSP approach was applied to two additional independent data sets and consistently predicted survival in both systemically treated and untreated patient groups.This study validates the "breast tumor intrinsic" subtype classification as an objective means of tumor classification that should be translated into a clinical assay for further retrospective and prospective validation. In addition, our method of combining existing data sets can be used to robustly validate the potential clinical value of any new gene expression profile.

Keywords: Breast Neoplasms, genetics; Cluster Analysis; Conserved Sequence, genetics; Female; Gene Expression Regulation, Neoplastic, genetics; Genes, Neoplasm, genetics; Genetic Predisposition to Disease; Humans; Oligonucleotide Array Sequence Analysis, methods; Predictive Value of Tests; Reproducibility of Results; Research Design; Sample Size; Survival Analysis
[Hornberg2006Cancer] JJ. Hornberg, FJ. Bruggeman, HV. Westerhoff, and JLankelma. Cancer: a systems biology disease. Biosystems, 83(2-3):81-90, 2006. [ bib | DOI | http | .pdf ]
Cancer research has focused on the identification of molecular differences between cancerous and healthy cells. The emerging picture is overwhelmingly complex. Molecules out of many parallel signal transduction pathways are involved. Their activities appear to be controlled by multiple factors. The action of regulatory circuits, cross-talk between pathways and the non-linear reaction kinetics of biochemical processes complicate the understanding and prediction of the outcome of intracellular signaling. In addition, interactions between tumor and other cell types give rise to a complex supra-cellular communication network. If cancer is such a complex system, how can one ever predict the effect of a mutation in a particular gene on a functionality of the entire system? And, how should one go about identifying drug targets? Here, we argue that one aspect is to recognize, where the essence resides, i.e. recognize cancer as a Systems Biology disease. Then, more cancer biologists could become systems biologists aiming to provide answers to some of the above systemic questions. To this aim, they should integrate the available knowledge stemming from quantitative experimental results through mathematical models. Models that have contributed to the understanding of complex biological systems are discussed. We show that the architecture of a signaling network is important for determining the site at which an oncologist should intervene. Finally, we discuss the possibility of applying network-based drug design to cancer treatment and how rationalized therapies, such as the application of kinase inhibitors, may benefit from Systems Biology.

Keywords: csbcbook, csbcbook-mustread
[Hoheisel2006Microarray] JD. Hoheisel. Microarray technology: beyond transcript profiling and genotype analysis. Nat Rev Genet, 7(3):200-210, Mar 2006. [ bib | DOI | http | .pdf ]
Understanding complex functional mechanisms requires the global and parallel analysis of different cellular processes. DNA microarrays have become synonymous with this kind of study and, in many cases, are the obvious platform to achieve this aim. They have already made important contributions, most notably to gene-expression studies, although the true potential of this technology is far greater. Whereas some assays, such as transcript profiling and genotyping, are becoming routine, others are still in the early phases of development, and new areas of application, such as genome-wide epigenetic analysis and on-chip synthesis, continue to emerge.

Keywords: csbcbook, csbcbook-ch2
[Hinton2006fast] GE. Hinton, SOsindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Comput, 18(7):1527-1554, Jul 2006. [ bib | DOI | http ]
We show how to use "complementary priors" to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

[Hinton2006Unsupervised] Geoffrey Hinton, Simon Osindero, Max Welling, and Yee-Whye Teh. Unsupervised discovery of nonlinear structure using contrastive backpropagation. Cogn Sci, 30(4):725-731, Jul 2006. [ bib | DOI | http ]
We describe a way of modeling high-dimensional data vectors by using an unsupervised, nonlinear, multilayer neural network in which the activity of each neuron-like unit makes an additive contribution to a global energy score that indicates how surprised the network is by the data vector. The connection weights that determine how the activity of each unit depends on the activities in earlier layers are learned by minimizing the energy assigned to data vectors that are actually observed and maximizing the energy assigned to "confabulations" that are generated by perturbing an observed data vector in a direction that decreases its energy under the current model.

[Hill2006G-protein-coupled] SJ. Hill. G-protein-coupled receptors: past, present and future. Br. J. Pharmacol., 147 Suppl 1:S27-S37, Jan 2006. [ bib | DOI | http ]
The G-protein-coupled receptor (GPCR) family represents the largest and most versatile group of cell surface receptors. Drugs active at these receptors have therapeutic actions across a wide range of human diseases ranging from allergic rhinitis to pain, hypertension and schizophrenia. This review provides a brief historical overview of the properties and signalling characteristics of this important family of receptors.

Keywords: chemogenomics
[Hess2006Pharmacogenomic] KR. Hess, KAnderson, WF. Symmans, VValero, NIbrahim, JA. Mejia, DBooser, RL. Theriault, AU. Buzdar, PJ. Dempsey, RRouzier, NSneige, JS. Ross, TVidaurre, HL. Gómez, GN. Hortobagyi, and LPusztai. Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. J Clin Oncol, 24(26):4236-4244, Sep 2006. [ bib | DOI | http | .pdf ]
We developed a multigene predictor of pathologic complete response (pCR) to preoperative weekly paclitaxel and fluorouracil-doxorubicin-cyclophosphamide (T/FAC) chemotherapy and assessed its predictive accuracy on independent cases.One hundred thirty-three patients with stage I-III breast cancer were included. Pretreatment gene expression profiling was performed with oligonecleotide microarrays on fine-needle aspiration specimens. We developed predictors of pCR from 82 cases and assessed accuracy on 51 independent cases.Overall pCR rate was 26% in both cohorts. In the training set, 56 probes were identified as differentially expressed between pCR versus residual disease, at a false discovery rate of 1%. We examined the performance of 780 distinct classifiers (set of genes + prediction algorithm) in full cross-validation. Many predictors performed equally well. A nominally best 30-probe set Diagonal Linear Discriminant Analysis classifier was selected for independent validation. It showed significantly higher sensitivity (92% v 61%) than a clinical predictor including age, grade, and estrogen receptor status. The negative predictive value (96% v 86%) and area under the curve (0.877 v 0.811) were nominally better but not statistically significant. The combination of genomic and clinical information yielded a predictor not significantly different from the genomic predictor alone. In 31 samples, RNA was hybridized in replicate with resulting predictions that were 97% concordant.A 30-probe set pharmacogenomic predictor predicted pCR to T/FAC chemotherapy with high sensitivity and negative predictive value. This test correctly identified all but one of the patients who achieved pCR (12 of 13 patients) and all but one of those who were predicted to have residual disease had residual cancer (27 of 28 patients).

Keywords: breastcancer
[Hertz2006PepDist] THertz and CYanover. PepDist: a new framework for protein-peptide binding prediction based on learning peptide distance functions. BMC Bioinformatics, 7 Suppl 1:S3, 2006. [ bib | DOI | http ]
BACKGROUND: Many different aspects of cellular signalling, trafficking and targeting mechanisms are mediated by interactions between proteins and peptides. Representative examples are MHC-peptide complexes in the immune system. Developing computational methods for protein-peptide binding prediction is therefore an important task with applications to vaccine and drug design. METHODS: Previous learning approaches address the binding prediction problem using traditional margin based binary classifiers. In this paper we propose PepDist: a novel approach for predicting binding affinity. Our approach is based on learning peptide-peptide distance functions. Moreover, we suggest to learn a single peptide-peptide distance function over an entire family of proteins (e.g. MHC class I). This distance function can be used to compute the affinity of a novel peptide to any of the proteins in the given family. In order to learn these peptide-peptide distance functions, we formalize the problem as a semi-supervised learning problem with partial information in the form of equivalence constraints. Specifically, we propose to use DistBoost, which is a semi-supervised distance learning algorithm. RESULTS: We compare our method to various state-of-the-art binding prediction algorithms on MHC class I and MHC class II datasets. In almost all cases, our method outperforms all of its competitors. One of the major advantages of our novel approach is that it can also learn an affinity function over proteins for which only small amounts of labeled peptides exist. In these cases, our method's performance gain, when compared to other computational methods, is even more pronounced. We have recently uploaded the PepDist webserver which provides binding prediction of peptides to 35 different MHC class I alleles. The webserver which can be found at http://www.pepdist.cs.huji.ac.il is powered by a prediction engine which was trained using the framework presented in this paper. CONCLUSION: The results obtained suggest that learning a single distance function over an entire family of proteins achieves higher prediction accuracy than learning a set of binary classifiers for each of the proteins separately. We also show the importance of obtaining information on experimentally determined non-binders. Learning with real non-binders generalizes better than learning with randomly generated peptides that are assumed to be non-binders. This suggests that information about non-binding peptides should also be published and made publicly available.

Keywords: immunoinformatics
[He2006Why] XHe and JZhang. Why do hubs tend to be essential in protein networks? PLoS Genet, 2(6):e88, Jun 2006. [ bib | DOI | http ]
The protein-protein interaction (PPI) network has a small number of highly connected protein nodes (known as hubs) and many poorly connected nodes. Genome-wide studies show that deletion of a hub protein is more likely to be lethal than deletion of a non-hub protein, a phenomenon known as the centrality-lethality rule. This rule is widely believed to reflect the special importance of hubs in organizing the network, which in turn suggests the biological significance of network architectures, a key notion of systems biology. Despite the popularity of this explanation, the underlying cause of the centrality-lethality rule has never been critically examined. We here propose the concept of essential PPIs, which are PPIs that are indispensable for the survival or reproduction of an organism. Our network analysis suggests that the centrality-lethality rule is unrelated to the network architecture, but is explained by the simple fact that hubs have large numbers of PPIs, therefore high probabilities of engaging in essential PPIs. We estimate that approximately 3% of PPIs are essential in the yeast, accounting for approximately 43% of essential genes. As expected, essential PPIs are evolutionarily more conserved than nonessential PPIs. Considering the role of essential PPIs in determining gene essentiality, we find the yeast PPI network functionally more robust than random networks, yet far less robust than the potential optimum. These and other findings provide new perspectives on the biological relevance of network structure and robustness.

[Guba2006Chemogenomics] WGuba. Chemogenomics strategies for g-protein coupled receptor hit finding. Ernst Schering Res Found Workshop, 58:21-29, 2006. [ bib | DOI ]
Targeting protein superfamilies via chemogenomics is based on a similarity clustering of gene sequences and molecular structures of ligands. Both target and ligand clusters are linked by generating binding affinity profiles of chemotypes vs a target panel. The application of this multidimensional similarity paradigm will be described in the context of Lead Generation to identify novel chemical hit classes for G-protein coupled receptors.

Keywords: chemogenomics
[Gold2006SitesBase] N.D. Gold and R.M. Jackson. Sitesbase: a database for structure-based protein-ligand binding site comparisons. Nucleic Acids Res., 34:D231-D234, Jan 2006. [ bib ]
There are many components which govern the function of a protein within a cell. Here, we focus on the molecular recognition of small molecules and the prediction of common recognition by similarity between protein-ligand binding sites. SitesBase is an easily accessible database which is simple to use and holds information about structural similarities between known ligand binding sites found in the Protein Data Bank. These similarities are presented to the wider community enabling full analysis of molecular recognition and potentially protein structure-function relationships. SitesBase is accessible at http://www.bioinformatics.leeds.ac.uk/sb.

[Gold2006Fold] N.D. Gold and R.M. Jackson. Fold independent structural comparisons of protein-ligand binding sites for exploring functional relationships. J. Mol. Biol., 355(5):1112-1124, Feb 2006. [ bib ]
The rapid growth in protein structural data and the emergence of structural genomics projects have increased the need for automatic structure analysis and tools for function prediction. Small molecule recognition is critical to the function of many proteins; therefore, determination of ligand binding site similarity is important for understanding ligand interactions and may allow their functional classification. Here, we present a binding sites database (SitesBase) that given a known protein-ligand binding site allows rapid retrieval of other binding sites with similar structure independent of overall sequence or fold similarity. However, each match is also annotated with sequence similarity and fold information to aid interpretation of structure and functional similarity. Similarity in ligand binding sites can indicate common binding modes and recognition of similar molecules, allowing potential inference of function for an uncharacterised protein or providing additional evidence of common function where sequence or fold similarity is already known. Alternatively, the resource can provide valuable information for detailed studies of molecular recognition including structure-based ligand design and in understanding ligand cross-reactivity. Here, we show examples of atomic similarity between superfamily or more distant fold relatives as well as between seemingly unrelated proteins. Assignment of unclassified proteins to structural superfamiles is also undertaken and in most cases substantiates assignments made using sequence similarity. Correct assignment is also possible where sequence similarity fails to find significant matches, illustrating the potential use of binding site comparisons for newly determined proteins.

Keywords: geometric hashing, SitesBase, structural genomics, 3D structure comparison
[Gevaert2006Predicting] OGevaert, F.D. Smet, DTimmerman, YMoreau, and B.D. Moor. Predicting the prognosis of breast cancer by integrating clinical and microarray data with bayesian networks. Bioinformatics, 22(14):e184-e190, 2006. [ bib ]
[GevaZatorsky2006MSB] NGeva-Zatorsky, NRosenfeld, SItzkovitz, RMilo, ASigal, EDekel, TYarnitzky, YLiron, PPolak, GLahav, and UAlon. Oscillations and variability in the p53 system. Mol Syst Biol, 2:2006 0033, 2006. [ bib ]
Understanding the dynamics and variability of protein circuitry requires accurate measurements in living cells as well as theoretical models. To address this, we employed one of the best-studied protein circuits in human cells, the negative feedback loop between the tumor suppressor p53 and the oncogene Mdm2. We measured the dynamics of fluorescently tagged p53 and Mdm2 over several days in individual living cells. We found that isogenic cells in the same environment behaved in highly variable ways following DNA-damaging gamma irradiation: some cells showed undamped oscillations for at least 3 days (more than 10 peaks). The amplitude of the oscillations was much more variable than the period. Sister cells continued to oscillate in a correlated way after cell division, but lost correlation after about 11 h on average. Other cells showed low-frequency fluctuations that did not resemble oscillations. We also analyzed different families of mathematical models of the system, including a novel checkpoint mechanism. The models point to the possible source of the variability in the oscillations: low-frequency noise in protein production rates, rather than noise in other parameters such as degradation rates. This study provides a view of the extensive variability of the behavior of a protein circuit in living human cells, both from cell to cell and in the same cell over time.

Keywords: csbcbook
[Garnis2006High] CGarnis, WW. Lockwood, EVucic, YGe, LGirard, JD. Minna, AF. Gazdar, SLam, CMacAulay, and WL. Lam. High resolution analysis of non-small cell lung cancer cell lines by whole genome tiling path array CGH. Int. J. Cancer, 118(6):1556-1564, 2006. [ bib | DOI | http ]
Chromosomal regions harboring tumor suppressors and oncogenes are often deleted or amplified. Array comparative genomic hybridization detects segmental DNA copy number alterations in tumor DNA relative to a normal control. The recent development of a bacterial artificial chromosome array, which spans the human genome in a tiling path manner with >32,000 clones, has facilitated whole genome profiling at an unprecedented resolution. Using this technology, we comprehensively describe and compare the genomes of 28 commonly used non-small cell lung carcinoma (NSCLC) cell models, derived from 18 adenocarcinomas (AC), 9 squamous cell carcinomas and 1 large cell carcinoma. Analysis at such resolution not only provided a detailed genomic alteration template for each of these model cell lines, but revealed novel regions of frequent duplication and deletion. Significantly, a detailed analysis of chromosome 7 identified 6 distinct regions of alterations across this chromosome, implicating the presence of multiple novel oncogene loci on this chromosome. As well, a comparison between the squamous and AC cells revealed alterations common to both subtypes, such as the loss of 3p and gain of 5p, in addition to multiple hotspots more frequently associated with only 1 subtype. Interestingly, chromosome 3q, which is known to be amplified in both subtypes, showed 2 distinct regions of alteration, 1 frequently altered in squamous and 1 more frequently altered in AC. In summary, our data demonstrate the unique information generated by high resolution analysis of NSCLC genomes and uncover the presence of genetic alterations prevalent in the different NSCLC subtypes.

Keywords: Carcinoma, Non-Small-Cell Lung, genetics/pathology; Cell Line, Tumor; Chromosomes, Artificial, Bacterial, genetics; Gene Amplification; Gene Dosage; Gene Expression Profiling; Genome, Human, genetics; Humans; Loss of Heterozygosity; Lung Neoplasms, genetics/pathology; Microarray Analysis, methods; Nucleic Acid Hybridization, methods
[Freeman2006Copy] JL. Freeman, GH. Perry, LFeuk, RRedon, SA. McCarroll, DM. Altshuler, HAburatani, KW. Jones, CTyler-Smith, ME. Hurles, NP. Carter, SW. Scherer, and CLee. Copy number variation: new insights in genome diversity. Genome Res, 16(8):949-961, Aug 2006. [ bib | DOI | http | .pdf ]
DNA copy number variation has long been associated with specific chromosomal rearrangements and genomic disorders, but its ubiquity in mammalian genomes was not fully realized until recently. Although our understanding of the extent of this variation is still developing, it seems likely that, at least in humans, copy number variants (CNVs) account for a substantial amount of genetic variation. Since many CNVs include genes that result in differential levels of gene expression, CNVs may account for a significant proportion of normal phenotypic variation. Current efforts are directed toward a more comprehensive cataloging and characterization of CNVs that will provide the basis for determining how genomic diversity impacts biological function, evolution, and common human diseases.

Keywords: cgh, csbcbook, csbcbook-ch2
[Freedman2006Statistical] D.A. Freedman. Statistical models for causation what inferential leverage do they provide? Evaluation Review, 30(6):691-713, 2006. [ bib ]
[Franke2006Reconstruction] LFranke, Hvan Bakel, LFokkens, E.D D de Jong, MEgmont-Petersen, and CWijmenga. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet., 78(6):1011-1025, Jun 2006. [ bib | DOI | http | .pdf ]
Most common genetic disorders have a complex inheritance and may result from variants in many genes, each contributing only weak effects to the disease. Pinpointing these disease genes within the myriad of susceptibility loci identified in linkage studies is difficult because these loci may contain hundreds of genes. However, in any disorder, most of the disease genes will be involved in only a few different molecular pathways. If we know something about the relationships between the genes, we can assess whether some genes (which may reside in different loci) functionally interact with each other, indicating a joint basis for the disease etiology. There are various repositories of information on pathway relationships. To consolidate this information, we developed a functional human gene network that integrates information on genes and the functional relationships between genes, based on data from the Kyoto Encyclopedia of Genes and Genomes, the Biomolecular Interaction Network Database, Reactome, the Human Protein Reference Database, the Gene Ontology database, predicted protein-protein interactions, human yeast two-hybrid interactions, and microarray co-expressions. We applied this network to interrelate positional candidate genes from different disease loci and then tested 96 heritable disorders for which the Online Mendelian Inheritance in Man database reported at least three disease genes. Artificial susceptibility loci, each containing 100 genes, were constructed around each disease gene, and we used the network to rank these genes on the basis of their functional interactions. By following up the top five genes per artificial locus, we were able to detect at least one known disease gene in 54% of the loci studied, representing a 2.8-fold increase over random selection. This suggests that our method can significantly reduce the cost and effort of pinpointing true disease genes in analyses of disorders for which numerous loci have been reported but for which most of the genes are unknown.

[Flannick2006Graemlin] JFlannick, ANovak, B.S. Srinivasan, H.H. McAdams, and SBatzoglou. Graemlin: general and robust alignment of multiple large interaction networks. Genome Res., 16(9):1169-1181, Sep 2006. [ bib | DOI | http | .pdf ]
The recent proliferation of protein interaction networks has motivated research into network alignment: the cross-species comparison of conserved functional modules. Previous studies have laid the foundations for such comparisons and demonstrated their power on a select set of sparse interaction networks. Recently, however, new computational techniques have produced hundreds of predicted interaction networks with interconnection densities that push existing alignment algorithms to their limits. To find conserved functional modules in these new networks, we have developed Graemlin, the first algorithm capable of scalable multiple network alignment. Graemlin's explicit model of functional evolution allows both the generalization of existing alignment scoring schemes and the location of conserved network topologies other than protein complexes and metabolic pathways. To assess Graemlin's performance, we have developed the first quantitative benchmarks for network alignment, which allow comparisons of algorithms in terms of their ability to recapitulate the KEGG database of conserved functional modules. We find that Graemlin achieves substantial scalability gains over previous methods while improving sensitivity.

[Fan2006Illumina] Jian-Bing Fan, Kevin L Gunderson, Marina Bibikova, Joanne M Yeakley, Jing Chen, Eliza Wickham Garcia, Lori L Lebruska, Marc Laurent, Richard Shen, and David Barker. Illumina universal bead arrays. Methods Enzymol, 410:57-73, 2006. [ bib | DOI | http ]
This chapter describes an accurate, scalable, and flexible microarray technology. It includes a miniaturized array platform where each individual feature is quality controlled and a versatile assay that can be adapted for various genetic analyses, such as single nucleotide polymorphism genotyping, DNA methylation detection, and gene expression profiling. This chapter describes the concept of the BeadArray technology, two different Array of Arrays formats, the assay scheme and protocol, the performance of the system, and its use in large-scale genetic, epigenetic, and expression studies.

Keywords: Animals; Humans; Microspheres; Oligonucleotide Array Sequence Analysis, instrumentation/methods
[Fan2006Concordance] CFan, D.S. Oh, LWessels, BWeigelt, D.S.A. Nuyten, A.B. Nobel, L.J. van't Veer, and C.M. Perou. Concordance among gene-expression-based predictors for breast cancer. N. Engl. J. Med., 355(6):560, 2006. [ bib | DOI | http | .pdf ]
Keywords: breastcancer, microarray
[Theres2006Structural] Theres Fagerberg, Jean-Charles Cerottini, and Olivier Michielin. Structural prediction of peptides bound to MHC class I. J. Mol. Biol., 356(2):521-546, Feb 2006. [ bib | DOI | http ]
An ab initio structure prediction approach adapted to the peptide-major histocompatibility complex (MHC) class I system is presented. Based on structure comparisons of a large set of peptide-MHC class I complexes, a molecular dynamics protocol is proposed using simulated annealing (SA) cycles to sample the conformational space of the peptide in its fixed MHC environment. A set of 14 peptide-human leukocyte antigen (HLA) A0201 and 27 peptide-non-HLA A0201 complexes for which X-ray structures are available is used to test the accuracy of the prediction method. For each complex, 1000 peptide conformers are obtained from the SA sampling. A graph theory clustering algorithm based on heavy atom root-mean-square deviation (RMSD) values is applied to the sampled conformers. The clusters are ranked using cluster size, mean effective or conformational free energies, with solvation free energies computed using Generalized Born MV 2 (GB-MV2) and Poisson-Boltzmann (PB) continuum models. The final conformation is chosen as the center of the best-ranked cluster. With conformational free energies, the overall prediction success is 83% using a 1.00 Angstroms crystal RMSD criterion for main-chain atoms, and 76% using a 1.50 Angstroms RMSD criterion for heavy atoms. The prediction success is even higher for the set of 14 peptide-HLA A0201 complexes: 100% of the peptides have main-chain RMSD values < or =1.00 Angstroms and 93% of the peptides have heavy atom RMSD values < or =1.50 Angstroms. This structure prediction method can be applied to complexes of natural or modified antigenic peptides in their MHC environment with the aim to perform rational structure-based optimizations of tumor vaccines.

Keywords: , Algorithms, Amino Acid Sequence, Antibodies, Artificial Intelligence, Automated, Binding Sites, Chemical, Computer Simulation, Databases, Epitope Mapping, Genes, HLA-A Antigens, HLA-DQ Antigens, Histocompatibility Antigens Class I, Humans, Immunoassay, Immunological, MHC Class I, Models, Molecular, Molecular Sequence Data, Pattern Recognition, Peptides, Protein, Protein Binding, Protein Conformation, Protein Interaction Mapping, Protein Structure, Sequence Alignment, Sequence Analysis, Software, Tertiary, Water, 16368108
[Esquela-Kerscher2006Oncomirs] AEsquela-Kerscher and FJ. Slack. Oncomirs - microRNAs with a role in cancer. Nat. Rev. Cancer, 6(4):259-269, Apr 2006. [ bib | DOI | http | .pdf ]
MicroRNAs (miRNAs) are an abundant class of small non-protein-coding RNAs that function as negative gene regulators. They regulate diverse biological processes, and bioinformatic data indicates that each miRNA can control hundreds of gene targets, underscoring the potential influence of miRNAs on almost every genetic pathway. Recent evidence has shown that miRNA mutations or mis-expression correlate with various human cancers and indicates that miRNAs can function as tumour suppressors and oncogenes. miRNAs have been shown to repress the expression of important cancer-related genes and might prove useful in the diagnosis and treatment of cancer.

Keywords: csbcbook
[Erhan2006Collaborative] DErhan, P.-J. L'heureux, SY. Yue, and YBengio. Collaborative filtering on a family of biological targets. J. Chem. Inf. Model., 46(2):626-635, 2006. [ bib | DOI | http | .pdf ]
Building a QSAR model of a new biological target for which few screening data are available is a statistical challenge. However, the new target may be part of a bigger family, for which we have more screening data. Collaborative filtering or, more generally, multi-task learning, is a machine learning approach that improves the generalization performance of an algorithm by using information from related tasks as an inductive bias. We use collaborative filtering techniques for building predictive models that link multiple targets to multiple examples. The more commonalities between the targets, the better the multi-target model that can be built. We show an example of a multi-target neural network that can use family information to produce a predictive model of an undersampled target. We evaluate JRank, a kernel-based method designed for collaborative filtering. We show their performance on compound prioritization for an HTS campaign and the underlying shared representation between targets. JRank outperformed the neural network both in the single- and multi-target models.

Keywords: chemogenomics
[Elfilali2006ITTACA] AElfilali, SLair, CVerbeke, PLa Rosa, FRadvanyi, and EBarillot. ITTACA: a new database for integrated tumor transcriptome array and clinical data analysis. Nucleic Acids Res., 34(Database issue):D613-D616, Jan 2006. [ bib | DOI | http ]
Transcriptome microarrays have become one of the tools of choice for investigating the genes involved in tumorigenesis and tumor progression, as well as finding new biomarkers and gene expression signatures for the diagnosis and prognosis of cancer. Here, we describe a new database for Integrated Tumor Transcriptome Array and Clinical data Analysis (ITTACA). ITTACA centralizes public datasets containing both gene expression and clinical data. ITTACA currently focuses on the types of cancer that are of particular interest to research teams at Institut Curie: breast carcinoma, bladder carcinoma and uveal melanoma. A web interface allows users to carry out different class comparison analyses, including the comparison of expression distribution profiles, tests for differential expression and patient survival analyses. ITTACA is complementary to other databases, such as GEO and SMD, because it offers a better integration of clinical data and different functionalities. It also offers more options for class comparison analyses when compared with similar projects such as Oncomine. For example, users can define their own patient groups according to clinical data or gene expression levels. This added flexibility and the user-friendly web interface makes ITTACA especially useful for comparing personal results with the results in the existing literature. ITTACA is accessible online at http://bioinfo.curie.fr/ittaca.

[Ein-Dor2006Thousands] LEin-Dor, OZuk, and EDomany. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. USA, 103(15):5923-5928, 2006. [ bib | DOI | http | .pdf ]
[Diaz-Uriarte2006Gene] RDíaz-Uriarte and S.A. De Andres. Gene selection and classification of microarray data using random forest. BMC bioinformatics, 7(1):3, 2006. [ bib ]
[Dubus2006In] EDubus, IIjjaali, FPetitet, and AMichel. In Silico Classification of hERG Channel Blockers: a Knowledge-Based Strategy. Chem. Med. Chem., 1(6):622-630, Jun 2006. [ bib | DOI | http ]
The blockage of the hERG potassium channel by a wide number of diverse compounds has become a major pharmacological safety concern as it can lead to sudden cardiac death. In silico models can be potent tools to screen out potential hERG blockers as early as possible during the drug-discovery process. In this study, predictive models developed using the recursive partitioning method and created using diverse datasets from 203 molecules tested on the hERG channel are described. The first model was built with hERG compounds grouped into two classes, with a separation limit set at an IC(50) value of 1 mum, and reaches an overall accuracy of 81 %. The misclassification of molecules having a range of activity between 1 and 10 muM led to the generation of a tri-class model able to correctly classify high, moderate, and weak hERG blockers with an overall accuracy of 90 %. Another model, constructed with the high and weak hERG-blocker categories, successfully increases the accuracy to 96 %. The results reported herein indicate that a combination of precise, knowledge management resources and powerful modeling tools are invaluable to assessing potential cardiotoxic side effects related to hERG blockage.

Keywords: herg chemoinformatics
[Dostie2006Chromosome] Josée Dostie, Todd A Richmond, Ramy A Arnaout, Rebecca R Selzer, William L Lee, Tracey A Honan, Eric D Rubio, Anton Krumm, Justin Lamb, Chad Nusbaum, Roland D Green, and Job Dekker. Chromosome conformation capture carbon copy (5c): a massively parallel solution for mapping interactions between genomic elements. Genome Res, 16(10):1299-1309, Oct 2006. [ bib | DOI | http | .pdf ]
Physical interactions between genetic elements located throughout the genome play important roles in gene regulation and can be identified with the Chromosome Conformation Capture (3C) methodology. 3C converts physical chromatin interactions into specific ligation products, which are quantified individually by PCR. Here we present a high-throughput 3C approach, 3C-Carbon Copy (5C), that employs microarrays or quantitative DNA sequencing using 454-technology as detection methods. We applied 5C to analyze a 400-kb region containing the human beta-globin locus and a 100-kb conserved gene desert region. We validated 5C by detection of several previously identified looping interactions in the beta-globin locus. We also identified a new looping interaction in K562 cells between the beta-globin Locus Control Region and the gamma-beta-globin intergenic region. Interestingly, this region has been implicated in the control of developmental globin gene switching. 5C should be widely applicable for large-scale mapping of cis- and trans- interaction networks of genomic elements and for the study of higher-order chromosome structure.

[Do2006Normalization] JH. Do and DK. Choi. Normalization of microarray data: single-labeled and dual-labeled arrays. Molecules and Cells, 22:254-261, 2006. [ bib ]
[Didiano2006Perfect] Dominic Didiano and Oliver Hobert. Perfect seed pairing is not a generally reliable predictor for mirna-target interactions. Nat Struct Mol Biol, 13(9):849-851, Sep 2006. [ bib | DOI | http | .pdf ]
We use Caenorhabditis elegans to test proposed general rules for microRNA (miRNA)-target interactions. We show that G.U base pairing is tolerated in the 'seed' region of the lsy-6 miRNA interaction with its in vivo target cog-1, and that 6- to 8-base-pair perfect seed pairing is not a generally reliable predictor for an interaction of lsy-6 with a 3' untranslated region (UTR). Rather, lsy-6 can functionally interact with its target site only in specific 3' UTR contexts. Our findings illustrate the difficulty of establishing generalizable rules of miRNA-target interactions.

Keywords: sirna
[Deshpande2006Targeting] DA. Deshpande and RB. Penn. Targeting G protein-coupled receptor signaling in asthma. Cell. Signal., 18(12):2105-2120, Dec 2006. [ bib | DOI | http ]
The complex disease asthma, an obstructive lung disease in which excessive airway smooth muscle (ASM) contraction as well as increased ASM mass reduces airway lumen size and limits airflow, can be viewed as a consequence of aberrant airway G protein-coupled receptor (GPCR) function. The central role of GPCRs in determining airway resistance is underscored by the fact that almost every drug used in the treatment of asthma directly or indirectly targets either GPCR-ligand interaction, GPCR signaling, or processes that produce GPCR agonists. Although many airway cells contribute to the regulation of airway resistance and architecture, ASM properties and functions have the greatest impact on airway homeostasis. The theme of this review is that GPCR-mediated regulation of ASM tone and ASM growth is a major determinant of the acute and chronic features of asthma, and multiple strategies targeting GPCR signaling may be employed to prevent or manage these features.

[Demvsar2006Statistical] JDemšar. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res., 7:1-30, 2006. [ bib | .html ]
[Davis2006Reliable] C.A. Davis, FGerick, VHintermair, C.C. Friedel, KFundel, RKüffner, and RZimmer. Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics, 22(19):2356-2363, 2006. [ bib ]
[Dalma-Weiszhausz2006Affymetrix] Dennise D Dalma-Weiszhausz, Janet Warrington, Eugene Y Tanimoto, and CGarrett Miyada. The affymetrix genechip platform: an overview. Methods Enzymol., 410:3-28, 2006. [ bib | DOI | http | .pdf ]
The intent of this chapter is to provide the reader with a review of GeneChip technology and the complete system it represents, including its versatility, components, and the exciting applications that are enabled by this platform. The following aspects of the technology are reviewed: array design and manufacturing, target preparation, instrumentation, data analysis, and both current and future applications. There are key differentiators between Affymetrix' GeneChip technology and other microarray-based methods. The most distinguishing feature of GeneChip microarrays is that their manufacture is directed by photochemical synthesis. Because of this manufacturing technology, more than a million different probes can be synthesized on an array roughly the size of a thumbnail. These numbers allow the inclusion of multiple probes to interrogate the same target sequence, providing statistical rigor to data interpretation. Over the years the GeneChip platform has proven to be a reliable and robust system, enabling many new discoveries and breakthroughs to be made by the scientific community.

Keywords: Animals; Humans; Oligonucleotide Array Sequence Analysis, instrumentation/methods
[Cour2006Balanced] TCour, PSrinivasan, and JShi. Balanced graph matching. In Advanced in Neural Information Processing Systems, 2006. [ bib ]
Keywords: graph-matching, spectral, spectral-matching
[Coupez2006Docking] BCoupez and RA. Lewis. Docking and scoring-theoretically easy, practically impossible? Curr. Med. Chem., 13(25):2995-3003, 2006. [ bib ]
Structure-based Drug Design (SBDD) is an essential part of the modern medicinal chemistry, and has led to the acceleration of many projects, and even to drugs on the market. Programs that perform docking and scoring of ligands to receptors are powerful tools in the drug designer's armoury that enhance the process of SBDD. They are even deployed on the desktop of many bench chemists. It is timely to review the state of the art, to understand how good our docking programs are, and what are the issues. In this review we would like to provide a guide around the reliable aspects of docking and scoring and the associated pitfalls aiming at an audience of medicinal chemists rather than modellers. For convenience, we will divide the review into two parts: docking and scoring. Docking concerns the preparation of the receptor and the ligand(s), the sampling of conformational space and stereochemistry (if appropriate). Scoring concerns the evaluation of all of the ligand-receptor poses generated by docking. The two processes are not truly independent, and this will be discussed here in detail. The preparation of the receptor and ligand(s) before docking requires great care. For the receptor, issues of protonation, tautomerisation and hydration are key, and we will discuss current approaches to these issues. Even more important is the degree of sampling: can the algorithms reproduce what is observed experimentally? If they can, are the scoring algorithms good enough to recognise this pose as the best? Do the scores correlate with observed binding affinity? How does local knowledge of the target (for example hinge-binding to a kinase) affect the accuracy of the predictions? We will review the key findings from several evaluation studies and present conclusions about when and how to interpret and trust the results of docking and scoring. Finally, we will present an outline of some of the latest developments in the area of scoring functions.

Keywords: Cluster Analysis; Computational Biology, methods; Computer Simulation; Computer-Aided Design; Databases, Factual; Drug Design; Ligands; Models, Chemical; Software; Structure-Activity Relationship
[Consortium2006MicroArray] MA. QC. Consortium, Leming Shi, Laura H Reid, Wendell D Jones, Richard Shippy, Janet A Warrington, Shawn C Baker, Patrick J Collins, Francoise de Longueville, Ernest S Kawasaki, Kathleen Y Lee, Yuling Luo, Yongming Andrew Sun, James C Willey, Robert A Setterquist, Gavin M Fischer, Weida Tong, Yvonne P Dragan, David J Dix, Felix W Frueh, Frederico M Goodsaid, Damir Herman, Roderick V Jensen, Charles D Johnson, Edward K Lobenhofer, Raj K Puri, Uwe Schrf, Jean Thierry-Mieg, Charles Wang, Mike Wilson, Paul K Wolber, Lu Zhang, Shashi Amur, Wenjun Bao, Catalin C Barbacioru, Anne Bergstrom Lucas, Vincent Bertholet, Cecilie Boysen, Bud Bromley, Donna Brown, Alan Brunner, Roger Canales, Xiaoxi Megan Cao, Thomas A Cebula, James J Chen, Jing Cheng, Tzu-Ming Chu, Eugene Chudin, John Corson, JChristopher Corton, Lisa J Croner, Christopher Davies, Timothy S Davison, Glenda Delenstarr, Xutao Deng, David Dorris, Aron C Eklund, Xiao hui Fan, Hong Fang, Stephanie Fulmer-Smentek, James C Fuscoe, Kathryn Gallagher, Weigong Ge, Lei Guo, Xu Guo, Janet Hager, Paul K Haje, Jing Han, Tao Han, Heather C Harbottle, Stephen C Harris, Eli Hatchwell, Craig A Hauser, Susan Hester, Huixiao Hong, Patrick Hurban, Scott A Jackson, Hanlee Ji, Charles R Knight, Winston P Kuo, JEugene LeClerc, Shawn Levy, Quan-Zhen Li, Chunmei Liu, Ying Liu, Michael J Lombardi, Yunqing Ma, Scott R Magnuson, Botoul Maqsodi, Tim McDaniel, Nan Mei, Ola Myklebost, Baitang Ning, Natalia Novoradovskaya, Michael S Orr, Terry W Osborn, Adam Papallo, Tucker A Patterson, Roger G Perkins, Elizabeth H Peters, Ron Peterson, Kenneth L Philips, PScott Pine, Lajos Pusztai, Feng Qian, Hongzu Ren, Mitch Rosen, Barry A Rosenzweig, Raymond R Samaha, Mark Schena, Gary P Schroth, Svetlana Shchegrova, Dave D Smith, Frank Staedtler, Zhenqiang Su, Hongmei Sun, Zoltan Szallasi, Zivana Tezak, Danielle Thierry-Mieg, Karol L Thompson, Irina Tikhonova, Yaron Turpaz, Beena Vallanat, Christophe Van, Stephen J Walker, Sue Jane Wang, Yonghong Wang, Russ Wolfinger, Alex Wong, Jie Wu, Chunlin Xiao, Qian Xie, Jun Xu, Wen Yang, Liang Zhang, Sheng Zhong, Yaping Zong, and William Slikker. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol., 24(9):1151-1161, Sep 2006. [ bib | DOI | http ]
Over the last decade, the introduction of microarray technology has had a profound impact on gene expression research. The publication of studies with dissimilar or altogether contradictory results, obtained using different microarray platforms to analyze identical RNA samples, has raised concerns about the reliability of this technology. The MicroArray Quality Control (MAQC) project was initiated to address these concerns, as well as other performance and data analysis issues. Expression data on four titration pools from two distinct reference RNA samples were generated at multiple test sites using a variety of microarray-based and alternative technology platforms. Here we describe the experimental design and probe mapping efforts behind the MAQC project. We show intraplatform consistency across test sites as well as a high level of interplatform concordance in terms of genes identified as differentially expressed. This study provides a resource that represents an important first step toward establishing a framework for the use of microarrays in clinical and regulatory settings.

Keywords: Equipment Design; Equipment Failure Analysis; Gene Expression Profiling, instrumentation/methods; Oligonucleotide Array Sequence Analysis, instrumentation; Quality Assurance, Health Care, methods; Quality Control; Reproducibility of Results; Sensitivity and Specificity; United States
[Coe2006Differential] BP. Coe, WW. Lockwood, LGirard, RChari, CMacAulay, SLam, AF. Gazdar, JD. Minna, and WL. Lam. Differential disruption of cell cycle pathways in small cell and non-small cell lung cancer. Br. J. Cancer, 94:1927-1935, 2006. [ bib ]
[Citri2006MolCelBiol] Ami Citri and Yosef Yarden. Egf-erbb signalling: towards the systems level. Nat Rev Mol Cell Biol, 7(7):505-516, Jul 2006. [ bib | DOI | http ]
Signalling through the ERBB/HER receptors is intricately involved in human cancer and already serves as a target for several cancer drugs. Because of its inherent complexity, it is useful to envision ERBB signalling as a bow-tie-configured, evolvable network, which shares modularity, redundancy and control circuits with robust biological and engineered systems. Because network fragility is an inevitable trade-off of robustness, systems-level understanding is expected to generate therapeutic opportunities to intercept aberrant network activation.

Keywords: Animals; Endocytosis, physiology; Epidermal Growth Factor, metabolism; Feedback, Physiological; Humans; Ligands; Models, Molecular; Oncogene Proteins v-erbB, genetics/metabolism; Phosphatidylinositol 3-Kinases, metabolism; Protein Conformation; Receptor, Epidermal Growth Factor, chemistry/genetics/metabolism; Signal Transduction, physiology
[Chapelle2006Semi-Supervised] OChapelle, BSchölkopf, and AZien. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006. [ bib | http ]
[Candes2006Stable] ECandès, JK. Romberk, and TTao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207-1223, 2006. [ bib ]
[Calzone2006BIOCHAM] LCalzone, FFages, and SSoliman. BIOCHAM: an environment for modeling biological systems and formalizing experimental knowledge. Bioinformatics, 22(14):1805-1807, 2006. [ bib | DOI | arXiv | http | .pdf ]
Summary: BIOCHAM (the BIOCHemical Abstract Machine) is a software environment for modeling biochemical systems. It is based on two aspects: (1) the analysis and simulation of boolean, kinetic and stochastic models and (2) the formalization of biological properties in temporal logic. BIOCHAM provides tools and languages for describing protein networks with a simple and straightforward syntax, and for integrating biological properties into the model. It then becomes possible to analyze, query, verify and maintain the model with respect to those properties. For kinetic models, BIOCHAM can search for appropriate parameter values in order to reproduce a specific behavior observed in experiments and formalized in temporal logic. Coupled with other methods such as bifurcation diagrams, this search assists the modeler/biologist in the modeling process. Availability: BIOCHAM (v. 2.5) is a free software available for download, with example models, at http://contraintes.inria.fr/BIOCHAM/ Contact: Sylvain.Soliman@inria.fr

Keywords: csbcbook
[Calin2006MicroRNA-cancer] G.A. Calin and CM. Croce. MicroRNA-cancer connection: the beginning of a new tale. Cancer Res., 66(15):7390-7394, Aug 2006. [ bib | DOI | http | .pdf ]
Cancer initiation and progression can involve microRNAs (miRNA), which are small noncoding RNAs that can regulate gene expression. Their expression profiles can be used for the classification, diagnosis, and prognosis of human malignancies. Loss or amplification of miRNA genes has been reported in a variety of cancers, and altered patterns of miRNA expression may affect cell cycle and survival programs. Germ-line and somatic mutations in miRNAs or polymorphisms in the mRNAs targeted by miRNAs may also contribute to cancer predisposition and progression. We propose that alterations in miRNA genes play a critical role in the pathophysiology of many, perhaps all, human cancers.

Keywords: csbcbook
[Calin2006MicroRNA] GA. Calin and CM. Croce. MicroRNA signatures in human cancers. Nat. Rev. Cancer, 6(11):857-866, Nov 2006. [ bib | DOI | http | .pdf ]
MicroRNA (miRNA) alterations are involved in the initiation and progression of human cancer. The causes of the widespread differential expression of miRNA genes in malignant compared with normal cells can be explained by the location of these genes in cancer-associated genomic regions, by epigenetic mechanisms and by alterations in the miRNA processing machinery. MiRNA-expression profiling of human tumours has identified signatures associated with diagnosis, staging, progression, prognosis and response to treatment. In addition, profiling has been exploited to identify miRNA genes that might represent downstream targets of activated oncogenic pathways, or that target protein-coding genes involved in cancer.

Keywords: csbcbook, csbcbook-ch3
[Buerckstuemmer2006efficient] Tilmann Bürckstümmer, Keiryn L Bennett, Adrijana Preradovic, Gregor Schütze, Oliver Hantschel, Giulio Superti-Furga, and Angela Bauch. An efficient tandem affinity purification procedure for interaction proteomics in mammalian cells. Nat Methods, 3(12):1013-1019, Dec 2006. [ bib | DOI | http | .pdf ]
Tandem affinity purification (TAP) is a generic two-step affinity purification protocol that enables the isolation of protein complexes under close-to-physiological conditions for subsequent analysis by mass spectrometry. Although TAP was instrumental in elucidating the yeast cellular machinery, in mammalian cells the method suffers from a low overall yield. We designed several dual-affinity tags optimized for use in mammalian cells and compared the efficiency of each tag to the conventional TAP tag. A tag based on protein G and the streptavidin-binding peptide (GS-TAP) resulted in a tenfold increase in protein-complex yield and improved the specificity of the procedure. This allows purification of protein complexes that were hitherto not amenable to TAP and use of less starting material, leading to higher success rates and enabling systematic interaction proteomics projects. Using the well-characterized Ku70-Ku80 protein complex as an example, we identified both core elements as well as new candidate effectors.

[Buyse2006Validation] MBuyse, SLoi, Svan't Veer, GViale, MDelorenzi, AM. Glas, MSaghatchian d'Assignies, JBergh, RLidereau, PEllis, AHarris, JBogaerts, PTherasse, AFloore, MAmakrane, FPiette, ERutgers, CSotiriou, FCardoso, MJ. Piccart, and TR. A. N. S. B. IG. Consortium. Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J. Natl. Canc. Inst., 98(17):1183-1192, Sep 2006. [ bib | DOI | http | .pdf ]
BACKGROUND: A 70-gene signature was previously shown to have prognostic value in patients with node-negative breast cancer. Our goal was to validate the signature in an independent group of patients. METHODS: Patients (n = 307, with 137 events after a median follow-up of 13.6 years) from five European centers were divided into high- and low-risk groups based on the gene signature classification and on clinical risk classifications. Patients were assigned to the gene signature low-risk group if their 5-year distant metastasis-free survival probability as estimated by the gene signature was greater than 90%. Patients were assigned to the clinicopathologic low-risk group if their 10-year survival probability, as estimated by Adjuvant! software, was greater than 88% (for estrogen receptor [ER]-positive patients) or 92% (for ER-negative patients). Hazard ratios (HRs) were estimated to compare time to distant metastases, disease-free survival, and overall survival in high- versus low-risk groups. RESULTS: The 70-gene signature outperformed the clinicopathologic risk assessment in predicting all endpoints. For time to distant metastases, the gene signature yielded HR = 2.32 (95% confidence interval [CI] = 1.35 to 4.00) without adjustment for clinical risk and hazard ratios ranging from 2.13 to 2.15 after adjustment for various estimates of clinical risk; clinicopathologic risk using Adjuvant! software yielded an unadjusted HR = 1.68 (95% CI = 0.92 to 3.07). For overall survival, the gene signature yielded an unadjusted HR = 2.79 (95% CI = 1.60 to 4.87) and adjusted hazard ratios ranging from 2.63 to 2.89; clinicopathologic risk yielded an unadjusted HR = 1.67 (95% CI = 0.93 to 2.98). For patients in the gene signature high-risk group, 10-year overall survival was 0.69 for patients in both the low- and high-clinical risk groups; for patients in the gene signature low-risk group, the 10-year survival rates were 0.88 and 0.89, respectively. CONCLUSIONS: The 70-gene signature adds independent prognostic information to clinicopathologic risk assessment for patients with early breast cancer.

Keywords: csbcbook, csbcbook-ch3
[Bunea2006Aggregation] FBunea, ATsybakov, and MWegkamp. Aggregation and sparsity via l_1 penalized least squares. In GLugosi and HU. Simon, editors, Proceedings of the 19th Annual Conference on Learning Theory, COLT 2006., number 4005 in LNAI, pages 379-391, Berlin Heidelberg, 2006. Springer-Verlag. [ bib | .pdf ]
Keywords: lasso
[Bulyk2006DNA] Martha L Bulyk. DNA microarray technologies for measuring protein-DNA interactions. Curr Opin Biotechnol, 17(4):422-430, Aug 2006. [ bib | DOI | http ]
DNA-binding proteins have key roles in many cellular processes, including transcriptional regulation and replication. Microarray-based technologies permit the high-throughput identification of binding sites and enable the functional roles of these binding proteins to be elucidated. In particular, microarray readout either of chromatin immunoprecipitated DNA-bound proteins (ChIP-chip) or of DNA adenine methyltransferase fusion proteins (DamID) enables the identification of in vivo genomic target sites of proteins. A complementary approach to analyse the in vitro binding of proteins directly to double-stranded DNA microarrays (protein binding microarrays; PBMs), permits rapid characterization of their DNA binding site sequence specificities. Recent advances in DNA microarray synthesis technologies have facilitated the definition of DNA-binding sites at much higher resolution and coverage, and advances in these and emerging technologies will further increase the efficiencies of these exciting new approaches.

Keywords: Animals; Chromatin Immunoprecipitation, methods; Cross-Linking Reagents, chemistry; DNA, analysis/chemistry/metabolism; DNA-Binding Proteins, analysis/genetics/metabolism; Humans; Oligonucleotide Array Sequence Analysis, methods; Protein Binding
[Bui2006Structural] H.-H. Bui, AJ. Schiewe, Hvon Grafenstein, and IS. Haworth. Structural prediction of peptides binding to MHC class I molecules. Proteins, 63(1):43-52, Apr 2006. [ bib | DOI | http ]
Peptide binding to class I major histocompatibility complex (MHCI) molecules is a key step in the immune response and the structural details of this interaction are of importance in the design of peptide vaccines. Algorithms based on primary sequence have had success in predicting potential antigenic peptides for MHCI, but such algorithms have limited accuracy and provide no structural information. Here, we present an algorithm, PePSSI (peptide-MHC prediction of structure through solvated interfaces), for the prediction of peptide structure when bound to the MHCI molecule, HLA-A2. The algorithm combines sampling of peptide backbone conformations and flexible movement of MHC side chains and is unique among other prediction algorithms in its incorporation of explicit water molecules at the peptide-MHC interface. In an initial test of the algorithm, PePSSI was used to predict the conformation of eight peptides bound to HLA-A2, for which X-ray data are available. Comparison of the predicted and X-ray conformations of these peptides gave RMSD values between 1.301 and 2.475 A. Binding conformations of 266 peptides with known binding affinities for HLA-A2 were then predicted using PePSSI. Structural analyses of these peptide-HLA-A2 conformations showed that peptide binding affinity is positively correlated with the number of peptide-MHC contacts and negatively correlated with the number of interfacial water molecules. These results are consistent with the relatively hydrophobic binding nature of the HLA-A2 peptide binding interface. In summary, PePSSI is capable of rapid and accurate prediction of peptide-MHC binding conformations, which may in turn allow estimation of MHCI-peptide binding affinity.

Keywords: Algorithms, Amino Acid Sequence, Antigens, Artificial Intelligence, Automated, Binding Sites, Chemical, Computational Biology, Computer Simulation, Crystallog, Crystallography, Electrostatics, Genes, Genetic, HLA Antigens, Histocompatibility Antigens Class I, Humans, Hydrogen Bonding, Ligands, MHC Class I, Major Histocompatibility Complex, Models, Molecular, Molecular Conformation, Molecular Sequence Data, Pattern Recognition, Peptides, Protein, Protein Binding, Protein Conformation, Proteomics, Quantitative Structure-Activity Relationship, Sequence Alignment, Sequence Analysis, Software, Structural Homology, Structure-Activity Relationship, Thermodynamics, Water, X-Ray, X-Rays, raphy, 16447245
[Bonachera2006Fuzzy] FBonachéra, BParent, FBarbosa, NFroloff, and DHorvath. Fuzzy tricentric pharmacophore fingerprints. 1. topological fuzzy pharmacophore triplets and adapted molecular similarity scoring schemes. J. Chem. Inf. Model., 46(6):2457-2477, 2006. [ bib | DOI | http | .pdf ]
This paper introduces a novel molecular description-topological (2D) fuzzy pharmacophore triplets, 2D-FPT-using the number of interposed bonds as the measure of separation between the atoms representing pharmacophore types (hydrophobic, aromatic, hydrogen-bond donor and acceptor, cation, and anion). 2D-FPT features three key improvements with respect to the state-of-the-art pharmacophore fingerprints: (1) The first key novelty is fuzzy mapping of molecular triplets onto the basis set of pharmacophore triplets: unlike in the binary scheme where an atom triplet is set to highlight the bit of a single, best-matching basis triplet, the herein-defined fuzzy approach allows for gradual mapping of each atom triplet onto several related basis triplets, thus minimizing binary classification artifacts. (2) The second innovation is proteolytic equilibrium dependence, by explicitly considering all of the conjugated acids and bases (microspecies). 2D-FPTs are concentration-weighted (as predicted at pH=7.4) averages of microspecies fingerprints. Therefore, small structural modifications, not affecting the overall pharmacophore pattern (in the sense of classical rule-based assignment), but nevertheless triggering a pKa shift, will have a major impact on 2D-FPT. Pairs of almost identical compounds with significantly differing activities ("activity cliffs" in classical descriptor spaces) were in many cases predictable by 2D-FPT. (3) The third innovation is a new similarity scoring formula, acknowledging that the simultaneous absence of a triplet in two molecules is a less-constraining indicator of similarity than its simultaneous presence. It displays excellent neighborhood behavior, outperforming 2D or 3D two-point pharmacophore descriptors or chemical fingerprints. The 2D-FPT calculator was developed using the chemoinformatics toolkit of ChemAxon (www.chemaxon.com).

Keywords: chemoinformatics
[Blaschko2006Conformal] M.B. Blaschko and THofmann. Conformal Multi-Instance Kernels. In NIPS 2006 Workshop on Learning to Compare Examples, 2006. [ bib ]
[Bishop2006Pattern] C.M. Bishop. Pattern recognition and machine learning. Springer, 2006. [ bib ]
[Birge2006Minimal] LBirgé and PMassart. Minimal penalties for gaussian model selection. Probab. Theory Relat. Fields, 138:33-73, 2006. [ bib | DOI | http | .pdf ]
[Bild2006Oncogenic] AH. Bild, GYao, JT. Chang, QWang, APotti, DChasse, MB. Joshi, DHarpole, JM. Lancaster, ABerchuck, Jr. Olson, JA., JR. Marks, HK. Dressman, MWest, and JR. Nevins. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature, 439(7074):353-7, 2006. [ bib | DOI | http | .pdf ]
The development of an oncogenic state is a complex process involving the accumulation of multiple independent mutations that lead to deregulation of cell signalling pathways central to the control of cell growth and cell fate. The ability to define cancer subtypes, recurrence of disease and response to specific therapies using DNA microarray-based gene expression signatures has been demonstrated in multiple studies. Various studies have also demonstrated the potential for using gene expression profiles for the analysis of oncogenic pathways. Here we show that gene expression signatures can be identified that reflect the activation status of several oncogenic pathways. When evaluated in several large collections of human cancers, these gene expression signatures identify patterns of pathway deregulation in tumours and clinically relevant associations with disease outcomes. Combining signature-based predictions across several pathways identifies coordinated patterns of pathway deregulation that distinguish between specific cancers and tumour subtypes. Clustering tumours based on pathway signatures further defines prognosis in respective patient subsets, demonstrating that patterns of oncogenic pathway deregulation underlie the development of the oncogenic phenotype and reflect the biology and outcome of specific cancers. Predictions of pathway deregulation in cancer cell lines are also shown to predict the sensitivity to therapeutic agents that target components of the pathway. Linking pathway deregulation with sensitivity to therapeutics that target components of the pathway provides an opportunity to make use of these oncogenic pathway signatures to guide the use of targeted therapeutics.

Keywords: breastcancer
[Biau2006Statistical] GBiau and KBleakley. Statistical inference on graphs. Statistics and Decisions, 24(2):209-232, 2006. [ bib ]
[Bhavani2006Substructure-based] SBhavani, ANagargadde, AThawani, VSridhar, and NChandra. Substructure-based support vector machine classifiers for prediction of adverse effects in diverse classes of drugs. J. Chem. Inform. Model., 46(6):2478-2486, 2006. [ bib | DOI | http ]
Unforeseen adverse effects exhibited by drugs contribute heavily to late-phase failure and even withdrawal of marketed drugs. Torsade de pointes (TdP) is one such important adverse effect, which causes cardiac arrhythmia and, in some cases, sudden death, making it crucial for potential drugs to be screened for torsadogenicity. The need to tap the power of computational approaches for the prediction of adverse effects such as TdP is increasingly becoming evident. The availability of screening data including those in organized databases greatly facilitates exploration of newer computational approaches. In this paper, we report the development of a prediction method based on a support machine vector algorithm. The method uses a combination of descriptors, encoding both the type of toxicophore as well as the position of the toxicophore in the drug molecule, thus considering both the pharmacophore and the three-dimensional shape information of the molecule. For delineating toxicophores, a novel pattern-recognition method that utilizes substructures within a molecule has been developed. The results obtained using the hybrid approach have been compared with those available in the literature for the same data set. An improvement in prediction accuracy is clearly seen, with the accuracy reaching up to 97% in predicting compounds that can cause TdP and 90% for predicting compounds that do not cause TdP. The generic nature of the method has been demonstrated with four data sets available for carcinogenicity, where prediction accuracies were significantly higher, with a best receiver operating characteristics (ROC) value of 0.81 as against a best ROC value of 0.7 reported in the literature for the same data set. Thus, the method holds promise for wide applicability in toxicity prediction.

Keywords: Algorithms; Carcinogens; Chemistry, Pharmaceutical; Computational Biology; Drug Evaluation, Preclinical; Drug Industry; Humans; Models, Chemical; Models, Statistical; Neural Networks (Computer); Pattern Recognition, Automated; ROC Curve; Sequence Analysis, Protein; Software; Torsades de Pointes
[Berg2006Cross-species] JBerg and MLässig. Cross-species analysis of biological networks by bayesian alignment. Proc. Natl. Acad. Sci. USA, 103(29):10967-10972, Jul 2006. [ bib | DOI | http | .pdf ]
Complex interactions between genes or proteins contribute a substantial part to phenotypic evolution. Here we develop an evolutionarily grounded method for the cross-species analysis of interaction networks by alignment, which maps bona fide functional relationships between genes in different organisms. Network alignment is based on a scoring function measuring mutual similarities between networks, taking into account their interaction patterns as well as sequence similarities between their nodes. High-scoring alignments and optimal alignment parameters are inferred by a systematic Bayesian analysis. We apply this method to analyze the evolution of coexpression networks between humans and mice. We find evidence for significant conservation of gene expression clusters and give network-based predictions of gene function. We discuss examples where cross-species functional relationships between genes do not concur with sequence similarity.

[Ben-Hur2006Choosing] ABen-Hur and WS. Noble. Choosing negative examples for the prediction of protein-protein interactions. BMC Bioinformatics, 7 Suppl 1:S2, 2006. [ bib | DOI | http ]
The protein-protein interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions. This need has prompted the development of a number of methods for predicting protein-protein interactions based on various sources of data and methodologies. The common method for choosing negative examples for training a predictor of protein-protein interactions is based on annotations of cellular localization, and the observation that pairs of proteins that have different localization patterns are unlikely to interact. While this method leads to high quality sets of non-interacting proteins, we find that this choice can lead to biased estimates of prediction accuracy, because the constraints placed on the distribution of the negative examples makes the task easier. The effects of this bias are demonstrated in the context of both sequence-based and non-sequence based features used for predicting protein-protein interactions.

[Beers2006Array-CGH] Evan Beers and PNederlof. Array-CGH and breast cancer. Breast Cancer Research, 8(3):210, 2006. [ bib | DOI | http | .pdf ]
The introduction of comparative genomic hybridization (CGH) in 1992 opened new avenues in genomic investigation; in particular, it advanced analysis of solid tumours, including breast cancer, because it obviated the need to culture cells before their chromosomes could be analyzed. The current generation of CGH analysis uses ordered arrays of genomic DNA sequences and is therefore referred to as array-CGH or matrix-CGH. It was introduced in 1998, and further increased the potential of CGH to provide insight into the fundamental processes of chromosomal instability and cancer. This review provides a critical evaluation of the data published on array-CGH and breast cancer, and discusses some of its expected future value and developments.

Keywords: breastcancer, cgh
[Bansal2006Inference] MBansal, GDella Gatta, and DBernardo. Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics, 22(7):815-822, Apr 2006. [ bib | DOI | http ]
MOTIVATION: Time series expression experiments are an increasingly popular method for studying a wide range of biological systems. Here we developed an algorithm that can infer the local network of gene-gene interactions surrounding a gene of interest. This is achieved by a perturbation of the gene of interest and subsequently measuring the gene expression profiles at multiple time points. We applied this algorithm to computer simulated data and to experimental data on a nine gene network in Escherichia coli. RESULTS: In this paper we show that it is possible to recover the gene regulatory network from a time series data of gene expression following a perturbation to the cell. We show this both on simulated data and on a nine gene subnetwork part of the DNA-damage response pathway (SOS pathway) in the bacteria E. coli. CONTACT: dibernardo@tigem.it SUPLEMENTARY INFORMATION: Supplementary data are available at http://dibernado.tigem.it

[Bandyopadhyay2006Systematic] SBandyopadhyay, RSharan, and TIdeker. Systematic identification of functional orthologs based on protein network comparison. Genome Res., 16(3):428-435, Mar 2006. [ bib | DOI | http | .pdf ]
Annotating protein function across species is an important task that is often complicated by the presence of large paralogous gene families. Here, we report a novel strategy for identifying functionally related proteins that supplements sequence-based comparisons with information on conserved protein-protein interactions. First, the protein interaction networks of two species are aligned by assigning proteins to sequence homology clusters using the Inparanoid algorithm. Next, probabilistic inference is performed on the aligned networks to identify pairs of proteins, one from each species, that are likely to retain the same function based on conservation of their interacting partners. Applying this method to Drosophila melanogaster and Saccharomyces cerevisiae, we analyze 121 cases for which functional orthology assignment is ambiguous when sequence similarity is used alone. In 61 of these cases, the network supports a different protein pair than that favored by sequence comparisons. These results suggest that network analysis can be used to provide a key source of information for refining sequence-based homology searches.

[Bagci2006BiophysJ] EZ. Bagci, YVodovotz, TR. Billiar, GB. Ermentrout, and IBahar. Bistability in apoptosis: roles of bax, bcl-2, and mitochondrial permeability transition pores. Biophys J, 90(5):1546-59, 2006. [ bib ]
We propose a mathematical model for mitochondria-dependent apoptosis, in which kinetic cooperativity in formation of the apoptosome is a key element ensuring bistability. We examine the role of Bax and Bcl-2 synthesis and degradation rates, as well as the number of mitochondrial permeability transition pores (MPTPs), on the cell response to apoptotic stimuli. Our analysis suggests that cooperative apoptosome formation is a mechanism for inducing bistability, much more robust than that induced by other mechanisms, such as inhibition of caspase-3 by the inhibitor of apoptosis (IAP). Simulations predict a pathological state in which cells will exhibit a monostable cell survival if Bax degradation rate is above a threshold value, or if Bax expression rate is below a threshold value. Otherwise, cell death or survival occur depending on initial caspase-3 levels. We show that high expression rates of Bcl-2 can counteract the effects of Bax. Our simulations also demonstrate a monostable (pathological) apoptotic response if the number of MPTPs exceeds a threshold value. This study supports our contention, based on mathematical modeling, that cooperativity in apoptosome formation is critically important for determining the healthy responses to apoptotic stimuli, and helps define the roles of Bax, Bcl-2, and MPTP vis-a-vis apoptosome formation.

Keywords: csbcbook
[Bacilieri2006Ligand] Magdalena Bacilieri and Stefano Moro. Ligand-based drug design methodologies in drug discovery process: an overview. Curr Drug Discov Technol, 3(3):155-165, Sep 2006. [ bib ]
Ligand-based drug design represents an important research field in the drug discovery and optimisation process. This review provides an overview about the theoretical background of the quantitative structure activity relationship (QSAR) models.

Keywords: Drug Design; Ligands; Models, Theoretical; Molecular Structure; Pharmaceutical Preparations, chemistry/metabolism; Quantitative Structure-Activity Relationship; Technology, Pharmaceutical, methods
[Antes2006DynaPred] Iris Antes, Shirley W I Siu, and Thomas Lengauer. DynaPred: a structure and sequence based method for the prediction of MHC class I binding peptide sequences and conformations. Bioinformatics, 22(14):e16-e24, Jul 2006. [ bib | DOI | http ]
MOTIVATION: The binding of endogenous antigenic peptides to MHC class I molecules is an important step during the immunologic response of a host against a pathogen. Thus, various sequence- and structure-based prediction methods have been proposed for this purpose. The sequence-based methods are computationally efficient, but are hampered by the need of sufficient experimental data and do not provide a structural interpretation of their results. The structural methods are data-independent, but are quite time-consuming and thus not suited for screening of whole genomes. Here, we present a new method, which performs sequence-based prediction by incorporating information obtained from molecular modeling. This allows us to perform large databases screening and to provide structural information of the results. RESULTS: We developed a SVM-trained, quantitative matrix-based method for the prediction of MHC class I binding peptides, in which the features of the scoring matrix are energy terms retrieved from molecular dynamics simulations. At the same time we used the equilibrated structures obtained from the same simulations in a simple and efficient docking procedure. Our method consists of two steps: First, we predict potential binders from sequence data alone and second, we construct protein-peptide complexes for the predicted binders. So far, we tested our approach on the HLA-A0201 allele. We constructed two prediction models, using local, position-dependent (DynaPred(POS)) and global, position-independent (DynaPred) features. The former model outperformed the two sequence-based methods used in our evaluation; the latter shows a much higher generalizability towards other alleles than the position-dependent models. The constructed peptide structures can be refined within seconds to structures with an average backbone RMSD of 1.53 A from the corresponding experimental structures.

Keywords: immunoinformatics
[Amato2006multi-step] RAmato, ACiaramella, NDeniskina, CDel Mondo, Ddi Bernardo, CDonalek, GLongo, GMangano, GMiele, GRaiconi, AStaiano, and RTagliaferri. A multi-step approach to time series analysis and gene expression clustering. Bioinformatics, 22(5):589-596, Mar 2006. [ bib | DOI | http ]
MOTIVATION: The huge growth in gene expression data calls for the implementation of automatic tools for data processing and interpretation. RESULTS: We present a new and comprehensive machine learning data mining framework consisting in a non-linear PCA neural network for feature extraction, and probabilistic principal surfaces combined with an agglomerative approach based on Negentropy aimed at clustering gene microarray data. The method, which provides a user-friendly visualization interface, can work on noisy data with missing points and represents an automatic procedure to get, with no a priori assumptions, the number of clusters present in the data. Cell-cycle dataset and a detailed analysis confirm the biological nature of the most significant clusters. AVAILABILITY: The software described here is a subpackage part of the ASTRONEURAL package and is available upon request from the corresponding author. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

[Abramovich2006Adapting] FAbramovich, YBenjamini, DL. Donoho, and IM. Johnstone. Adapting to unknown sparsity by controlling the false discovery rate. Ann. Stat., 34(2):584-653, 2006. [ bib | DOI | http | .pdf ]
[Abernethy2006Low-rank] JAbernethy, FBach, TEvgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report cs/0611124, arXiv, 2006. [ bib ]
[Glaser2006Method] FGlaser, RJ. Morris, RJ. Najmanovich, RA. Laskowski, and JM. Thornton. A method for localizing ligand binding pockets in protein structures. Proteins, 62(2):479-488, February 2006. [ bib | DOI | http ]
The accurate identification of ligand binding sites in protein structures can be valuable in determining protein function. Once the binding site is known, it becomes easier to perform in silico and experimental procedures that may allow the ligand type and the protein function to be determined. For example, binding pocket shape analysis relies heavily on the correct localization of the ligand binding site. We have developed SURFNET-ConSurf, a modular, two-stage method for identifying the location and shape of potential ligand binding pockets in protein structures. In the first stage, the SURFNET program identifies clefts in the protein surface that are potential binding sites. In the second stage, these clefts are trimmed in size by cutting away regions distant from highly conserved residues, as defined by the ConSurf-HSSP database. The largest clefts that remain tend to be those where ligands bind. To test the approach, we analyzed a nonredundant set of 244 protein structures from the PDB and found that SURFNET-ConSurf identifies a ligand binding pocket in 75% of them. The trimming procedure reduces the original cleft volumes by 30% on average, while still encompassing an average 87% of the ligand volume. From the analysis of the results we conclude that for those cases in which the ligands are found in large, highly conserved clefts, the combined SURFNET-ConSurf method gives pockets that are a better match to the ligand shape and location. We also show that this approach works better for enzymes than for nonenzyme proteins.

Keywords: ligand-volume, protein-ligand, surface
[Zheng2006Robust] Yefeng Z. and DDoermann. Robust point matching for nonrigid shapes by preserving local neighborhood structures. IEEE Trans. Pattern Anal. Mach. Intell., 28(4):643-649, April 2006. [ bib | DOI ]
[Tong2006Prediction] JC. Tong, GL. Zhang, TW. Tan, JT. August, VBrusic, and SRanganathan. Prediction of HLA-DQ3.2beta ligands: evidence of multiple registers in class II binding peptides. Bioinformatics, 22(10):1232-1238, May 2006. [ bib | DOI | http ]
Keywords: immunoinformatics
[Kamangar2006Patterns] FKamangar, GM. Dores, and WF. Anderson. Patterns of cancer incidence, mortality, and prevalence across five continents: defining priorities to reduce cancer disparities in different geographic regions of the world. J. Clin. Oncol., 24(14):2137-2150, May 2006. [ bib | DOI | http ]
Efforts to reduce global cancer disparities begin with an understanding of geographic patterns in cancer incidence, mortality, and prevalence. Using the GLOBOCAN (2002) and Cancer Incidence in Five Continents databases, we describe overall cancer incidence, mortality, and prevalence, age-adjusted temporal trends, and age-specific incidence patterns in selected geographic regions of the world. For the eight most common malignancies-cancers of lung, breast, colon and rectum, stomach, prostate, liver, cervix, and esophagus-the most important risk factors, cancer prevention and control measures are briefly reviewed. In 2002, an estimated 11 million new cancer cases and 7 million cancer deaths were reported worldwide; nearly 25 million persons were living with cancer. Among the eight most common cancers, global disparities in cancer incidence, mortality, and prevalence are evident, likely due to complex interactions of nonmodifiable (ie, genetic susceptibility and aging) and modifiable risk factors (ie, tobacco, infectious agents, diet, and physical activity). Indeed, when risk factors among populations are intertwined with differences in individual behaviors, cultural beliefs and practices, socioeconomic conditions, and health care systems, global cancer disparities are inevitable. For the eight most common cancers, priorities for reducing cancer disparities are discussed.

[Frigola2006Epigenetic] JFrigola, JSong, CStirzaker, RA. Hinshelwood, MA. Peinado, and SJ. Clark. Epigenetic remodeling in colorectal cancer results in coordinate gene suppression across an entire chromosome band. Nat. Genet., 38(5):540-549, May 2006. [ bib | DOI | http | .pdf ]
We report a new mechanism in carcinogenesis involving coordinate long-range epigenetic gene silencing. Epigenetic silencing in cancer has always been envisaged as a local event silencing discrete genes. However, in this study of silencing in colorectal cancer, we found common repression of the entire 4-Mb band of chromosome 2q.14.2, associated with global methylation of histone H3 Lys9. DNA hypermethylation within the repressed genomic neighborhood was localized to three separate enriched CpG island 'suburbs', with the largest hypermethylated suburb spanning 1 Mb. These data change our understanding of epigenetic gene silencing in cancer cells: namely, epigenetic silencing can span large regions of the chromosome, and both DNA-methylated and neighboring unmethylated genes can be coordinately suppressed by global changes in histone modification. We propose that loss of gene expression can occur through long-range epigenetic silencing, with similar implications as loss of heterozygosity in cancer.

Keywords: csbcbook
[Farid2006New] RFarid, TDay, RA. Friesner, and RA. Pearlstein. New insights about HERG blockade obtained from protein modeling, potential energy mapping, and docking studies. Bioorg. Med. Chem., 14(9):3160-3173, May 2006. [ bib | DOI | http | .pdf ]
We created a homology model of the homo-tetrameric pore domain of HERG using the crystal structure of the bacterial potassium channel, KvAP, as a template. We docked a set of known blockers with well-characterized effects on channel function into the lumen of the pore between the selectivity filter and extracellular entrance using a novel docking and refinement procedure incorporating Glide and Prime. Key aromatic groups of the blockers are predicted to form multiple simultaneous ring stacking and hydrophobic interactions among the eight aromatic residues lining the pore. Furthermore, each blocker can achieve these interactions via multiple docking configurations. To further interpret the docking results, we mapped hydrophobic and hydrophilic potentials within the lumen of each refined docked complex. Hydrophilic iso-potential contours define a 'propeller-shaped' volume at the selectivity filter entrance. Hydrophobic contours define a hollow 'crown-shaped' volume located above the 'propeller', whose hydrophobic 'rim' extends along the pore axis between Tyr652 and Phe656. Blockers adopt conformations/binding orientations that closely mimic the shapes and properties of these contours. Blocker basic groups are localized in the hydrophilic 'propeller', forming electrostatic interactions with Ser624 rather than a generally accepted pi-cation interaction with Tyr652. Terfenadine, cisapride, sertindole, ibutilide, and clofilium adopt similar docked poses, in which their N-substituents bridge radially across the hollow interior of the 'crown' (analogous to the hub and spokes of a wheel), and project aromatic/hydrophobic portions into the hydrophobic 'rim'. MK-499 docks with its longitudinal axis parallel to the axis of the pore and 'crown', and its hydrophobic groups buried within the hydrophobic 'rim'.

Keywords: chemoinformatics herg
[Driel2006text-mining] M.A. van Driel, JBruggeman, GVriend, H.G. Brunner, and J.A.M. Leunissen. A text-mining analysis of the human phenome. Eur. J. Hum. Genet., 14(5):535-542, May 2006. [ bib | DOI | http ]
A number of large-scale efforts are underway to define the relationships between genes and proteins in various species. But, few attempts have been made to systematically classify all such relationships at the phenotype level. Also, it is unknown whether such a phenotype map would carry biologically meaningful information. We have used text mining to classify over 5000 human phenotypes contained in the Online Mendelian Inheritance in Man database. We find that similarity between phenotypes reflects biological modules of interacting functionally related genes. These similarities are positively correlated with a number of measures of gene function, including relatedness at the level of protein sequence, protein motifs, functional annotation, and direct protein-protein interaction. Phenotype grouping reflects the modular nature of human disease genetics. Thus, phenotype mapping may be used to predict candidate genes for diseases as well as functional relations between genes and proteins. Such predictions will further improve if a unified system of phenotype descriptors is developed. The phenotype similarity data are accessible through a web interface at http://www.cmbi.ru.nl/MimMiner/.

Keywords: Chromosome Mapping; Databases, Genetic; Genetic Predisposition to Disease; Genetic Vectors; Genome, Human; Genotype; Humans; Models, Genetic; Models, Statistical; Multigene Family; Phenotype
[Coi2006Prediction] ACoi, IMassarelli, LMurgia, MSaraceno, VCalderone, and AM. Bianucci. Prediction of hERG potassium channel affinity by the CODESSA approach. Bioorg. Med. Chem., 14(9):3153-3159, May 2006. [ bib | DOI | http ]
The problem of predicting torsadogenic cardiotoxicity of drugs is afforded in this work. QSAR studies on a series of molecules, acting as hERG K+ channel blockers, were carried out for this purpose by using the CODESSA program. Molecules belonging to the analyzed dataset are characterized by different therapeutic targets and by high molecular diversity. The predictive power of the obtained models was estimated by means of rigorous validation criteria implying the use of highly diagnostic statistical parameters on the test set, other than the training set. Validation results obtained for a blind set, disjoined from the whole dataset initially considered, confirmed the predictive potency of the models proposed here, so suggesting that they are worth to be considered as a valuable tool for practical applications in predicting the blockade of hERG K+ channels.

Keywords: chemoinformatics herg
[Bertucci2006Gene] FBertucci, PFinetti, NCervera, ECharafe-Jauffret, EMamessier, JAdélaïde, SDebono, GHouvenaeghel, DMaraninchi, PViens, CCharpin, JJacquemier, and DBirnbaum. Gene expression profiling shows medullary breast cancer is a subgroup of basal breast cancers. Cancer Res., 66(9):4636-4644, May 2006. [ bib | DOI | http | .pdf ]
Medullary breast cancer (MBC) is a rare but enigmatic pathologic type of breast cancer. Despite features of aggressiveness, MBC is associated with a favorable prognosis. Morphologic diagnosis remains difficult in many cases. Very little is known about the molecular alterations involved in MBC. Notably, it is not clear whether MBC and ductal breast cancer (DBC) represent molecularly distinct entities and what genes/proteins might account for their differences. Using whole-genome oligonucleotide microarrays, we compared gene expression profiles of 22 MBCs and 44 grade III DBCs. We show that MBCs are less heterogeneous than DBCs. Whereas different molecular subtypes (luminal A, luminal B, basal, ERBB2-overexpressing, and normal-like) exist in DBCs, 95% MBCs display a basal profile, similar to that of basal DBCs. Supervised analysis identified gene expression signatures that discriminated MBCs from DBCs. Discriminator genes are associated with various cellular processes related to MBC features, in particular immune reaction and apoptosis. As compared with MBCs, basal DBCs overexpress genes involved in smooth muscle cell differentiation, suggesting that MBCs are a distinct subgroup of basal breast cancer with limited myoepithelial differentiation. Finally, MBCs overexpress a series of genes located on the 12p13 and 6p21 chromosomal regions known to contain pluripotency genes. Our results contribute to a better understanding of MBC and of mammary oncogenesis in general.

[Aerts2006Gene] SAerts, DLambrechts, SMaity, PVan Loo, BCoessens, FDe Smet, L.-C. Tranchevent, BDe Moor, PMarynen, BHassan, PCarmeliet, and YMoreau. Gene prioritization through genomic data fusion. Nat. Biotechnol., 24(5):537-544, May 2006. [ bib | DOI | http | .pdf ]
The identification of genes involved in health and disease remains a challenge. We describe a bioinformatics approach, together with a freely accessible, interactive and flexible software termed Endeavour, to prioritize candidate genes underlying biological processes or diseases, based on their similarity to known genes involved in these phenomena. Unlike previous approaches, ours generates distinct prioritizations for multiple heterogeneous data sources, which are then integrated, or fused, into a global ranking using order statistics. In addition, it offers the flexibility of including additional data sources. Validation of our approach revealed it was able to efficiently prioritize 627 genes in disease data sets and 76 genes in biological pathway sets, identify candidates of 16 mono- or polygenic diseases, and discover regulatory genes of myeloid differentiation. Furthermore, the approach identified a novel gene involved in craniofacial development from a 2-Mb chromosomal region, deleted in some patients with DiGeorge-like birth defects. The approach described here offers an alternative integrative method for gene discovery.

[Bailey2006MEME] Timothy L. Bailey, Nadya Williams, Chris Misleh, and Wilfred W. Li. Meme: discovering and analyzing dna and protein sequence motifs. Nucl. Acids Res., 34(suppl_2):W369-373, July 2006. [ bib | DOI | http ]
MEME (Multiple EM for Motif Elicitation) is one of the most widely used tools for searching for novel signals' in sets of biological sequences. Applications include the discovery of new transcription factor binding sites and protein domains. MEME works by searching for repeated, ungapped sequence patterns that occur in the DNA or protein sequences provided by the user. Users can perform MEME searches via the web server hosted by the National Biomedical Computation Resource (http://meme.nbcr.net) and several mirror sites. Through the same web server, users can also access the Motif Alignment and Search Tool to search sequence databases for matches to motifs encoded in several popular formats. By clicking on buttons in the MEME output, users can compare the motifs discovered in their input sequences with databases of known motifs, search sequence databases for matches to the motifs and display the motifs in various formats. This article describes the freely accessible web server and its architecture, and discusses ways to use MEME effectively to find new sequence patterns in biological sequences and analyze their significance. 10.1093/nar/gkl198

Keywords: motif-identification, sequence-pattern-recognition, software
[Mahe2006Graph] PMahé and J.-P. Vert. Graph kernels based on tree patterns for molecules. Technical Report ccsd-00095488, HAL, September 2006. [ bib | http ]
Keywords: chemoinformatics kernel-theory
[Chin2006Using] S.-F. Chin, YWang, NP. Thorne, AE. Teschendorff, SE. Pinder, MVias, ANaderi, IRoberts, NL. Barbosa-Morais, MJ. Garcia, NG. Iyer, TKranjac, JFR. Robertson, SAparicio, STavare, IEllis, JD. Brenton, and CCaldas. Using array-comparative genomic hybridization to define molecular portraits of primary breast cancers. Oncogene, 26(13):1959-1970, September 2006. [ bib | DOI | http | .pdf ]
Keywords: breastcancer
[Zou2006adaptive] Hui Zou. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc., 101:1418-1429, December 2006. [ bib | .html | .pdf ]
[Bottou2007Large-scale] LBottou, OChapelle, DDeCoste, and JWeston, editors. Large-scale kernel machines. MIT Press, 2007. [ bib ]
[Zien2007Multiclass] AZien and COng. Multiclass multiple kernel learning. In Zoubin Ghahramani, editor, Proceedings of the 24th Annual International Conference on Machine Learning (ICML 2007), pages 1191-1198. Omnipress, 2007. [ bib ]
[Yuan2007Predicting] YYuan, LGuo, LShen, and JS. Liu. Predicting gene expression from sequence: A reexamination. PLoS Comput. Biol., 3(11):e243, 2007. [ bib | DOI | http | .pdf ]
[Yuan2007On] Ming Yuan and Yi Lin. On the non-negative garrotte estimator. Journal Of The Royal Statistical Society Series B, 69(2):143-161, 2007. [ bib | .html ]
We study the non-negative garrotte estimator from three different aspects: consistency, computation and flexibility. We argue that the non-negative garrotte is a general procedure that can be used in combination with estimators other than the original least squares estimator as in its original form. In particular, we consider using the lasso, the elastic net and ridge regression along with ordinary least squares as the initial estimate in the non-negative garrotte. We prove that the non-negative garrotte has the nice property that, with probability tending to 1, the solution path contains an estimate that correctly identifies the set of important variables and is consistent for the coefficients of the important variables, whereas such a property may not be valid for the initial estimators. In general, we show that the non-negative garrotte can turn a consistent estimate into an estimate that is not only consistent in terms of estimation but also in terms of variable selection. We also show that the non-negative garrotte has a piecewise linear solution path. Using this fact, we propose an efficient algorithm for computing the whole solution path for the non-negative garrotte. Simulations and a real example demonstrate that the non-negative garrotte is very effective in improving on the initial estimator in terms of variable selection and estimation accuracy. Copyright 2007 Royal Statistical Society.

[Yuan2007Model] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19-35, 2007. [ bib | .html | .pdf ]
[Yu2007Robust] Shipeng Yu, Volker Tresp, and Kai Yu. Robust multi-task learning with t-processes. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 1103-1110, New York, NY, USA, 2007. ACM. [ bib | DOI ]
[Yu2007Pathway] JYu, ASieuwerts, YZhang, JMartens, MSmid, JKlijn, YWang, and JFoekens. Pathway analysis of gene signatures predicting metastasis of node-negative primary breast cancer. BMC cancer, 7(1):182, 2007. [ bib ]
[Yi2007Strategy] YYi, CLi, CMiller, and AL. George. Strategy for encoding and comparison of gene expression signatures. Genome Biol., 8(7):R133, 2007. [ bib | DOI | http ]
EXALT (EXpression signature AnaLysis Tool) is a computational system enabling comparisons of microarray data across experimental platforms and different laboratories http://seq.mc.vanderbilt.edu/exalt/. An essential feature of EXALT is a database holding thousands of gene expression signatures extracted from the Gene Expression Omnibus, and encoded in a searchable format. This novel approach to performing global comparisons of shared microarray data may have enormous value when coupled directly with a shared data repository.

[Yan2007Machine] RJ. Yan and CX. Ling. Machine learning for stock selection. In KDD '07 Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1038-1042, New York, NY, USA, 2007. ACM Press. [ bib | DOI | http | .pdf ]
In this paper, we propose a new method called Prototype Ranking (PR) designed for the stock selection problem. PR takes into account the huge size of real-world stock data and applies a modified competitive learning technique to predict the ranks of stocks. The primary target of PR is to select the top performing stocks among many ordinary stocks. PR is designed to perform the learning and testing in a noisy stocks sample set where the top performing stocks are usually the minority. The performance of PR is evaluated by a trading simulation of the real stock data. Each week the stocks with the highest predicted ranks are chosen to construct a portfolio. In the period of 1978-2004, PR's portfolio earns a much higher average return as well as a higher risk-adjusted return than Cooper's method, which shows that the PR method leads to a clear profit improvement.

[Yan2007Determining] Mingjin Yan and Keying Ye. Determining the number of clusters using the weighted gap statistic. Biometrics, 63(4):1031-1037, Dec 2007. [ bib | DOI | http ]
Estimating the number of clusters in a data set is a crucial step in cluster analysis. In this article, motivated by the gap method (Tibshirani, Walther, and Hastie, 2001, Journal of the Royal Statistical Society B63, 411-423), we propose the weighted gap and the difference of difference-weighted (DD-weighted) gap methods for estimating the number of clusters in data using the weighted within-clusters sum of errors: a measure of the within-clusters homogeneity. In addition, we propose a "multilayer" clustering approach, which is shown to be more accurate than the original gap method, particularly in detecting the nested cluster structure of the data. The methods are applicable when the input data contain continuous measurements and can be used with any clustering method. Simulation studies and real data are investigated and compared among these proposed methods as well as with the original gap method.

Keywords: Algorithms; Biometry, methods; Cluster Analysis; Computer Simulation; Data Interpretation, Statistical; Models, Biological; Models, Statistical; Pattern Recognition, Automated, methods
[Xue2007Multi-task] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research, 8:2007, January 2007. [ bib ]
[Weskamp2007Multiple] NWeskamp, EHullermeier, DKuhn, and GKlebe. Multiple graph alignment for the structural analysis of protein active sites. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 4(2):310-320, 2007. [ bib | DOI ]
[Weinberg2007Biology] R A Weinberg. The biology of cancer. Garland Science, Taylor & Francis Group, LLC, 2007. [ bib ]
Keywords: csbcbook
[Wang2007new] J.Z. Wang, ZDu, RPayattakool, P.S. Yu, and C.F. Chen. A new method to measure the semantic similarity of GO terms. Bioinformatics, 23(10):1274, 2007. [ bib | DOI | http ]
[Vishwanathan2007Fast] S.V.N. Vishwanathan, KBorgwardt, and NSchraudolph. Fast Computation of Graph Kernels. In BSchölkopf, JPlatt, and THoffman, editors, Adv. Neural Inform. Process. Syst., volume 19, pages 1-2, Cambridge, MA, 2007. MIT Press, Cambridge, MA. [ bib ]
[Vert2007new] J.-P. Vert, JQiu, and WS. Noble. A new pairwise kernel for biological network inference with support vector machines. BMC Bioinformatics, 8 Suppl 10:S8, 2007. [ bib | DOI | http ]
BACKGROUND: Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression, etc.). RESULTS: Here, we distinguish between two modes of inference in this setting: direct inference based upon similarities between nodes joined by an edge, and indirect inference based upon similarities between one pair of nodes and another pair of nodes. We propose a supervised approach for the direct case by translating it into a distance metric learning problem. A relaxation of the resulting convex optimization problem leads to the support vector machine (SVM) algorithm with a particular kernel for pairs, which we call the metric learning pairwise kernel. This new kernel for pairs can easily be used by most SVM implementations to solve problems of supervised classification and inference of pairwise relationships from heterogeneous data. We demonstrate, using several real biological networks and genomic datasets, that this approach often improves upon the state-of-the-art SVM for indirect inference with another pairwise kernel, and that the combination of both kernels always improves upon each individual kernel. CONCLUSION: The metric learning pairwise kernel is a new formulation to infer pairwise relationships with SVM, which provides state-of-the-art results for the inference of several biological networks from heterogeneous genomic data.

[Vert2007Kernel] J.-P. Vert. Kernel methods in genomics and computational biology. In GCamps-Valls, J.-L. Rojo-Alvarez, and MMartinez-Ramon, editors, Kernel Methods in Bioengineering, Signal and Image Processing. IDEA Group, 2007. [ bib ]
[Vasudevan2007Switching] Shobha Vasudevan, Yingchun Tong, and Joan A Steitz. Switching from repression to activation: micrornas can up-regulate translation. Science, 318(5858):1931-1934, Dec 2007. [ bib | DOI | http ]
AU-rich elements (AREs) and microRNA target sites are conserved sequences in messenger RNA (mRNA) 3' untranslated regions (3'UTRs) that control gene expression posttranscriptionally. Upon cell cycle arrest, the ARE in tumor necrosis factor-alpha (TNFalpha) mRNA is transformed into a translation activation signal, recruiting Argonaute (AGO) and fragile X mental retardation-related protein 1 (FXR1), factors associated with micro-ribonucleoproteins (microRNPs). We show that human microRNA miR369-3 directs association of these proteins with the AREs to activate translation. Furthermore, we document that two well-studied microRNAs-Let-7 and the synthetic microRNA miRcxcr4-likewise induce translation up-regulation of target mRNAs on cell cycle arrest, yet they repress translation in proliferating cells. Thus, activation is a common function of microRNPs on cell cycle arrest. We propose that translation regulation by microRNPs oscillates between repression and activation during the cell cycle.

Keywords: sirna
[Valentini2007Mosclust:] Giorgio Valentini. Mosclust: a software library for discovering significant structures in bio-molecular data. Bioinformatics, 23(3):387-389, Feb 2007. [ bib | DOI | http ]
The R package mosclust (model order selection for clustering problems) implements algorithms based on the concept of stability for discovering significant structures in bio-molecular data. The software library provides stability indices obtained through different data perturbations methods (resampling, random projections, noise injection), as well as statistical tests to assess the significance of multi-level structures singled out from the data. Availability: http://homes.dsi.unimi.itvalenti/SW/mosclust/download/mosclust1.0.tar.gz. Supplementary information: http://homes.dsi.unimi.itvalenti/SW/mosclust.

Keywords: Algorithms; Artificial Intelligence; Cluster Analysis; Gene Expression Profiling, methods; Oligonucleotide Array Sequence Analysis, methods; Pattern Recognition, Automated, methods; Programming Languages; Proteome, metabolism; Signal Transduction, physiology; Software
[Vaidya2007Breast] Jayant S Vaidya. Breast cancer: an artistic view. The Lancet Oncology, 8:583-585, 2007. [ bib ]
Keywords: csbcbook
[Tung2007POPI:] Chun-Wei Tung and Shinn-Ying Ho. Popi: predicting immunogenicity of mhc class i binding peptides by mining informative physicochemical properties. Bioinformatics, 23(8):942-949, Apr 2007. [ bib | DOI | http ]
MOTIVATION: Both modeling of antigen-processing pathway including major histocompatibility complex (MHC) binding and immunogenicity prediction of those MHC-binding peptides are essential to develop a computer-aided system of peptide-based vaccine design that is one goal of immunoinformatics. Numerous studies have dealt with modeling the immunogenic pathway but not the intractable problem of immunogenicity prediction due to complex effects of many intrinsic and extrinsic factors. Moderate affinity of the MHC-peptide complex is essential to induce immune responses, but the relationship between the affinity and peptide immunogenicity is too weak to use for predicting immunogenicity. This study focuses on mining informative physicochemical properties from known experimental immunogenicity data to understand immune responses and predict immunogenicity of MHC-binding peptides accurately. RESULTS: This study proposes a computational method to mine a feature set of informative physicochemical properties from MHC class I binding peptides to design a support vector machine (SVM) based system (named POPI) for the prediction of peptide immunogenicity. High performance of POPI arises mainly from an inheritable bi-objective genetic algorithm, which aims to automatically determine the best number m out of 531 physicochemical properties, identify these m properties and tune SVM parameters simultaneously. The dataset consisting of 428 human MHC class I binding peptides belonging to four classes of immunogenicity was established from MHCPEP, a database of MHC-binding peptides (Brusic et al., 1998). POPI, utilizing the m = 23 selected properties, performs well with the accuracy of 64.72% using leave-one-out cross-validation, compared with two sequence alignment-based prediction methods ALIGN (54.91%) and PSI-BLAST (53.23%). POPI is the first computational system for prediction of peptide immunogenicity based on physicochemical properties. AVAILABILITY: A web server for prediction of peptide immunogenicity (POPI) and the used dataset of MHC class I binding peptides (PEPMHCI) are available at http://iclab.life.nctu.edu.tw/POPI

Keywords: Algorithms; Artificial Intelligence; Binding Sites; Epitope Mapping; Histocompatibility Antigens Class I; Oligopeptides; Pattern Recognition, Automated; Protein Binding; Software; Structure-Activity Relationship
[Tsuda2007Entire] KTsuda. Entire regularization path for graph data. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 919-926, New York, NY, USA, 2007. ACM. [ bib | DOI | http | .pdf ]
[Tong2007In] Joo Chuan Tong, Tin Wee Tan, and Shoba Ranganathan. In silico grouping of peptide/hla class i complexes using structural interaction characteristics. Bioinformatics, 23(2):177-183, Jan 2007. [ bib | DOI | http ]
MOTIVATION: Classification of human leukocyte antigen (HLA) proteins into supertypes underpins the development of epitope-based vaccines with wide population coverage. Current methods for HLA supertype definition, based on common structural features of HLA proteins and/or their functional binding specificities, leave structural interaction characteristics among different HLA supertypes with antigenic peptides unexplored. METHODS: We describe the use of structural interaction descriptors for the analysis of 68 peptide/HLA class I crystallographic structures. Interaction parameters computed include the number of intermolecular hydrogen bonds between each HLA protein and its corresponding bound peptide, solvent accessibility, gap volume and gap index. RESULTS: The structural interactions patterns of peptide/HLA class I complexes investigated herein vary among individual alleles and may be grouped in a supertype dependent manner. Using the proposed methodology, eight HLA class I supertypes were defined based on existing experimental crystallographic structures which largely overlaps (77% consensus) with the definitions by binding motifs. This mode of classification, which considers conformational information of both peptide and HLA proteins, provides an alternative to the characterization of supertypes using either peptide or HLA protein information alone.

[Tomlins2007Integrative] Scott A Tomlins, Rohit Mehra, Daniel R Rhodes, Xuhong Cao, Lei Wang, Saravana M Dhanasekaran, Shanker Kalyana-Sundaram, John T Wei, Mark A Rubin, Kenneth J Pienta, Rajal B Shah, and Arul M Chinnaiyan. Integrative molecular concept modeling of prostate cancer progression. Nat. Genet., 39(1):41-51, Jan 2007. [ bib | DOI | http | .pdf ]
Despite efforts to profile prostate cancer, the genetic alterations and biological processes that correlate with the observed histological progression are unclear. Using laser-capture microdissection to isolate 101 cell populations, we have profiled prostate cancer progression from benign epithelium to metastatic disease. By analyzing expression signatures in the context of over 14,000 'molecular concepts', or sets of biologically connected genes, we generated an integrative model of progression. Molecular concepts that demarcate critical transitions in progression include protein biosynthesis, E26 transformation-specific (ETS) family transcriptional targets, androgen signaling and cell proliferation. Of note, relative to low-grade prostate cancer (Gleason pattern 3), high-grade cancer (Gleason pattern 4) shows an attenuated androgen signaling signature, similar to metastatic prostate cancer, which may reflect dedifferentiation and explain the clinical association of grade with prognosis. Taken together, these data show that analyzing gene expression signatures in the context of a compendium of molecular concepts is useful in understanding cancer biology.

[Thomassen2007Comparison] MThomassen, QTan, FEiriksdottir, MBak, SCold, and T.A. Kruse. Comparison of gene sets for expression profiling: prediction of metastasis from low-malignant breast cancer. Clinical Cancer Research, 13(18):5355-5360, 2007. [ bib ]
[Taylor2007minimum] Chris F Taylor, Norman W Paton, Kathryn S Lilley, Pierre-Alain Binz, Randall K Julian, Andrew R Jones, Weimin Zhu, Rolf Apweiler, Ruedi Aebersold, Eric W Deutsch, Michael J Dunn, Albert J R Heck, Alexander Leitner, Marcus Macht, Matthias Mann, Lennart Martens, Thomas A Neubert, Scott D Patterson, Peipei Ping, Sean L Seymour, Puneet Souda, Akira Tsugita, Joel Vandekerckhove, Thomas M Vondriska, Julian P Whitelegge, Marc R Wilkins, Ioannnis Xenarios, John R Yates, and Henning Hermjakob. The minimum information about a proteomics experiment (miape). Nat Biotechnol, 25(8):887-893, Aug 2007. [ bib | DOI | http ]
Both the generation and the analysis of proteomics data are now widespread, and high-throughput approaches are commonplace. Protocols continue to increase in complexity as methods and technologies evolve and diversify. To encourage the standardized collection, integration, storage and dissemination of proteomics data, the Human Proteome Organization's Proteomics Standards Initiative develops guidance modules for reporting the use of techniques such as gel electrophoresis and mass spectrometry. This paper describes the processes and principles underpinning the development of these modules; discusses the ramifications for various interest groups such as experimentalists, funders, publishers and the private sector; addresses the issue of overlap with other reporting guidelines; and highlights the criticality of appropriate tools and resources in enabling 'MIAPE-compliant' reporting.

Keywords: Databases, Protein; Gene Expression Profiling; Genome, Human; Guidelines as Topic; Humans; Information Storage and Retrieval; Internationality; Proteomics; Research
[Strobl2007Bias] CStrobl, A.L. Boulesteix, AZeileis, and THothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25, 2007. [ bib ]
[Starkuviene2007potential] VStarkuviene and RPepperkok. The potential of high-content high-throughput microscopy in drug discovery. Br. J. Pharmacol., 152(1):62-71, Sep 2007. [ bib | DOI | http ]
Fluorescence microscopy is a powerful method to study protein function in its natural habitat, the living cell. With the availability of the green fluorescent protein and its spectral variants, almost any gene of interest can be fluorescently labelled in living cells opening the possibility to study protein localization, dynamics and interactions. The emergence of automated cellular systems allows rapid visualization of large groups of cells and phenotypic analysis in a quantitative manner. Here, we discuss recent advances in high-content high-throughput microscopy and its potential application to several steps of the drug discovery process.

[Singh2007Pairwise] RSingh, JXu, and BBerger. Pairwise global alignment of protein interaction networks by matching neighborhood topology. The Proceedings of the 11th International Conference on Research in Computational Molecular Biology (RECOMB), 2007. [ bib ]
[Simonis2007evaluation] Marieke Simonis, Jurgen Kooren, and Wouter de Laat. An evaluation of 3c-based methods to capture dna interactions. Nat Methods, 4(11):895-901, Nov 2007. [ bib | DOI | http ]
The shape of the genome is thought to play an important part in the coordination of transcription and other DNA-metabolic processes. Chromosome conformation capture (3C) technology allows us to analyze the folding of chromatin in the native cellular state at a resolution beyond that provided by current microscopy techniques. It has been used, for example, to demonstrate that regulatory DNA elements communicate with distant target genes through direct physical interactions that loop out the intervening chromatin fiber. Here we discuss the intricacies of 3C and new 3C-based methods including the 4C, 5C and ChIP-loop assay.

Keywords: Animals; Chromatin Immunoprecipitation, methods; Chromatin, chemistry/metabolism; DNA Ligases, chemistry/metabolism; DNA Restriction Enzymes, chemistry/metabolism; DNA, chemistry/genetics/metabolism; Formaldehyde, chemistry; Genetic Techniques; Humans; Reproducibility of Results
[Shock2007Whole-genome] JL. Shock, KF. Fischer, and JL. DeRisi. Whole-genome analysis of mrna decay in plasmodium falciparum reveals a global lengthening of mrna half-life during the intra-erythrocytic development cycle. Genome Biol., 8(7):R134, 2007. [ bib | DOI | http ]
BACKGROUND: The rate of mRNA decay is an essential element of post-transcriptional regulation in all organisms. Previously, studies in several organisms found that the specific half-life of each mRNA is precisely related to its physiologic role, and plays an important role in determining levels of gene expression. RESULTS: We used a genome-wide approach to characterize mRNA decay in Plasmodium falciparum. We found that, globally, rates of mRNA decay increase dramatically during the asexual intra-erythrocytic developmental cycle. During the ring stage of the cycle, the average mRNA half-life was 9.5 min, but this was extended to an average of 65 min during the late schizont stage of development. Thus, a major determinant of mRNA decay rate appears to be linked to the stage of intra-erythrocytic development. Furthermore, we found specific variations in decay patterns superimposed upon the dominant trend of progressive half-life lengthening. These variations in decay pattern were frequently enriched for genes with specific cellular functions or processes. CONCLUSION: Elucidation of Plasmodium mRNA decay rates provides a key element for deciphering mechanisms of genetic control in this parasite, by complementing and extending previous mRNA abundance studies. Our results indicate that progressive stage-dependent decreases in mRNA decay rate function are a major determinant of mRNA accumulation during the schizont stage of intra-erythrocytic development. This type of genome-wide change in mRNA decay rate has not been observed in any other organism to date, and indicates that post-transcriptional regulation may be the dominant mechanism of gene regulation in P. falciparum.

Keywords: plasmodium
[Shah2007Modeling] S.P. Shah, W.L. Lam, R.T. Ng, and K.P. Murphy. Modeling recurrent DNA copy number alterations in array CGH data. Bioinformatics, 23(13):i450-i458, 2007. [ bib | .pdf ]
[Sebat2007Major] Jonathan Sebat. Major changes in our dna lead to major changes in our thinking. Nat. Genet., 39(7 Suppl):S3-S5, Jul 2007. [ bib | DOI | http ]
Variability in the human genome has far exceeded expectations. In the course of the past three years, we have learned that much of our naturally occurring genetic variation consists of large-scale differences in genome structure, including copy-number variants (CNVs) and balanced rearrangements such as inversions. Recent studies have begun to reveal that structural variants are an important contributor to disease risk; however, structural variants as a class may not conform well to expectations of current methods for gene mapping. New approaches are needed to understand the contribution of structural variants to disease.

Keywords: DNA; Gene Dosage; Gene Rearrangement; Genetic Diseases, I; Genetic Variation; Genome, Human; Humans; nborn
[Schuster2007Next] SC. Schuster. Next-generation sequencing transforms today's biology. Nat. Methods, 5:16-18, 2007. [ bib ]
[Saeys2007review] YSaeys, IInza, and PLarrañaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507-2517, 2007. [ bib ]
[Rosset2007Piecewise] SRosset and JZhu. Piecewise linear regularized solution paths. Annals of Statistics, 35(3):1012-1030, 2007. [ bib ]
[Robine2007Genome-wide] NRobine, NUematsu, FAmiot, XGidrol, EBarillot, ANicolas, and VBorde. Genome-wide redistribution of meiotic double-strand breaks in saccharomyces cerevisiae. Mol. Cell. Biol., 27(5):1868-1880, Mar 2007. [ bib | DOI | http ]
Meiotic recombination is initiated by the formation of programmed DNA double-strand breaks (DSBs) catalyzed by the Spo11 protein. DSBs are not randomly distributed along chromosomes. To better understand factors that control the distribution of DSBs in budding yeast, we have examined the genome-wide binding and cleavage properties of the Gal4 DNA binding domain (Gal4BD)-Spo11 fusion protein. We found that Gal4BD-Spo11 cleaves only a subset of its binding sites, indicating that the association of Spo11 with chromatin is not sufficient for DSB formation. In centromere-associated regions, the centromere itself prevents DSB cleavage by tethered Gal4BD-Spo11 since its displacement restores targeted DSB formation. In addition, we observed that new DSBs introduced by Gal4BD-Spo11 inhibit surrounding DSB formation over long distances (up to 60 kb), keeping constant the number of DSBs per chromosomal region. Together, these results demonstrate that the targeting of Spo11 to new chromosomal locations leads to both local stimulation and genome-wide redistribution of recombination initiation and that some chromosomal regions are inherently cold regardless of the presence of Spo11.

[Rivals2007Enrichment] IRivals, LPersonnaz, LTaing, and M.-C. Potier. Enrichment or depletion of a go category within a class of genes: which test? Bioinformatics, 23(4):401-407, Feb 2007. [ bib | DOI | http | .pdf ]
A number of available program packages determine the significant enrichments and/or depletions of GO categories among a class of genes of interest. Whereas a correct formulation of the problem leads to a single exact null distribution, these GO tools use a large variety of statistical tests whose denominations often do not clarify the underlying P-value computations.We review the different formulations of the problem and the tests they lead to: the binomial, chi2, equality of two probabilities, Fisher's exact and hypergeometric tests. We clarify the relationships existing between these tests, in particular the equivalence between the hypergeometric test and Fisher's exact test. We recall that the other tests are valid only for large samples, the test of equality of two probabilities and the chi2-test being equivalent. We discuss the appropriateness of one- and two-sided P-values, as well as some discreteness and conservatism issues.Supplementary data are available at Bioinformatics online.

[Rhodes2007Oncomine] Daniel R. Rhodes, Shanker Kalyana-Sundaram, Vasudeva Mahavisno, Radhika Varambally, Jianjun Yu, Benjamin B. Briggs, Terrence R. Barrette, Matthew J. Anstet, Colleen Kincead-Beal, Prakash Kulkarni, Sooryanaryana Varambally, Debashis Ghosh, and Arul M. Chinnaiyan. Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia, 9(2):166-180, Feb 2007. [ bib ]
DNA microarrays have been widely applied to cancer transcriptome analysis; however, the majority of such data are not easily accessible or comparable. Furthermore, several important analytic approaches have been applied to microarray analysis; however, their application is often limited. To overcome these limitations, we have developed Oncomine, a bioinformatics initiative aimed at collecting, standardizing, analyzing, and delivering cancer transcriptome data to the biomedical research community. Our analysis has identified the genes, pathways, and networks deregulated across 18,000 cancer gene expression microarrays, spanning the majority of cancer types and subtypes. Here, we provide an update on the initiative, describe the database and analysis modules, and highlight several notable observations. Results from this comprehensive analysis are available at http://www.oncomine.org.

Keywords: Antineoplastic Agents, pharmacology; Automatic Data Processing; Chromosome Mapping; Chromosomes, Human, genetics; Computational Biology, organization /&/ administration; Data Collection; Data Display; Data Interpretation, Statistical; Databases, Genetic; Drug Design; Gene Expression Profiling, statistics /&/ numerical data; Gene Expression Regulation, Neoplastic; Genes, Neoplasm; Humans; Internet; Models, Biological; Neoplasm Proteins, biosynthesis/chemistry/genetics; Neoplasms, classification/genetics/metabolism; Oligonucleotide Array Sequence Analysis; Subtraction Technique; Transcription, Genetic
[Rapaport2007Classification] FRapaport, AZynoviev, MDutreix, EBarillot, and J.-P. Vert. Classification of microarray data using gene networks. BMC Bioinformatics, 8:35, 2007. [ bib ]
[Rakotomamonjy2007More] Alain Rakotomamonjy, Francis Bach, Stéphane Canu, and Yves Grandvalet. More efficiency in multiple kernel learning. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 775-782, New York, NY, USA, 2007. ACM. [ bib | DOI ]
[Perry2007Diet] George H Perry, Nathaniel J Dominy, Katrina G Claw, Arthur S Lee, Heike Fiegler, Richard Redon, John Werner, Fernando A Villanea, Joanna L Mountain, Rajeev Misra, Nigel P Carter, Charles Lee, and Anne C Stone. Diet and the evolution of human amylase gene copy number variation. Nat Genet, 39(10):1256-1260, Oct 2007. [ bib | DOI | http ]
Starch consumption is a prominent characteristic of agricultural societies and hunter-gatherers in arid environments. In contrast, rainforest and circum-arctic hunter-gatherers and some pastoralists consume much less starch. This behavioral variation raises the possibility that different selective pressures have acted on amylase, the enzyme responsible for starch hydrolysis. We found that copy number of the salivary amylase gene (AMY1) is correlated positively with salivary amylase protein level and that individuals from populations with high-starch diets have, on average, more AMY1 copies than those with traditionally low-starch diets. Comparisons with other loci in a subset of these populations suggest that the extent of AMY1 copy number differentiation is highly unusual. This example of positive selection on a copy number-variable gene is, to our knowledge, one of the first discovered in the human genome. Higher AMY1 copy numbers and protein levels probably improve the digestion of starchy foods and may buffer against the fitness-reducing effects of intestinal disease.

Keywords: Animals; Diet; Evolution, Molecular; Gene Dosage; Genetic Variation; Humans; Starch, metabolism; alpha-Amylases, genetics
[Papadakis2007Efficient] PPapadakis, IPratikakis, SPerantonis, and TTheoharis. Efficient 3d shape matching and retrieval using a concrete radialized spherical projection representation. Pattern Recogn., 40(9):2437-2452, 2007. [ bib | DOI ]
[Paik2007Development] Soonmyung Paik. Development and clinical utility of a 21-gene recurrence score prognostic assay in patients with early breast cancer treated with tamoxifen. Oncologist, 12(6):631-635, Jun 2007. [ bib | DOI | http ]
Although patients diagnosed with axillary node-negative estrogen receptor-positive breast cancer have an excellent prognosis, about 15% of them fail after 5 years of tamoxifen treatment. Clinical trials have provided evidence that there is a significant benefit from chemotherapy for these patients, but it would be significant overtreatment if all of them were treated with chemotherapy. Therefore, context-specific prognostic assays that can identify those who need chemotherapy in addition to tamoxifen, or those who are essentially cured by tamoxifen alone, and can be performed using routinely processed tumor biopsy tissue would be clinically useful. Using a stepwise approach of going through independent model-building and validation sets, a 21-gene recurrence score (RS), based on monitoring of mRNA expression levels of 16 cancer-related genes in relation to five reference genes, has been developed. The RS identified approximately 50% of the patients who had excellent prognosis after tamoxifen alone. Subsequent study suggested that high-risk patients identified with the RS preferentially benefit from chemotherapy. Ideally the RS should be used as a continuous variable. A prospective study-the Trial Assigning Individualized Options for Treatment (Rx) (TAILORx)-to examine whether chemotherapy is required for the intermediate-risk group defined by the RS is accruing in North America.

Keywords: Breast Neoplasms, drug therapy/genetics/pathology; Estrogen Antagonists, therapeutic use; Female; Gene Expression Profiling; Gene Expression Regulation, Neoplastic; Genetic Predisposition to Disease; Humans; Neoplasm Recurrence, Local; Prognosis; Risk Factors; Tamoxifen, therapeutic use; Time Factors; Treatment Outcome
[Okuno2007GLIDA] YOkuno, ATamon, HYabuuchi, SNiijima, YMinowa, KTonomura, RKunimoto, and CFeng. GLIDA: GPCR ligand database for chemical genomics drug discovery database and tools update. Nucleic Acids Res., 36(Database issue):D907-D912, Nov 2007. [ bib | DOI | http ]
G-protein coupled receptors (GPCRs) represent one of the most important families of drug targets in pharmaceutical development. GLIDA is a public GPCR-related Chemical Genomics database that is primarily focused on the integration of information between GPCRs and their ligands. It provides interaction data between GPCRs and their ligands, along with chemical information on the ligands, as well as biological information regarding GPCRs. These data are connected with each other in a relational database, allowing users in the field of Chemical Genomics research to easily retrieve such information from either biological or chemical starting points. GLIDA includes a variety of similarity search functions for the GPCRs and for their ligands. Thus, GLIDA can provide correlation maps linking the searched homologous GPCRs (or ligands) with their ligands (or GPCRs). By analyzing the correlation patterns between GPCRs and ligands, we can gain more detailed knowledge about their conserved molecular recognition patterns and improve drug design efforts by focusing on inferred candidates for GPCR-specific drugs. This article provides a summary of the GLIDA database and user facilities, and describes recent improvements to database design, data contents, ligand classification programs, similarity search options and graphical interfaces. GLIDA is publicly available at http://pharminfo.pharm.kyoto-u.ac.jp/services/glida/. We hope that it will prove very useful for Chemical Genomics research and GPCR-related drug discovery.

Keywords: chemogenomics
[Okamoto2007Prediction] SOkamoto, YYamanishi, SEhira, SKawashima, KTonomura, and MKanehisa. Prediction of nitrogen metabolism-related genes in anabaena by kernel-based network analysis. Proteomics, 7(6):900-909, Mar 2007. [ bib | DOI | http ]
Prediction of molecular interaction networks from large-scale datasets in genomics and other omics experiments is an important task in terms of both developing bioinformatics methods and solving biological problems. We have applied a kernel-based network inference method for extracting functionally related genes to the response of nitrogen deprivation in cyanobacteria Anabaena sp. PCC 7120 integrating three heterogeneous datasets: microarray data, phylogenetic profiles, and gene orders on the chromosome. We obtained 1348 predicted genes that are somehow related to known genes in the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. While this dataset contained previously known genes related to the nitrogen deprivation condition, it also contained additional genes. Thus, we attempted to select any relevant genes using the constraints of Pfam domains and NtcA-binding sites. We found candidates of nitrogen metabolism-related genes, which are depicted as extensions of existing KEGG pathways. The prediction of functional relationships between proteins rather than functions of individual proteins will thus assist the discovery from the large-scale datasets.

[Obozinski2007Multi-task] GObozinski, BTaskar, and MI. Jordan. Multi-task feature selection. Technical report, arXiv, 2007. [ bib ]
[Npg2007DNA] Nature Publishing Group. DNA Technologies - Milestones timeline. Nature Milestones, 2007. http://www.nature.com/milestones/miledna/timeline.html. [ bib ]
Keywords: csbcbook, csbcbook-ch2
[Nacu2007Gene] SNacu, RCritchley-Thorne, PLee, and SHolmes. Gene expression network analysis and applications to immunology. Bioinformatics, 23(7):850-858, Apr 2007. [ bib | DOI | http | .pdf ]
We address the problem of using expression data and prior biological knowledge to identify differentially expressed pathways or groups of genes. Following an idea of Ideker et al. (2002), we construct a gene interaction network and search for high-scoring subnetworks. We make several improvements in terms of scoring functions and algorithms, resulting in higher speed and accuracy and easier biological interpretation. We also assign significance levels to our results, adjusted for multiple testing. Our methods are successfully applied to three human microarray data sets, related to cancer and the immune system, retrieving several known and potential pathways. The method, denoted by the acronym GXNA (Gene eXpression Network Analysis) is implemented in software that is publicly available and can be used on virtually any microarray data set. SUPPLEMENTARY INFORMATION: The source code and executable for the software, as well as certain supplemental materials, can be downloaded from http://stat.stanford.eduserban/gxna.

[Morris2007Identification] Stephanie A Morris, Bhargavi Rao, Benjamin A Garcia, Sandra B Hake, Robert L Diaz, Jeffrey Shabanowitz, Donald F Hunt, CDavid Allis, Jason D Lieb, and Brian D Strahl. Identification of histone h3 lysine 36 acetylation as a highly conserved histone modification. J Biol Chem, 282(10):7632-7640, Mar 2007. [ bib | DOI | http ]
Histone lysine acetylation is a major mechanism by which cells regulate the structure and function of chromatin, and new sites of acetylation continue to be discovered. Here we identify and characterize histone H3K36 acetylation (H3K36ac). By mass spectrometric analyses of H3 purified from Tetrahymena thermophila and Saccharomyces cerevisiae (yeast), we find that H3K36 can be acetylated or methylated. Using an antibody specific to H3K36ac, we show that this modification is conserved in mammals. In yeast, genome-wide ChIP-chip experiments show that H3K36ac is localized predominantly to the promoters of RNA polymerase II-transcribed genes, a pattern inversely related to that of H3K36 methylation. The pattern of H3K36ac localization is similar to that of other sites of H3 acetylation, including H3K9ac and H3K14ac. Using histone acetyltransferase complexes purified from yeast, we show that the Gcn5-containing SAGA complex that regulates transcription specifically acetylates H3K36 in vitro. Deletion of GCN5 completely abolishes H3K36ac in vivo. These data expand our knowledge of the genomic targets of Gcn5, show H3K36ac is highly conserved, and raise the intriguing possibility that the transition between H3K36ac and H3K36me acts as an "acetyl/methyl switch" governing chromatin function along transcription units.

Keywords: Acetylation; Amino Acid Sequence; Animals; Chromatin Immunoprecipitation; Conserved Sequence; Histone Acetyltransferases, physiology; Histones, chemistry; Humans; Lysine; Methylation; Mice; Molecular Sequence Data; Promoter Regions, Genetic; Saccharomyces cerevisiae Proteins, physiology; Saccharomyces cerevisiae, chemistry; Tetrahymena, chemistry
[Mitra2007p53] A.P. Mitra, MBirkhahn, and RJ. Cote. p53 and retinoblastoma pathways in bladder cancer. World J Urol., 25:563-571, 2007. [ bib ]
[Mitelman2007Impact] FMitelman, BJohansson, and FMertens. The impact of translocations and gene fusions on cancer causation. Nat. Rev. Cancer, 7:233-245, 2007. [ bib ]
Keywords: csbcbook
[Miller2007Expression] L.D. Miller and E.T. Liu. Expression genomics in breast cancer research: microarrays at the crossroads of biology and medicine. Breast Cancer Res., 9:206, 2007. [ bib | DOI | http | .pdf ]
Genome-wide expression microarray studies have revealed that the biological and clinical heterogeneity of breast cancer can be partly explained by information embedded within a complex but ordered transcriptional architecture. Comprising this architecture are gene expression networks, or signatures, reflecting biochemical and behavioral properties of tumors that might be harnessed to improve disease subtyping, patient prognosis and prediction of therapeutic response. Emerging 'hypothesis-driven' strategies that incorporate knowledge of pathways and other biological phenomena in the signature discovery process are linking prognosis and therapy prediction with transcriptional readouts of tumorigenic mechanisms that better inform therapeutic options.

Keywords: csbcbook, csbcbook-ch3
[Meinshausen2007Discussion] NMeinshausen, GRocha, and BYu. Discussion: A tale of three cousins: Lasso, l2boosting and dantzig. ANNALS OF STATISTICS, 35:2373, 2007. [ bib | http ]
[McKusick2007Mendelian] V.A. McKusick. Mendelian inheritance in man and its online version, omim. Am. J. Hum. Genet., 80(4):588-604, Apr 2007. [ bib | DOI | http ]
Keywords: Databases, Genetic; Epigenesis, Genetic; Genetic Predisposition to Disease; Genetic Variation; History, 20th Century; History, 21st Century; Internet; Phenotype; Terminology as Topic
[Markowetz2007Inferring] FMarkowetz and RSpang. Inferring cellular networks - a review. BMC Bioinformatics, 8(Suppl 6):S5, 2007. [ bib | DOI | http ]
In this review we give an overview of computational and statistical methods to reconstruct cellular networks. Although this area of research is vast and fast developing, we show that most currently used methods can be organized by a few key concepts. The first part of the review deals with conditional independence models including Gaussian graphical models and Bayesian networks. The second part discusses probabilistic and graph-based methods for data from experimental interventions and perturbations.

[Loenning2007Breast] PE. Lønning. Breast cancer prognostication and prediction: are we making progress? Ann. Oncol., 18 Suppl 8:viii3-viii7, Sep 2007. [ bib | DOI | http | .pdf ]
Currently, much effort is being invested in the identification of new, accurate prognostic and predictive factors in breast cancer. Prognostic factors assess the patient's risk of relapse based on indicators such as intrinsic tumor biology and disease stage at diagnosis, and are traditionally used to identify patients who can be spared unnecessary adjuvant therapy based only on the risk of relapse. Lymph node status and tumor size are accepted as well-defined prognostic factors in breast cancer. Predictive factors, in contrast, determine the responsiveness of a particular tumor to a specific treatment. Despite recent advances in the understanding of breast cancer biology and changing practices in disease management, with the exception of hormone receptor status, which predicts responsiveness to endocrine treatment, no predictive factor for response to systemic therapy in breast cancer is widely accepted. While gene expression studies have provided important new information with regard to tumor biology and prognostication, attempts to identify predictive factors have not been successful so far. This article will focus on recent advances in prognostication and prediction, with emphasis on findings from gene expression profiling studies.

Keywords: csbcbook, csbcbook-ch3
[Lytle2007Target] JRobin Lytle, Therese A Yario, and Joan A Steitz. Target mrnas are repressed as efficiently by microrna-binding sites in the 5' utr as in the 3' utr. Proc Natl Acad Sci U S A, 104(23):9667-9672, Jun 2007. [ bib | DOI | http ]
In animals, microRNAs (miRNAs) bind to the 3' UTRs of their target mRNAs and interfere with translation, although the exact mechanism of inhibition of protein synthesis remains unclear. Functional miRNA-binding sites in the coding regions or 5' UTRs of endogenous mRNAs have not been identified. We studied the effect of introducing miRNA target sites into the 5' UTR of luciferase reporter mRNAs containing internal ribosome entry sites (IRESs), so that potential steric hindrance by a microribonucleoprotein complex would not interfere with the initiation of translation. In human HeLa cells, which express endogenous let-7a miRNA, the translational efficiency of these IRES-containing reporters with 5' let-7 complementary sites from the Caenorhabditis elegans lin-41 3' UTR was repressed. Similarly, the IRES-containing reporters were translationally repressed when human Ago2 was tethered to either the 5' or 3' UTR. Interestingly, the method of DNA transfection affected our ability to observe miRNA-mediated repression. Our results suggest that association with any position on a target mRNA is mechanistically sufficient for a microribonucleoprotein to exert repression of translation at some step downstream of initiation.

Keywords: sirna
[Loi2007Definition] Sherene Loi, Benjamin Haibe-Kains, Christine Desmedt, Françoise Lallemand, Andrew M. Tutt, Cheryl Gillet, Paul Ellis, Adrian Harris, Jonas Bergh, John A. Foekens, Jan G M. Klijn, Denis Larsimont, Marc Buyse, Gianluca Bontempi, Mauro Delorenzi, Martine J. Piccart, and Christos Sotiriou. Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade. J Clin Oncol, 25(10):1239-1246, Apr 2007. [ bib | DOI | http ]
A number of microarray studies have reported distinct molecular profiles of breast cancers (BC), such as basal-like, ErbB2-like, and two to three luminal-like subtypes. These were associated with different clinical outcomes. However, although the basal and the ErbB2 subtypes are repeatedly recognized, identification of estrogen receptor (ER) -positive subtypes has been inconsistent. Therefore, refinement of their molecular definition is needed.We have previously reported a gene expression grade index (GGI), which defines histologic grade based on gene expression profiles. Using this algorithm, we assigned ER-positive BC to either high-or low-genomic grade subgroups and compared these with previously reported ER-positive molecular classifications. As further validation, we classified 666 ER-positive samples into subtypes and assessed their clinical outcome.Two ER-positive molecular subgroups (high and low genomic grade) could be defined using the GGI. Despite tracking a single biologic pathway, these were highly comparable to the previously described luminal A and B classification and significantly correlated to the risk groups produced using the 21-gene recurrence score. The two subtypes were associated with statistically distinct clinical outcome in both systemically untreated and tamoxifen-treated populations.The use of genomic grade can identify two clinically distinct ER-positive molecular subtypes in a simple and highly reproducible manner across multiple data sets. This study emphasizes the important role of proliferation-related genes in predicting prognosis in ER-positive BC.

[Liu2007Network-based] MLiu, ALiberzon, AW. Kong, WR. Lai, PJ. Park, IS. Kohane, and SKasif. Network-based analysis of affected biological processes in type 2 diabetes models. PLoS Genet., 3(6):e96, Jun 2007. [ bib | DOI | http | .pdf ]
Type 2 diabetes mellitus is a complex disorder associated with multiple genetic, epigenetic, developmental, and environmental factors. Animal models of type 2 diabetes differ based on diet, drug treatment, and gene knockouts, and yet all display the clinical hallmarks of hyperglycemia and insulin resistance in peripheral tissue. The recent advances in gene-expression microarray technologies present an unprecedented opportunity to study type 2 diabetes mellitus at a genome-wide scale and across different models. To date, a key challenge has been to identify the biological processes or signaling pathways that play significant roles in the disorder. Here, using a network-based analysis methodology, we identified two sets of genes, associated with insulin signaling and a network of nuclear receptors, which are recurrent in a statistically significant number of diabetes and insulin resistance models and transcriptionally altered across diverse tissue types. We additionally identified a network of protein-protein interactions between members from the two gene sets that may facilitate signaling between them. Taken together, the results illustrate the benefits of integrating high-throughput microarray studies, together with protein-protein interaction networks, in elucidating the underlying biological processes associated with a complex disorder.

[Liu2007Computational] HLiu and HMotoda. Computational methods of feature selection. Chapman & Hall/CRC, 2007. [ bib ]
[Li2007JBiosci] CLi, QW. Ge, MNakata, HMatsuno, and SMiyano. Modelling and simulation of signal transductions in an apoptosis pathway by using timed petri nets. J Biosci, 32(1):113-27, 2007. [ bib ]
This paper first presents basic Petri net components representing molecular interactions and mechanisms of signalling pathways, and introduces a method to construct a Petri net model of a signalling pathway with these components. Then a simulation method of determining the delay time of transitions, by using timed Petri nets - i.e. the time taken in fi ring of each transition - is proposed based on some simple principles that the number of tokens flowed into a place is equivalent to the number of tokens fl owed out. Finally, the availability of proposed method is confirmed by observing signalling transductions in biological pathways through simulation experiments of the apoptosis signalling pathways as an example.

Keywords: csbcbook
[Levy2007Diploid] Samuel Levy, Granger Sutton, Pauline C Ng, Lars Feuk, Aaron L Halpern, Brian P Walenz, Nelson Axelrod, Jiaqi Huang, Ewen F Kirkness, Gennady Denisov, Yuan Lin, Jeffrey R MacDonald, Andy Wing Chun Pang, Mary Shago, Timothy B Stockwell, Alexia Tsiamouri, Vineet Bafna, Vikas Bansal, Saul A Kravitz, Dana A Busam, Karen Y Beeson, Tina C McIntosh, Karin A Remington, Josep F Abril, John Gill, Jon Borman, Yu-Hui Rogers, Marvin E Frazier, Stephen W Scherer, Robert L Strausberg, and JCraig Venter. The diploid genome sequence of an individual human. PLoS Biol, 5(10):e254, Sep 2007. [ bib | DOI | http ]
Presented here is a genome sequence of an individual human. It was produced from approximately 32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2-206 bp), 292,102 heterozygous insertion/deletion events (indels)(1-571 bp), 559,473 homozygous indels (1-82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.

Keywords: Base Sequence; Chromosome Mapping, instrumentation/methods; Chromosomes, Human; Chromosomes, Human, Y, genetics; Diploidy; Gene Dosage; Genome, Human; Genotype; Haplotypes; Human Genome Project; Humans; INDEL Mutation; In Situ Hybridization, Fluorescence; Male; Microarray Analysis; Middle Aged; Molecular Sequence Data; Pedigree; Phenotype; Polymorphism, Single Nucleotide; Reproducibility of Results; Sequence Analysis, DNA, instrumentation/methods
[Lavrik2007JBC] IN. Lavrik, AGolks, DRiess, MBentele, REils, and PH. Krammer. Analysis of cd95 threshold signaling: triggering of cd95 (fas/apo-1) at low concentrations primarily results in survival signaling. J Biol Chem, 282(18):13664-71, 2007. [ bib ]
Recently we generated a mathematical model (Bentele, M., Lavrik, I., Ulrich, M., Stosser, S., Heermann, D. W., Kalthoff, H., Krammer, P. H., and Eils, R. (2004) J. Cell Biol. 166, 839-851) of signaling in CD95(Fas/APO-1)-mediated apoptosis. Mathematical modeling in combination with experimental data provided new insights into CD95-mediated apoptosis and allowed us to establish a threshold mechanism of life and death. Here, we further assessed the predictability of the model experimentally by a detailed analysis of the threshold behavior of CD95 signaling. Using the model predictions for the mechanism of the threshold behavior we found that the CD95 DISC (death-inducing signaling complex) is formed at the cell membrane upon stimulation with low concentrations of agonistic anti-APO-1 monoclonal antibodies; however, activation of procaspase-8 at the DISC is blocked due to high cellular FLICE-inhibitory protein recruitment into the DISC. Given that death signaling does not occur upon CD95 stimulation at low (threshold) anti-APO-1 concentrations, we also analyzed survival signaling, focusing on mitogen-activated protein kinase activation. Interestingly, we found that mitogen-activated protein kinase activation takes place under threshold conditions. These findings show that triggering of CD95 can signal both life or death, depending on the strength of the stimulus.

Keywords: csbcbook
[Lage2007human] KLage, E.O. Karlberg, Z.M. Størling, P.I. Olason, A.G. Pedersen, ORigina, A.M. Hinsby, ZTümer, FPociot, NTommerup, YMoreau, and SBrunak. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat. Biotechnol., 25(3):309-316, Mar 2007. [ bib | DOI | http | .pdf ]
We performed a systematic, large-scale analysis of human protein complexes comprising gene products implicated in many different categories of human disease to create a phenome-interactome network. This was done by integrating quality-controlled interactions of human proteins with a validated, computationally derived phenotype similarity score, permitting identification of previously unknown complexes likely to be associated with disease. Using a phenomic ranking of protein complexes linked to human disease, we developed a Bayesian predictor that in 298 of 669 linkage intervals correctly ranks the known disease-causing protein as the top candidate, and in 870 intervals with no identified disease-causing gene, provides novel candidates implicated in disorders such as retinitis pigmentosa, epithelial ovarian cancer, inflammatory bowel disease, amyotrophic lateral sclerosis, Alzheimer disease, type 2 diabetes and coronary heart disease. Our publicly available draft of protein complexes associated with pathology comprises 506 complexes, which reveal functional relationships between disease-promoting genes that will inform future experimentation.

[Kroemer2007Structure] Romano T Kroemer. Structure-based drug design: docking and scoring. Curr. Protein Pept. Sci., 8(4):312-328, Aug 2007. [ bib ]
This review gives an introduction into ligand - receptor docking and illustrates the basic underlying concepts. An overview of different approaches and algorithms is provided. Although the application of docking and scoring has led to some remarkable successes, there are still some major challenges ahead, which are outlined here as well. Approaches to address some of these challenges and the latest developments in the area are presented. Some aspects of the assessment of docking program performance are discussed. A number of successful applications of structure-based virtual screening are described.

Keywords: Algorithms; Artificial Intelligence; Computational Biology; Computer Simulation; Computer-Aided Design; Drug Design; Imaging, Three-Dimensional; Ligands; Models, Molecular; Protein Binding; Protein Conformation; Software; Structure-Activity Relationship
[Kosorok2007Marginal] MR. Kosorok and SMa. Marginal asymptotics for the "large p, small n" paradigm: With applications to microarray data. Ann. Stat., 35(4):1456-1486, 2007. [ bib | DOI | http ]
Thelarge p, small nparadigm arises in microarray studies, image analysis, high throughput molecular screening, astronomy, and in many other high dimensional applications. False discovery rate (FDR) methods are useful for resolving the accompanying multiple testing problems. In cDNA microarray studies, for example, p-values may be computed for each of p genes using data from n arrays, where typically p is in the thousands and n is less than 30. For FDR methods to be valid in identifying differentially expressed genes, the p-values for the nondifferentially expressed genes must simultaneously have uniform distributions marginally. While feasible for permutation p-values, this uniformity is problematic for asymptotic based p-values since the number of p-values involved goes to infinity and intuition suggests that at least some of the p-values should behave erratically. We examine this neglected issue when n is moderately large but p is almost exponentially large relative to n. We show the somewhat surprising result that, under very general dependence structures and for both mean and median tests, the p-values are simultaneously valid. A small simulation study and data analysis are used for illustration.

[Korn2007Cell-based] KKorn and EKrausz. Cell-based high-content screening of small-molecule libraries. Curr. Opin. Chem. Biol., 11(5):503-510, Oct 2007. [ bib | DOI | http ]
Advanced microscopy and the corresponding image analysis have been developed in recent years into a powerful tool for studying molecular and morphological events in cells and tissues. Cell-based high-content screening (HCS) is an upcoming methodology for the investigation of cellular processes and their alteration by multiple chemical or genetic perturbations. Multiparametric characterization of responses to such changes can be analyzed using intact live cells as reporter. These disturbances are screened for effects on a variety of molecular and cellular targets, including subcellular localization and redistribution of proteins. In contrast to biochemical screening, they detect the responses within the context of the intercellular structural and functional networks of normal and diseased cells, respectively. As cell-based HCS of small-molecule libraries is applied to identify and characterize new therapeutic lead compounds, large pharmaceutical companies are major drivers of the technology and have already shown image-based screens using more than 100,000 compounds.

[Koren2007Autocorrelation] Amnon Koren, Itay Tirosh, and Naama Barkai. Autocorrelation analysis reveals widespread spatial biases in microarray experiments. BMC Genomics, 8:164, 2007. [ bib | DOI | http ]
BACKGROUND: DNA microarrays provide the ability to interrogate multiple genes in a single experiment and have revolutionized genomic research. However, the microarray technology suffers from various forms of biases and relatively low reproducibility. A particular source of false data has been described, in which non-random placement of gene probes on the microarray surface is associated with spurious correlations between genes. RESULTS: In order to assess the prevalence of this effect and better understand its origins, we applied an autocorrelation analysis of the relationship between chromosomal position and expression level to a database of over 2000 individual yeast microarray experiments. We show that at least 60% of these experiments exhibit spurious chromosomal position-dependent gene correlations, which nonetheless appear in a stochastic manner within each experimental dataset. Using computer simulations, we show that large spatial biases caused in the microarray hybridization step and independently of printing procedures can exclusively account for the observed spurious correlations, in contrast to previous suggestions. Our data suggest that such biases may generate more than 15% false data per experiment. Importantly, spatial biases are expected to occur regardless of microarray design and over a wide range of microarray platforms, organisms and experimental procedures. CONCLUSIONS: Spatial biases comprise a major source of noise in microarray studies; revision of routine experimental practices and normalizations to account for these biases may significantly and comprehensively improve the quality of new as well as existing DNA microarray data.

Keywords: DNA Probes; Diagnostic Errors; Oligonucleotide Array Sequence Analysis, methods/standards; Reproducibility of Results; Research, methods/standards; Yeasts
[Korbel2007Paired-end] Jan O Korbel, Alexander Eckehart Urban, Jason P Affourtit, Brian Godwin, Fabian Grubert, Jan Fredrik Simons, Philip M Kim, Dean Palejev, Nicholas J Carriero, Lei Du, Bruce E Taillon, Zhoutao Chen, Andrea Tanzer, AC Eugenia Saunders, Jianxiang Chi, Fengtang Yang, Nigel P Carter, Matthew E Hurles, Sherman M Weissman, Timothy T Harkins, Mark B Gerstein, Michael Egholm, and Michael Snyder. Paired-end mapping reveals extensive structural variation in the human genome. Science, 318(5849):420-426, Oct 2007. [ bib | DOI | http | .pdf ]
Structural variation of the genome involves kilobase- to megabase-sized deletions, duplications, insertions, inversions, and complex combinations of rearrangements. We introduce high-throughput and massive paired-end mapping (PEM), a large-scale genome-sequencing method to identify structural variants (SVs) approximately 3 kilobases (kb) or larger that combines the rescue and capture of paired ends of 3-kb fragments, massive 454 sequencing, and a computational approach to map DNA reads onto a reference genome. PEM was used to map SVs in an African and in a putatively European individual and identified shared and divergent SVs relative to the reference genome. Overall, we fine-mapped more than 1000 SVs and documented that the number of SVs among humans is much larger than initially hypothesized; many of the SVs potentially affect gene function. The breakpoint junction sequences of more than 200 SVs were determined with a novel pooling strategy and computational analysis. Our analysis provided insights into the mechanisms of SV formation in humans.

Keywords: ngs
[Koh2007interior] KKoh, S.J. Kim, and SBoyd. An interior-point method for large-scale l1-regularized logistic regression. Journal of Machine learning research, 8(8):1519-1555, 2007. [ bib ]
[Kobilka2007G] BK. Kobilka. G protein coupled receptor structure and activation. Biochim. Biophys. Acta, 1768(4):794-807, Apr 2007. [ bib | DOI | http ]
G protein coupled receptors (GPCRs) are remarkably versatile signaling molecules. The members of this large family of membrane proteins are activated by a spectrum of structurally diverse ligands, and have been shown to modulate the activity of different signaling pathways in a ligand specific manner. In this manuscript I will review what is known about the structure and mechanism of activation of GPCRs focusing primarily on two model systems, rhodopsin and the beta(2) adrenoceptor.

Keywords: chemogenomics
[Knight-Yamada-02] KKnight and KYamada. Integer programming decoder for machine translation. Patent US 7,177,792, 2007. Application filed in 2002. [ bib ]
[Kim2007Sparse] Hyunsoo Kim and Haesun Park. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23(12):1495-1502, Jun 2007. [ bib | DOI | http | .pdf ]
Many practical pattern recognition problems require non-negativity constraints. For example, pixels in digital images and chemical concentrations in bioinformatics are non-negative. Sparse non-negative matrix factorizations (NMFs) are useful when the degree of sparseness in the non-negative basis matrix or the non-negative coefficient matrix in an NMF needs to be controlled in approximating high-dimensional data in a lower dimensional space.In this article, we introduce a novel formulation of sparse NMF and show how the new formulation leads to a convergent sparse NMF algorithm via alternating non-negativity-constrained least squares. We apply our sparse NMF algorithm to cancer-class discovery and gene expression data analysis and offer biological analysis of the results obtained. Our experimental results illustrate that the proposed sparse NMF algorithm often achieves better clustering performance with shorter computing time compared to other existing NMF algorithms.The software is available as supplementary material.

[Kertesz2007role] Michael Kertesz, Nicola Iovino, Ulrich Unnerstall, Ulrike Gaul, and Eran Segal. The role of site accessibility in microrna target recognition. Nat Genet, 39(10):1278-1284, Oct 2007. [ bib | DOI | http | .pdf ]
MicroRNAs are key regulators of gene expression, but the precise mechanisms underlying their interaction with their mRNA targets are still poorly understood. Here, we systematically investigate the role of target-site accessibility, as determined by base-pairing interactions within the mRNA, in microRNA target recognition. We experimentally show that mutations diminishing target accessibility substantially reduce microRNA-mediated translational repression, with effects comparable to those of mutations that disrupt sequence complementarity. We devise a parameter-free model for microRNA-target interaction that computes the difference between the free energy gained from the formation of the microRNA-target duplex and the energetic cost of unpairing the target to make it accessible to the microRNA. This model explains the variability in our experiments, predicts validated targets more accurately than existing algorithms, and shows that genomes accommodate site accessibility by preferentially positioning targets in highly accessible regions. Our study thus demonstrates that target accessibility is a critical factor in microRNA function.

Keywords: sirna
[Kalousis2007Stability] AKalousis, JPrados, and MHilario. Stability of feature selection algorithms: a study on high-dimensional spaces. Knowledge and information systems, 12(1):95-116, 2007. [ bib ]
[Kahraman2007Shape] AKahraman, RJ. Morris, RA. Laskowski, and JM. Thornton. Shape variation in protein binding pockets and their ligands. J. Mol. Biol., 368(1):283-301, Apr 2007. [ bib | DOI | http ]
A common assumption about the shape of protein binding pockets is that they are related to the shape of the small ligand molecules that can bind there. But to what extent is that assumption true? Here we use a recently developed shape matching method to compare the shapes of protein binding pockets to the shapes of their ligands. We find that pockets binding the same ligand show greater variation in their shapes than can be accounted for by the conformational variability of the ligand. This suggests that geometrical complementarity in general is not sufficient to drive molecular recognition. Nevertheless, we show when considering only shape and size that a significant proportion of the recognition power of a binding pocket for its ligand resides in its shape. Additionally, we observe a "buffer zone" or a region of free space between the ligand and protein, which results in binding pockets being on average three times larger than the ligand that they bind.

Keywords: Binding Sites; Computer Simulation; Ligands; Models, Molecular; Models, Statistical; Protein Binding; Protein Conformation; Protein Folding
[Johnson2007Adjusting] WEvan Johnson, Cheng Li, and Ariel Rabinovic. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics, 8(1):118-127, Jan 2007. [ bib | DOI | http ]
Non-biological experimental variation or "batch effects" are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes ( > 25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.

Keywords: Bayes Theorem; Data Interpretation, Statistical; Gene Expression Profiling, methods; Humans; Oligonucleotide Array Sequence Analysis, methods
[Johnson2007Genome-wide] DS. Johnson, AMortazavi, RM. Myers, and BWold. Genome-wide mapping of in vivo protein-dna interactions. Science, 316(5830):1497-1502, Jun 2007. [ bib | DOI | http | .pdf ]
In vivo protein-DNA interactions connect each transcription factor with its direct targets to form a gene network scaffold. To map these protein-DNA interactions comprehensively across entire mammalian genomes, we developed a large-scale chromatin immunoprecipitation assay (ChIPSeq) based on direct ultrahigh-throughput DNA sequencing. This sequence census method was then used to map in vivo binding of the neuron-restrictive silencer factor (NRSF; also known as REST, for repressor element-1 silencing transcription factor) to 1946 locations in the human genome. The data display sharp resolution of binding position [+/-50 base pairs (bp)], which facilitated our finding motifs and allowed us to identify noncanonical NRSF-binding motifs. These ChIPSeq data also have high sensitivity and specificity [ROC (receiver operator characteristic) area >/= 0.96] and statistical confidence (P <10(-4)), properties that were important for inferring new candidate interactions. These include key transcription factors in the gene network that regulates pancreatic islet cell development.

[Jacob2007Kernel] LJacob and J.-P. Vert. Kernel methods for in silico chemogenomics. Technical Report 0709.3931v1, arXiv, 2007. [ bib | http ]
Keywords: chemogenomics
[Ivanciuc2007Applications] OIvanciuc. Applications of support vector machines in chemistry. In KB. Lipkowitz and TR. Cundari, editors, Reviews in Computational Chemistry, volume 23, pages 291-400, Weiheim, 2007. Wiley-VCH. [ bib ]
[Hudis2007Trastuzumab] C.A. Hudis. Trastuzumab-mechanism of action and use in clinical practice. N. Engl. J. Med., 357(1):39-51, Jul 2007. [ bib | DOI | http | .pdf ]
Keywords: csbcbook
[Hertz2007Identifying] Tomer Hertz and Chen Yanover. Identifying hla supertypes by learning distance functions. Bioinformatics, 23(2):e148-e155, Jan 2007. [ bib | DOI | http ]
MOTIVATION: The development of epitope-based vaccines crucially relies on the ability to classify Human Leukocyte Antigen (HLA) molecules into sets that have similar peptide binding specificities, termed supertypes. In their seminal work, Sette and Sidney defined nine HLA class I supertypes and claimed that these provide an almost perfect coverage of the entire repertoire of HLA class I molecules. HLA alleles are highly polymorphic and polygenic and therefore experimentally classifying each of these molecules to supertypes is at present an impossible task. Recently, a number of computational methods have been proposed for this task. These methods are based on defining protein similarity measures, derived from analysis of binding peptides or from analysis of the proteins themselves. RESULTS: In this paper we define both peptide derived and protein derived similarity measures, which are based on learning distance functions. The peptide derived measure is defined using a peptide-peptide distance function, which is learned using information about known binding and non-binding peptides. The protein derived similarity measure is defined using a protein-protein distance function, which is learned using information about alleles previously classified to supertypes by Sette and Sidney (1999). We compare the classification obtained by these two complimentary methods to previously suggested classification methods. In general, our results are in excellent agreement with the classifications suggested by Sette and Sidney (1999) and with those reported by Buus et al. (2004). The main important advantage of our proposed distance-based approach is that it makes use of two different and important immunological sources of information-HLA alleles and peptides that are known to bind or not bind to these alleles. Since each of our distance measures is trained using a different source of information, their combination can provide a more confident classification of alleles to supertypes.

[Heckerman2007Leveraging] DHeckerman, DKadie, and JListgarten. Leveraging information across HLA alleles/supertypes improves epitope prediction. J. Comput. Biol., 14(6):736-746, 2007. [ bib | DOI | http ]
We present a model for predicting HLA class I restricted CTL epitopes. In contrast to almost all other work in this area, we train a single model on epitopes from all HLA alleles and supertypes, yet retain the ability to make epitope predictions for specific HLA alleles. We are therefore able to leverage data across all HLA alleles and/or their supertypes, automatically learning what information should be shared and also how to combine allele-specific, supertype-specific, and global information in a principled way. We show that this leveraging can improve prediction of epitopes having HLA alleles with known supertypes, and dramatically increases our ability to predict epitopes having alleles which do not fall into any of the known supertypes. Our model, which is based on logistic regression, is simple to implement and understand, is solved by finding a single global maximum, and is more accurate (to our knowledge) than any other model.

[Harchaoui2007Image] ZHarchaoui and FBach. Image classification with segmentation graph kernels. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), pages 1-8. IEEE Computer Society, 2007. [ bib | DOI | http | .pdf ]
We propose a family of kernels between images, defined as kernels between their respective segmentation graphs. The kernels are based on soft matching of subtree-patterns of the respective graphs, leveraging the natural structure of images while remaining robust to the associated segmentation process uncertainty. Indeed, output from morphological segmentation is often represented by a labelled graph, each vertex corresponding to a segmented region, with edges joining neighboring regions. However, such image representations have mostly remained underused for learning tasks, partly because of the observed instability of the segmentation process and the inherent hardness of inexact graph matching with uncertain graphs. Our kernels count common virtual substructures amongst images, which enables to perform efficient supervised classification of natural images with a support vector machine. Moreover, the kernel machinery allows us to take advantage of recent advances in kernel-based learning: (i) semi-supervised learning reduces the required number of labelled images, while (ii) multiple kernel learning algorithms efficiently select the most relevant similarity measures between images within our family.

Keywords: image
[Guo2007Edge-based] ZGuo, LWang, YLi, XGong, CYao, WMa, DWang, YLi, JZhu, MZhang, DYang, SRao, and JWan. Edge-based scoring and searching method for identifying condition-responsive protein-protein interaction sub-network. Bioinformatics, 23(16):2121-2128, Aug 2007. [ bib | DOI | http | .pdf ]
Current high-throughput protein-protein interaction (PPI) data do not provide information about the condition(s) under which the interactions occur. Thus, the identification of condition-responsive PPI sub-networks is of great importance for investigating how a living cell adapts to changing environments.In this article, we propose a novel edge-based scoring and searching approach to extract a PPI sub-network responsive to conditions related to some investigated gene expression profiles. Using this approach, what we constructed is a sub-network connected by the selected edges (interactions), instead of only a set of vertices (proteins) as in previous works. Furthermore, we suggest a systematic approach to evaluate the biological relevance of the identified responsive sub-network by its ability of capturing condition-relevant functional modules. We apply the proposed method to analyze a human prostate cancer dataset and a yeast cell cycle dataset. The results demonstrate that the edge-based method is able to efficiently capture relevant protein interaction behaviors under the investigated conditions.Supplementary data are available at Bioinformatics online.

[Guiot2007Morphological] Caterina Guiot, Pier P Delsanto, and Thomas S Deisboeck. Morphological instability and cancer invasion: a `splashing water drop' analogy. Theor. Biol. Med. Model., 4:4, 2007. [ bib | DOI | http ]
BACKGROUND: Tissue invasion, one of the hallmarks of cancer, is a major clinical problem. Recent studies suggest that the process of invasion is driven at least in part by a set of physical forces that may be susceptible to mathematical modelling which could have practical clinical value. MODEL AND CONCLUSION: We present an analogy between two unrelated instabilities. One is caused by the impact of a drop of water on a solid surface while the other concerns a tumor that develops invasive cellular branches into the surrounding host tissue. In spite of the apparent abstractness of the idea, it yields a very practical result, i.e. an index that predicts tumor invasion based on a few measurable parameters. We discuss its application in the context of experimental data and suggest potential clinical implications.

Keywords: Animals; Biomechanics; Cell Adhesion; Humans; Mathematics; Models, Biological; Neoplasm Invasiveness; Neoplasms, pathology; Surface Tension
[Grimson2007MicroRNA] Andrew Grimson, Kyle Kai-How Farh, Wendy K Johnston, Philip Garrett-Engele, Lee P Lim, and David P Bartel. Microrna targeting specificity in mammals: determinants beyond seed pairing. Mol Cell, 27(1):91-105, Jul 2007. [ bib | DOI | http | .pdf ]
Mammalian microRNAs (miRNAs) pair to 3'UTRs of mRNAs to direct their posttranscriptional repression. Important for target recognition are approximately 7 nt sites that match the seed region of the miRNA. However, these seed matches are not always sufficient for repression, indicating that other characteristics help specify targeting. By combining computational and experimental approaches, we uncovered five general features of site context that boost site efficacy: AU-rich nucleotide composition near the site, proximity to sites for coexpressed miRNAs (which leads to cooperative action), proximity to residues pairing to miRNA nucleotides 13-16, positioning within the 3'UTR at least 15 nt from the stop codon, and positioning away from the center of long UTRs. A model combining these context determinants quantitatively predicts site performance both for exogenously added miRNAs and for endogenous miRNA-message interactions. Because it predicts site efficacy without recourse to evolutionary conservation, the model also identifies effective nonconserved sites and siRNA off-targets.

Keywords: sirna
[Garcia2007Organismal] Benjamin A Garcia, Sandra B Hake, Robert L Diaz, Monika Kauer, Stephanie A Morris, Judith Recht, Jeffrey Shabanowitz, Nilamadhab Mishra, Brian D Strahl, CDavid Allis, and Donald F Hunt. Organismal differences in post-translational modifications in histones h3 and h4. J Biol Chem, 282(10):7641-7655, Mar 2007. [ bib | DOI | http ]
Post-translational modifications (PTMs) of histones play an important role in many cellular processes, notably gene regulation. Using a combination of mass spectrometric and immunobiochemical approaches, we show that the PTM profile of histone H3 differs significantly among the various model organisms examined. Unicellular eukaryotes, such as Saccharomyces cerevisiae (yeast) and Tetrahymena thermophila (Tet), for example, contain more activation than silencing marks as compared with mammalian cells (mouse and human), which are generally enriched in PTMs more often associated with gene silencing. Close examination reveals that many of the better-known modified lysines (Lys) can be either methylated or acetylated and that the overall modification patterns become more complex from unicellular eukaryotes to mammals. Additionally, novel species-specific H3 PTMs from wild-type asynchronously grown cells are also detected by mass spectrometry. Our results suggest that some PTMs are more conserved than previously thought, including H3K9me1 and H4K20me2 in yeast and H3K27me1, -me2, and -me3 in Tet. On histone H4, methylation at Lys-20 showed a similar pattern as H3 methylation at Lys-9, with mammals containing more methylation than the unicellular organisms. Additionally, modification profiles of H4 acetylation were very similar among the organisms examined.

Keywords: Acetylation; Animals; Hela Cells; Histones, chemistry/metabolism; Humans; Methylation; Mice; NIH 3T3 Cells; Protein Processing, Post-Translational; Saccharomyces cerevisiae, metabolism; Species Specificity; Tandem Mass Spectrometry; Tetrahymena, metabolism
[Friedman2007Pathwise] JFriedman, THastie, HHöfling, and RTibshirani. Pathwise coordinate optimization. Ann. Appl. Statist., 1(1):302-332, 2007. [ bib | DOI | http | .pdf ]
We consider "one-at-a-time" coordinate-wise descent algorithms for a class of convex optimization problems. An algorithm of this kind has been proposed for the L1-penalized regression (lasso) in the literature, but it seems to have been largely ignored. Indeed, it seems that coordinate-wise algorithms are not often used in convex optimization. We show that this algorithm is very competitive with the well-known LARS (or homotopy) procedure in large lasso problems, and that it can be applied to related methods such as the garotte and elastic net. It turns out that coordinate-wise descent does not work in the "fused lasso", however, so we derive a generalized algorithm that yields the solution in much less time that a standard convex optimizer. Finally, we generalize the procedure to the two-dimensional fused lasso, and demonstrate its performance on some image smoothing problems.

[Faith2007Large-scale] JJ. Faith, BHayete, JT. Thaden, IMogno, JWierzbowski, GCottarel, SKasif, JJ. Collins, and TS. Gardner. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol., 5(1):e8, Jan 2007. [ bib | DOI | http | .pdf ]
Machine learning approaches offer the potential to systematically identify transcriptional regulatory interactions from a compendium of microarray expression profiles. However, experimental validation of the performance of these methods at the genome scale has remained elusive. Here we assess the global performance of four existing classes of inference algorithms using 445 Escherichia coli Affymetrix arrays and 3,216 known E. coli regulatory interactions from RegulonDB. We also developed and applied the context likelihood of relatedness (CLR) algorithm, a novel extension of the relevance networks class of algorithms. CLR demonstrates an average precision gain of 36% relative to the next-best performing algorithm. At a 60% true positive rate, CLR identifies 1,079 regulatory interactions, of which 338 were in the previously known network and 741 were novel predictions. We tested the predicted interactions for three transcription factors with chromatin immunoprecipitation, confirming 21 novel interactions and verifying our RegulonDB-based performance estimates. CLR also identified a regulatory link providing central metabolic control of iron transport, which we confirmed with real-time quantitative PCR. The compendium of expression data compiled in this study, coupled with RegulonDB, provides a valuable model system for further improvement of network inference algorithms using experimental data.

[Esteller2007Cancer] MEsteller. Cancer epigenomics: DNA methylomes and histone-modification maps. Nat. Rev. Genet., 8(4):286-298, Apr 2007. [ bib | DOI | http | .pdf ]
An altered pattern of epigenetic modifications is central to many common human diseases, including cancer. Many studies have explored the mosaic patterns of DNA methylation and histone modification in cancer cells on a gene-by-gene basis; among their results has been the seminal finding of transcriptional silencing of tumour-suppressor genes by CpG-island-promoter hypermethylation. However, recent technological advances are now allowing cancer epigenetics to be studied genome-wide - an approach that has already begun to provide both biological insight and new avenues for translational research. It is time to 'upgrade' cancer epigenetics research and put together an ambitious plan to tackle the many unanswered questions in this field using epigenomics approaches.

Keywords: csbcbook
[Eissing2007Responsea] TEissing, SWaldherr, FAllgower, PScheurich, and EBullinger. Response to bistability in apoptosis: Roles of bax, bcl-2, and mitochondrial permeability transition pores. Biophys. J., 92(9):3332-3334, 2007. [ bib | DOI | http ]
Keywords: csbcbook
[Eissing2007Response] TEissing, SWaldherr, FAllgower, PScheurich, and EBullinger. Response to bistability in apoptosis: Roles of bax, bcl-2, and mitochondrial permeability transition pores. Biophysical Journal, 92(9):3332 - 3334, 2007. [ bib | DOI | http | .pdf ]
Keywords: csbcbook
[Efroni2007Identification] SEfroni, CF. Schaefer, and KH. Buetow. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS One, 2(5):e425, 2007. [ bib | DOI | http | .pdf ]
Cancer is recognized to be a family of gene-based diseases whose causes are to be found in disruptions of basic biologic processes. An increasingly deep catalogue of canonical networks details the specific molecular interaction of genes and their products. However, mapping of disease phenotypes to alterations of these networks of interactions is accomplished indirectly and non-systematically. Here we objectively identify pathways associated with malignancy, staging, and outcome in cancer through application of an analytic approach that systematically evaluates differences in the activity and consistency of interactions within canonical biologic processes. Using large collections of publicly accessible genome-wide gene expression, we identify small, common sets of pathways - Trka Receptor, Apoptosis response to DNA Damage, Ceramide, Telomerase, CD40L and Calcineurin - whose differences robustly distinguish diverse tumor types from corresponding normal samples, predict tumor grade, and distinguish phenotypes such as estrogen receptor status and p53 mutation state. Pathways identified through this analysis perform as well or better than phenotypes used in the original studies in predicting cancer outcome. This approach provides a means to use genome-wide characterizations to map key biological processes to important clinical features in disease.

[Dostie2007Chromosome] Josée Dostie, Ye Zhan, and Job Dekker. Chromosome conformation capture carbon copy technology. Curr Protoc Mol Biol, Chapter 21:Unit 21.14, Oct 2007. [ bib | DOI | http ]
Chromosome conformation capture (3C) is used to quantify physical DNA contacts in vivo at high resolution. 3C was first used in yeast to map the spatial chromatin organization of chromosome III, and in higher eukaryotes to demonstrate that genomic DNA elements regulate target genes by physically interacting with them. 3C has been widely adopted for small-scale analysis of functional chromatin interactions along (cis) or between (trans) chromosomes. For larger-scale applications, chromosome conformation capture carbon copy (5C) combines 3C with ligation-mediated amplification (LMA) to simultaneously quantify hundreds of thousands of physical DNA contacts by microarray or ultra-high-throughput DNA sequencing. 5C allows the mapping of extensive networks of physical interactions among large sets of genomic elements throughout the genome. Such networks can provide important biological insights, e.g., by identifying relationships between regulatory elements and their target genes. This unit describes 5C for large-scale analysis of cis- and trans-chromatin interactions in mammalian cells.

Keywords: Chromosomes, Artificial, Bacterial; Chromosomes, chemistry; DNA Primers, metabolism; Molecular Biology, methods; Nucleic Acid Conformation; Oligonucleotide Array Sequence Analysis; Polymerase Chain Reaction; Sequence Analysis, DNA; Templates, Genetic
[Deupi2007Structural] XDeupi, NDölker, ML. Lòpez-Rodrìguez, MCampillo, JA. Ballesteros, and LPardo. Structural models of class a G protein-coupled receptors as a tool for drug design: insights on transmembrane bundle plasticity. Curr. Top. Med. Chem., 7(10):991-998, 2007. [ bib ]
G protein-coupled receptors (GPCRs) interact with an extraordinary diversity of ligands by means of their extracellular domains and/or the extracellular part of the transmembrane (TM) segments. Each receptor subfamily has developed specific sequence motifs to adjust the structural characteristics of its cognate ligands to a common set of conformational rearrangements of the TM segments near the G protein binding domains during the activation process. Thus, GPCRs have fulfilled this adaptation during their evolution by customizing a preserved 7TM scaffold through conformational plasticity. We use this term to describe the structural differences near the binding site crevices among different receptor subfamilies, responsible for the selective recognition of diverse ligands among different receptor subfamilies. By comparing the sequence of rhodopsin at specific key regions of the TM bundle with the sequences of other GPCRs we have found that the extracellular region of TMs 2 and 3 provides a remarkable example of conformational plasticity within Class A GPCRs. Thus, rhodopsin-based molecular models need to include the plasticity of the binding sites among GPCR families, since the "quality" of these homology models is intimately linked with the success in the processes of rational drug-design or virtual screening of chemical databases.

Keywords: chemogenomics
[Deodhar2007framework] Meghana Deodhar and Joydeep Ghosh. A framework for simultaneous co-clustering and learning from complex data. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 250-259, New York, NY, USA, 2007. ACM. [ bib | DOI ]
[Bie2007Kernel-based] TDe Bie, L.-C. Tranchevent, LMM. van Oeffelen, and YMoreau. Kernel-based data fusion for gene prioritization. Bioinformatics, 23(13):i125-i132, Jul 2007. [ bib | DOI | http | .pdf ]
MOTIVATION: Hunting disease genes is a problem of primary importance in biomedical research. Biologists usually approach this problem in two steps: first a set of candidate genes is identified using traditional positional cloning or high-throughput genomics techniques; second, these genes are further investigated and validated in the wet lab, one by one. To speed up discovery and limit the number of costly wet lab experiments, biologists must test the candidate genes starting with the most probable candidates. So far, biologists have relied on literature studies, extensive queries to multiple databases and hunches about expected properties of the disease gene to determine such an ordering. Recently, we have introduced the data mining tool ENDEAVOUR (Aerts et al., 2006), which performs this task automatically by relying on different genome-wide data sources, such as Gene Ontology, literature, microarray, sequence and more. RESULTS: In this article, we present a novel kernel method that operates in the same setting: based on a number of different views on a set of training genes, a prioritization of test genes is obtained. We furthermore provide a thorough learning theoretical analysis of the method's guaranteed performance. Finally, we apply the method to the disease data sets on which ENDEAVOUR (Aerts et al., 2006) has been benchmarked, and report a considerable improvement in empirical performance. AVAILABILITY: The MATLAB code used in the empirical results will be made publicly available.

[Davies2007Poisson] J.R. Davies, R.M. Jackson, K.V. Mardia, and C.C. Taylor. The poisson index: a new probabilistic model for protein ligand binding site similarity. Bioinformatics, 23(22):3001-3008, Nov 2007. [ bib ]
MOTIVATION: The large-scale comparison of protein-ligand binding sites is problematic, in that measures of structural similarity are difficult to quantify and are not easily understood in terms of statistical similarity that can ultimately be related to structure and function. We present a binding site matching score the Poisson Index (PI) based upon a well-defined statistical model. PI requires only the number of matching atoms between two sites and the size of the two sites-the same information used by the Tanimoto Index (TI), a comparable and widely used measure for molecular similarity. We apply PI and TI to a previously automatically extracted set of binding sites to determine the robustness and usefulness of both scores. RESULTS: We found that PI outperforms TI; moreover, site similarity is poorly defined for TI at values around the 99.5% confidence level for which PI is well defined. A difference map at this confidence level shows that PI gives much more meaningful information than TI. We show individual examples where TI fails to distinguish either a false or a true site paring in contrast to PI, which performs much better. TI cannot handle large or small sites very well, or the comparison of large and small sites, in contrast to PI that is shown to be much more robust. Despite the difficulty of determining a biological 'ground truth' for binding site similarity we conclude that PI is a suitable measure of binding site similarity and could form the basis for a binding site classification scheme comparable to existing protein domain classification schema.

[Dalton2007Evaluation] James A R Dalton and Richard M Jackson. An evaluation of automated homology modelling methods at low target template sequence similarity. Bioinformatics, 23(15):1901-1908, Aug 2007. [ bib | DOI | http ]
MOTIVATION: There are two main areas of difficulty in homology modelling that are particularly important when sequence identity between target and template falls below 50%: sequence alignment and loop building. These problems become magnified with automatic modelling processes, as there is no human input to correct mistakes. As such we have benchmarked several stand-alone strategies that could be implemented in a workflow for automated high-throughput homology modelling. These include three new sequence-structure alignment programs: 3D-Coffee, Staccato and SAlign, plus five homology modelling programs and their respective loop building methods: Builder, Nest, Modeller, SegMod/ENCAD and Swiss-Model. The SABmark database provided 123 targets with at least five templates from the same SCOP family and sequence identities </=50%. RESULTS: When using Modeller as the common modelling program, 3D-Coffee outperforms Staccato and SAlign using both multiple templates and the best single template, and across the sequence identity range 20-50%. The mean model RMSD generated from 3D-Coffee using multiple templates is 15 and 28% (or using single templates, 3 and 13%) better than those generated by Staccato and Salign, respectively. 3D-Coffee gives equivalent modelling accuracy from multiple and single templates, but Staccato and SAlign are more successful with single templates, their quality deteriorating as additional lower sequence identity templates are added. Evaluating the different homology modelling programs, on average Modeller performs marginally better in overall modelling than the others tested. However, on average Nest produces the best loops with an 8% improvement by mean RMSD compared to the loops generated by Builder.

Keywords: Algorithms; Amino Acid Sequence; Computer Simulation; Models, Chemical; Models, Molecular; Molecular Sequence Data; Proteins, chemistry; Reproducibility of Results; Sensitivity and Specificity; Sequence Alignment, methods; Sequence Analysis, Protein, methods; Software; Software Validation
[Daily2007Distinct] JP. Daily, DScanfeld, NPochet, KLe Roch, DPlouffe, MKamal, OSarr, SMboup, ONdir, D.j Wypi, KLevasseur, EThomas, PTamayo, CDong, YZhou, ES. Lander, DNdiaye, DWirth, EA. Winzeler, JP. Mesirov, and ARegev. Distinct physiological states of plasmodium falciparum in malaria-infected patients. Nature, 450(7172):1091-1095, Dec 2007. [ bib | DOI | http | .pdf ]
Infection with the malaria parasite Plasmodium falciparum leads to widely different clinical conditions in children, ranging from mild flu-like symptoms to coma and death. Despite the immense medical implications, the genetic and molecular basis of this diversity remains largely unknown. Studies of in vitro gene expression have found few transcriptional differences between different parasite strains. Here we present a large study of in vivo expression profiles of parasites derived directly from blood samples from infected patients. The in vivo expression profiles define three distinct transcriptional states. The biological basis of these states can be interpreted by comparison with an extensive compendium of expression data in the yeast Saccharomyces cerevisiae. The three states in vivo closely resemble, first, active growth based on glycolytic metabolism, second, a starvation response accompanied by metabolism of alternative carbon sources, and third, an environmental stress response. The glycolytic state is highly similar to the known profile of the ring stage in vitro, but the other states have not been observed in vitro. The results reveal a previously unknown physiological diversity in the in vivo biology of the malaria parasite, in particular evidence for a functional mitochondrion in the asexual-stage parasite, and indicate in vivo and in vitro studies to determine how this variation may affect disease manifestations and treatment.

Keywords: plasmodium
[Cuturi2007kernel] MCuturi, JP. Vert, OBirkenes, and TMatsui. A kernel for time series based on global alignment. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), volume 2, pages II-413-II-416, 2007. [ bib | DOI | http | .pdf ]
[Cowell2007application] J.K. Cowell and LHawthorn. The application of microarray technology to the analysis of the cancer genome. Curr. Mol. Med., 7(1):103-120, Feb 2007. [ bib ]
The identification of genetic events that are involved in the development of human cancer has been facilitated through the development and application of a diverse series of high resolution, high throughput microarray platforms. Essentially there are two types of array; those that carry PCR products from cloned nucleic acids (e.g. cDNA, BACs, cosmids) and those that use oligonucleotides. Each has advantages and disadvantages but it is now possible to survey genome wide DNA copy number abnormalities and expression levels to allow correlations between losses, gains and amplifications in tumor cells with genes that are over- and under-expressed in the same samples. The gene expression arrays that provide estimates of mRNA levels in tumors have given rise to exon-specific arrays that can identify both gene expression levels, alternative splicing events and mRNA processing alterations. Oligonucleotide arrays are also being used to interrogate single nucleotide polymorphisms (SNPs) throughout the genome for linkage and association studies and these have been adapted to quantify copy number abnormalities and loss of heterozygosity events. To identify as yet unknown transcripts tiling arrays across the genome have been developed which can also identify DNA methylation changes and be used to identify DNA-protein interactions using ChIP on Chip protocols. Ultimately DNA sequencing arrays will allow resequencing of chromosome regions and whole genomes. With all of these capabilities becoming routine in genomics laboratories, the idea of a systematic characterization of the sum genetic events that give rise to a cancer cell is rapidly becoming a reality.

Keywords: csbcbook, csbcbook-ch2
[Cour2007Recognizing] TCour and JShi. Recognizing objects by piecing together the segmentation puzzle. In Proc. IEEE Conference on Computer Vision and Pattern Recognition CVPR '07, pages 1-8, 17-22 June 2007. [ bib | DOI ]
[Chuang2007Network-based] H.-Y. Chuang, ELee, Y.-T. Liu, DLee, and TIdeker. Network-based classification of breast cancer metastasis. Mol. Syst. Biol., 3:140, 2007. [ bib | DOI | http | .pdf ]
Mapping the pathways that give rise to metastasis is one of the key challenges of breast cancer research. Recently, several large-scale studies have shed light on this problem through analysis of gene expression profiles to identify markers correlated with metastasis. Here, we apply a protein-network-based approach that identifies markers not as individual genes but as subnetworks extracted from protein interaction databases. The resulting subnetworks provide novel hypotheses for pathways involved in tumor progression. Although genes with known breast cancer mutations are typically not detected through analysis of differential expression, they play a central role in the protein network by interconnecting many differentially expressed genes. We find that the subnetwork markers are more reproducible than individual marker genes selected without network information, and that they achieve higher accuracy in the classification of metastatic versus non-metastatic tumors.

Keywords: breastcancer
[Choi2007Coupled] H.-S. Choi, SHan, HYokota, and K.-H. Cho. Coupled positive feedbacks provoke slow induction plus fast switching in apoptosis. FEBS Letters, 581(14):2684 - 2690, 2007. [ bib | DOI | http | .pdf ]
Apoptosis is a form of a programmed cell death for multicellular organisms to remove unwanted or damaged cells. This critical choice of cellular fate is an all-or-none process, but its dynamics remains unraveled. The switch-like apoptotic decision has to be reliable, and once a pro-apoptotic fate is determined it requires fast and irreversible execution. One of the key regulators in apoptosis is caspase-3. Interestingly, activated caspase-3 quickly executes apoptosis, but it takes considerable time to activate it. Here, we have analyzed this slow induction plus fast switching mechanism of caspase-3 through mathematical modeling and computational simulation. First, we have shown that two positive feedbacks, composed of caspase-8 and XIAP, are essential for the slow induction plus fast switching behavior of caspase-3. Second, we have found that XIAP in the feedback loops primarily regulates induction time of caspase-3. In many cancer cells activation of caspase-3 is suppressed. Our results suggest that reinforcement of the positive feedback by XIAP, which relieves XIAP-mediated caspase-3 inhibition, might favor a pro-apoptotic cellular fate.

Keywords: csbcbook
[Chin2007High-resolution] SF. Chin, AE. Teschendorff, JC. Marioni, YWang, NL. Barbosa-Morais, NP. Thorne, JL. Costa, SE. Pinder, MA. van de Wiel, AR. Green, IO. Ellis, PL. Porter, STavaré, JD. Brenton, BYlstra, and CCaldas. High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer. Genome Biol., 8(10):R215, 2007. [ bib | DOI | http | .pdf ]
BACKGROUND: The characterization of copy number alteration patterns in breast cancer requires high-resolution genome-wide profiling of a large panel of tumor specimens. To date, most genome-wide array comparative genomic hybridization studies have used tumor panels of relatively large tumor size and high Nottingham Prognostic Index (NPI) that are not as representative of breast cancer demographics. RESULTS: We performed an oligo-array-based high-resolution analysis of copy number alterations in 171 primary breast tumors of relatively small size and low NPI, which was therefore more representative of breast cancer demographics. Hierarchical clustering over the common regions of alteration identified a novel subtype of high-grade estrogen receptor (ER)-negative breast cancer, characterized by a low genomic instability index. We were able to validate the existence of this genomic subtype in one external breast cancer cohort. Using matched array expression data we also identified the genomic regions showing the strongest coordinate expression changes ('hotspots'). We show that several of these hotspots are located in the phosphatome, kinome and chromatinome, and harbor members of the 122-breast cancer CAN-list. Furthermore, we identify frequently amplified hotspots on 8q22.3 (EDD1, WDSOF1), 8q24.11-13 (THRAP6, DCC1, SQLE, SPG8) and 11q14.1 (NDUFC2, ALG8, USP35) associated with significantly worse prognosis. Amplification of any of these regions identified 37 samples with significantly worse overall survival (hazard ratio (HR) = 2.3 (1.3-1.4) p = 0.003) and time to distant metastasis (HR = 2.6 (1.4-5.1) p = 0.004) independently of NPI. CONCLUSION: We present strong evidence for the existence of a novel subtype of high-grade ER-negative tumors that is characterized by a low genomic instability index. We also provide a genome-wide list of common copy number alteration regions in breast cancer that show strong coordinate aberrant expression, and further identify novel frequently amplified regions that correlate with poor prognosis. Many of the genes associated with these regions represent likely novel oncogenes or tumor suppressors.

Keywords: breastcancer, cgh
[Cherezov2007High-resolution] Vadim Cherezov, Daniel M Rosenbaum, Michael A Hanson, Søren G F Rasmussen, Foon Sun Thian, Tong Sun Kobilka, Hee-Jung Choi, Peter Kuhn, William I Weis, Brian K Kobilka, and Raymond C Stevens. High-resolution crystal structure of an engineered human beta2-adrenergic g protein-coupled receptor. Science, 318(5854):1258-1265, Nov 2007. [ bib | DOI | http | .pdf ]
Heterotrimeric guanine nucleotide-binding protein (G protein)-coupled receptors constitute the largest family of eukaryotic signal transduction proteins that communicate across the membrane. We report the crystal structure of a human beta2-adrenergic receptor-T4 lysozyme fusion protein bound to the partial inverse agonist carazolol at 2.4 angstrom resolution. The structure provides a high-resolution view of a human G protein-coupled receptor bound to a diffusible ligand. Ligand-binding site accessibility is enabled by the second extracellular loop, which is held out of the binding cavity by a pair of closely spaced disulfide bridges and a short helical segment within the loop. Cholesterol, a necessary component for crystallization, mediates an intriguing parallel association of receptor molecules in the crystal lattice. Although the location of carazolol in the beta2-adrenergic receptor is very similar to that of retinal in rhodopsin, structural differences in the ligand-binding site and other regions highlight the challenges in using rhodopsin as a template model for this large receptor family.

[Chen2007GPCR] J.-Z. Chen, JWang, and X.-Q. Xie. Gpcr structure-based virtual screening approach for cb2 antagonist search. J. Chem. Inf. Model., 47(4):1626-1637, 2007. [ bib | DOI | http | .pdf ]
The potential for therapeutic specificity in regulating diseases has made cannabinoid (CB) receptors one of the most important G-protein-coupled receptor (GPCR) targets in search for new drugs. Considering the lack of related 3D experimental structures, we have established a structure-based virtual screening protocol to search for CB2 bioactive antagonists based on the 3D CB2 homology structure model. However, the existing homology-predicted 3D models often deviate from the native structure and therefore may incorrectly bias the in silico design. To overcome this problem, we have developed a 3D testing database query algorithm to examine the constructed 3D CB2 receptor structure model as well as the predicted binding pocket. In the present study, an antagonist-bound CB2 receptor complex model was initially generated using flexible docking simulation and then further optimized by molecular dynamic and mechanical (MD/MM) calculations. The refined 3D structural model of the CB2-ligand complex was then inspected by exploring the interactions between the receptor and ligands in order to predict the potential CB2 binding pocket for its antagonist. The ligand-receptor complex model and the predicted antagonist binding pockets were further processed and validated by FlexX-Pharm docking against a testing compound database that contains known antagonists. Furthermore, a consensus scoring (CScore) function algorithm was established to rank the binding interaction modes of a ligand on the CB2 receptor. Our results indicated that the known antagonists seeded in the testing database can be distinguished from a significant amount of randomly chosen molecules. Our studies demonstrated that the established GPCR structure-based virtual screening approach provided a new strategy with a high potential for in silico identifying novel CB2 antagonist leads based on the homology-generated 3D CB2 structure model.

Keywords: chemogenomics
[Chen2007FEBS] CChen, JCui, WZhang, and PShen. Robustness analysis identifies the plausible model of the bcl-2 apoptotic switch. FEBS Lett, 581(26):5143-50, 2007. [ bib ]
In this paper two competing models of the B-cell lymphoma 2 (Bcl-2) apoptotic switch were contrasted by mathematical modeling and robustness analysis. Since switch-like behaviors are required for models that attempt to explain the all-or-none decisions of apoptosis, ultrasensitivity was employed as a criterion for comparison. Our results successfully exhibit that the direct activation model operates more reliably to achieve a robust switch in cellular conditions. Moreover, by investigating the robustness of other important features of the Bcl-2 apoptotic switch (including low Bax basal activation, inhibitory role of anti-apoptotic proteins and insensitivity to small perturbations) the direct activation model was further supported. In all, we identified the direct activation model as a more plausible explanation for the Bcl-2 apoptotic switch.

Keywords: csbcbook
[Chen2007BiophysJ] CChen, JCui, HLu, RWang, SZhang, and PShen. Modeling of the role of a bax-activation switch in the mitochondrial apoptosis decision. Biophys J, 92(12):4304-15, 2007. [ bib ]
We performed in silico modeling of the regulatory network of mitochondrial apoptosis through which we examined the role of a Bax-activation switch in governing the mitochondrial apoptosis decision. Two distinct modeling methods were used in this article. One is continuous and deterministic, comprised of a set of ordinary differential equations. The other, carried out in a discrete manner, is based on a cellular automaton, which takes stochastic fluctuations into consideration. We focused on dynamic properties of the mitochondrial apoptosis regulatory network. The roles of Bcl-2 family proteins in cellular responses to apoptotic stimuli were examined. In our simulations, a self-amplification process of Bax-activation is indicated. Further analysis suggests that the core module of Bax-activation is bistable in both deterministic and stochastic models, and this feature is robust to noise and wide ranges of parameter variation. When coupling with Bax-polymerization, it forms a one-way-switch, which governs irreversible behaviors of Bax-activation even with attenuation of apoptotic stimulus. Together with the growing biochemical evidence, we propose a novel molecular switch mechanism embedded in the mitochondrial apoptosis regulatory network and give a plausible explanation for the all-or-none, irreversible character of mitochondrial apoptosis.

Keywords: csbcbook
[Cela2007Quadratic] ECela. Qaudratuc assignment problem library, 2007. [ bib | http ]
[Catapano2007G] LA. Catapano and HK. Manji. G protein-coupled receptors in major psychiatric disorders. Biochim. Biophys. Acta, 1768(4):976-993, Apr 2007. [ bib | DOI | http ]
Keywords: chemogenomics
[Candes2007Dantzig] ECandès and TTao. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat., 35(6):2313-2351, 2007. [ bib | DOI | http | .pdf ]
Keywords: lasso
[Calvo2007partially] BCalvo, NLópez-Bigas, S.J. Furney, PLarrañaga, and J.A. Lozano. A partially supervised classification approach to dominant and recessive human disease gene prediction. Comput. Methods Programs Biomed., 85(3):229-237, Mar 2007. [ bib | DOI | http ]
The discovery of the genes involved in genetic diseases is a very important step towards the understanding of the nature of these diseases. In-lab identification is a difficult, time-consuming task, where computational methods can be very useful. In silico identification algorithms can be used as a guide in future studies. Previous works in this topic have not taken into account that no reliable sets of negative examples are available, as it is not possible to ensure that a given gene is not related to any genetic disease. In this paper, this feature of the nature of the problem is considered, and identification is approached as a partially supervised classification problem. In addition, we have performed a more specific method to identify disease genes by classifying, for the first time, genes causing dominant and recessive diseases independently. We base this separation on previous results that show that these two types of genes present differences in their sequence properties. In this paper, we have applied a new model averaging algorithm to the identification of human genes associated with both dominant and recessive Mendelian diseases.

Keywords: Algorithms; Genes, Dominant; Genes, Recessive; Genetic Predisposition to Disease; Humans; Sequence Analysis, DNA; Spain
[Bunea2007Sparsity] FBunea, ATsybakov, and MWegkamp. Sparsity oracle inequalities for the lasso. Electron. J. Statist., 1:169-194, 2007. [ bib | DOI | http | .pdf ]
Keywords: lasso
[Bock2007Effective] Mary Ellen Bock, Claudio Garutti, and Conettina Guerra. Effective labeling of molecular surface points for cavity detection and location of putative binding sites. Comput Syst Bioinformatics Conf, 6:263-274, 2007. [ bib ]
We present a method for detecting and comparing cavities on protein surfaces that is useful for protein binding site recognition. The method is based on a representation of the protein structures by a collection of spin-images and their associated spin-image profiles. Results of the cavity detection procedure are presented for a large set of non-redundant proteins and compared with SURFNET-ConSurf. Our comparison method is used to find a surface region in one cavity of a protein that is geometrically similar to a surface region in the cavity of another protein. Such a finding would be an indication that the two regions likely bind to the same ligand. Our overall approach for cavity detection and comparison is benchmarked on several pairs of known complexes, obtaining a good coverage of the atoms of the binding sites.

Keywords: Binding Sites; Computer Simulation; Models, Chemical; Models, Molecular; Protein Binding; Protein Conformation; Protein Folding; Proteins, chemistry/ultrastructure; Sequence Analysis, Protein, methods; Surface Properties
[Bleakley2007Supervised] KBleakley, GBiau, and J.-P. Vert. Supervised reconstruction of biological networks with local models. Bioinformatics, 23(13):i57-i65, Jul 2007. [ bib | DOI | http | .pdf ]
MOTIVATION: Inference and reconstruction of biological networks from heterogeneous data is currently an active research subject with several important applications in systems biology. The problem has been attacked from many different points of view with varying degrees of success. In particular, predicting new edges with a reasonable false discovery rate is highly demanded for practical applications, but remains extremely challenging due to the sparsity of the networks of interest. RESULTS: While most previous approaches based on the partial knowledge of the network to be inferred build global models to predict new edges over the network, we introduce here a novel method which predicts whether there is an edge from a newly added vertex to each of the vertices of a known network using local models. This involves learning individually a certain subnetwork associated with each vertex of the known network, then using the discovered classification rule associated with only that vertex to predict the edge to the new vertex. Excellent experimental results are shown in the case of metabolic and protein-protein interaction network reconstruction from a variety of genomic data. AVAILABILITY: An implementation of the proposed algorithm is available upon request from the authors. CONTACT: Jean-Philippe.Vert@ensmp.fr.

[Bickel2007Discriminative] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 81-88. ACM Press, 2007. [ bib ]
[Bartlett2007Sparseness] PL. Bartlett and ATewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. J. Mach. Learn. Res., 8:775-790, 2007. [ bib | .html ]
One of the nice properties of kernel classifiers such as SVMs is that they often produce sparse solutions. However, the decision functions of these classifiers cannot always be used to estimate the conditional probability of the class label. We investigate the relationship between these two properties and show that these are intimately related: sparseness does not occur when the conditional probabilities can be unambiguously estimated. We consider a family of convex loss functions and derive sharp asymptotic results for the fraction of data that becomes support vectors. This enables us to characterize the exact trade-off between sparseness and the ability to estimate conditional probabilities for these loss functions.

Keywords: PUlearning
[Bantscheff2007Quantitative] Marcus Bantscheff, Markus Schirle, Gavain Sweetman, Jens Rick, and Bernhard Kuster. Quantitative mass spectrometry in proteomics: a critical review. Anal Bioanal Chem, 389(4):1017-1031, Oct 2007. [ bib | DOI | http ]
The quantification of differences between two or more physiological states of a biological system is among the most important but also most challenging technical tasks in proteomics. In addition to the classical methods of differential protein gel or blot staining by dyes and fluorophores, mass-spectrometry-based quantification methods have gained increasing popularity over the past five years. Most of these methods employ differential stable isotope labeling to create a specific mass tag that can be recognized by a mass spectrometer and at the same time provide the basis for quantification. These mass tags can be introduced into proteins or peptides (i) metabolically, (ii) by chemical means, (iii) enzymatically, or (iv) provided by spiked synthetic peptide standards. In contrast, label-free quantification approaches aim to correlate the mass spectrometric signal of intact proteolytic peptides or the number of peptide sequencing events with the relative or absolute protein quantity directly. In this review, we critically examine the more commonly used quantitative mass spectrometry methods for their individual merits and discuss challenges in arriving at meaningful interpretations of quantitative proteomic data.

Keywords: Automatic Data Processing; Isotope Labeling; Mass Spectrometry; Peptides; Proteins; Proteome; Proteomics; Reference Standards
[Bansal2007How] MBansal, VBelcastro, AAmbesi-Impiombato, and Ddi Bernardo. How to infer gene networks from expression profiles. Mol. Syst. Biol., 3:78, 2007. [ bib | DOI | http | .pdf ]
Inferring, or 'reverse-engineering', gene networks can be defined as the process of identifying gene interactions from experimental data through computational analysis. Gene expression data from microarrays are typically used for this purpose. Here we compared different reverse-engineering algorithms for which ready-to-use software was available and that had been tested on experimental data sets. We show that reverse-engineering algorithms are indeed able to correctly infer regulatory interactions among genes, at least when one performs perturbation experiments complying with the algorithm requirements. These algorithms are superior to classic clustering algorithms for the purpose of finding regulatory interactions among genes, and, although further improvements are needed, have reached a discreet performance for being practically useful.

[Azencott2007One] C.-A. Azencott, AKsikes, SJ. Swamidass, JH. Chen, LRalaivola, and PBaldi. One- to four-dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. J. Chem. Inform. Model., 47(3):965-974, 2007. [ bib | DOI | http | .pdf ]
Many chemoinformatics applications, including high-throughput virtual screening, benefit from being able to rapidly predict the physical, chemical, and biological properties of small molecules to screen large repositories and identify suitable candidates. When training sets are available, machine learning methods provide an effective alternative to ab initio methods for these predictions. Here, we leverage rich molecular representations including 1D SMILES strings, 2D graphs of bonds, and 3D coordinates to derive efficient machine learning kernels to address regression problems. We further expand the library of available spectral kernels for small molecules developed for classification problems to include 2.5D surface and 3D kernels using Delaunay tetrahedrization and other techniques from computational geometry, 3D pharmacophore kernels, and 3.5D or 4D kernels capable of taking into account multiple molecular configurations, such as conformers. The kernels are comprehensively tested using cross-validation and redundancy-reduction methods on regression problems using several available data sets to predict boiling points, melting points, aqueous solubility, octanol/water partition coefficients, and biological activity with state-of-the art results. When sufficient training data are available, 2D spectral kernels in general tend to yield the best and most robust results, better than state-of-the art. On data sets containing thousands of molecules, the kernels achieve a squared correlation coefficient of 0.91 for aqueous solubility prediction and 0.94 for octanol/water partition coefficient prediction. Averaging over conformations improves the performance of kernels based on the three-dimensional structure of molecules, especially on challenging data sets. Kernel predictors for aqueous solubility (kSOL), LogP (kLOGP), and melting point (kMELT) are available over the Web through: http://cdb.ics.uci.edu.

[Avlani2007Critical] VA. Avlani, KJ. Gregory, CJ. Morton, MW. Parker, PM. Sexton, and AChristopoulos. Critical role for the second extracellular loop in the binding of both orthosteric and allosteric g protein-coupled receptor ligands. J. Biol. Chem., 282(35):25677-25686, Aug 2007. [ bib | DOI | http ]
The second extracellular (E2) loop of G protein-coupled receptors (GPCRs) plays an essential but poorly understood role in the binding of non-peptidic small molecules. We have utilized both orthosteric ligands and allosteric modulators of the M2 muscarinic acetylcholine receptor, a prototypical Family A GPCR, to probe possible E2 loop binding dynamics. We developed a homology model based on the crystal structure of bovine rhodopsin and predicted novel cysteine substitutions that should dramatically reduce E2 loop flexibility via disulfide bond formation and significantly inhibit the binding of both types of ligands. This prediction was validated experimentally using radioligand binding, dissociation kinetics, and cell-based functional assays. The results argue for a flexible "gatekeeper" role of the E2 loop in the binding of both allosteric and orthosteric GPCR ligands.

Keywords: chemogenomics
[Argyriou2007Multi-task] AArgyriou, TEvgeniou, and MPontil. Multi-task feature learning. In BSchölkopf, JPlatt, and THoffman, editors, Adv. Neural. Inform. Process Syst. 19, pages 41-48, Cambridge, MA, 2007. MIT Press. [ bib ]
[Applegate06Traveling] DL. Applegate, RE. Bixby, VChvatal, and WJ. Cook. The Traveling Salesman Problem: A Computational Study (Princeton Series in Applied Mathematics). Princeton University Press, January 2007. [ bib ]
This book presents the latest findings on one of the most intensely investigated subjects in computational mathematics-the traveling salesman problem. It sounds simple enough: given a set of cities and the cost of travel between each pair of them, the problem challenges you to find the cheapest route by which to visit all the cities and return home to where you began. Though seemingly modest, this exercise has inspired studies by mathematicians, chemists, and physicists. Teachers use it in the classroom. It has practical applications in genetics, telecommunications, and neuroscience. The authors of this book are the same pioneers who for nearly two decades have led the investigation into the traveling salesman problem. They have derived solutions to almost eighty-six thousand cities, yet a general solution to the problem has yet to be discovered. Here they describe the method and computer code they used to solve a broad range of large-scale problems, and along the way they demonstrate the interplay of applied mathematics with increasingly powerful computing platforms. They also give the fascinating history of the problem-how it developed, and why it continues to intrigue us.

[Aouba2007Les] AAouba, FPéquignot, ALe Toullec, and EJougla. Les causes médicales de décès en France en 2004 et leur évolution 1980-2004. Bulletin épidémiologique hebdomadaire, 35-36:308-314, 2007. [ bib | .pdf | .pdf ]
Keywords: csbcbook
[Andersson2007Multivariate] CD. Andersson, EThysell, ALindström, MBylesjö, FRaubacher, and ALinusson. A multivariate approach to investigate docking parameters' effects on docking performance. J. Chem. Inform. Model., 47(4):1673-1687, 2007. [ bib | DOI | http ]
Increasingly powerful docking programs for analyzing and estimating the strength of protein-ligand interactions have been developed in recent decades, and they are now valuable tools in drug discovery. Software used to perform dockings relies on a number of parameters that affect various steps in the docking procedure. However, identifying the best choices of the settings for these parameters is often challenging. Therefore, the settings of the parameters are quite often left at their default values, even though scientists with long experience with a specific docking tool know that modifying certain parameters can improve the results. In the study presented here, we have used statistical experimental design and subsequent regression based on root-mean-square deviation values using partial least-square projections to latent structures (PLS) to scrutinize the effects of different parameters on the docking performance of two software packages: FRED and GOLD. Protein-ligand complexes with a high level of ligand diversity were selected from the PDBbind database for the study, using principal component analysis based on 1D and 2D descriptors, and space-filling design. The PLS models showed quantitative relationships between the docking parameters and the ability of the programs to reproduce the ligand crystallographic conformation. The PLS models also revealed which of the parameters and what parameter settings were important for the docking performance of the two programs. Furthermore, the variation in docking results obtained with specific parameter settings for different protein-ligand complexes in the diverse set examined indicates that there is great potential for optimizing the parameter settings for selected sets of proteins.

[Amit2007Uncovering] Yonatan Amit, Michael Fink, Nathan Srebro, and Shimon Ullman. Uncovering shared structures in multiclass classification. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 17-24, New York, NY, USA, 2007. ACM. [ bib | DOI ]
[Ameres2007Molecular] Stefan Ludwig Ameres, Javier Martinez, and Renée Schroeder. Molecular basis for target rna recognition and cleavage by human risc. Cell, 130(1):101-112, Jul 2007. [ bib | DOI | http | .pdf ]
The RNA-Induced Silencing Complex (RISC) is a ribonucleoprotein particle composed of a single-stranded short interfering RNA (siRNA) and an endonucleolytically active Argonaute protein, capable of cleaving mRNAs complementary to the siRNA. The mechanism by which RISC cleaves a target RNA is well understood, however it remains enigmatic how RISC finds its target RNA. Here, we show, both in vitro and in vivo, that the accessibility of the target site correlates directly with the efficiency of cleavage, demonstrating that RISC is unable to unfold structured RNA. In the course of target recognition, RISC transiently contacts single-stranded RNA nonspecifically and promotes siRNA-target RNA annealing. Furthermore, the 5' part of the siRNA within RISC creates a thermodynamic threshold that determines the stable association of RISC and the target RNA. We therefore provide mechanistic insights by revealing features of RISC and target RNAs that are crucial to achieve efficiency and specificity in RNA interference.

Keywords: sirna
[Alon2007Network] UAlon. Network motifs: theory and experimental approaches. Nat Rev Genet, 8(6):450-461, Jun 2007. [ bib | DOI | http | .pdf ]
Transcription regulation networks control the expression of genes. The transcription networks of well-studied microorganisms appear to be made up of a small set of recurring regulation patterns, called network motifs. The same network motifs have recently been found in diverse organisms from bacteria to humans, suggesting that they serve as basic building blocks of transcription networks. Here I review network motifs and their functions, with an emphasis on experimental studies. Network motifs in other biological networks are also mentioned, including signalling and neuronal networks.

[Aldea2007Image] EAldea, JAtif, and IBloch. Image classification using marginalized kernels for graphs. In Graph-Based Representations in Pattern Recognition, volume 4538/2007 of Lecture Notes in Computer Science, pages 103-113. Springer Berlin / Heidelberg, 2007. [ bib | DOI | http | .pdf ]
We propose in this article an image classification technique based on kernel methods and graphs. Our work explores the possibility of applying marginalized kernels to image processing. In machine learning, performant algorithms have been developed for data organized as real valued arrays; these algorithms are used for various purposes like classification or regression. However, they are inappropriate for direct use on complex data sets. Our work consists of two distinct parts. In the first one we model the images by graphs to be able to represent their structural properties and inherent attributes. In the second one, we use kernel functions to project the graphs in a mathematical space that allows the use of performant classification algorithms. Experiments are performed on medical images acquired with various modalities and concerning different parts of the body.

Keywords: image
[Aguda2007PLOSCompBiol] BD. Aguda and AB. Goryachev. From pathways databases to network models of switching behavior. PLoS Comput Biol, 3(9):1674-8, 2007. [ bib ]
Keywords: csbcbook
[Aspremont2007Direct] Ad'Aspremont, LEl Ghaoui, MI. Jordan, and GRG. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 49(3):434-448, 2007. [ bib | DOI | http ]
[Bonilla2007Kernel] Edwin V. Bonilla, Felix V. Agakov, and Christopher KI. Williams. Kernel multi-task learning using task-specific features. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics. Omnipress, March 2007. [ bib ]
[Yamanishi2007Glycan] YYamanishi, FBach, and J.-P. Vert. Glycan classification with tree kernels. Bioinformatics, 23(10):1211-1216, May 2007. [ bib | DOI | http ]
MOTIVATION: Glycans are covalent assemblies of sugar that play crucial roles in many cellular processes. Recently, comprehensive data about the structure and function of glycans have been accumulated, therefore the need for methods and algorithms to analyze these data is growing fast. RESULTS: This article presents novel methods for classifying glycans and detecting discriminative glycan motifs with support vector machines (SVM). We propose a new class of tree kernels to measure the similarity between glycans. These kernels are based on the comparison of tree substructures, and take into account several glycan features such as the sugar type, the sugar bound type or layer depth. The proposed methods are tested on their ability to classify human glycans into four blood components: leukemia cells, erythrocytes, plasma and serum. They are shown to outperform a previously published method. We also applied a feature selection approach to extract glycan motifs which are characteristic of each blood component. We confirmed that some leukemia-specific glycan motifs detected by our method corresponded to several results in the literature. AVAILABILITY: Softwares are available upon request. SUPPLEMENTARY INFORMATION: Datasets are available at the following website: http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/glycankernel/

[Rognan2007Chemogenomic] DRognan. Chemogenomic approaches to rational drug design. Br. J. Pharmacol., 152:38-52, May 2007. [ bib | DOI | http ]
Paradigms in drug design and discovery are changing at a significant pace. Concomitant to the sequencing of over 180 several genomes, the high-throughput miniaturization of chemical synthesis and biological evaluation of a multiple compounds on gene/protein expression and function opens the way to global drug-discovery approaches, no more focused on a single target but on an entire family of related proteins or on a full metabolic pathway. Chemogenomics is this emerging research field aimed at systematically studying the biological effect of a wide array of small molecular-weight ligands on a wide array of macromolecular targets. Since the quantity of existing data (compounds, targets and assays) and of produced information (gene/protein expression levels and binding constants) are too large for manual manipulation, information technologies play a crucial role in planning, analysing and predicting chemogenomic data. The present review will focus on predictive in silico chemogenomic approaches to foster rational drug design and derive information from the simultaneous biological evaluation of multiple compounds on multiple targets. State-of-the-art methods for navigating in either ligand or target space will be presented and concrete drug design applications will be mentioned.British Journal of Pharmacology advance online publication, 29 May 2007; doi:10.1038/sj.bjp.0707307.

[Qiu2007structural] JQiu, JHue, ABen-Hur, J.-P. Vert, and WS. Noble. A structural alignment kernel for protein structures. Bioinformatics, 23(9):1090-1098, May 2007. [ bib | DOI | http ]
MOTIVATION: This work aims to develop computational methods to annotate protein structures in an automated fashion. We employ a support vector machine (SVM) classifier to map from a given class of structures to their corresponding structural (SCOP) or functional (Gene Ontology) annotation. In particular, we build upon recent work describing various kernels for protein structures, where a kernel is a similarity function that the classifier uses to compare pairs of structures. RESULTS: We describe a kernel that is derived in a straightforward fashion from an existing structural alignment program, MAMMOTH. We find in our benchmark experiments that this kernel significantly out-performs a variety of other kernels, including several previously described kernels. Furthermore, in both benchmarks, classifying structures using MAMMOTH alone does not work as well as using an SVM with the MAMMOTH kernel. AVAILABILITY: http://noble.gs.washington.edu/proj/3dkernel

Keywords: Algorithms; Amino Acid Sequence; Artificial Intelligence; Molecular Sequence Data; Pattern Recognition, Automated; Proteins; Sequence Alignment; Sequence Analysis, Protein; Sequence Homology, Amino Acid
[Liben-Nowell2007link-prediction] DLiben-Nowell and JKleinberg. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol., 58(7):1019-1031, May 2007. [ bib | DOI | http ]
[Klabunde2007Chemogenomic] TKlabunde. Chemogenomic approaches to drug discovery: similar receptors bind similar ligands. Br. J. Pharmacol., 152:5-7, May 2007. [ bib | DOI | http ]
Within recent years, a paradigm shift from traditional receptor-specific studies to a cross-receptor view has taken place within pharmaceutical research to increase the efficiency of modern drug discovery. Receptors are no longer viewed as single entities but grouped into sets of related proteins or receptor families that are explored in a systematic manner. This interdisciplinary approach attempting to derive predictive links between the chemical structures of bioactive molecules and the receptors with which these molecules interact is referred to as chemogenomics. Insights from chemogenomics are used for the rational compilation of screening sets and for the rational design and synthesis of directed chemical libraries to accelerate drug discovery.British Journal of Pharmacology advance online publication, 29 May 2007; doi:10.1038/sj.bjp.0707308.

Keywords: chemogenomics
[Jin2007yeast] Fulai Jin, Larisa Avramova, Jing Huang, and Tony Hazbun. A yeast two-hybrid smart-pool-array system for protein-interaction mapping. Nat Methods, 4(5):405-407, May 2007. [ bib | DOI | http ]
We present here a new two-hybrid smart pool array (SPA) system in which, instead of individual activation domain strains, well-designed activation domain pools are screened in an array format that allows built-in replication and prey-bait deconvolution. Using this method, a Saccharomyces cerevisiae genome SPA increases yeast two-hybrid screening efficiency by an order of magnitude.

Keywords: Genome, Fungal; Protein Interaction Mapping; Saccharomyces cerevisiae; Saccharomyces cerevisiae Proteins; Two-Hybrid System Techniques
[Goh2007human] K.-O. Goh, ME. Cusick, DValle, BChilds, MVidal, and A.-L. Barabási. The human disease network. Proc. Natl. Acad. Sci. U. S. A., 104(21):8685-8690, May 2007. [ bib | DOI | http | .pdf ]
A network of disorders and disease genes linked by known disorder-gene associations offers a platform to explore in a single graph-theoretic framework all known phenotype and disease gene associations, indicating the common genetic origin of many diseases. Genes associated with similar disorders show both higher likelihood of physical interactions between their products and higher expression profiling similarity for their transcripts, supporting the existence of distinct disease-specific functional modules. We find that essential human genes are likely to encode hub proteins and are expressed widely in most tissues. This suggests that disease genes also would play a central role in the human interactome. In contrast, we find that the vast majority of disease genes are nonessential and show no tendency to encode hub proteins, and their expression pattern indicates that they are localized in the functional periphery of the network. A selection-based model explains the observed difference between essential and disease genes and also suggests that diseases caused by somatic mutations should not be peripheral, a prediction we confirm for cancer genes.

[Fredholm2007G-protein-coupled] BB. Fredholm, THökfelt, and GMilligan. G-protein-coupled receptors: an update. Acta Physiol., 190(1):3-7, May 2007. [ bib | DOI | http ]
The receptors that couple to G proteins (GPCR) and which span the cell membranes seven times (7-TM receptors) were the focus of a symposium in Stockholm 2006. The ensemble of GPCR has now been mapped in several animal species. They remain a major focus of interest in drug development, and their diverse physiological and pathophysiological roles are being clarified, i.a. by genetic targeting. Recent developments hint at novel levels of complexity. First, many, if not all, GPCRs are part of multimeric ensembles, and physiology and pharmacology of a given GPCR may be at least partly guided by the partners it was formed together with. Secondly, at least some GPCRs may be constitutively active. Therefore, drugs that are inverse agonists may prove useful. Furthermore, the level of activity may vary in such a profound way between cells and tissues that this could offer new ways of achieving specificity of drug action. Finally, it is becoming increasingly clear that some of these receptors can signal via novel types of pathways, and hence that 'GPCRs' may not always be G-protein-coupled. Thus there are many challenges for the basic scientist and the drug industry.

Keywords: chemogenomics
[Davies2007Harnessing] Matthew N Davies and Darren R Flower. Harnessing bioinformatics to discover new vaccines. Drug Discov Today, 12(9-10):389-395, May 2007. [ bib | DOI | http ]
Vaccine design is highly suited to the application of in silico techniques, for both the discovery and development of new and existing vaccines. Here, we discuss computational contributions to epitope mapping and reverse vaccinology, two techniques central to the new discipline of immunomics. Also discussed are methods to improve the efficiency of vaccination, such as codon optimization and adjuvant discovery in addition to the identification of allergenic proteins. We also review current software developed to facilitate vaccine design.

Keywords: Animals; Computational Biology; Drug Design; Epitope Mapping; Humans; Software Design; Vaccination; Vaccines
[Coe2007Resolving] BP. Coe, BYlstra, BCarvalho, GA. Meijer, CMacaulay, and WL. Lam. Resolving the resolution of array CGH. Genomics, 89(5):647-653, May 2007. [ bib | DOI | http | .pdf ]
Many recent technologies have been designed to supplant conventional metaphase CGH technology with the goal of refining the description of segmental copy number status throughout the genome. However, the emergence of new technologies has led to confusion as to how to describe adequately the capabilities of each array platform. The design of a CGH array can incorporate a uniform or a highly variable element distribution. This can lead to bias in the reporting of average or median resolutions, making it difficult to provide a fair comparison of platforms. In this report, we propose a new definition of resolution for array CGH technology, termed "functional resolution," that incorporates the uniformity of element spacing on the array, as well as the sensitivity of each platform to single-copy alterations. Calculation of these metrics is automated through the development of a Java-based application, "ResCalc," which is applicable to any array CGH platform.

Keywords: csbcbook, csbcbook-ch2
[Xue2007The] Ya Xue, David Dunson, and Lawrence Carin. The matrix stick-breaking process for flexible multi-task learning. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 1063-1070, New York, NY, USA, June 2007. ACM. [ bib | DOI ]
[Leordeanu2007Beyond] MLeordeanu, MHebert, and RSukthankar. Beyond local appearance: Category recognition from pairwise interactions of simple features. In Proceedings of CVPR, June 2007. [ bib ]
[Neuhaus2007Bridging] MNeuhaus and HBunke. Bridging the Gap Between Graph Edit Distance and Kernel Machines. World Scientific, September 2007. [ bib ]
[VonLuxburg2007A] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395-416, December 2007. [ bib | DOI | http ]
Abstract&nbsp;&nbsp; In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.

[Cohen2008Machine] William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors. Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, volume 307 of ACM International Conference Proceeding Series. ACM, 2008. [ bib ]
[Zucknick2008Comparing] MZucknick, SRichardson, E.A. Stronach, et al. Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Stat Appl Genet Mol Biol, 7(1):7, 2008. [ bib ]
[Zinovyev2008Bioinformatics] AZinovyev, EViara, LCalzone, and EBarillot. BiNoM: a Cytoscape plugin for manipulating and analyzing biological networks. Bioinformatics, 24(6):876-877, 2008. [ bib | DOI | arXiv | http ]
BiNoM (Biological Network Manager) is a new bioinformatics software that significantly facilitates the usage and the analysis of biological networks in standard systems biology formats (SBML, SBGN, BioPAX). BiNoM implements a full-featured BioPAX editor and a method of interfaces' for accessing BioPAX content. BiNoM is able to work with huge BioPAX files such as whole pathway databases. In addition, BiNoM allows the analysis of networks created with CellDesigner software and their conversion into BioPAX format. BiNoM comes as a library and as a Cytoscape plugin which adds a rich set of operations to Cytoscape such as path and cycle analysis, clustering sub-networks, decomposition of network into modules, clipboard operations and others. Availability: Last version of BiNoM distributed under the LGPL licence together with documentation, source code and API are available at http://bioinfo.curie.fr/projects/binom Contact: andrei.zinovyev@curie.fr

Keywords: csbcbook
[Zhang2008Machine] YZhang and J.C. Rajapakse. Machine learning in bioinformatics, volume 4. Wiley, 2008. [ bib ]
[Zhang2008Variable] HH. Zhang, YLiu, YWu, and JZhu. Variable selection for multicategory SVM via adaptive sup-norm regularization. Electronic Journal of Statistics, 2:149-167, 2008. [ bib | DOI | http ]
Keywords: lasso
[Zaslavskiy2008patha] MZaslavskiy, FBach, and J.-P. Vert. A path following algorithm for the graph matching problem. Technical Report 00232851, HAL, 2008. To appear in IEEE Trans. Pattern Anal. Mach. Intell. [ bib | http ]
We propose a convex-concave programming approach for the labeled weighted graph matching problem. The convex-concave programming formulation is obtained by rewriting the weighted graph matching problem as a least-square problem on the set of permutation matrices and relaxing it to two different optimization problems: a quadratic convex and a quadratic concave optimization problem on the set of doubly stochastic matrices. The concave relaxation has the same global minimum as the initial graph matching problem, but the search for its global minimum is also a hard combinatorial problem. We therefore construct an approximation of the concave problem solution by following a solution path of a convex-concave problem obtained by linear interpolation of the convex and concave formulations, starting from the convex relaxation. This method allows to easily integrate the information on graph label similarities into the optimization problem, and therefore to perform labeled weighted graph matching. The algorithm is compared with some of the best performing graph matching methods on four datasets: simulated graphs, QAPLib, retina vessel images and handwritten chinese characters. In all cases, the results are competitive with the state-of-the-art.

[Zaslavskiy2008GraphM] MZaslavskiy, FBach, and JP. Vert. GRAPHM: Graph matching package, 2008. Available at http://cbio.ensmp.fr/graphm. [ bib | http ]
[Zaslavskiy2008path] MZaslavskiy, FBach, and JP. Vert. A path following algorithm for graph matching. In AElmoataz, OLezoray, FNouboud, and DMammass, editors, Image and Signal Processing, Proceedings of the 3rd International Conference, ICISP 2008, volume 5099 of LNCS, pages 329-337. Springer Berlin / Heidelberg, 2008. [ bib | DOI | http ]
We propose a convex-concave programming approach for the labelled weighted graph matching problem. The convex-concave programming formulation is obtained by rewriting the graph matching problem as a least-square problem on the set of permutation matrices and relaxing it to two different optimization problems: a quadratic convex and a quadratic concave optimization problem on the set of doubly stochastic matrices. The concave relaxation has the same global minimum as the initial graph matching problem, but the search for its global minimum is aslo a complex combinatorial problem. We therefore construct an approximation of the concave problem solution by following a solution path of the convex-concave problem obtained by linear interpolation of the convex and concave formulations, starting from the convex relaxation. The algorithm is compared with some of the best performing graph matching methods on three datasets: simulated graphs, QAPLib and handwritten chinese characters.

[Yu2008Stable] LYu, CDing, and SLoscalzo. Stable feature selection via dense feature groups. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 803-811. ACM, 2008. [ bib ]
[Yu2008High-quality] HYu, PBraun, MA. Yildirim, ILemmens, KVenkatesan, JSahalie, THirozane-Kishikawa, FGebreab, NLi, NSimonis, THao, J.-F. Rual, ADricot, AVazquez, RR. Murray, CSimon, LTardivo, STam, NSvrzikapa, CFan, A.-S. de Smet, AMotyl, ME. Hudson, JPark, XXin, ME. Cusick, TMoore, CBoone, MSnyder, FR. Roth, A.-L. Barabási, JTavernier, DE. Hill, and MVidal. High-quality binary protein interaction map of the yeast interactome network. Science, 322(5898):104-110, Oct 2008. [ bib | DOI | http | .pdf ]
Current yeast interactome network maps contain several hundred molecular complexes with limited and somewhat controversial representation of direct binary interactions. We carried out a comparative quality assessment of current yeast interactome data sets, demonstrating that high-throughput yeast two-hybrid (Y2H) screening provides high-quality binary interaction information. Because a large fraction of the yeast binary interactome remains to be mapped, we developed an empirically controlled mapping framework to produce a "second-generation" high-quality, high-throughput Y2H data set covering approximately 20% of all yeast binary interactions. Both Y2H and affinity purification followed by mass spectrometry (AP/MS) data are of equally high quality but of a fundamentally different and complementary nature, resulting in networks with different topological and biological properties. Compared to co-complex interactome models, this binary map is enriched for transient signaling interactions and intercomplex connections with a highly significant clustering between essential proteins. Rather than correlating with essentiality, protein connectivity correlates with genetic pleiotropy.

[Yosef2008Improved] NYosef, RSharan, and WS. Noble. Improved network-based identification of protein orthologs. Bioinformatics, 24(16):i200-i206, Aug 2008. [ bib | DOI | http | .pdf ]
MOTIVATION: Identifying protein orthologs is an important task that is receiving growing attention in the bioinformatics literature. Orthology detection provides a fundamental tool towards understanding protein evolution, predicting protein functions and interactions, aligning protein-protein interaction (PPI) networks of different species and detecting conserved modules within these networks. RESULTS: Here, we present a novel diffusion-based framework that builds on the Rankprop algorithm for protein orthology detection and enhances it in several important ways. Specifically, we enhance the Rankprop algorithm to account for the presence of multiple paralogs, utilize PPI, and consider multiple (>2) species in parallel. We comprehensively benchmarked our framework using a variety of training datasets and experimental settings. The results, based on the yeast, fly and human proteomes, show that the novel enhancements of Rankprop provide substantial improvements over its original formulation as well as over a number of state of the art methods for network-based orthology detection. AVAILABILITY: datasets and source code are available upon request.

[Yamanishi2008Prediction] YYamanishi, MAraki, AGutteridge, WHonda, and MKanehisa. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 24(13):i232-i240, Jul 2008. [ bib | DOI | http | .pdf ]
MOTIVATION: The identification of interactions between drugs and target proteins is a key area in genomic drug discovery. Therefore, there is a strong incentive to develop new methods capable of detecting these potential drug-target interactions efficiently. RESULTS: In this article, we characterize four classes of drug-target interaction networks in humans involving enzymes, ion channels, G-protein-coupled receptors (GPCRs) and nuclear receptors, and reveal significant correlations between drug structure similarity, target sequence similarity and the drug-target interaction network topology. We then develop new statistical methods to predict unknown drug-target interaction networks from chemical structure and genomic sequence information simultaneously on a large scale. The originality of the proposed method lies in the formalization of the drug-target interaction inference as a supervised learning problem for a bipartite graph, the lack of need for 3D structure information of the target proteins, and in the integration of chemical and genomic spaces into a unified space that we call 'pharmacological space'. In the results, we demonstrate the usefulness of our proposed method for the prediction of the four classes of drug-target interaction networks. Our comprehensively predicted drug-target interaction networks enable us to suggest many potential drug-target interactions and to increase research productivity toward genomic drug discovery. AVAILABILITY: Softwares are available upon request. SUPPLEMENTARY INFORMATION: Datasets and all prediction results are available at http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/.

[Yakoby2008combinatorial] Nir Yakoby, Christopher A. Bristow, Danielle Gong, Xenia Schafer, Jessica Lembong, Jeremiah J. Zartman, Marc S. Halfon, Trudi Schüpbach, and Stanislav Y. Shvartsman. A combinatorial code for pattern formation in drosophila oogenesis. Dev Cell, 15(5):725-737, Nov 2008. [ bib | DOI | http | .pdf ]
Two-dimensional patterning of the follicular epithelium in Drosophila oogenesis is required for the formation of three-dimensional eggshell structures. Our analysis of a large number of published gene expression patterns in the follicle cells suggests that they follow a simple combinatorial code based on six spatial building blocks and the operations of union, difference, intersection, and addition. The building blocks are related to the distribution of inductive signals, provided by the highly conserved epidermal growth factor receptor and bone morphogenetic protein signaling pathways. We demonstrate the validity of the code by testing it against a set of patterns obtained in a large-scale transcriptional profiling experiment. Using the proposed code, we distinguish 36 distinct patterns for 81 genes expressed in the follicular epithelium and characterize their joint dynamics over four stages of oogenesis. The proposed combinatorial framework allows systematic analysis of the diversity and dynamics of two-dimensional transcriptional patterns and guides future studies of gene regulation.

[Wu2008Network-based] XWu, RJiang, M.Q. Zhang, and SLi. Network-based global inference of human disease genes. Mol. Syst. Biol., 4:189, 2008. [ bib | DOI | http ]
Deciphering the genetic basis of human diseases is an important goal of biomedical research. On the basis of the assumption that phenotypically similar diseases are caused by functionally related genes, we propose a computational framework that integrates human protein-protein interactions, disease phenotype similarities, and known gene-phenotype associations to capture the complex relationships between phenotypes and genotypes. We develop a tool named CIPHER to predict and prioritize disease genes, and we show that the global concordance between the human protein network and the phenotype network reliably predicts disease genes. Our method is applicable to genetically uncharacterized phenotypes, effective in the genome-wide scan of disease genes, and also extendable to explore gene cooperativity in complex diseases. The predicted genetic landscape of over 1000 human phenotypes, which reveals the global modular organization of phenotype-genotype relationships. The genome-wide prioritization of candidate genes for over 5000 human phenotypes, including those with under-characterized disease loci or even those lacking known association, is publicly released to facilitate future discovery of disease genes.

Keywords: BRCA1 Protein; Bias (Epidemiology); Breast Neoplasms; Disease; Female; Gene Regulatory Networks; Genes; Genome, Human; Genotype; Humans; Linkage (Genetics); Phenotype; Software
[Wirapati2008Meta-analysis] PWirapati, CSotiriou, SKunkel, PFarmer, SPradervand, BHaibe-Kains, CDesmedt, MIgnatiadis, TSengstag, FSchütz, DR. Goldstein, MPiccart, and MDelorenzi. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res., 10(4):R65, 2008. [ bib | DOI | http | .pdf ]
INTRODUCTION: Breast cancer subtyping and prognosis have been studied extensively by gene expression profiling, resulting in disparate signatures with little overlap in their constituent genes. Although a previous study demonstrated a prognostic concordance among gene expression signatures, it was limited to only one dataset and did not fully elucidate how the different genes were related to one another nor did it examine the contribution of well-known biological processes of breast cancer tumorigenesis to their prognostic performance. METHOD: To address the above issues and to further validate these initial findings, we performed the largest meta-analysis of publicly available breast cancer gene expression and clinical data, which are comprised of 2,833 breast tumors. Gene coexpression modules of three key biological processes in breast cancer (namely, proliferation, estrogen receptor [ER], and HER2 signaling) were used to dissect the role of constituent genes of nine prognostic signatures. RESULTS: Using a meta-analytical approach, we consolidated the signatures associated with ER signaling, ERBB2 amplification, and proliferation. Previously published expression-based nomenclature of breast cancer 'intrinsic' subtypes can be mapped to the three modules, namely, the ER-/HER2- (basal-like), the HER2+ (HER2-like), and the low- and high-proliferation ER+/HER2- subtypes (luminal A and B). We showed that all nine prognostic signatures exhibited a similar prognostic performance in the entire dataset. Their prognostic abilities are due mostly to the detection of proliferation activity. Although ER- status (basal-like) and ERBB2+ expression status correspond to bad outcome, they seem to act through elevated expression of proliferation genes and thus contain only indirect information about prognosis. Clinical variables measuring the extent of tumor progression, such as tumor size and nodal status, still add independent prognostic information to proliferation genes. CONCLUSION: This meta-analysis unifies various results of previous gene expression studies in breast cancer. It reveals connections between traditional prognostic factors, expression-based subtyping, and prognostic signatures, highlighting the important role of proliferation in breast cancer prognosis.

Keywords: microarray, breastcancer
[Willett2008From] PWillett. From chemical documentation to chemoinformatics: 50 years of chemical information science. J. Inf. Sci., 34(4):477-499, 2008. [ bib | DOI ]
[Wheeler2008Complete] David A Wheeler, Maithreyan Srinivasan, Michael Egholm, Yufeng Shen, Lei Chen, Amy McGuire, Wen He, Yi-Ju Chen, Vinod Makhijani, GThomas Roth, Xavier Gomes, Karrie Tartaro, Faheem Niazi, Cynthia L Turcotte, Gerard P Irzyk, James R Lupski, Craig Chinault, Xing zhi Song, Yue Liu, Ye Yuan, Lynne Nazareth, Xiang Qin, Donna M Muzny, Marcel Margulies, George M Weinstock, Richard A Gibbs, and Jonathan M Rothberg. The complete genome of an individual by massively parallel dna sequencing. Nature, 452(7189):872-876, Apr 2008. [ bib | DOI | http ]
The association of genetic variation with disease and drug response, and improvements in nucleic acid technologies, have given great optimism for the impact of 'genomic medicine'. However, the formidable size of the diploid human genome, approximately 6 gigabases, has prevented the routine application of sequencing methods to deciphering complete individual human genomes. To realize the full potential of genomics for human health, this limitation must be overcome. Here we report the DNA sequence of a diploid genome of a single individual, James D. Watson, sequenced to 7.4-fold redundancy in two months using massively parallel sequencing in picolitre-size reaction vessels. This sequence was completed in two months at approximately one-hundredth of the cost of traditional capillary electrophoresis methods. Comparison of the sequence to the reference genome led to the identification of 3.3 million single nucleotide polymorphisms, of which 10,654 cause amino-acid substitution within the coding sequence. In addition, we accurately identified small-scale (2-40,000 base pair (bp)) insertion and deletion polymorphism as well as copy number variation resulting in the large-scale gain and loss of chromosomal segments ranging from 26,000 to 1.5 million base pairs. Overall, these results agree well with recent results of sequencing of a single individual by traditional methods. However, in addition to being faster and significantly less expensive, this sequencing technology avoids the arbitrary loss of genomic sequences inherent in random shotgun sequencing by bacterial cloning because it amplifies DNA in a cell-free system. As a result, we further demonstrate the acquisition of novel human sequence, including novel genes not previously identified by traditional genomic sequencing. This is the first genome sequenced by next-generation technologies. Therefore it is a pilot for the future challenges of 'personalized genome sequencing'.

Keywords: Alleles; Computational Biology; Genetic Predisposition to Disease, genetics; Genetic Variation, genetics; Genome, Human, genetics; Genomics, economics/methods/trends; Genotype; Humans; Individuality; Male; Oligonucleotide Array Sequence Analysis; Polymorphism, Single Nucleotide, genetics; Reproducibility of Results; Sensitivity and Specificity; Sequence Alignment; Sequence Analysis, DNA, economics/methods; Software
[Weis2008Structural] William I Weis and Brian K Kobilka. Structural insights into G-protein-coupled receptor activation. Curr Opin Struct Biol, 18(6):734-740, Dec 2008. [ bib | DOI | http ]
G-protein-coupled receptors (GPCRs) are the largest family of eukaryotic plasma membrane receptors, and are responsible for the majority of cellular responses to external signals. GPCRs share a common architecture comprising seven transmembrane (TM) helices. Binding of an activating ligand enables the receptor to catalyze the exchange of GTP for GDP in a heterotrimeric G protein. GPCRs are in a conformational equilibrium between inactive and activating states. Crystallographic and spectroscopic studies of the visual pigment rhodopsin and two beta-adrenergic receptors have defined some of the conformational changes associated with activation.

Keywords: Animals; Crystallography; Humans; Membrane Proteins; Models, Molecular; Receptors, Adrenergic, beta; Receptors, G-Protein-Coupled; Rhodopsin
[Watanabe2008Endogenous] TWatanabe, YTotoki, AToyoda, MKaneda, SKuramochi-Miyagawa, YObata, HChiba, YKohara, TKono, TNakano, M.A. Surani, YSakaki, and HSasaki. Endogenous siRNAs from naturally formed dsRNAs regulate transcripts in mouse oocytes. Nature, 453:539-543, 2008. [ bib ]
Keywords: csbcbook
[Wang2008diploid] Jun Wang, Wei Wang, Ruiqiang Li, Yingrui Li, Geng Tian, Laurie Goodman, Wei Fan, Junqing Zhang, Jun Li, Juanbin Zhang, Yiran Guo, Binxiao Feng, Heng Li, Yao Lu, Xiaodong Fang, Huiqing Liang, Zhenglin Du, Dong Li, Yiqing Zhao, Yujie Hu, Zhenzhen Yang, Hancheng Zheng, Ines Hellmann, Michael Inouye, John Pool, Xin Yi, Jing Zhao, Jinjie Duan, Yan Zhou, Junjie Qin, Lijia Ma, Guoqing Li, Zhentao Yang, Guojie Zhang, Bin Yang, Chang Yu, Fang Liang, Wenjie Li, Shaochuan Li, Dawei Li, Peixiang Ni, Jue Ruan, Qibin Li, Hongmei Zhu, Dongyuan Liu, Zhike Lu, Ning Li, Guangwu Guo, JZhang, JYe, LFang, QHao, QChen, YLiang, YSu, ASan, CPing, SYang, FChen, LLi, KZhou, HZheng, YRen, LYang, YGao, GYang, ZLi, XFeng, KKristiansen, GK.-S. Wong, RNielsen, RDurbin, LBolund, XZhang, SLi, HYang, and JWang. The diploid genome sequence of an Asian individual. Nature, 456(7218):60-65, Nov 2008. [ bib | DOI | http ]
Here we present the first diploid genome sequence of an Asian individual. The genome was sequenced to 36-fold average coverage using massively parallel sequencing technology. We aligned the short reads onto the NCBI human reference genome to 99.97% coverage, and guided by the reference genome, we used uniquely mapped reads to assemble a high-quality consensus sequence for 92% of the Asian individual's genome. We identified approximately 3 million single-nucleotide polymorphisms (SNPs) inside this region, of which 13.6% were not in the dbSNP database. Genotyping analysis showed that SNP identification had high accuracy and consistency, indicating the high sequence quality of this assembly. We also carried out heterozygote phasing and haplotype prediction against HapMap CHB and JPT haplotypes (Chinese and Japanese, respectively), sequence comparison with the two available individual genomes (J. D. Watson and J. C. Venter), and structural variation identification. These variations were considered for their potential biological impact. Our sequence data and analyses demonstrate the potential usefulness of next-generation sequencing technologies for personal genomics.

Keywords: ngs
[Wadman2008James] Meredith Wadman. James watson's genome sequenced at high speed. Nature, 452(7189):788, Apr 2008. [ bib | DOI | http ]
Keywords: Genetic Counseling, trends; Genome, Human; Genomics, economics/trends; History, 21st Century; Humans; Individuality; Male; Reference Standards; Sequence Analysis, DNA, economics/trends; Time Factors
[Waaijenborg2008Quantifying] SWaaijenborg, PC. Verselewel de Witt Hamer, and AH. Zwinderman. Quantifying the association between gene expressions and dna-markers by penalized canonical correlation analysis. Stat Appl Genet Mol Biol, 7(1):Article3, 2008. [ bib | DOI | http ]
Multiple changes at the DNA level are at the basis of complex diseases. Identifying the genetic networks that are influenced by these changes might help in understanding the development of these diseases. Canonical correlation analysis is used to associate gene expressions with DNA-markers and thus reveals sets of co-expressed and co-regulated genes and their associating DNA-markers. However, when the number of variables gets high, e.g. in the case of microarray studies, interpretation of these results can be difficult. By adapting the elastic net to canonical correlation analysis the number of variables reduces, and interpretation becomes easier, moreover, due to the grouping effect of the elastic net co-regulated and co-expressed genes cluster. Additionally, our adaptation works well in situations where the number of variables exceeds by far the number of subjects.

Keywords: Cluster Analysis; DNA, genetics; Gene Expression; Genetic Markers
[Vishwanathan2008Graph] SVN. Vishwanathan, KM. Borgwardt, RI. Kondor, and NN. Schraudolph. Graph kernels. CoRR, abs/0807.0093, 2008. [ bib ]
[Vincent-Salomon2008Integrated] AVincent-Salomon, CLucchesi, NGruel, VRaynal, GPierron, RGoudefroye, FReyal, FRadvanyi, RSalmon, J.-P. Thiery, XSastre-Garau, BSigal-Zafrani, AFourquet, and ADelattre. Integrated genomic and transcriptomic analysis of ductal carcinoma in situ of the breast. Clin. Cancer Res., 14(7):1956-1965, Apr 2008. [ bib | DOI | http | .pdf ]
PURPOSE: To gain insight into genomic and transcriptomic subtypes of ductal carcinomas in situ of the breast (DCIS). EXPERIMENTAL DESIGN: We did a combined phenotypic and genomic analysis of a series of 57 DCIS integrated with gene expression profile analysis for 26 of the 57 cases. RESULTS: Thirty-two DCIS exhibited a luminal phenotype; 21 were ERBB2 positive, and 4 were ERBB2/estrogen receptor (ER) negative with 1 harboring a bona fide basal-like phenotype. Based on a CGH analysis, genomic types were identified in this series of DCIS with the 1q gain/16q loss combination observed in 3 luminal DCIS, the mixed amplifier pattern including all ERBB2, 12 luminal and 2 ERBB2(-)/ER(-) DCIS, and the complex copy number alteration profile encompassing 14 luminal and 1 ERBB2(-)/ER(-) DCIS. Eight cases (8 of 57; 14%) presented a TP53 mutation, all being amplifiers. Unsupervised analysis of gene expression profiles of 26 of the 57 DCIS showed that luminal and ERBB2-amplified, ER-negative cases clustered separately. We further investigated the effect of high and low copy number changes on gene expression. Strikingly, amplicons but also low copy number changes especially on 1q, 8q, and 16q in DCIS regulated the expression of a subset of genes in a very similar way to that recently described in invasive ductal carcinomas. CONCLUSIONS: These combined approaches show that the molecular heterogeneity of breast ductal carcinomas exists already in in situ lesions and further indicate that DCIS and invasive ductal carcinomas share genomic alterations with a similar effect on gene expression profile.

Keywords: breastcancer, cgh
[Vert2008optimal] J.-P. Vert. The optimal assignment kernel is not positive definite. Technical Report 0801.4061, Arxiv, 2008. [ bib | http ]
[Vermeulen2008Cancer] LVermeulen, M.R. Sprick, KKemper, GStassi, and J.P. Medema. Cancer stem cells - old concepts, new insights. Cell Death and Differentiation, 15:947-58, 2008. [ bib ]
Keywords: csbcbook
[Tomioka2008Sparse] RTomioka and MSugiyama. Sparse learning with duality gap guarantee. Technical report, Department of Computer Science; Graduate School of Information Science and Engineering, Tokyo Institute of Technology, 152-8552, Tokyo, Japan, 2008. [ bib ]
[Tibshirani2008Spatial] RTibshirani and PWang. Spatial smoothing and hot spot detection for cgh data using the fused lasso. Biostatistics (Oxford, England), 9(1):18-29, January 2008. [ bib | DOI | http ]
We apply the "fused lasso" regression method of (TSRZ2004) to the problem of "hot- spot detection", in particular, detection of regions of gain or loss in comparative genomic hybridization (CGH) data. The fused lasso criterion leads to a convex optimization problem, and we provide a fast algorithm for its solution. Estimates of false-discovery rate are also provided. Our studies show that the new method generally outperforms competing methods for calling gains and losses in CGH data.

Keywords: copy_number
[Taylor2008Guidelines] Chris F Taylor, Pierre-Alain Binz, Ruedi Aebersold, Michel Affolter, Robert Barkovich, Eric W Deutsch, David M Horn, Andreas Hühmer, Martin Kussmann, Kathryn Lilley, Marcus Macht, Matthias Mann, Dieter Müller, Thomas A Neubert, Janice Nickson, Scott D Patterson, Roberto Raso, Kathryn Resing, Sean L Seymour, Akira Tsugita, Ioannis Xenarios, Rong Zeng, and Randall K Julian. Guidelines for reporting the use of mass spectrometry in proteomics. Nat Biotechnol, 26(8):860-861, Aug 2008. [ bib | DOI | http ]
Keywords: Databases, Protein; Guidelines as Topic; Mass Spectrometry; Proteomics
[Szafranski2008Composite] MSzafranski, YGrandvalet, and ARakotomamonjy. Composite kernel learning. In ICML '08: Proceedings of the 25th international conference on Machine learning, Helsinki Finlande, 07 2008. [ bib | http ]
[Szafranski2008Hierarchical] Marie Szafranski, Yves Grandvalet, and Pierre Morizet-Mahoudeaux. Hierarchical penalization. In J.C. Platt, DKoller, YSinger, and SRoweis, editors, Advances in Neural Information Processing Systems 20, pages 1457-1464. MIT Press, Cambridge, MA, 2008. [ bib ]
[Suter2008Two-hybrid] Bernhard Suter, Saranya Kittanakom, and Igor Stagljar. Two-hybrid technologies in proteomics research. Curr Opin Biotechnol, 19(4):316-323, Aug 2008. [ bib | DOI | http ]
Given that protein-protein interactions (PPIs) regulate nearly every living process; the exploration of global and pathway-specific protein interaction networks is expected to have major implications in the understanding of diseases and for drug discovery. Consequently, the development and application of methodologies that address physical associations among proteins is of major importance in today's proteomics research. The most widely and successfully used methodology to assess PPIs is the yeast two-hybrid system (YTH). Here we present an overview on the current applications of YTH and variant technologies in yeast and mammalian systems. Two-hybrid-based methods will not only continue to have a dominant role in the assessment of protein interactomes but will also become important in the development of novel compounds that target protein interaction interfaces for therapeutic intervention.

Keywords: Animals; Drug Design; Mammals; Proteomics; Two-Hybrid System Techniques
[Stratton2008Emerging] M.R. Stratton and NRahman. The emerging landscape of breast cancer susceptibility. Nat. Genet., 40:17-22, 2008. [ bib ]
Keywords: csbcbook
[Smart2008Cascading] AG. Smart, LAN. Amaral, and JM. Ottino. Cascading failure and robustness in metabolic networks. Proc. Natl. Acad. Sci. USA, 105(36):13223-13228, Sep 2008. [ bib | DOI | http | .pdf ]
We investigate the relationship between structure and robustness in the metabolic networks of Escherichia coli, Methanosarcina barkeri, Staphylococcus aureus, and Saccharomyces cerevisiae, using a cascading failure model based on a topological flux balance criterion. We find that, compared to appropriate null models, the metabolic networks are exceptionally robust. Furthermore, by decomposing each network into rigid clusters and branched metabolites, we demonstrate that the enhanced robustness is related to the organization of branched metabolites, as rigid cluster formations in the metabolic networks appear to be consistent with null model behavior. Finally, we show that cascading in the metabolic networks can be described as a percolation process.

[Singh2008Global] RSingh, JXu, and BBerger. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl. Acad. Sci. USA, 105(35):12763-12768, Sep 2008. [ bib | DOI | http | .pdf ]
Protein-protein interactions (PPIs) and their networks play a central role in all biological processes. Akin to the complete sequencing of genomes and their comparative analysis, complete descriptions of interactomes and their comparative analysis is fundamental to a deeper understanding of biological processes. A first step in such an analysis is to align two or more PPI networks. Here, we introduce an algorithm, IsoRank, for global alignment of multiple PPI networks. The guiding intuition here is that a protein in one PPI network is a good match for a protein in another network if their respective sequences and neighborhood topologies are a good match. We encode this intuition as an eigenvalue problem in a manner analogous to Google's PageRank method. Using IsoRank, we compute a global alignment of the Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Mus musculus, and Homo sapiens PPI networks. We demonstrate that incorporating PPI data in ortholog prediction results in improvements over existing sequence-only approaches and over predictions from local alignments of the yeast and fly networks. Previous methods have been effective at identifying conserved, localized network patterns across pairs of networks. This work takes the further step of performing a global alignment of multiple PPI networks. It simultaneously uses sequence similarity and network data and, unlike previous approaches, explicitly models the tradeoff inherent in combining them. We expect IsoRank-with its simultaneous handling of node similarity and network similarity-to be applicable across many scientific domains.

[Simon2008Lost] RSimon. Lost in translation problems and pitfalls in translating laboratory observations to clinical utility. European journal of cancer (Oxford, England: 1990), 44(18):2707, 2008. [ bib ]
[Shen2008Pathway] RShen, AM. Chinnaiyan, and DGhosh. Pathway analysis reveals functional convergence of gene expression profiles in breast cancer. BMC Medical Genomics, 1(1):28, 2008. [ bib | DOI | http | .pdf ]
[Shann2008Genome] Y.J. Shann, CCheng, C.H. Chiao, D.T. Chen, P.H. Li, and M.T. Hsu. Genome-wide mapping and characterization of hypomethylated sites in human tissues and breast cancer cell lines. Genome Res., 18:791-801, 2008. [ bib ]
Keywords: csbcbook
[Shah2008SVM-HUSTLE] AR. Shah, CS. Oehmen, and B.-J. Webb-Robertson. SVM-HUSTLE-an iterative semi-supervised machine learning approach for pairwise protein remote homology detection. Bioinformatics, 24(6):783-790, Mar 2008. [ bib | DOI | http | .pdf ]
MOTIVATION: As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular 'parts list'. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fail to identify remote homologs (proteins with similar function but dissimilar sequence) which often are a significant fraction of the total homolog collection for a given sequence. We introduce a Support Vector Machine (SVM)-based tool to detect homology using semi-supervised iterative learning (SVM-HUSTLE) that identifies significantly more remote homologs than current state-of-the-art sequence or cluster-based methods. As opposed to building profiles or position specific scoring matrices, SVM-HUSTLE builds an SVM classifier for a query sequence by training on a collection of representative high-confidence training sets, recruits additional sequences and assigns a statistical measure of homology between a pair of sequences. SVM-HUSTLE combines principles of semi-supervised learning theory with statistical sampling to create many concurrent classifiers to iteratively detect and refine, on-the-fly, patterns indicating homology. RESULTS: When compared against existing methods for identifying protein homologs (BLAST, PSI-BLAST, COMPASS, PROFSIM, RANKPROP and their variants) on two different benchmark datasets SVM-HUSTLE significantly outperforms each of the above methods using the most stringent ROC(1) statistic with P-values less than 1e-20. SVM-HUSTLE also yields results comparable to HHSearch but at a substantially reduced computational cost since we do not require the construction of HMMs. AVAILABILITY: The software executable to run SVM-HUSTLE can be downloaded from http://www.sysbio.org/sysbio/networkbio/svmhustle

Keywords: PUlearning
[Schones2008Genome] Dustin E. Schones and Keji Zhao. Genome-wide approaches to studying chromatin modifications. Nat. Rev. Genet., 9:179-191, 2008. [ bib ]
Keywords: csbcbook, csbcbook-ch2
[Schalon2008Simple] CSchalon, J-S. Surgand, EKellenberger, and DRognan. A simple and fuzzy method to align and compare druggable ligand-binding sites. Proteins, 71(4):1755-1778, Jun 2008. [ bib | DOI | http ]
A novel method to measure distances between druggable protein cavities is presented. Starting from user-defined ligand binding sites, eight topological and physicochemical properties are projected from cavity-lining protein residues to an 80 triangle-discretised sphere placed at the centre of the binding site, thus defining a cavity fingerprint. Representing binding site properties onto a discretised sphere presents many advantages: (i) a normalised distance between binding sites of different sizes may be easily derived by summing up the normalised differences between the 8 computed descriptors; (ii) a structural alignment of two proteins is simply done by systematically rotating/translating one mobile sphere around one immobile reference; (iii) a certain degree of fuzziness in the comparison is reached by projecting global amino acid properties (e.g., charge, size, functional groups count, distance to the site centre) independently of local rotameric/tautomeric states of cavity-lining residues. The method was implemented in a new program (SiteAlign) and tested in a number of various scenarios: measuring the distance between 376 related active site pairs, computing the cross-similarity of members of a protein family, predicting the targets of ligands with various promiscuity levels. The proposed method is robust enough to detect local similarity among active sites of different sizes, to discriminate between protein subfamilies and to recover the known targets of promiscuous ligands by virtual screening.

Keywords: Algorithms; Amino Acid Sequence; Binding Sites, drug effects; Drug Design; Hydrogen Bonding; Ligands; Protein Binding; Sequence Alignment; Structure-Activity Relationship
[Schachtner2008Knowledge-based] RSchachtner, DLutter, PKnollmüller, AM. Tomé, FJ. Theis, GSchmitz, MStetter, PGómez Vilda, and EW. Lang. Knowledge-based gene expression classification via matrix factorization. Bioinformatics, 24(15):1688-1697, Aug 2008. [ bib | DOI | http ]
MOTIVATION: Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classification of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks. RESULTS: In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression profiles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classified into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients.

Keywords: Algorithms; Artificial Intelligence; Gene Expression Profiling; Oligonucleotide Array Sequence Analysis; Pattern Recognition, Automated
[Sawyers2008cancer] CL. Sawyers. The cancer biomarker problem. Nature, 452(7187):548-552, Apr 2008. [ bib | DOI | http | .pdf ]
Genomic technologies offer the promise of a comprehensive understanding of cancer. These technologies are being used to characterize tumours at the molecular level, and several clinical successes have shown that such information can guide the design of drugs targeted to a relevant molecule. One of the main barriers to further progress is identifying the biological indicators, or biomarkers, of cancer that predict who will benefit from a particular targeted therapy.

Keywords: csbcbook-ch3, csbcbook
[Satzinger2008Theodor] Helga Satzinger. Theodor and Marcella Boveri: chromosomes and cytoplasm in heredity and development. Nat. Rev. Genet., 9:231-238, 2008. [ bib ]
Keywords: csbcbook
[Rusk2008Primer] Nicole Rusk and Veronique Kiermer. Primer: Sequencing - the next generation. Nat. Methods, 5:15, 2008. [ bib ]
Keywords: csbcbook, csbcbook-ch2
[Rothman2008Sparse] AJ. Rothman, PJ. Bickel, ELevina, and JZhu. Sparse permutation invariant covariance estimation. Electron. J. Statist., 2:494-515, 2008. [ bib | DOI | http | .pdf ]
[Roth2008The] VRoth and BFischer. The group-lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In ICML '08: Proceedings of the 25th international conference on Machine learning, pages 848-855, 2008. [ bib | .pdf ]
Keywords: lasso
[Reyal2008comprehensive] FReyal, MH. van Vliet, NJ. Armstrong, HM. Horlings, KE. de Visser, MKok, AE. Teschendorff, SMook, Lvan't Veer, CCaldas, RJ. Salmon, Remy, MJ. van de Vijver, and LFA. Wessels. A comprehensive analysis of prognostic signatures reveals the high predictive capacity of the proliferation, immune response and RNA splicing modules in breast cancer. Breast Cancer Res., 10(6):R93, 2008. [ bib | DOI | http | .pdf ]
[Reinders2008Genome] Jon Reinders, Celine Delucinge Vivier, Gregory Theiler, Didier Chollet, Patrick Descombes, and Jerzy Paszkowski. Genome-wide, high-resolution DNA methylation profiling using bisulfite-mediated cytosine conversion. Genome Res., 18:469-76, 2008. [ bib ]
Keywords: csbcbook, csbcbook-ch2
[Ravikumar2008SpAM] Pradeep Ravikumar, Han Liu, John Lafferty, and Larry Wasserman. Spam: Sparse additive models. In J.C. Platt, DKoller, YSinger, and SRoweis, editors, Advances in Neural Information Processing Systems 20, pages 1201-1208. MIT Press, Cambridge, MA, 2008. [ bib ]
[Rapaport2008Classification] FRapaport, EBarillot, and J.-P. Vert. Classification of arrayCGH data using fused SVM. Bioinformatics, 24(13):i375-i382, Jul 2008. [ bib | DOI | http | .pdf ]
MOTIVATION: Array-based comparative genomic hybridization (arrayCGH) has recently become a popular tool to identify DNA copy number variations along the genome. These profiles are starting to be used as markers to improve prognosis or diagnosis of cancer, which implies that methods for automated supervised classification of arrayCGH data are needed. Like gene expression profiles, arrayCGH profiles are characterized by a large number of variables usually measured on a limited number of samples. However, arrayCGH profiles have a particular structure of correlations between variables, due to the spatial organization of bacterial artificial chromosomes along the genome. This suggests that classical classification methods, often based on the selection of a small number of discriminative features, may not be the most accurate methods and may not produce easily interpretable prediction rules. RESULTS: We propose a new method for supervised classification of arrayCGH data. The method is a variant of support vector machine that incorporates the biological specificities of DNA copy number variations along the genome as prior knowledge. The resulting classifier is a sparse linear classifier based on a limited number of regions automatically selected on the chromosomes, leading to easy interpretation and identification of discriminative regions of the genome. We test this method on three classification problems for bladder and uveal cancer, involving both diagnosis and prognosis. We demonstrate that the introduction of the new prior on the classifier leads not only to more accurate predictions, but also to the identification of known and new regions of interest in the genome. AVAILABILITY: All data and algorithms are publicly available.

Keywords: cgh
[Rapaport2008Introduction] FRapaport. Introduction de la connaissance a priori dans l’étude des puces à ADN. PhD thesis, Université Pierre et Marie Curie - Paris 6, 2008. [ bib ]
[Rakotomamonjy2008SimpleMKL] ARakotomamonjy, FBach, SCanu, and YGrandvalet. SimpleMKL. J. Mach. Learn. Res., 9:2491-2521, 2008. [ bib ]
[Post2008Extensions] T.M. Post, J.I. Freijer, and B.A. Ploeger. Extensions to the Visual Predictive Check to facilitate model performance evaluation. J Pharmacokinet Pharmacodyn, 35:185-02, 2008. [ bib | DOI ]
[Poor2008Quickest] HV. Poor and OHadjiliadis. Quickest Detection. Cambridge University Press, 2008. [ bib ]
Keywords: segmentation
[Pfeifer2008Multiple] Nico Pfeifer and Oliver Kohlbacher. Multiple instance learning allows mhc class ii epitope predictions across alleles. In WABI '08: Proceedings of the 8th international workshop on Algorithms in Bioinformatics, pages 210-221, Berlin, Heidelberg, 2008. Springer-Verlag. [ bib | DOI ]
[Oti2008Conserved] MOti, Jvan Reeuwijk, M.A. Huynen, and H.G. Brunner. Conserved co-expression for candidate disease gene prioritization. BMC Bioinformatics, 9:208, 2008. [ bib | DOI | http ]
BACKGROUND: Genes that are co-expressed tend to be involved in the same biological process. However, co-expression is not a very reliable predictor of functional links between genes. The evolutionary conservation of co-expression between species can be used to predict protein function more reliably than co-expression in a single species. Here we examine whether co-expression across multiple species is also a better prioritizer of disease genes than is co-expression between human genes alone. RESULTS: We use co-expression data from yeast (S. cerevisiae), nematode worm (C. elegans), fruit fly (D. melanogaster), mouse and human and find that the use of evolutionary conservation can indeed improve the predictive value of co-expression. The effect that genes causing the same disease have higher co-expression than do other genes from their associated disease loci, is significantly enhanced when co-expression data are combined across evolutionarily distant species. We also find that performance can vary significantly depending on the co-expression datasets used, and just using more data does not necessarily lead to better prioritization. Instead, we find that dataset quality is more important than quantity, and using a consistent microarray platform per species leads to better performance than using more inclusive datasets pooled from various platforms. CONCLUSION: We find that evolutionarily conserved gene co-expression prioritizes disease candidate genes better than human gene co-expression alone, and provide the integrated data as a new resource for disease gene prioritization tools.

Keywords: Animals; Base Sequence; Caenorhabditis elegans; Conserved Sequence; Databases, Genetic; Disease; Drosophila melanogaster; Evolution, Molecular; Gene Dosage; Gene Expression; Gene Expression Profiling; Gene Frequency; Genetic Predisposition to Disease; Humans; Mice; Oligonucleotide Array Sequence Analysis; Penetrance; Predictive Value of Tests; Saccharomyces cerevisiae; Sample Size; Species Specificity
[Network2008Comprehensive] Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455(7216):1061-1068, Oct 2008. [ bib | DOI | http | .pdf ]
Human cancer cells typically harbour multiple chromosomal aberrations, nucleotide substitutions and epigenetic modifications that drive malignant transformation. The Cancer Genome Atlas (TCGA) pilot project aims to assess the value of large-scale multi-dimensional analysis of these molecular characteristics in human cancer and to provide the data rapidly to the research community. Here we report the interim integrative analysis of DNA copy number, gene expression and DNA methylation aberrations in 206 glioblastomas-the most common type of adult brain cancer-and nucleotide sequence aberrations in 91 of the 206 glioblastomas. This analysis provides new insights into the roles of ERBB2, NF1 and TP53, uncovers frequent mutations of the phosphatidylinositol-3-OH kinase regulatory subunit gene PIK3R1, and provides a network view of the pathways altered in the development of glioblastoma. Furthermore, integration of mutation, DNA methylation and clinical treatment data reveals a link between MGMT promoter methylation and a hypermutator phenotype consequent to mismatch repair deficiency in treated glioblastomas, an observation with potential clinical implications. Together, these findings establish the feasibility and power of TCGA, demonstrating that it can rapidly expand knowledge of the molecular basis of cancer.

[Najmanovich2008Detection] RNajmanovich, NKurbatova, and JThornton. Detection of 3d atomic similarities and their use in the discrimination of small molecule protein-binding sites. Bioinformatics, 24(16):i105-i111, Aug 2008. [ bib ]
MOTIVATION: Current computational methods for the prediction of function from structure are restricted to the detection of similarities and subsequent transfer of functional annotation. In a significant minority of cases, global sequence or structural (fold) similarities do not provide clues about protein function. In these cases, one alternative is to detect local binding site similarities. These may still reflect more distant evolutionary relationships as well as unique physico-chemical constraints necessary for binding similar ligands, thus helping pinpoint the function. In the present work, we ask the following question: is it possible to discriminate within a dataset of non-homologous proteins those that bind similar ligands based on their binding site similarities? METHODS: We implement a graph-matching-based method for the detection of 3D atomic similarities introducing some simplifications that allow us to extend its applicability to the analysis of large allatom binding site models. This method, called IsoCleft, does not require atoms to be connected either in sequence or space. We apply the method to a cognate-ligand bound dataset of non-homologous proteins. We define a family of binding site models with decreasing knowledge about the identity of the ligand-interacting atoms to uncouple the questions of predicting the location of the binding site and detecting binding site similarities. Furthermore, we calculate the individual contributions of binding site size, chemical composition and geometry to prediction performance. RESULTS: We find that it is possible to discriminate between different ligand-binding sites. In other words, there is a certain uniqueness in the set of atoms that are in contact to specific ligand scaffolds. This uniqueness is restricted to the atoms in close proximity of the ligand in which case, size and chemical composition alone are sufficient to discriminate binding sites. Discrimination ability decreases with decreasing knowledge about the identity of the ligand-interacting binding site atoms. The decrease is quite abrupt when considering size and chemical composition alone, but much slower when including geometry. We also observe that certain ligands are easier to discriminate. Interestingly, the subset of binding site atoms belonging to highly conserved residues is not sufficient to discriminate binding sites, implying that convergently evolved binding sites arrived at dissimilar solutions. AVAILABILITY: IsoCleft can be obtained from the authors.

[Mosley2008Cell] J.D. Mosley and R.A. Keri. Cell cycle correlated genes dictate the prognostic power of breast cancer gene lists. BMC Medical Genomics, 1(1):11, 2008. [ bib ]
[Mortazavi2008Mapping] AMortazavi, BA. Williams, KMcCue, LSchaeffer, and BWold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods, 5(7):621-628, Jul 2008. [ bib | DOI | http | .pdf ]
We have mapped and quantified mouse transcriptomes by deeply sequencing them and recording how frequently each gene is represented in the sequence sample (RNA-Seq). This provides a digital measure of the presence and prevalence of transcripts from known and previously unknown genes. We report reference measurements composed of 41-52 million mapped 25-base-pair reads for poly(A)-selected RNA from adult mouse brain, liver and skeletal muscle tissues. We used RNA standards to quantify transcript prevalence and to test the linear range of transcript detection, which spanned five orders of magnitude. Although >90% of uniquely mapped reads fell within known exons, the remaining data suggest new and revised gene models, including changed or additional promoters, exons and 3' untranscribed regions, as well as new candidate microRNA precursors. RNA splice events, which are not readily measured by standard gene expression microarray or serial analysis of gene expression methods, were detected directly by mapping splice-crossing sequence reads. We observed 1.45 x 10(5) distinct splices, and alternative splices were prominent, with 3,500 different genes expressing one or more alternate internal splices.

[Morin2008Application] Ryan D Morin, Michael D O'Connor, Malachi Griffith, Florian Kuchenbauer, Allen Delaney, Anna-Liisa Prabhu, Yongjun Zhao, Helen McDonald, Thomas Zeng, Martin Hirst, Connie J Eaves, and Marco A Marra. Application of massively parallel sequencing to microrna profiling and discovery in human embryonic stem cells. Genome Res, 18(4):610-621, Apr 2008. [ bib | DOI | http | .pdf ]
MicroRNAs (miRNAs) are emerging as important, albeit poorly characterized, regulators of biological processes. Key to further elucidation of their roles is the generation of more complete lists of their numbers and expression changes in different cell states. Here, we report a new method for surveying the expression of small RNAs, including microRNAs, using Illumina sequencing technology. We also present a set of methods for annotating sequences deriving from known miRNAs, identifying variability in mature miRNA sequences, and identifying sequences belonging to previously unidentified miRNA genes. Application of this approach to RNA from human embryonic stem cells obtained before and after their differentiation into embryoid bodies revealed the sequences and expression levels of 334 known plus 104 novel miRNA genes. One hundred seventy-one known and 23 novel microRNA sequences exhibited significant expression differences between these two developmental states. Owing to the increased number of sequence reads, these libraries represent the deepest miRNA sampling to date, spanning nearly six orders of magnitude of expression. The predicted targets of those miRNAs enriched in either sample shared common features. Included among the high-ranked predicted gene targets are those implicated in differentiation, cell cycle control, programmed cell death, and transcriptional regulation.

Keywords: ngs, sirna
[Mordelet2008SIRENE] FMordelet and J.-P. Vert. SIRENE: Supervised inference of regulatory networks. Bioinformatics, 24(16):i76-i82, 2008. [ bib | DOI | http | .pdf ]
[Moitessier2008Towards] NMoitessier, PEnglebienne, DLee, JLawandi, and CR. Corbeil. Towards the development of universal, fast and highly accurate docking/scoring methods: a long way to go. Br. J. Pharmacol., 153 Suppl 1:S7-26, Mar 2008. [ bib | DOI | http ]
Accelerating the drug discovery process requires predictive computational protocols capable of reducing or simplifying the synthetic and/or combinatorial challenge. Docking-based virtual screening methods have been developed and successfully applied to a number of pharmaceutical targets. In this review, we first present the current status of docking and scoring methods, with exhaustive lists of these. We next discuss reported comparative studies, outlining criteria for their interpretation. In the final section, we describe some of the remaining developments that would potentially lead to a universally applicable docking/scoring method.

Keywords: Algorithms; Animals; Artificial Intelligence; Computer Simulation; Drug Evaluation, Preclinical, methods; Humans; Metals, chemistry; Models, Molecular; Molecular Conformation; Nucleic Acids, chemistry/drug effects; Proteins, chemistry/drug effects; Reproducibility of Results; Stochastic Processes
[Mishra2008Review] KP. Mishra, LGanju, MSairam, PK. Banerjee, and RC. Sawhney. A review of high throughput technology for the screening of natural products. Biomed Pharmacother, 62(2):94-98, Feb 2008. [ bib | DOI | http ]
High throughput screening is commonly defined as automatic testing of potential drug candidates at a rate in excess of 10,000 compounds per week. The aim of high throughput drug discovery is to test large compound collections for potentially active compounds ('hits') in order to allow further development of compounds for pre-clinical testing ('leads'). High throughput technology has emerged over the last few years as an important tool for drug discovery and lead optimisation. In this approach, the molecular diversity and range of biological properties displayed by secondary metabolites constitutes a challenge to combinatorial strategies for natural products synthesis and derivatization. This article reviews the approach of High throughput technique for the screening of natural products for drug discovery.

Keywords: Automation; Biological Products, pharmacology; Combinatorial Chemistry Techniques; Drug Design; Drug Evaluation, Preclinical; Technology, Pharmaceutical, methods
[Meier2008group] LMeier, Svan de Geer, and PBühlmann. The group lasso for logistic regression. J. R. Stat. Soc. Ser. B, 70(1):53-71, 2008. [ bib | DOI | http | .pdf ]
Keywords: lasso
[Mardis2008Next] Elaine R Mardis. Next-generation dna sequencing methods. Annu. Rev. Genomics Hum. Genet., 9:387-402, 2008. [ bib | DOI | http ]
Recent scientific discoveries that resulted from the application of next-generation DNA sequencing technologies highlight the striking impact of these massively parallel platforms on genetics. These new methods have expanded previously focused readouts from a variety of DNA preparation protocols to a genome-wide scale and have fine-tuned their resolution to single base precision. The sequencing of RNA also has transitioned and now includes full-length cDNA analyses, serial analysis of gene expression (SAGE)-based methods, and noncoding RNA discovery. Next-generation sequencing has also enabled novel applications such as the sequencing of ancient DNA samples, and has substantially widened the scope of metagenomic analysis of environmentally derived samples. Taken together, an astounding potential exists for these technologies to bring enormous change in genetic and biological research and to enhance our fundamental biological knowledge.

Keywords: Chromatin Immunoprecipitation; Fossils; Gene Expression Profiling; Genome, Human; Genomics; Humans; RNA, Untranslated; Sequence Analysis, DNA
[Mardis2008Impact] Elaine R. Mardis. The impact of next-generation sequencing technology on genetics. Trends Genet., 24:133-141, 2008. [ bib ]
[Ma2008Penalized] SMa and JHuang. Penalized feature selection and classification in bioinformatics. Briefings in bioinformatics, 9(5):392-403, 2008. [ bib ]
[Lowery2008MicroRNAs] AJ. Lowery, NMiller, RE. McNeill, and MJ. Kerin. MicroRNAs as prognostic indicators and therapeutic targets: potential effect on breast cancer management. Clin. Cancer Res., 14(2):360-365, Jan 2008. [ bib | DOI | http | .pdf ]
The discovery of microRNAs (miRNA) as novel modulators of gene expression has resulted in a rapidly expanding repertoire of molecules in this family, as reflected in the concomitant expansion of scientific literature. MiRNAs are a category of naturally occurring RNA molecules that play important regulatory roles in plants and animals by targeting mRNAs for cleavage or translational repression. Characteristically, miRNAs are noncoding, single-stranded short (18-22 nucleotides) RNAs, features which possibly explain why they had not been intensively investigated until recently. Accumulating experimental evidence indicates that miRNAs play a pivotal role in many cellular functions via the regulation of gene expression. Furthermore, their dysregulation and/or mutation has been shown in carcinogenesis. We provide a brief review of miRNA biogenesis and discuss the technical challenges of modifying experimental techniques to facilitate the identification and characterization of these small RNAs. MiRNA function and their involvement in malignancy, particularly their putative role as oncogenes or tumor suppressors is also discussed, with a specific emphasis on breast cancer. Finally, we comment on the potential role of miRNAs in breast cancer management, particularly in improving current prognostic tools and achieving the goal of individualized cancer treatment.

Keywords: csbcbook, csbcbook-ch3
[Lounici2008Sup-norm] KLounici. Sup-norm convergence rate and sign concentration property of lasso and dantzig estimators. Electron. J. Statist., 2:90-102, 2008. [ bib | DOI | http | .pdf ]
Keywords: lasso
[Lopez08Statistical] ALopez. Statistical machine translation. ACM Comput. Surv., 40(3):1-49, 2008. [ bib | DOI ]
[Liang2008Gene] K-C. Liang and XWang. Gene regulatory network reconstruction using conditional mutual information. EURASIP J Bioinform Syst Biol, page 253894, 2008. [ bib | DOI | http ]
The inference of gene regulatory network from expression data is an important area of research that provides insight to the inner workings of a biological system. The relevance-network-based approaches provide a simple and easily-scalable solution to the understanding of interaction between genes. Up until now, most works based on relevance network focus on the discovery of direct regulation using correlation coefficient or mutual information. However, some of the more complicated interactions such as interactive regulation and co-regulation are not easily detected. In this work, we propose a relevance network model for gene regulatory network inference which employs both mutual information and conditional mutual information to determine the interactions between genes. For this purpose, we propose a conditional mutual information estimator based on adaptive partitioning which allows us to condition on both discrete and continuous random variables. We provide experimental results that demonstrate that the proposed regulatory network inference algorithm can provide better performance when the target network contains coregulated and interactively regulated genes.

[Li2008Mapping] HLi, JRuan, and RDurbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res., 18(11):1851-1858, Nov 2008. [ bib | DOI | http | .pdf ]
New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome and handling ambiguity or lack of accuracy in this alignment. Here we introduce the concept of mapping quality, a measure of the confidence that a read actually comes from the position it is aligned to by the mapping algorithm. We describe the software MAQ that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g., from a human sample. MAQ makes full use of mate-pair information and estimates the error probability of each read alignment. Error probabilities are also derived for the final genotype calls, using a Bayesian statistical model that incorporates the mapping qualities, error probabilities from the raw sequence quality scores, sampling of the two haplotypes, and an empirical model for correlated errors at a site. Both read mapping and genotype calling are evaluated on simulated data and real data. MAQ is accurate, efficient, versatile, and user-friendly. It is freely available at http://maq.sourceforge.net.

Keywords: ngs
[Lefkowitz2008crystal] RJ. Lefkowitz, J.-P. Sun, and AK. Shukla. A crystal clear view of the beta2-adrenergic receptor. Nat. Biotechnol., 26(2):189-191, Feb 2008. [ bib | DOI | http ]
Keywords: chemogenomics
[Lee2008Inferring] ELee, H.-Y. Chuang, JW. Kim, TIdeker, and DLee. Inferring pathway activity toward precise disease classification. PLoS Comput Biol, 4(11):e1000217, Nov 2008. [ bib | DOI | http | .pdf ]
The advent of microarray technology has made it possible to classify disease states based on gene expression profiles of patients. Typically, marker genes are selected by measuring the power of their expression profiles to discriminate among patients of different disease states. However, expression-based classification can be challenging in complex diseases due to factors such as cellular heterogeneity within a tissue sample and genetic heterogeneity across patients. A promising technique for coping with these challenges is to incorporate pathway information into the disease classification procedure in order to classify disease based on the activity of entire signaling pathways or protein complexes rather than on the expression levels of individual genes or proteins. We propose a new classification method based on pathway activities inferred for each patient. For each pathway, an activity level is summarized from the gene expression levels of its condition-responsive genes (CORGs), defined as the subset of genes in the pathway whose combined expression delivers optimal discriminative power for the disease phenotype. We show that classifiers using pathway activity achieve better performance than classifiers based on individual gene expression, for both simple and complex case-control studies including differentiation of perturbed from non-perturbed cells and subtyping of several different kinds of cancer. Moreover, the new method outperforms several previous approaches that use a static (i.e., non-conditional) definition of pathways. Within a pathway, the identified CORGs may facilitate the development of better diagnostic markers and the discovery of core alterations in human disease.

[Leclerc2008Survival] R.D. Leclerc. Survival of the sparsest: robust gene networks are parsimonious. Mol Syst Biol, 4:213, 2008. [ bib | DOI | http ]
Biological gene networks appear to be dynamically robust to mutation, stochasticity, and changes in the environment and also appear to be sparsely connected. Studies with computational models, however, have suggested that denser gene networks evolve to be more dynamically robust than sparser networks. We resolve this discrepancy by showing that misassumptions about how to measure robustness in artificial networks have inadvertently discounted the costs of network complexity. We show that when the costs of complexity are taken into account, that robustness implies a parsimonious network structure that is sparsely connected and not unnecessarily complex; and that selection will favor sparse networks when network topology is free to evolve. Because a robust system of heredity is necessary for the adaptive evolution of complex phenotypes, the maintenance of frugal network complexity is likely a crucial design constraint that underlies biological organization.

Keywords: Adaptation, Physiological; Computer Simulation; Evolution, Molecular; Gene Regulatory Networks; Genotype; Heredity; Humans; Male; Models, Genetic; Mutation; Reproducibility of Results; Selection, Genetic; Stochastic Processes; Transcription, Genetic
[LeRoux2008Representational] Nicolas Le Roux and Yoshua Bengio. Representational power of restricted boltzmann machines and deep belief networks. Neural Comput, 20(6):1631-1649, Jun 2008. [ bib | DOI | http ]
Deep belief networks (DBN) are generative neural network models with many layers of hidden explanatory factors, recently introduced by Hinton, Osindero, and Teh (2006) along with a greedy layer-wise unsupervised learning algorithm. The building block of a DBN is a probabilistic model called a restricted Boltzmann machine (RBM), used to represent one layer of the model. Restricted Boltzmann machines are interesting because inference is easy in them and because they have been successfully used as building blocks for training deeper models. We first prove that adding hidden units yields strictly improved modeling power, while a second theorem shows that RBMs are universal approximators of discrete distributions. We then study the question of whether DBNs with more layers are strictly more powerful in terms of representational power. This suggests a new and less greedy criterion for training RBMs within DBNs.

[Le2008systematic] KG. Le Roch, JR. Johnson, HAhiboh, D.-WD. Chung, JPrudhomme, DPlouffe, KHenson, YZhou, WWitola, JR. Yates, CBen Mamoun, EA. Winzeler, and HVial. A systematic approach to understand the mechanism of action of the bisthiazolium compound T4 on the human malaria parasite, Plasmodium falciparum. BMC Genomics, 9:513, 2008. [ bib | DOI | http ]
BACKGROUND: In recent years, a major increase in the occurrence of drug resistant falciparum malaria has been reported. Choline analogs, such as the bisthiazolium T4, represent a novel class of compounds with strong potency against drug sensitive and resistant P. falciparum clones. Although T4 and its analogs are presumed to target the parasite's lipid metabolism, their exact mechanism of action remains unknown. Here we have employed transcriptome and proteome profiling analyses to characterize the global response of P. falciparum to T4 during the intraerythrocytic cycle of this parasite. RESULTS: No significant transcriptional changes were detected immediately after addition of T4 despite the drug's effect on the parasite metabolism. Using the Ontology-based Pattern Identification (OPI) algorithm with an increased T4 incubation time, we demonstrated cell cycle arrest and a general induction of genes involved in gametocytogenesis. Proteomic analysis revealed a significant decrease in the level of the choline/ethanolamine-phosphotransferase (PfCEPT), a key enzyme involved in the final step of synthesis of phosphatidylcholine (PC). This effect was further supported by metabolic studies, which showed a major alteration in the synthesis of PC from choline and ethanolamine by the compound. CONCLUSION: Our studies demonstrate that the bisthiazolium compound T4 inhibits the pathways of synthesis of phosphatidylcholine from choline and ethanolamine in P. falciparum, and provide evidence for post-transcriptional regulations of parasite metabolism in response to external stimuli.

Keywords: lasso
[Launay2008Homology] GLaunay and TSimonson. Homology modelling of protein-protein complexes: a simple method and its possibilities and limitations. BMC Bioinformatics, 9:427, 2008. [ bib | DOI | http ]
BACKGROUND: Structure-based computational methods are needed to help identify and characterize protein-protein complexes and their function. For individual proteins, the most successful technique is homology modelling. We investigate a simple extension of this technique to protein-protein complexes. We consider a large set of complexes of known structures, involving pairs of single-domain proteins. The complexes are compared with each other to establish their sequence and structural similarities and the relation between the two. Compared to earlier studies, a simpler dataset, a simpler structural alignment procedure, and an additional energy criterion are used. Next, we compare the Xray structures to models obtained by threading the native sequence onto other, homologous complexes. An elementary requirement for a successful energy function is to rank the native structure above any threaded structure. We use the DFIREbeta energy function, whose quality and complexity are typical of the models used today. Finally, we compare near-native models to distinctly non-native models. RESULTS: If weakly stable complexes are excluded (defined by a binding energy cutoff), as well as a few unusual complexes, a simple homology principle holds: complexes that share more than 35% sequence identity share similar structures and interaction modes; this principle was less clearcut in earlier studies. The energy function was then tested for its ability to identify experimental structures among sets of decoys, produced by a simple threading procedure. On average, the experimental structure is ranked above 92% of the alternate structures. Thus, discrimination of the native structure is good but not perfect. The discrimination of near-native structures is fair. Typically, a single, alternate, non-native binding mode exists that has a native-like energy. Some of the associated failures may correspond to genuine, alternate binding modes and/or native complexes that are artefacts of the crystal environment. In other cases, additional model filtering with more sophisticated tools is needed. CONCLUSION: The results suggest that the simple modelling procedure applied here could help identify and characterize protein-protein complexes. The next step is to apply it on a genomic scale.

Keywords: Algorithms; Protein Binding; Protein Conformation; Protein Interaction Domains and Motifs; Proteins, chemistry/metabolism; Structural Homology, Protein
[Kohler2008Walking] SKöhler, SBauer, DHorn, and P.N. Robinson. Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet., 82(4):949-958, Apr 2008. [ bib | DOI | http ]
The identification of genes associated with hereditary disorders has contributed to improving medical care and to a better understanding of gene functions, interactions, and pathways. However, there are well over 1500 Mendelian disorders whose molecular basis remains unknown. At present, methods such as linkage analysis can identify the chromosomal region in which unknown disease genes are located, but the regions could contain up to hundreds of candidate genes. In this work, we present a method for prioritization of candidate genes by use of a global network distance measure, random walk analysis, for definition of similarities in protein-protein interaction networks. We tested our method on 110 disease-gene families with a total of 783 genes and achieved an area under the ROC curve of up to 98% on simulated linkage intervals of 100 genes surrounding the disease gene, significantly outperforming previous methods based on local distance measures. Our results not only provide an improved tool for positional-cloning projects but also add weight to the assumption that phenotypically similar diseases are associated with disturbances of subnetworks within the larger protein interactome that extend beyond the disease proteins themselves.

Keywords: Animals; Chromosome Mapping; Computational Biology; Databases, Genetic; Genetic Diseases, Inborn; Genetic Predisposition to Disease; Humans; Internet; Linkage (Genetics); Mice; Pedigree; Protein Interaction Mapping; Software
[Kuksa2008Scalable] Pavel P. Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Scalable algorithms for string kernels with inexact matching. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and L'eon Bottou, editors, NIPS, pages 881-888. MIT Press, 2008. [ bib | http ]
[Kondor2008skew] RKondor and KM. Borgwardt. The skew spectrum of graphs. In ICML '08: Proceedings of the 25th international conference on Machine learning, pages 496-503, New York, NY, USA, 2008. ACM. [ bib | DOI | .pdf ]
[Kim2008Robust] JS. Kim and CScott. Robust kernel density estimation. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing ICASSP 2008, pages 3381-3384, 2008. [ bib | DOI | http | .pdf ]
Keywords: kernelbook
[Kim2008Insights] Eddo Kim, Amir Goren, and Gil Ast. Insights into the connection between cancer and alternative splicing. Trends Genet., 24:7-10, 2008. [ bib ]
Keywords: csbcbook
[Kellenberger2008How] EKellenberger, CSchalon, and DRognan. How to measure the similarity between protein ligand-binding sites? Current Computer-Aided Drug Design, 4(3):209-220, Sep. 2008. [ bib | DOI | http | .pdf ]
Quantification of local similarity between protein 3D structures is a promising tool in computer-aided drug design and prediction of biological function. Over the last ten years, several computational methods were proposed, mostly based on geometrical comparisons. This review summarizes the recent literature and gives an overview of available programs. A particular interest is given to the underlying methodologies. Our analysis points out strengths and weaknesses of the various approaches. If all described methods work relatively well when two binding sites obviously resemble each other, scoring potential solutions remains a difficult issue, especially if the similarity is low. The other challenging question is the protein flexibility, which is indeed difficult to evaluate from a static representation. Last, most of recently developed techniques are fast and can be applied to large amounts of data. Examples were carefully chosen to illustrate the wide applicability domain of the most popular methods: detection of common structural motifs, identification of secondary targets for a drug-like compound, comparison of binding sites across a functional family, comparison of homology models, database screening.

Keywords: chemogenomics
[Kallioniemi2008CGH] AKallioniemi. CGH microarrays and cancer. Curr Opin Biotechnol, 19(1):36-40, Feb 2008. [ bib | DOI | http | .pdf ]
Genetic alterations are a key feature of cancer cells and typically target biological processes and pathways that contribute to cancer pathogenesis. Array-based comparative genomic hybridization (aCGH) has provided a wealth of new information on copy number changes in cancer on a genome-wide level and aCGH data have also been utilized in cancer classification. More importantly, aCGH analyses have allowed highly accurate localization of specific genetic alterations that, for example, are associated with tumor progression, therapy response, or patient outcome. The genes involved in these aberrations are likely to contribute to cancer pathogenesis, and the high-resolution mapping by aCGH greatly facilitates the subsequent identification of these cancer-associated genes.

Keywords: csbcbook, csbcbook-ch2, cgh
[Jacob2008Protein] LJacob and J.-P. Vert. Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics, 24(19):2149-2156, 2008. [ bib | DOI | http | .pdf ]
Keywords: chemogenomics
[Jacob2008Efficient] LJacob and J.-P. Vert. Efficient peptide-MHC-I binding prediction for alleles with few known binders. Bioinformatics, 24(3):358-366, Feb 2008. [ bib | DOI | http | .pdf ]
MOTIVATION: In silico methods for the prediction of antigenic peptides binding to MHC class I molecules play an increasingly important role in the identification of T-cell epitopes. Statistical and machine learning methods in particular are widely used to score candidate binders based on their similarity with known binders and non-binders. The genes coding for the MHC molecules, however, are highly polymorphic, and statistical methods have difficulties building models for alleles with few known binders. In this context, recent work has demonstrated the utility of leveraging information across alleles to improve the performance of the prediction. RESULTS: We design a support vector machine algorithm that is able to learn peptide-MHC-I binding models for many alleles simultaneously, by sharing binding information across alleles. The sharing of information is controlled by a user-defined measure of similarity between alleles. We show that this similarity can be defined in terms of supertypes, or more directly by comparing key residues known to play a role in the peptide-MHC binding. We illustrate the potential of this approach on various benchmark experiments where it outperforms other state-of-the-art methods. AVAILABILITY: The method is implemented on a web server: http://cbio.ensmp.fr/kiss. All data and codes are freely and publicly available from the authors.

Keywords: chemogenomics immunoinformatics
[Jacob2008Virtual] LJacob, BHoffmann, VStoven, and J.-P. Vert. Virtual screening of GPCRs: an in silico chemogenomics approach. BMC Bioinformatics, 9:363, 2008. [ bib | DOI | http | .pdf ]
Keywords: chemogenomics
[Jacob2008VirtualOLD] LJacob, BHoffmann, BStoven, and J.-P. Vert. Virtual screening of GPCRs: an in silico chemogenomics approach. Technical Report 0801.4301, Arxiv, 2008. [ bib ]
[Ideker2008Protein] TIdeker and RSharan. Protein networks in disease. Genome Res, 18(4):644-652, Apr 2008. [ bib | DOI | http | .pdf ]
During a decade of proof-of-principle analysis in model organisms, protein networks have been used to further the study of molecular evolution, to gain insight into the robustness of cells to perturbation, and for assignment of new protein functions. Following these analyses, and with the recent rise of protein interaction measurements in mammals, protein networks are increasingly serving as tools to unravel the molecular basis of disease. We review promising applications of protein networks to disease in four major areas: identifying new disease genes; the study of their network properties; identifying disease-related subnetworks; and network-based disease classification. Applications in infectious disease, personalized medicine, and pharmacology are also forthcoming as the available protein network information improves in quality and coverage.

[Harris2008Single-molecule] Timothy D Harris, Phillip R Buzby, Hazen Babcock, Eric Beer, Jayson Bowers, Ido Braslavsky, Marie Causey, Jennifer Colonell, James Dimeo, JWilliam Efcavitch, Eldar Giladi, Jaime Gill, John Healy, Mirna Jarosz, Dan Lapen, Keith Moulton, Stephen R Quake, Kathleen Steinmann, Edward Thayer, Anastasia Tyurina, Rebecca Ward, Howard Weiss, and Zheng Xie. Single-molecule dna sequencing of a viral genome. Science, 320(5872):106-109, Apr 2008. [ bib | DOI | http ]
The full promise of human genomics will be realized only when the genomes of thousands of individuals can be sequenced for comparative analysis. A reference sequence enables the use of short read length. We report an amplification-free method for determining the nucleotide sequence of more than 280,000 individual DNA molecules simultaneously. A DNA polymerase adds labeled nucleotides to surface-immobilized primer-template duplexes in stepwise fashion, and the asynchronous growth of individual DNA molecules was monitored by fluorescence imaging. Read lengths of >25 bases and equivalent phred software program quality scores approaching 30 were achieved. We used this method to sequence the M13 virus to an average depth of >150x and with 100% coverage; thus, we resequenced the M13 genome with high-sensitivity mutation detection. This demonstrates a strategy for high-throughput low-cost resequencing.

Keywords: Algorithms; Bacteriophage M13; Computational Biology; DNA Primers; DNA, Viral; Genome, Viral; Mutation; Sequence Alignment; Sequence Analysis, DNA; Software; Templates, Genetic
[Harchaoui2008Catching] ZHarchaoui and CLevy-Leduc. Catching change-points with lasso. In J.C. Platt, DKoller, YSinger, and SRoweis, editors, Adv. Neural. Inform. Process Syst., volume 20, pages 617-624. MIT Press, Cambridge, MA, 2008. [ bib ]
[Harchaoui2008Methodes] ZHarchaoui. Méthodes à noyaux pour la détection. PhD thesis, Telecom ParisTech, 2008. [ bib ]
[Han2008Apoptosis] LHan, YZhao, and XJia. Mathematical modeling identified c-flip as an apoptotic switch in death receptor induced apoptosis. Apoptosis, 13(10):1198-204, 2008. [ bib ]
Apoptosis is an essential process to get rid of injured or unwanted cells. In this study, we proposed a mathematical modeling for death receptor mediated apoptosis to investigate the role of c-FLIP in controlling the balance between apoptosis and survival. In order to get insight into how NF-kappa B mediated pro-survival pathway affects the outcome of our modeling, we implemented reduced models without taking such regulation into consideration. Our simulation revealed that c-FLIP could act as a pivotal death or life switch and this switch-like behavior is bistable, irreversible, and robust. We introduce a new term, probability apoptosis, to delineate the likelihood in occurrence of apoptosis events. This simulation system is plausible and may offer several valuable clinical indications for the abnormal apoptosis related disease, such as cancer.

Keywords: csbcbook
[Haibe-Kains2008comparative] BHaibe-Kains, CDesmedt, CSotiriou, and GBontempi. A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? Bioinformatics, 24(19):2200-2208, 2008. [ bib | DOI | http | .pdf ]
[Haibe-Kains2008Comparison] BHaibe-Kains, CDesmedt, FPiette, MBuyse, FCardoso, LVan't Veer, MPiccart, GBontempi, and CSotiriou. Comparison of prognostic gene expression signatures for breast cancer. BMC Genomics, 9:394, 2008. [ bib | DOI | http | .pdf ]
BACKGROUND: During the last years, several groups have identified prognostic gene expression signatures with apparently similar performances. However, signatures were never compared on an independent population of untreated breast cancer patients, where risk assessment was computed using the original algorithms and microarray platforms. RESULTS: We compared three gene expression signatures, the 70-gene, the 76-gene and the Gene expression Grade Index (GGI) signatures, in terms of predicting distant metastasis free survival (DMFS) for the individual patient. To this end, we used the previously published TRANSBIG independent validation series of node-negative untreated primary breast cancer patients. We observed agreement in prediction for 135 of 198 patients (68%) when considering the three signatures. When comparing the signatures two by two, the agreement in prediction was 71% for the 70- and 76-gene signatures, 76% for the 76-gene signature and the GGI, and 88% for the 70-gene signature and the GGI. The three signatures had similar capabilities of predicting DMFS and added significant prognostic information to that provided by the classical parameters. CONCLUSION: Despite the difference in development of these signatures and the limited overlap in gene identity, they showed similar prognostic performance, adding to the growing evidence that these prognostic signatures are of clinical relevance.

Keywords: breastcancer
[Goendoer2008High-resolution] Anita Göndör, Carole Rougier, and Rolf Ohlsson. High-resolution circular chromosome conformation capture assay. Nat Protoc, 3(2):303-313, 2008. [ bib | DOI | http ]
The pioneering chromosome conformation capture (3C) method provides the opportunity to study chromosomal folding in the nucleus. It is based on formaldehyde cross-linking of living cells followed by enzyme digestion, intramolecular ligation and quantitative (Q)-PCR analysis. However, 3C requires prior knowledge of the bait and interacting sequence (termed interactor) rendering it less useful for genome-wide studies. As several recent reports document, this limitation has been overcome by exploiting a circular intermediate in a variant of the 3C method, termed 4C (for circular 3C). The strategic positioning of primers within the bait enables the identification of unknown interacting sequences, which form part of the circular DNA. Here, we describe a protocol for our 4C method, which produces a high-resolution interaction map potentially suitable for the analysis of cis-regulatory elements and for comparison with chromatin marks obtained by chromatin immunoprecipitation (ChIP) on chip at the sites of interaction. Following optimization of enzyme digestions and amplification conditions, the protocol can be completed in 2-3 weeks.

Keywords: Chromatin; Chromosomes, Human, Pair 11; DNA; DNA Restriction Enzymes; DNA, Circular; Formaldehyde; Genetic Techniques; Humans; Nucleic Acid Conformation
[Gutin09Asymmetric] GGutin, DKarapetyan, and NKrasnogor. Memetic algorithm for the generalized asymmetric traveling salesman problem. In NICSO 2007, pages 199-210. Springer Berlin, 2008. [ bib ]
[Geppert2008Support-vector-machine-based] HGeppert, THorváth, TGärtner, SWrobel, and JBajorath. Support-vector-machine-based ranking significantly improves the effectiveness of similarity searching using 2d fingerprints and multiple reference compounds. J Chem Inf Model, 48(4):742-746, Apr 2008. [ bib | DOI | http | .pdf ]
Similarity searching using molecular fingerprints is computationally efficient and a surprisingly effective virtual screening tool. In this study, we have compared ranking methods for similarity searching using multiple active reference molecules. Different 2D fingerprints were used as search tools and also as descriptors for a support vector machine (SVM) algorithm. In systematic database search calculations, a SVM-based ranking scheme consistently outperformed nearest neighbor and centroid approaches, regardless of the fingerprints that were tested, even if only very small training sets were used for SVM learning. The superiority of SVM-based ranking over conventional fingerprint methods is ascribed to the fact that SVM makes use of information about database molecules, in addition to known active compounds, during the learning phase.

Keywords: chemoinformatics, PUlearning
[Galluzzi2008Cell] LGalluzzi and GKroemer. Necroptosis: A specialized pathway of programmed necrosis. Cell, 135(7):1161-1163, 2008. doi: DOI: 10.1016/j.cell.2008.12.004. [ bib ]
Keywords: csbcbook
[Fukumizu2008Statistical] KFukumizu, FR. Bach, and AGretton. Statistical consistency of kernel canonical correlation analysis. J. Mach. Learn. Res., 8:361-383, 2008. [ bib | .pdf | .pdf ]
[Friedman2008Sparse] JFriedman, THastie, and RTibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432-441, Jul 2008. [ bib | DOI | http | .pdf ]
We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm-the graphical lasso-that is remarkably fast: It solves a 1000-node problem ( approximately 500,000 parameters) in at most a minute and is 30-4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.

[Finetti2008Sixteen-kinase] PFinetti, NCervera, ECharafe-Jauffret, CChabannon, CCharpin, MChaffanet, JJacquemier, PViens, DBirnbaum, and FBertucci. Sixteen-kinase gene expression identifies luminal breast cancers with poor prognosis. Cancer Res., 68(3):767-776, Feb 2008. [ bib | DOI | http | .pdf ]
Breast cancer is a heterogeneous disease made of various molecular subtypes with different prognosis. However, evolution remains difficult to predict within some subtypes, such as luminal A, and treatment is not as adapted as it should be. Refinement of prognostic classification and identification of new therapeutic targets are needed. Using oligonucleotide microarrays, we profiled 227 breast cancers. We focused our analysis on two major breast cancer subtypes with opposite prognosis, luminal A (n = 80) and basal (n = 58), and on genes encoding protein kinases. Whole-kinome expression separated luminal A and basal tumors. The expression (measured by a kinase score) of 16 genes encoding serine/threonine kinases involved in mitosis distinguished two subgroups of luminal A tumors: Aa, of good prognosis and Ab, of poor prognosis. This classification and its prognostic effect were validated in 276 luminal A cases from three independent series profiled across different microarray platforms. The classification outperformed the current prognostic factors in univariate and multivariate analyses in both training and validation sets. The luminal Ab subgroup, characterized by high mitotic activity compared with luminal Aa tumors, displayed clinical characteristics and a kinase score intermediate between the luminal Aa subgroup and the luminal B subtype, suggesting a continuum in luminal tumors. Some of the mitotic kinases of the signature represent therapeutic targets under investigation. The identification of luminal A cases of poor prognosis should help select appropriate treatment, whereas the identification of a relevant kinase set provides potential targets.

Keywords: csbcbook, csbcbook-ch3
[Filipowicz2008Mechanisms] WFilipowicz, SN. Bhattacharyya, and NSonenberg. Mechanisms of post-transcriptional regulation by microRNAs: are the answers in sight? Nat. Rev. Genet., 9(2):102-114, Feb 2008. [ bib | DOI | http | .pdf ]
MicroRNAs constitute a large family of small, approximately 21-nucleotide-long, non-coding RNAs that have emerged as key post-transcriptional regulators of gene expression in metazoans and plants. In mammals, microRNAs are predicted to control the activity of approximately 30% of all protein-coding genes, and have been shown to participate in the regulation of almost every cellular process investigated so far. By base pairing to mRNAs, microRNAs mediate translational repression or mRNA degradation. This Review summarizes the current understanding of the mechanistic aspects of microRNA-induced repression of translation and discusses some of the controversies regarding different modes of microRNA function.

Keywords: csbcbook
[Fiebitz2008High-throughput] Andrea Fiebitz, Lajos Nyarsik, Bernard Haendler, Yu-Hui Hu, Florian Wagner, Sabine Thamm, Hans Lehrach, Michal Janitz, and Dominique Vanhecke. High-throughput mammalian two-hybrid screening for protein-protein interactions using transfected cell arrays. BMC Genomics, 9:68, 2008. [ bib | DOI | http | .pdf ]
BACKGROUND: Most of the biological processes rely on the formation of protein complexes. Investigation of protein-protein interactions (PPI) is therefore essential for understanding of cellular functions. It is advantageous to perform mammalian PPI analysis in mammalian cells because the expressed proteins can then be subjected to essential post-translational modifications. Until now mammalian two-hybrid assays have been performed on individual gene scale. We here describe a new and cost-effective method for the high-throughput detection of protein-protein interactions in mammalian cells that combines the advantages of mammalian two-hybrid systems with those of DNA microarrays. RESULTS: In this cell array protein-protein interaction assay (CAPPIA), mixtures of bait and prey expression plasmids together with an auto-fluorescent reporter are immobilized on glass slides in defined array formats. Adherent cells that grow on top of the micro-array will become fluorescent only if the expressed proteins interact and subsequently trans-activate the reporter. Using known interaction partners and by screening 160 different combinations of prey and bait proteins associated with the human androgen receptor we demonstrate that this assay allows the quantitative detection of specific protein interactions in different types of mammalian cells and under the influence of different compounds. Moreover, different strategies in respect to bait-prey combinations are presented. CONCLUSION: We demonstrate that the CAPPIA assay allows the quantitative detection of specific protein interactions in different types of mammalian cells and under the influence of different compounds. The high number of preys that can be tested per slide together with the flexibility to interrogate any bait of interest and the small amounts of reagents that are required makes this assay currently one of the most economical high-throughput detection assays for protein-protein interactions in mammalian cells.

[Faith2008Many] J.J. Faith, M.E. Driscoll, V.A. Fusaro, E.J. Cosgrove, BHayete, F.S. Juhn, S.J. Schneider, and T.S. Gardner. Many microbe microarrays database: uniformly normalized affymetrix compendia with structured experimental metadata. Nucleic Acids Res., 36(Database issue):D866-D870, Jan 2008. [ bib | DOI | http | .pdf ]
Many Microbe Microarrays Database (M3D) is designed to facilitate the analysis and visualization of expression data in compendia compiled from multiple laboratories. M3D contains over a thousand Affymetrix microarrays for Escherichia coli, Saccharomyces cerevisiae and Shewanella oneidensis. The expression data is uniformly normalized to make the data generated by different laboratories and researchers more comparable. To facilitate computational analyses, M3D provides raw data (CEL file) and normalized data downloads of each compendium. In addition, web-based construction, visualization and download of custom datasets are provided to facilitate efficient interrogation of the compendium for more focused analyses. The experimental condition metadata in M3D is human curated with each chemical and growth attribute stored as a structured and computable set of experimental features with consistent naming conventions and units. All versions of the normalized compendia constructed for each species are maintained and accessible in perpetuity to facilitate the future interpretation and comparison of results published on M3D data. M3D is accessible at http://m3d.bu.edu/.

[Fabbri2008MicroRNAs] MFabbri, CM. Croce, and GA. Calin. MicroRNAs. Cancer J., 14(1):1-6, 2008. [ bib | DOI | http ]
MicroRNAs (miRNAs) are small, noncoding RNAs with regulatory functions, which play an important role in many human diseases, including cancer. An emerging number of studies show that miRNAs can act either as oncogenes or as tumor suppressor genes or sometimes as both. Germline, somatic mutations and polymorphisms can contribute to cancer predisposition. miRNA expression levels have diagnostic and prognostic implications, and their roles as anticancer therapeutic agents is promising and currently under investigation.

Keywords: csbcbook
[Eulalio2008Getting] Ana Eulalio, Eric Huntzinger, and Elisa Izaurralde. Getting to the root of mirna-mediated gene silencing. Cell, 132(1):9-14, Jan 2008. [ bib | DOI | http ]
MicroRNAs are approximately 22 nucleotide-long RNAs that silence gene expression posttranscriptionally by binding to the 3' untranslated regions of target mRNAs. Although much is known about their biogenesis and biological functions, the mechanisms allowing miRNAs to silence gene expression in animal cells are still under debate. Here, we discuss current models for miRNA-mediated gene silencing and formulate a hypothesis to reconcile differences.

Keywords: sirna
[Esteller2008Epigenetics] MEsteller. Epigenetics in cancer. N. Engl. J. Med., 358(11):1148-1159, Mar 2008. [ bib | DOI | http | .pdf ]
Keywords: csbcbook
[Elkan2008Learning] CElkan and KNoto. Learning classifiers from only positive and unlabeled data. In KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 213-220, New York, NY, USA, 2008. ACM. [ bib | DOI | http | .pdf ]
Keywords: PUlearning
[Duchi2008Efficient] JDuchi, SShalev-Shwartz, YSinger, and TChandra. Efficient projections onto the l1-ball for learning in high dimensions. In Andrew McCallum and Sam Roweis, editors, Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008), pages 272-279. Omnipress, 2008. [ bib ]
[Didiano2008Molecular] Dominic Didiano and Oliver Hobert. Molecular architecture of a mirna-regulated 3' utr. RNA, 14(7):1297-1317, Jul 2008. [ bib | DOI | http | .pdf ]
Animal genomes contain hundreds of microRNAs (miRNAs), small regulatory RNAs that control gene expression by binding to complementary sites in target mRNAs. Some rules that govern miRNA/target interaction have been elucidated but their general applicability awaits further experimentation on a case-by-case basis. We use here an assay system in transgenic nematodes to analyze the interaction of the Caenorhabditis elegans lsy-6 miRNA with 3' UTR sequences. In contrast to many previously described assay systems used to analyze miRNA/target interactions, our assay system operates within the cellular context in which lsy-6 normally functions, a single neuron in the nervous system of C. elegans. Through extensive mutational analysis, we define features in the known and experimentally validated target of lsy-6, the 3' UTR of the cog-1 homeobox gene, that are required for a functional miRNA/target interaction. We describe that both in the context of the cog-1 3' UTR and in the context of heterologous 3' UTRs, one or more seed matches are not a reliable predictor for a functional miRNA/target interaction. We rather find that two nonsequence specific contextual features beyond miRNA target sites are critical determinants of miRNA-mediated 3' UTR regulation. The contextual features reside 3' of lsy-6 binding sites in the 3' UTR and act in a combinatorial manner; mutation of each results in limited defects in 3' UTR regulation, but a combinatorial deletion results in complete loss of 3' UTR regulation. Together with two lsy-6 sites, these two contextual features are capable of imparting regulation on a heterologous 3' UTR. Moreover, the contextual features need to be present in a specific configuration relative to miRNA binding sites and could either represent protein binding sites or provide an appropriate structural context. We conclude that a given target site resides in a 3' UTR context that evolved beyond target site complementarity to support regulation by a specific miRNA. The large number of 3' UTRs that we analyzed in this study will also be useful to computational biologists in designing the next generation of miRNA/target prediction algorithms.

Keywords: sirna
[Dias2008Molecular] Raquel Dias and Walter Filgueira de Azevedo. Molecular docking algorithms. Curr. Drug Targets, 9(12):1040-1047, Dec 2008. [ bib ]
By means of virtual screening of small molecules databases it is possible to identify new potential inhibitors against a target of interest. Molecular docking is a computer simulation procedure to predict the conformation of a receptor-ligand complex. Each docking program makes use of one or more specific search algorithms, which are the methods used to predict the possible conformations of a binary complex. In the present review we describe several molecular-docking search algorithms, and the programs which apply such methodologies. We also discuss how virtual screening can be optimized, describing methods that may increase accuracy of the simulation process, with relatively fast docking algorithms.

Keywords: Algorithms; Database Management Systems; Information Storage and Retrieval; Models, Molecular; Molecular Conformation; Monte Carlo Method
[Desmedt2008Biological] CDesmedt, BHaibe-Kains, PWirapati, MBuyse, DLarsimont, GBontempi, MDelorenzi, MPiccart, and CSotiriou. Biological processes associated with breast cancer clinical outcome depend on the molecular subtypes. Clin. Cancer Res., 14(16):5158-5165, Aug 2008. [ bib | DOI | http | .pdf ]
Recently, several prognostic gene expression signatures have been identified; however, their performance has never been evaluated according to the previously described molecular subtypes based on the estrogen receptor (ER) and human epidermal growth factor receptor 2 (HER2), and their biological meaning has remained unclear. Here we aimed to perform a comprehensive meta-analysis integrating both clinicopathologic and gene expression data, focusing on the main molecular subtypes.We developed gene expression modules related to key biological processes in breast cancer such as tumor invasion, immune response, angiogenesis, apoptosis, proliferation, and ER and HER2 signaling, and then analyzed these modules together with clinical variables and several prognostic signatures on publicly available microarray studies (>2,100 patients).Multivariate analysis showed that in the ER+/HER2- subgroup, only the proliferation module and the histologic grade were significantly associated with clinical outcome. In the ER-/HER2- subgroup, only the immune response module was associated with prognosis, whereas in the HER2+ tumors, the tumor invasion and immune response modules displayed significant association with survival. Proliferation was identified as the most important component of several prognostic signatures, and their performance was limited to the ER+/HER2- subgroup.Although proliferation is the strongest parameter predicting clinical outcome in the ER+/HER2- subtype and the common denominator of most prognostic gene signatures, immune response and tumor invasion seem to be the main molecular processes associated with prognosis in the ER-/HER2- and HER2+ subgroups, respectively. These findings may help to define new clinicogenomic models and to identify new therapeutic strategies in the specific molecular subgroups.

[Cui2008Two] JCui, CChen, HLu, TSun, and PShen. Two independent positive feedbacks and bistability in the bcl-2 apoptotic switch. PLoS ONE, 3(1):e1469, 01 2008. [ bib | DOI | http | .pdf ]
Background - The complex interplay between B-cell lymphoma 2 (Bcl-2) family proteins constitutes a crucial checkpoint in apoptosis. Its detailed molecular mechanism remains controversial. Our former modeling studies have selected theDirect Activation Modelas a better explanation for experimental observations. In this paper, we continue to extend this model by adding interactions according to updating experimental findings. Methodology/Principal Findings - Through mathematical simulation we found bistability, a kind of switch, can arise from a positive (double negative) feedback in the Bcl-2 interaction network established by anti-apoptotic group of Bcl-2 family proteins. Moreover, Bax/Bak auto-activation as an independent positive feedback can enforce the bistability, and make it more robust to parameter variations. By ensemble stochastic modeling, we also elucidated how intrinsic noise can change ultrasensitive switches into gradual responses. Our modeling result agrees well with recent experimental data where bimodal Bax activation distributions in cell population were found. Conclusions/Significance - Along with the growing experimental evidences, our studies successfully elucidate the switch mechanism embedded in the Bcl-2 interaction network and provide insights into pharmacological manipulation of Bcl-2 apoptotic switch as further cancer therapies.

Keywords: csbcbook
[Croce2008Oncogenes] CM. Croce. Oncogenes and cancer. N. Engl. J. Med., 358(5):502-511, Jan 2008. [ bib | DOI | http | .pdf ]
Keywords: csbcbook
[Chin2008Translating] LChin and JW. Gray. Translating insights from the cancer genome into clinical practice. Nature, 452(7187):553-563, Apr 2008. [ bib | DOI | http | .pdf ]
Cancer cells have diverse biological capabilities that are conferred by numerous genetic aberrations and epigenetic modifications. Today's powerful technologies are enabling these changes to the genome to be catalogued in detail. Tomorrow is likely to bring a complete atlas of the reversible and irreversible alterations that occur in individual cancers. The challenge now is to work out which molecular abnormalities contribute to cancer and which are simply 'noise' at the genomic and epigenomic levels. Distinguishing between these will aid in understanding how the aberrations in a cancer cell collaborate to drive pathophysiology. Past successes in converting information from genomic discoveries into clinical tools provide valuable lessons to guide the translation of emerging insights from the genome into clinical end points that can affect the practice of cancer medicine.

Keywords: csbcbook-ch3
[Chi2008year] KR. Chi. The year of sequencing. Nat. Methods, 5(1):11-14, Jan 2008. [ bib | DOI | http | .pdf ]
In 2007, the next-generation sequencing technologies have come into their own with an impressive array of successful applications. Kelly Rae Chi reports.

Keywords: csbcbook, csbcbook-ch2
[Chen2008Mapping] Wei Chen, Vera Kalscheuer, Andreas Tzschach, Corinna Menzel, Reinhard Ullmann, Marcel Holger Schulz, Fikret Erdogan, Na Li, Zofia Kijas, Ger Arkesteijn, Isidora Lopez Pajares, Margret Goetz-Sothmann, Uwe Heinrich, Imma Rost, Andreas Dufke, Ute Grasshoff, Birgitta Glaeser, Martin Vingron, and HHilger Ropers. Mapping translocation breakpoints by next-generation sequencing. Genome Res., 18(7):1143-1149, Jul 2008. [ bib | DOI | http | .pdf ]
Balanced chromosome rearrangements (BCRs) can cause genetic diseases by disrupting or inactivating specific genes, and the characterization of breakpoints in disease-associated BCRs has been instrumental in the molecular elucidation of a wide variety of genetic disorders. However, mapping chromosome breakpoints using traditional methods, such as in situ hybridization with fluorescent dye-labeled bacterial artificial chromosome clones (BAC-FISH), is rather laborious and time-consuming. In addition, the resolution of BAC-FISH is often insufficient to unequivocally identify the disrupted gene. To overcome these limitations, we have performed shotgun sequencing of flow-sorted derivative chromosomes using "next-generation" (Illumina/Solexa) multiplex sequencing-by-synthesis technology. As shown here for three different disease-associated BCRs, the coverage attained by this platform is sufficient to bridge the breakpoints by PCR amplification, and this procedure allows the determination of their exact nucleotide positions within a few weeks. Its implementation will greatly facilitate large-scale breakpoint mapping and gene finding in patients with disease-associated balanced translocations.

Keywords: ngs, csbcbook, csbcbook-ch2
[Chang2008Fast] CQ. Chang, ZDing, YS. Hung, and PC. Fung. Fast network component analysis (fastnca) for gene regulatory network reconstruction from microarray data. Bioinformatics, 24(11):1349-1358, 2008. [ bib ]
[Cavasotto2008Discovery] CN. Cavasotto, AJW. Orry, NJ. Murgolo, MF. Czarniecki, SA. Kocsi, BE. Hawes, KA. O'Neill, HHine, MS. Burton, JH. Voigt, RA. Abagyan, ML. Bayne, and FJ. Monsma. Discovery of novel chemotypes to a G-protein-coupled receptor through ligand-steered homology modeling and structure-based virtual screening. J. Med. Chem., 51(3):581-588, Feb 2008. [ bib | DOI | http ]
Melanin-concentrating hormone receptor 1 (MCH-R1) is a G-protein-coupled receptor (GPCR) and a target for the development of therapeutics for obesity. The structure-based development of MCH-R1 and other GPCR antagonists is hampered by the lack of an available experimentally determined atomic structure. A ligand-steered homology modeling approach has been developed (where information about existing ligands is used explicitly to shape and optimize the binding site) followed by docking-based virtual screening. Top scoring compounds identified virtually were tested experimentally in an MCH-R1 competitive binding assay, and six novel chemotypes as low micromolar affinity antagonist "hits" were identified. This success rate is more than a 10-fold improvement over random high-throughput screening, which supports our ligand-steered method. Clearly, the ligand-steered homology modeling method reduces the uncertainty of structure modeling for difficult targets like GPCRs.

Keywords: chemogenomics
[Candes2008The] ECandès. The restricted isometry property. Compte Rendus de l'Académie des Sciences, Paris, 1(346):589-592, 2008. [ bib ]
[Campbell2008Identification] Peter J Campbell, Philip J Stephens, Erin D Pleasance, Sarah O'Meara, Heng Li, Thomas Santarius, Lucy A Stebbings, Catherine Leroy, Sarah Edkins, Claire Hardy, Jon W Teague, Andrew Menzies, Ian Goodhead, Daniel J Turner, Christopher M Clee, Michael A Quail, Antony Cox, Clive Brown, Richard Durbin, Matthew E Hurles, Paul A W Edwards, Graham R Bignell, Michael R Stratton, and PAndrew Futreal. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat. Genet., 40(6):722-729, Jun 2008. [ bib | DOI | http | .pdf ]
Human cancers often carry many somatically acquired genomic rearrangements, some of which may be implicated in cancer development. However, conventional strategies for characterizing rearrangements are laborious and low-throughput and have low sensitivity or poor resolution. We used massively parallel sequencing to generate sequence reads from both ends of short DNA fragments derived from the genomes of two individuals with lung cancer. By investigating read pairs that did not align correctly with respect to each other on the reference human genome, we characterized 306 germline structural variants and 103 somatic rearrangements to the base-pair level of resolution. The patterns of germline and somatic rearrangement were markedly different. Many somatic rearrangements were from amplicons, although rearrangements outside these regions, notably including tandem duplications, were also observed. Some somatic rearrangements led to abnormal transcripts, including two from internal tandem duplications and two fusion transcripts created by interchromosomal rearrangements. Germline variants were predominantly mediated by retrotransposition, often involving AluY and LINE elements. The results demonstrate the feasibility of systematic, genome-wide characterization of rearrangements in complex human cancer genomes, raising the prospect of a new harvest of genes associated with cancer using this strategy.

Keywords: ngs
[Calzone2008comprehensive] LCalzone, AGelay, AZinovyev, FRadvanyi, and EBarillot. A comprehensive modular map of molecular interactions in RB/E2F pathway. Mol. Syst. Biol., 4:173, 2008. [ bib | DOI | http | .pdf ]
We present, here, a detailed and curated map of molecular interactions taking place in the regulation of the cell cycle by the retinoblastoma protein (RB/RB1). Deregulations and/or mutations in this pathway are observed in most human cancers. The map was created using Systems Biology Graphical Notation language with the help of CellDesigner 3.5 software and converted into BioPAX 2.0 pathway description format. In the current state the map contains 78 proteins, 176 genes, 99 protein complexes, 208 distinct chemical species and 165 chemical reactions. Overall, the map recapitulates biological facts from approximately 350 publications annotated in the diagram. The network contains more details about RB/E2F interaction network than existing large-scale pathway databases. Structural analysis of the interaction network revealed a modular organization of the network, which was used to elaborate a more summarized, higher-level representation of RB/E2F network. The simplification of complex networks opens the road for creating realistic computational models of this regulatory pathway.

Keywords: csbcbook
[Bouchard2008Efficient] Alexandre Bouchard-Côté, Michael I. Jordan, and Dan Klein. Efficient inference in phylogenetic indel trees. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Léon Bottou, editors, NIPS, pages 177-184. MIT Press, 2008. [ bib | http ]
[Bonilla2008Multi-task] Edwin Bonilla, Kian Ming Chai, and Chris Williams. Multi-task gaussian process prediction. In J.C. Platt, DKoller, YSinger, and SRoweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008. [ bib ]
[Bonachera2008Fuzzy] FBonachéra and DHorvath. Fuzzy tricentric pharmacophore fingerprints. 2. application of topological fuzzy pharmacophore triplets in quantitative structure-activity relationships. J. Chem. Inf. Model., 48(2):409-425, Feb 2008. [ bib | DOI | http | .pdf ]
Topological fuzzy pharmacophore triplets (2D-FPT), using the number of interposed bonds to measure separation between the atoms representing pharmacophore types, were employed to establish and validate quantitative structure-activity relationships (QSAR). Thirteen data sets for which state-of-the-art QSAR models were reported in literature were revisited in order to benchmark 2D-FPT biological activity-explaining propensities. Linear and nonlinear QSAR models were constructed for each compound series (following the original author's splitting into training/validation subsets) with three different 2D-FPT versions, using the genetic algorithm-driven Stochastic QSAR sampler (SQS) to pick relevant triplets and fit their coefficients. 2D-FPT QSARs are computationally cheap, interpretable, and perform well in benchmarking. In a majority of cases (10/13), default 2D-FPT models validated better than or as well as the best among those reported, including 3D overlay-dependent approaches. Most of the analogues series, either unaffected by protonation equilibria or unambiguously adopting expected protonation states, were equally well described by rule- or pKa-based pharmacophore flagging. Thermolysin inhibitors represent a notable exception: pKa-based flagging boosts model quality, although-surprisingly-not due to proteolytic equilibrium effects. The optimal degree of 2D-FPT fuzziness is compound set dependent. This work further confirmed the higher robustness of nonlinear over linear SQS models. In spite of the wealth of studied sets, benchmarking is nevertheless flawed by low intraset diversity: a whole series of thereby caused artifacts were evidenced, implicitly raising questions about the way QSAR studies are conducted nowadays. An in-depth investigation of thrombin inhibition models revealed that some of the selected triplets make sense (one of these stands for a topological pharmacophore covering the P1 and P2 binding pockets). Nevertheless, equations were either unable to predict the activity of the structurally different ligands or tended to indiscriminately predict any compound outside the training family to be active. 2D-FPT QSARs do however not depend on any common scaffold required for molecule superimposition and may in principle be trained on hand of diverse sets, which is a must in order to obtain widely applicable models. Adding (assumed) inactives of various families for training enabled discovery of models that specifically recognize the structurally different actives.

Keywords: chemoinformatics
[Blow2008DNA] NBlow. DNA sequencing: generation next-next. Nat. Meth., 5(3):267-274, 2008. [ bib | DOI | http | .pdf ]
Emboldened by the success of next-generation sequencing, scientists are pursuing the holy grail of genomicsthe '1,000 genome'—with single-molecule approaches. Nathan Blow reports.

Keywords: csbcbook-ch2, csbcbook
[Bickel2008Multi-task] Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, and Tobias Scheffer. Multi-task learning for hiv therapy screening. In ICML'08: Proceedings of the 25th international conference on Machine learning, pages 56-63, 2008. [ bib ]
[Bertoni2008Discovering] Alberto Bertoni and Giorgio Valentini. Discovering multi-level structures in bio-molecular data through the bernstein inequality. BMC Bioinformatics, 9 Suppl 2:S4, 2008. [ bib | DOI | http ]
The unsupervised discovery of structures (i.e. clusterings) underlying data is a central issue in several branches of bioinformatics. Methods based on the concept of stability have been recently proposed to assess the reliability of a clustering procedure and to estimate the "optimal" number of clusters in bio-molecular data. A major problem with stability-based methods is the detection of multi-level structures (e.g. hierarchical functional classes of genes), and the assessment of their statistical significance. In this context, a chi-square based statistical test of hypothesis has been proposed; however, to assure the correctness of this technique some assumptions about the distribution of the data are needed.To assess the statistical significance and to discover multi-level structures in bio-molecular data, a new method based on Bernstein's inequality is proposed. This approach makes no assumptions about the distribution of the data, thus assuring a reliable application to a large range of bioinformatics problems. Results with synthetic and DNA microarray data show the effectiveness of the proposed method.The Bernstein test, due to its loose assumptions, is more sensitive than the chi-square test to the detection of multiple structures simultaneously present in the data. Nevertheless it is less selective, that is subject to more false positives, but adding independence assumptions, a more selective variant of the Bernstein inequality-based test is also presented. The proposed methods can be applied to discover multiple structures and to assess their significance in different types of bio-molecular data.

[Behre2008Structural] J. Behre, T. Wilhelm, A. von Kamp, E. Ruppin, and S. Schuster. Structural robustness of metabolic networks with respect to multiple knockouts. J Theor Biol, 252(3):433-441, Jun 2008. [ bib | DOI | http | .pdf ]
We present a generalised framework for analysing structural robustness of metabolic networks, based on the concept of elementary flux modes (EFMs). Extending our earlier study on single knockouts [Wilhelm, T., Behre, J., Schuster, S., 2004. Analysis of structural robustness of metabolic networks. IEE Proc. Syst. Biol. 1(1), 114-120], we are now considering the general case of double and multiple knockouts. The robustness measures are based on the ratio of the number of remaining EFMs after knockout vs. the number of EFMs in the unperturbed situation, averaged over all combinations of knockouts. With the help of simple examples we demonstrate that consideration of multiple knockouts yields additional information going beyond single-knockout results. It is proven that the robustness score decreases as the knockout depth increases. We apply our extended framework to metabolic networks representing amino acid anabolism in Escherichia coli and human hepatocytes, and the central metabolism in human erythrocytes. Moreover, in the E. coli model the two subnetworks synthesising amino acids that are essential and those that are non-essential for humans are studied separately. The results are discussed from an evolutionary viewpoint. We find that E. coli has the most robust metabolism of all the cell types studied here. Considering only the subnetwork of the synthesis of non-essential amino acids, E. coli and the human hepatocyte show about the same robustness.

[Bashir2008Evaluation] Ali Bashir, Stanislav Volik, Colin Collins, Vineet Bafna, and Benjamin J Raphael. Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer. PLoS Comput. Biol., 4(4):e1000051, Apr 2008. [ bib | DOI | http | .pdf ]
Paired-end sequencing is emerging as a key technique for assessing genome rearrangements and structural variation on a genome-wide scale. This technique is particularly useful for detecting copy-neutral rearrangements, such as inversions and translocations, which are common in cancer and can produce novel fusion genes. We address the question of how much sequencing is required to detect rearrangement breakpoints and to localize them precisely using both theoretical models and simulation. We derive a formula for the probability that a fusion gene exists in a cancer genome given a collection of paired-end sequences from this genome. We use this formula to compute fusion gene probabilities in several breast cancer samples, and we find that we are able to accurately predict fusion genes in these samples with a relatively small number of fragments of large size. We further demonstrate how the ability to detect fusion genes depends on the distribution of gene lengths, and we evaluate how different parameters of a sequencing strategy impact breakpoint detection, breakpoint localization, and fusion gene detection, even in the presence of errors that suggest false rearrangements. These results will be useful in calibrating future cancer sequencing efforts, particularly large-scale studies of many cancer genomes that are enabled by next-generation sequencing technologies.

Keywords: ngs
[Banerjee2008Model] O. Banerjee, L. El Ghaoui, and A. d'Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res., 9:485-516, 2008. [ bib | .pdf | .pdf ]
[Bagci2008PLOS1] E. Z. Bagci, Y. Vodovotz, T. R. Billiar, B. Ermentrout, and I. Bahar. Computational insights on the competing effects of nitric oxide in regulating apoptosis. PLoS One, 3(5):e2249, 2008. [ bib ]
Despite the establishment of the important role of nitric oxide (NO) on apoptosis, a molecular-level understanding of the origin of its dichotomous pro- and anti-apoptotic effects has been elusive. We propose a new mathematical model for simulating the effects of nitric oxide (NO) on apoptosis. The new model integrates mitochondria-dependent apoptotic pathways with NO-related reactions, to gain insights into the regulatory effect of the reactive NO species N(2)O(3), non-heme iron nitrosyl species (FeL(n)NO), and peroxynitrite (ONOO(-)). The biochemical pathways of apoptosis coupled with NO-related reactions are described by ordinary differential equations using mass-action kinetics. In the absence of NO, the model predicts either cell survival or apoptosis (a bistable behavior) with shifts in the onset time of apoptotic response depending on the strength of extracellular stimuli. Computations demonstrate that the relative concentrations of anti- and pro-apoptotic reactive NO species, and their interplay with glutathione, determine the net anti- or pro-apoptotic effects at long time points. Interestingly, transient effects on apoptosis are also observed in these simulations, the duration of which may reach up to hours, despite the eventual convergence to an anti-apoptotic state. Our computations point to the importance of precise timing of NO production and external stimulation in determining the eventual pro- or anti-apoptotic role of NO.

Keywords: csbcbook
[Baek2008impact] Daehyun Baek, Judit Villén, Chanseok Shin, Fernando D Camargo, Steven P Gygi, and David P Bartel. The impact of micrornas on protein output. Nature, 455(7209):64-71, Sep 2008. [ bib | DOI | http | .pdf ]
MicroRNAs are endogenous approximately 23-nucleotide RNAs that can pair to sites in the messenger RNAs of protein-coding genes to downregulate the expression from these messages. MicroRNAs are known to influence the evolution and stability of many mRNAs, but their global impact on protein output had not been examined. Here we use quantitative mass spectrometry to measure the response of thousands of proteins after introducing microRNAs into cultured cells and after deleting mir-223 in mouse neutrophils. The identities of the responsive proteins indicate that targeting is primarily through seed-matched sites located within favourable predicted contexts in 3' untranslated regions. Hundreds of genes were directly repressed, albeit each to a modest degree, by individual microRNAs. Although some targets were repressed without detectable changes in mRNA levels, those translationally repressed by more than a third also displayed detectable mRNA destabilization, and, for the more highly repressed targets, mRNA destabilization usually comprised the major component of repression. The impact of microRNAs on the proteome indicated that for most interactions microRNAs act as rheostats to make fine-scale adjustments to protein output.

Keywords: sirna
[Bach2008Consistency] F. R. Bach. Consistency of trace norm minimization. J. Mach. Learn. Res., 9:1019-1048, 2008. [ bib | .pdf | .pdf ]
Regularization by the sum of singular values, also referred to as the trace norm, is a popular tech- nique for estimating low rank rectangular matrices. In this paper, we extend some of the consis- tency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled.

[Bach2008Bolasso] F. R. Bach. Bolasso: model consistent Lasso estimation through the bootstrap. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the 25th international conference on Machine learning, volume 308 of ACM International Conference Proceeding Series, pages 33-40, New York, NY, USA, 2008. ACM. [ bib | DOI ]
[Bach2008Consistencya] F. Bach. Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res., 9:1179-1225, 2008. [ bib | .html | .pdf ]
We consider the least-square regression problem with regularization by a block l1-norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the l1-norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model mis specification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covar iance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied.

Keywords: lasso
[Auliac2008Evolutionary] Cédric Auliac, Vincent Frouin, Xavier Gidrol, and Florence d'Alché Buc. Evolutionary approaches for the reverse-engineering of gene regulatory networks: A study on a biologically realistic dataset. BMC Bioinformatics, 9, 2008. [ bib ]
[Argyriou2008A] Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil, and Yiming Ying. A spectral regularization framework for multi-task structure learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 25-32. MIT Press, Cambridge, MA, 2008. [ bib ]
[Argyriou2008When] Andreas Argyriou, Charles A. Micchelli, and Massimiliano Pontil. When is there a representer theorem? vector versus matrix regularizers. CoRR, abs/0809.1590, 2008. [ bib ]
[Argyriou2008Convex] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243-272, 2008. To appear. [ bib ]
[Albert2008SCBM] I. Albert, J. Thakar, S. Li, R. Zhang, and R. Albert. Boolean network simulations for life scientists. Source Code Biol Med, 3:16, 2008. [ bib ]
ABSTRACT: Modern life sciences research increasingly relies on computational solutions, from large scale data analyses to theoretical modeling. Within the theoretical models Boolean networks occupy an increasing role as they are eminently suited at mapping biological observations and hypotheses into a mathematical formalism. The conceptual underpinnings of Boolean modeling are very accessible even without a background in quantitative sciences, yet it allows life scientists to describe and explore a wide range of surprisingly complex phenomena. In this paper we provide a clear overview of the concepts used in Boolean simulations, present a software library that can perform these simulations based on simple text inputs and give three case studies. The large scale simulations in these case studies demonstrate the Boolean paradigms and their applicability as well as the advanced features and complex use cases that our software package allows. Our software is distributed via a liberal Open Source license and is freely accessible from http://booleannet.googlecode.com.

[Albeck2008PLOSBiol] John G. Albeck, John M. Burke, Sabrina L. Spencer, Douglas A. Lauffenburger, and Peter K. Sorger. Modeling a snap-action, variable-delay switch controlling extrinsic cell death. PLoS Biol, 6(12):e299, 2008. [ bib ]
A combination of single-cell experiments and mathematical modeling reveals the mechanisms underlying all-or-none caspase activation during receptor-induced apoptosis.

[Ala2008Prediction] U. Ala, R.M. Piro, E. Grassi, C. Damasco, L. Silengo, M. Oti, P. Provero, and F. Di Cunto. Prediction of human disease genes by human-mouse conserved coexpression analysis. PLoS Comput. Biol., 4(3):e1000043, Mar 2008. [ bib | DOI | http ]
BACKGROUND: Even in the post-genomic era, the identification of candidate genes within loci associated with human genetic diseases is a very demanding task, because the critical region may typically contain hundreds of positional candidates. Since genes implicated in similar phenotypes tend to share very similar expression profiles, high throughput gene expression data may represent a very important resource to identify the best candidates for sequencing. However, so far, gene coexpression has not been used very successfully to prioritize positional candidates. METHODOLOGY/PRINCIPAL FINDINGS: We show that it is possible to reliably identify disease-relevant relationships among genes from massive microarray datasets by concentrating only on genes sharing similar expression profiles in both human and mouse. Moreover, we show systematically that the integration of human-mouse conserved coexpression with a phenotype similarity map allows the efficient identification of disease genes in large genomic regions. Finally, using this approach on 850 OMIM loci characterized by an unknown molecular basis, we propose high-probability candidates for 81 genetic diseases. CONCLUSION: Our results demonstrate that conserved coexpression, even at the human-mouse phylogenetic distance, represents a very strong criterion to predict disease-relevant relationships among human genes.

Keywords: Algorithms; Animals; Biological Markers; Chromosome Mapping; Conserved Sequence; Diagnosis, Computer-Assisted; Gene Expression Profiling; Genetic Diseases, Inborn; Genetic Predisposition to Disease; Humans; Mice; Proteome
[Aguilera2008Genome] A. Aguilera and B. Gómez-González. Genome instability: a mechanistic view of its causes and consequences. Nat. Rev. Genet., 9(3):204-217, Mar 2008. [ bib | DOI | http | .pdf ]
Genomic instability in the form of mutations and chromosome rearrangements is usually associated with pathological disorders, and yet it is also crucial for evolution. Two types of elements have a key role in instability leading to rearrangements: those that act in trans to prevent instability-among them are replication, repair and S-phase checkpoint factors-and those that act in cis-chromosomal hotspots of instability such as fragile sites and highly transcribed DNA sequences. Taking these elements as a guide, we review the causes and consequences of instability with the aim of providing a mechanistic perspective on the origin of genomic instability.

Keywords: csbcbook
[Abernethy2008Eliciting] J. Abernethy, T. Evgeniou, O. Toubia, and J.-P. Vert. Eliciting consumer preferences using robust adaptive choice questionnaires. IEEE Trans. Knowl. Data Eng., 20(2):145-155, 2008. [ bib | DOI | http | .pdf ]
We propose a framework for designing adaptive choice-based conjoint questionnaires that are robust to response error. It is developed based on a combination of experimental design and statistical learning theory principles. We implement and test a specific case of this framework using Regularization Networks. We also formalize within this framework the polyhedral methods recently proposed in marketing. We use simulations as well as an online market research experiment with 500 participants to compare the proposed method to benchmark methods. Both experiments show that the proposed adaptive questionnaires outperform existing ones in most cases. This work also indicates the potential of using machine learning methods in marketing.

[Abernethy2008New-techreport] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative filtering: Operator estimation with spectral regularization. Technical Report 00250231, HAL, 2008. [ bib ]
[Veer2008Enabling] L. J. van't Veer and R. Bernards. Enabling personalized cancer medicine through analysis of gene-expression patterns. Nature, 452(7187):564-570, Apr 2008. [ bib | DOI | http | .pdf ]
Therapies for patients with cancer have changed gradually over the past decade, moving away from the administration of broadly acting cytotoxic drugs towards the use of more-specific therapies that are targeted to each tumour. To facilitate this shift, tests need to be developed to identify those individuals who require therapy and those who are most likely to benefit from certain therapies. In particular, tests that predict the clinical outcome for patients on the basis of the genes expressed by their tumours are likely to increasingly affect patient management, heralding a new era of personalized medicine.

Keywords: csbcbook, csbcbook-ch3
[Dunson2008The] D. Dunson, Y. Xue, and L. Carin. The matrix stick-breaking process: Flexible bayes meta-analysis. Journal of the American Statistical Association, 103(481):317-327, March 2008. [ bib | http ]
In analyzing data from multiple related studies, it often is of interest to borrow information across studies and to cluster similar studies. Although parametric hierarchical models are commonly used, of concern is sensitivity to the form chosen for the random-effects distribution. A Dirichlet process (DP) prior can allow the distribution to be unknown, while clustering studies; however, the DP does not allow local clustering of studies with respect to a subset of the coefficients without making independence assumptions. Motivated by this problem, we propose a matrix stick-breaking process (MSBP) as a prior for a matrix of random probability measures. Properties of the MSBP are considered, and methods are developed for posterior computation using Markov chain Monte Carlo. Using the MSBP as a prior for a matrix of study-specific regression coefficients, we demonstrate advantages over parametric modeling in simulated examples. The methods are further illustrated using a multinational uterotrophic bioassay study.

[Shulman2008MultiBind] A. Shulman-Peleg, M. Shatsky, R. Nussinov, and H. J. J. Wolfson. Multibind and mappis: webservers for multiple alignment of protein 3d-binding sites and their interactions. Nucleic Acids Res., 36:260-264, May 2008. [ bib | DOI | http ]
Analysis of protein-ligand complexes and recognition of spatially conserved physico-chemical properties is important for the prediction of binding and function. Here, we present two webservers for multiple alignment and recognition of binding patterns shared by a set of protein structures. The first webserver, MultiBind (http://bioinfo3d.cs.tau.ac.il/MultiBind), performs multiple alignment of protein binding sites. It recognizes the common spatial chemical binding patterns even in the absence of similarity of the sequences or the folds of the compared proteins. The input to the MultiBind server is a set of protein-binding sites defined by interactions with small molecules. The output is a detailed list of the shared physico-chemical binding site properties. The second webserver, MAPPIS (http://bioinfo3d.cs.tau.ac.il/MAPPIS), aims to analyze protein-protein interactions. It performs multiple alignment of protein-protein interfaces (PPIs), which are regions of interaction between two protein molecules. MAPPIS recognizes the spatially conserved physico-chemical interactions, which often involve energetically important hot-spot residues that are crucial for protein-protein associations. The input to the MAPPIS server is a set of protein-protein complexes. The output is a detailed list of the shared interaction properties of the interfaces.

Keywords: binding-site
[Irizarry2008Comprehensive] Rafael A Irizarry, Christine Ladd-Acosta, Benilton Carvalho, Hao Wu, Sheri A Brandenburg, Jeffrey A Jeddeloh, Bo Wen, and Andrew P Feinberg. Comprehensive high-throughput arrays for relative methylation (charm). Genome Res, 18(5):780-790, May 2008. [ bib | DOI | http ]
This study was originally conceived to test in a rigorous way the specificity of three major approaches to high-throughput array-based DNA methylation analysis: (1) MeDIP, or methylated DNA immunoprecipitation, an example of antibody-mediated methyl-specific fractionation; (2) HELP, or HpaII tiny fragment enrichment by ligation-mediated PCR, an example of differential amplification of methylated DNA; and (3) fractionation by McrBC, an enzyme that cuts most methylated DNA. These results were validated using 1466 Illumina methylation probes on the GoldenGate methylation assay and further resolved discrepancies among the methods through quantitative methylation pyrosequencing analysis. While all three methods provide useful information, there were significant limitations to each, specifically bias toward CpG islands in MeDIP, relatively incomplete coverage in HELP, and location imprecision in McrBC. However, we found that with an original array design strategy using tiling arrays and statistical procedures that average information from neighboring genomic locations, much improved specificity and sensitivity could be achieved, e.g., approximately 100% sensitivity at 90% specificity with McrBC. We term this approach "comprehensive high-throughput arrays for relative methylation" (CHARM). While this approach was applied to McrBC analysis, the array design and computational algorithms are fractionation method-independent and make this a simple, general, relatively inexpensive tool suitable for genome-wide analysis, and in which individual samples can be assayed reliably at very high density, allowing locus-level genome-wide epigenetic discrimination of individuals, not just groups of samples. Furthermore, unlike the other approaches, CHARM is highly quantitative, a substantial advantage in application to the study of human disease.

Keywords: Bias (Epidemiology); CpG Islands; DNA Methylation; Genome, Human; Genomics; Humans; Oligonucleotide Array Sequence Analysis; Reference Standards; Reproducibility of Results; Sensitivity and Specificity
[Frith2008Discovering] Martin C. Frith, Neil F. W. Saunders, Bostjan Kobe, and Timothy L. Bailey. Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput. Biol., 4(5):e1000071+, May 2008. [ bib | DOI | http ]
Keywords: glam
[Callison-Burch08Proceedings] C. Callison-Burch, P. Koehn, C. Monz, J. Schroeder, and C. S. Fordyce, editors. Proceedings of the Third Workshop on SMT. ACL, Columbus, Ohio, June 2008. [ bib | http ]
[Zhang2008Progress] Yang Zhang. Progress and challenges in protein structure prediction. Curr. Opin. Struct. Biol., 18(3):342-348, June 2008. [ bib | DOI | http ]
Depending on whether similar structures are found in the PDB library, the protein structure prediction can be categorized into template-based modeling and free modeling. Although threading is an efficient tool to detect the structural analogs, the advancements in methodology development have come to a steady state. Encouraging progress is observed in structure refinement which aims at drawing template structures closer to the native; this has been mainly driven by the use of multiple structure templates and the development of hybrid knowledge-based and physics-based force fields. For free modeling, exciting examples have been witnessed in folding small proteins to atomic resolutions. However, predicting structures for proteins larger than 150 residues still remains a challenge, with bottlenecks from both force field and conformational search.

Keywords: casp8, modeling, zhang
[Hoang08Moses] H. Hoang and P. Koehn. Design of the Moses decoder for statistical machine translation. In ACL 2008 Software workshop, pages 58-65, Columbus, Ohio, June 2008. ACL. [ bib | http ]
Keywords: moses, smt
[Obozinski2008Union] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Union support recovery in high-dimensional multivariate regression. Technical Report 0808.0711v1, arXiv, August 2008. [ bib | .pdf ]
Keywords: lasso
[Yin2008Activity] J. Yin, Q. Yang, D. Shen, and Z.-N. Li. Activity recognition via user-trace segmentation. ACM Trans. Sen. Netw., 4(4):19:1-19:34, September 2008. [ bib | DOI | http | .pdf ]
[Vert2008Machine] J.-P. Vert and L. Jacob. Machine learning for in silico virtual screening and chemical genomics: New strategies. Combinatorial Chemistry & High Throughput Screening, 11(8):677-685, September 2008. [ bib | DOI | http ]
Keywords: cheminformatics, virtual_screening
[Pan2008A] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. Technical Report HKUST-CS08-08, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China, November 2008. [ bib | .pdf ]
[Scherer2009Batch] A. Scherer, editor. Batch Effects and Noise in Microarray Experiments: Sources and Solutions. John Wiley and Sons, 2009. [ bib ]
[Zhao2009composite] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37(6A):3468-3497, 2009. [ bib ]
[Zhao2008Grouped] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite absolute penalties. Ann. Stat., 37(6A):3468-3497, 2009. [ bib ]
[Zhang2009Maximum] K. Zhang, I.W. Tsang, and J.T. Kwok. Maximum margin clustering made practical. IEEE T. Neural Networ., 20(4):583-596, 2009. [ bib | DOI ]
Keywords: learning (artificial intelligence), optimisation, pattern clustering, Laplacian/square loss, alternating optimization, maximum margin clustering, nonconvex optimization problem, nonconvex problem, semidefinite programs, supervised learning, Large margin methods, maximum margin clustering (MMC), scalability, unsupervised learning
[Zhang2009Penalized] D. Zhang, Y. Lin, and M. Zhang. Penalized orthogonal-components regression for large p small n data. Electron. J. Statist., 3:781-796, 2009. [ bib | DOI | http ]
Here we propose a penalized orthogonal-components regression (POCRE) for large p small n data. Orthogonal components are sequentially constructed to maximize, upon standardization, their correlation to the response residuals. A new penalization framework, implemented via empirical Bayes thresholding, is presented to effectively identify sparse predictors of each component. POCRE is computationally efficient owing to its sequential construction of leading sparse principal components. In addition, such construction offers other properties such as grouping highly correlated predictors and allowing for collinear or nearly collinear predictors. With multivariate responses, POCRE can construct common components and thus build up latent-variable models for large p small n data.

[Zaslavskiy2009Global] M. Zaslavskiyi, F. Bach, and J-P. Vert. Global alignment of protein-protein interaction networks by graph matching methods. Bioinformatics, 25(12), 2009. [ bib ]
[Zaslavskiy2009Path] M. Zaslavskiy, F. Bach, and J.-P. Vert. A path following algorithm for the graph matching problem. IEEE Trans. Pattern Anal. Mach. Intell., 31(12):2227-2242, 2009. [ bib | DOI | http | .pdf ]
We propose a convex-concave programming approach for the labeled weighted graph matching problem. The convex-concave programming formulation is obtained by rewriting the weighted graph matching problem as a least-square problem on the set of permutation matrices and relaxing it to two different optimization problems: a quadratic convex and a quadratic concave optimization problem on the set of doubly stochastic matrices. The concave relaxation has the same global minimum as the initial graph matching problem, but the search for its global minimum is also a hard combinatorial problem. We, therefore, construct an approximation of the concave problem solution by following a solution path of a convex-concave problem obtained by linear interpolation of the convex and concave formulations, starting from the convex relaxation. This method allows to easily integrate the information on graph label similarities into the optimization problem, and therefore, perform labeled weighted graph matching. The algorithm is compared with some of the best performing graph matching methods on four data sets: simulated graphs, QAPLib, retina vessel images, and handwritten Chinese characters. In all cases, the results are competitive with the state of the art.

[Yoon2009Sensitive] Seungtai Yoon, Zhenyu Xuan, Vladimir Makarov, Kenny Ye, and Jonathan Sebat. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res., 19(9):1586-1592, Sep 2009. [ bib | DOI | http | .pdf ]
Methods for the direct detection of copy number variation (CNV) genome-wide have become effective instruments for identifying genetic risk factors for disease. The application of next-generation sequencing platforms to genetic studies promises to improve sensitivity to detect CNVs as well as inversions, indels, and SNPs. New computational approaches are needed to systematically detect these variants from genome sequence data. Existing sequence-based approaches for CNV detection are primarily based on paired-end read mapping (PEM) as reported previously by Tuzun et al. and Korbel et al. Due to limitations of the PEM approach, some classes of CNVs are difficult to ascertain, including large insertions and variants located within complex genomic regions. To overcome these limitations, we developed a method for CNV detection using read depth of coverage. Event-wise testing (EWT) is a method based on significance testing. In contrast to standard segmentation algorithms that typically operate by performing likelihood evaluation for every point in the genome, EWT works on intervals of data points, rapidly searching for specific classes of events. Overall false-positive rate is controlled by testing the significance of each possible event and adjusting for multiple testing. Deletions and duplications detected in an individual genome by EWT are examined across multiple genomes to identify polymorphism between individuals. We estimated error rates using simulations based on real data, and we applied EWT to the analysis of chromosome 1 from paired-end shotgun sequence data (30x) on five individuals. Our results suggest that analysis of read depth is an effective approach for the detection of CNVs, and it captures structural variants that are refractory to established PEM-based methods.

Keywords: ngs
[Xie2009Unified] Lei Xie, Li Xie, and Philip E Bourne. A unified statistical model to support local sequence order independent similarity searching for ligand-binding sites and its application to genome-based drug discovery. Bioinformatics, 25(12):i305-i312, Jun 2009. [ bib | DOI | http ]
Functional relationships between proteins that do not share global structure similarity can be established by detecting their ligand-binding-site similarity. For a large-scale comparison, it is critical to accurately and efficiently assess the statistical significance of this similarity. Here, we report an efficient statistical model that supports local sequence order independent ligand-binding-site similarity searching. Most existing statistical models only take into account the matching vertices between two sites that are defined by a fixed number of points. In reality, the boundary of the binding site is not known or is dependent on the bound ligand making these approaches limited. To address these shortcomings and to perform binding-site mapping on a genome-wide scale, we developed a sequence-order independent profile-profile alignment (SOIPPA) algorithm that is able to detect local similarity between unknown binding sites a priori. The SOIPPA scoring integrates geometric, evolutionary and physical information into a unified framework. However, this imposes a significant challenge in assessing the statistical significance of the similarity because the conventional probability model that is based on fixed-point matching cannot be applied. Here we find that scores for binding-site matching by SOIPPA follow an extreme value distribution (EVD). Benchmark studies show that the EVD model performs at least two-orders faster and is more accurate than the non-parametric statistical method in the previous SOIPPA version. Efficient statistical analysis makes it possible to apply SOIPPA to genome-based drug discovery. Consequently, we have applied the approach to the structural genome of Mycobacterium tuberculosis to construct a protein-ligand interaction network. The network reveals highly connected proteins, which represent suitable targets for promiscuous drugs.

Keywords: Binding Sites; Computational Biology, methods; Drug Discovery, methods; Genome; Ligands; Models, Statistical; Mycobacterium tuberculosis, genetics/metabolism; Proteins, chemistry
[Xie2009CNV-seq] Chao Xie and Martti T Tammi. Cnv-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics, 10:80, 2009. [ bib | DOI | http | .pdf ]
BACKGROUND: DNA copy number variation (CNV) has been recognized as an important source of genetic variation. Array comparative genomic hybridization (aCGH) is commonly used for CNV detection, but the microarray platform has a number of inherent limitations. RESULTS: Here, we describe a method to detect copy number variation using shotgun sequencing, CNV-seq. The method is based on a robust statistical model that describes the complete analysis procedure and allows the computation of essential confidence values for detection of CNV. Our results show that the number of reads, not the length of the reads is the key factor determining the resolution of detection. This favors the next-generation sequencing methods that rapidly produce large amount of short reads. CONCLUSION: Simulation of various sequencing methods with coverage between 0.1x to 8x show overall specificity between 91.7 - 99.9%, and sensitivity between 72.2 - 96.5%. We also show the results for assessment of CNV between two individual human genomes.

Keywords: ngs
[Witten2009penalized] Daniela M. Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515-534, Jul 2009. [ bib | DOI | http | .pdf ]
We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as circumflexX = sigma(k=1)(K) d(k)u(k)v(k)(T), where d(k), u(k), and v(k) minimize the squared Frobenius norm of X - circumflexX, subject to penalties on u(k) and v(k). This results in a regularized version of the singular value decomposition. Of particular interest is the use of L(1)-penalties on u(k) and v(k), which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L(1)-penalty on v(k) but not on u(k), a method for sparse principal components results. In fact, this yields an efficient algorithm for the "SCoTLASS" proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.

[Witten2009Covariance-regularized] D. M. Witten and R. Tibshirani. Covariance-regularized regression and classification for high dimensional problems. J. R. Stat. Soc. Ser. B, 71(3), 2009. [ bib | DOI | http | .pdf ]
[wiki:tsp] Wikipedia. Travelling Salesman Problem - Wikipedia, The Free Encyclopedia, 2009. [Online; accessed 5-May-2009]. [ bib ]
[Weill2009Development] Nathanael Weill and Didier Rognan. Development and Validation of a Novel Protein- Ligand Fingerprint To Mine Chemogenomic Space: Application to G Protein-Coupled Receptors and Their Ligands. Journal of Chemical Information and Modeling, 49(4):1049-1062, 2009. [ bib ]
[Wassermann2009Ligand] Anne Mai Wassermann, Hanna Geppert, and Jürgen Bajorath. Ligand prediction for orphan targets using support vector machines and various target-ligand kernels is dominated by nearest neighbor effects. J Chem Inf Model, 49(10):2155-2167, Oct 2009. [ bib | DOI | http | .pdf ]
Support vector machine (SVM) calculations combining protein and small molecule information have been applied to identify ligands for simulated orphan targets (i.e., targets for which no ligands were available). The combination of protein and ligand information was facilitated through the design of target-ligand kernel functions that account for pairwise ligand and target similarity. The design and biological information content of such kernel functions was expected to play a major role for target-directed ligand prediction. Therefore, a variety of target-ligand kernels were implemented to capture different types of target information including sequence, secondary structure, tertiary structure, biophysical properties, ontologies, or structural taxonomy. These kernels were tested in ligand predictions for simulated orphan targets in two target protein systems characterized by the presence of different intertarget relationships. Surprisingly, although there were target- and set-specific differences in prediction rates for alternative target-ligand kernels, the performance of these kernels was overall similar and also similar to SVM linear combinations. Test calculations designed to better understand possible reasons for these observations revealed that ligand information provided by nearest neighbors of orphan targets significantly influenced SVM performance, much more so than the inclusion of protein information. As long as ligands of closely related neighbors of orphan targets were available for SVM learning, orphan target ligands could be well predicted, regardless of the type and sophistication of the kernel function that was used. These findings suggest simplified strategies for SVM-based ligand prediction for orphan targets.

Keywords: chemogenomics, chemoinformatics
[Wang2009RNA] Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet., 10(1):57-63, Jan 2009. [ bib | DOI | http | .pdf ]
RNA-Seq is a recently developed approach to transcriptome profiling that uses deep-sequencing technologies. Studies using this method have already altered our view of the extent and complexity of eukaryotic transcriptomes. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods. This article describes the RNA-Seq approach, the challenges associated with its application, and the advances made so far in characterizing several eukaryote transcriptomes.

Keywords: ngs, rnaseq
[Wainwright2009Sharp] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using 1-constrained quadratic programming (lasso). IEEE T. Inform. Theory., 55(5):2183-2202, 2009. [ bib | DOI | http | .pdf ]
[Wainwright2009Information-Theoretic] M. J. Wainwright. Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. IEEE T Inform Theory, 55(12):5728-5741, 2009. [ bib | DOI ]
[Vishwanathan2009Graph] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt. Graph kernels. J. Mach. Learn. Res., 10:1-41, 2009. [ bib | .pdf ]
[Vert2009High-level] J. P. Vert, T. Matsui, S. Satoh, and Y. Uchiyama. High-level feature extraction using SVM with walk-based graph kernel. In Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP 2009), 2009. [ bib | .pdf ]
[Vassetzky2009Chromosome] Yegor Vassetzky, Alexey Gavrilov, Elvira Eivazova, Iryna Priozhkova, Marc Lipinski, and Sergey Razin. Chromosome conformation capture (from 3c to 5c) and its chip-based modification. Methods Mol Biol, 567:171-188, 2009. [ bib | DOI | http ]
Chromosome conformation capture (3C) methodology was developed to study spatial organization of long genomic regions in living cells. Briefly, chromatin is fixed with formaldehyde in vivo to cross-link interacting sites, digested with a restriction enzyme and ligated at a low DNA concentration so that ligation between cross-linked fragments is favored over ligation between random fragments. Ligation products are then analyzed and quantified by PCR. So far, semi-quantitative PCR methods were widely used to estimate the ligation frequencies. However, it is often important to estimate the ligation frequencies more precisely which is only possible by using the real-time PCR. At the same time, it is equally necessary to monitor the specificity of PCR amplification. That is why the real-time PCR with TaqMan probes is becoming more and more popular in 3C studies. In this chapter, we describe the general protocol for 3C analysis with the subsequent estimation of ligation frequencies by using the real-time PCR technology with TaqMan probes. We discuss in details all steps of the experimental procedure paying special attention to weak points and possible ways to solve the problems. A special attention is also paid to the problems in interpretation of the results and necessary control experiments. Besides, in theory, we consider other approaches to analysis of the ligation products used in frames of the so-called 4C and 5C methods. The recently developed chromatin immunoprecipitation (ChIP)-loop assay representing a combination of 3C and ChIP is also discussed.

Keywords: Chromatin Immunoprecipitation; Chromosome Mapping; Chromosomes; Cross-Linking Reagents; Humans; Models, Biological; Nucleic Acid Conformation; Polymerase Chain Reaction; Quality Control
[Tuteja2009Extracting] Geetu Tuteja, Peter White, Jonathan Schug, and Klaus H Kaestner. Extracting transcription factor targets from chip-seq data. Nucleic Acids Res, 37(17):e113, Sep 2009. [ bib | DOI | http | .pdf ]
ChIP-Seq technology, which combines chromatin immunoprecipitation (ChIP) with massively parallel sequencing, is rapidly replacing ChIP-on-chip for the genome-wide identification of transcription factor binding events. Identifying bound regions from the large number of sequence tags produced by ChIP-Seq is a challenging task. Here, we present GLITR (GLobal Identifier of Target Regions), which accurately identifies enriched regions in target data by calculating a fold-change based on random samples of control (input chromatin) data. GLITR uses a classification method to identify regions in ChIP data that have a peak height and fold-change which do not resemble regions in an input sample. We compare GLITR to several recent methods and show that GLITR has improved sensitivity for identifying bound regions closely matching the consensus sequence of a given transcription factor, and can detect bona fide transcription factor targets missed by other programs. We also use GLITR to address the issue of sequencing depth, and show that sequencing biological replicates identifies far more binding regions than re-sequencing the same sample.

Keywords: ngs
[Tseng2009Coordinate] P. Tseng and Y. Sangwoon. A coordinate gradient descent method for nonsmooth separable minimization. Math. Program., 117(1-2):387-423, 2009. [ bib | DOI | http | .pdf ]
Keywords: lasso
[Trinh2009Elementary] C. T. Trinh, A. Wlaschin, and F. Srienc. Elementary mode analysis: a useful metabolic pathway analysis tool for characterizing cellular metabolism. Appl Microbiol Biotechnol, 81(5):813-826, Jan 2009. [ bib | DOI | http | .pdf ]
Elementary mode analysis is a useful metabolic pathway analysis tool to identify the structure of a metabolic network that links the cellular phenotype to the corresponding genotype. The analysis can decompose the intricate metabolic network comprised of highly interconnected reactions into uniquely organized pathways. These pathways consisting of a minimal set of enzymes that can support steady state operation of cellular metabolism represent independent cellular physiological states. Such pathway definition provides a rigorous basis to systematically characterize cellular phenotypes, metabolic network regulation, robustness, and fragility that facilitate understanding of cell physiology and implementation of metabolic engineering strategies. This mini-review aims to overview the development and application of elementary mode analysis as a metabolic pathway analysis tool in studying cell physiology and as a basis of metabolic engineering.

[Tournier2009JTB] L. Tournier and M. Chaves. Uncovering operational interactions in genetic networks using asynchronous boolean dynamics. Journal of Theoretical Biology, 260(2):196-209, 2009. [ bib | DOI | http ]
Biological networks of large dimensions, with their diagram of interactions, are often well represented by a Boolean model with a family of logical rules. The state space of a Boolean model is finite, and its asynchronous dynamics are fully described by a transition graph in the state space. In this context, a model reduction method will be developed for identifying the active or operational interactions responsible for a given dynamic behaviour. The first step in this procedure is the decomposition of the asynchronous transition graph into its strongly connected components, to obtain a reduced and hierarchically organized graph of transitions. The second step consists of the identification of a partial graph of interactions and a sub-family of logical rules that remain operational in a given region of the state space. This model reduction method and its usefulness are illustrated by an application to a model of programmed cell death. The method identifies two mechanisms used by the cell to respond to death-receptor stimulation and decide between the survival and apoptotic pathways.

Keywords: csbcbook
[Topiol2009X-ray] Sid Topiol and Michael Sabio. X-ray structure breakthroughs in the GPCR transmembrane region. Biochem Pharmacol, 78(1):11-20, Jul 2009. [ bib | DOI | http ]
G-protein-coupled receptor (GPCR) proteins [Lundstrom KH, Chiu ML, editors. G protein-coupled receptors in drug discovery. CRC Press; 2006] are the single largest drug target, representing 25-50% of marketed drugs [Overington JP, Al-Lazikani B, Hopkins AL. How many drug targets are there? Nat Rev Drug Discov 2006;5(12):993-6; Parrill AL. Crystal structures of a second G protein-coupled receptor: triumphs and implications. ChemMedChem 2008;3:1021-3]. While there are six subclasses of GPCR proteins, the hallmark of all GPCR proteins is the transmembrane-spanning region. The general architecture of this transmembrane (TM) region has been known for some time to contain seven alpha-helices. From a drug discovery and design perspective, structural information of the GPCRs has been sought as a tool for structure-based drug design. The advances in the past decade of technologies for structure-based design have proven to be useful in a number of areas. Invoking these approaches for GPCR targets has remained challenging. Until recently, the most closely related structures available for GPCR modeling have been those of bovine rhodopsin. While a representative of class A GPCRs, bovine rhodopsin is not a ligand-activated GPCR and is fairly distant in sequence homology to other class A GPCRs. Thus, there is a variable degree of uncertainty in the use of the rhodopsin X-ray structure as a template for homology modeling of other GPCR targets. Recent publications of X-ray structures of class A GPCRs now offer the opportunity to better understand the molecular mechanism of action at the atomic level, to deploy X-ray structures directly for their use in structure-based design, and to provide more promising templates for many other ligand-mediated GPCRs. We summarize herein some of the recent findings in this area and provide an initial perspective of the emerging opportunities, possible limitations, and remaining questions. Other aspects of the recent X-ray structures are described by Weis and Kobilka [Weis WI, Kobilka BK. Structural insights into G-protein-coupled receptor activation. Curr Opin Struct Biol 2008;18:734-40] and Mustafi and Palczewski [Mustafi D, Palczewski K. Topology of class A G protein-coupled receptors: insights gained from crystal structures of rhodopsins, adrenergic and adenosine receptors. Mol Pharmacol 2009;75:1-12].

Keywords: Animals; Cell Membrane; Humans; Models, Molecular; Molecular Conformation; Pindolol; Propanolamines; Protein Conformation; Receptor, Adenosine A2A; Receptors, Adrenergic, beta-2; Receptors, G-Protein-Coupled; Retinaldehyde; Rhodopsin; X-Ray Diffraction
[Terentiev2009Dynamic] A. A. Terentiev, N. T. Moldogazieva, and K. V. Shaitan. Dynamic proteomics in modeling of the living cell. protein-protein interactions. Biochemistry (Mosc), 74(13):1586-1607, Dec 2009. [ bib ]
This review is devoted to describing, summarizing, and analyzing of dynamic proteomics data obtained over the last few years and concerning the role of protein-protein interactions in modeling of the living cell. Principles of modern high-throughput experimental methods for investigation of protein-protein interactions are described. Systems biology approaches based on integrative view on cellular processes are used to analyze organization of protein interaction networks. It is proposed that finding of some proteins in different protein complexes can be explained by their multi-modular and polyfunctional properties; the different protein modules can be located in the nodes of protein interaction networks. Mathematical and computational approaches to modeling of the living cell with emphasis on molecular dynamics simulation are provided. The role of the network analysis in fundamental medicine is also briefly reviewed.

Keywords: Animals; Humans; Mass Spectrometry; Models, Theoretical; Molecular Dynamics Simulation; Multiprotein Complexes; Protein Conformation; Protein Interaction Mapping; Proteins; Proteomics; Systems Biology; Two-Hybrid System Techniques
[Taslim2009Comparative] C. Taslim, J. Wu, P. Yan, G. Singer, J. Parvin, T. Huan, S. Lin, and K. Huang. Comparative study on ChIP-seq data: normalization and binding pattern characterization. Bioinformatics, 25(18):2334-2340, Sep 2009. [ bib | DOI | http ]
MOTIVATION: Antibody-based Chromatin Immunoprecipitation assay followed by high-throughput sequencing technology (ChIP-seq) is a relatively new method to study the binding patterns of specific protein molecules over the entire genome. ChIP-seq technology allows scientist to get more comprehensive results in shorter time. Here, we present a non-linear normalization algorithm and a mixture modeling method for comparing ChIP-seq data from multiple samples and characterizing genes based on their RNA polymerase II (Pol II) binding patterns. RESULTS: We apply a two-step non-linear normalization method based on locally weighted regression (LOESS) approach to compare ChIP-seq data across multiple samples and model the difference using an Exponential-Normal(K) mixture model. Fitted model is used to identify genes associated with differential binding sites based on local false discovery rate (fdr). These genes are then standardized and hierarchically clustered to characterize their Pol II binding patterns. As a case study, we apply the analysis procedure comparing normal breast cancer (MCF7) to tamoxifen-resistant (OHT) cell line. We find enriched regions that are associated with cancer (P < 0.0001). Our findings also imply that there may be a dysregulation of cell cycle and gene expression control pathways in the tamoxifen-resistant cells. These results show that the non-linear normalization method can be used to analyze ChIP-seq data across multiple samples. AVAILABILITY: Data are available at http://www.bmi.osu.edu/ khuang/Data/ChIP/RNAPII/.

[Sutherland2009Transcription] Heidi Sutherland and Wendy A Bickmore. Transcription factories: gene expression in unions? Nat Rev Genet, 10(7):457-466, Jul 2009. [ bib | DOI | http ]
Transcription is a fundamental step in gene expression, yet it remains poorly understood at a cellular level. Visualization of transcription sites and active genes has led to the suggestion that transcription occurs at discrete sites in the nucleus, termed transcription factories, where multiple active RNA polymerases are concentrated and anchored to a nuclear substructure. However, this concept is not universally accepted. This Review discusses the experimental evidence in support of the transcription factory model and the evidence that argues against such a spatially structured view of transcription. The transcription factory model has implications for the regulation of transcription initiation and elongation, for the organization of genes in the genome, for the co-regulation of genes and for genome instability.

Keywords: Animals; Cell Nucleus; DNA-Directed RNA Polymerases; Genome; Genomic Instability; Humans; Models, Biological; Transcription, Genetic
[Stempfel2009Learning] G. Stempfel and L. Ralaivola. Learning svms from sloppily labeled data. In Cesare Alippi, Marios Polycarpou, Christos Panayiotou, and Georgios Ellinas, editors, Artificial Neural Networks – ICANN 2009, volume 5768 of Lecture Notes in Computer Science, pages 884-893. Springer Berlin / Heidelberg, 2009. [ bib | http ]
[Stempfel2009Robustesse] G. Stempfel. Robustesse des séparateurs linéaires au bruit de classification. PhD thesis, Universit`e de Provence, 2009. [ bib ]
[Sriphaew2009Cool] K. Sriphaew, H. Takamura, and M. Okumura. Cool blog classification from positive and unlabeled examples. In T. Theeramunkong, B. Kijsirikul, C. Cercone, and T-B. Ho, editors, Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD '09, pages 62-73, Berlin, Heidelberg, 2009. Springer-Verlag. [ bib | DOI | http ]
We address the problem of cool blog classification using only positive and unlabeled examples. We propose an algorithm, called PUB, that exploits the information of unlabeled data together with the positive examples to predict whether the unseen blogs are cool or not. The algorithm uses the weighting technique to assign a weight to each unlabeled example which is assumed to be negative in the training set, and the bagging technique to obtain several weak classifiers, each of which is learned on a small training set generated by randomly sampling some positive examples and some unlabeled examples, which are assumed to be negative. Each of the weak classifiers must achieve admissible performance measure evaluated based on the whole labeled positive examples or has the best performance measure within iteration limit. The majority voting function on all weak classifiers is employed to predict the class of a test instance. The experimental results show that PUB can correctly predict the classes of unseen blogs where this situation cannot be handled by the traditional learning from positive and negative examples. The results also show that PUB outperforms other algorithms for learning from positive and unlabeled examples in the task of cool blog classification.

Keywords: Cool blog, PU-learning, bagging, weighting examples
[Spyrou2009BayesPeak] Christiana Spyrou, Rory Stark, Andy G Lynch, and Simon Tavaré. Bayespeak: Bayesian analysis of chip-seq data. BMC Bioinformatics, 10:299, 2009. [ bib | DOI | http | .pdf ]
BACKGROUND: High-throughput sequencing technology has become popular and widely used to study protein and DNA interactions. Chromatin immunoprecipitation, followed by sequencing of the resulting samples, produces large amounts of data that can be used to map genomic features such as transcription factor binding sites and histone modifications. METHODS: Our proposed statistical algorithm, BayesPeak, uses a fully Bayesian hidden Markov model to detect enriched locations in the genome. The structure accommodates the natural features of the Solexa/Illumina sequencing data and allows for overdispersion in the abundance of reads in different regions. Moreover, a control sample can be incorporated in the analysis to account for experimental and sequence biases. Markov chain Monte Carlo algorithms are applied to estimate the posterior distributions of the model parameters, and posterior probabilities are used to detect the sites of interest. CONCLUSION: We have presented a flexible approach for identifying peaks from ChIP-seq reads, suitable for use on both transcription factor binding and histone modification data. Our method estimates probabilities of enrichment that can be used in downstream analysis. The method is assessed using experimentally verified data and is shown to provide high-confidence calls with low false positive rates.

Keywords: ngs
[Sotiriou2009Gene-Expression] C. Sotiriou and L. Pusztai. Gene-expression signatures in breast cancer. N. Engl. J. Med., 360(8):790-800, 2009. [ bib | DOI | http | .pdf ]
[Soares2009Identifying] H. D. Soares, Y. Chen, M. Sabbagh, A. Roher, A. Rohrer, E. Schrijvers, and M. Breteler. Identifying early markers of alzheimer's disease using quantitative multiplex proteomic immunoassay panels. Ann. N. Y. Acad. Sci., 1180:56-67, Oct 2009. [ bib | DOI | http | .pdf ]
Alzheimer's disease (AD) is a debilitating neurodegenerative disorder with incidence expected to increase four-fold over the next decade. Extensive research efforts are focused upon identifying new treatments, and early diagnosis is considered key to successful intervention. Although imaging and cerebrospinal fluid biomarkers have shown promise in identifying patients in very early stages of the disease, more noninvasive cost-effective tools have remained elusive. Recent studies have reported that an 18-analyte multiplexed plasma panel can differentiate AD from controls suggesting plasma-based screening tools for early AD diagnosis exists. The current study tested the reproducibility of a subset of the original 18-analyte panel using a bead-based multiplex technology. Preliminary results suggest diagnostic accuracy using the subset was 61%. Multivariate analysis of an 89-analyte multivariate panel yielded a diagnostic accuracy of 70% suggesting a plasma-based AD signature that may be a useful screening tool.

[Snijder2009Population] Berend Snijder, Raphael Sacher, Pauli Rämö, Eva-Maria Damm, Prisca Liberali, and Lucas Pelkmans. Population context determines cell-to-cell variability in endocytosis and virus infection. Nature, 461(7263):520-523, Sep 2009. [ bib | DOI | http | .pdf ]
Single-cell heterogeneity in cell populations arises from a combination of intrinsic and extrinsic factors. This heterogeneity has been measured for gene transcription, phosphorylation, cell morphology and drug perturbations, and used to explain various aspects of cellular physiology. In all cases, however, the causes of heterogeneity were not studied. Here we analyse, for the first time, the heterogeneous patterns of related cellular activities, namely virus infection, endocytosis and membrane lipid composition in adherent human cells. We reveal correlations with specific cellular states that are defined by the population context of a cell, and we derive probabilistic models that can explain and predict most cellular heterogeneity of these activities, solely on the basis of each cell's population context. We find that accounting for population-determined heterogeneity is essential for interpreting differences between the activity levels of cell populations. Finally, we reveal that synergy between two molecular components, focal adhesion kinase and the sphingolipid GM1, enhances the population-determined pattern of simian virus 40 (SV40) infection. Our findings provide an explanation for the origin of heterogeneity patterns of cellular activities in adherent cell populations.

Keywords: highcontentscreening
[Smalter2009Feature] Aaron Smalter, Jun Huan, and Gerald Lushington. Feature selection in the tensor product feature space. Data Mining, IEEE International Conference on, 0:1004-1009, 2009. [ bib | DOI ]
[Shivakumar2009Structural] Pavithra Shivakumar and Michael Krauthammer. Structural similarity assessment for drug sensitivity prediction in cancer. BMC Bioinformatics, 10 Suppl 9:S17, 2009. [ bib | DOI | http | .pdf ]
BACKGROUND: The ability to predict drug sensitivity in cancer is one of the exciting promises of pharmacogenomic research. Several groups have demonstrated the ability to predict drug sensitivity by integrating chemo-sensitivity data and associated gene expression measurements from large anti-cancer drug screens such as NCI-60. The general approach is based on comparing gene expression measurements from sensitive and resistant cancer cell lines and deriving drug sensitivity profiles consisting of lists of genes whose expression is predictive of response to a drug. Importantly, it has been shown that such profiles are generic and can be applied to cancer cell lines that are not part of the anti-cancer screen. However, one limitation is that the profiles can not be generated for untested drugs (i.e., drugs that are not part of an anti-cancer drug screen). In this work, we propose using an existing drug sensitivity profile for drug A as a substitute for an untested drug B given high structural similarities between drugs A and B. RESULTS: We first show that structural similarity between pairs of compounds in the NCI-60 dataset highly correlates with the similarity between their activities across the cancer cell lines. This result shows that structurally similar drugs can be expected to have a similar effect on cancer cell lines. We next set out to test our hypothesis that we can use existing drug sensitivity profiles as substitute profiles for untested drugs. In a cross-validation experiment, we found that the use of substitute profiles is possible without a significant loss of prediction accuracy if the substitute profile was generated from a compound with high structural similarity to the untested compound. CONCLUSION: Anti-cancer drug screens are a valuable resource for generating omics-based drug sensitivity profiles. We show that it is possible to extend the usefulness of existing screens to untested drugs by deriving substitute sensitivity profiles from structurally similar drugs part of the screen.

Keywords: chemogenomics
[Sherashidze2009Efficient] N. Sherashidze, S.V.N. Vishwanathan, T.H. Petri, K. Mehlhorn, and K.M. Borgwardt. Efficient graphlet kernels for large graph comparison. In 12th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 488-495, Clearwater Beach, Florida USA, 2009. Society for Artificial Intelligence and Statistics. [ bib | .pdf ]
[Shah2009Mutational] S. P. Shah, R. D. Morin, J. Khattra, L. Prentice, T. Pugh, A. Burleigh, A. Delaney, K. Gelmon, R. Guliany, J. Senz, C. Steidl, R.A . Holt, S. Jones, M. Sun, G. Leung, R. Moore, T. Severson, G. A. Taylor, A. E. Teschendorff, K. Tse, G. Turashvili, R. Varhol, R. L. Warren, P. Watson, Y. Zhao, C. Caldas, D. Huntsman, M. Hirst, M. A. Marra, and A. Aparicio. Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature, 461(7265):809-813, Oct 2009. [ bib | DOI | http | .pdf ]
Recent advances in next generation sequencing have made it possible to precisely characterize all somatic coding mutations that occur during the development and progression of individual cancers. Here we used these approaches to sequence the genomes (>43-fold coverage) and transcriptomes of an oestrogen-receptor-alpha-positive metastatic lobular breast cancer at depth. We found 32 somatic non-synonymous coding mutations present in the metastasis, and measured the frequency of these somatic mutations in DNA from the primary tumour of the same patient, which arose 9 years earlier. Five of the 32 mutations (in ABCB11, HAUS3, SLC24A4, SNX4 and PALB2) were prevalent in the DNA of the primary tumour removed at diagnosis 9 years earlier, six (in KIF1C, USP28, MYH8, MORC1, KIAA1468 and RNASEH2A) were present at lower frequencies (1-13%), 19 were not detected in the primary tumour, and two were undetermined. The combined analysis of genome and transcriptome data revealed two new RNA-editing events that recode the amino acid sequence of SRP9 and COG3. Taken together, our data show that single nucleotide mutational heterogeneity can be a property of low or intermediate grade primary breast cancers and that significant evolution can occur with disease progression.

Keywords: ngs
[Scott2009Novelty] C. Scott and G. Blanchard. Novelty detection: Unlabeled data definitely help. In D. van Dyk and M. Welling, editors, Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS) 2009, volume 5, pages 464-471, Clearwater Beach, Florida, 2009. JMLR: W&CP 5. [ bib | .html | .pdf ]
In machine learning, one formulation of the novelty detection problem is to build a detector based on a training sample consisting of only nominal data. The standard (inductive) approach to this problem has been to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classification. Unlike the inductive approach, our approach yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule.

Keywords: PUlearning
[Saigo2009gBoost] H. Saigo, S. Nowozin, T. Kadowaki, T. Kudo, and K. Tsuda. gBoost: a mathematical programming approach to graph classification and regression. Mach. Learn., 75(1):69-89, 2009. [ bib | DOI | http | .pdf ]
[Rinaldo2009Properties] A. Rinaldo. Properties and refinements of the fused lasso. Ann. Stat., 37(5B):2922-2952, 2009. [ bib | http ]
[Reyal2009Analyse] F. Reyal. Analyse du profil d'expression par la technique des puces à ADN. Application à la caractérisation moléculaire et à la détermination du pronostic des cancers canalaires infiltrants du sein. PhD thesis, Université Paris 11, 2009. [ bib ]
Keywords: breastcancer, microarray
[Rehm2009Dynamics] M. Rehm, H. J. Huber, C. T. Hellwig, S. Anguissola, H. Dussmann, and J. H. M. Prehn. Dynamics of outer mitochondrial membrane permeabilization during apoptosis. Cell Death Differ., 16:613-623, 2009. [ bib | .html ]
Keywords: csbcbook
[Rehm2009CellDeathDiff] M. Rehm, H. J. Huber, C. T. Hellwig, S. Anguissola, H. Dussmann, and J. H. M. Prehn. Dynamics of outer mitochondrial membrane permeabilization during apoptosis. Cell Death and Differentiation, 16:613 - 623, 2009. [ bib | .html ]
Keywords: csbcbook
[Ravikumar2009Model] Pradeep Ravikumar, Garvesh Raskutti, Martin Wainwright, and Bin Yu. Model selection in gaussian graphical models: High-dimensional consistency of 1-regularized mle. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1329-1336. MIT Press, 2009. [ bib ]
[Pushkarev2009Single] Dmitry Pushkarev, Norma F Neff, and Stephen R Quake. Single-molecule sequencing of an individual human genome. Nat Biotechnol, 27(9):847-852, Sep 2009. [ bib | DOI | http ]
Recent advances in high-throughput DNA sequencing technologies have enabled order-of-magnitude improvements in both cost and throughput. Here we report the use of single-molecule methods to sequence an individual human genome. We aligned billions of 24- to 70-bp reads (32 bp average) to approximately 90% of the National Center for Biotechnology Information (NCBI) reference genome, with 28x average coverage. Our results were obtained on one sequencing instrument by a single operator with four data collection runs. Single-molecule sequencing enabled analysis of human genomic information without the need for cloning, amplification or ligation. We determined approximately 2.8 million single nucleotide polymorphisms (SNPs) with a false-positive rate of less than 1% as validated by Sanger sequencing and 99.8% concordance with SNP genotyping arrays. We identified 752 regions of copy number variation by analyzing coverage depth alone and validated 27 of these using digital PCR. This milestone should allow widespread application of genome sequencing to many aspects of genetics and human health, including personal genomics.

Keywords: Computer Simulation; Genome, Human; Genomics, methods; Humans; Polymorphism, Single Nucleotide; Reproducibility of Results; Sequence Analysis, DNA, methods
[Praz2009CleanEx:] Viviane Praz and Philipp Bucher. Cleanex: new data extraction and merging tools based on mesh term annotation. Nucleic Acids Res, 37(Database issue):D880-D884, Jan 2009. [ bib | DOI | http ]
The CleanEx expression database (http://www.cleanex.isb-sib.ch) provides access to public gene expression data via unique gene names as well as via experiments biomedical characteristics. To reach this, a dual annotation of both sequences and experiments has been generated. First, the system links official gene symbols to any kind of sequences used for gene expression measurements (cDNA, Affymetrix, oligonucleotide arrays, SAGE or MPSS tags, Expressed Sequence Tags or other mRNA sequences, etc.). For the biomedical annotation, we re-annotate each experiment from the CleanEx database with the MeSH (Medical Subject Headings) terms, primarily used by NLM (National Library of Medicine) for indexing articles for the MEDLINE/PubMED database. This annotation allows a fast and easy retrieval of expression data with common biological or medical features. The numerical data can then be exported as matrix-like tab-delimited text files. Data can be extracted from either one dataset or from heterogeneous datasets.

Keywords: Animals; Chromosome Mapping; Databases, Genetic; Gene Expression Profiling; Humans; Medical Subject Headings; Mice; Oligonucleotide Array Sequence Analysis; Software
[Popova2009Genome] Tatiana Popova, Elodie Manié, Dominique Stoppa-Lyonnet, Guillem Rigaill, Emmanuel Barillot, and Marc Henri Stern. Genome alteration print (gap): a tool to visualize and mine complex cancer genomic profiles obtained by snp arrays. Genome Biol, 10(11):R128, 2009. [ bib | DOI | http ]
We describe a method for automatic detection of absolute segmental copy numbers and genotype status in complex cancer genome profiles measured with single-nucleotide polymorphism (SNP) arrays. The method is based on pattern recognition of segmented and smoothed copy number and allelic imbalance profiles. Assignments were verified by DNA indexes of primary tumors and karyotypes of cell lines. The method performs well even for poor-quality data, low tumor content, and highly rearranged tumor genomes.

Keywords: Allelic Imbalance; Automation; Breast Neoplasms, genetics; Gene Expression Profiling; Gene Expression Regulation, Neoplastic; Genome; Genomics; Genotype; Homozygote; Humans; Karyotyping; Loss of Heterozygosity; Models, Genetic; Ploidies; Polymorphism, Single Nucleotide
[Philippi2009BMCSysBio] N. Philippi, D. Walter, R. Schlatter, K. Ferreira, M. Ederer, O. Sawodny, J. Timmer, C. Borner, and T. Dandekar. Modeling system states in liver cells: survival, apoptosis and their modifications in response to viral infection. BMC Syst Biol, 3:97, 2009. [ bib ]
BACKGROUND: The decision pro- or contra apoptosis is complex, involves a number of different inputs, and is central for the homeostasis of an individual cell as well as for the maintenance and regeneration of the complete organism. RESULTS: This study centers on Fas ligand (FasL)-mediated apoptosis, and a complex and internally strongly linked network is assembled around the central FasL-mediated apoptosis cascade. Different bioinformatical techniques are employed and different crosstalk possibilities including the integrin pathway are considered. This network is translated into a Boolean network (74 nodes, 108 edges). System stability is dynamically sampled and investigated using the software SQUAD. Testing a number of alternative crosstalk possibilities and networks we find that there are four stable system states, two states comprising cell survival and two states describing apoptosis by the intrinsic and the extrinsic pathways, respectively. The model is validated by comparing it to experimental data from kinetics of cytochrome c release and caspase activation in wildtype and Bid knockout cells grown on different substrates. Pathophysiological modifications such as input from cytomegalovirus proteins M36 and M45 again produces output behavior that well agrees with experimental data. CONCLUSION: A network model for apoptosis and crosstalk in hepatocytes shows four different system states and reproduces a number of different conditions around apoptosis including effects of different growth substrates and viral infections. It produces semi-quantitative predictions on the activity of individual nodes, agreeing with experimental data. The model (SBML format) and all data are available for further predictions and development.

Keywords: csbcbook
[Pelckmans2009Transductively] K. Pelckmans and J.A.K. Suykens. Transductively learning from positive examples only. In Proc. of the European Symposium on Artificial Neural Networks (ESANN 2009), 2009. [ bib | .pdf | .pdf ]
Keywords: PUlearning
[Parkhomenko2009Sparse] E. Parkhomenko, D. Tritchler, and J. Beyene. Sparse canonical correlation analysis with application to genomic data integration. Stat Appl Genet Mol Biol, 8(1):Article 1, Jan 2009. [ bib | DOI | http ]
Large scale genomic studies with multiple phenotypic or genotypic measures may require the identification of complex multivariate relationships. In multivariate analysis a common way to inspect the relationship between two sets of variables based on their correlation is canonical correlation analysis, which determines linear combinations of all variables of each type with maximal correlation between the two linear combinations. However, in high dimensional data analysis, when the number of variables under consideration exceeds tens of thousands, linear combinations of the entire sets of features may lack biological plausibility and interpretability. In addition, insufficient sample size may lead to computational problems, inaccurate estimates of parameters and non-generalizable results. These problems may be solved by selecting sparse subsets of variables, i.e. obtaining sparse loadings in the linear combinations of variables of each type. In this paper we present Sparse Canonical Correlation Analysis (SCCA) which examines the relationships between two types of variables and provides sparse solutions that include only small subsets of variables of each type by maximizing the correlation between the subsets of variables of different types while performing variable selection. We also present an extension of SCCA-adaptive SCCA. We evaluate their properties using simulated data and illustrate practical use by applying both methods to the study of natural variation in human gene expression.

Keywords: Algorithms; Genomics, statistics /&/ numerical data; Humans; Models, Statistical; Sample Size
[Parker2009Supervised] Joel S Parker, Michael Mullins, Maggie C U Cheang, Samuel Leung, David Voduc, Tammi Vickery, Sherri Davies, Christiane Fauron, Xiaping He, Zhiyuan Hu, John F Quackenbush, Inge J Stijleman, Juan Palazzo, J. S. Marron, Andrew B Nobel, Elaine Mardis, Torsten O Nielsen, Matthew J Ellis, Charles M Perou, and Philip S Bernard. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol, 27(8):1160-1167, Mar 2009. [ bib | DOI | http ]
PURPOSE To improve on current standards for breast cancer prognosis and prediction of chemotherapy benefit by developing a risk model that incorporates the gene expression-based "intrinsic" subtypes luminal A, luminal B, HER2-enriched, and basal-like. METHODS A 50-gene subtype predictor was developed using microarray and quantitative reverse transcriptase polymerase chain reaction data from 189 prototype samples. Test sets from 761 patients (no systemic therapy) were evaluated for prognosis, and 133 patients were evaluated for prediction of pathologic complete response (pCR) to a taxane and anthracycline regimen. RESULTS: The intrinsic subtypes as discrete entities showed prognostic significance (P = 2.26E-12) and remained significant in multivariable analyses that incorporated standard parameters (estrogen receptor status, histologic grade, tumor size, and node status). A prognostic model for node-negative breast cancer was built using intrinsic subtype and clinical information. The C-index estimate for the combined model (subtype and tumor size) was a significant improvement on either the clinicopathologic model or subtype model alone. The intrinsic subtype model predicted neoadjuvant chemotherapy efficacy with a negative predictive value for pCR of 97%. CONCLUSION Diagnosis by intrinsic subtype adds significant prognostic and predictive information to standard parameters for patients with breast cancer. The prognostic properties of the continuous risk score will be of value for the management of node-negative breast cancers. The subtypes and risk score can also be used to assess the likelihood of efficacy from neoadjuvant chemotherapy.

Keywords: Adult; Aged; Breast Neoplasms, classification/drug therapy/etiology/mortality; Chemotherapy, Adjuvant; Female; Humans; Middle Aged; Neoplasm Recurrence, Local, etiology; Prognosis; Receptor, erbB-2, analysis; Receptors, Estrogen, analysis; Reverse Transcriptase Polymerase Chain Reaction; Risk
[Park2009ChIP] Peter J Park. Chip-seq: advantages and challenges of a maturing technology. Nat Rev Genet, 10(10):669-680, Oct 2009. [ bib | DOI | http ]
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a technique for genome-wide profiling of DNA-binding proteins, histone modifications or nucleosomes. Owing to the tremendous progress in next-generation sequencing technology, ChIP-seq offers higher resolution, less noise and greater coverage than its array-based predecessor ChIP-chip. With the decreasing cost of sequencing, ChIP-seq has become an indispensable tool for studying gene regulation and epigenetic mechanisms. In this Review, I describe the benefits and challenges in harnessing this technique with an emphasis on issues related to experimental design and data analysis. ChIP-seq experiments generate large quantities of data, and effective computational analysis will be crucial for uncovering biological mechanisms.

Keywords: Animals; Chromatin Immunoprecipitation, methods; Computational Biology; DNA-Binding Proteins, genetics; Epigenesis, Genetic; Humans; Nucleosomes, genetics; Sequence Analysis, DNA, methods
[Nilsson2009reliable] R. Nilsson, J. Björkegren, and J. Tegnér. On reliable discovery of molecular signatures. BMC Bioinformatics, 10:38, 2009. [ bib | DOI | http | .pdf ]
Background: Molecular signatures are sets of genes, proteins, genetic variants or other variables that can be used as markers for a particular phenotype. Reliable signature discovery methods could yield valuable insight into cell biology and mechanisms of human disease. However, it is currently not clear how to control error rates such as the false discovery rate (FDR) in signature discovery. Moreover, signatures for cancer gene expression have been shown to be unstable, that is, difficult to replicate in independent studies, casting doubts on their reliability. Results: We demonstrate that with modern prediction methods, signatures that yield accurate predictions may still have a high FDR. Further, we show that even signatures with low FDR may fail to replicate in independent studies due to limited statistical power. Thus, neither stability nor predictive accuracy are relevant when FDR control is the primary goal. We therefore develop a general statistical hypothesis testing framework that for the first time provides FDR control for signature discovery. Our method is demonstrated to be correct in simulation studies. When applied to five cancer data sets, the method was able to discover molecular signatures with 5 while two data sets yielded no significant findings. Conclusion: Our approach enables reliable discovery of molecular signatures from genome-wide data with current sample sizes. The statistical framework developed herein is potentially applicable to a wide range of prediction problems in bioinformatics.

[Mustafi2009Topology] Debarshi Mustafi and Krzysztof Palczewski. Topology of class a g protein-coupled receptors: insights gained from crystal structures of rhodopsins, adrenergic and adenosine receptors. Mol Pharmacol, 75(1):1-12, Jan 2009. [ bib | DOI | http ]
Biological membranes are densely packed with membrane proteins that occupy approximately half of their volume. In almost all cases, membrane proteins in the native state lack the higher-order symmetry required for their direct study by diffraction methods. Despite many technical difficulties, numerous crystal structures of detergent solubilized membrane proteins have been determined that illustrate their internal organization. Among such proteins, class A G protein-coupled receptors have become amenable to crystallization and high resolution X-ray diffraction analyses. The derived structures of native and engineered receptors not only provide insights into their molecular arrangements but also furnish a framework for designing and testing potential models of transformation from inactive to active receptor signaling states and for initiating rational drug design.

Keywords: Animals; Crystallography, X-Ray; Humans; Models, Molecular; Protein Structure, Secondary; Receptors, Adrenergic; Receptors, G-Protein-Coupled; Receptors, Purinergic P1; Rhodopsin
[McKernan2009Sequence] Kevin Judd McKernan, Heather E Peckham, Gina L Costa, Stephen F McLaughlin, Yutao Fu, Eric F Tsung, Christopher R Clouser, Cisyla Duncan, Jeffrey K Ichikawa, Clarence C Lee, Zheng Zhang, Swati S Ranade, Eileen T Dimalanta, Fiona C Hyland, Tanya D Sokolsky, Lei Zhang, Andrew Sheridan, Haoning Fu, Cynthia L Hendrickson, Bin Li, Lev Kotler, Jeremy R Stuart, Joel A Malek, Jonathan M Manning, Alena A Antipova, Damon S Perez, Michael P Moore, Kathleen C Hayashibara, Michael R Lyons, Robert E Beaudoin, Brittany E Coleman, Michael W Laptewicz, Adam E Sannicandro, Michael D Rhodes, Rajesh K Gottimukkala, Shan Yang, Vineet Bafna, Ali Bashir, Andrew MacBride, Can Alkan, Jeffrey M Kidd, Evan E Eichler, Martin G Reese, Francisco M De La Vega, and Alan P Blanchard. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res., 19(9):1527-1541, Sep 2009. [ bib | DOI | http | .pdf ]
We describe the genome sequencing of an anonymous individual of African origin using a novel ligation-based sequencing assay that enables a unique form of error correction that improves the raw accuracy of the aligned reads to >99.9%, allowing us to accurately call SNPs with as few as two reads per allele. We collected several billion mate-paired reads yielding approximately 18x haploid coverage of aligned sequence and close to 300x clone coverage. Over 98% of the reference genome is covered with at least one uniquely placed read, and 99.65% is spanned by at least one uniquely placed mate-paired clone. We identify over 3.8 million SNPs, 19% of which are novel. Mate-paired data are used to physically resolve haplotype phases of nearly two-thirds of the genotypes obtained and produce phased segments of up to 215 kb. We detect 226,529 intra-read indels, 5590 indels between mate-paired reads, 91 inversions, and four gene fusions. We use a novel approach for detecting indels between mate-paired reads that are smaller than the standard deviation of the insert size of the library and discover deletions in common with those detected with our intra-read approach. Dozens of mutations previously described in OMIM and hundreds of nonsynonymous single-nucleotide and structural variants in genes previously implicated in disease are identified in this individual. There is more genetic variation in the human genome still to be uncovered, and we provide guidance for future surveys in populations and cancer biopsies.

Keywords: ngs
[Mayr2009Novel] Lorenz M Mayr and Dejan Bojanic. Novel trends in high-throughput screening. Curr Opin Pharmacol, 9(5):580-588, Oct 2009. [ bib | DOI | http ]
High-throughput screening (HTS) is a well-established process for lead discovery in Pharma and Biotech companies and is now also being used for basic and applied research in academia. It comprises the screening of large chemical libraries for activity against biological targets via the use of automation, miniaturized assays and large-scale data analysis. Since its first advent in the early to mid 1990s, the field of HTS has seen not only a continuous change in technology and processes, but also an adaptation to various needs in lead discovery. HTS has now evolved into a mature discipline that is a crucial source of chemical starting points for drug discovery. Whereas in previous years much emphasis has been put on a steady increase in screening capacity ('quantitative increase') via automation and miniaturization, the past years have seen a much greater emphasis on content and quality ('qualitative increase'). Today, many experts in the field see HTS at a crossroad with the need to decide on either higher throughput/more experimentation or a greater focus on assays of greater physiological relevance, both of which may lead to higher productivity in pharmaceutical R&D. In this paper, we describe the development of HTS over the past decade and point out our own ideas for future directions of HTS in biomedical research. We predict that the trend toward further miniaturization will slow down with the balanced implementation of 384 well, 1536 well, and 384 low volume well plates. Furthermore, we envisage that there will be much more emphasis on rigorous assay and chemical characterization, particularly considering that novel and more difficult target classes will be pursued. In recent years we have witnessed a clear trend in the drug discovery community toward rigorous hit validation by the use of orthogonal readout technologies, label free and biophysical methodologies. We also see a trend toward a more flexible use of the various screening approaches in lead discovery, that is, the use of both full deck compound screening as well as the use of focused screening and iterative screening approaches. Moreover, we expect greater usage of target identification strategies downstream of phenotypic screening and the more effective implementation of affinity selection technologies as a result of advances in chemical diversity methodologies. We predict that, ultimately, each hit finding strategy will be much more project-related, tailor-made, and better integrated into the broader drug discovery efforts.

Keywords: Animals; Automation, Laboratory; Computer Simulation; Computer-Aided Design, trends; Cost-Benefit Analysis; Drug Discovery, economics/standards/trends; High-Throughput Screening Assays, economics/standards/trends; Humans; Miniaturization; Models, Molecular; Quality Control; Small Molecule Libraries; Structure-Activity Relationship; Systems Integration; Time Factors
[Marbach2009Generating] D. Marbach, T. Schaffter, C. Mattiussi, and D. Floreano. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J. Comput. Biol., 16(2):229-239, 2009. [ bib | DOI | http ]
[Marbach2009Replaying] D. Marbach, C. Mattiussi, and D. Floreano. Replaying the evolutionary tape: biomimetic reverse engineering of gene networks. Ann N Y Acad Sci, 1158:234-245, Mar 2009. [ bib | DOI | http ]
In this paper, we suggest a new approach for reverse engineering gene regulatory networks, which consists of using a reconstruction process that is similar to the evolutionary process that created these networks. The aim is to integrate prior knowledge into the reverse-engineering procedure, thus biasing the search toward biologically plausible solutions. To this end, we propose an evolutionary method that abstracts and mimics the natural evolution of gene regulatory networks. Our method can be used with a wide range of nonlinear dynamical models. This allows us to explore novel model types such as the log-sigmoid model introduced here. We apply the biomimetic method to a gold-standard dataset from an in vivo gene network. The obtained results won a reverse engineering competition of the second DREAM conference (Dialogue on Reverse Engineering Assessments and Methods 2007, New York, NY).

Keywords: Algorithms; Biomimetics; Computational Biology; Databases, Genetic; Evolution; Gene Regulatory Networks; Models, Biological; Nonlinear Dynamics
[Mahe2009Graph] P. Mahé and J. P. Vert. Graph kernels based on tree patterns for molecules. Mach. Learn., 75(1):3-35, 2009. [ bib | DOI | http | .pdf ]
[Levy-Leduc2009Detection] C. Lévy-Leduc and F. Roueff. Detection and localization of change-points in high-dimensional network traffic data. Ann. Appl. Stat., 3(2):637-662, 2009. [ bib | DOI | http | .pdf ]
[Lounici2009Taking] Karim Lounici, Massimiliano Pontil, Alexandre B. Tsybakov, and Sara van de Geer. Taking advantage of sparsity in multi-task learning. In Proceedings of COLT, 2009. [ bib ]
[Liu2009Nonparametric] Han Liu, John Lafferty, and Larry Wasserman. Nonparametric regression and classification with joint sparsity constraints. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 969-976. MIT Press, 2009. [ bib ]
[Linghu2009Genome-wide] B. Linghu, E.S. Snitkin, Z. Hu, Y. Xia, and C. Delisi. Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network. Genome Biol., 10(9):R91, 2009. [ bib | DOI | http ]
We integrate 16 genomic features to construct an evidence-weighted functional-linkage network comprising 21,657 human genes. The functional-linkage network is used to prioritize candidate genes for 110 diseases, and to reliably disclose hidden associations between disease pairs having dissimilar phenotypes, such as hypercholesterolemia and Alzheimer's disease. Many of these disease-disease associations are supported by epidemiology, but with no previous genetic basis. Such associations can drive novel hypotheses on molecular mechanisms of diseases and therapies.

[Lima-Mendez2009powerful] G. Lima-Mendez and J. van Helden. The powerful law of the power law and other myths in network biology. Mol Biosyst, 5(12):1482-1493, Dec 2009. [ bib | DOI | http ]
For almost 10 years, topological analysis of different large-scale biological networks (metabolic reactions, protein interactions, transcriptional regulation) has been highlighting some recurrent properties: power law distribution of degree, scale-freeness, small world, which have been proposed to confer functional advantages such as robustness to environmental changes and tolerance to random mutations. Stochastic generative models inspired different scenarios to explain the growth of interaction networks during evolution. The power law and the associated properties appeared so ubiquitous in complex networks that they were qualified as "universal laws". However, these properties are no longer observed when the data are subjected to statistical tests: in most cases, the data do not fit the expected theoretical models, and the cases of good fitting merely result from sampling artefacts or improper data representation. The field of network biology seems to be founded on a series of myths, i.e. widely believed but false ideas. The weaknesses of these foundations should however not be considered as a failure for the entire domain. Network analysis provides a powerful frame for understanding the function and evolution of biological processes, provided it is brought to an appropriate level of description, by focussing on smaller functional modules and establishing the link between their topological properties and their dynamical behaviour.

Keywords: Computational Biology, methods; Gene Regulatory Networks; Metabolic Networks and Pathways; Models, Biological; Semantics; Signal Transduction
[Lievens2009Mammalian] Sam Lievens, Irma Lemmens, and Jan Tavernier. Mammalian two-hybrids come of age. Trends Biochem Sci, 34(11):579-588, Nov 2009. [ bib | DOI | http ]
A diverse series of mammalian two-hybrid technologies for the detection of protein-protein interactions have emerged in the past few years, complementing the established yeast two-hybrid approach. Given the mammalian background in which they operate, these assays open new avenues to study the dynamics of mammalian protein interaction networks, i.e. the temporal, spatial and functional modulation of protein-protein associations. In addition, novel assay formats are available that enable high-throughput mammalian two-hybrid applications, facilitating their use in large-scale interactome mapping projects. Finally, as they can be applied in drug discovery and development programs, these techniques also offer exciting new opportunities for biomedical research.

Keywords: Animals; Genes, Reporter; Humans; Models, Biological; Protein Binding; Protein Interaction Mapping; Proteins; Recombinant Fusion Proteins; Transfection; Two-Hybrid System Techniques
[Lieberman-Aiden2009Comprehensive] Erez Lieberman-Aiden, Nynke L van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy, Agnes Telling, Ido Amit, Bryan R Lajoie, Peter J Sabo, Michael O Dorschner, Richard Sandstrom, Bradley Bernstein, M. A. Bender, Mark Groudine, Andreas Gnirke, John Stamatoyannopoulos, Leonid A Mirny, Eric S Lander, and Job Dekker. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 326(5950):289-293, Oct 2009. [ bib | DOI | http | .pdf ]
We describe Hi-C, a method that probes the three-dimensional architecture of whole genomes by coupling proximity-based ligation with massively parallel sequencing. We constructed spatial proximity maps of the human genome with Hi-C at a resolution of 1 megabase. These maps confirm the presence of chromosome territories and the spatial proximity of small, gene-rich chromosomes. We identified an additional level of genome organization that is characterized by the spatial segregation of open and closed chromatin to form two genome-wide compartments. At the megabase scale, the chromatin conformation is consistent with a fractal globule, a knot-free, polymer conformation that enables maximally dense packing while preserving the ability to easily fold and unfold any genomic locus. The fractal globule is distinct from the more commonly used globular equilibrium model. Our results demonstrate the power of Hi-C to map the dynamic conformations of whole genomes.

Keywords: hic, ngs
[Li2009SNP] R. Li, Y. Li, X. Fang, H. Yang, J. Wang, K. Kristiansen, and J. Wang. SNP detection for massively parallel whole-genome resequencing. Genome Res., 19(6):1124-1132, Jun 2009. [ bib | DOI | http | .pdf ]
Next-generation massively parallel sequencing technologies provide ultrahigh throughput at two orders of magnitude lower unit cost than capillary Sanger sequencing technology. One of the key applications of next-generation sequencing is studying genetic variation between individuals using whole-genome or target region resequencing. Here, we have developed a consensus-calling and SNP-detection method for sequencing-by-synthesis Illumina Genome Analyzer technology. We designed this method by carefully considering the data quality, alignment, and experimental errors common to this technology. All of this information was integrated into a single quality score for each base under Bayesian theory to measure the accuracy of consensus calling. We tested this methodology using a large-scale human resequencing data set of 36x coverage and assembled a high-quality nonrepetitive consensus sequence for 92.25% of the diploid autosomes and 88.07% of the haploid X chromosome. Comparison of the consensus sequence with Illumina human 1M BeadChip genotyped alleles from the same DNA sample showed that 98.6% of the 37,933 genotyped alleles on the X chromosome and 98% of 999,981 genotyped alleles on autosomes were covered at 99.97% and 99.84% consistency, respectively. At a low sequencing depth, we used prior probability of dbSNP alleles and were able to improve coverage of the dbSNP sites significantly as compared to that obtained using a nonimputation model. Our analyses demonstrate that our method has a very low false call rate at any sequencing depth and excellent genome coverage at a high sequencing depth.

Keywords: ngs
[Li2009BriefBioinform] C. Li, M. Courtot, N. Le Novere, and C. Laibe. Biomodels.net web services, a free and integrated toolkit for computational modelling software. Brief Bioinform, 2009. [ bib ]
Exchanging and sharing scientific results are essential for researchers in the field of computational modelling. BioModels.net defines agreed-upon standards for model curation. A fundamental one, MIRIAM (Minimum Information Requested in the Annotation of Models), standardises the annotation and curation process of quantitative models in biology. To support this standard, MIRIAM Resources maintains a set of standard data types for annotating models, and provides services for manipulating these annotations. Furthermore, BioModels.net creates controlled vocabularies, such as SBO (Systems Biology Ontology) which strictly indexes, defines and links terms used in Systems Biology. Finally, BioModels Database provides a free, centralised, publicly accessible database for storing, searching and retrieving curated and annotated computational models. Each resource provides a web interface to submit, search, retrieve and display its data. In addition, the BioModels.net team provides a set of Web Services which allows the community to programmatically access the resources. A user is then able to perform remote queries, such as retrieving a model and resolving all its MIRIAM Annotations, as well as getting the details about the associated SBO terms. These web services use established standards. Communications rely on SOAP (Simple Object Access Protocol) messages and the available queries are described in a WSDL (Web Services Description Language) file. Several libraries are provided in order to simplify the development of client software. BioModels.net Web Services make one step further for the researchers to simulate and understand the entirety of a biological system, by allowing them to retrieve biological models in their own tool, combine queries in workflows and efficiently analyse models.

Keywords: csbcbook
[Lenovere2009The] Nicolas Le Novère, Michael Hucka, Huaiyu Mi, Stuart Moodie, Falk Schreiber, Anatoly Sorokin, Emek Demir, Katja Wegner, Mirit I Aladjem, Sarala M Wimalaratne, Frank T Bergman, Ralph Gauges, Peter Ghazal, Hideya Kawaji, Lu Li, Yukiko Matsuoka, Alice Villéger, Sarah E Boyd, Laurence Calzone, Melanie Courtot, Ugur Dogrusoz, Tom C Freeman, Akira Funahashi, Samik Ghosh, Akiya Jouraku, Sohyoung Kim, Fedor Kolpakov, Augustin Luna, Sven Sahle, Esther Schmidt, Steven Watterson, Guanming Wu, Igor Goryanin, Douglas B Kell, Chris Sander, Herbert Sauro, Jacky L Snoep, Kurt Kohn, and Hiroaki Kitano. The systems biology graphical notation. Nat Biotechnol, 27(8):735-741, Aug 2009. [ bib | DOI | http ]
Circuit diagrams and Unified Modeling Language diagrams are just two examples of standard visual languages that help accelerate work by promoting regularity, removing ambiguity and enabling software tool support for communication of complex information. Ironically, despite having one of the highest ratios of graphical to textual information, biology still lacks standard graphical notations. The recent deluge of biological knowledge makes addressing this deficit a pressing concern. Toward this goal, we present the Systems Biology Graphical Notation (SBGN), a visual language developed by a community of biochemists, modelers and computer scientists. SBGN consists of three complementary languages: process diagram, entity relationship diagram and activity flow diagram. Together they enable scientists to represent networks of biochemical interactions in a standard, unambiguous way. We believe that SBGN will foster efficient and accurate representation, visualization, storage, exchange and reuse of information on all kinds of biological knowledge, from gene regulation, to metabolism, to cellular signaling.

[Langmead2009Ultrafast] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, 10(3):R25, 2009. [ bib | DOI | http | .pdf ]
Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source (http://bowtie.cbcb.umd.edu).

Keywords: ngs
[LeCao2009Sparse] K.-A. Lê Cao, P. G. P. Martin, C. Robert-Granié, and P. Besse. Sparse canonical methods for biological data integration: application to a cross-platform study. BMC Bioinformatics, 10:34, 2009. [ bib | DOI | http ]
In the context of systems biology, few sparse approaches have been proposed so far to integrate several data sets. It is however an important and fundamental issue that will be widely encountered in post genomic studies, when simultaneously analyzing transcriptomics, proteomics and metabolomics data using different platforms, so as to understand the mutual interactions between the different data sets. In this high dimensional setting, variable selection is crucial to give interpretable results. We focus on a sparse Partial Least Squares approach (sPLS) to handle two-block data sets, where the relationship between the two types of variables is known to be symmetric. Sparse PLS has been developed either for a regression or a canonical correlation framework and includes a built-in procedure to select variables while integrating data. To illustrate the canonical mode approach, we analyzed the NCI60 data sets, where two different platforms (cDNA and Affymetrix chips) were used to study the transcriptome of sixty cancer cell lines.We compare the results obtained with two other sparse or related canonical correlation approaches: CCA with Elastic Net penalization (CCA-EN) and Co-Inertia Analysis (CIA). The latter does not include a built-in procedure for variable selection and requires a two-step analysis. We stress the lack of statistical criteria to evaluate canonical correlation methods, which makes biological interpretation absolutely necessary to compare the different gene selections. We also propose comprehensive graphical representations of both samples and variables to facilitate the interpretation of the results.sPLS and CCA-EN selected highly relevant genes and complementary findings from the two data sets, which enabled a detailed understanding of the molecular characteristics of several groups of cell lines. These two approaches were found to bring similar results, although they highlighted the same phenomenons with a different priority. They outperformed CIA that tended to select redundant information.

Keywords: Computational Biology, methods; Genomics; Metabolomics; Proteomics; Systems Biology, methods
[Korbel2009PEMer] J. Korbel, A. Abyzov, X. Mu, N. Carriero, P. Cayting, Z. Zhang, Z. Snyder, and M. Gerstein. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol., 10(2):R23, Feb 2009. [ bib | DOI | http | .pdf ]
ABSTRACT: Personal-genomics endeavors, such as the 1000 Genomes project, are generating maps of genomic structural variants by analyzing ends of massively sequenced genome fragments. To process these we developed Paired-End Mapper (PEMer; http://sv.gersteinlab.org/pemer). This comprises an analysis pipeline, compatible with several next-generation sequencing platforms; simulation-based error models, yielding confidence-values for each structural variant; and a back-end database. The simulations demonstrated high structural variant reconstruction efficiency for PEMer's coverage-adjusted multi-cutoff scoring-strategy and showed its relative insensitivity to base-calling errors.

Keywords: ngs
[Kondor2009graphlet] R. Kondor, N. Shervashidze, and K. M. Borgwardt. The graphlet spectrum. In ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 529-536, New York, NY, USA, 2009. ACM. [ bib | DOI ]
[Koller2009Probabilistic] D. Koller and N. Friedman. Probabilistic Graphical Models. MIT Press, 2009. [ bib ]
[Kim2009Effects] S.Y. Kim. Effects of sample size on robustness and prediction accuracy of a prognostic gene signature. BMC bioinformatics, 10(1):147, 2009. [ bib ]
[Karni2009network-based] S. Karni, H. Soreq, and R. Sharan. A network-based method for predicting disease-causing genes. J. Comput. Biol., 16(2):181-189, Feb 2009. [ bib | DOI | http | .pdf ]
A fundamental problem in human health is the inference of disease-causing genes, with important applications to diagnosis and treatment. Previous work in this direction relied on knowledge of multiple loci associated with the disease, or causal genes for similar diseases, which limited its applicability. Here we present a new approach to causal gene prediction that is based on integrating protein-protein interaction network data with gene expression data under a condition of interest. The latter are used to derive a set of disease-related genes which is assumed to be in close proximity in the network to the causal genes. Our method applies a set-cover-like heuristic to identify a small set of genes that best "cover" the disease-related genes. We perform comprehensive simulations to validate our method and test its robustness to noise. In addition, we validate our method on real gene expression data and on gene specific knockouts. Finally, we apply it to suggest possible genes that are involved in myasthenia gravis.

[Kaplunovsky2009Statistics] A. Kaplunovsky, V. Khailenko, A. Bolshoy, S. Atambayeva, and A. Ivashchenko. Statistics of exon lengths in animals, plants, fungi, and protists, 2009. [ bib ]
[Jiang2009Statistical] H. Jiang and W. H. Wong. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics, 25(8):1026-1032, Apr 2009. [ bib | DOI | http | .pdf ]
SUMMARY: The development of RNA sequencing (RNA-Seq) makes it possible for us to measure transcription at an unprecedented precision and throughput. However, challenges remain in understanding the source and distribution of the reads, modeling the transcript abundance and developing efficient computational methods. In this article, we develop a method to deal with the isoform expression estimation problem. The count of reads falling into a locus on the genome annotated with multiple isoforms is modeled as a Poisson variable. The expression of each individual isoform is estimated by solving a convex optimization problem and statistical inferences about the parameters are obtained from the posterior distribution by importance sampling. Our results show that isoform expression inference in RNA-Seq is possible by employing appropriate statistical methods.

Keywords: ngs, rnaseq
[Jiang2009Compensatory] D. Jiang, S. Zhou, and Y.-P. P. Chen. Compensatory ability to null mutation in metabolic networks. Biotechnol. Bioeng., 103(2):361-369, Jun 2009. [ bib | DOI | http | .pdf ]
Robustness is an inherent property of biological system. It is still a limited understanding of how it is accomplished at the cellular or molecular level. To this end, this article analyzes the impact degree of each reaction to others, which is defined as the number of cascading failures of following and/or forward reactions when an initial reaction is deleted. By analyzing more than 800 organism's metabolic networks, it suggests that the reactions with larger impact degrees are likely essential and the universal reactions should also be essential. Alternative metabolic pathways compensate null mutations, which represents that average impact degrees for all organisms are small. Interestingly, average impact degrees of archaea organisms are smaller than other two categories of organisms, eukayote and bacteria, indicating that archaea organisms have strong robustness to resist the various perturbations during the evolution process. The results show that scale-free feature and reaction reversibility contribute to the robustness in metabolic networks. The optimal growth temperature of organism also relates the robust structure of metabolic network.

[Jensen2009STRING] L.J. Jensen, M. Kuhn, M. Stark, S. Chaffron, C. Creevey, J. Muller, T. Doerks, P. Julien, A. Roth, M. Simonovic, P. Bork, and C. von Mering. String 8-a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res, 37(Database issue):D412-D416, Jan 2009. [ bib | DOI | http ]
Functional partnerships between proteins are at the core of complex cellular phenotypes, and the networks formed by interacting proteins provide researchers with crucial scaffolds for modeling, data reduction and annotation. STRING is a database and web resource dedicated to protein-protein interactions, including both physical and functional interactions. It weights and integrates information from numerous sources, including experimental repositories, computational prediction methods and public text collections, thus acting as a meta-database that maps all interaction evidence onto a common set of genomes and proteins. The most important new developments in STRING 8 over previous releases include a URL-based programming interface, which can be used to query STRING from other resources, improved interaction prediction via genomic neighborhood in prokaryotes, and the inclusion of protein structures. Version 8.0 of STRING covers about 2.5 million proteins from 630 organisms, providing the most comprehensive view on protein-protein interactions currently available. STRING can be reached at http://string-db.org/.

Keywords: Databases, Protein; Genomics; Multiprotein Complexes; Protein Interaction Mapping; Proteins; User-Computer Interface
[Jenatton2009Structured] R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. Technical Report 0904.3523, arXiv, 2009. [ bib | http | .pdf ]
We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual 1-norm and the group 1-norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings.

Keywords: variable selection;sparsity; convex optimization;learning theory
[Jacob2009Group] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. In ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 433-440, New York, NY, USA, 2009. ACM. [ bib | DOI | .pdf ]
[Jacob2009Clustered] L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: A convex formulation. In Advances in Neural Information Processing Systems 21, pages 745-752. MIT Press, 2009. [ bib | .pdf ]
[Ioannidis2009Repeatability] John P A Ioannidis, David B Allison, Catherine A Ball, Issa Coulibaly, Xiangqin Cui, Aedín C Culhane, Mario Falchi, Cesare Furlanello, Laurence Game, Giuseppe Jurman, Jon Mangion, Tapan Mehta, Michael Nitzberg, Grier P Page, Enrico Petretto, and Vera van Noort. Repeatability of published microarray gene expression analyses. Nat Genet, 41(2):149-155, Feb 2009. [ bib | DOI | http ]
Given the complexity of microarray-based gene expression studies, guidelines encourage transparent design and public data availability. Several journals require public data deposition and several public databases exist. However, not all data are publicly available, and even when available, it is unknown whether the published results are reproducible by independent scientists. Here we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling published in Nature Genetics in 2005-2006. One table or figure from each article was independently evaluated by two teams of analysts. We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis. Repeatability of published microarray studies is apparently limited. More strict publication rules enforcing public data availability and explicit description of data processing and analysis should be considered.

Keywords: Animals; Data Interpretation, Statistical; Databases, Genetic; Gene Expression Profiling, standards; Genome-Wide Association Study, standards; Humans; Oligonucleotide Array Sequence Analysis, standards; Peer Review, Research; Publications, standards; Reproducibility of Results
[Huang2009Learning] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 417-424. ACM, 2009. [ bib ]
[Hu2009Genetic] Xiaolan Hu, Howard M Stern, Lin Ge, Carol O'Brien, Lauren Haydu, Cynthia D Honchell, Peter M Haverty, Brock A Peters, Thomas D Wu, Lukas C Amler, John Chant, David Stokoe, Mark R Lackner, and Guy Cavet. Genetic alterations and oncogenic pathways associated with breast cancer subtypes. Mol Cancer Res, 7(4):511-522, Apr 2009. [ bib | DOI | http | .pdf ]
Breast cancers can be divided into subtypes with important implications for prognosis and treatment. We set out to characterize the genetic alterations observed in different breast cancer subtypes and to identify specific candidate genes and pathways associated with subtype biology. mRNA expression levels of estrogen receptor, progesterone receptor, and HER2 were shown to predict marker status determined by immunohistochemistry and to be effective at assigning samples to subtypes. HER2(+) cancers were shown to have the greatest frequency of high-level amplification (independent of the ERBB2 amplicon itself), but triple-negative cancers had the highest overall frequencies of copy gain. Triple-negative cancers also were shown to have more frequent loss of phosphatase and tensin homologue and mutation of RB1, which may contribute to genomic instability. We identified and validated seven regions of copy number alteration associated with different subtypes, and used integrative bioinformatics analysis to identify candidate oncogenes and tumor suppressors, including ERBB2, GRB7, MYST2, PPM1D, CCND1, HDAC2, FOXA1, and RASA1. We tested the candidate oncogene MYST2 and showed that it enhances the anchorage-independent growth of breast cancer cells. The genome-wide and region-specific differences between subtypes suggest the differential activation of oncogenic pathways.

[Horner2009Bioinformatics] D. S. Horner, G. Pavesi, T. Castrignanò, P. D. De Meo, S. Liuni, M. Sammeth, E. Picardi, and G. Pesole. Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinform, Oct 2009. [ bib | DOI | http | .pdf ]
Technical advances such as the development of molecular cloning, Sanger sequencing, PCR and oligonucleotide microarrays are key to our current capacity to sequence, annotate and study complete organismal genomes. Recent years have seen the development of a variety of so-called 'next-generation' sequencing platforms, with several others anticipated to become available shortly. The previously unimaginable scale and economy of these methods, coupled with their enthusiastic uptake by the scientific community and the potential for further improvements in accuracy and read length, suggest that these technologies are destined to make a huge and ongoing impact upon genomic and post-genomic biology. However, like the analysis of microarray data and the assembly and annotation of complete genome sequences from conventional sequencing data, the management and analysis of next-generation sequencing data requires (and indeed has already driven) the development of informatics tools able to assemble, map, and interpret huge quantities of relatively or extremely short nucleotide sequence data. Here we provide a broad overview of bioinformatics approaches that have been introduced for several genomics and functional genomics applications of next-generation sequencing.

Keywords: ngs
[Hormozdiari2009Combinatorial] Fereydoun Hormozdiari, Can Alkan, Evan E Eichler, and S. Cenk Sahinalp. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res., 19(7):1270-1278, Jul 2009. [ bib | DOI | http | .pdf ]
Recent studies show that along with single nucleotide polymorphisms and small indels, larger structural variants among human individuals are common. The Human Genome Structural Variation Project aims to identify and classify deletions, insertions, and inversions (>5 Kbp) in a small number of normal individuals with a fosmid-based paired-end sequencing approach using traditional sequencing technologies. The realization of new ultra-high-throughput sequencing platforms now makes it feasible to detect the full spectrum of genomic variation among many individual genomes, including cancer patients and others suffering from diseases of genomic origin. Unfortunately, existing algorithms for identifying structural variation (SV) among individuals have not been designed to handle the short read lengths and the errors implied by the "next-gen" sequencing (NGS) technologies. In this paper, we give combinatorial formulations for the SV detection between a reference genome sequence and a next-gen-based, paired-end, whole genome shotgun-sequenced individual. We describe efficient algorithms for each of the formulations we give, which all turn out to be fast and quite reliable; they are also applicable to all next-gen sequencing methods (Illumina, 454 Life Sciences [Roche], ABI SOLiD, etc.) and traditional capillary sequencing technology. We apply our algorithms to identify SV among individual genomes very recently sequenced by Illumina technology.

Keywords: ngs
[Hoefling2009path] H. Hoefling. A path algorithm for the Fused Lasso Signal Approximator. Technical Report 0910.0526v1, arXiv, Oct. 2009. [ bib | .pdf ]
Keywords: segmentation
[Harismendy2009Evaluation] O. Harismendy, P. C. Ng, R. L. Strausberg, X. Wang, T. B. Stockwell, K. Y. Beeson, N. J. Schork, S. S. Murray, E. J. Topol, S. Levy, and K. A. Frazer. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol., 10(3):R32, 2009. [ bib | DOI | http | .pdf ]
Next generation sequencing (NGS) platforms are currently being utilized for targeted sequencing of candidate genes or genomic intervals to perform sequence-based association studies. To evaluate these platforms for this application, we analyzed human sequence generated by the Roche 454, Illumina GA, and the ABI SOLiD technologies for the same 260 kb in four individuals.Local sequence characteristics contribute to systematic variability in sequence coverage (>100-fold difference in per-base coverage), resulting in patterns for each NGS technology that are highly correlated between samples. A comparison of the base calls to 88 kb of overlapping ABI 3730xL Sanger sequence generated for the same samples showed that the NGS platforms all have high sensitivity, identifying >95% of variant sites. At high coverage, depth base calling errors are systematic, resulting from local sequence contexts; as the coverage is lowered additional 'random sampling' errors in base calling occur.Our study provides important insights into systematic biases and data variability that need to be considered when utilizing NGS platforms for population targeted sequencing studies.

Keywords: ngs
[Harchaoui2009regularized] Z. Harchaoui, F. Vallet, A. Lung-Yut-Fong, and O. Cappe. A regularized kernel-based approach to unsupervised audio segmentation. In ICASSP '09: Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1665-1668, Washington, DC, USA, 2009. IEEE Computer Society. [ bib | DOI | http | .pdf ]
Keywords: segmentation
[Goendoer2009Chromosome] Anita Göndör and Rolf Ohlsson. Chromosome crosstalk in three dimensions. Nature, 461(7261):212-217, Sep 2009. [ bib | DOI | http ]
The genome forms extensive and dynamic physical interactions with itself in the form of chromosome loops and bridges, thus exploring the three-dimensional space of the nucleus. It is now possible to examine these interactions at the molecular level, and we have gained glimpses of their functional implications. Chromosomal interactions can contribute to the silencing and activation of genes within the three-dimensional context of the nuclear architecture. Technical advances in detecting these interactions contribute to our understanding of the functional organization of the genome, as well as its adaptive plasticity in response to environmental changes during development and disease.

Keywords: Animals; Cell Nucleus, genetics/metabolism; Chromosome Positioning; Chromosomes, chemistry/genetics/metabolism; Gene Expression Regulation; Humans; Nucleic Acid Conformation; Transcription, Genetic
[Gutin09Generalized] G. Gutin and D. Karapetyan. A memetic algorithm for the generalized traveling salesman problem. Natural Computing, 2009. [ bib | DOI | http ]
The generalized traveling salesman problem (GTSP) is an extension of the well-known traveling salesman problem. In GTSP, we are given a partition of cities into groups and we are required to find a minimum length tour that includes exactly one city from each group. The recent studies on this subject consider different variations of a memetic algorithm approach to the GTSP. The aim of this paper is to present a new memetic algorithm for GTSP with a powerful local search procedure. The experiments show that the proposed algorithm clearly outperforms all of the known heuristics with respect to both solution quality and running time. While the other memetic algorithms were designed only for the symmetric GTSP, our algorithm can solve both symmetric and asymmetric instances.

[Gusterson2009Do] B. Gusterson. Do 'basal-like' breast cancers really exist? Nat. Rev. Cancer, 9(2):128-134, Feb 2009. [ bib | DOI | http | .pdf ]
It has been proposed that gene expression profiles will revolutionize the classification of breast cancer, eventually replacing histopathology with a more reproducible technology. These new approaches, combined with a better understanding of the cellular origins of breast cancer, should enable us to identify patient subgroups for more effective therapy. However, in such a rapidly advancing field it is essential that initial and thought-provoking results do not become established as 'facts' without question. This Opinion addresses some of the negatives and positives generated by the term 'basal-like' breast cancer, and questions its existence as an entity.

Keywords: breastcancer
[Gonzalez2009Highlighting] I. González, S. Déjean, P. G. P. Martin, O. Gonçalves, P. Besse, and A. Baccini. Highlighting relationships between heterogeneous biological data through graphical displays based on regularized canonical correlation analysis. J Biol Syst, 17:173-199, 2009. [ bib ]
[Goldstein2009Common] D. B. Goldstein. Common genetic variation and human traits. N. Engl. J. Med., 360(17):1696-1698, Apr 2009. [ bib | DOI | http | .pdf ]
[Girirajan2009Sequencing] Santhosh Girirajan, Lin Chen, Tina Graves, Tomas Marques-Bonet, Mario Ventura, Catrina Fronick, Lucinda Fulton, Mariano Rocchi, Robert S Fulton, Richard K Wilson, Elaine R Mardis, and Evan E Eichler. Sequencing human-gibbon breakpoints of synteny reveals mosaic new insertions at rearrangement sites. Genome Res., 19(2):178-190, Feb 2009. [ bib | DOI | http | .pdf ]
The gibbon genome exhibits extensive karyotypic diversity with an increased rate of chromosomal rearrangements during evolution. In an effort to understand the mechanistic origin and implications of these rearrangement events, we sequenced 24 synteny breakpoint regions in the white-cheeked gibbon (Nomascus leucogenys, NLE) in the form of high-quality BAC insert sequences (4.2 Mbp). While there is a significant deficit of breakpoints in genes, we identified seven human gene structures involved in signaling pathways (DEPDC4, GNG10), phospholipid metabolism (ENPP5, PLSCR2), beta-oxidation (ECH1), cellular structure and transport (HEATR4), and transcription (ZNF461), that have been disrupted in the NLE gibbon lineage. Notably, only three of these genes show the expected evolutionary signatures of pseudogenization. Sequence analysis of the breakpoints suggested both nonclassical nonhomologous end-joining (NHEJ) and replication-based mechanisms of rearrangement. A substantial number (11/24) of human-NLE gibbon breakpoints showed new insertions of gibbon-specific repeats and mosaic structures formed from disparate sequences including segmental duplications, LINE, SINE, and LTR elements. Analysis of these sites provides a model for a replication-dependent repair mechanism for double-strand breaks (DSBs) at rearrangement sites and insights into the structure and formation of primate segmental duplications at sites of genomic rearrangements during evolution.

Keywords: ngs
[Fullwood2009Next-generation] Melissa J Fullwood, Chia-Lin Wei, Edison T Liu, and Yijun Ruan. Next-generation dna sequencing of paired-end tags (pet) for transcriptome and genome analyses. Genome Res, 19(4):521-532, Apr 2009. [ bib | DOI | http | .pdf ]
Comprehensive understanding of functional elements in the human genome will require thorough interrogation and comparison of individual human genomes and genomic structures. Such an endeavor will require improvements in the throughputs and costs of DNA sequencing. Next-generation sequencing platforms have impressively low costs and high throughputs but are limited by short read lengths. An immediate and widely recognized solution to this critical limitation is the paired-end tag (PET) sequencing for various applications, collectively called the PET sequencing strategy, in which short and paired tags are extracted from the ends of long DNA fragments for ultra-high-throughput sequencing. The PET sequences can be accurately mapped to the reference genome, thus demarcating the genomic boundaries of PET-represented DNA fragments and revealing the identities of the target DNA elements. PET protocols have been developed for the analyses of transcriptomes, transcription factor binding sites, epigenetic sites such as histone modification sites, and genome structures. The exclusive advantage of the PET technology is its ability to uncover linkages between the two ends of DNA fragments. Using this unique feature, unconventional fusion transcripts, genome structural variations, and even molecular interactions between distant genomic elements can be unraveled by PET analysis. Extensive use of PET data could lead to efficient assembly of individual human genomes, transcriptomes, and interactomes, enabling new biological and clinical insights. With its versatile and powerful nature for DNA analysis, the PET sequencing strategy has a bright future ahead.

[Fullwood2009oestrogen-receptor-alpha-bound] Melissa J Fullwood, Mei Hui Liu, You Fu Pan, Jun Liu, Han Xu, Yusoff Bin Mohamed, Yuriy L Orlov, Stoyan Velkov, Andrea Ho, Poh Huay Mei, Elaine G Y Chew, Phillips Yao Hui Huang, Willem-Jan Welboren, Yuyuan Han, Hong Sain Ooi, Pramila N Ariyaratne, Vinsensius B Vega, Yanquan Luo, Peck Yean Tan, Pei Ye Choy, K. D Senali Abayratna Wansa, Bing Zhao, Kar Sian Lim, Shi Chi Leow, Jit Sin Yow, Roy Joseph, Haixia Li, Kartiki V Desai, Jane S Thomsen, Yew Kok Lee, R. Krishna Murthy Karuturi, Thoreau Herve, Guillaume Bourque, Hendrik G Stunnenberg, Xiaoan Ruan, Valere Cacheux-Rataboul, Wing-Kin Sung, Edison T Liu, Chia-Lin Wei, Edwin Cheung, and Yijun Ruan. An oestrogen-receptor-alpha-bound human chromatin interactome. Nature, 462(7269):58-64, Nov 2009. [ bib | DOI | http ]
Genomes are organized into high-level three-dimensional structures, and DNA elements separated by long genomic distances can in principle interact functionally. Many transcription factors bind to regulatory DNA elements distant from gene promoters. Although distal binding sites have been shown to regulate transcription by long-range chromatin interactions at a few loci, chromatin interactions and their impact on transcription regulation have not been investigated in a genome-wide manner. Here we describe the development of a new strategy, chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) for the de novo detection of global chromatin interactions, with which we have comprehensively mapped the chromatin interaction network bound by oestrogen receptor alpha (ER-alpha) in the human genome. We found that most high-confidence remote ER-alpha-binding sites are anchored at gene promoters through long-range chromatin interactions, suggesting that ER-alpha functions by extensive chromatin looping to bring genes together for coordinated transcriptional regulation. We propose that chromatin interactions constitute a primary mechanism for regulating transcription in mammalian genomes.

Keywords: Binding Sites; Cell Line; Chromatin; Chromatin Immunoprecipitation; Cross-Linking Reagents; Estrogen Receptor alpha; Formaldehyde; Genome, Human; Humans; Promoter Regions, Genetic; Protein Binding; Reproducibility of Results; Sequence Analysis, DNA; Transcription, Genetic; Transcriptional Activation
[Elliott2009Current] M Elliott, C Parker, D Smith, and CH Borchers. Current trends in quantitative proteomics. The Journal of Mass Spectrometry, 44:1637-60, 2009. [ bib ]
[Eid2009Real] John Eid, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geoff Otto, Paul Peluso, David Rank, Primo Baybayan, Brad Bettman, Arkadiusz Bibillo, Keith Bjornson, Bidhan Chaudhuri, Frederick Christians, Ronald Cicero, Sonya Clark, Ravindra Dalal, Alex Dewinter, John Dixon, Mathieu Foquet, Alfred Gaertner, Paul Hardenbol, Cheryl Heiner, Kevin Hester, David Holden, Gregory Kearns, Xiangxu Kong, Ronald Kuse, Yves Lacroix, Steven Lin, Paul Lundquist, Congcong Ma, Patrick Marks, Mark Maxham, Devon Murphy, Insil Park, Thang Pham, Michael Phillips, Joy Roy, Robert Sebra, Gene Shen, Jon Sorenson, Austin Tomaney, Kevin Travers, Mark Trulson, John Vieceli, Jeffrey Wegener, Dawn Wu, Alicia Yang, Denis Zaccarin, Peter Zhao, Frank Zhong, Jonas Korlach, and Stephen Turner. Real-time dna sequencing from single polymerase molecules. Science, 323(5910):133-138, Jan 2009. [ bib | DOI | http ]
We present single-molecule, real-time sequencing data obtained from a DNA polymerase performing uninterrupted template-directed synthesis using four distinguishable fluorescently labeled deoxyribonucleoside triphosphates (dNTPs). We detected the temporal order of their enzymatic incorporation into a growing DNA strand with zero-mode waveguide nanostructure arrays, which provide optical observation volume confinement and enable parallel, simultaneous detection of thousands of single-molecule sequencing reactions. Conjugation of fluorophores to the terminal phosphate moiety of the dNTPs allows continuous observation of DNA synthesis over thousands of bases without steric hindrance. The data report directly on polymerase dynamics, revealing distinct polymerization states and pause sites corresponding to DNA secondary structure. Sequence data were aligned with the known reference sequence to assay biophysical parameters of polymerization for each template position. Consensus sequences were generated from the single-molecule reads at 15-fold coverage, showing a median accuracy of 99.3%, with no systematic error beyond fluorophore-dependent error rates.

Keywords: Base Sequence; Consensus Sequence; DNA, Circular, chemistry; DNA, Single-Stranded, chemistry; DNA, biosynthesis; DNA-Directed DNA Polymerase, metabolism; Deoxyribonucleotides, metabolism; Enzymes, Immobilized; Fluorescent Dyes; Kinetics; Nanostructures; Sequence Analysis, DNA, methods; Spectrometry, Fluorescence
[Duchenne2009Tensor] O. Duchenne, F. Bach, I. Kweon, and J. Ponce. A tensor-based algorithm for high-order graph matching. In Proc. IEEE Conference on Computer Vision and Pattern Recognition CVPR 2009, pages 1980-1987, 20-25 June 2009. [ bib | DOI ]
[Declercq2009RIP] W. Declercq, T. Vanden Berghe, and P. Vandenabeele. RIP kinases at the crossroads of cell death and survival. Cell, 138(2):229-232, 2009. [ bib | DOI | http | .pdf ]
Keywords: csbcbook
[Daume2009Bayesian] H. Daume. Bayesian Multitask Learning with Latent Hierarchies. In 25th Conference on Uncertainty in Artificial Intelligence, 2009. [ bib ]
[Curturi2009White] M. Cuturi, J.-P. Vert, and A. D'Aspremont. White functionals for anomaly detection in dynamical systems. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 432-440, 2009. [ bib | .pdf | .pdf ]
[Cianfrocca2009New] M. Cianfrocca and W. Gradishar. New molecular classifications of breast cancer. CA Cancer J. Clin., 59(5):303-313, 2009. [ bib | DOI | http | .pdf ]
Traditionally, pathologic determinations of tumor size, lymph node status, endocrine receptor status, and human epidermal growth factor receptor 2 (HER2) status have driven prognostic predictions and adjuvant therapy recommendations for patients with early stage breast cancer. However, these prognostic and predictive factors are relatively crude measures, resulting in many patients being overtreated or undertreated. As a result of gene expression assays, there is growing recognition that breast cancer is a molecularly heterogeneous disease. Evidence from gene expression microarrays suggests the presence of multiple molecular subtypes of breast cancer. The recent commercial availability of gene expression profiling techniques that predict risk of disease recurrence as well as potential chemotherapy benefit have shown promise in refining clinical decision making. These techniques will be reviewed in this article.

Keywords: csbcbook, breastcancer
[Chiang2009High-resolution] Derek Y Chiang, Gad Getz, David B Jaffe, Michael J T O'Kelly, Xiaojun Zhao, Scott L Carter, Carsten Russ, Chad Nusbaum, Matthew Meyerson, and Eric S Lander. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat. Methods, 6(1):99-103, Jan 2009. [ bib | DOI | http | .pdf ]
Cancer results from somatic alterations in key genes, including point mutations, copy-number alterations and structural rearrangements. A powerful way to discover cancer-causing genes is to identify genomic regions that show recurrent copy-number alterations (gains and losses) in tumor genomes. Recent advances in sequencing technologies suggest that massively parallel sequencing may provide a feasible alternative to DNA microarrays for detecting copy-number alterations. Here we present: (i) a statistical analysis of the power to detect copy-number alterations of a given size; (ii) SegSeq, an algorithm to segment equal copy numbers from massively parallel sequence data; and (iii) analysis of experimental data from three matched pairs of tumor and normal cell lines. We show that a collection of approximately 14 million aligned sequence reads from human cell lines has comparable power to detect events as the current generation of DNA microarrays and has over twofold better precision for localizing breakpoints (typically, to within approximately 1 kilobase).

Keywords: ngs
[Chen2009A] Jianhui Chen, Lei Tang, Jun Liu, and Jieping Ye. A convex formulation for learning shared structures from multiple tasks. In ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 137-144, New York, NY, USA, 2009. ACM. [ bib | DOI ]
[Candes2009Exact] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math., 9(6):717-772, 2009. [ bib | DOI | http | .pdf ]
[Byerly2009Effects] S. Byerly, K. Sundin, R. Raja, J. Stanchfield, B. A. Bejjani, and L. G. Shaffer. Effects of ozone exposure during microarray posthybridization washes and scanning. J. Mol. Diagn., 11(6):590-597, Nov 2009. [ bib | DOI | http ]
The increasing prevalence of array-based comparative genomic hybridization in the clinical laboratory necessitates the implementation of quality control measures to attain accurate results with a high level of confidence. Environmental ozone is present in all industrialized cities and has been found to be detrimental to array data even at levels considered acceptable by US Environmental Protection Agency standards. In this study, we characterized the effect of ozone on microarray data on three different labeling platforms that use different fluorescent dyes (Cy3 and Cy5, Alexa Fluor 555 and Alexa Fluor 647, and Alexa Fluor 3 and Alexa Fluor 5) that are commonly used in array-based comparative genomic hybridization. We investigated the effects of ozone on microarray data by washing the array in variable ozone environments. In addition, we observed the effects of prolonged exposure to ozone on the microarray after washing in an ozone-free environment. Our results demonstrate the necessity of minimizing ozone exposure when washing and drying the microarray. We also found that washed microarrays produce the best results when immediately scanned; however, if a low-ozone environment is maintained, there will be little compromise in the data collected.

Keywords: Humans; Oligonucleotide Array Sequence Analysis, methods; Ozone
[Bungaro2009Integration] Silvia Bungaro, Marta Campo Dell'Orto, Andrea Zangrando, Dario Basso, Tatiana Gorletta, Luca Lo Nigro, Anna Leszl, Bryan D. Young, Giuseppe Basso, Silvio Bicciato, Andrea Biondi, Gertruy te Kronnie, and Giovanni Cazzaniga. Integration of genomic and gene expression data of childhood all without known aberrations identifies subgroups with specific genetic hallmarks. Genes Chromosomes Cancer, 48(1):22-38, Jan 2009. [ bib | DOI | http ]
Pediatric acute lymphoblastic leukemia (ALL) comprises genetically distinct subtypes. However, 25% of cases still lack defined genetic hallmarks. To identify genomic aberrancies in childhood ALL patients nonclassifiable by conventional methods, we performed a single nucleotide polymorphisms (SNP) array-based genomic analysis of leukemic cells from 29 cases. The vast majority of cases analyzed (19/24, 79%) showed genomic abnormalities; at least one of them affected either genes involved in cell cycle regulation or in B-cell development. The most relevant abnormalities were CDKN2A/9p21 deletions (7/24, 29%), ETV6 (TEL)/12p13 deletions (3/24, 12%), and intrachromosomal amplifications of chromosome 21 (iAMP21) (3/24, 12%). To identify variation in expression of genes directly or indirectly affected by recurrent genomic alterations, we integrated genomic and gene expression data generated by microarray analyses of the same samples. SMAD1 emerged as a down-regulated gene in CDKN2A homozygous deleted cases compared with nondeleted. The JAG1 gene, encoding the Jagged 1 ligand of the Notch receptor, was among a list of differentially expressed (up-regulated) genes in ETV6-deleted cases. Our findings demonstrate that integration of genomic analysis and gene expression profiling can identify genetic lesions undetected by routine methods and potential novel pathways involved in B-progenitor ALL pathogenesis.

[Brodersen2009Revisiting] Peter Brodersen and Olivier Voinnet. Revisiting the principles of microrna target recognition and mode of action. Nat Rev Mol Cell Biol, 10(2):141-148, Feb 2009. [ bib | DOI | http | .pdf ]
MicroRNAs (miRNAs) are fundamental regulatory elements of animal and plant gene expression. Although rapid progress in our understanding of miRNA biogenesis has been achieved by experimentation, computational approaches have also been influential in determining the general principles that are thought to govern miRNA target recognition and mode of action. We discuss how these principles are being progressively challenged by genetic and biochemical studies. In addition, we discuss the role of target-site-specific endonucleolytic cleavage, which is the hallmark of experimental RNA interference and a mechanism that is used by plant miRNAs and a few animal miRNAs. Generally thought to be merely a degradation mechanism, we propose that this might also be a biogenesis mechanism for biologically functional, non-coding RNA fragments.

Keywords: sirna
[Boysen2009Consistencies] L. Boysen, A. Kempe, V. Liebscher, A. Munk, and O. Wittich. Consistencies and rates of convergence of jump-penalized least squares estimators. Ann. Stat., 37(1):157-183, 2009. [ bib | DOI | http | .pdf ]
We study the asymptotics for jump-penalized least squares regression aiming at approximating a regression function by piecewise constant functions. Besides conventional consistency and convergence rates of the estimates in L2([0, 1)) our results cover other metrics like Skorokhod metric on the space of càdlàg functions and uniform metrics on C([0, 1]). We will show that these estimators are in an adaptive sense rate optimal over certain classes of "approximation spaces." Special cases are the class of functions of bounded variation (piecewise) Hölder continuous functions of order 0<alpha<=1 and the class of step functions with a finite but arbitrary number of jumps. In the latter setting, we will also deduce the rates known from change-point analysis for detecting the jumps. Finally, the issue of fully automatic selection of the smoothing parameter is addressed.

Keywords: lasso
[Boulesteix2009Stability] A.L. Boulesteix and M. Slawski. Stability and aggregation of ranked gene lists. Briefings in bioinformatics, 10(5):556-568, 2009. [ bib ]
[Bohnert2009Transcript] R. Bohnert, J. Behr, and G. Rätsch. Transcript quantification with RNA-Seq data. BMC Bioinformatics, 10 (Suppl 13):P5, 2009. [ bib | DOI | http | .pdf ]
Keywords: ngs, rnaseq
[Bickel2009Simultaneous] P. J. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat., 37(4):1705-1732, 2009. [ bib | .pdf ]
Keywords: lasso
[Bengtsson2009Single] Henrik Bengtsson, Pratyaksha Wirapati, and Terence P Speed. A single-array preprocessing method for estimating full-resolution raw copy numbers from all affymetrix genotyping arrays including genomewidesnp 5 & 6. Bioinformatics, 25(17):2149-2156, Sep 2009. [ bib | DOI | http ]
MOTIVATION: High-resolution copy-number (CN) analysis has in recent years gained much attention, not only for the purpose of identifying CN aberrations associated with a certain phenotype, but also for identifying CN polymorphisms. In order for such studies to be successful and cost effective, the statistical methods have to be optimized. We propose a single-array preprocessing method for estimating full-resolution total CNs. It is applicable to all Affymetrix genotyping arrays, including the recent ones that also contain non-polymorphic probes. A reference signal is only needed at the last step when calculating relative CNs. RESULTS: As with our method for earlier generations of arrays, this one controls for allelic crosstalk, probe affinities and PCR fragment-length effects. Additionally, it also corrects for probe sequence effects and co-hybridization of fragments digested by multiple enzymes that takes place on the latest chips. We compare our method with Affymetrix's CN5 method and the dChip method by assessing how well they differentiate between various CN states at the full resolution and various amounts of smoothing. Although CRMA v2 is a single-array method, we observe that it performs as well as or better than alternative methods that use data from all arrays for their preprocessing. This shows that it is possible to do online analysis in large-scale projects where additional arrays are introduced over time.

Keywords: Base Pairing, genetics; Chromosomes, Human, Pair 10, genetics; Gene Dosage, genetics; Genome, Human, genetics; Genotype; Humans; Oligonucleotide Array Sequence Analysis, methods; Polymorphism, Single Nucleotide, genetics; ROC Curve
[Beck2009fast] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183-202, 2009. [ bib | DOI | http | .pdf ]
[Barrett2009NCBI] T. Barrett, D.B. Troup, S.E. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista, I.F. Kim, A. Soboleva, M. Tomashevsky, K.A. Marshall, et al. Ncbi geo: archive for high-throughput functional genomic data. Nucleic acids research, 37(suppl 1):D885-D890, 2009. [ bib ]
[Baraud2009Gaussian] Yannick Baraud, Christophe Giraud, and Sylvie Huet. Gaussian model selection with an unknown variance. Annals Of Statistics, to appear, 37:630, 2009. [ bib | http ]
[Bach2009Exploring] F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In Adv. Neural. Inform. Process Syst., volume 21, 2009. [ bib | .pdf ]
[Alkan2009Personalized] Can Alkan, Jeffrey M Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci, Fereydoun Hormozdiari, Jacob O Kitzman, Carl Baker, Maika Malig, Onur Mutlu, S. Cenk Sahinalp, Richard A Gibbs, and Evan E Eichler. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet., 41(10):1061-1067, Oct 2009. [ bib | DOI | http | .pdf ]
Despite their importance in gene innovation and phenotypic variation, duplicated regions have remained largely intractable owing to difficulties in accurately resolving their structure, copy number and sequence content. We present an algorithm (mrFAST) to comprehensively map next-generation sequence reads, which allows for the prediction of absolute copy-number variation of duplicated segments and genes. We examine three human genomes and experimentally validate genome-wide copy number differences. We estimate that, on average, 73-87 genes vary in copy number between any two individuals and find that these genic differences overwhelmingly correspond to segmental duplications (odds ratio = 135; P < 2.2 x 10(-16)). Our method can distinguish between different copies of highly identical genes, providing a more accurate assessment of gene content and insight into functional constraint without the limitations of array-based technology.

Keywords: ngs
[Agarwal2009Ranking] S. Agarwal and S. Sengupta. Ranking genes by relevance to a disease. In Proceedings of the 8th Annual International Conference on Computational Systems Bioinformatics, 2009. [ bib ]
[Abernethy2008new] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative filtering: operator estimation with spectral regularization. J. Mach. Learn. Res., 10:803-826, 2009. [ bib ]
[Tayrac2009Simultaneous] M. de Tayrac, S. Lê, M. Aubry, J. Mosser, and F. Husson. Simultaneous analysis of distinct omics data sets with integration of biological knowledge: Multiple factor analysis approach. BMC Genomics, 10:32, 2009. [ bib | DOI | http ]
Genomic analysis will greatly benefit from considering in a global way various sources of molecular data with the related biological knowledge. It is thus of great importance to provide useful integrative approaches dedicated to ease the interpretation of microarray data.Here, we introduce a data-mining approach, Multiple Factor Analysis (MFA), to combine multiple data sets and to add formalized knowledge. MFA is used to jointly analyse the structure emerging from genomic and transcriptomic data sets. The common structures are underlined and graphical outputs are provided such that biological meaning becomes easily retrievable. Gene Ontology terms are used to build gene modules that are superimposed on the experimentally interpreted plots. Functional interpretations are then supported by a step-by-step sequence of graphical representations.When applied to genomic and transcriptomic data and associated Gene Ontology annotations, our method prioritize the biological processes linked to the experimental settings. Furthermore, it reduces the time and effort to analyze large amounts of 'Omics' data.

Keywords: Animals; Comparative Genomic Hybridization; Factor Analysis, Statistical; Gene Expression Profiling, methods; Genomics, methods; Glioma, genetics; Humans; Mice; Models, Biological; Oligonucleotide Array Sequence Analysis, methods
[Zhang2009Copy] Y. Zhang, J. W. M. Martens, J. X. Yu, J. Jiang, A. M. Sieuwerts, M. Smid, J. G. M. Klijn, Y. Wang, and J. A. Foekens. Copy number alterations that predict metastatic capability of human breast cancer. Cancer Res, 69(9):3795-3801, May 2009. [ bib | DOI | http | .pdf ]
We have analyzed the DNA copy numbers for over 100,000 single-nucleotide polymorphism loci across the human genome in genomic DNA from 313 lymph node-negative primary breast tumors for which genome-wide gene expression data were also available. Combining these two data sets allowed us to identify the genomic loci and their mapped genes, having high correlation with distant metastasis. An estimation of the likely response based on published predictive signatures was performed in the identified prognostic subgroups defined by gene expression and DNA copy number data. In the training set of 200 patients, we constructed an 81-gene prognostic copy number signature (CNS) that identified a subgroup of patients with increased probability of distant metastasis in the independent validation set of 113 patients [hazard ratio (HR), 2.8; 95% confidence interval (95% CI), 1.4-5.6] and in an external data set of 116 patients (HR, 3.7; 95% CI, 1.3-10.6). These high-risk patients constituted a subset of the high-risk patients predicted by our previously established 76-gene gene expression signature (GES). This very poor prognostic group identified by CNS and GES was putatively more resistant to preoperative paclitaxel and 5-fluorouracil-doxorubicin-cyclophosphamide combination chemotherapy (P = 0.0048), particularly against the doxorubicin compound, while potentially benefiting from etoposide. Our study shows the feasibility of using copy number alterations to predict patient prognostic outcome. When combined with gene expression-based signatures for prognosis, the CNS refines risk classification and can help identify those breast cancer patients who have a significantly worse outlook in prognosis and a potential differential response to chemotherapeutic drugs.

[Trapnell2009TopHat] C. Trapnell, L. Pachter, and S. L. Salzberg. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9):1105-1111, May 2009. [ bib | DOI | http | .pdf ]
A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or 'reads', can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites.We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20,000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development.TopHat is free, open-source software available from http://tophat.cbcb.umd.edu.Supplementary data are available at Bioinformatics online.

Keywords: ngs, rnaseq
[Spencer2009Non-genetic] Sabrina L Spencer, Suzanne Gaudet, John G Albeck, John M Burke, and Peter K Sorger. Non-genetic origins of cell-to-cell variability in trail-induced apoptosis. Nature, 459(7245):428-432, May 2009. [ bib | DOI | http ]
In microorganisms, noise in gene expression gives rise to cell-to-cell variability in protein concentrations. In mammalian cells, protein levels also vary and individual cells differ widely in their responsiveness to uniform physiological stimuli. In the case of apoptosis mediated by TRAIL (tumour necrosis factor (TNF)-related apoptosis-inducing ligand) it is common for some cells in a clonal population to die while others survive-a striking divergence in cell fate. Among cells that die, the time between TRAIL exposure and caspase activation is highly variable. Here we image sister cells expressing reporters of caspase activation and mitochondrial outer membrane permeabilization after exposure to TRAIL. We show that naturally occurring differences in the levels or states of proteins regulating receptor-mediated apoptosis are the primary causes of cell-to-cell variability in the timing and probability of death in human cell lines. Protein state is transmitted from mother to daughter, giving rise to transient heritability in fate, but protein synthesis promotes rapid divergence so that sister cells soon become no more similar to each other than pairs of cells chosen at random. Our results have implications for understanding 'fractional killing' of tumour cells after exposure to chemotherapy, and for variability in mammalian signal transduction in general.

Keywords: highcontentscreening
[Rumble2009SHRiMP] S. M. Rumble, P. Lacroute, A. V. Dalca, M. Fiume, A. Sidow, and M. Brudno. SHRiMP: accurate mapping of short color-space reads. PLoS Comput. Biol., 5(5):e1000386, May 2009. [ bib | DOI | http ]
The development of Next Generation Sequencing technologies, capable of sequencing hundreds of millions of short reads (25-70 bp each) in a single run, is opening the door to population genomic studies of non-model species. In this paper we present SHRiMP - the SHort Read Mapping Package: a set of algorithms and methods to map short reads to a genome, even in the presence of a large amount of polymorphism. Our method is based upon a fast read mapping technique, separate thorough alignment methods for regular letter-space as well as AB SOLiD (color-space) reads, and a statistical model for false positive hits. We use SHRiMP to map reads from a newly sequenced Ciona savignyi individual to the reference genome. We demonstrate that SHRiMP can accurately map reads to this highly polymorphic genome, while confirming high heterozygosity of C. savignyi in this second individual. SHRiMP is freely available at http://compbio.cs.toronto.edu/shrimp.

[Meinshausen2009Stability] Nicolai Meinshausen and Peter Buehlmann. Stability selection, May 2009. [ bib | arXiv | http ]
Estimation of structure, such as in variable selection, graphical modelling or cluster analysis is notoriously difficult, especially for high-dimensional data. We introduce stability selection. It is based on subsampling in combination with (high-dimensional) selection algorithms. As such, the method is extremely general and has a very wide range of applicability. Stability selection provides finite sample control for some error rates of false discoveries and hence a transparent principle to choose a proper amount of regularisation for structure estimation. Variable selection and structure estimation improve markedly for a range of selection methods if stability selection is applied. We prove for randomised Lasso that stability selection will be variable selection consistent even if the necessary conditions needed for consistency of the original Lasso method are violated. We demonstrate stability selection for variable selection and Gaussian graphical modelling, using real and simulated data.

Keywords: model_selection, regularization, resampling, sensitivity
[Foucart2009Sparsest] Simon Foucart and Ming-Jun Lai. Sparsest solutions of underdetermined linear systems via q-minimization for 0 < q <=1. Applied and Computational Harmonic Analysis, 26(3):395-407, May 2009. [ bib | http ]
[Fu2009DISCOVER] Wenjie Fu, Pradipta Ray, and Eric P. Xing. Discover: a feature-based discriminative method for motif search in complex genomes. Bioinformatics (Oxford, England), 25(12):i321-329, June 2009. [ bib | DOI | http ]
MOTIVATION: Identifying transcription factor binding sites (TFBSs) encoding complex regulatory signals in metazoan genomes remains a challenging problem in computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate 'grammatical organization' of motifs within cis-regulatory modules (CRMs), extant pattern matching-based in silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologically meaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence/absence of nearby coding regions, etc. We present a new method for TFBS prediction in metazoan genomes that utilizes both the CRM architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features. RESULTS: This model overcomes weaknesses in earlier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1 score. Availability and Implementation: The code is publicly available at http://www.sailing.cs.cmu.edu/discover.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Keywords: complex, discovery, genomes, motif, tfbs
[Zaslavskiy2009Phrase] M. Zaslavskiy, M. Dymetman, and N. Cancedda. Phrase-based statistical machine translation as a traveling salesman problem. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 333-341, Suntec, Singapore, August 2009. Association for Computational Linguistics. [ bib | http ]
[Croce2009Causes] Carlo M. Croce. Causes and consequences of microRNA dysregulation in cancer. Nat Rev Genet, 10(10):704-714, October 2009. [ bib | DOI | http | .pdf ]
Keywords: csbcbook
[Bleakley2009Joint] K. Bleakley and J.-P. Vert. Joint segmentation of many aCGH profiles using fast group LARS. Technical Report hal-00422430, HAL, October 2009. [ bib | http ]
Array-Based Comparative Genomic Hybridization (aCGH) is a method used to search for genomic regions with copy numbers variations. For a given aCGH profile, one challenge is to accurately segment it into regions of constant copy number. Subjects sharing the same disease status, for example a type of cancer, often have aCGH profiles with similar copy number variations, due to duplications and deletions relevant to that particular disease. We introduce a constrained optimization algorithm that jointly segments aCGH profiles of many subjects. It simultaneously penalizes the amount of freedom the set of profiles have to jump from one level of constant copy number to another, at genomic locations known as breakpoints. We show that breakpoints shared by many different profiles tend to be found first by the algorithm, even in the presence of significant amounts of noise. The algorithm can be formulated as a group LARS problem. We propose an extremely fast way to find the solution path, i.e., a sequence of shared breakpoints in order of importance. For no extra cost the algorithm smoothes all of the aCGH profiles into piecewise-constant regions of equal copy number, giving low-dimensional versions of the original data. These can be shown for all profiles on a single graph, allowing for intuitive visual interpretation. Simulations and an implementation of the algorithm on bladder cancer aCGH profiles are provided.

[Zoppoli2010TimeDelay] P. Zoppoli, S. Morganella, and M. Ceccarelli. Timedelay-aracne: Reverse engineering of gene networks from time-course data by an information theoretic approach. Bmc Bioinformatics, 11(1):154, 2010. [ bib ]
[Zhang2010Detecting] N. R. Zhang, D. O. Siegmund, H. Ji, and J. Li. Detecting simultaneous change-points in multiple sequences. Biometrika, 97(3):631-645, 2010. [ bib | DOI | http | .pdf ]
Keywords: segmentation
[Zaslavskiy2010Many-to-Many] M. Zaslavskiy, F. Bach, and J.-P. Vert. Many-to-many graph matching: A continuous relaxation approach. In J. Balcázar, F. Bonchi, A. Gionis, and M. Sebag, editors, Machine Learning and Knowledge Discovery in Databases, volume 6323 of Lecture Notes in Computer Science, pages 515-530. Springer Berlin / Heidelberg, 2010. [ bib | DOI | http ]
[Yu2010L2-norm] S. Yu, T. Falck, A. Daemen, L-C Tranchevent, Y.Ak. Suykens, B. De Moor, and Y. Moreau. L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinformatics, 11:309, 2010. [ bib | DOI | http ]
BACKGROUND: This paper introduces the notion of optimizing different norms in the dual problem of support vector machines with multiple kernels. The selection of norms yields different extensions of multiple kernel learning (MKL) such as L(infinity), L1, and L2 MKL. In particular, L2 MKL is a novel method that leads to non-sparse optimal kernel coefficients, which is different from the sparse kernel coefficients optimized by the existing L(infinity) MKL method. In real biomedical applications, L2 MKL may have more advantages over sparse integration method for thoroughly combining complementary information in heterogeneous data sources. RESULTS: We provide a theoretical analysis of the relationship between the L2 optimization of kernels in the dual problem with the L2 coefficient regularization in the primal problem. Understanding the dual L2 problem grants a unified view on MKL and enables us to extend the L2 method to a wide range of machine learning problems. We implement L2 MKL for ranking and classification problems and compare its performance with the sparse L(infinity) and the averaging L1 MKL methods. The experiments are carried out on six real biomedical data sets and two large scale UCI data sets. L2 MKL yields better performance on most of the benchmark data sets. In particular, we propose a novel L2 MKL least squares support vector machine (LSSVM) algorithm, which is shown to be an efficient and promising classifier for large scale data sets processing. CONCLUSIONS: This paper extends the statistical framework of genomic data fusion based on MKL. Allowing non-sparse weights on the data sources is an attractive option in settings where we believe most data sources to be relevant to the problem at hand and want to avoid a "winner-takes-all" effect seen in L(infinity) MKL, which can be detrimental to the performance in prospective studies. The notion of optimizing L2 kernels can be straightforwardly extended to ranking, classification, regression, and clustering algorithms. To tackle the computational burden of MKL, this paper proposes several novel LSSVM based MKL algorithms. Systematic comparison on real data sets shows that LSSVM MKL has comparable performance as the conventional SVM MKL algorithms. Moreover, large scale numerical experiments indicate that when cast as semi-infinite programming, LSSVM MKL can be solved more efficiently than SVM MKL. AVAILABILITY: The MATLAB code of algorithms implemented in this paper is downloadable from http://homes.esat.kuleuven.be/ sistawww/bioi/syu/l2lssvm.html.

[Yang2010Fast] A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Fast l1-minimization algorithms and an application in robust face recognition: a review. In Proceedings of the International Conference on Image Processing, pages 1849-1852, 2010. [ bib ]
[Weinberg2010Point] R. Weinberg. Point: Hypotheses first. Nature, 464(7289):678, Apr 2010. [ bib | DOI | http ]
[Weigelt2010Molecular] B. Weigelt and J. S. Reis-Filho. Molecular profiling currently offers no more than tumour morphology and basic immunohistochemistry. Breast Cancer Res., 12(S4):S5, 2010. [ bib | DOI | http | .pdf ]
[Weigelt2010Breast] Britta Weigelt, Alan Mackay, Roger A'hern, Rachael Natrajan, David S P Tan, Mitch Dowsett, Alan Ashworth, and Jorge S Reis-Filho. Breast cancer molecular profiling with single sample predictors: a retrospective analysis. Lancet Oncol, 11(4):339-349, Apr 2010. [ bib | DOI | http ]
Microarray expression profiling classifies breast cancer into five molecular subtypes: luminal A, luminal B, basal-like, HER2, and normal breast-like. Three microarray-based single sample predictors (SSPs) have been used to define molecular classification of individual samples. We aimed to establish agreement between these SSPs for identification of breast cancer molecular subtypes.Previously described microarray-based SSPs were applied to one in-house (n=53) and three publicly available (n=779) breast cancer datasets. Agreement was analysed between SSPs for the whole classification system and for the five molecular subtypes individually in each cohort.Fair-to-substantial agreement between every pair of SSPs in each cohort was recorded (kappa=0.238-0.740). Of the five molecular subtypes, only basal-like cancers consistently showed almost-perfect agreement (kappa>0.812). The proportion of cases classified as basal-like in each cohort was consistent irrespective of the SSP used; however, the proportion of each remaining molecular subtype varied substantially. Assignment of individual cases to luminal A, luminal B, HER2, and normal breast-like subtypes was dependent on the SSP used. The significance of associations with outcome of each molecular subtype, other than basal-like and luminal A, varied depending on SSP used. However, different SSPs produced broadly similar survival curves.Although every SSP identifies molecular subtypes with similar survival, they do not reliably assign the same patients to the same molecular subtypes. For molecular subtype classification to be incorporated into routine clinical practice and treatment decision making, stringent standardisation of methodologies and definitions for identification of breast cancer molecular subtypes is needed.Breakthrough Breast Cancer, Cancer Research UK.

[Wallace2010Identification] Emma V B Wallace, David Stoddart, Andrew J Heron, Ellina Mikhailova, Giovanni Maglia, Timothy J Donohoe, and Hagan Bayley. Identification of epigenetic dna modifications with a protein nanopore. Chem Commun (Camb), 46(43):8195-8197, Nov 2010. [ bib | DOI | http ]
Two DNA bases, 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (hmC), marks of epigenetic modification, are recognized in immobilized DNA strands and distinguished from G, A, T and C by nanopore current recording. Therefore, if further aspects of nanopore sequencing can be addressed, the approach will provide a means to locate epigenetic modifications in unamplified genomic DNA.

Keywords: 5-Methylcytosine, chemistry; Cyclodextrins, chemistry; Cytosine, analogs /&/ derivatives/chemistry; DNA, chemistry; Epigenesis, Genetic; Hemolysin Proteins, chemistry; Nanopores
[Voduc2010Breast] K. David Voduc, Maggie C U Cheang, Scott Tyldesley, Karen Gelmon, Torsten O Nielsen, and Hagen Kennecke. Breast cancer subtypes and the risk of local and regional relapse. J Clin Oncol, 28(10):1684-1691, Apr 2010. [ bib | DOI | http | .pdf ]
The risk of local and regional relapse associated with each breast cancer molecular subtype was determined in a large cohort of patients with breast cancer. Subtype assignment was accomplished using a validated six-marker immunohistochemical panel applied to tissue microarrays.Semiquantitative analysis of estrogen receptor (ER), progesterone receptor (PR), Ki-67, human epidermal growth factor receptor 2 (HER2), epidermal growth factor receptor (EGFR), and cytokeratin (CK) 5/6 was performed on tissue microarrays constructed from 2,985 patients with early invasive breast cancer. Patients were classified into the following categories: luminal A, luminal B, luminal-HER2, HER2 enriched, basal-like, or triple-negative phenotype-nonbasal. Multivariable Cox analysis was used to determine the risk of local or regional relapse associated the intrinsic subtypes, adjusting for standard clinicopathologic factors.The intrinsic molecular subtype was successfully determined in 2,985 tumors. The median follow-up time was 12 years, and there have been a total of 325 local recurrences and 227 regional lymph node recurrences. Luminal A tumors (ER or PR positive, HER2 negative, Ki-67 < 1%) had the best prognosis and the lowest rate of local or regional relapse. For patients undergoing breast conservation, HER2-enriched and basal subtypes demonstrated an increased risk of regional recurrence, and this was statistically significant on multivariable analysis. After mastectomy, luminal B, luminal-HER2, HER2-enriched, and basal subtypes were all associated with an increased risk of local and regional relapse on multivariable analysis.Luminal A tumors are associated with a low risk of local or regional recurrence. Molecular subtyping of breast tumors using a six-marker immunohistochemical panel can identify patients at increased risk of local and regional recurrence.

Keywords: Adult; Breast Neoplasms, mortality/pathology; Female; Humans; Ki-67 Antigen, metabolism; Lymphatic Metastasis; Middle Aged; Neoplasm Metastasis; Neoplasm Recurrence, Local; Neoplasms, Hormone-Dependent; Receptor, Epidermal Growth Factor, metabolism; Receptors, Estrogen, analysis; Receptors, Progesterone, analysis; Tissue Array Analysis; Tumor Markers, Biological, analysis
[Vert2010Fast] J-P. Vert and K. Bleakley. Fast detection of multiple change-points shared by many signals using group LARS. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Adv. Neural. Inform. Process Syst., volume 22, pages 2343-2352, 2010. [ bib ]
[Vanunu2010Associating] O. Vanunu, O. Magger, E. Ruppin, T. Shlomi, and R. Sharan. Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol., 6(1):e1000641, Jan 2010. [ bib | DOI | http ]
A fundamental challenge in human health is the identification of disease-causing genes. Recently, several studies have tackled this challenge via a network-based approach, motivated by the observation that genes causing the same or similar diseases tend to lie close to one another in a network of protein-protein or functional interactions. However, most of these approaches use only local network information in the inference process and are restricted to inferring single gene associations. Here, we provide a global, network-based method for prioritizing disease genes and inferring protein complex associations, which we call PRINCE. The method is based on formulating constraints on the prioritization function that relate to its smoothness over the network and usage of prior information. We exploit this function to predict not only genes but also protein complex associations with a disease of interest. We test our method on gene-disease association data, evaluating both the prioritization achieved and the protein complexes inferred. We show that our method outperforms extant approaches in both tasks. Using data on 1,369 diseases from the OMIM knowledgebase, our method is able (in a cross validation setting) to rank the true causal gene first for 34% of the diseases, and infer 139 disease-related complexes that are highly coherent in terms of the function, expression and conservation of their member proteins. Importantly, we apply our method to study three multi-factorial diseases for which some causal genes have been found already: prostate cancer, alzheimer and type 2 diabetes mellitus. PRINCE's predictions for these diseases highly match the known literature, suggesting several novel causal genes and protein complexes for further investigation.

Keywords: Algorithms; Alzheimer Disease; Databases, Genetic; Diabetes Mellitus; Disease; Genes; Humans; Male; Multiprotein Complexes; Prostatic Neoplasms; Protein Interaction Mapping; Proteins; Reproducibility of Results
[Ulitsky2010DEGAS] I. Ulitsky, A. Krishnamurthy, R. M. Karp, and R. Shamir. DEGAS: de novo discovery of dysregulated pathways in human diseases. PLoS One, 5(10):e13367, 2010. [ bib | DOI | http | .pdf ]
Molecular studies of the human disease transcriptome typically involve a search for genes whose expression is significantly dysregulated in sick individuals compared to healthy controls. Recent studies have found that only a small number of the genes in human disease-related pathways show consistent dysregulation in sick individuals. However, those studies found that some pathway genes are affected in most sick individuals, but genes can differ among individuals. While a pathway is usually defined as a set of genes known to share a specific function, pathway boundaries are frequently difficult to assign, and methods that rely on such definition cannot discover novel pathways. Protein interaction networks can potentially be used to overcome these problems.We present DEGAS (DysrEgulated Gene set Analysis via Subnetworks), a method for identifying connected gene subnetworks significantly enriched for genes that are dysregulated in specimens of a disease. We applied DEGAS to seven human diseases and obtained statistically significant results that appear to home in on compact pathways enriched with hallmarks of the diseases. In Parkinson's disease, we provide novel evidence for involvement of mRNA splicing, cell proliferation, and the 14-3-3 complex in the disease progression. DEGAS is available as part of the MATISSE software package (http://acgt.cs.tau.ac.il/matisse).The subnetworks identified by DEGAS can provide a signature of the disease potentially useful for diagnosis, pinpoint possible pathways affected by the disease, and suggest targets for drug intervention.

[Teer2010Exome] Jamie K Teer and James C Mullikin. Exome sequencing: the sweet spot before whole genomes. Hum Mol Genet, Aug 2010. [ bib | DOI | http ]
The development of massively parallel sequencing technologies, coupled with new massively parallel DNA enrichment technologies (genomic capture), has allowed the sequencing of targeted regions of the human genome in rapidly increasing numbers of samples. Genomic capture can target specific areas in the genome, including genes of interest and linkage regions, but this limits the study to what is already known. Exome capture allows an unbiased investigation of the complete protein-coding regions in the genome. Researchers can use exome capture to focus on a critical part of the human genome, allowing larger numbers of samples than are currently practical with whole-genome sequencing. In this review, we briefly describe some of the methodologies currently used for genomic and exome capture and highlight recent applications of this technology.

[Taby2010Cancer] Rodolphe Taby and Jean-Pierre J Issa. Cancer epigenetics. CA Cancer J Clin, 60(6):376-392, 2010. [ bib | DOI | http ]
Epigenetics refers to stable alterations in gene expression with no underlying modifications in the genetic sequence and is best exemplified by differentiation, in which multiple cell types diverge physiologically despite a common genetic code. Interest in this area of science has grown over the past decades, especially since it was found to play a major role in physiologic phenomena such as embryogenesis, imprinting, and X chromosome inactivation, and in disease states such as cancer. The latter had been previously thought of as a disease with an exclusive genetic etiology. However, recent data have demonstrated that the complexity of human carcinogenesis cannot be accounted for by genetic alterations alone, but also involves epigenetic changes in processes such as DNA methylation, histone modifications, and microRNA expression. In turn, these molecular alterations lead to permanent changes in the expression of genes that regulate the neoplastic phenotype, such as cellular growth and invasiveness. Targeting epigenetic modifiers has been referred to as epigenetic therapy. The success of this approach in hematopoietic malignancies validates the importance of epigenetic alterations in cancer, not only at the therapeutic level but also with regard to prevention, diagnosis, risk stratification, and prognosis.

Keywords: Animals; Cell Cycle, genetics; Cell Transformation, Neoplastic, genetics; DNA Methylation; Epigenesis, Genetic; Histones, genetics; Humans; MicroRNAs, genetics; Neoplasm Invasiveness, genetics; Neoplasms, classification/diagnosis/genetics/metabolism/prevention /&/ control/therapy; Prognosis; Risk Assessment; Tumor Markers, Biological, genetics
[Stoddart2010Nucleobase] David Stoddart, Andrew J Heron, Jochen Klingelhoefer, Ellina Mikhailova, Giovanni Maglia, and Hagan Bayley. Nucleobase recognition in ssdna at the central constriction of the alpha-hemolysin pore. Nano Lett, 10(9):3633-3637, Sep 2010. [ bib | DOI | http ]
Nanopores are under investigation for single-molecule DNA sequencing. The alpha-hemolysin (alphaHL) protein nanopore contains three recognition points capable of nucleobase discrimination in individual immobilized ssDNA molecules. We have modified the recognition point R(1) by extensive mutagenesis of residue 113. Amino acids that provide an energy barrier to ion flow (e.g., bulky or hydrophobic residues) strengthen base identification, while amino acids that lower the barrier weaken it. Amino acids with related side chains produce similar patterns of nucleobase recognition providing a rationale for the redesign of recognition points.

Keywords: Amino Acid Substitution; Base Sequence; DNA, Single-Stranded, chemistry; Hemolysin Proteins, chemistry; Models, Molecular; Mutagenesis
[Stein2010Case] Lincoln D Stein. The case for cloud computing in genome informatics. Genome Biol, 11(5):207, 2010. [ bib | DOI | http ]
[Soneson2010Integrative] Charlotte Soneson, Henrik Lilljebjörn, Thoas Fioretos, and Magnus Fontes. Integrative analysis of gene expression and copy number alterations using canonical correlation analysis. BMC Bioinformatics, 11:191, 2010. [ bib | DOI | http ]
With the rapid development of new genetic measurement methods, several types of genetic alterations can be quantified in a high-throughput manner. While the initial focus has been on investigating each data set separately, there is an increasing interest in studying the correlation structure between two or more data sets. Multivariate methods based on Canonical Correlation Analysis (CCA) have been proposed for integrating paired genetic data sets. The high dimensionality of microarray data imposes computational difficulties, which have been addressed for instance by studying the covariance structure of the data, or by reducing the number of variables prior to applying the CCA. In this work, we propose a new method for analyzing high-dimensional paired genetic data sets, which mainly emphasizes the correlation structure and still permits efficient application to very large data sets. The method is implemented by translating a regularized CCA to its dual form, where the computational complexity depends mainly on the number of samples instead of the number of variables. The optimal regularization parameters are chosen by cross-validation. We apply the regularized dual CCA, as well as a classical CCA preceded by a dimension-reducing Principal Components Analysis (PCA), to a paired data set of gene expression changes and copy number alterations in leukemia.Using the correlation-maximizing methods, regularized dual CCA and PCA+CCA, we show that without pre-selection of known disease-relevant genes, and without using information about clinical class membership, an exploratory analysis singles out two patient groups, corresponding to well-known leukemia subtypes. Furthermore, the variables showing the highest relevance to the extracted features agree with previous biological knowledge concerning copy number alterations and gene expression changes in these subtypes. Finally, the correlation-maximizing methods are shown to yield results which are more biologically interpretable than those resulting from a covariance-maximizing method, and provide different insight compared to when each variable set is studied separately using PCA.We conclude that regularized dual CCA as well as PCA+CCA are useful methods for exploratory analysis of paired genetic data sets, and can be efficiently implemented also when the number of variables is very large.

Keywords: Algorithms; Databases, Genetic; Gene Dosage; Gene Expression; Gene Expression Profiling, methods; Genomics, methods; Humans; Leukemia, classification/genetics; Principal Component Analysis
[Shoval2010Cell] Oren Shoval and Uri Alon. Snapshot: network motifs. Cell, 143(2):326-3e1, Oct 2010. [ bib | DOI | http ]
Keywords: Animals; Feedback; Gene Regulatory Networks; Humans; Signal Transduction
[Shi2010Functional] W. Shi, M. Bessarabova, D. Dosymbekov, Z. Dezso, T. Nikolskaya, M. Dudoladova, T. Serebryiskaya, A. Bugrim, A. Gyuryanov, R. J. Brennan, R. Shah, J. Dopazo, M. Chen, Y. Deng, T. Shi, G. Jurman, C. Furnlanelle, R. S. Thomas, J. C. Corton, W. Tong, L. Shi, and Y. Nikolsky. Functional analysis of multiple genomic signatures demonstrates that classification algorithms choose phenotype-related genes. Pharmacogenomics J., 10(4):310-323, 2010. [ bib | DOI | http | .pdf ]
[Shi2010MicroArray] Leming Shi, Gregory Campbell, Wendell D Jones, Fabien Campagne, Zhining Wen, Stephen J Walker, Zhenqiang Su, Tzu-Ming Chu, Federico M Goodsaid, Lajos Pusztai, John D Shaughnessy, André Oberthuer, Russell S Thomas, Richard S Paules, Mark Fielden, Bart Barlogie, Weijie Chen, Pan Du, Matthias Fischer, Cesare Furlanello, Brandon D Gallas, Xijin Ge, Dalila B Megherbi, W. Fraser Symmans, May D Wang, John Zhang, Hans Bitter, Benedikt Brors, Pierre R Bushel, Max Bylesjo, Minjun Chen, Jie Cheng, Jing Cheng, Jeff Chou, Timothy S Davison, Mauro Delorenzi, Youping Deng, Viswanath Devanarayan, David J Dix, Joaquin Dopazo, Kevin C Dorff, Fathi Elloumi, Jianqing Fan, Shicai Fan, Xiaohui Fan, Hong Fang, Nina Gonzaludo, Kenneth R Hess, Huixiao Hong, Jun Huan, Rafael A Irizarry, Richard Judson, Dilafruz Juraeva, Samir Lababidi, Christophe G Lambert, Li Li, Yanen Li, Zhen Li, Simon M Lin, Guozhen Liu, Edward K Lobenhofer, Jun Luo, Wen Luo, Matthew N McCall, Yuri Nikolsky, Gene A Pennello, Roger G Perkins, Reena Philip, Vlad Popovici, Nathan D Price, Feng Qian, Andreas Scherer, Tieliu Shi, Weiwei Shi, Jaeyun Sung, Danielle Thierry-Mieg, Jean Thierry-Mieg, Venkata Thodima, Johan Trygg, Lakshmi Vishnuvajjala, Sue Jane Wang, Jianping Wu, Yichao Wu, Qian Xie, Waleed A Yousef, Liang Zhang, Xuegong Zhang, Sheng Zhong, Yiming Zhou, Sheng Zhu, Dhivya Arasappan, Wenjun Bao, Anne Bergstrom Lucas, Frank Berthold, Richard J Brennan, Andreas Buness, Jennifer G Catalano, Chang Chang, Rong Chen, Yiyu Cheng, Jian Cui, Wendy Czika, Francesca Demichelis, Xutao Deng, Damir Dosymbekov, Roland Eils, Yang Feng, Jennifer Fostel, Stephanie Fulmer-Smentek, James C Fuscoe, Laurent Gatto, Weigong Ge, Darlene R Goldstein, Li Guo, Donald N Halbert, Jing Han, Stephen C Harris, Christos Hatzis, Damir Herman, Jianping Huang, Roderick V Jensen, Rui Jiang, Charles D Johnson, Giuseppe Jurman, Yvonne Kahlert, Sadik A Khuder, Matthias Kohl, Jianying Li, Li Li, Menglong Li, Quan-Zhen Li, Shao Li, Zhiguang Li, Jie Liu, Ying Liu, Zhichao Liu, Lu Meng, Manuel Madera, Francisco Martinez-Murillo, Ignacio Medina, Joseph Meehan, Kelci Miclaus, Richard A Moffitt, David Montaner, Piali Mukherjee, George J Mulligan, Padraic Neville, Tatiana Nikolskaya, Baitang Ning, Grier P Page, Joel Parker, R. Mitchell Parry, Xuejun Peng, Ron L Peterson, John H Phan, Brian Quanz, Yi Ren, Samantha Riccadonna, Alan H Roter, Frank W Samuelson, Martin M Schumacher, Joseph D Shambaugh, Qiang Shi, Richard Shippy, Shengzhu Si, Aaron Smalter, Christos Sotiriou, Mat Soukup, Frank Staedtler, Guido Steiner, Todd H Stokes, Qinglan Sun, Pei-Yi Tan, Rong Tang, Zivana Tezak, Brett Thorn, Marina Tsyganova, Yaron Turpaz, Silvia C Vega, Roberto Visintainer, Juergen von Frese, Charles Wang, Eric Wang, Junwei Wang, Wei Wang, Frank Westermann, James C Willey, Matthew Woods, Shujian Wu, Nianqing Xiao, Joshua Xu, Lei Xu, Lun Yang, Xiao Zeng, Jialu Zhang, Li Zhang, Min Zhang, Chen Zhao, Raj K Puri, Uwe Scherf, Weida Tong, Russell D Wolfinger, and M. A. Q. C. Consortium. The microarray quality control (maqc)-ii study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol, 28(8):827-838, Aug 2010. [ bib ]
Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis.

Keywords: Animals; Breast Neoplasms, diagnosis/genetics; Disease Models, Animal; Female; Gene Expression Profiling, methods/standards; Guidelines as Topic; Humans; Liver Diseases, etiology/genetics/pathology; Lung Diseases, etiology/genetics/pathology; Multiple Myeloma, diagnosis/genetics; Neoplasms, diagnosis/genetics/mortality; Neuroblastoma, diagnosis/genetics; Oligonucleotide Array Sequence Analysis, methods/standards; Predictive Value of Tests; Quality Control; Rats; Survival Analysis
[Portela2010Epigenetic] Anna Portela and Manel Esteller. Epigenetic modifications and human disease. Nat Biotechnol, 28(10):1057-1068, Oct 2010. [ bib | DOI | http ]
Epigenetics is one of the most rapidly expanding fields in biology. The recent characterization of a human DNA methylome at single nucleotide resolution, the discovery of the CpG island shores, the finding of new histone variants and modifications, and the unveiling of genome-wide nucleosome positioning maps highlight the accelerating speed of discovery over the past two years. Increasing interest in epigenetics has been accompanied by technological breakthroughs that now make it possible to undertake large-scale epigenomic studies. These allow the mapping of epigenetic marks, such as DNA methylation, histone modifications and nucleosome positioning, which are critical for regulating gene and noncoding RNA expression. In turn, we are learning how aberrant placement of these epigenetic marks and mutations in the epigenetic machinery is involved in disease. Thus, a comprehensive understanding of epigenetic mechanisms, their interactions and alterations in health and disease, has become a priority in biomedical research.

Keywords: Amino Acid Sequence; Autoimmune Diseases, genetics; DNA Methylation, genetics; Disease, genetics; Epigenesis, Genetic; Histones, chemistry/metabolism; Humans; Molecular Sequence Data; Nerve Degeneration, genetics
[Popovici2010Effect] V. Popovici, W. Chen, B.G. Gallas, C. Hatzis, W. Shi, F.W. Samuelson, Y. Nikolsky, M. Tsyganova, A. Ishkin, T. Nikolskaya, et al. Effect of training-sample size and classification difficulty on the accuracy of genomic predictors. Breast Cancer Res, 12(1):R5, 2010. [ bib ]
[Peng2010Regularized] Jie Peng, Ji Zhu, Anna Bergamaschi, Wonshik Han, Dong-Young Noh, Jonathan R. Pollack, and Pei Wang. Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Stat., 4(1):53-77, 2010. [ bib | DOI | http ]
[Ortiz-Estevez2010ACNE] Maria Ortiz-Estevez, Henrik Bengtsson, and Angel Rubio. Acne: a summarization method to estimate allele-specific copy numbers for affymetrix snp arrays. Bioinformatics, 26(15):1827-1833, Aug 2010. [ bib | DOI | http ]
MOTIVATION: Current algorithms for estimating DNA copy numbers (CNs) borrow concepts from gene expression analysis methods. However, single nucleotide polymorphism (SNP) arrays have special characteristics that, if taken into account, can improve the overall performance. For example, cross hybridization between alleles occurs in SNP probe pairs. In addition, most of the current CN methods are focused on total CNs, while it has been shown that allele-specific CNs are of paramount importance for some studies. Therefore, we have developed a summarization method that estimates high-quality allele-specific CNs. RESULTS: The proposed method estimates the allele-specific DNA CNs for all Affymetrix SNP arrays dealing directly with the cross hybridization between probes within SNP probesets. This algorithm outperforms (or at least it performs as well as) other state-of-the-art algorithms for computing DNA CNs. It better discerns an aberration from a normal state and it also gives more precise allele-specific CNs. AVAILABILITY: The method is available in the open-source R package ACNE, which also includes an add on to the aroma.affymetrix framework (http://www.aroma-project.org/).

[Obozinski2010Joint] G. Obozinski, B. Taskar, and M.I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231-252, 2010. [ bib ]
[Nielsen2010Visualizing] Cydney B Nielsen, Michael Cantor, Inna Dubchak, David Gordon, and Ting Wang. Visualizing genomes: techniques and challenges. Nat Methods, 7(3 Suppl):S5-S15, Mar 2010. [ bib | DOI | http ]
As our ability to generate sequencing data continues to increase, data analysis is replacing data generation as the rate-limiting step in genomics studies. Here we provide a guide to genomic data visualization tools that facilitate analysis tasks by enabling researchers to explore, interpret and manipulate their data, and in some cases perform on-the-fly computations. We will discuss graphical methods designed for the analysis of de novo sequencing assemblies and read alignments, genome browsing, and comparative genomics, highlighting the strengths and limitations of these approaches and the challenges ahead.

[Netterwald20101000] J Netterwald. The $1000 genome: Coming soon? Drug Discovery & Development, 13:14-15, 2010. [ bib ]
[Mordelet2010Learning] F. Mordelet. Learning from positive and unlabeled examples in biology. PhD thesis, Mines ParisTech, 2010. [ bib ]
[Metzker2010Sequencing] M. L. Metzker. Sequencing technologies - the next generation. Nat. Rev. Genet., 11(1):31-46, Jan 2010. [ bib | DOI | http ]
Demand has never been greater for revolutionary technologies that deliver fast, inexpensive and accurate genome information. This challenge has catalysed the development of next-generation sequencing (NGS) technologies. The inexpensive production of large volumes of sequence data is the primary advantage over conventional methods. Here, I present a technical review of template preparation, sequencing and imaging, genome alignment and assembly approaches, and recent advances in current and near-term commercially available NGS instruments. I also outline the broad range of applications for NGS technologies, in addition to providing guidelines for platform selection to address biological questions of interest.

Keywords: ngs
[Meinshausen2010Stability] N. Meinshausen and P. Bühlmann. Stability selection. J. R. Stat. Soc. Ser. B, 72(4):417-473, 2010. [ bib | DOI | http | .pdf ]
[McCarthy2010Third] Alice McCarthy. Third generation dna sequencing: pacific biosciences' single molecule real time technology. Chem Biol, 17(7):675-676, Jul 2010. [ bib | DOI | http ]
[Markowetz2010How] Florian Markowetz. How to understand the cell by breaking it: network analysis of gene perturbation screens. PLoS Comput Biol, 6(2):e1000655, 2010. [ bib | DOI | http ]
Keywords: Animals; Cell Physiological Processes; Cluster Analysis; Gene Regulatory Networks; Genomics; Humans; Models, Genetic; Models, Statistical; Phenotype; Signal Transduction; Systems Biology
[Marbach2010Revealing] D. Marbach, R. J. Prill, T. Schaffter, C. Mattiussi, D. Floreano, and G. Stolovitzky. Revealing strengths and weaknesses of methods for gene network inference. Proc. Natl. Acad. Sci. USA, 107(14):6286-6291, 2010. [ bib | DOI | arXiv | http ]
Numerous methods have been developed for inferring gene regulatory networks from expression data, however, both their absolute and comparative performance remain poorly understood. In this paper, we introduce a framework for critical performance assessment of methods for gene network inference. We present an in silico benchmark suite that we provided as a blinded, community-wide challenge within the context of the DREAM (Dialogue on Reverse Engineering Assessment and Methods) project. We assess the performance of 29 gene-network-inference methods, which have been applied independently by participating teams. Performance profiling reveals that current inference methods are affected, to various degrees, by different types of systematic prediction errors. In particular, all but the best-performing method failed to accurately infer multiple regulatory inputs (combinatorial regulation) of genes. The results of this community-wide experiment show that reliable network inference from gene expression data remains an unsolved problem, and they indicate potential ways of network reconstruction improvements.

[Mamanova2010Target-enrichment] Lira Mamanova, Alison J Coffey, Carol E Scott, Iwanka Kozarewa, Emily H Turner, Akash Kumar, Eleanor Howard, Jay Shendure, and Daniel J Turner. Target-enrichment strategies for next-generation sequencing. Nat Methods, 7(2):111-118, Feb 2010. [ bib | DOI | http ]
We have not yet reached a point at which routine sequencing of large numbers of whole eukaryotic genomes is feasible, and so it is often necessary to select genomic regions of interest and to enrich these regions before sequencing. There are several enrichment approaches, each with unique advantages and disadvantages. Here we describe our experiences with the leading target-enrichment technologies, the optimizations that we have performed and typical results that can be obtained using each. We also provide detailed protocols for each technology so that end users can find the best compromise between sensitivity, specificity and uniformity for their particular project.

Keywords: Chromosome Mapping; Forecasting; Gene Targeting; In Situ Hybridization; Molecular Probe Techniques; Polymerase Chain Reaction; Sequence Analysis, DNA
[Malyutov2010Recovery] M. Malyutov. Recovery of sparse active inputs in general systems: A review. In Proc. IEEE Region 8 Int Computational Technologies in Electrical and Electronics Engineering (SIBIRCON) Conf, pages 15-22, 2010. [ bib | DOI ]
[Mairal2010Online] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res., 11:19-60, 2010. [ bib | .html | .pdf ]
Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper fo- cuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal pro- cessing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large data sets.

[Mairal2010Sparse] J. Mairal. Sparse coding for machine learning, image processing and computer vision. PhD thesis, École normale supérieure de Cachan-ENS Cachan, 2010. [ bib ]
[Laird2010Principles] Peter W Laird. Principles and challenges of genome-wide dna methylation analysis. Nat Rev Genet, 11(3):191-203, Feb 2010. [ bib | DOI | http ]
Methylation of cytosine bases in DNA provides a layer of epigenetic control in many eukaryotes that has important implications for normal biology and disease. Therefore, profiling DNA methylation across the genome is vital to understanding the influence of epigenetics. There has been a revolution in DNA methylation analysis technology over the past decade: analyses that previously were restricted to specific loci can now be performed on a genome-scale and entire methylomes can be characterized at single-base-pair resolution. However, there is such a diversity of DNA methylation profiling techniques that it can be challenging to select one. This Review discusses the different approaches and their relative merits and introduces considerations for data analysis.

[Kondor2010Ranking] R. I. Kondor and M. S. Barbosa. Ranking with kernels in fourier space. In A. T. Kalai and M. Mohri, editors, COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 451-463. Omnipress, 2010. [ bib | .pdf ]
[Kallioniemi2010DNA] Anne Kallioniemi. DNA copy number analysis on tissue microarrays. Methods Mol Biol, 664:127-134, 2010. [ bib | DOI | http ]
Detection of DNA sequence copy number changes is essential in both clinical practice and basic research, especially in cancer research. The combination of fluorescence in situ hybridization (FISH) and tissue microarray (TMA) technology provides high-throughput means for the evaluation of genetic aberrations in a large number of tissue samples. FISH on TMA is technically demanding and several protocols that include a variety of tissue pretreatment steps have been developed to improve the success of this methodology. Despite of the technical difficulties, FISH analysis on TMA has been successfully used not only to uncover genetic alterations in various malignancies but to also rapidly establish the clinical significance of such changes.

[Jovanovic2010epigenetics] Jovana Jovanovic, Jo Anders Rønneberg, Jörg Tost, and Vessela Kristensen. The epigenetics of breast cancer. Mol Oncol, 4(3):242-254, Jun 2010. [ bib | DOI | http ]
Epigenetic changes can be defined as stable molecular alterations of a cellular phenotype such as the gene expression profile of a cell that are heritable during somatic cell divisions (and sometimes germ line transmissions) but do not involve changes of the DNA sequence itself. Epigenetic phenomena are mediated by several molecular mechanisms comprising histone modifications, polycomb/trithorax protein complexes, small non-coding or antisense RNAs and DNA methylation. These different modifications are closely interconnected. Epigenetic regulation is critical in normal growth and development and closely conditions the transcriptional potential of genes. Epigenetic mechanisms convey genomic adaption to an environment thereby ultimately contributing towards given phenotype. In this review we will describe the various aspects of epigenetics and in particular DNA methylation in breast carcinogenesis and their potential application for diagnosis, prognosis and treatment decision.

Keywords: Breast Neoplasms, diagnosis/genetics/pathology/therapy; Chromatin, chemistry/metabolism; DNA Methylation; DNA Modification Methylases, metabolism; DNA, chemistry/metabolism; Epigenesis, Genetic; Female; Gene Expression Regulation, Neoplastic; Histones, metabolism; Humans; MicroRNAs, genetics/metabolism; Molecular Structure; Prognosis; Receptors, Estrogen, genetics/metabolism; Tumor Markers, Biological, metabolism
[Iwamoto2010Predicting] T. Iwamoto, L. Pusztai, et al. Predicting prognosis of breast cancer with gene signatures: are we lost in a sea of data? Genome medicine, 2(11):81, 2010. [ bib ]
[Hwang2010Heterogeneous] T. Hwang and R. Kuang. A heterogeneous label propagation algorithm for disease gene discovery. In Proceedings of the SIAM International Conference on Data Mining, SDM 2010, April 29 - May 1, 2010, Columbus, Ohio, USA, pages 583-594, 2010. [ bib ]
[Huynh-Thu2010Inferring] V. A. Huynh-Thu, A. Irrthum, L. Wehenkel, and P. Geurts. Inferring regulatory networks from expression data using tree-based methods. PLoS One, 5(9):e12776, 2010. [ bib | DOI | http | .pdf ]
One of the pressing open problems of computational systems biology is the elucidation of the topology of genetic regulatory networks (GRNs) using high throughput genomic data, in particular microarray gene expression data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge aims to evaluate the success of GRN inference algorithms on benchmarks of simulated data. In this article, we present GENIE3, a new algorithm for the inference of GRNs that was best performer in the DREAM4 In Silico Multifactorial challenge. GENIE3 decomposes the prediction of a regulatory network between p genes into p different regression problems. In each of the regression problems, the expression pattern of one of the genes (target gene) is predicted from the expression patterns of all the other genes (input genes), using tree-based ensemble methods Random Forests or Extra-Trees. The importance of an input gene in the prediction of the target gene expression pattern is taken as an indication of a putative regulatory link. Putative regulatory links are then aggregated over all genes to provide a ranking of interactions from which the whole network is reconstructed. In addition to performing well on the DREAM4 In Silico Multifactorial challenge simulated data, we show that GENIE3 compares favorably with existing algorithms to decipher the genetic regulatory network of Escherichia coli. It doesn't make any assumption about the nature of gene regulation, can deal with combinatorial and non-linear interactions, produces directed GRNs, and is fast and scalable. In conclusion, we propose a new algorithm for GRN inference that performs well on both synthetic and real gene expression data. The algorithm, based on feature selection with tree-based ensemble methods, is simple and generic, making it adaptable to other types of genomic data and interactions.

[Hue2010learning] M. Hue and J-P. Vert. On learning with kernels for unordered pairs. In J. Fürnkranz and T. Joachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 463-470. Omnipress, 2010. [ bib | .pdf ]
[Huang2010benefit] J. Huang and T. Zhang. The benefit of group sparsity. Ann. Stat., 38(4):1978-2004, 2010. [ bib | DOI | http ]
This paper develops a theory for group Lasso using a concept called strong group sparsity. Our result shows that group Lasso is superior to standard Lasso for strongly group-sparse signals. This provides a convincing theoretical justification for using group sparse regularization when the underlying group structure is consistent with the data. Moreover, the theory predicts some limitations of the group Lasso formulation that are confirmed by simulation studies.

[Hohlbein2010Surfing] Johannes Hohlbein, Kristofer Gryte, Mike Heilemann, and Achillefs N Kapanidis. Surfing on a new wave of single-molecule fluorescence methods. Phys Biol, 7(3):031001, 2010. [ bib | DOI | http ]
Single-molecule fluorescence microscopy is currently one of the most popular methods in the single-molecule toolbox. In this review, we discuss recent advances in fluorescence instrumentation and assays: these methods are characterized by a substantial increase in complexity of the instrumentation or biological samples involved. Specifically, we describe new multi-laser and multi-colour fluorescence spectroscopy and imaging techniques, super-resolution microscopy imaging and the development of instruments that combine fluorescence detection with other single-molecule methods such as force spectroscopy. We also highlight two pivotal developments in basic and applied biosciences: the new information available from detection of single molecules in single biological cells and exciting developments in fluorescence-based single-molecule DNA sequencing.

[Hoffmann2010new] Brice Hoffmann, Mikhail Zaslavskiy, Jean-Philippe Vert, and Veronique Stoven. A new protein binding pocket similarity measure based on comparison of clouds of atoms in 3d: application to ligand prediction. BMC Bioinformatics, 11(1):99, 2010. [ bib | DOI | http ]
BACKGROUND:Predicting which molecules can bind to a given binding site of a protein with known 3D structure is important to decipher the protein function, and useful in drug design. A classical assumption in structural biology is that proteins with similar 3D structures have related molecular functions, and therefore may bind similar ligands. However, proteins that do not display any overall sequence or structure similarity may also bind similar ligands if they contain similar binding sites. Quantitatively assessing the similarity between binding sites may therefore be useful to propose new ligands for a given pocket, based on those known for similar pockets.RESULTS:We propose a new method to quantify the similarity between binding pockets, and explore its relevance for ligand prediction. We represent each pocket by a cloud of atoms, and assess the similarity between two pockets by aligning their atoms in the 3D space and comparing the resulting configurations with a convolution kernel. Pocket alignment and comparison is possible even when the corresponding proteins share no sequence or overall structure similarities. In order to predict ligands for a given target pocket, we compare it to an ensemble of pockets with known ligands to identify the most similar pockets. We discuss two criteria to evaluate the performance of a binding pocket similarity measure in the context of ligand prediction, namely, area under ROC curve (AUC scores) and classification based scores. We show that the latter is better suited to evaluate the methods with respect to ligand prediction, and demonstrate the relevance of our new binding site similarity compared to existing similarity measures.CONCLUSIONS:This study demonstrates the relevance of the proposed method to identify ligands binding to known binding pockets. We also provide a new benchmark for future work in this field. The new method and the benchmark are available at http://cbio.ensmp.fr/paris

[He2010Stable] Z. He and W. Yu. Stable feature selection for biomarker discovery. arXiv preprint arXiv:1001.0887, 2010. [ bib ]
[Haury2010stability] A.-C. Haury and J-P. Vert. On the stability and interpretability of prognosis signatures in breast cancer. In Proceedings of the Fourth International Workshop on Machine Learning in Systems Biology (MLSB10), 2010. To appear. [ bib ]
[Haury2010Increasing] A.C. Haury, L. Jacob, and J.P. Vert. Increasing stability and interpretability of gene expression signatures. arXiv preprint arXiv:1001.3109, 2010. [ bib ]
[Harchaoui2010Multiple] Z. Harchaoui and C. Levy-Leduc. Multiple change-point estimation with a total variation penalty. J. Am. Stat. Assoc., 105(492):1480-1493, 2010. [ bib | DOI | http | .pdf ]
[Han2010variance] Y. Han and L. Yu. A variance reduction framework for stable feature selection. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 206-215. IEEE, 2010. [ bib ]
[Golub2010Counterpoint] T. Golub. Counterpoint: Data first. Nature, 464(7289):679, Apr 2010. [ bib | DOI | http ]
[Gibney2010Epigenetics] E. R. Gibney and C. M. Nolan. Epigenetics and gene expression. Heredity, 105(1):4-13, Jul 2010. [ bib | DOI | http ]
Transcription, translation and subsequent protein modification represent the transfer of genetic information from the archival copy of DNA to the short-lived messenger RNA, usually with subsequent production of protein. Although all cells in an organism contain essentially the same DNA, cell types and functions differ because of qualitative and quantitative differences in their gene expression. Thus, control of gene expression is at the heart of differentiation and development. Epigenetic processes, including DNA methylation, histone modification and various RNA-mediated processes, are thought to influence gene expression chiefly at the level of transcription; however, other steps in the process (for example, translation) may also be regulated epigenetically. The following paper will outline the role epigenetics is believed to have in influencing gene expression.

Keywords: Animals; Cell Differentiation; Cell Lineage; DNA Modification Methylases, metabolism; Epigenesis, Genetic; Gene Expression; Heredity; Humans
[Genuer2010Variable] R. Genuer, J.M. Poggi, and C. Tuleau-Malot. Variable selection using random forests. Pattern Recognition Letters, 31(14):2225-2236, 2010. [ bib ]
[Gehlenborg2010Visualization] Nils Gehlenborg, Seán I O'Donoghue, Nitin S Baliga, Alexander Goesmann, Matthew A Hibbs, Hiroaki Kitano, Oliver Kohlbacher, Heiko Neuweger, Reinhard Schneider, Dan Tenenbaum, and Anne-Claude Gavin. Visualization of omics data for systems biology. Nat Methods, 7(3 Suppl):S56-S68, Mar 2010. [ bib | DOI | http ]
High-throughput studies of biological systems are rapidly accumulating a wealth of 'omics'-scale data. Visualization is a key aspect of both the analysis and understanding of these data, and users now have many visualization methods and tools to choose from. The challenge is to create clear, meaningful and integrated visualizations that give biological insight, without being overwhelmed by the intrinsic complexity of the data. In this review, we discuss how visualization tools are being used to help interpret protein interaction, gene expression and metabolic profile data, and we highlight emerging new directions.

Keywords: Genomics; Image Processing, Computer-Assisted; Mass Spectrometry; Metabolomics; Nuclear Magnetic Resonance, Biomolecular; Protein Binding; Proteomics; Systems Biology
[Gardner2010Reverse] T. S. Gardner and J. J. Faith. Reverse-engineering transcription control networks. Phys. Life Rev., 2(1):65-88, Apr 2010. [ bib | DOI | http | .pdf ]
Microarray technologies, which enable the simultaneous measurement of all RNA transcripts in a cell, have spawned the development of algorithms for reverse-engineering transcription control networks. In this article, we classify the algorithms into two general strategies: physical modeling and influence modeling. We discuss the biological and computational principles underlying each strategy, and provide leading examples of each. We also discuss the practical considerations for developing and applying the various methods.

[Fullwood2010Chromatin] Melissa J Fullwood, Yuyuan Han, Chia-Lin Wei, Xiaoan Ruan, and Yijun Ruan. Chromatin interaction analysis using paired-end tag sequencing. Curr Protoc Mol Biol, Chapter 21:Unit 21.15.1-Unit 21.1525, Jan 2010. [ bib | DOI | http ]
Chromatin Interaction Analysis using Paired-End Tag sequencing (ChIA-PET) is a technique developed for large-scale, de novo analysis of higher-order chromatin structures. Cells are treated with formaldehyde to cross-link chromatin interactions, DNA segments bound by protein factors are enriched by chromatin immunoprecipitation, and interacting DNA fragments are then captured by proximity ligation. The Paired-End Tag (PET) strategy is applied to the construction of ChIA-PET libraries, which are sequenced by high-throughput next-generation sequencing technologies. Finally, raw PET sequences are subjected to bioinformatics analysis, resulting in a genome-wide map of binding sites and chromatin interactions mediated by the protein factor under study. This unit describes ChIA-PET for genome-wide analysis of chromatin interactions in mammalian cells, with the application of Roche/454 and Illumina sequencing technologies.

Keywords: Animals; Chromatin; Computational Biology; Databases, Nucleic Acid; Genome-Wide Association Study; Humans; Sequence Analysis, DNA
[Freedman2010Lies] D.H. Freedman. Lies, damned lies, and medical science. The Atlantic, 306(4):76-84, 2010. [ bib ]
[Flusberg2010Direct] Benjamin A Flusberg, Dale R Webster, Jessica H Lee, Kevin J Travers, Eric C Olivares, Tyson A Clark, Jonas Korlach, and Stephen W Turner. Direct detection of dna methylation during single-molecule, real-time sequencing. Nat Methods, 7(6):461-465, Jun 2010. [ bib | DOI | http ]
We describe the direct detection of DNA methylation, without bisulfite conversion, through single-molecule, real-time (SMRT) sequencing. In SMRT sequencing, DNA polymerases catalyze the incorporation of fluorescently labeled nucleotides into complementary nucleic acid strands. The arrival times and durations of the resulting fluorescence pulses yield information about polymerase kinetics and allow direct detection of modified nucleotides in the DNA template, including N6-methyladenine, 5-methylcytosine and 5-hydroxymethylcytosine. Measurement of polymerase kinetics is an intrinsic part of SMRT sequencing and does not adversely affect determination of primary DNA sequence. The various modifications affect polymerase kinetics differently, allowing discrimination between them. We used these kinetic signatures to identify adenine methylation in genomic samples and found that, in combination with circular consensus sequencing, they can enable single-molecule identification of epigenetic modifications with base-pair resolution. This method is amenable to long read lengths and will likely enable mapping of methylation patterns in even highly repetitive genomic regions.

Keywords: DNA Methylation; DNA-Directed DNA Polymerase, metabolism; Escherichia coli, genetics; Kinetics; Principal Component Analysis; Sequence Analysis, DNA, methods
[Ernst2010Discovery] J. Ernst and M. Kellis. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat. Biotechnol., 28(8):817-825, Aug 2010. [ bib | DOI | http ]
A plethora of epigenetic modifications have been described in the human genome and shown to play diverse roles in gene regulation, cellular differentiation and the onset of disease. Although individual modifications have been linked to the activity levels of various genetic functional elements, their combinatorial patterns are still unresolved and their potential for systematic de novo genome annotation remains untapped. Here, we use a multivariate Hidden Markov Model to reveal 'chromatin states' in human T cells, based on recurrent and spatially coherent combinations of chromatin marks. We define 51 distinct chromatin states, including promoter-associated, transcription-associated, active intergenic, large-scale repressed and repeat-associated states. Each chromatin state shows specific enrichments in functional annotations, sequence motifs and specific experimentally observed characteristics, suggesting distinct biological roles. This approach provides a complementary functional annotation of the human genome that reveals the genome-wide locations of diverse classes of epigenetic function.

[Efcavitch2010Single-molecule] J. William Efcavitch and John F Thompson. Single-molecule dna analysis. Annu Rev Anal Chem (Palo Alto Calif), 3:109-128, 2010. [ bib | DOI | http ]
The ability to detect single molecules of DNA or RNA has led to an extremely rich area of exploration of the single most important biomolecule in nature. In cases in which the nucleic acid molecules are tethered to a solid support, confined to a channel, or simply allowed to diffuse into a detection volume, novel techniques have been developed to manipulate the DNA and to examine properties such as structural dynamics and protein-DNA interactions. Beyond the analysis of the properties of nucleic acids themselves, single-molecule detection has enabled dramatic improvements in the throughput of DNA sequencing and holds promise for continuing progress. Both optical and nonoptical detection methods that use surfaces, nanopores, and zero-mode waveguides have been attempted, and one optically based instrument is already commercially available. The breadth of literature related to single-molecule DNA analysis is vast; this review focuses on a survey of efforts in molecular dynamics and nucleic acid sequencing.

[Domon2010Options] Bruno Domon and Ruedi Aebersold. Options and considerations when selecting a quantitative proteomics strategy. Nat Biotechnol, 28(7):710-721, Jul 2010. [ bib | DOI | http ]
The vast majority of proteomic studies to date have relied on mass spectrometric techniques to identify, and in some cases quantify, peptides that have been generated by proteolysis. Current approaches differ in the types of instrument used, their performance profiles, the manner in which they interface with biological research strategies, and their reliance on and use of prior information. Here, we consider the three main mass spectrometry (MS)-based proteomic approaches used today: shotgun (or discovery), directed and targeted strategies. We discuss the principles of each technique, their strengths and weaknesses and the dependence of their performance profiles on the composition of the biological sample. Our goal is to provide a rational framework for selecting strategies optimally suited to address the specific research issue under consideration.

[David2010Cancer:] A. Rosalie David and Michael R Zimmerman. Cancer: an old disease, a new disease or something in between? Nat Rev Cancer, 10(10):728-733, Oct 2010. [ bib | DOI | http ]
In industrialized societies, cancer is second only to cardiovascular disease as a cause of death. The history of this disorder has the potential to improve our understanding of disease prevention, aetiology, pathogenesis and treatment. A striking rarity of malignancies in ancient physical remains might indicate that cancer was rare in antiquity, and so poses questions about the role of carcinogenic environmental factors in modern societies. Although the rarity of cancer in antiquity remains undisputed, the first published histological diagnosis of cancer in an Egyptian mummy demonstrates that new evidence is still forthcoming.

Keywords: Animals; Art; Fossils; Hominidae; Humans; Neoplasms; Paleopathology
[Dalca2010VARiD] A. V. Dalca, S. M. Rumble, S. Levy, and M. Brudno. VARiD: a variation detection framework for color-space and letter-space platforms. Bioinformatics, 26(12):i343-i349, Jun 2010. [ bib | DOI | http | .pdf ]
High-throughput sequencing (HTS) technologies are transforming the study of genomic variation. The various HTS technologies have different sequencing biases and error rates, and while most HTS technologies sequence the residues of the genome directly, generating base calls for each position, the Applied Biosystem's SOLiD platform generates dibase-coded (color space) sequences. While combining data from the various platforms should increase the accuracy of variation detection, to date there are only a few tools that can identify variants from color space data, and none that can analyze color space and regular (letter space) data together.We present VARiD-a probabilistic method for variation detection from both letter- and color-space reads simultaneously. VARiD is based on a hidden Markov model and uses the forward-backward algorithm to accurately identify heterozygous, homozygous and tri-allelic SNPs, as well as micro-indels. Our analysis shows that VARiD performs better than the AB SOLiD toolset at detecting variants from color-space data alone, and improves the calls dramatically when letter- and color-space reads are combined.The toolset is freely available at http://compbio.cs.utoronto.ca/varid.

[Cook2010model] Peter R Cook. A model for all genomes: the role of transcription factories. J Mol Biol, 395(1):1-10, Jan 2010. [ bib | DOI | http ]
A model for all genomes involving one major architectural motif is presented: DNA or chromatin loops are tethered to "factories" through the transcription machinery-a polymerase (active or inactive) or its transcription factors (activators or repressors). These loops appear and disappear as polymerases initiate and terminate (and as factors bind and dissociate), so the structure is ever-changing and self-organizing. This model is parsimonious, detailed (and so easily tested), and incorporates elements found in various other models.

Keywords: Animals; Bacteria; Eukaryota; Gene Expression Regulation; Genome; Humans; Models, Genetic; Transcription, Genetic
[Chowdhury2010Identification] S. A. Chowdhury and M. Koyutürk. Identification of coordinately dysregulated subnetworks in complex phenotypes. Pac. Symp. Biocomput., pages 133-144, 2010. [ bib | .pdf ]
In the study of complex phenotypes, single gene markers can only provide limited insights into the manifestation of phenotype. To this end, protein-protein interaction (PPI) networks prove useful in the identification of multiple interacting markers. Recent studies show that, when considered together, many proteins that are connected via physical and functional interactions exhibit significant differential expression with respect to various complex phenotypes, including cancers. As compared to single gene markers, these "coordinately dysregulated subnetworks" improve diagnosis and prognosis of cancer significantly and offer novel insights into the network dynamics of phenotype. However, the problem of identifying coordinately dysregulated subnetworks presents significant algorithmic challenges. Existing approaches utilize heuristics that aim to greedily maximize information-theoretic class separability measures, however, by definition of "coordinate" dysregulation, such greedy algorithms do not suit well to this problem. In this paper, we formulate coordinate dysregulation in the context of the well-known set-cover problem, with a view to capturing the coordination between multiple genes at a sample-specific resolution. Based on this formulation, we adapt state-of-the-art approximation algorithms for set-cover to the identification of coordinately dysregulated subnetworks. Comprehensive experimental results on human colorectal cancer (CRC) show that, when compared to existing algorithms, the proposed algorithm, NETCOVER, improves diagnosis of cancer and prediction of metastasis significantly. Our results also demonstrate that subnetworks in the neighborhood of known CRC driver genes exhibit significant coordinate dysregulation, indicating that the notion of coordinate dysregulation may indeed be useful in understanding the network dynamics of complex phenotypes.

[Choudhary2010Decoding] Chunaram Choudhary and Matthias Mann. Decoding signalling networks by mass spectrometry-based proteomics. Nat Rev Mol Cell Biol, 11(6):427-439, Jun 2010. [ bib | DOI | http ]
Signalling networks regulate essentially all of the biology of cells and organisms in normal and disease states. Signalling is often studied using antibody-based techniques such as western blots. Large-scale 'precision proteomics' based on mass spectrometry now enables the system-wide characterization of signalling events at the levels of post-translational modifications, protein-protein interactions and changes in protein expression. This technology delivers accurate and unbiased information about the quantitative changes of thousands of proteins and their modifications in response to any perturbation. Current studies focus on phosphorylation, but acetylation, methylation, glycosylation and ubiquitylation are also becoming amenable to investigation. Large-scale proteomics-based signalling research will fundamentally change our understanding of signalling networks.

Keywords: Animals; Humans; Mass Spectrometry; Protein Processing, Post-Translational; Proteome; Proteomics; Signal Transduction
[Caie2010High] Peter D Caie, Rebecca E Walls, Alexandra Ingleston-Orme, Sandeep Daya, Tom Houslay, Rob Eagle, Mark E Roberts, and Neil O Carragher. High-content phenotypic profiling of drug response signatures across distinct cancer cells. Mol Cancer Ther, 9(6):1913-1926, Jun 2010. [ bib | DOI | http ]
The application of high-content imaging in conjunction with multivariate clustering techniques has recently shown value in the confirmation of cellular activity and further characterization of drug mode of action following pharmacologic perturbation. However, such practical examples of phenotypic profiling of drug response published to date have largely been restricted to cell lines and phenotypic response markers that are amenable to basic cellular imaging. As such, these approaches preclude the analysis of both complex heterogeneous phenotypic responses and subtle changes in cell morphology across physiologically relevant cell panels. Here, we describe the application of a cell-based assay and custom designed image analysis algorithms designed to monitor morphologic phenotypic response in detail across distinct cancer cell types. We further describe the integration of these methods with automated data analysis workflows incorporating principal component analysis, Kohonen neural networking, and kNN classification to enable rapid and robust interrogation of such data sets. We show the utility of these approaches by providing novel insight into pharmacologic response across four cancer cell types, Ovcar3, MiaPaCa2, and MCF7 cells wild-type and mutant for p53. These methods have the potential to drive the development of a new generation of novel therapeutic classes encompassing pharmacologic compositions or polypharmacology in appropriate disease context.

[Bullard2010Evaluation] J. H. Bullard, E. Purdom, K. D. Hansen, and S. Dudoit. Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC Bioinformatics, 11:94, 2010. [ bib | DOI | http ]
High-throughput sequencing technologies, such as the Illumina Genome Analyzer, are powerful new tools for investigating a wide range of biological and medical questions. Statistical and computational methods are key for drawing meaningful and accurate conclusions from the massive and complex datasets generated by the sequencers. We provide a detailed evaluation of statistical methods for normalization and differential expression (DE) analysis of Illumina transcriptome sequencing (mRNA-Seq) data.We compare statistical methods for detecting genes that are significantly DE between two types of biological samples and find that there are substantial differences in how the test statistics handle low-count genes. We evaluate how DE results are affected by features of the sequencing platform, such as, varying gene lengths, base-calling calibration method (with and without phi X control lane), and flow-cell/library preparation effects. We investigate the impact of the read count normalization method on DE results and show that the standard approach of scaling by total lane counts (e.g., RPKM) can bias estimates of DE. We propose more general quantile-based normalization procedures and demonstrate an improvement in DE detection.Our results have significant practical and methodological implications for the design and analysis of mRNA-Seq experiments. They highlight the importance of appropriate statistical methods for normalization and DE inference, to account for features of the sequencing platform that could impact the accuracy of results. They also reveal the need for further research in the development of statistical and computational methods for mRNA-Seq.

Keywords: Computational Biology; Databases, Genetic; RNA, Messenger; Sequence Analysis, RNA
[Bullard2010Polygenic] J. H. Bullard, Y. Mostovoy, S. Dudoit, and R. B. Brem. Polygenic and directional regulatory evolution across pathways in Saccharomyces. PNAS, 107(11):5058-5063, 2010. [ bib | http ]
[Boulesteix2010Over-optimism] A.L. Boulesteix. Over-optimism in bioinformatics research. Bioinformatics, 26(3):437-439, 2010. [ bib ]
[Berkum2010HiC] N. L. van Berkum, E. Lieberman-Aiden, L. Williams, M. Imakaev, A. Gnirke, L. A. Mirny, J. Dekker, and E. S. Lander. Hi-C: a method to study the three-dimensional architecture of genomes. J. Vis. Exp., 39:e1869, 2010. [ bib | DOI | http ]
The three-dimensional folding of chromosomes compartmentalizes the genome and and can bring distant functional elements, such as promoters and enhancers, into close spatial proximity (2-6). Deciphering the relationship between chromosome organization and genome activity will aid in understanding genomic processes, like transcription and replication. However, little is known about how chromosomes fold. Microscopy is unable to distinguish large numbers of loci simultaneously or at high resolution. To date, the detection of chromosomal interactions using chromosome conformation capture (3C) and its subsequent adaptations required the choice of a set of target loci, making genome-wide studies impossible (7-10). We developed Hi-C, an extension of 3C that is capable of identifying long range interactions in an unbiased, genome-wide fashion. In Hi-C, cells are fixed with formaldehyde, causing interacting loci to be bound to one another by means of covalent DNA-protein cross-links. When the DNA is subsequently fragmented with a restriction enzyme, these loci remain linked. A biotinylated residue is incorporated as the 5' overhangs are filled in. Next, blunt-end ligation is performed under dilute conditions that favor ligation events between cross-linked DNA fragments. This results in a genome-wide library of ligation products, corresponding to pairs of fragments that were originally in close proximity to each other in the nucleus. Each ligation product is marked with biotin at the site of the junction. The library is sheared, and the junctions are pulled-down with streptavidin beads. The purified junctions can subsequently be analyzed using a high-throughput sequencer, resulting in a catalog of interacting fragments. Direct analysis of the resulting contact matrix reveals numerous features of genomic organization, such as the presence of chromosome territories and the preferential association of small gene-rich chromosomes. Correlation analysis can be applied to the contact matrix, demonstrating that the human genome is segregated into two compartments: a less densely packed compartment containing open, accessible, and active chromatin and a more dense compartment containing closed, inaccessible, and inactive chromatin regions. Finally, ensemble analysis of the contact matrix, coupled with theoretical derivations and computational simulations, revealed that at the megabase scale Hi-C reveals features consistent with a fractal globule conformation.

Keywords: ngs, hic
[Baker2010Next] Monya Baker. Next-generation sequencing: adjusting to data overload free. Nature Methods, 7:495-499, 2010. [ bib ]
[Auer2010Statistical] P. L. Auer and R. W. Doerge. Statistical design and analysis of rna sequencing data. Genetics, 185(2):405-416, Jun 2010. [ bib | DOI | http ]
Next-generation sequencing technologies are quickly becoming the preferred approach for characterizing and quantifying entire genomes. Even though data produced from these technologies are proving to be the most informative of any thus far, very little attention has been paid to fundamental design aspects of data collection and analysis, namely sampling, randomization, replication, and blocking. We discuss these concepts in an RNA sequencing framework. Using simulations we demonstrate the benefits of collecting replicated RNA sequencing data according to well known statistical designs that partition the sources of biological and technical variation. Examples of these designs and their corresponding models are presented with the goal of testing differential expression.

[Aranda2010IntAct] B. Aranda, P. Achuthan, Y. Alam-Faruque, I. Armean, A. Bridge, C. Derow, M. Feuermann, A. T. Ghanbarian, S. Kerrien, J. Khadake, J. Kerssemakers, C. Leroy, M. Menden, M. Michaut, L. Montecchi-Palazzi, S. N. Neuhauser, S. Orchard, V. Perreau, B. Roechert, K. van Eijk, and H. Hermjakob. The intact molecular interaction database in 2010. Nucleic Acids Res, 38(Database issue):D525-D531, Jan 2010. [ bib | DOI | http ]
IntAct is an open-source, open data molecular interaction database and toolkit. Data is abstracted from the literature or from direct data depositions by expert curators following a deep annotation model providing a high level of detail. As of September 2009, IntAct contains over 200.000 curated binary interaction evidences. In response to the growing data volume and user requests, IntAct now provides a two-tiered view of the interaction data. The search interface allows the user to iteratively develop complex queries, exploiting the detailed annotation with hierarchical controlled vocabularies. Results are provided at any stage in a simplified, tabular view. Specialized views then allows 'zooming in' on the full annotation of interactions, interactors and their properties. IntAct source code and data are freely available at http://www.ebi.ac.uk/intact.

Keywords: Animals; Computational Biology; Databases, Genetic; Databases, Protein; False Positive Reactions; Humans; Information Storage and Retrieval; Internet; Programming Languages; Protein Interaction Mapping; Protein Structure, Tertiary; Proteins; Software; User-Computer Interface; Vocabulary, Controlled
[Abraham2010Prediction] Gad Abraham, Adam Kowalczyk, Sherene Loi, Izhak Haviv, and Justin Zobel. Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context. BMC Bioinformatics, 11(1):277, 2010. [ bib | DOI | http | .pdf ]
[Abeel2010Robust] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, and Y. Saeys. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics, 26:392-398, Nov 2010. [ bib | DOI | http | .pdf ]
MOTIVATION: Biomarker discovery is an important topic in biomedical applications of computational biology, including applications such as gene and SNP selection from high dimensional data. Surprisingly, the stability with respect to sampling variation or robustness of such selection processes has received attention only recently. However, robustness of biomarkers is an important issue, as it may greatly influence subsequent biological validations. In addition, a more robust set of markers may strengthen the confidence of an expert in the results of a selection method. RESULTS: Our first contribution is a general framework for the analysis of the robustness of a biomarker selection algorithm. Secondly, we conducted a large-scale analysis of the recently introduced concept of ensemble feature selection, where multiple feature selections are combined in order to increase the robustness of the final set of selected features. We focus on selection methods that are embedded in the estimation of support vector machines (SVMs). SVMs are powerful classification models that have shown state-of-the-art performance on several diagnosis and prognosis tasks on biological data. Their feature selection extensions also offered good results for gene selection tasks. We show that the robustness of SVMs for biomarker discovery can be substantially increased by using ensemble feature selection techniques, while at the same time improving upon classification performances. The proposed methodology is evaluated on four microarray data sets showing increases of up to almost 30% in robustness of the selected biomarkers, along with an improvement of about 15% in classification performance. The stability improvement with ensemble methods is particularly noticeable for small signature sizes (a few tens of genes), which is most relevant for the design of a diagnosis or prognosis model from a gene signature. CONTACT: yvan.saeys@psb.ugent.be.

[Consortium2010map] 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061-1073, Oct 2010. [ bib | DOI | http ]
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

Keywords: Calibration; Chromosomes, Human, Y, genetics; Computational Biology; DNA Mutational Analysis; DNA, Mitochondrial, genetics; Evolution, Molecular; Female; Genetic Association Studies; Genetic Variation, genetics; Genetics, Population, methods; Genome, Human, genetics; Genome-Wide Association Study; Genomics, methods; Genotype; Haplotypes, genetics; Humans; Male; Mutation, genetics; Pilot Projects; Polymorphism, Single Nucleotide, genetics; Recombination, Genetic, genetics; Sample Size; Selection, Genetic, genetics; Sequence Alignment; Sequence Analysis, DNA, methods
[Zanella2010High] Fabian Zanella, James B Lorens, and Wolfgang Link. High content screening: seeing is believing. Trends Biotechnol, 28(5):237-245, May 2010. [ bib | DOI | http ]
High content screening (HCS) combines the efficiency of high-throughput techniques with the ability of cellular imaging to collect quantitative data from complex biological systems. HCS technology is integrated into all aspects of contemporary drug discovery, including primary compound screening, post-primary screening capable of supporting structure-activity relationships, and early evaluation of ADME (absorption, distribution, metabolism and excretion)/toxicity properties and complex multivariate drug profiling. Recently, high content approaches have been used extensively to interrogate stem cell biology. Despite these dramatic advances, a number of significant challenges remain related to the use of more biology- and disease-relevant cell systems, the development of informative reagents to measure and manipulate cellular events, and the integration of data management and informatics.

Keywords: Animals; Drug Evaluation, Preclinical, instrumentation/methods; High-Throughput Screening Assays, instrumentation/methods; Humans; Stem Cells, drug effects; Structure-Activity Relationship
[Trapnell2010Transcript] C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, and L. Pachter. Transcript assembly and quantification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol, 28(5):511-515, May 2010. [ bib | DOI | http | .pdf ]
High-throughput mRNA sequencing (RNA-Seq) promises simultaneous transcript discovery and abundance estimation. However, this would require algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing. Here we introduce such algorithms in an open-source software program called Cufflinks. To test Cufflinks, we sequenced and analyzed >430 million paired 75-bp RNA-Seq reads from a mouse myoblast cell line over a differentiation time series. We detected 13,692 known transcripts and 3,724 previously unannotated ones, 62% of which are supported by independent expression data or by homologous genes in other species. Over the time series, 330 genes showed complete switches in the dominant transcription start site (TSS) or splice isoform, and we observed more subtle shifts in 1,304 other genes. These results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.

Keywords: ngs, rnaseq
[Tomizaki2010Protein] Kin-ya Tomizaki, Kenji Usui, and Hisakazu Mihara. Protein-protein interactions and selection: array-based techniques for screening disease-associated biomarkers in predictive/early diagnosis. FEBS J, 277(9):1996-2005, May 2010. [ bib | DOI | http ]
There has been considerable interest in recent years in the development of miniaturized and parallelized array technology for protein-protein interaction analysis and protein profiling, namely 'protein-detecting microarrays'. Protein-detecting microarrays utilize a wide variety of capture agents (antibodies, fusion proteins, DNA/RNA aptamers, synthetic peptides, carbohydrates, and small molecules) immobilized at high spatial density on a solid surface. Each capture agent binds selectively to its target protein in a complex mixture, such as serum or cell lysate samples. Captured proteins are subsequently detected and quantified in a high-throughput fashion, with minimal sample consumption. Protein-detecting microarrays were first described by MacBeath and Schreiber in 2000, and the number of publications involving this technology is rapidly increasing. Furthermore, the first multiplex immunoassay systems have been cleared by the US Food and Drug Administration, signaling recognition of the usefulness of miniaturized and parallelized array technology for protein detection in predictive/early diagnosis. Although genetic tests still predominate, with further development protein-based diagnosis will become common in clinical use within a few years.

Keywords: Animals; Biological Markers, analysis/metabolism; Early Diagnosis; Humans; Mass Screening, methods; Protein Array Analysis, methods; Proteins, analysis/metabolism; Risk Factors
[Blows2010Subtyping] Fiona M Blows, Kristy E Driver, Marjanka K Schmidt, Annegien Broeks, Flora E van Leeuwen, Jelle Wesseling, Maggie C Cheang, Karen Gelmon, Torsten O Nielsen, Carl Blomqvist, Päivi Heikkilä, Tuomas Heikkinen, Heli Nevanlinna, Lars A Akslen, Louis R Bégin, William D Foulkes, Fergus J Couch, Xianshu Wang, Vicky Cafourek, Janet E Olson, Laura Baglietto, Graham G Giles, Gianluca Severi, Catriona A McLean, Melissa C Southey, Emad Rakha, Andrew R Green, Ian O Ellis, Mark E Sherman, Jolanta Lissowska, William F Anderson, Angela Cox, Simon S Cross, Malcolm W R Reed, Elena Provenzano, Sarah-Jane Dawson, Alison M Dunning, Manjeet Humphreys, Douglas F Easton, Montserrat García-Closas, Carlos Caldas, Paul D Pharoah, and David Huntsman. Subtyping of breast cancer by immunohistochemistry to investigate a relationship between subtype and short and long term survival: a collaborative analysis of data for 10,159 cases from 12 studies. PLoS Med, 7(5):e1000279, May 2010. [ bib | DOI | http | .pdf ]
Immunohistochemical markers are often used to classify breast cancer into subtypes that are biologically distinct and behave differently. The aim of this study was to estimate mortality for patients with the major subtypes of breast cancer as classified using five immunohistochemical markers, to investigate patterns of mortality over time, and to test for heterogeneity by subtype.We pooled data from more than 10,000 cases of invasive breast cancer from 12 studies that had collected information on hormone receptor status, human epidermal growth factor receptor-2 (HER2) status, and at least one basal marker (cytokeratin [CK]5/6 or epidermal growth factor receptor [EGFR]) together with survival time data. Tumours were classified as luminal and nonluminal tumours according to hormone receptor expression. These two groups were further subdivided according to expression of HER2, and finally, the luminal and nonluminal HER2-negative tumours were categorised according to expression of basal markers. Changes in mortality rates over time differed by subtype. In women with luminal HER2-negative subtypes, mortality rates were constant over time, whereas mortality rates associated with the luminal HER2-positive and nonluminal subtypes tended to peak within 5 y of diagnosis and then decline over time. In the first 5 y after diagnosis the nonluminal tumours were associated with a poorer prognosis, but over longer follow-up times the prognosis was poorer in the luminal subtypes, with the worst prognosis at 15 y being in the luminal HER2-positive tumours. Basal marker expression distinguished the HER2-negative luminal and nonluminal tumours into different subtypes. These patterns were independent of any systemic adjuvant therapy.The six subtypes of breast cancer defined by expression of five markers show distinct behaviours with important differences in short term and long term prognosis. Application of these markers in the clinical setting could have the potential to improve the targeting of adjuvant chemotherapy to those most likely to benefit. The different patterns of mortality over time also suggest important biological differences between the subtypes that may result in differences in response to specific therapies, and that stratification of breast cancers by clinically relevant subtypes in clinical trials is urgently required.

Keywords: Adult; Aged; Aged, 80 and over; Breast Neoplasms, metabolism/mortality/pathology; Female; Hormones, analysis; Humans; Immunohistochemistry; Keratins; Middle Aged; Prognosis; Proportional Hazards Models; Receptor, Epidermal Growth Factor, analysis; Receptors, Cell Surface, metabolism; Tumor Markers, Biological, analysis; Young Adult
[Barash2010Deciphering] Yoseph Barash, John A Calarco, Weijun Gao, Qun Pan, Xinchen Wang, Ofer Shai, Benjamin J Blencowe, and Brendan J Frey. Deciphering the splicing code. Nature, 465(7294):53-59, May 2010. [ bib | DOI | http ]
Alternative splicing has a crucial role in the generation of biological complexity, and its misregulation is often involved in human disease. Here we describe the assembly of a 'splicing code', which uses combinations of hundreds of RNA features to predict tissue-dependent changes in alternative splicing for thousands of exons. The code determines new classes of splicing patterns, identifies distinct regulatory programs in different tissues, and identifies mutation-verified regulatory sequences. Widespread regulatory strategies are revealed, including the use of unexpectedly large combinations of features, the establishment of low exon inclusion levels that are overcome by features in specific tissues, the appearance of features deeper into introns than previously appreciated, and the modulation of splice variant levels by transcript structure characteristics. The code detected a class of exons whose inclusion silences expression in adult tissues by activating nonsense-mediated messenger RNA decay, but whose exclusion promotes expression during embryogenesis. The code facilitates the discovery and detailed characterization of regulated alternative splicing events on a genome-wide scale.

Keywords: Alternative Splicing, genetics; Animals; Gene Expression Regulation; Gene Silencing; Genetic Code, genetics; Humans; Mice; Models, Genetic; RNA, Messenger, metabolism; Reproducibility of Results
[Mordelet2010bagging] F. Mordelet and J-P. Vert. A bagging SVM to learn from positive and unlabeled examples. Technical Report 00523336, HAL, October 2010. [ bib | http ]
We consider the problem of learning a binary classifier from a training set of positive and unlabeled examples, both in the inductive and in the transductive setting. This problem, often referred to as PU learning, differs from the standard supervised classification problem by the lack of negative examples in the training set. It corresponds to an ubiquitous situation in many applications such as information retrieval or gene ranking, when we have identified a set of data of interest sharing a particular property, and we wish to automatically retrieve additional data sharing the same property among a large and easily available pool of unlabeled data. We propose a conceptually simple method, akin to bagging, to approach both inductive and transductive PU learning problems, by converting them into series of supervised binary classification problems discriminating the known positive examples from random subsamples of the unlabeled set. We empirically demonstrate the relevance of the method on simulated and real data, where it performs at least as well as existing methods while being faster.

[Yaffe2011Probabilistic] E. Yaffe and A. Tanay. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet., 43(11):1059-1065, 2011. [ bib | DOI | http | .pdf ]
Hi-C experiments measure the probability of physical proximity between pairs of chromosomal loci on a genomic scale. We report on several systematic biases that substantially affect the Hi-C experimental procedure, including the distance between restriction sites, the GC content of trimmed ligation junctions and sequence uniqueness. To address these biases, we introduce an integrated probabilistic background model and develop algorithms to estimate its parameters and renormalize Hi-C data. Analysis of corrected human lymphoblast contact maps provides genome-wide evidence for interchromosomal aggregation of active chromatin marks, including DNase-hypersensitive sites and transcriptionally active foci. We observe extensive long-range (up to 400 kb) cis interactions at active promoters and derive asymmetric contact profiles next to transcription start sites and CTCF binding sites. Clusters of interacting chromosomal domains suggest physical separation of centromere-proximal and centromere-distal regions. These results provide a computational basis for the inference of chromosomal architectures from Hi-C experiments.

Keywords: hic, ngs
[Xia2011NSMAP] Z. Xia, J. Wen, C.-C. Chang, and X. Zhou. NSMAP: a method for spliced isoforms identification and quantification from RNA-Seq. BMC Bioinformatics, 12:162, 2011. [ bib | DOI | http | .pdf ]
The development of techniques for sequencing the messenger RNA (RNA-Seq) enables it to study the biological mechanisms such as alternative splicing and gene expression regulation more deeply and accurately. Most existing methods employ RNA-Seq to quantify the expression levels of already annotated isoforms from the reference genome. However, the current reference genome is very incomplete due to the complexity of the transcriptome which hiders the comprehensive investigation of transcriptome using RNA-Seq. Novel study on isoform inference and estimation purely from RNA-Seq without annotation information is desirable.A Nonnegativity and Sparsity constrained Maximum APosteriori (NSMAP) model has been proposed to estimate the expression levels of isoforms from RNA-Seq data without the annotation information. In contrast to previous methods, NSMAP performs identification of the structures of expressed isoforms and estimation of the expression levels of those expressed isoforms simultaneously, which enables better identification of isoforms. In the simulations parameterized by two real RNA-Seq data sets, more than 77% expressed isoforms are correctly identified and quantified. Then, we apply NSMAP on two RNA-Seq data sets of myelodysplastic syndromes (MDS) samples and one normal sample in order to identify differentially expressed known and novel isoforms in MDS disease.NSMAP provides a good strategy to identify and quantify novel isoforms without the knowledge of annotated reference genome which can further realize the potential of RNA-Seq technique in transcriptome analysis. NSMAP package is freely available at https://sites.google.com/site/nsmapforrnaseq.

[Witten2011New] D. M. Witten, J. H. Friedman, and N. Simon. New insights and faster computations for the graphical lasso. J. Comput. Graph. Stat., 20(4):892-900, 2011. [ bib | DOI | http | .pdf ]
We consider the graphical lasso formulation for estimating a Gaussian graphical model in the high-dimensional setting. This approach entails estimating the inverse covariance matrix under a multivariate normal model by maximizing the ℓ1-penalized log-likelihood. We present a very simple necessary and sufficient condition that can be used to identify the connected components in the graphical lasso solution. The condition can be employed to determine whether the estimated inverse covariance matrix will be block diagonal, and if so, then to identify the blocks. This in turn can lead to drastic speed improvements, since one can simply apply a standard graphical lasso algorithm to each block separately. Moreover, the necessary and sufficient condition provides insight into the graphical lasso solution: the set of connected nodes at any given tuning parameter value is a superset of the set of connected nodes at any larger tuning parameter value. This article has supplementary material online.

[Venet2011Most] D. Venet, J.E. Dumont, and V. Detours. Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS computational biology, 7(10):e1002240, 2011. [ bib ]
[Tubio2011Cancer:] Jose M C Tubio and Xavier Estivill. Cancer: When catastrophe strikes a cell. Nature, 470(7335):476-477, Feb 2011. [ bib | DOI | http ]
Keywords: Apoptosis; Bone Neoplasms, genetics/pathology; Cell Survival; Cell Transformation, Neoplastic, genetics; Chromosomes, Human, genetics/metabolism; DNA Breaks; DNA Copy Number Variations, genetics; DNA Repair; Disease Progression; Genes, Neoplasm, genetics; Humans; Leukemia, genetics; Mutagenesis, genetics; Mutation, genetics; Neoplasms, genetics/pathology; Recombination, Genetic, genetics
[Tranchevent2010guide] L.-C. Tranchevent, F. B. Capdevila, D. Nitsch, B. De Moor, P. De Causmaecker, and Y. Moreau. A guide to web tools to prioritize candidate genes. Brief. Bioinform., 12(12):22-32, 2011. [ bib | DOI | arXiv | http ]
Finding the most promising genes among large lists of candidate genes has been defined as the gene prioritization problem. It is a recurrent problem in genetics in which genetic conditions are reported to be associated with chromosomal regions. In the last decade, several different computational approaches have been developed to tackle this challenging task. In this study, we review 19 computational solutions for human gene prioritization that are freely accessible as web tools and illustrate their differences. We summarize the various biological problems to which they have been successfully applied. Ultimately, we describe several research directions that could increase the quality and applicability of the tools. In addition we developed a website (http://www.esat.kuleuven.be/gpp) containing detailed information about these and other tools, which is regularly updated. This review and the associated website constitute together a guide to help users select a gene prioritization strategy that suits best their needs.

[Tomioka2011Super] R. Tomioka, T. Suzuki, and M. Sugiyama. Super-linear convergence of dual augmented-lagrangian algorithm for sparsity regularized estimation. J. Mach. Learn. Res., 12:1537-1586, 2011. [ bib | .pdf ]
[Thompson2011properties] John F Thompson and Patrice M Milos. The properties and applications of single-molecule dna sequencing. Genome Biol, 12(2):217, Feb 2011. [ bib | DOI | http ]
ABSTRACT: Single-molecule sequencing enables DNA or RNA to be sequenced directly from biological samples, making it well-suited for diagnostic and clinical applications. Here we review the properties and applications of this rapidly evolving and promising technology.

[Stephens2011Massive] Philip J Stephens, Chris D Greenman, Beiyuan Fu, Fengtang Yang, Graham R Bignell, Laura J Mudie, Erin D Pleasance, King Wai Lau, David Beare, Lucy A Stebbings, Stuart McLaren, Meng-Lay Lin, David J McBride, Ignacio Varela, Serena Nik-Zainal, Catherine Leroy, Mingming Jia, Andrew Menzies, Adam P Butler, Jon W Teague, Michael A Quail, John Burton, Harold Swerdlow, Nigel P Carter, Laura A Morsberger, Christine Iacobuzio-Donahue, George A Follows, Anthony R Green, Adrienne M Flanagan, Michael R Stratton, P. Andrew Futreal, and Peter J Campbell. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell, 144(1):27-40, Jan 2011. [ bib | DOI | http ]
Cancer is driven by somatically acquired point mutations and chromosomal rearrangements, conventionally thought to accumulate gradually over time. Using next-generation sequencing, we characterize a phenomenon, which we term chromothripsis, whereby tens to hundreds of genomic rearrangements occur in a one-off cellular crisis. Rearrangements involving one or a few chromosomes crisscross back and forth across involved regions, generating frequent oscillations between two copy number states. These genomic hallmarks are highly improbable if rearrangements accumulate over time and instead imply that nearly all occur during a single cellular catastrophe. The stamp of chromothripsis can be seen in at least 2%-3% of all cancers, across many subtypes, and is present in ∼25% of bone cancers. We find that one, or indeed more than one, cancer-causing lesion can emerge out of the genomic crisis. This phenomenon has important implications for the origins of genomic remodeling and temporal emergence of cancer.

Keywords: Bone Neoplasms, genetics; Cell Line, Tumor; Chromosome Aberrations; Chromosome Painting; Female; Gene Rearrangement; Humans; Leukemia, Lymphocytic, Chronic, B-Cell, genetics; Middle Aged; Neoplasms, genetics/pathology
[Smoot2011Cytoscape] M.E. Smoot, K. Ono, J. Ruscheinski, P.L. Wang, and T. Ideker. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics, 27(3):431-432, 2011. [ bib ]
[Shin2011Partitionable] K. Shin. Partitionable kernels for mapping kernels. In D. J. Cook, J. Pei, W. Wang, O. R. Zaïane, and X. Wu, editors, Proceedings of the 11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011., pages 645-654, 2011. [ bib | DOI | http | .pdf ]
[Schaffter2011GeneNetWeaver] T. Schaffter, D. Marbach, and D. Floreano. Genenetweaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics, 27(16):2263-2270, 2011. [ bib | DOI | arXiv | http ]
Motivation: Over the last decade, numerous methods have been developed for inference of regulatory networks from gene expression data. However, accurate and systematic evaluation of these methods is hampered by the difficulty of constructing adequate benchmarks and the lack of tools for a differentiated analysis of network predictions on such benchmarks.Results: Here, we describe a novel and comprehensive method for in silico benchmark generation and performance profiling of network inference methods available to the community as an open-source software called GeneNetWeaver (GNW). In addition to the generation of detailed dynamical models of gene regulatory networks to be used as benchmarks, GNW provides a network motif analysis that reveals systematic prediction errors, thereby indicating potential ways of improving inference methods. The accuracy of network inference methods is evaluated using standard metrics such as precision-recall and receiver operating characteristic curves. We show how GNW can be used to assess the performance and identify the strengths and weaknesses of six inference methods. Furthermore, we used GNW to provide the international Dialogue for Reverse Engineering Assessments and Methods (DREAM) competition with three network inference challenges (DREAM3, DREAM4 and DREAM5).Availability: GNW is available at http://gnw.sourceforge.net along with its Java source code, user manual and supporting data.Supplementary information: Supplementary data are available at Bioinformatics online.Contact: dario.floreano@epfl.ch

[Rodriguez-Paredes2011Cancer] Manuel Rodríguez-Paredes and Manel Esteller. Cancer epigenetics reaches mainstream oncology. Nat Med, 17(3):330-339, Mar 2011. [ bib | DOI | http ]
Epigenetics is one of the most promising and expanding fields in the current biomedical research landscape. Since the inception of epigenetics in the 1940s, the discoveries regarding its implications in normal and disease biology have not stopped, compiling a vast amount of knowledge in the past decade. The field has moved from just one recognized marker, DNA methylation, to a variety of others, including a wide spectrum of histone modifications. From the methodological standpoint, the successful initial single gene candidate approaches have been complemented by the current comprehensive epigenomic approaches that allow the interrogation of genomes to search for translational applications in an unbiased manner. Most important, the discovery of mutations in the epigenetic machinery and the approval of the first epigenetic drugs for the treatment of subtypes of leukemias and lymphomas has been an eye-opener for many biomedical scientists and clinicians. Herein, we will summarize the progress in the field of cancer epigenetics research that has reached mainstream oncology in the development of new biomarkers of the disease and new pharmacological strategies.

Keywords: Amino Acid Sequence; DNA Methylation; Epigenesis, Genetic; Humans; Molecular Sequence Data; Neoplasms, genetics/therapy; Tumor Markers, Biological
[Roberts2011Improving] A. Roberts, C. Trapnell, J. Donaghey, J. L. Rinn, and L. Pachter. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol, 12(3):R22, 2011. [ bib | DOI | http | .pdf ]
The biochemistry of RNA-Seq library preparation results in cDNA fragments that are not uniformly distributed within the transcripts they represent. This non-uniformity must be accounted for when estimating expression levels, and we show how to perform the needed corrections using a likelihood based approach. We find improvements in expression estimates as measured by correlation with independently performed qRT-PCR and show that correction of bias leads to improved replicability of results across libraries and sequencing technologies.

Keywords: ngs, rnaseq
[Roberts2011Identification] A. Roberts, H. Pimentel, C. Trapnell, and L. Pachter. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics, 27(17):2325-2329, Sep 2011. [ bib | DOI | http | .pdf ]
We describe a new 'reference annotation based transcript assembly' problem for RNA-Seq data that involves assembling novel transcripts in the context of an existing annotation. This problem arises in the analysis of expression in model organisms, where it is desirable to leverage existing annotations for discovering novel transcripts. We present an algorithm for reference annotation-based transcript assembly and show how it can be used to rapidly investigate novel transcripts revealed by RNA-Seq in comparison with a reference annotation.The methods described in this article are implemented in the Cufflinks suite of software for RNA-Seq, freely available from http://bio.math.berkeley.edu/cufflinks. The software is released under the BOOST license.cole@broadinstitute.org; lpachter@math.berkeley.eduSupplementary data are available at Bioinformatics online.

Keywords: ngs, rnaseq
[Ravikumar2011High-dimensional] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covarince estimation by minimizing 1-penalized log-determinant divergence. Electron. J. Statist., 5:935-980, 2011. [ bib | DOI | http | .pdf ]
[Pachter2011Models] L. Pachter. Models for transcript quantification from RNA-seq. Technical Report 1104-3889, arXiv, 2011. [ bib | .pdf ]
[Obozinski2011Support] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Support union recovery in high-dimensional multivariate regression. Ann. Stat., 39(1):1-47, 2011. [ bib | .pdf ]
[Obozinski2011Group] G. Obozinski, L. Jacob, and J.-P. Vert. Group lasso with overlaps: the latent group lasso approach. Technical report, arXiv, 2011. [ bib ]
[Nielsen2011Genotype] R. Nielsen, J. S. Paul, A. Albrechtsen, and Y. S. Song. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet., 12(6):443-451, Jun 2011. [ bib | DOI | http | .pdf ]
Meaningful analysis of next-generation sequencing (NGS) data, which are produced extensively by genetics and genomics studies, relies crucially on the accurate calling of SNPs and genotypes. Recently developed statistical methods both improve and quantify the considerable uncertainty associated with genotype calling, and will especially benefit the growing number of studies using low- to medium-coverage data. We review these methods and provide a guide for their use in NGS studies.

Keywords: ngs
[Mordelet2011ProDiGe] Fantine Mordelet and Jean-Philippe Vert. ProDiGe: Prioritization of disease genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics, 12:389, 2011. [ bib | DOI | http | .pdf ]
Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases.We propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.ProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at http://cbio.ensmp.fr/prodige.

[Li2011IsoLasso] W. Li, J. Feng, and T. Jiang. Isolasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J Comput Biol, 18(11):1693-1707, Nov 2011. [ bib | DOI | http | .pdf ]
The new second generation sequencing technology revolutionizes many biology-related research fields and poses various computational biology challenges. One of them is transcriptome assembly based on RNA-Seq data, which aims at reconstructing all full-length mRNA transcripts simultaneously from millions of short reads. In this article, we consider three objectives in transcriptome assembly: the maximization of prediction accuracy, minimization of interpretation, and maximization of completeness. The first objective, the maximization of prediction accuracy, requires that the estimated expression levels based on assembled transcripts should be as close as possible to the observed ones for every expressed region of the genome. The minimization of interpretation follows the parsimony principle to seek as few transcripts in the prediction as possible. The third objective, the maximization of completeness, requires that the maximum number of mapped reads (or ?expressed segments? in gene models) be explained by (i.e., contained in) the predicted transcripts in the solution. Based on the above three objectives, we present IsoLasso, a new RNA-Seq based transcriptome assembly tool. IsoLasso is based on the well-known LASSO algorithm, a multivariate regression method designated to seek a balance between the maximization of prediction accuracy and the minimization of interpretation. By including some additional constraints in the quadratic program involved in LASSO, IsoLasso is able to make the set of assembled transcripts as complete as possible. Experiments on simulated and real RNA-Seq datasets show that IsoLasso achieves, simultaneously, higher sensitivity and precision than the state-of-art transcript assembly tools.

Keywords: ngs, rnaseq
[Jenatton2011Proximal] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res., 12(Jul):2297-2334, 2011. [ bib | .html ]
[Haury2011influence] A.-C. Haury, P. Gestraud, and J.-P. Vert. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One, 6(12):e28210, 2011. [ bib | DOI | http | .pdf ]
Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. In this study we compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Surprisingly, complex wrapper and embedded methods generally do not outperform simple univariate feature selection methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results.

[Hanahan2011Hallmarks] D. Hanahan and R. A. Weinberg. Hallmarks of cancer: the next generation. Cell, 144(5):646-674, Mar 2011. [ bib | DOI | http | .pdf ]
The hallmarks of cancer comprise six biological capabilities acquired during the multistep development of human tumors. The hallmarks constitute an organizing principle for rationalizing the complexities of neoplastic disease. They include sustaining proliferative signaling, evading growth suppressors, resisting cell death, enabling replicative immortality, inducing angiogenesis, and activating invasion and metastasis. Underlying these hallmarks are genome instability, which generates the genetic diversity that expedites their acquisition, and inflammation, which fosters multiple hallmark functions. Conceptual progress in the last decade has added two emerging hallmarks of potential generality to this list-reprogramming of energy metabolism and evading immune destruction. In addition to cancer cells, tumors exhibit another dimension of complexity: they contain a repertoire of recruited, ostensibly normal cells that contribute to the acquisition of hallmark traits by creating the "tumor microenvironment." Recognition of the widespread applicability of these concepts will increasingly affect the development of new means to treat human cancer.

[Haibe-Kains2011genefu] B. Haibe-Kains, M. Schroeder, G. Bontempi, C. Sotiriou, and J. Quackenbush. genefu: Relevant Functions for Gene Expression Analysis, Especially in Breast Cancer., 2011. R package version 1.4.0. [ bib | http ]
[Guedj2011refined] M. Guedj, L. Marisa, A. de Reynies, B. Orsetti, R. Schiappa, F. Bibeau, G. Macgrogan, F. Lerebours, P. Finetti, M. Longy, P. Bertheau, F. Bertrand, F. Bonnet, A. L. Martin, J. P. Feugeas, I. Bièche, J. Lehmann-Che, R. Lidereau, D. Birnbaum, F. Bertucci, H. de Thé, and C. Theillet. A refined molecular taxonomy of breast cancer. Oncogene, Jul 2011. [ bib | DOI | http | .pdf ]
The current histoclinical breast cancer classification is simple but imprecise. Several molecular classifications of breast cancers based on expression profiling have been proposed as alternatives. However, their reliability and clinical utility have been repeatedly questioned, notably because most of them were derived from relatively small initial patient populations. We analyzed the transcriptomes of 537 breast tumors using three unsupervised classification methods. A core subset of 355 tumors was assigned to six clusters by all three methods. These six subgroups overlapped with previously defined molecular classes of breast cancer, but also showed important differences, notably the absence of an ERBB2 subgroup and the division of the large luminal ER+ group into four subgroups, two of them being highly proliferative. Of the six subgroups, four were ER+/PR+/AR+, one was ER-/PR-/AR+ and one was triple negative (AR-/ER-/PR-). ERBB2-amplified tumors were split between the ER-/PR-/AR+ subgroup and the highly proliferative ER+ LumC subgroup. Importantly, each of these six molecular subgroups showed specific copy-number alterations. Gene expression changes were correlated to specific signaling pathways. Each of these six subgroups showed very significant differences in tumor grade, metastatic sites, relapse-free survival or response to chemotherapy. All these findings were validated on large external datasets including more than 3000 tumors. Our data thus indicate that these six molecular subgroups represent well-defined clinico-biological entities of breast cancer. Their identification should facilitate the detection of novel prognostic factors or therapeutical targets in breast cancer.Oncogene advance online publication, 25 July 2011; doi:10.1038/onc.2011.301.

[Geurts2011Learning] Pierre Geurts. Learning from positive and unlabeled examples by enforcing statistical significance. Journal of Machine Learning Research - Proceedings Track, 15:305-314, 2011. [ bib | http ]
Keywords: dblp
[Gedela2011Integration] Srinubabu Gedela. Integration, warehousing, and analysis strategies of omics data. Methods Mol Biol, 719:399-414, 2011. [ bib | DOI | http ]
"-Omics" is a current suffix for numerous types of large-scale biological data generation procedures, which naturally demand the development of novel algorithms for data storage and analysis. With next generation genome sequencing burgeoning, it is pivotal to decipher a coding site on the genome, a gene's function, and information on transcripts next to the pure availability of sequence information. To explore a genome and downstream molecular processes, we need umpteen results at the various levels of cellular organization by utilizing different experimental designs, data analysis strategies and methodologies. Here comes the need for controlled vocabularies and data integration to annotate, store, and update the flow of experimental data. This chapter explores key methodologies to merge Omics data by semantic data carriers, discusses controlled vocabularies as eXtensible Markup Languages (XML), and provides practical guidance, databases, and software links supporting the integration of Omics data.

[Gama-Castro2011RegulonDB] S. Gama-Castro, H. Salgado, M. Peralta-Gil, A. Santos-Zavaleta, L. Muñiz-Rascado, H. Solano-Lira, V. Jimenez-Jacinto, Verena Weiss, J. S. García-Sotelo, A. López-Fuentes, L. Porrón-Sotelo, S. Alquicira-Hernández, A. Medina-Rivera, I. Martínez-Flores, K. Alquicira-Hernández, R. Martínez-Adame, C. Bonavides-Martínez, J. Miranda-Ríos, A. M. Huerta, A. Mendoza-Vargas, L. Collado-Torres, B. Taboada, L. Vega-Alvarado, M. Olvera, L. Olvera, R. Grande, E. Morett, and J. Collado-Vides. RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (gensor units). Nucleic Acids Res., 39(suppl 1):D98-D105, 2011. [ bib | DOI | arXiv | http ]
RegulonDB (http://regulondb.ccg.unam.mx/) is the primary reference database of the best-known regulatory network of any free-living organism, that of Escherichia coli K-12. The major conceptual change since 3 years ago is an expanded biological context so that transcriptional regulation is now part of a unit that initiates with the signal and continues with the signal transduction to the core of regulation, modifying expression of the affected target genes responsible for the response. We call these genetic sensory response units, or Gensor Units. We have initiated their high-level curation, with graphic maps and superreactions with links to other databases. Additional connectivity uses expandable submaps. RegulonDB has summaries for every transcription factor (TF) and TF-binding sites with internal symmetry. Several DNA-binding motifs and their sizes have been redefined and relocated. In addition to data from the literature, we have incorporated our own information on transcription start sites (TSSs) and transcriptional units (TUs), obtained by using high-throughput whole-genome sequencing technologies. A new portable drawing tool for genomic features is also now available, as well as new ways to download the data, including web services, files for several relational database manager systems and text files including BioPAX format.

[Drier2011two] Y. Drier and E. Domany. Do two machine-learning based prognostic signatures for breast cancer capture the same biological processes? PloS one, 6(3):e17795, 2011. [ bib ]
[Dimitriadou2011e1071] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien, 2011. R package version 1.6. [ bib | http ]
[Chowdhury2011Subnetwork] S. A. Chowdhury, R. K. Nibbe, M. R. Chance, and M. Koyutürk. Subnetwork state functions define dysregulated subnetworks in cancer. J. Comput. Biol., 18(3):263-281, Mar 2011. [ bib | DOI | http | .pdf ]
Emerging research demonstrates the potential of protein-protein interaction (PPI) networks in uncovering the mechanistic bases of cancers, through identification of interacting proteins that are coordinately dysregulated in tumorigenic and metastatic samples. When used as features for classification, such coordinately dysregulated subnetworks improve diagnosis and prognosis of cancer considerably over single-gene markers. However, existing methods formulate coordination between multiple genes through additive representation of their expression profiles and utilize fast heuristics to identify dysregulated subnetworks, which may not be well suited to the potentially combinatorial nature of coordinate dysregulation. Here, we propose a combinatorial formulation of coordinate dysregulation and decompose the resulting objective function to cast the problem as one of identifying subnetwork state functions that are indicative of phenotype. Based on this formulation, we show that coordinate dysregulation of larger subnetworks can be bounded using simple statistics on smaller subnetworks. We then use these bounds to devise an efficient algorithm, Crane, that can search the subnetwork space more effectively than existing algorithms. Comprehensive cross-classification experiments show that subnetworks identified by Crane outperform those identified by additive algorithms in predicting metastasis of colorectal cancer (CRC).

[Chen2011Removing] Chao Chen, Kay Grennan, Judith Badner, Dandan Zhang, Elliot Gershon, Li Jin, and Chunyu Liu. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One, 6(2):e17238, 2011. [ bib | DOI | http ]
The expression microarray is a frequently used approach to study gene expression on a genome-wide scale. However, the data produced by the thousands of microarray studies published annually are confounded by "batch effects," the systematic error introduced when samples are processed in multiple batches. Although batch effects can be reduced by careful experimental design, they cannot be eliminated unless the whole study is done in a single batch. A number of programs are now available to adjust microarray data for batch effects prior to analysis. We systematically evaluated six of these programs using multiple measures of precision, accuracy and overall performance. ComBat, an Empirical Bayes method, outperformed the other five programs by most metrics. We also showed that it is essential to standardize expression data at the probe level when testing for correlation of expression profiles, due to a sizeable probe effect in microarray data that can inflate the correlation among replicates and unrelated samples.

Keywords: Bayes Theorem; Case-Control Studies; Data Interpretation, Statistical; Gene Expression Profiling, standards/statistics /&/ numerical data; Humans; Microarray Analysis, standards/statistics /&/ numerical data; ROC Curve; Reference Standards; Research Design; Sample Size; Selection Bias; Validation Studies as Topic
[Candes2011Robust] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM, 58(3):11:1-11:37, jun 2011. [ bib | DOI | http | .pdf ]
[Brancotte2011Gene] B. Brancotte, A. Biton, I. Bernard-Pierrot, F. Radvanyi, F. Reyal, and S. Cohen-Boulakia. Gene list significance at-a-glance with GeneValorization. Bioinformatics, 27(8):1187-1189, Apr 2011. [ bib | DOI | http ]
High-throughput technologies provide fundamental informations concerning thousands of genes. Many of the current research laboratories daily use one or more of these technologies and end-up with lists of genes. Assessing the originality of the results obtained includes being aware of the number of publications available concerning individual or multiple genes and accessing information about these publications. Faced with the exponential growth of publications avaliable and number of genes involved in a study, this task is becoming particularly difficult to achieve.We introduce GeneValorization, a web-based tool that gives a clear and handful overview of the bibliography available corresponding to the user input formed by (i) a gene list (expressed by gene names or ids from EntrezGene) and (ii) a context of study (expressed by keywords). From this input, GeneValorization provides a matrix containing the number of publications with co-occurrences of gene names and keywords. Graphics are automatically generated to assess the relative importance of genes within various contexts. Links to publications and other databases offering information on genes and keywords are also available. To illustrate how helpful GeneValorization is, we will consider the gene list of the OncotypeDX prognostic marker test.http://bioguide-project.net/gvcohen@lri.frSupplementary data are available at Bioinformatics online.

[Boyd2011Distributed] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1-122, 2011. [ bib | DOI | http | .pdf ]
[Boeva2011Control-free] V. Boeva, A. Zinovyev, K. Bleakley, J.-P. Vert, I. Janoueix-Lerosey, O. Delattre, and E. Barillot. Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Bioinformatics, 27(2):268-269, Jan 2011. [ bib | DOI | http | .pdf ]
We present a tool for control-free copy number alteration (CNA) detection using deep-sequencing data, particularly useful for cancer studies. The tool deals with two frequent problems in the analysis of cancer deep-sequencing data: absence of control sample and possible polyploidy of cancer cells. FREEC (control-FREE Copy number caller) automatically normalizes and segments copy number profiles (CNPs) and calls CNAs. If ploidy is known, FREEC assigns absolute copy number to each predicted CNA. To normalize raw CNPs, the user can provide a control dataset if available; otherwise GC content is used. We demonstrate that for Illumina single-end, mate-pair or paired-end sequencing, GC-contentr normalization provides smooth profiles that can be further segmented and analyzed in order to predict CNAs.Source code and sample data are available at http://bioinfo-out.curie.fr/projects/freec/.freec@curie.frSupplementary data are available at Bioinformatics online.

Keywords: ngs
[Behr2011Simultaneous] J. Behr, R. Bonhert, A. Kahles, and G. Rätsch. Simultaneous RNA-seq-based transcript inference and quantification using mixed integer programming. NIPS Machine Learning in Computational Biology Workshop, Sierra Nevada., 2011. [ bib ]
[Barrett2011NCBI] T. Barrett, D.B. Troup, S.E. Wilhite, P. Ledoux, C. Evangelista, I.F. Kim, M. Tomashevsky, K.A. Marshall, K.H. Phillippy, P.M. Sherman, et al. Ncbi geo: archive for functional genomics data sets - 10 years on. Nucleic acids research, 39(suppl 1):D1005-D1010, 2011. [ bib ]
[Bach2011Structured] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Structured sparsity through convex optimization. arXiv preprint arXiv:1109.2397, 2011. [ bib ]
[Bach2011Optimization] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1-106, 2011. [ bib | DOI | http | .pdf ]
[Niedringhaus2011Landscape] Thomas P Niedringhaus, Denitsa Milanova, Matthew B Kerby, Michael P Snyder, and Annelise E Barron. Landscape of next-generation sequencing technologies. Anal Chem, May 2011. [ bib | DOI | http ]
[Fang2011Design] Zhide Fang and Xiangqin Cui. Design and validation issues in rna-seq experiments. Brief Bioinform, 12(3):280-287, May 2011. [ bib | DOI | http ]
The next-generation sequencing technologies are being rapidly applied in biological research. Tens of millions of short sequences generated in a single experiment provide us enormous information on genome composition, genetic variants, gene expression levels and protein binding sites depending on the applications. Various methods are being developed for analyzing the data generated by these technologies. However, the relevant experimental design issues have rarely been discussed. In this review, we use RNA-seq as an example to bring this topic into focus and to discuss experimental design and validation issues pertaining to next-generation sequencing in the quantification of transcripts.

[Li2011Sparse] J. J. Li, C.-R. Jiang, J. B. Brown, H. Huang, and P. J. Bickel. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc. Natl. Acad. Sci. USA, 108(50):19867-19872, December 2011. [ bib | DOI | http | .pdf ]
Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts have been made to utilize RNA-Seq data in assembling full-length mRNA isoforms de novo and estimating abundance of isoforms. However, for genes with more than a few exons, the problem tends to be challenging and often involves identifiability issues in statistical modeling. We have developed a statistical method called ” sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation” (SLIDE) that takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the stochastic aspects of RNA-Seq reads in exons from different isoforms and thus has increased power in detecting more novel isoforms. Another advantage of SLIDE is its flexibility of incorporating other transcriptomic data such as RACE, CAGE, and EST into its model to further increase isoform discovery accuracy. SLIDE can also work downstream of other RNA-Seq assembly algorithms to integrate newly discovered genes and exons. Besides isoform discovery, SLIDE sequentially uses the same linear model to estimate the abundance of discovered isoforms. Simulation and real data studies show that SLIDE performs as well as or better than major competitors in both isoform discovery and abundance estimation. The SLIDE software package is available at https://sites.google.com/site/jingyijli/SLIDE.zip.

Keywords: ngs, rnaseq
[Zhang2012Spatial] Y. Zhang, R. A. McCord, Y.-J. Ho, B. R. Lajoie, D. G. Hildebrand, A. C. Simon, M. S. Becker, F. W. Alt, and J. Dekker. Spatial organization of the mouse genome and its role in recurrent chromosomal translocations. Cell, 148(5):908 - 921, 2012. [ bib | DOI | http | .pdf ]
Summary The extent to which the three-dimensional organization of the genome contributes to chromosomal translocations is an important question in cancer genomics. We generated a high-resolution Hi-C spatial organization map of the G1-arrested mouse pro-B cell genome and used high-throughput genome-wide translocation sequencing to map translocations from target DNA double-strand breaks (DSBs) within it. RAG endonuclease-cleaved antigen-receptor loci are dominant translocation partners for target DSBs regardless of genomic position, reflecting high-frequency DSBs at these loci and their colocalization in a fraction of cells. To directly assess spatial proximity contributions, we normalized genomic DSBs via ionizing radiation. Under these conditions, translocations were highly enriched in cis along single chromosomes containing target DSBs and within other chromosomes and subchromosomal domains in a manner directly related to pre-existing spatial proximity. By combining two high-throughput genomic methods in a genetically tractable system, we provide a new lens for viewing cancer genomes.

Keywords: hic, ngs
[Vliet2012Integration] M.H. van Vliet, H.M. Horlings, M.J. van de Vijver, M.J.T. Reinders, and L.F.A. Wessels. Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome. PloS one, 7(7):e40358, 2012. [ bib ]
[Trapnell2012Differential] C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, H. Pimentel, S. L. Salzberg, J. L. Rinn, and L. Pachter. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc, 7(3):562-578, Mar 2012. [ bib | DOI | http | .pdf ]
Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocol's execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ∼1 h of hands-on time.

Keywords: ngs, rnaseq
[Takemoto2012Analysis] K. Takemoto, T. Tamura, Y. Cong, W.-K. Ching, J.-P. Vert, and T. Akutsu. Analysis of the impact degree distribution inmetabolic networks using branching process approximation. Physica A, 391:379-387, 2012. [ bib | DOI | http | .pdf ]
[Staiger2012Critical] C. Staiger, S. Cadot, R. Kooter, M. Dittrich, T. Müller, G.W. Klau, and L.F.A. Wessels. A critical evaluation of network and pathway-based classifiers for outcome prediction in breast cancer. PloS one, 7(4):e34796, 2012. [ bib ]
[RCoreTeam2012R] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0. [ bib | http ]
[Mazumder2012Exact] R. Mazumder and T. Hastie. Exact covariance thresholding into connected components for large-scale graphical lasso. J. Mach. Learn. Res., 13:781-794, Mar 2012. [ bib | .pdf | .pdf ]
[Marbach2012Wisdom] D. Marbach, J.C. Costello, R. Küffner, N. Vega, R.J. Prill, D.M. Camacho, K.R. Allison, the DREAM5 Consortium, M. Kellis, J.J. Collins, and G. Stolovitzky. Wisdom of crowds for robust gene network inference. Nat. Methods, 9(8):796-804, 2012. [ bib | DOI | http | .pdf ]
[Mairal2012Path] J. Mairal and .B Yu. Path coding penalties for directed acyclic graphs. 2012. [ bib ]
[Lazar2012survey] C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter, V. de Schaetzen, R. Duque, H. Bersini, and A. Nowé. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 9(4):1106-1119, 2012. [ bib ]
[Kim2012Robust] J. S. Kim and C. D. Scott. Robust kernel density estimation. J. Mach. Learn. Res., 13:2529-2565, 2012. [ bib | .pdf | .pdf ]
[Karr2012whole] J. R. Karr, J. C. Sanghvi, D. N. Macklin, M. V. Gutschow, J. M. Jacobs, B. Bolival, N. Assad-Garcia, J. I. Glass, and M. W. Covert. A whole-cell computational model predicts phenotype from genotype. Cell, 150(2):389-401, Jul 2012. [ bib | DOI | http | .pdf ]
Understanding how complex phenotypes arise from individual molecules and their interactions is a primary challenge in biology that computational approaches are poised to tackle. We report a whole-cell computational model of the life cycle of the human pathogen Mycoplasma genitalium that includes all of its molecular components and their interactions. An integrative approach to modeling that combines diverse mathematics enabled the simultaneous inclusion of fundamentally different cellular processes and experimental measurements. Our whole-cell model accounts for all annotated gene functions and was validated against a broad range of data. The model provides insights into many previously unobserved cellular behaviors, including in vivo rates of protein-DNA association and an inverse relationship between the durations of DNA replication initiation and replication. In addition, experimental analysis directed by model predictions identified previously undetected kinetic parameters and biological functions. We conclude that comprehensive whole-cell models can be used to facilitate biological discovery.

Keywords: Phenotype
[Kueffner2012Inferring] R. Küffner, T. Petri, P. Tavakkolkhah, L. Windhager, and R Zimmer. Inferring gene regulatory networks by ANOVA. Bioinformatics, 2012. [ bib | DOI | http ]
[Hornberger2012Clinical] J. Hornberger, M.D. Alvarado, C. Rebecca, H.R. Gutierrez, M.Y. Tiffany, and W.J. Gradishar. Clinical validity/utility, change in practice patterns, and economic implications of risk stratifiers to predict outcomes for early-stage breast cancer: A systematic review. Journal of the National Cancer Institute, 104(14):1068-1079, 2012. [ bib ]
[Haury2012TIGRESS] A.C. Haury, F. Mordelet, P. Vera-Licona, and J.P. Vert. Tigress: trustful inference of gene regulation using stability selection. arXiv preprint arXiv:1205.1181, 2012. [ bib ]
[Estrach2012Scattering] J. B. Estrach. Scattering representations for recognition. PhD thesis, Ecole Polytechnique, 2012. [ bib | .pdf ]
[Dixon2012Topological] J. R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. S. Liu, and B. Ren. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 485(5):376-80, 2012. [ bib | DOI | http | .pdf ]
Keywords: ngs, hic
[Curtis2012genomic] Christina Curtis, Sohrab P. Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M. Rueda, Mark J. Dunning, Doug Speed, Andy G. Lynch, Shamith Samarajiwa, Yinyin Yuan, Stefan Gräf, Gavin Ha, Gholamreza Haffari, Ali Bashashati, Roslin Russell, Steven McKinney, M. E. T. A. B. R. I. C Group  , Anita Langerød, Andrew Green, Elena Provenzano, Gordon Wishart, Sarah Pinder, Peter Watson, Florian Markowetz, Leigh Murphy, Ian Ellis, Arnie Purushotham, Anne-Lise Børresen-Dale, James D. Brenton, Simon Tavaré, Carlos Caldas, and Samuel Aparicio. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486(7403):346-352, Jun 2012. [ bib | DOI | http | .pdf ]
The elucidation of breast cancer subgroups and their molecular drivers requires integrated views of the genome and transcriptome from representative numbers of patients. We present an integrated analysis of copy number and gene expression in a discovery and validation set of 997 and 995 primary breast tumours, respectively, with long-term clinical follow-up. Inherited variants (copy number variants and single nucleotide polymorphisms) and acquired somatic copy number aberrations (CNAs) were associated with expression in  40% of genes, with the landscape dominated by cis- and trans-acting CNAs. By delineating expression outlier genes driven in cis by CNAs, we identified putative cancer genes, including deletions in PPP2R2A, MTAP and MAP2K4. Unsupervised analysis of paired DNA–RNA profiles revealed novel subgroups with distinct clinical outcomes, which reproduced in the validation cohort. These include a high-risk, oestrogen-receptor-positive 11q13/14 cis-acting subgroup and a favourable prognosis subgroup devoid of CNAs. Trans-acting aberration hotspots were found to modulate subgroup-specific gene networks, including a TCR deletion-mediated adaptive immune response in the ‘CNA-devoid’ subgroup and a basal-specific chromosome 5 deletion-associated mitotic network. Our results provide a novel molecular stratification of the breast cancer population, derived from the impact of somatic CNAs on the transcriptome.

[Biau2012Analysis] G. Biau. Analysis of a random forests model. The Journal of Machine Learning Research, 98888:1063-1095, 2012. [ bib ]
[Barretina2012Cancer] Jordi Barretina, Giordano Caponigro, Nicolas Stransky, Kavitha Venkatesan, Adam A. Margolin, Sungjoon Kim, Christopher J. Wilson, Joseph Lehár, Gregory V. Kryukov, Dmitriy Sonkin, Anupama Reddy, Manway Liu, Lauren Murray, Michael F. Berger, John E. Monahan, Paula Morais, Jodi Meltzer, Adam Korejwa, Judit Jané-Valbuena, Felipa A. Mapa, Joseph Thibault, Eva Bric-Furlong, Pichai Raman, Aaron Shipway, Ingo H. Engels, Jill Cheng, Guoying K. Yu, Jianjun Yu, Peter Aspesi, Jr, Melanie de Silva, Kalpana Jagtap, Michael D. Jones, Li Wang, Charles Hatton, Emanuele Palescandolo, Supriya Gupta, Scott Mahan, Carrie Sougnez, Robert C. Onofrio, Ted Liefeld, Laura MacConaill, Wendy Winckler, Michael Reich, Nanxin Li, Jill P. Mesirov, Stacey B. Gabriel, Gad Getz, Kristin Ardlie, Vivien Chan, Vic E. Myer, Barbara L. Weber, Jeff Porter, Markus Warmuth, Peter Finan, Jennifer L. Harris, Matthew Meyerson, Todd R. Golub, Michael P. Morrissey, William R. Sellers, Robert Schlegel, and Levi A. Garraway. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391):603-607, Mar 2012. [ bib | DOI | http | .pdf ]
The systematic translation of cancer genomic data into knowledge of tumour biology and therapeutic possibilities remains challenging. Such efforts should be greatly aided by robust preclinical model systems that reflect the genomic diversity of human cancers and for which detailed genetic and pharmacological annotation is available. Here we describe the Cancer Cell Line Encyclopedia (CCLE): a compilation of gene expression, chromosomal copy number and massively parallel sequencing data from 947 human cancer cell lines. When coupled with pharmacological profiles for 24 anticancer drugs across 479 of the cell lines, this collection allowed identification of genetic, lineage, and gene-expression-based predictors of drug sensitivity. In addition to known predictors, we found that plasma cell lineage correlated with sensitivity to IGF1 receptor inhibitors; AHR expression was associated with MEK inhibitor efficacy in NRAS-mutant lines; and SLFN11 expression predicted sensitivity to topoisomerase inhibitors. Together, our results indicate that large, annotated cell-line collections may help to enable preclinical stratification schemata for anticancer agents. The generation of genetic predictions of drug response in the preclinical setting and their incorporation into cancer clinical trial design could speed the emergence of 'personalized' therapeutic regimens.

[Argyriou2012Sparse] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-overlap norm. arXiv preprint arXiv:1204.5043, 2012. [ bib ]
[Tjong2012Physical] H. Tjong, K. Gong, L. Chen, and F. Alber. Physical tethering and volume exclusion determine higher-order genome organization in budding yeast. Genome Res., May 2012. [ bib | DOI | http | .pdf ]
In this paper we show that tethering of heterochromatic regions to nuclear landmarks and random encounters of chromosomes in the confined nuclear volume are sufficient to explain the higher-order organization of the budding yeast genome. We have quantitatively characterized the contact patterns and nuclear territories that emerge when chromosomes are allowed to behave as constrained but otherwise randomly configured flexible polymer chains in the nucleus. Remarkably, this constrained random encounter model explains in a statistical manner the experimental hallmarks of the S. cerevisiae genome organization, including (1) the folding patterns of individual chromosomes; (2) the highly enriched interactions between specific chromatin regions and chromosomes; (3) the emergence, shape, and position of gene territories; (4) the mean distances between pairs of telomeres; and (5) even the co-location of functionally related gene loci, including early replication start sites and tRNA genes. Therefore, most aspects of the yeast genome organization can be explained without calling on biochemically mediated chromatin interactions. Such interactions may modulate the pre-existing propensity for co-localization but seem not to be the cause for the observed higher-order organization. The fact that geometrical constraints alone yield a highly organized genome structure, on which different functional elements are specifically distributed, has strong implications for the folding principles of the genome and the evolution of its function.

[Hu2012HiCNorm] Ming Hu, Ke Deng, Siddarth Selvaraj, Zhaohui Qin, Bing Ren, and Jun S. Liu. HiCNorm: removing biases in Hi-C data via Poisson regression. Bioinformatics, 28(23):3131-3133, December 2012. [ bib | DOI | http ]
Summary: We propose a parametric model, HiCNorm, to remove systematic biases in the raw Hi-C contact maps, resulting in a simple, fast, yet accurate normalization procedure. Compared with the existing Hi-C normalization method developed by Yaffe and Tanay, HiCNorm has fewer parameters, runs >1000 times faster and achieves higher reproducibility.Availability: Freely available on the web at: http://www.people.fas.harvard.edu/∼junliu/HiCNorm/.Contact: jliu@stat.harvard.eduSupplementary information: Supplementary data are available at Bioinformatics online.

Keywords: hi-c
[Trapnell2013Differential] C. Trapnell, D. G. Hendrickson, M. Sauvageau, L. Goff, J. L. Rinn, and L. Pachter. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol, 31(1):46-53, Jan 2013. [ bib | DOI | http | .pdf ]
Differential analysis of gene and transcript expression using high-throughput RNA sequencing (RNA-seq) is complicated by several sources of measurement variability and poses numerous statistical challenges. We present Cuffdiff 2, an algorithm that estimates expression at transcript-level resolution and controls for variability evident across replicate libraries. Cuffdiff 2 robustly identifies differentially expressed transcripts and genes and reveals differential splicing and promoter-preference changes. We demonstrate the accuracy of our approach through differential analysis of lung fibroblasts in response to loss of the developmental transcription factor HOXA1, which we show is required for lung fibroblast and HeLa cell cycle progression. Loss of HOXA1 results in significant expression level changes in thousands of individual transcripts, along with isoform switching events in key regulators of the cell cycle. Cuffdiff 2 performs robust differential analysis in RNA-seq experiments at transcript resolution, revealing a layer of regulation not readily observable with other high-throughput technologies.

Keywords: ngs, rnaseq
[Mezlini2013iReckon] A. M. Mezlini, E. J. M. Smith, M. Fiume, O. Buske, G. L. Savich, S. Shah, S. Aparicio, D. Y. Chiang, A. Goldenberg, and M. Brudno. iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res, 23(3):519-529, Mar 2013. [ bib | DOI | http | .pdf ]
High-throughput RNA sequencing (RNA-seq) promises to revolutionize our understanding of genes and their role in human disease by characterizing the RNA content of tissues and cells. The realization of this promise, however, is conditional on the development of effective computational methods for the identification and quantification of transcripts from incomplete and noisy data. In this article, we introduce iReckon, a method for simultaneous determination of the isoforms and estimation of their abundances. Our probabilistic approach incorporates multiple biological and technical phenomena, including novel isoforms, intron retention, unspliced pre-mRNA, PCR amplification biases, and multimapped reads. iReckon utilizes regularized expectation-maximization to accurately estimate the abundances of known and novel isoforms. Our results on simulated and real data demonstrate a superior ability to discover novel isoforms with a significantly reduced number of false-positive predictions, and our abundance accuracy prediction outmatches that of other state-of-the-art tools. Furthermore, we have applied iReckon to two cancer transcriptome data sets, a triple-negative breast cancer patient sample and the MCF7 breast cancer cell line, and show that iReckon is able to reconstruct the complex splicing changes that were not previously identified. QT-PCR validations of the isoforms detected in the MCF7 cell line confirmed all of iReckon's predictions and also showed strong agreement (r = 0.94) with the predicted abundances.

Keywords: ngs, rnaseq
[Homouz20133D] D. Homouz and A. S. Kudlicki. The 3D organization of the yeast genome correlates with co-expression and reflects functional relations between genes. PLoS ONE, 8(1):e54699, 01 2013. [ bib | DOI | http | .pdf ]
<p>The spatial organization of eukaryotic genomes is thought to play an important role in regulating gene expression. The recent advances in experimental methods including chromatin capture techniques, as well as the large amounts of accumulated gene expression data allow studying the relationship between spatial organization of the genome and co-expression of protein-coding genes. To analyse this genome-wide relationship at a single gene resolution, we combined the interchromosomal DNA contacts in the yeast genome measured by Duan et al. with a comprehensive collection of 1,496 gene expression datasets. We find significant enhancement of co-expression among genes with contact links. The co-expression is most prominent when two gene loci fall within 1,000 base pairs from the observed contact. We also demonstrate an enrichment of inter-chromosomal links between functionally related genes, which suggests that the non random nature of the genome organization serves to facilitate coordinated transcription in groups of genes.</p>

Keywords: hic, ngs
[Ben-Elazar2013Spatial] S. Ben-Elazar, Z. Yakhini, and I. Yanai. Spatial localization of co-regulated genes exceeds genomic gene clustering in the saccharomyces cerevisiae genome. Nucleic Acids Res, 41(4):2191-2201, Feb 2013. [ bib | DOI | http | .pdf ]
While it has been long recognized that genes are not randomly positioned along the genome, the degree to which its 3D structure influences the arrangement of genes has remained elusive. In particular, several lines of evidence suggest that actively transcribed genes are spatially co-localized, forming transcription factories; however, a generalized systematic test has hitherto not been described. Here we reveal transcription factories using a rigorous definition of genomic structure based on Saccharomyces cerevisiae chromosome conformation capture data, coupled with an experimental design controlling for the primary gene order. We develop a data-driven method for the interpolation and the embedding of such datasets and introduce statistics that enable the comparison of the spatial and genomic densities of genes. Combining these, we report evidence that co-regulated genes are clustered in space, beyond their observed clustering in the context of gene order along the genome and show this phenomenon is significant for 64 out of 117 transcription factors. Furthermore, we show that those transcription factors with high spatially co-localized targets are expressed higher than those whose targets are not spatially clustered. Collectively, our results support the notion that, at a given time, the physical density of genes is intimately related to regulatory activity.

Keywords: ngs, hic

This file was generated by bibtex2html 1.97.