microarray references

[Schena1995Quantitative] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science, 270(5235):467-470, Oct 1995. [ bib | DOI | http | .pdf ]
A high-capacity system was developed to monitor the expression of many genes in parallel. Microarrays prepared by high-speed robotic printing of complementary DNAs on glass were used for quantitative expression measurements of the corresponding genes. Because of the small format and high density of the arrays, hybridization volumes of 2 microliters could be used that enabled detection of rare transcripts in probe mixtures derived from 2 micrograms of total cellular messenger RNA. Differential expression measurements of 45 Arabidopsis genes were made by means of simultaneous, two-color fluorescence hybridization.

Keywords: microarray
[DeRisi1997Exploring] J. L. DeRisi, V. R. Iyer, and P. O. Brown. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278(5338):680-686, 1997. [ bib | .pdf | .pdf ]
[Spellman1998Comprehensive] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, and B. Futcher. Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol. Biol. Cell, 9:3273-3297, 1998. [ bib | .pdf | .pdf ]
[Mukherjee1998Support] S. Mukherjee, P. Tamayo, J. P. Mesirov, D. Slonim, A. Verri, and T. Poggio. Support vector machine classification of microarray data. Technical Report 182, C.B.L.C., 1998. A.I. Memo 1677. [ bib | .html | .pdf ]
Keywords: biosvm microarray
[Eisen1998Cluster] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95:14863-14868, Dec 1998. [ bib | .pdf | .pdf ]
[Chu1998Transcriptional] S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P.O. Brown, and I. Herskowitz. The Transcriptional Program of Sporulation in Budding Yeast. Science, 282:699-705, 1998. [ bib | .pdf | .pdf ]
[Tavazoie1999Systematic] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church. Systematic determination of genetic network architecture. Nat. Genet., 22:281-285, 1999. [ bib | DOI | http | .pdf ]
[Golub1999Molecular] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531-537, 1999. [ bib | DOI | http | .pdf ]
Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression moni- toring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.

Keywords: csbcbook, csbcbook-ch3, csbcbook-ch4
[Ferea1999Systematic] T. L. Ferea, D. Botstein, P. O. Brown, and R. F. Rosenzweig. Systematic changes in gene expression patterns following adaptive evolution in yeast. Proc. Natl. Acad. Sci. USA, 96(17):9721-9726, 1999. [ bib | .pdf | .pdf ]
[Zhu2000Two] G. Zhu, P. T. Spellman, T. Volpe, P. O. Brown, D. Botstein, T. N. Davis, and B. Futcher. Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth. Nature, 406:90-94, 2000. [ bib | http | .pdf ]
[Selinger2000RNA] Douglas W. Selinger, Kevin J. Cheung, Rui Mei, Erik M. Johansson, Craig S. Richmond, Frederick R. Blattner, David J. Lockhart, and George M. Church. RNA expression analysis using a 30 base pair resolution Escherichia coli genome array. Nat. Biotechnol., 18:1262-1268, 2000. [ bib | http | .pdf ]
[Ogawa2000New] Nobuo Ogawa, Joseph DeRisi, and Patrick O. Brown. New Components of a System for Phosphate Accumulation and Polyphosphate Metabolism in Saccharomyces cerevisiae Revealed by Genomic Expression Analysis. Mol. Biol. Cell, 11:4309-4321, Dec 2000. [ bib | .pdf | .pdf ]
[Gross2000Identification] C. Gross, M. Kelleher, V.R. Iyer, P.O. Brown, and D.R. Winge. Identification of the copper regulon in Saccharomyces cerevisiae by DNA microarrays. J. Biol. Chem., 275(41):32310-32316, 2000. [ bib | http | .pdf ]
[Gasch2000Genomic] A. P. Gasch, P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein, and P. O. Brown. Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Mol. Biol. Cell, 11:4241-4257, Dec 2000. [ bib | .pdf | .pdf ]
[Friedman2000Using] N. Friedman, M. Linial, I. Nachman, and D. Pe'er. Using Bayesian networks to analyze expression data. J. Comput. Biol., 7(3-4):601-620, 2000. [ bib | DOI | http | .pdf ]
DNA hybridization arrays simultaneously measure the expression level for thousands of genes. These measurements provide a "snapshot" of transcription levels within the cell. A major challenge in computational biology is to uncover, from such measurements, gene/protein interactions and key biological features of cellular systems. In this paper, we propose a new framework for discovering interactions between genes based on multiple expression measurements. This framework builds on the use of Bayesian networks for representing statistical dependencies. A Bayesian network is a graph-based model of joint multivariate probability distributions that captures properties of conditional independence between variables. Such models are attractive for their ability to describe complex stochastic processes and because they provide a clear methodology for learning from (noisy) observations. We start by showing how Bayesian networks can describe interactions between genes. We then describe a method for recovering gene interactions from microarray data using tools for learning Bayesian networks. Finally, we demonstrate this method on the S. cerevisiae cell-cycle measurements of Spellman et al. (1998).

Keywords: biogm
[Brown2000Exploring] P.O. Brown and D. Botstein. Exploring the new world of the genome with DNA microarrays. Nat. Genet., 21:33-37, 2000. [ bib | .html | .pdf ]
[Brown2000Knowledge-based] M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA, 97(1):262-7, Jan 2000. [ bib | http | .pdf ]
We introduce a method of functionally classifying genes by using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data.

Keywords: biosvm microarray
[Ben-Dor2000Tissue] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. Tissue classification with gene expression profiles. J. Comput. Biol., 7(3-4):559-583, 2000. [ bib | http | .pdf ]
Constantly improving gene expression profiling technologies are expected to provide understanding and insight into cancer-related cellular processes. Gene expression data is also expected to significantly aid in the development of efficient cancer diagnosis and classification platforms. In this work we examine three sets of gene expression data measured across sets of tumor(s) and normal clinical samples: The first set consists of 2,000 genes, measured in 62 epithelial colon samples (Alon et al., 1999). The second consists of approximately equal to 100,000 clones, measured in 32 ovarian samples (unpublished extension of data set described in Schummer et al. (1999)). The third set consists of approximately equal to 7,100 genes, measured in 72 bone marrow and peripheral blood samples (Golub et al, 1999). We examine the use of scoring methods, measuring separation of tissue type (e.g., tumors from normals) using individual gene expression levels. These are then coupled with high-dimensional classification methods to assess the classification power of complete expression profiles. We present results of performing leave-one-out cross validation (LOOCV) experiments on the three data sets, employing nearest neighbor classifier, SVM (Cortes and Vapnik, 1995), AdaBoost (Freund and Schapire, 1997) and a novel clustering-based classification technique. As tumor samples can differ from normal samples in their cell-type composition, we also perform LOOCV experiments using appropriately modified sets of genes, attempting to eliminate the resulting bias. We demonstrate success rate of at least 90 in tumor versus normal classification, using sets of selected genes, with, as well as without, cellular-contamination-related members. These results are insensitive to the exact selection mechanism, over a certain range.

Keywords: biosvm microarray
[Sherlock2001Stanford] G. Sherlock, T. Hernandez-Boussard, A. Kasarskis, G. Binkley, J.C. Matese, S.S. Dwight, M. Kaloper, S. Weng, H. Jin, C.A. Ball, M.B. Eisen, and P.T. Spellman. The Stanford Microarray Database. Nucleic Acids Res., 29(1):152-155, Jan 2001. [ bib | .pdf | .pdf ]
[Ramaswamy2001Multiclass] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S. Lander, and T.R. Golub. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. USA, 98(26):15149-15154, Dec 2001. [ bib | DOI | http | .pdf ]
The optimal treatment of patients with cancer depends on establishing accurate diagnoses by using a complex combination of clinical and histopathological data. In some instances, this task is difficult or impossible because of atypical clinical presentation or histopathology. To determine whether the diagnosis of multiple common adult malignancies could be achieved purely by molecular classification, we subjected 218 tumor samples, spanning 14 common tumor types, and 90 normal tissue samples to oligonucleotide microarray gene expression analysis. The expression levels of 16,063 genes and expressed sequence tags were used to evaluate the accuracy of a multiclass classifier based on a support vector machine algorithm. Overall classification accuracy was 78 Poorly differentiated cancers resulted in low-confidence predictions and could not be accurately classified according to their tissue of origin, indicating that they are molecularly distinct entities with dramatically different gene expression patterns compared with their well differentiated counterparts. Taken together, these results demonstrate the feasibility of accurate, multiclass molecular cancer classification and suggest a strategy for future clinical implementation of molecular cancer diagnostics.

Keywords: biosvm microarray
[Pilpel2001Identifying] Y. Pilpel, P. Sudarsanam, and G. M. Church. Identifying regulatory networks by combinatorial analysis of promoter elements. Nature, 29:153-159, 2001. [ bib | http | .pdf ]
[Kuhn2001Global] K. M. Kuhn, J. L. DeRisi, P. O. Brown, and P. Sarnow. Global and specific translational regulation in the genomic response of Saccharomyces cerevisiae to a rapid transfer from a fermentable to a nonfermentable carbon source. Mol. Cell. Biol., 21(3):916-927, 2001. [ bib | http | .pdf ]
[Gasch2001Genomic] A.P. Gasch, M. Huang, S. Metzner, D. Botstein, S.J. Elledge, and P.O. Brown. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol. Biol. Cell, 12(10):2987-3003, 2001. [ bib | http | .pdf ]
[Chiang2001Visualizing] D. Y. Chiang, P. O. Brown, and M. B. Eisen. Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. Bioinformatics, 17:49S-55S, 2001. [ bib | .pdf | .pdf ]
[Bussemaker2001Regulatory] H. J. Bussemaker, H. Li, and E. D. Siggia. Regulatory element detection using correlation with expression. Nat. Genet., 27:167-174, 2001. [ bib | http | .pdf ]
[Hanisch2002Co-clustering] D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 2002. [ bib | .pdf ]
[Bao2002Identifying] L. Bao and Z. Sun. Identifying genes related to drug anticancer mechanisms using support vector machine. FEBS Lett., 521:109-114, 2002. [ bib | .html | .pdf ]
In an effort to identify genes related to the cell line chemosensitivity and to evaluate the functional relationships between genes and anticancer drugs acting by the same mechanism, a supervised machine learning approach called support vector machine was used to label genes into any of the five predefined anticancer drug mechanistic categories. Among dozens of unequivocally categorized genes, many were known to be causally related to the drug mechanisms. For example, a few genes were found to be involved in the biological process triggered by the drugs (e.g. DNA polymerase epsilon was the direct target for the drugs from DNA antimetabolites category). DNA repair-related genes were found to be enriched for about eight-fold in the resulting gene set relative to the entire gene set. Some uncharacterized transcripts might be of interest in future studies. This method of correlating the drugs and genes provides a strategy for finding novel biologically significant relationships for molecular pharmacology.

Keywords: biosvm microarray
[Aliferis2002Machine] C.F. Aliferis, D.P. Hardin, and P. Massion. Machine Learning Models For Lung Cancer Classification Using Array Comparative Genomic Hybridization. In Proceedings of the 2002 American Medical Informatics Association (AMIA) Annual Symposium, pages 7-11, 2002. [ bib | .pdf ]
Array CGH is a recently introduced technology that measures changes in the gene copy number of hundreds of genes in a single experiment. The primary goal of this study was to develop machine learning models that classify non-small Lung Cancers according to histopathology types and to compare several machine learning methods in this learning task. DNA from tumors of 37 patients (21 squamous carcinomas, and 16 adenocarcinomas) were extracted and hybridized onto a 452 BAC clone array. The following algorithms were used: KNN, Decision Tree Induction, Support Vector Machines and Feed-Forward Neural Networks. Performance was measured via leave-one-out classification accuracy. The best multi-gene model found had a leave-one-out accuracy of 89.2%. Decision Trees performed poorer than the other methods in this learning task and dataset. We conclude that gene copy numbers as measured by array CGH are, collectively, an excellent indicator of histological subtype. Several interesting research directions are discussed.

Keywords: biosvm microarray, cgh
[Peng2003Molecular] S. Peng, Q. Xu, X.B. Ling, X. Peng, W. Du, and L. Chen. Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines. FEBS Lett., 555(2):358-362, 2003. [ bib | DOI | http | .pdf ]
Simultaneous multiclass classification of tumor types is essential for future clinical implementations of microarray-based cancer diagnosis. In this study, we have combined genetic algorithms (GAs) and all paired support vector machines (SVMs) for multiclass cancer identification. The predictive features have been selected through iterative SVMs/GAs, and recursive feature elimination post-processing steps, leading to a very compact cancer-related predictive gene set. Leave-one-out cross-validations yielded accuracies of 87.93 the eight-class and 85.19 outperforming the results derived from previously published methods.

Keywords: biosvm microarray
[Meireles2003Differentially] S.I. Meireles, A.F. Carvalho, R. Hirata, A.L. Montagnini, W.K. Martins, F.B. Runza, B.S. Stolf, L. Termini, C.E. Neto, R.L. Silva, F.A. Soares, E.J. Neves, and L.F. Reis. Differentially expressed genes in gastric tumors identified by cDNA array. Cancer Lett., 190(2):199-211, Feb 2003. [ bib | DOI | http | .pdf ]
Using cDNA fragments from the FAPESP/lICR Cancer Genome Project, we constructed a cDNA array having 4512 elements and determined gene expression in six normal and six tumor gastric tissues. Using t-statistics, we identified 80 cDNAs whose expression in normal and tumor samples differed more than 3.5 sample standard deviations. Using Self-Organizing Map, the expression profile of these cDNAs allowed perfect separation of malignant and non-malignant samples. Using the supervised learning procedure Support Vector Machine, we identified trios of cDNAs that could be used to classify samples as normal or tumor, based on single-array analysis. Finally, we identified genes with altered linear correlation when their expression in normal and tumor samples were compared. Further investigation concerning the function of these genes could contribute to the understanding of gastric carcinogenesis and may prove useful in molecular diagnostics.

Keywords: biosvm microarray
[LeRoch2003Discovery] K. G. Le Roch, Y. Zhou, P. L. Blair, M. Grainger, J. K. Moch, J. D. Haynes, P. De la Vega, A. A. Holder, S. Batalov, D. J. Carucci, and E. A. Winzeler. Discovery of gene function by expression profiling of the malaria parasite life cycle. Science, 301(5639):1503-1508, 2003. [ bib | DOI | http | .pdf ]
The completion of the genome sequence for Plasmodium falciparum, the species responsible for most malaria human deaths, has the potential to reveal hundreds of new drug targets and proteins involved in pathogenesis. However, only approximately 35 with an identifiable function. The absence of routine genetic tools for studying Plasmodium parasites suggests that this number is unlikely to change quickly if conventional serial methods are used to characterize encoded proteins. Here, we use a high-density oligonucleotide array to generate expression profiles of human and mosquito stages of the malaria parasite's life cycle. Genes with highly correlated levels and temporal patterns of expression were often involved in similar functions or cellular processes.

Keywords: microarray plasmodium
[Bozdech2003Expression] Z. Bozdech, J. Zhu, M. Joachimiak, F. Cohen, B. Pulliam, and J. DeRisi. Expression profiling of the schizont and trophozoite stages of Plasmodium falciparum with a long-oligonucleotide microarray. Genome Biology, 4(2):R9, 2003. [ bib | DOI | http | .pdf ]
BACKGROUND:The worldwide persistence of drug-resistant Plasmodium falciparum, the most lethal variety of human malaria, is a global health concern. The P. falciparum sequencing project has brought new opportunities for identifying molecular targets for antimalarial drug and vaccine development.RESULTS:We developed a software package, ArrayOligoSelector, to design an open reading frame (ORF)-specific DNA microarray using the publicly available P. falciparum genome sequence. Each gene was represented by one or more long 70 mer oligonucleotides selected on the basis of uniqueness within the genome, exclusion of low-complexity sequence, balanced base composition and proximity to the 3' end. A first-generation microarray representing approximately 6,000 ORFs of the P. falciparum genome was constructed. Array performance was evaluated through the use of control oligonucleotide sets with increasing levels of introduced mutations, as well as traditional northern blotting. Using this array, we extensively characterized the gene-expression profile of the intraerythrocytic trophozoite and schizont stages of P. falciparum. The results revealed extensive transcriptional regulation of genes specialized for processes specific to these two stages.CONCLUSIONS:DNA microarrays based on long oligonucleotides are powerful tools for the functional annotation and exploration of the P. falciparum genome. Expression profiling of trophozoites and schizonts revealed genes associated with stage-specific processes and may serve as the basis for future drug targets and vaccine development.

Keywords: microarray plasmodium
[Bozdech2003Transcriptome] Z. Bozdech, M. Llinas, B. L. Pulliam, E. D. Wong, J. Zhu, and J. L. DeRisi. The Transcriptome of the Intraerythrocytic Developmental Cycle of Plasmodium falciparum. PLoS Biology, 1(1):e5, 2003. [ bib | DOI | http | .pdf ]
Plasmodium falciparum is the causative agent of the most burdensome form of human malaria, affecting 200-300 million individuals per year worldwide. The recently sequenced genome of P. falciparum revealed over 5,400 genes, of which 60percnt encode proteins of unknown function. Insights into the biochemical function and regulation of these genes will provide the foundation for future drug and vaccine development efforts toward eradication of this disease. By analyzing the complete asexual intraerythrocytic developmental cycle (IDC) transcriptome of the HB3 strain of P. falciparum, we demonstrate that at least 60percnt of the genome is transcriptionally active during this stage. Our data demonstrate that this parasite has evolved an extremely specialized mode of transcriptional regulation that produces a continuous cascade of gene expression, beginning with genes corresponding to general cellular processes, such as protein synthesis, and ending with Plasmodium-specific functionalities, such as genes involved in erythrocyte invasion. The data reveal that genes contiguous along the chromosomes are rarely coregulated, while transcription from the plastid genome is highly coregulated and likely polycistronic. Comparative genomic hybridization between HB3 and the reference genome strain (3D7) was used to distinguish between genes not expressed during the IDC and genes not detected because of possible sequence variations. Genomic differences between these strains were found almost exclusively in the highly antigenic subtelomeric regions of chromosomes. The simple cascade of gene regulation that directs the asexual development of P. falciparum is unprecedented in eukaryotic biology. The transcriptome of the IDC resembles a "just-in-time" manufacturing process whereby induction of any given gene occurs once per cycle and only at a time when it is required. These data provide to our knowledge the first comprehensive view of the timing of transcription throughout the intraerythrocytic development of P. falciparum and provide a resource for the identification of new chemotherapeutic and vaccine candidates.

Keywords: microarray plasmodium
[Tsai2004Gene] C.A. Tsai, C.H. Chen, T.C. Lee, I.C. Ho, U.C. Yang, and J.J. Chen. Gene selection for sample classifications in microarray experiments. DNA Cell Biol., 23(10):607-614, 2004. [ bib | DOI | http | .pdf ]
DNA microarray technology provides useful tools for profiling global gene expression patterns in different cell/tissue samples. One major challenge is the large number of genes relative to the number of samples. The use of all genes can suppress or reduce the performance of a classification rule due to the noise of nondiscriminatory genes. Selection of an optimal subset from the original gene set becomes an important prestep in sample classification. In this study, we propose a family-wise error (FWE) rate approach to selection of discriminatory genes for two-sample or multiple-sample classification. The FWE approach controls the probability of the number of one or more false positives at a prespecified level. A public colon cancer data set is used to evaluate the performance of the proposed approach for the two classification methods: k nearest neighbors (k-NN) and support vector machine (SVM). The selected gene sets from the proposed procedure appears to perform better than or comparable to several results reported in the literature using the univariate analysis without performing multivariate search. In addition, we apply the FWE approach to a toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV) for a total of 55 samples for a multisample classification. Two gene sets are considered: the gene set omegaF formed by the ANOVA F-test, and a gene set omegaT formed by the union of one-versus-all t-tests. The predicted accuracies are evaluated using the internal and external crossvalidation. Using the SVM classification, the overall accuracies to predict 55 samples into one of the nine treatments are above 80 crossvalidation. OmegaF has slightly higher accuracy rates than omegaT. The overall predicted accuracies are above 70 external crossvalidation; the two gene sets omegaT and omegaF performed equally well.

Keywords: biosvm microarray
[Pochet2004Systematic] N. Pochet, F. De Smet, J. A. K. Suykens, and B. L. R. De Moor. Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction. Bioinformatics, 20(17):3185-3195, Nov 2004. [ bib | DOI | http | .pdf ]
Motivation: Microarrays are capable of determining the expression levels of thousands of genes simultaneously. In combination with classification methods, this technology can be useful to support clinical management decisions for individual patients, e.g. in oncology. The aim of this paper is to systematically benchmark the role of non-linear versus linear techniques and dimensionality reduction methods. Results: A systematic benchmarking study is performed by comparing linear versions of standard classification and dimensionality reduction techniques with their non-linear versions based on non-linear kernel functions with a radial basis function (RBF) kernel. A total of 9 binary cancer classification problems, derived from 7 publicly available microarray datasets, and 20 randomizations of each problem are examined. Conclusions: Three main conclusions can be formulated based on the performances on independent test sets. (1) When performing classification with least squares support vector machines (LS-SVMs) (without dimensionality reduction), RBF kernels can be used without risking too much overfitting. The results obtained with well-tuned RBF kernels are never worse and sometimes even statistically significantly better compared to results obtained with a linear kernel in terms of test set receiver operating characteristic and test set accuracy performances. (2) Even for classification with linear classifiers like LS-SVM with linear kernel, using regularization is very important. (3) When performing kernel principal component analysis (kernel PCA) before classification, using an RBF kernel for kernel PCA tends to result in overfitting, especially when using supervised feature selection. It has been observed that an optimal selection of a large number of features is often an indication for overfitting. Kernel PCA with linear kernel gives better results. Availability: Matlab scripts are available on request. Supplementary information: http://www.esat.kuleuven.ac.be/ npochet/Bioinformatics/

Keywords: biosvm microarray
[Ishkanian2004tiling] A. S. Ishkanian, C. A. Malloff, S. K. Watson, R. J. DeLeeuw, B. Chi, B. P. Coe, A. Snijders, D. G. Albertson, D. Pinkel, M. A. Marra, V. Ling, C. MacAulay, and W. L. Lam. A tiling resolution DNA microarray with complete coverage of the human genome. Nat. Genet., 36(3):299-303, Mar 2004. [ bib | DOI | http | .pdf ]
We constructed a tiling resolution array consisting of 32,433 overlapping BAC clones covering the entire human genome. This increases our ability to identify genetic alterations and their boundaries throughout the genome in a single comparative genomic hybridization (CGH) experiment. At this tiling resolution, we identified minute DNA alterations not previously reported. These alterations include microamplifications and deletions containing oncogenes, tumor-suppressor genes and new genes that may be associated with multiple tumor types. Our findings show the need to move beyond conventional marker-based genome comparison approaches, that rely on inference of continuity between interval markers. Our submegabase resolution tiling set for array CGH (SMRT array) allows comprehensive assessment of genomic integrity and thereby the identification of new genes associated with disease.

Keywords: csbcbook, microarray
[Bozdech2004Antioxidant] Z. Bozdech and H. Ginsburg. Antioxidant defense in Plasmodium falciparum - data mining of the transcriptome. Malaria Journal, 3(1):23, 2004. [ bib | DOI | http | .pdf ]
The intraerythrocytic malaria parasite is under constant oxidative stress originating both from endogenous and exogenous processes. The parasite is endowed with a complete network of enzymes and proteins that protect it from those threats, but also uses redox activities to regulate enzyme activities. In the present analysis, the transcription of the genes coding for the antioxidant defense elements are viewed in the time-frame of the intraerythrocytic cycle. Time-dependent transcription data were taken from the transcriptome of the human malaria parasite Plasmodium falciparum. Whereas for several processes the transcription of the many participating genes is coordinated, in the present case there are some outstanding deviations where gene products that utilize glutathione or thioredoxin are transcribed before the genes coding for elements that control the levels of those substrates are transcribed. Such insights may hint to novel, non-classical pathways that necessitate further investigations.

Keywords: microarray plasmodium
[Pavey2004Microarray] S. Pavey, P. Johansson, L. Packer, J. Taylor, M. Stark, P.M. Pollock, G.J. Walker, G.M. Boyle, U. Harper, S.J. Cozzi, K. Hansen, L. Yudt, C. Schmidt, P. Hersey, K.A. Ellem, M.G. O'Rourke, P.G. Parsons, P. Meltzer, M. Ringner, and N.K. Hayward. Microarray expression profiling in melanoma reveals a BRAF mutation signature. Oncogene, 23(23):4060-4067, May 2004. [ bib | DOI | http | .pdf ]
We have used microarray gene expression profiling and machine learning to predict the presence of BRAF mutations in a panel of 61 melanoma cell lines. The BRAF gene was found to be mutated in 42 samples (69 seven samples (11 Using support vector machines, we have built a classifier that differentiates between melanoma cell lines based on BRAF mutation status. As few as 83 genes are able to discriminate between BRAF mutant and BRAF wild-type samples with clear separation observed using hierarchical clustering. Multidimensional scaling was used to visualize the relationship between a BRAF mutation signature and that of a generalized mitogen-activated protein kinase (MAPK) activation (either BRAF or NRAS mutation) in the context of the discriminating gene list. We observed that samples carrying NRAS mutations lie somewhere between those with or without BRAF mutations. These observations suggest that there are gene-specific mutation signals in addition to a common MAPK activation that result from the pleiotropic effects of either BRAF or NRAS on other signaling pathways, leading to measurably different transcriptional changes.

Keywords: biosvm microarray
[Zhou2005LS] Xin Zhou and K. Z. Mao. LS Bound based gene selection for DNA microarray data. Bioinformatics, 21(8):1559-64, Apr 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: One problem with discriminant analysis of DNA microarray data is that each sample is represented by quite a large number of genes, and many of them are irrelevant, insignificant or redundant to the discriminant problem at hand. Methods for selecting important genes are, therefore, of much significance in microarray data analysis. In the present study, a new criterion, called LS Bound measure, is proposed to address the gene selection problem. The LS Bound measure is derived from leave-one-out procedure of LS-SVMs (least squares support vector machines), and as the upper bound for leave-one-out classification results it reflects to some extent the generalization performance of gene subsets. RESULTS: We applied this LS Bound measure for gene selection on two benchmark microarray datasets: colon cancer and leukemia. We also compared the LS Bound measure with other evaluation criteria, including the well-known Fisher's ratio and Mahalanobis class separability measure, and other published gene selection algorithms, including Weighting factor and SVM Recursive Feature Elimination. The strength of the LS Bound measure is that it provides gene subsets leading to more accurate classification results than the filter method while its computational complexity is at the level of the filter method. AVAILABILITY: A companion website can be accessed at http://www.ntu.edu.sg/home5/pg02776030/lsbound/. The website contains: (1) the source code of the gene selection algorithm; (2) the complete set of tables and figures regarding the experimental study; (3) proof of the inequality (9). CONTACT: ekzmao@ntu.edu.sg.

Keywords: biosvm featureselection microarray
[Wang2005Gene] Yu Wang, Igor V Tetko, Mark A Hall, Eibe Frank, Axel Facius, Klaus F X Mayer, and Hans W Mewes. Gene selection from microarray data for cancer classification-a machine learning approach. Comput. Biol. Chem., 29(1):37-46, Feb 2005. [ bib | DOI | http | .pdf ]
A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, naïve Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.

Keywords: biosvm microarray
[Wang2005Gene-expression] Y. Wang, J.G.M. Klijn, Y. Zhang, A.M. Sieuwerts, M.P. Look, F. Yang, D. Talantov, M. Timmermans, M.E. Meijer-van Gelder, J. Yu, T. Jatkoe, E.M.J.J. Berns, D. Atkins, and J.A. Foekens. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancers. Lancet, 365(9460):671-679, 2005. [ bib | DOI | http | .pdf ]
BACKGROUND: Genome-wide measures of gene expression can identify patterns of gene activity that subclassify tumours and might provide a better means than is currently available for individual risk assessment in patients with lymph-node-negative breast cancer. METHODS: We analysed, with Affymetrix Human U133a GeneChips, the expression of 22000 transcripts from total RNA of frozen tumour samples from 286 lymph-node-negative patients who had not received adjuvant systemic treatment. FINDINGS: In a training set of 115 tumours, we identified a 76-gene signature consisting of 60 genes for patients positive for oestrogen receptors (ER) and 16 genes for ER-negative patients. This signature showed 93% sensitivity and 48% specificity in a subsequent independent testing set of 171 lymph-node-negative patients. The gene profile was highly informative in identifying patients who developed distant metastases within 5 years (hazard ratio 5.67 [95% CI 2.59-12.4]), even when corrected for traditional prognostic factors in multivariate analysis (5.55 [2.46-12.5]). The 76-gene profile also represented a strong prognostic factor for the development of metastasis in the subgroups of 84 premenopausal patients (9.60 [2.28-40.5]), 87 postmenopausal patients (4.04 [1.57-10.4]), and 79 patients with tumours of 10-20 mm (14.1 [3.34-59.2]), a group of patients for whom prediction of prognosis is especially difficult. INTERPRETATION: The identified signature provides a powerful tool for identification of patients at high risk of distant recurrence. The ability to identify patients who have a favourable prognosis could, after independent confirmation, allow clinicians to avoid adjuvant systemic therapy or to choose less aggressive therapeutic options.

Keywords: microarray, breastcancer
[Thukral2005Prediction] Sushil K Thukral, Paul J Nordone, Rong Hu, Leah Sullivan, Eric Galambos, Vincent D Fitzpatrick, Laura Healy, Michael B Bass, Mary E Cosenza, and Cynthia A Afshari. Prediction of nephrotoxicant action and identification of candidate toxicity-related biomarkers. Toxicol Pathol, 33(3):343-55, 2005. [ bib | DOI | http ]
A vast majority of pharmacological compounds and their metabolites are excreted via the urine, and within the complex structure of the kidney,the proximal tubules are a main target site of nephrotoxic compounds. We used the model nephrotoxicants mercuric chloride, 2-bromoethylamine hydrobromide, hexachlorobutadiene, mitomycin, amphotericin, and puromycin to elucidate time- and dose-dependent global gene expression changes associated with proximal tubular toxicity. Male Sprague-Dawley rats were dosed via intraperitoneal injection once daily for mercuric chloride and amphotericin (up to 7 doses), while a single dose was given for all other compounds. Animals were exposed to 2 different doses of these compounds and kidney tissues were collected on day 1, 3, and 7 postdosing. Gene expression profiles were generated from kidney RNA using 17K rat cDNA dual dye microarray and analyzed in conjunction with histopathology. Analysis of gene expression profiles showed that the profiles clustered based on similarities in the severity and type of pathology of individual animals. Further, the expression changes were indicative of tubular toxicity showing hallmarks of tubular degeneration/regeneration and necrosis. Use of gene expression data in predicting the type of nephrotoxicity was then tested with a support vector machine (SVM)-based approach. A SVM prediction module was trained using 120 profiles of total profiles divided into four classes based on the severity of pathology and clustering. Although mitomycin C and amphotericin B treatments did not cause toxicity, their expression profiles were included in the SVM prediction module to increase the sample size. Using this classifier, the SVM predicted the type of pathology of 28 test profiles with 100% selectivity and 82% sensitivity. These data indicate that valid predictions could be made based on gene expression changes from a small set of expression profiles. A set of potential biomarkers showing a time- and dose-response with respect to the progression of proximal tubular toxicity were identified. These include several transporters (Slc21a2, Slc15, Slc34a2), Kim 1, IGFbp-1, osteopontin, alpha-fibrinogen, and Gstalpha.

Keywords: Algorithms, Animals, Antibiotics, Antineoplastic, Artificial Intelligence, Butadienes, Chloroplasts, Comparative Study, Computer Simulation, Computer-Assisted, Diagnosis, Disinfectants, Dose-Response Relationship, Drug, Drug Toxicity, Electrodes, Electroencephalography, Ethylamines, Expert Systems, Feedback, Fungicides, Gene Expression Profiling, Genes, Genetic Markers, Humans, Implanted, Industrial, Information Storage and Retrieval, Kidney, Kidney Tubules, MEDLINE, Male, Mercuric Chloride, Microarray Analysis, Molecular Biology, Motor Cortex, Movement, Natural Language Processing, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Plant Proteins, Predictive Value of Tests, Proteins, Proteome, Proximal, Puromycin Aminonucleoside, Rats, Reproducibility of Results, Research Support, Sprague-Dawley, Subcellular Fractions, Terminology, Therapy, Time Factors, Toxicogenetics, U.S. Gov't, User-Computer Interface, 15805072
[Statnikov2005comprehensive] A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 2005. To appear. [ bib | http | .pdf ]
Motivation: Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. In order to equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods, and two cross validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types.Results: Multicategory Support Vector Machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins, and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms such as K-Nearest Neighbors, Backpropagation and Probabilistic Neural Networks, often to a remarkable degree. Gene selection techniques can significantly improve classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets.Availability: The software system GEMS is available for download from http://www.gems-system.org for non-commercial use.

Keywords: biosvm microarray
[Segal2005From] E. Segal, N. Friedman, N. Kaminski, A. Regev, and D. Koller. From signatures to models: understanding cancer using microarrays. Nat Genet, 37(6 Suppl):S38-45, 2005. [ bib | DOI | http | .pdf ]
Genomics has the potential to revolutionize the diagnosis and management of cancer by offering an unprecedented comprehensive view of the molecular underpinnings of pathology. Computational analysis is essential to transform the masses of generated data into a mechanistic understanding of disease. Here we review current research aimed at uncovering the modular organization and function of transcriptional networks and responses in cancer. We first describe how methods that analyze biological processes in terms of higher-level modules can identify robust signatures of disease mechanisms. We then discuss methods that aim to identify the regulatory mechanisms underlying these modules and processes. Finally, we show how comparative analysis, combining human data with model organisms, can lead to more robust findings. We conclude by discussing the challenges of generalizing these methods from cells to tissues and the opportunities they offer to improve cancer diagnosis and management.

Keywords: microarray
[ODonnell2005Gene] Rebekah K O'Donnell, Michael Kupferman, S. Jack Wei, Sunil Singhal, Randal Weber, Bert O'Malley, Yi Cheng, Mary Putt, Michael Feldman, Barry Ziober, and Ruth J Muschel. Gene expression signature predicts lymphatic metastasis in squamous cell carcinoma of the oral cavity. Oncogene, 24(7):1244-51, Feb 2005. [ bib | DOI | http | .pdf ]
Metastasis via the lymphatics is a major risk factor in squamous cell carcinoma of the oral cavity (OSCC). We sought to determine whether the presence of metastasis in the regional lymph node could be predicted by a gene expression signature of the primary tumor. A total of 18 OSCCs were characterized for gene expression by hybridizing RNA to Affymetrix U133A gene chips. Genes with differential expression were identified using a permutation technique and verified by quantitative RT-PCR and immunohistochemistry. A predictive rule was built using a support vector machine, and the accuracy of the rule was evaluated using crossvalidation on the original data set and prediction of an independent set of four patients. Metastatic primary tumors could be differentiated from nonmetastatic primary tumors by a signature gene set of 116 genes. This signature gene set correctly predicted the four independent patients as well as associating five lymph node metastases from the original patient set with the metastatic primary tumor group. We concluded that lymph node metastasis could be predicted by gene expression profiles of primary oral cavity squamous cell carcinomas. The presence of a gene expression signature for lymph node metastasis indicates that clinical testing to assess risk for lymph node metastasis should be possible.

Keywords: biosvm microarray
[Michiels2005Prediction] S. Michiels, S. Koscielny, and C. Hill. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, 365(9458):488-492, 2005. [ bib | DOI | http ]
BACKGROUND: General studies of microarray gene-expression profiling have been undertaken to predict cancer outcome. Knowledge of this gene-expression profile or molecular signature should improve treatment of patients by allowing treatment to be tailored to the severity of the disease. We reanalysed data from the seven largest published studies that have attempted to predict prognosis of cancer patients on the basis of DNA microarray analysis. METHODS: The standard strategy is to identify a molecular signature (ie, the subset of genes most differentially expressed in patients with different outcomes) in a training set of patients and to estimate the proportion of misclassifications with this signature on an independent validation set of patients. We expanded this strategy (based on unique training and validation sets) by using multiple random sets, to study the stability of the molecular signature and the proportion of misclassifications. FINDINGS: The list of genes identified as predictors of prognosis was highly unstable; molecular signatures strongly depended on the selection of patients in the training sets. For all but one study, the proportion misclassified decreased as the number of patients in the training set increased. Because of inadequate validation, our chosen studies published overoptimistic results compared with those from our own analyses. Five of the seven studies did not classify patients better than chance. INTERPRETATION: The prognostic value of published microarray results in cancer studies should be considered with caution. We advocate the use of validation by repeated random sampling.

Keywords: featureselection, breastcancer, microarray
[Mavroforakis2005Significance] Michael Mavroforakis, Harris Georgiou, Nikos Dimitropoulos, Dionisis Cavouras, and Sergios Theodoridis. Significance analysis of qualitative mammographic features, using linear classifiers, neural networks and support vector machines. Eur J Radiol, 54(1):80-9, Apr 2005. [ bib | DOI | http | .pdf ]
Advances in modern technologies and computers have enabled digital image processing to become a vital tool in conventional clinical practice, including mammography. However, the core problem of the clinical evaluation of mammographic tumors remains a highly demanding cognitive task. In order for these automated diagnostic systems to perform in levels of sensitivity and specificity similar to that of human experts, it is essential that a robust framework on problem-specific design parameters is formulated. This study is focused on identifying a robust set of clinical features that can be used as the base for designing the input of any computer-aided diagnosis system for automatic mammographic tumor evaluation. A thorough list of clinical features was constructed and the diagnostic value of each feature was verified against current clinical practices by an expert physician. These features were directly or indirectly related to the overall morphological properties of the mammographic tumor or the texture of the fine-scale tissue structures as they appear in the digitized image, while others contained external clinical data of outmost importance, like the patient's age. The entire feature set was used as an annotation list for describing the clinical properties of mammographic tumor cases in a quantitative way, such that subsequent objective analyses were possible. For the purposes of this study, a mammographic image database was created, with complete clinical evaluation descriptions and positive histological verification for each case. All tumors contained in the database were characterized according to the identified clinical features' set and the resulting dataset was used as input for discrimination and diagnostic value analysis for each one of these features. Specifically, several standard methodologies of statistical significance analysis were employed to create feature rankings according to their discriminating power. Moreover, three different classification models, namely linear classifiers, neural networks and support vector machines, were employed to investigate the true efficiency of each one of them, as well as the overall complexity of the diagnostic task of mammographic tumor characterization. Both the statistical and the classification results have proven the explicit correlation of all the selected features with the final diagnosis, qualifying them as an adequate input base for any type of similar automated diagnosis system. The underlying complexity of the diagnostic task has justified the high value of sophisticated pattern recognition architectures.

Keywords: Algorithms, Animals, Antibiotics, Antineoplastic, Artificial Intelligence, Butadienes, Chloroplasts, Comparative Study, Computer Simulation, Computer-Assisted, Diagnosis, Disinfectants, Dose-Response Relationship, Drug, Drug Toxicity, Electrodes, Electroencephalography, Ethylamines, Expert Systems, Feedback, Fungicides, Gene Expression Profiling, Genes, Genetic Markers, Humans, Implanted, Industrial, Information Storage and Retrieval, Kidney, Kidney Tubules, MEDLINE, Male, Mercuric Chloride, Microarray Analysis, Molecular Biology, Motor Cortex, Movement, Natural Language Processing, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Plant Proteins, Predictive Value of Tests, Proteins, Proteome, Proximal, Puromycin Aminonucleoside, Rats, Reproducibility of Results, Research Support, Sprague-Dawley, Subcellular Fractions, Terminology, Therapy, Time Factors, Toxicogenetics, U.S. Gov't, User-Computer Interface, 15797296
[Ioannidis2005Microarrays] J. P. A. Ioannidis. Microarrays and molecular research: noise discovery? Lancet, 365(9458):454, 2005. [ bib | .pdf ]
Keywords: microarray
[Haferlach2005global] Torsten Haferlach, Alexander Kohlmann, Susanne Schnittger, Martin Dugas, Wolfgang Hiddemann, Wolfgang Kern, and Claudia Schoch. A global approach to the diagnosis of leukemia using gene expression profiling. Blood, 106(4):1189-1198, Aug 2005. [ bib | DOI | http | .pdf ]
Accurate diagnosis and classification of leukemias are the bases for the appropriate management of patients. The diagnostic accuracy and efficiency of present methods may be improved by the use of microarrays for gene expression profiling. We analyzed gene expression profiles in bone marrow and peripheral blood samples from 937 patients with all clinically relevant leukemia subtypes (n=892) and non-leukemic controls (n=45) by U133A and B GeneChips (Affymetrix). For each subgroup differentially expressed genes were calculated. Class prediction was performed using support vector machines. Prediction accuracies were estimated by 10-fold cross validation and assessed for robustness in a 100-fold resampling approach using randomly chosen test-sets consisting of 1/3 of the samples. Applying the top 100 genes of each subgroup an overall prediction accuracy of 95.1% was achieved which was confirmed by resampling (median, 93.8%; 95% confidence interval, 91.4%-95.8%). In particular, AML with t(15;17), t(8;21), or inv(16), CLL, and Pro-B-ALL with t(11q23) were classified with 100% sensitivity and 100% specificity. Accordingly, cluster analysis completely separated all of the 13 subgroups analyzed. Gene expression profiling can predict all clinically relevant subentities of leukemia with high accuracy.

Keywords: biosvm microarray
[Haferlach2005AML] Torsten Haferlach, Alexander Kohlmann, Susanne Schnittger, Martin Dugas, Wolfgang Hiddemann, Wolfgang Kern, and Claudia Schoch. AML M3 and AML M3 variant each have a distinct gene expression signature but also share patterns different from other genetically defined AML subtypes. Genes Chromosomes Cancer, 43(2):113-27, Jun 2005. [ bib | DOI | http | .pdf ]
Acute promyelocytic leukemia (APL) with t(15;17) appears in two phenotypes: AML M3, with abnormal promyelocytes showing heavy granulation and bundles of Auer rods, and AML M3 variant (M3v), with non- or hypogranular cytoplasm and a bilobed nucleus. We investigated the global gene expression profiles of 35 APL patients (19 AML M3, 16 AML M3v) by using high-density DNA-oligonucleotide microarrays. First, an unsupervised approach clearly separated APL samples from other AMLs characterized genetically as t(8;21) (n = 35), inv(16) (n = 35), or t(11q23)/MLL (n = 35) or as having a normal karyotype (n = 50). Second, we found genes with functional relevance for blood coagulation that were differentially expressed between APL and other AMLs. Furthermore, a supervised pairwise comparison between M3 and M3v revealed differential expression of genes that encode for biological functions and pathways such as granulation and maturation of hematologic cells, explaining morphologic and clinical differences. Discrimination between M3 and M3v based on gene signatures showed a median classification accuracy of 90% by use of 10-fold CV and support vector machines. Additional molecular mutations such as FLT3-LM, which were significantly more frequent in M3v than in M3 (P < 0.0001), may partly contribute to the different phenotypes. However, linear regression analysis demonstrated that genes differentially expressed between M3 and M3v did not correlate with FLT3-LM.

Keywords: biosvm microarray
[Haasdonk2005Feature] Bernard Haasdonk. Feature space interpretation of SVMs with indefinite kernels. IEEE Trans Pattern Anal Mach Intell, 27(4):482-92, Apr 2005. [ bib | DOI | http | .pdf ]
Kernel methods are becoming increasingly popular for various kinds of machine learning tasks, the most famous being the support vector machine (SVM) for classification. The SVM is well understood when using conditionally positive definite (cpd) kernel functions. However, in practice, non-cpd kernels arise and demand application in SVMs. The procedure of "plugging" these indefinite kernels in SVMs often yields good empirical classification results. However, they are hard to interpret due to missing geometrical and theoretical understanding. In this paper, we provide a step toward the comprehension of SVM classifiers in these situations. We give a geometric interpretation of SVMs with indefinite kernel functions. We show that such SVMs are optimal hyperplane classifiers not by margin maximization, but by minimization of distances between convex hulls in pseudo-Euclidean spaces. By this, we obtain a sound framework and motivation for indefinite SVMs. This interpretation is the basis for further theoretical analysis, e.g., investigating uniqueness, and for the derivation of practical guidelines like characterizing the suitability of indefinite SVMs.

Keywords: Algorithms, Animals, Antibiotics, Antineoplastic, Artificial Intelligence, Automated, Automatic Data Processing, Butadienes, Chloroplasts, Cluster Analysis, Comparative Study, Computer Simulation, Computer-Assisted, Computing Methodologies, Database Management Systems, Databases, Diagnosis, Disinfectants, Dose-Response Relationship, Drug, Drug Toxicity, Electrodes, Electroencephalography, Ethylamines, Expert Systems, Factual, Feedback, Fungicides, Gene Expression Profiling, Genes, Genetic Markers, Humans, Image Enhancement, Image Interpretation, Implanted, Industrial, Information Storage and Retrieval, Kidney, Kidney Tubules, MEDLINE, Male, Mercuric Chloride, Microarray Analysis, Molecular Biology, Motor Cortex, Movement, Natural Language Processing, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Numerical Analysis, Pattern Recognition, Plant Proteins, Predictive Value of Tests, Proteins, Proteome, Proximal, Puromycin Aminonucleoside, Rats, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Sprague-Dawley, Subcellular Fractions, Terminology, Therapy, Time Factors, Toxicogenetics, U.S. Gov't, User-Computer Interface, 15794155
[Ein-Dor2005Outcome] L. Ein-Dor, I. Kela, G. Getz, D. Givol, and E. Domany. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21(2):171-178, Jan 2005. [ bib | DOI | http | .pdf ]
MOTIVATION: Predicting the metastatic potential of primary malignant tissues has direct bearing on the choice of therapy. Several microarray studies yielded gene sets whose expression profiles successfully predicted survival. Nevertheless, the overlap between these gene sets is almost zero. Such small overlaps were observed also in other complex diseases, and the variables that could account for the differences had evoked a wide interest. One of the main open questions in this context is whether the disparity can be attributed only to trivial reasons such as different technologies, different patients and different types of analyses. RESULTS: To answer this question, we concentrated on a single breast cancer dataset, and analyzed it by a single method, the one which was used by van't Veer et al. to produce a set of outcome-predictive genes. We showed that, in fact, the resulting set of genes is not unique; it is strongly influenced by the subset of patients used for gene selection. Many equally predictive lists could have been produced from the same analysis. Three main properties of the data explain this sensitivity: (1) many genes are correlated with survival; (2) the differences between these correlations are small; (3) the correlations fluctuate strongly when measured over different subsets of patients. A possible biological explanation for these properties is discussed. CONTACT: eytan.domany@weizmann.ac.il SUPPLEMENTARY INFORMATION: http://www.weizmann.ac.il/physics/complex/compphys/downloads/liate/

Keywords: breastcancer, microarray, featureselection
[Dong2005Fast] Jian xiong Dong, Adam Krzyzak, and Ching Y Suen. Fast SVM training algorithm with decomposition on very large data sets. IEEE Trans Pattern Anal Mach Intell, 27(4):603-18, Apr 2005. [ bib ]
Training a support vector machine on a data set of huge size with thousands of classes is a challenging problem. This paper proposes an efficient algorithm to solve this problem. The key idea is to introduce a parallel optimization step to quickly remove most of the nonsupport vectors, where block diagonal matrices are used to approximate the original kernel matrix so that the original problem can be split into hundreds of subproblems which can be solved more efficiently. In addition, some effective strategies such as kernel caching and efficient computation of kernel matrix are integrated to speed up the training process. Our analysis of the proposed algorithm shows that its time complexity grows linearly with the number of classes and size of the data set. In the experiments, many appealing properties of the proposed algorithm have been investigated and the results show that the proposed algorithm has a much better scaling capability than Libsvm, SVMlight, and SVMTorch. Moreover, the good generalization performances on several large databases have also been achieved.

Keywords: Algorithms, Animals, Antibiotics, Antineoplastic, Artificial Intelligence, Automated, Automatic Data Processing, Butadienes, Chloroplasts, Comparative Study, Computer Simulation, Computer-Assisted, Database Management Systems, Databases, Diagnosis, Disinfectants, Dose-Response Relationship, Drug, Drug Toxicity, Electrodes, Electroencephalography, Ethylamines, Expert Systems, Factual, Feedback, Fungicides, Gene Expression Profiling, Genes, Genetic Markers, Humans, Image Enhancement, Image Interpretation, Implanted, Industrial, Information Storage and Retrieval, Kidney, Kidney Tubules, MEDLINE, Male, Mercuric Chloride, Microarray Analysis, Molecular Biology, Motor Cortex, Movement, Natural Language Processing, Neural Networks (Computer), Non-P.H.S., Non-U.S. Gov't, Numerical Analysis, Pattern Recognition, Plant Proteins, Predictive Value of Tests, Proteins, Proteome, Proximal, Puromycin Aminonucleoside, Rats, Reproducibility of Results, Research Support, Sensitivity and Specificity, Signal Processing, Sprague-Dawley, Subcellular Fractions, Terminology, Therapy, Time Factors, Toxicogenetics, U.S. Gov't, User-Computer Interface, 15794164
[Burckin2005Exploring] T. Burckin, R. Nagel, Y. Mandel-Gutfreund, L. Shiue, T. A. Clark, J.-L. Chong, T.-H. Chang, S. Squazzo, G. Hartzog, and M. Ares. Exploring functional relationships between components of the gene expression machinery. Nat. Struct. Mol. Biol., 12(2):175-82, Feb 2005. [ bib | DOI | http | .pdf ]
Eukaryotic gene expression requires the coordinated activity of many macromolecular machines including transcription factors and RNA polymerase, the spliceosome, mRNA export factors, the nuclear pore, the ribosome and decay machineries. Yeast carrying mutations in genes encoding components of these machineries were examined using microarrays to measure changes in both pre-mRNA and mRNA levels. We used these measurements as a quantitative phenotype to ask how steps in the gene expression pathway are functionally connected. A multiclass support vector machine was trained to recognize the gene expression phenotypes caused by these mutations. In several cases, unexpected phenotype assignments by the computer revealed functional roles for specific factors at multiple steps in the gene expression pathway. The ability to resolve gene expression pathway phenotypes provides insight into how the major machineries of gene expression communicate with each other.

Keywords: biosvm microarray
[Tothill2005expression-based] Richard W Tothill, Adam Kowalczyk, Danny Rischin, Alex Bousioutas, Izhak Haviv, Ryan K van Laar, Paul M Waring, John Zalcberg, Robyn Ward, Andrew V Biankin, Robert L Sutherland, Susan M Henshall, Kwun Fong, Jonathan R Pollack, David D L Bowtell, and Andrew J Holloway. An expression-based site of origin diagnostic method designed for clinical application to cancer of unknown origin. Cancer Res., 65(10):4031-40, May 2005. [ bib | DOI | http | .pdf ]
Gene expression profiling offers a promising new technique for the diagnosis and prognosis of cancer. We have applied this technology to build a clinically robust site of origin classifier with the ultimate aim of applying it to determine the origin of cancer of unknown primary (CUP). A single cDNA microarray platform was used to profile 229 primary and metastatic tumors representing 14 tumor types and multiple histologic subtypes. This data set was subsequently used for training and validation of a support vector machine (SVM) classifier, demonstrating 89% accuracy using a 13-class model. Further, we show the translation of a five-class classifier to a quantitative PCR-based platform. Selecting 79 optimal gene markers, we generated a quantitative-PCR low-density array, allowing the assay of both fresh-frozen and formalin-fixed paraffin-embedded (FFPE) tissue. Data generated using both quantitative PCR and microarray were subsequently used to train and validate a cross-platform SVM model with high prediction accuracy. Finally, we applied our SVM classifiers to 13 cases of CUP. We show that the microarray SVM classifier was capable of making high confidence predictions in 11 of 13 cases. These predictions were supported by comprehensive review of the patients' clinical histories.

Keywords: biosvm microarray
[Garnis2006High] C. Garnis, W. W. Lockwood, E. Vucic, Y. Ge, L. Girard, J. D. Minna, A. F. Gazdar, S. Lam, C. MacAulay, and W. L. Lam. High resolution analysis of non-small cell lung cancer cell lines by whole genome tiling path array CGH. Int. J. Cancer, 118(6):1556-1564, 2006. [ bib | DOI | http ]
Chromosomal regions harboring tumor suppressors and oncogenes are often deleted or amplified. Array comparative genomic hybridization detects segmental DNA copy number alterations in tumor DNA relative to a normal control. The recent development of a bacterial artificial chromosome array, which spans the human genome in a tiling path manner with >32,000 clones, has facilitated whole genome profiling at an unprecedented resolution. Using this technology, we comprehensively describe and compare the genomes of 28 commonly used non-small cell lung carcinoma (NSCLC) cell models, derived from 18 adenocarcinomas (AC), 9 squamous cell carcinomas and 1 large cell carcinoma. Analysis at such resolution not only provided a detailed genomic alteration template for each of these model cell lines, but revealed novel regions of frequent duplication and deletion. Significantly, a detailed analysis of chromosome 7 identified 6 distinct regions of alterations across this chromosome, implicating the presence of multiple novel oncogene loci on this chromosome. As well, a comparison between the squamous and AC cells revealed alterations common to both subtypes, such as the loss of 3p and gain of 5p, in addition to multiple hotspots more frequently associated with only 1 subtype. Interestingly, chromosome 3q, which is known to be amplified in both subtypes, showed 2 distinct regions of alteration, 1 frequently altered in squamous and 1 more frequently altered in AC. In summary, our data demonstrate the unique information generated by high resolution analysis of NSCLC genomes and uncover the presence of genetic alterations prevalent in the different NSCLC subtypes.

Keywords: Carcinoma, Non-Small-Cell Lung, genetics/pathology; Cell Line, Tumor; Chromosomes, Artificial, Bacterial, genetics; Gene Amplification; Gene Dosage; Gene Expression Profiling; Genome, Human, genetics; Humans; Loss of Heterozygosity; Lung Neoplasms, genetics/pathology; Microarray Analysis, methods; Nucleic Acid Hybridization, methods
[Fan2006Concordance] C. Fan, D.S. Oh, L. Wessels, B. Weigelt, D.S.A. Nuyten, A.B. Nobel, L.J. van't Veer, and C.M. Perou. Concordance among gene-expression-based predictors for breast cancer. N. Engl. J. Med., 355(6):560, 2006. [ bib | DOI | http | .pdf ]
Keywords: breastcancer, microarray
[Levy2007Diploid] Samuel Levy, Granger Sutton, Pauline C Ng, Lars Feuk, Aaron L Halpern, Brian P Walenz, Nelson Axelrod, Jiaqi Huang, Ewen F Kirkness, Gennady Denisov, Yuan Lin, Jeffrey R MacDonald, Andy Wing Chun Pang, Mary Shago, Timothy B Stockwell, Alexia Tsiamouri, Vineet Bafna, Vikas Bansal, Saul A Kravitz, Dana A Busam, Karen Y Beeson, Tina C McIntosh, Karin A Remington, Josep F Abril, John Gill, Jon Borman, Yu-Hui Rogers, Marvin E Frazier, Stephen W Scherer, Robert L Strausberg, and J. Craig Venter. The diploid genome sequence of an individual human. PLoS Biol, 5(10):e254, Sep 2007. [ bib | DOI | http ]
Presented here is a genome sequence of an individual human. It was produced from approximately 32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2-206 bp), 292,102 heterozygous insertion/deletion events (indels)(1-571 bp), 559,473 homozygous indels (1-82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.

Keywords: Base Sequence; Chromosome Mapping, instrumentation/methods; Chromosomes, Human; Chromosomes, Human, Y, genetics; Diploidy; Gene Dosage; Genome, Human; Genotype; Haplotypes; Human Genome Project; Humans; INDEL Mutation; In Situ Hybridization, Fluorescence; Male; Microarray Analysis; Middle Aged; Molecular Sequence Data; Pedigree; Phenotype; Polymorphism, Single Nucleotide; Reproducibility of Results; Sequence Analysis, DNA, instrumentation/methods
[Wirapati2008Meta-analysis] P. Wirapati, C. Sotiriou, S. Kunkel, P. Farmer, S. Pradervand, B. Haibe-Kains, C. Desmedt, M. Ignatiadis, T. Sengstag, F. Schütz, D. R. Goldstein, M. Piccart, and M. Delorenzi. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res., 10(4):R65, 2008. [ bib | DOI | http | .pdf ]
INTRODUCTION: Breast cancer subtyping and prognosis have been studied extensively by gene expression profiling, resulting in disparate signatures with little overlap in their constituent genes. Although a previous study demonstrated a prognostic concordance among gene expression signatures, it was limited to only one dataset and did not fully elucidate how the different genes were related to one another nor did it examine the contribution of well-known biological processes of breast cancer tumorigenesis to their prognostic performance. METHOD: To address the above issues and to further validate these initial findings, we performed the largest meta-analysis of publicly available breast cancer gene expression and clinical data, which are comprised of 2,833 breast tumors. Gene coexpression modules of three key biological processes in breast cancer (namely, proliferation, estrogen receptor [ER], and HER2 signaling) were used to dissect the role of constituent genes of nine prognostic signatures. RESULTS: Using a meta-analytical approach, we consolidated the signatures associated with ER signaling, ERBB2 amplification, and proliferation. Previously published expression-based nomenclature of breast cancer 'intrinsic' subtypes can be mapped to the three modules, namely, the ER-/HER2- (basal-like), the HER2+ (HER2-like), and the low- and high-proliferation ER+/HER2- subtypes (luminal A and B). We showed that all nine prognostic signatures exhibited a similar prognostic performance in the entire dataset. Their prognostic abilities are due mostly to the detection of proliferation activity. Although ER- status (basal-like) and ERBB2+ expression status correspond to bad outcome, they seem to act through elevated expression of proliferation genes and thus contain only indirect information about prognosis. Clinical variables measuring the extent of tumor progression, such as tumor size and nodal status, still add independent prognostic information to proliferation genes. CONCLUSION: This meta-analysis unifies various results of previous gene expression studies in breast cancer. It reveals connections between traditional prognostic factors, expression-based subtyping, and prognostic signatures, highlighting the important role of proliferation in breast cancer prognosis.

Keywords: microarray, breastcancer
[Reyal2009Analyse] F. Reyal. Analyse du profil d'expression par la technique des puces à ADN. Application à la caractérisation moléculaire et à la détermination du pronostic des cancers canalaires infiltrants du sein. PhD thesis, Université Paris 11, 2009. [ bib ]
Keywords: breastcancer, microarray
[Chen2011Removing] Chao Chen, Kay Grennan, Judith Badner, Dandan Zhang, Elliot Gershon, Li Jin, and Chunyu Liu. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One, 6(2):e17238, 2011. [ bib | DOI | http ]
The expression microarray is a frequently used approach to study gene expression on a genome-wide scale. However, the data produced by the thousands of microarray studies published annually are confounded by "batch effects," the systematic error introduced when samples are processed in multiple batches. Although batch effects can be reduced by careful experimental design, they cannot be eliminated unless the whole study is done in a single batch. A number of programs are now available to adjust microarray data for batch effects prior to analysis. We systematically evaluated six of these programs using multiple measures of precision, accuracy and overall performance. ComBat, an Empirical Bayes method, outperformed the other five programs by most metrics. We also showed that it is essential to standardize expression data at the probe level when testing for correlation of expression profiles, due to a sizeable probe effect in microarray data that can inflate the correlation among replicates and unrelated samples.

Keywords: Bayes Theorem; Case-Control Studies; Data Interpretation, Statistical; Gene Expression Profiling, standards/statistics /&/ numerical data; Humans; Microarray Analysis, standards/statistics /&/ numerical data; ROC Curve; Reference Standards; Research Design; Sample Size; Selection Bias; Validation Studies as Topic

This file was generated by bibtex2html 1.97.