So many ways to analyze high-throughtput biological data
I am a huge advocate for using R for data analysis. Returning from my vacation, I got to catch up with my reading. This week I read about a number of R package to perform high-throughput biological, or more accurately, genomic data analysis.
While reading my favorite R blog, I discovered this article that talked about 2 popular gene annotation/pathway analysis packages; clusterProfiles and GeneAnswers. I discovered GeneAnswers 4 years ago at ISMB Boston through Simon Lin’s group (package creator). The article, titled “why clusterProfiler fails”, emphasized the important of setting the correct background gene list value for geometric model pathway analysis. The article is actually a little misleading. They stated that clusterProfiler usually reported a higher p-value that lead to lower false negative, and that is a good thing. Further exploring the clusterProfiler package, I found that it works well with many pathway databases, including KEGG and Reactome. As KEGG is transitioning to a pay model, Reactome seems to be a good replacement. That lead me to explore Reactome further and discovered that Reactome website is quite well developed and maintained. One of the selling point for KEGG is its pathway layout. A briefly exploration on the Reactome website, I found that they have worked hard to provide similar layout with connectivity to Cytoscape via an App called ReactomeFlViz. There is also a Bioconductor package ReactomePA for performing pathway analysis using the Reactome database. This is very exciting. Now, only if the Bioconductor package pathview will allow me to drow Reactome diagram I will be a happy camper.
In the area of integrated data analysis, I read about Epiviz. Epiviz is a web tool that allows the integration of multiple genomics data for visualization and analysis. The idea is to tight together multiple dashboard like panels to connect data among multiple databases and data types. For example, I might be interested in connecting my significant gene list to potential transcription factor binding sites. By laying out these two pieces of information side by side, you can easily search them in the same space. The implementation is very elaborate and sleek. What excite me the most is its accompanying Bioconductor package Epivizr that allow me to compile, build and control how to combine various data type together to be visualized on the web browser. This sounds like a very good student project using specific example for biomedical research and discovery. The R package mentions the AnnotationHub package, which was designed to pull big database information into genomics analysis.
I always interested in how to use big data consortium dataset for original research. The week I read a paper doing just that. The thesis of the paper was to use pathways, instead of gene list, to more accurately and consistently classified breast cancer types. The author extract the dataset from TCGA, EGA and GEO. I think it is very important for biological data scientists to understand these data repository in order to better unitize them for future analysis and discovery.