My Research Projects
Detail of my top fifteen projects are as follows.
The Center for Expanded Data Annotation and Retrieval (CEDAR), one of the Centers of Excellence supported through the NIH Big Data to Knowledge (BD2K) program, to meet the challenge of harvesting the wealth of information contained in biomedical Big Data. CEDAR’s overall mission is to develop technologies to address Big Data impediments, including lack of appropriate tools, poor data accessibility, and insufficient training. Details can be found at the CEDAR website http://metadatacenter.org
As part of a CEDAR core team, I have devised a framework for the ontological mapping of immunological data and have developed a CEDAR’s NCBI submission module. I am now enhancing my ontology aware metadata submission pipeline to incorporate the AIRR data. Scientists can submit scientific metadata and data from CEDAR workbench to the NCBI repositories such as BioProject, BioSample and the Sequence Read Archive (SRA).
1- The CAIRR Pipeline
Abstract: Working Group, we have developed MiAIRR (Minimum information about an Adaptive Immune Receptor Repertoire Sequencing Experiment), a standard for reporting AIRR sequencing (AIRR-seq) studies. The MiAIRR standard has been implemented using the National Center for Biotechnology Information (NCBI) repositories. Submissions of AIRR-seq data to the NCBI repositories typically use a combination of web-based and flat-file templates and include only a minimal amount of terms validation. As a result, metadata at the NCBI are often described using inconsistent terminologies, limiting scientists’ ability to access, find, interoperate, and reuse the data sets and to understand how the experiments were performed. CEDAR (Center for Expanded Data Annotation and Retrieval) develops technologies involving the use of data standards and ontologies to improve metadata quality. In order to improve metadata quality and ease AIRR-seq study submission process, we have developed an AIRR-seq data submission pipeline named CEDAR-AIRR (CAIRR). CAIRR leverages CEDAR’s technologies to i) create web-based templates whose entries are controlled by ontology terms, and ii) submit the ontology-linked metadata and sequence files (FASTQ) to the NCBI BioProject, BioSample, and Sequence Read Archive (SRA) databases. Thus, CAIRR provides a web-based metadata submission interface that supports compliance with MiAIRR standards. The interface enables ontology-based validation for several data elements, including: organism, disease, cell type and subtype, and tissue. This pipeline will facilitate the NCBI submission process and improve the metadata quality of AIRR-seq studies.
Abstract: The National Center for Biotechnology Information (NCBI) repositories constitute a major public resource for sharing diverse types of biomedical data, such as DNA sequences, genes and biological samples. These repositories provide the necessary infrastructure for researchers to submit experimental data and associated metadata as part of the science-dissemination process. However, the corresponding metadata lack use of globally accepted standard identifiers, formats, and terminologies. The ability to find, access, interoperate, and reuse biomedical data sets such as those deposited in the NCBI repositories depends on how well described the associated metadata are. To this end, the Center for Expanded Data Annotation and Retrieval (CEDAR) develops technologies involving the use of data standards and ontologies to improve biomedical metadata. In this paper, we describe a CEDAR–NCBI data submission prototype that incorporates CEDAR’s standards-oriented, ontology-aware technologies to create metadata templates and to enter metadata using these templates for data submission to several NCBI repositories: BioProject, BioSample, and Sequence Read Archive (SRA). We demonstrate how this prototype works by creating NCBI-compatible templates (with extended features) and populating these templates with metadata associated with an Adaptive Immune Receptor Repertoire (AIRR) sequencing data set from a study of Myasthenia Gravis (MG) patients. Our approach has the potential to improve overall quality of metadata in NCBI as well as non-NCBI repositories.
3- CEDAR OnDemand: A Browser extension to generate ontology-based scientific metadata
Abstract: Background: Public data repositories generally provide Web-based interfaces to collect experimental metadata. However, these interfaces typically reflect the ad hoc metadata specification practices of the associated repositories, leading to a lack of standardization in the collected metadata. This lack of standardization limits the ability of the source datasets to be broadly discovered, reused, and integrated with other datasets. To increase reuse, discoverability, and reproducibility of the described experiments, datasets should be appropriately annotated by using agreed-upon terms, ideally from ontologies or other controlled term sources. Results: This work presents “CEDAR OnDemand”, a browser extension powered by the NCBO BioPortal API that enables users to seamlessly enter ontology-based metadata through existing Web forms native to individual repositories. CEDAR OnDemand analyzes the web page contents to identify the text input fields and their labels to associate them with related ontologies and further suggests ontology-based metadata. We employed the NCBI BioSample as a use case to test the CEDAR On Demand functionality. Conclusion: CEDAR OnDemand helps lower the barrier of incorporating ontologies into standardized metadata entry for public data repositories. CEDAR OnDemand is available freely on the Google Chrome store https://chrome.google.com/webstore/search/CEDAROndemand
4- MiAIRR: Minimum information about an Adaptive Immune Receptor Repertoire Sequencing Experiment
Our talk starts at 2:04:14
5- AIRRPort: The Adaptive Immune Receptor Repertoire Study Database and Discovery Portal
Human Immunology Project Consortium Projects
6- HIPC Data Template Standardization: The story so far
Abstract: Background: Systems biology is playing an increasingly important role in unraveling the complexity of human immune responses. An important aspect of this approach involves the integration and analysis of different types of high-throughput measurements (“omics” data) to understand or predict the behavior of biological systems under different experimental conditions. With the increase of biological databases, we have seen a proliferation of custom templates provided by different repositories/databases for data submission. Public repositories such as GEO and ArrayExpress provide MIAME-compliant templates for submission of microarray gene expression data. Although these templates represent the same type of data and are compliant with the MIAME standard, they are described using different terminologies (e.g., platform vs. array design). This makes data submission, comparison and integration difficult between different repositories.Methods: We propose a domain-specific architecture to enable mapping of local terms from individual data submission templates to ontology concepts through NCBO ontology annotator. In this paper, We mapped template types such as study, human subject, biosamples, and reagents, template columns (e.g., biosample type and age), and value sets (e.g., controlled values such as serum and urine for biosample type) to ontology concepts found in BioPortal. Results and conclusion: Such ontology term mapping allows us to achieve semantic normalization across different templates used for submitting data to bio-repositories. Ontology-based data standards constitute a powerful way to enable semantic data integration across studies and lower the barriers to data sharing. Experimental results can be shared without the problem of lengthy and ambiguous descriptions of experimental conditions and file formats.
7- Towards Ontological Mapping of Immunological Data Standards
Semantic Enrichment and Similarity Approximation on Biomedical Sequence Images
12- BioNLP-SADI An Interoperable framework for BioNLP SADI web service
13- HC2LOD Canadian Health Census to Linked Open Data
14- Smart Mechanism to automate the personalized itinerary based on Type-2 Fuzzy Ontology
15- Intelligent Vessel Traffic Service Simulator
These are my top 15 projects. I have worked on several others data science and artificial intelligence related projects. You can find details of few in selected publication tab. If you need more information about my research work or would like to collaborate please email me at ahmad[dot]chan[@]yale.edu