AHA Approved Data Repositories

The ArrayExpress Archive is a database of functional genomics experiments including gene expression where you can query and download data collected to MIAME and MINSEQE standards. Gene Expression Atlas contains a subset of curated and re-annotated Archive data which can be queried for individual gene expression under different biological conditions across experiments.

BioModels Database is a repository of computational models of biological processes. Models described from literature are manually curated and enriched with cross-references. All models are provided in the Public Domain.

The cancer Nanotechnology Laboratory (caNanoLab) data portal is an NIH-supported, publicly-accessible repository designed to enable sharing of nanomaterials data, and to expedite and validate the use of nanoparticles in biomedicine.

The purpose of CellML is to store and exchange computer-based mathematical models. CellML allows scientists to share models even if they are using different modeling tools. It also enables them to reuse components from one model in another, thus accelerating model development.

ClinicalTrials.gov is a Web-based resource that provides patients, their family members, health care professionals, researchers, and the public with easy access to information on publicly and privately supported clinical studies on a wide range of diseases and conditions.

ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. ClinVar facilitates access to and communication about the relationships asserted between human variation and observed health status, and the history of that interpretation.

COSMIC is designed to store and display somatic mutation information and related details and contains information relating to human cancers.

Dataverse (general)
The Dataverse Network is an open source application to publish, share, reference, extract and analyze research data.

database of Genotypes and Phenotypes (dbGaP) (open section)
The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the results of studies that investigate the interaction of genotype and phenotype. Such studies include genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits.
dbGaP provides two levels of access - open and controlled - in order to allow broad release of non-sensitive data, while providing oversight and investigator accountability for sensitive data sets involving personal health information. The assumption is that AHA-funded data would fall under the open category unless there are exceptional circumstances.

database of Single Nucleotide Polymorphisms (dpSNP)
In collaboration with the National Human Genome Research Institute, the National Center for Biotechnology Information has established the dbSNP database to serve as a central repository for both single base nucleotide substitutions and short deletion and insertion polymorphisms.

European Genome-phenome Archive (EGA)
The European Genome-phenome Archive (EGA) is designed to be a repository for all types of genotype experiments, including case control, population, and family studies. It includes SNP and CNV genotypes from array based methods and genotyping done with re-sequencing methods. This data may be either publicly available or limited access, depending on the design of the study.

Electron Microscopy DataBank is a unified global portal for deposition and retrieval of 3DEM density maps, atomic models, and associated metadata, as well as a resource for news, events, software tools, data standards, validation methods for the 3DEM community.

European Nucleotide Archive (ENA)
Europe's primary nucleotide sequence resource. The main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications. It is one part of the European Bioinformatics Institute (EMBL-EBI), which maintains the world’s most comprehensive range of freely available and up-to-date molecular data resources.

exRNA Atlas
The exRNA Atlas is the data repository of the Extracellular RNA Communication Consortium (ERCC). It is developed and maintained by the Data Management and Resource Repository (DMRR). It includes qPCR-derived exRNA profiles from human and mouse biofluids and conditions and currently stores data profiled from small RNA sequencing assays.

figshare (general)
figshare allows users to upload any file format to be made visualisable in the browser so that figures, datasets, media, papers, posters, presentations and file sets can be disseminated in a way that the current scholarly publishing model does not allow.

FlowRepository is a web-based application accessible from a web browser that serves as an online database of flow cytometry experiments where users can query and download data collected and annotated according to the MIFlowCyt standard.

FlyBase is a database of genetic and molecular data for D. melanogaster and other Drosophila species, targeted to an audience of research professionals.

GenBank is an annotated collection of publicly available DNA sequences through the National Center for Biotechnology Information databases. GenBank contains over 135,000,000 sequence records and is updated every two months. GenBank is part of the International Nucleotide Sequence Database Collaboration along with the DNA DataBank of Japan and the European Molecular Biology Laboratory.

GitHub (source code)
Repository for open source code.

Global Health Data Exchange (GHDx)
The Global Health Data Exchange (GHDx) is a catalog of global health and demographic data. The goal of the GHDx is to help people locate data by cataloging information about data including the topics covered, by providing links to data providers or explaining how to acquire the data, and in cases where we have permission, providing the data directly for download. Use the GHDx to research population census data, surveys, registries, indicators and estimates, administrative health data, and financial data related to health.

GlycoPOST is a mass spectrometry data repository for glycomics. It consists of a high-speed file upload process, flexible file management system and easy-to-use interfaces. Submission conditions are in accordance with the Minimum Information Required for a Glycomics Experiment (MIRAGE) guidelines.

ImmPort Shared Data enables searching and downloading of shared biomedical research data funded from NIAID, DAIT, DMID, other NIH agencies, and non-government sources. Additional resources include step-by-step data reuse tutorials with example R and Python analysis code, the Cell Ontology Visualizer, the Cytokine Registry, 10,000 Immunomes - a reference dataset for human immunology, and immuneXpresso - the cytokine and cell interaction literature mining tool.

IntAct provides a freely available, open source database system and analysis tools for protein interaction data. All interactions are derived from literature curation or direct user submissions and are freely available.

International Mouse Phenotyping Consortium (IMPC)
The International Mouse Phenotyping Consortium is an international scientific endeavor to create and characterize the phenotype of 20,000 knockout mouse strains. Using a standardized phenotyping protocol, the IMPC integrates data to existing mouse and human disease resources and provides strains and phenotype data for use by the research community.

MetaboLights is a database for Metabolomics experiments and derived information. The database is cross-species, cross-technique and covers metabolite structures and their reference spectra as well as their biological roles, locations and concentrations, and experimental data from metabolic experiments.

Metabolomics Workbench
The Metabolomics Workbench serves as a national and international repository for metabolomics data and metadata and provides analysis tools and access to metabolite standards, protocols, tutorials, training, and more. The Workbench is a companion to RCMRCs and is a part of the Common Fund Initiative in metabolomics.

National Collection of Pathogenic Viruses
A wide-ranging archive of well-characterized, authenticated human pathogens which will resource the supply of viruses, and materials derived from them, to the scientific community. [The National Collection of Pathogenic Viruses comprises over 300 human pathogenic viruses available for supply which require handling at UK biosafety containment levels 2, 3 and 4.]

National Collection of Type Cultures
The National Collection of Type Cultures (NCTC) is a specialized laboratory located in the Central Public Health Laboratory, Colindale. It accesses, preserves and supplies authentic cultures of bacteria and mycoplasmas that are pathogenic to man or other animals that may occur in food or water and in hospital or health-related environments and which can be preserved by freeze-drying.

NCBI BioProject
The BioProject repository collects projects with biological data that relates to a single initiative that originates from a single entity or consortium. Records provide users with a single location for the links to diverse data types generated for those projects.

NCBI BioSample
The BioSample database contains descriptions of biological source materials used in experimental assays.

NCBI Conserved Domains Database
The Conserved Domains Database (CDD) contains annotations of functional units in proteins; including multiple sequence alignment models for ancient domains and full-length proteins. This collection of models includes 3D structures that display the sequence/structure/function relationships in proteins. Users can identify amino acids in protein sequences with the resources available through CDD as well as view single sequences embedded within multiple sequence alignments.

NCBI dbVar
The dbVar is a database of genomic structural variation containing data from multiple gene studies. dbVar is a structural variation database designed to store data on variant DNA ≥ 1 bp in size. It is recommended that variation data that is > 50bp be submitted to dbVar and variation data that is ≤ 50bp to dbSNP. All clinically relevant structural variation should be submitted to ClinVar or dbGaP. Users can browse data containing the number of variant cells from each study, and filter studies by organism, study type, method and genomic variant. Organisms include human, mouse, cattle, and additional animals.

Integrating information from a variety of species, records in Gene include nomenclature, Reference Sequences, maps, pathways, variations, phenotypes, and links to genome-specific, phenotype-specific, and locus-specific resources.

NCBI Genome
The Genome database contains annotations and analyses of eukaryotic and prokaryotic genomes, as well as tools that allow users to compare genomes and gene sequences from humans, microbes, plants, viruses and organelles. Users can browse by organism, and view genome maps and protein clusters.

NCBI GEO Datasets
An international public repository, GEO (Gene Expression Omnibus) DataSets archives and distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data. The records include original submitter-supplied records (Series, Samples and Platforms) and curated DataSets. GEO aims to provide a database that efficiently stores this data; offer simple submission procedures and formats that support complete and well-annotated data deposits from the research community; and provide user-friendly mechanisms for users to find and use studies and gene expression profiles of interest. GEO DataSets provides tools to identify differences in gene expression levels and cluster heatmaps.

NCBI GEO Profiles
The Gene Expression Omnibus (GEO) database stores individual gene expression profiles from NCBI databases and is searchable by gene annotation as well as gene profile characteristics. GEO archives microarray and next-generation sequencing as well as other forms of genomic data submitted by researchers within the scientific community.

NCBI HomoloGene
The HomoloGene database provides a system for the automated detection of homologs among annotated genes of genomes across multiple species. These homologs are fully documented and organized by homology group. HomoloGene processing uses proteins from input organisms to compare and sequence homologs, mapping back to corresponding DNA sequences.

NCBI Nucleotide
The NCBI Nucleotide database collects sequences from such sources as GenBank, RefSeq, TPA, and PDB. Sequences collected relate to genome, gene, and transcript sequence data, and provide a foundation for research related to the biomedical field.

NCBI PopSet collects DNA sequences to analyze the ways that populations are related by evolution. Such sequences indicate if populations originate from different members of the same species or from organisms of different species entirely.

NCBI Protein
The Protein database collects protein sequences related to biological structure and function. The sequences in NCBI Protein come from the translations from annotated coding regions in GenBank, RefSeq, and TPA, and records from SwissProt, PIR, PRF, and PDB.

NCBI Protein Clusters
The Entrez Protein Clusters database contains annotation information, publications, structures and analysis tools for related protein sequences encoded by complete genomes. The data available in the Protein Clusters Database is generated from prokaryotic genomic studies and is intended to assist researchers studying micro-organism evolution as well as other biological sciences. Available genomes include plants and viruses as well as organelles and microbial genomes.

NCBI Reference Sequence
The Reference Sequence database provides explicitly linked nucleotide and protein sequences, as well as comprehensive and annotated sequence sets with genomic DNA, proteins and transcripts. Users have access to a wealth of resources for gene identification, comparative analysis and genome research. Reference Sequences are available for naturally occurring DNA, RNA and protein sequences in organic species worldwide.

NCBI Structure
The Structure database provides three-dimensional structures of macromolecules for a variety of research purposes and allows the user to retrieve structures for specific molecule types as well as structures for genes and proteins of interest. Three main databases comprise Structure-The Molecular Modeling Database; Conserved Domains and Protein Classification; and the BioSystems Database. Structure also links to the PubChem databases to connect biological activity data to the macromolecular structures. Users can locate structural templates for proteins and interactively view structures and sequence data to closely examine sequence-structure relationships.

NCBI Taxonomy
Currently covering about 10 percent of the described species on the planet and more than 175,000 taxa, Taxonomy is a curated classification and nomenclature for all organisms in the public sequence databases. Taxonomy gives species names and higher-level classifications of the organisms represented in the Entrez sequence databases. It maintains a phylogenetic classification (containing only monophyletic groups if possible). Most species are represented only by a small piece of sequence data that's insufficient to construct a full phylogeny, but some species contain complete genomes.

NITRC facilitates finding and comparing neuroimaging resources for functional and structural neuroimaging analyses. NITRC and its components—the Resources Registry (NITRC-R), Image Repository (NITRC-IR), and Computational Environment (NITRC-CE) offer researchers PET/SPECT, CT, EEG/MEG, optical imaging, clinical neuroinformatics, computational neuroscience, and imaging genomics software tools, data, and computational resources.

Online Mendelian Inheritance in Animals (OMIA)
Online Mendelian Inheritance in Animals contains textual information, references, links, and relevant records related to genes, traits, and inherited disorders in animals.

Online Mendelian Inheritance in Man (OMIM)
OMIM contains authoritative medical data on all known mendelian disorders as well as full-text and referenced overviews on the relationship between phenotype and genotype. Users can search the OMIM database by chromosome as well as narrow their search results by known gene sequences, phenotypes and gene map locus; as well as searching using only clinical synopses containing any combination of 22 specified criteria. The information contained in OMIM is available to download for personal, educational and research uses.

A service of the Inter-university Consortium for Political and Social Research (ICPSR), openICPSR is a self-publishing repository for social, behavioral, and health sciences research data. openICPSR is particularly well-suited for the deposit of replication data sets for researchers who need to publish their raw data associated with a journal article so that other researchers can replicate their findings.

A free and open platform for sharing MRI, MEG, EEG, iEEG, ECoG, and ASL data.

Open Science Framework (general)
Open Science Framework is a repository hosted by the Center for Open Science, which is a non-profit technology company providing free and open services to increase inclusivity and transparency of research.

Panorama is a freely-available, open-source repository server application for targeted mass spectrometry assays that integrates into a Skyline mass spec workflow.

Protein Data Bank in Europe
The EBI Protein Structure Database in Europe is a project for the collection, management and distribution of data about macromolecular structures, derived from the Protein Data Bank (PDB). It is one of the founding members of Worldwide Protein Data Bank (wwPDB).

The ProteomeXchange Consortium was established to provide globally coordinated standard data submission and dissemination pipelines involving the main proteomics repositories, and to encourage open data policies in the field. ProteomeXchange fully supports both MS/MS proteomics and SRM data submission. Submissions of other types of proteomics data is also possible using the Partial Submission mechanism.

PRoteomics IDEntifications database (PRIDE)
The PRoteomics IDEntifications (PRIDE) database at EMBL-EBI is a centralized, standards-compliant, public data repository for proteomics data. It has been developed to provide the proteomics community with a public repository for protein and peptide identifications together with the evidence supporting these identifications. PRIDE is also able to capture details of post-translational modifications. It is a core member in the ProteomeXchange (PX) consortium.

PubChem is an open chemistry database at the NIH. It accepts and stores information on chemical structures, identifiers, chemical and physical properties, biological activities, patents, health, safety, toxicity data, and many others.

Rat Genome Database
The Rat Genome Database houses genomic, genetic, functional, physiological, pathway and disease data for the laboratory rat as well as comparative genomics between rat, human and mouse.

RCSB Protein Data Bank (PDB)
Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids.

Sequence Read Archive (SRA)
The Sequence Read Archive stores the raw sequencing data from such sequencing platforms as the Roche 454 GS System, the Illumina Genome Analyzer, the Applied Biosystems SOLiD System, the Helicos Heliscope, and the Complete Genomics. It archives the sequencing data associated with RNA-Seq, ChIP-Seq, Genomic and Transcriptomic assemblies, and 16S ribosomal RNA data.

The Cancer Imaging Archive (TCIA)
TCIA is a service that de-identifies and hosts a large archive of medical images of cancer accessible for public download. The data are organized as “Collections”, typically patients related by a common disease (e.g. lung cancer), image modality (MRI, CT, etc) or research focus. DICOM is the primary file format used by TCIA for image storage. Supporting data related to the images such as patient outcomes, treatment details, genomics, pathology, and expert analyses are also provided when available.

The Image Data Resource (IDR)
The Image Data Resource (IDR) is a public repository of reference image datasets from published scientific studies. IDR enables access, search and analysis of these highly annotated datasets.

TriTrypDB is an integrated genomic and functional genomic database for pathogens of the family Trypanosomatidae, including organisms in both Leishmania and Trypanosoma genera.

UK Data Archive
The UK Data Archive (UKDA) is a center of expertise in data acquisition, preservation, dissemination and promotion and is curator of the largest collection of digital data in the social sciences and humanities in the UK.

WikiPathways is an open, collaborative platform dedicated to the curation of biological pathways. Building on the same MediaWiki software that powers Wikipedia, the platform has a custom graphical pathway editing tool and integrated databases covering major gene, protein, and small-molecule systems.

WormBase is an online biological database about the biology and genome of the nematode model organism Caenorhabditis elegans and contains information about other related nematodes.

Zenodo (general)
ZENODO builds and operate a simple and innovative service that enables researchers, scientists, EU projects and institutions to share and showcase multidisciplinary research results (data and publications) that are not part of the existing institutional or subject-based repositories of the research communities.

ZFIN serves as the zebrafish model organism database, on-line database of information for zebrafish researchers.

Subject-focused repositories, when available, are preferred over general repositories.


Contact Us

For technical assistance: 1-800-875-2562 (toll-free U.S. and Canada), 1-703-964-5840 (direct dial international) or [email protected]

For award program or application content questions: 214-360-6107 (option 1) or [email protected]