AHA Approved Data Repositories
Repository policies -- including storage limitations, support, and fees -- are subject to change. The AHA reviews and updates this list periodically, but individual investigators are encouraged to verify a repository's acceptability and appropriateness for their specific project.
ArrayExpress
The functional genomics data collection (ArrayExpress), stores data from high-throughput functional genomics experiments, and provides data for reuse to the research community. In line with community guidelines, a study typically contains metadata such as detailed sample annotations, protocols, processed data and raw data. It generates MAGE-TAB format and supports high metadata standards in compliance with MIAME/MINSEQE/MINSCE guidelines.
BioMagResBank (BMRB)
BioMagResBank (BMRB) collects, annotates, archives, and disseminates spectral and quantitative data derived from NMR spectroscopic investigations of biological macromolecules and metabolites.
BioModels
BioModels is a repository of mathematical models of biological and biomedical systems. It hosts a vast selection of existing literature-based physiologically and pharmaceutically relevant mechanistic models in standard formats. All models are provided in the Public Domain.
caNanoLab
The cancer Nanotechnology Laboratory (caNanoLab) data portal is an NIH-supported, publicly-accessible repository designed to enable sharing of nanomaterials data, and to expedite and validate the use of nanoparticles in biomedicine.
CellML
The purpose of CellML is to store and exchange computer-based mathematical models. CellML allows scientists to share models even if they are using different modeling tools. It also enables them to reuse components from one model in another, thus accelerating model development.
ClinicalTrials.gov
This repository may be selected if your study must be registered as a clinical trial. ClinicalTrials.gov is a Web-based resource that provides patients, their family members, health care professionals, researchers, and the public with easy access to information on publicly and privately supported clinical studies on a wide range of diseases and conditions.
When choosing this repository, however, it must be used in conjunction with a general repository to satisfy the AHA’s Open Data requirements. Please also select one of the following: Dataverse, Dryad, figshare, Mendeley, Open Science Framework, or Zenodo since data deposited into ClinicalTrials.gov are only summary data.
ClinVar
ClinVar is a freely accessible, public archive of reports of human variations classified for diseases and drug responses, with supporting evidence. ClinVar thus facilitates access to and communication about the relationships asserted between human variation and observed conditions, and the history of those assertions.
COSMIC
COSMIC—Catalogue of Somatic Mutations in Cancer—is designed to store and display somatic mutation information related details and contains information relating to human cancers.
Culture Collections
The UK Health Security Agency (UKHSA) is the custodian of four unique collections that consist of expertly preserved, authenticated cell lines and microbial strains of known provenance. Scientists across the world use UKHSA Culture Collection materials to determine the effects of drugs, cosmetics, radiation, viruses, pesticides and household chemicals on human cells. Culture Collection cells and microorganisms are also used worldwide as controls for diagnostic and antimicrobial susceptibility tests and for developing vaccines, anti-cancer drugs and treatments for metabolic diseases. All the collections are developed, managed and maintained by highly trained, dedicated staff who work in accordance with internationally recognised quality standards including certification to ISO 9001:2015.
Dataverse (general)
The Dataverse Project is an open source web application to share, preserve, cite, explore, and analyze research data.
database of Genotypes and Phenotypes (dbGaP) (open section)
The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the results of studies that investigate the interaction of genotype and phenotype. Such studies include genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits.
dbGaP provides two levels of access - open and controlled - in order to allow broad release of non-sensitive data, while providing oversight and investigator accountability for sensitive data sets involving personal health information. The assumption is that AHA-funded data would fall under the open category unless there are exceptional circumstances.
Dryad (general)
Dryad is a non-profit, open-source data repository that allows researchers to share, publish, and preserve publicly available research data related to the basic sciences and medicine. Dryad is a curated general-purpose repository that makes data discoverable, freely reusable, and citable.
Electron Microscopy Data Bank (EMDB)
Electron Microscopy Data Bank is a public repository for electron cryo-microscopy volume maps and tomograms of macromolecular complexes and subcellular structures. It covers a variety of techniques, including single-particle analysis, electron tomography, and electron crystallography.
European Genome-phenome Archive (EGA)
The European Genome-phenome Archive (EGA) is designed to be a repository for all types of genotype experiments, including case control, population, and family studies. It includes SNP and CNV genotypes from array based methods and genotyping done with re-sequencing methods. This data may be either publicly available or limited access, depending on the design of the study.
European Molecular Biology Laboratory/European Bioinformatics Institute (EMBL-EBI)
The European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBLI) supports data resources and repositories encompassing DNA and protein sequences and structures, genome annotation, gene expression information, and molecular interactions and pathways.
European Nucleotide Archive (ENA)
The European Nucleotide Archive (ENA) is an open, supported platform that provides management, sharing, integration, archiving and dissemination of sequence data. ENA provides a comprehensive record of the world’s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.
exRNA Atlas
The exRNA Atlas is the data repository of the Extracellular RNA Communication Consortium (ERCC). The repository includes small RNA sequencing and qPCR-derived exRNA profiles from human and mouse biofluids. All RNA-seq datasets are processed using version 4 of the exceRpt small RNA-seq pipeline (Rozowsky et al., 2019) and ERCC-developed quality metrics are uniformly applied to these datasets.
figshare (general)
figshare is a repository where users can make all of their research outputs available in a citable, shareable and discoverable manner.
FlowRepository
FlowRepository is a web-based application accessible from a web browser that serves as an online database of flow cytometry experiments where users can query and download data collected and annotated according to the MIFlowCyt standard.
FlyBase
FlyBase is a database of genetic and molecular data for D. melanogaster and other Drosophila species, targeted to an audience of research professionals.
GenBank
GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42). GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis. A GenBank release occurs every two months.
Genomic Expression Archive (GEA)
Genomic Expression Archive (GEA) is a public database of functional genomics data such as gene expression, epigenetics, and genotyping SNP array
GitHub (source code)
Repository for open source code.
Global Health Data Exchange (GHDx)
The Global Health Data Exchange (GHDx) is a catalog of global health and demographic data. The goal of the GHDx is to help people locate data by cataloging information about data including the topics covered, by providing links to data providers or explaining how to acquire the data, and in cases where we have permission, providing the data directly for download. Use the GHDx to research population census data, surveys, registries, indicators and estimates, administrative health data, and financial data related to health.
GlycoPOST
GlycoPOST is a mass spectrometry data repository for glycomics. It consists of a high-speed file upload process, flexible file management system and easy-to-use interfaces. Submission conditions are in accordance with the Minimum Information Required for a Glycomics Experiment (MIRAGE) guidelines.
ImmPort
The Immunology Database and Analysis Portal (ImmPort) has been developed under the ImmPort Contract by the Peraton team for the National Institutes of Health (NIH), National Institute of Allergy and Infectious Diseases (NIAID), Division of Allergy, Immunology, and Transplantation (DAIT). The ImmPort project provides advanced information technology support in the archiving and exchange of scientific data for the diverse community of life science researchers supported by NIAID/DAIT and serves as a long-term, sustainable archive of research and clinical data. The core component of ImmPort is an extensive data warehouse containing experimental data and metadata describing the purpose of the study and the methods of data generation. The functionality of ImmPort will be expanded continuously over the life of the BISC project to accommodate the needs of expanding research communities. The shared research and clinical data, as well as the analytical tools in ImmPort are available to any researcher after registration.
IntACT
IntAct provides a freely available, open source database system and analysis tools for protein interaction data. All interactions are derived from literature curation or direct user submissions and are freely available.
International Molecular Exchange Consortium (IMEx)
International Molecular Exchange Consortuim (IMEx) provides access to a non-redundant set of physical molecular interaction data from a broad taxonomic range of organisms.
International Mouse Phenotyping Consortium (IMPC)
The International Mouse Phenotyping Consortium is an international scientific endeavor to create and characterize the phenotype of 20,000 knockout mouse strains. Using a standardized phenotyping protocol, the IMPC integrates data to existing mouse and human disease resources and provides strains and phenotype data for use by the research community.
MassIVE
MassIVE is a community resource developed by the NIH-funded Center for Computational Mass Spectrometry to promote the global, free exchange of mass spectrometry data.
Mendeley (general)
Mendeley Data is a free and secure cloud-based communal repository where you can store your data, ensuring it is easy to share, access and cite, wherever you are. Elsevier's Mendeley Data repository is a participating member of the National Institutes of Health (NIH) Office of Data Science Strategy GREI project. The GREI includes established generalist repositories funded by the NIH to work together to establish consistent metadata, develop use cases for data sharing, train and educate researchers on FAIR data and the importance of data sharing, and more.
MetaboLights
MetaboLights is a database for Metabolomics experiments and derived information. The database is cross-species, cross-technique and covers metabolite structures and their reference spectra as well as their biological roles, locations and concentrations, and experimental data from metabolic experiments.
Metabolomics Workbench
The Metabolomics Workbench serves as a national and international repository for metabolomics data and metadata and provides analysis tools and access to metabolite standards, protocols, tutorials, training, and more. The Workbench is a companion to Regional Comprehensive Metabolomics Resource Cores (RCMRC) and is a part of the Common Fund Initiative in metabolomics.
NCBI BioProject
The BioProject repository collects projects with biological data that relates to a single initiative that originates from a single entity or consortium. Records provide users with a single location for the links to diverse data types generated for those projects.
NCBI BioSample
The BioSample database contains descriptions of biological source materials used in experimental assays.
NCBI Conserved Domains Database
The Conserved Domains Database (CDD) contains annotations of functional units in proteins; including multiple sequence alignment models for ancient domains and full-length proteins. This collection of models includes 3D structures that display the sequence/structure/function relationships in proteins. Users can identify amino acids in protein sequences with the resources available through CDD as well as view single sequences embedded within multiple sequence alignments.
NCBI dbVar
dbVar is NCBI's database of human genomic Structural Variation — large variants >50 bp including insertions, deletions, duplications, inversions, mobile elements, translocations, and complex variants. dbVar is a database of human genomic structural variation where users can search, view, and download data from submitted studies. dbVar stopped supporting data from non-human organisms on November 1, 2017; however existing non-human data remains available via FTP download. In keeping with the common definition of structural variation, most variants are larger than 50 basepairs in length - however a handful of smaller variants may also be found.
NCBI Gene
Integrating information from a variety of species, records in Gene include nomenclature, Reference Sequences, maps, pathways, variations, phenotypes, and links to genome-specific, phenotype-specific, and locus-specific resources.
NCBI GEO Datasets
An international public repository, GEO (Gene Expression Omnibus) DataSets archives and distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data. The records include original submitter-supplied records (Series, Samples and Platforms) and curated DataSets. GEO aims to provide a database that efficiently stores this data; offer simple submission procedures and formats that support complete and well-annotated data deposits from the research community; and provide user-friendly mechanisms for users to find and use studies and gene expression profiles of interest. GEO DataSets provides tools to identify differences in gene expression levels and cluster heatmaps.
NCBI GEO Profiles
The Gene Expression Omnibus (GEO) database stores individual gene expression profiles from NCBI databases and is searchable by gene annotation as well as gene profile characteristics. GEO archives microarray and next-generation sequencing as well as other forms of genomic data submitted by researchers within the scientific community.
NCBI Nucleotide
The NCBI Nucleotide database collects sequences from such sources as GenBank, RefSeq, TPA, and PDB. Sequences collected relate to genome, gene, and transcript sequence data, and provide a foundation for research related to the biomedical field.
NCBI PopSet
The PopSet database is a collection of related DNA sequences derived from population, phylogenetic, mutation and ecosystem studies that have been submitted to GenBank.
NCBI Protein
The Protein database collects protein sequences related to biological structure and function. The sequences in NCBI Protein come from the translations from annotated coding regions in GenBank, RefSeq, and TPA, and records from SwissProt, PIR, PRF, and PDB.
NCBI Protein Clusters
This collection of related protein sequences (clusters) consists of proteins derived from the annotations of whole genomes, organelles and plasmids. It is currently limited to Archaea, Bacteria, Plants, Fungi, Protozoans, and Viruses.
NCBI Reference Sequence
The Reference Sequence database provides explicitly linked nucleotide and protein sequences, as well as comprehensive and annotated sequence sets with genomic DNA, proteins and transcripts. Users have access to a wealth of resources for gene identification, comparative analysis and genome research. Reference Sequences are available for naturally occurring DNA, RNA and protein sequences in organic species worldwide.
NCBI Structure
The Structure database provides three-dimensional structures of macromolecules for a variety of research purposes and allows the user to retrieve structures for specific molecule types as well as structures for genes and proteins of interest. Three main databases comprise Structure-The Molecular Modeling Database; Conserved Domains and Protein Classification; and the BioSystems Database. Structure also links to the PubChem databases to connect biological activity data to the macromolecular structures. Users can locate structural templates for proteins and interactively view structures and sequence data to closely examine sequence-structure relationships.
NCBI Taxonomy
Currently covering about 10 percent of the described species on the planet. The NCBI Taxonomy database is a curated set of names and classifications for all of the organisms that are represented in GenBank. When new sequences are submitted to GenBank, the submission is checked for new organism names, which are then classified and added to the Taxonomy database. As of April 2003, there were 176,890 total taxa represented.
There are two main tools for viewing the information in the Taxonomy database: the Taxonomy Browser, and Taxonomy Entrez. Both systems allow searching of the Taxonomy database for names, and both link to the relevant sequence data. However, the Taxonomy Browser provides a hierarchical view of the classification (the best display for most casual users interested in exploring our classification), whereas Entrez Taxonomy provides a uniform indexing, search, and retrieval engine with a common mechanism for linking between the Taxonomy and other relevant Entrez databases.
NeuroImaging Tools & Resource Collaboratory (NITRC)
NeuroImaging Tools & Resource Collaboratory (NITRC) is a United States Department of Health and Human Services award-winning, and free web-based resource that offers comprehensive information on an ever expanding scope of neuroinformatics software and data. NITRC has met the stringent FAIR sharing and open access requirements to be listed as a NIH-Supported Scientific Data Repository, a NLM Domain-Specific Repository and a neuroscience repository on Scientific Data. NITRC and its components—the Resources Registry (NITRC-R), Image Repository (NITRC-IR), and Computational Environment (NITRC-CE) offer researchers MR, PET/SPECT, CT, EEG/MEG, optical imaging, clinical neuroimaging, computational neuroscience, and imaging genomics software tools, data, and computational resources.
Online Mendelian Inheritance in Animals (OMIA)
Online Mendelian Inheritance in Animals contains textual information, references, links, and relevant records related to genes, traits, and inherited disorders in animals.
Online Mendelian Inheritance in Man (OMIM)
OMIM contains authoritative medical data on all known mendelian disorders as well as full-text and referenced overviews on the relationship between phenotype and genotype. Users can search the OMIM database by chromosome as well as narrow their search results by known gene sequences, phenotypes and gene map locus; as well as searching using only clinical synopses containing any combination of 22 specified criteria. The information contained in OMIM is available to download for personal, educational and research uses.
Open Data Commons for Traumatic Brain Injury (ODC-TBI)
Open Data Commons for Traumatic Brain Injury (ODC-TBI) is a dedicated data sharing portal and repository for the field of traumatic brain injuries.
openICPSR
A service of the Inter-university Consortium for Political and Social Research (ICPSR), openICPSR is a self-publishing repository for social, behavioral, and health sciences research data. openICPSR is particularly well-suited for the deposit of replication data sets for researchers who need to publish their raw data associated with a journal article so that other researchers can replicate their findings.
OpenNEURO
A free and open platform for validating and sharing Brain Imaging Data Structure (BIDS) compliant MRI, PET, MEG, EEG, and iEEG data.
Open Science Framework (general)
Open Science Framework is a repository hosted by the Center for Open Science, which is a non-profit technology company providing free and open services to increase inclusivity and transparency of research.
Panorama
Panorama is a freely-available, open-source repository server application for targeted mass spectrometry assays that integrates into a Skyline mass spec workflow.
PhysioNet
PhysioNet hosts large collections of physiological and clinical data and related open-source software. This includes digital recordings and biomedical signals (including cardiopulmonary and neural) from healthy subjects and patients with a variety of conditions with major public health implications, including sudden cardiac death, congestive heart failure, epilepsy, gait disorders, sleep apnea, and aging. PhysioNet also includes clinical and imaging data related to critical care.
Protein Data Bank in Europe
The EBI Protein Structure Database in Europe is a project for the collection, management and distribution of data about macromolecular structures, derived from the Protein Data Bank (PDB). It is one of the founding members of Worldwide Protein Data Bank (wwPDB).
ProteomeXchange
The ProteomeXchange Consortium was established to provide globally coordinated standard data submission and dissemination pipelines involving the main proteomics repositories, and to encourage open data policies in the field. ProteomeXchange fully supports both MS/MS proteomics and SRM data submission. Submissions of other types of proteomics data is also possible using the Partial Submission mechanism.
PRoteomics IDEntifications Database (PRIDE)
The PRIDE PRoteomics IDEntifications (PRIDE) Archive Database is a centralized, standards compliant, public data repository for mass spectrometry proteomics data, including protein and peptide identifications and the corresponding expression values, post-translational modifications and supporting mass spectra evidence (both as raw data and peak list files). PRIDE is a core member in the ProteomeXchange (PX) consortium, which provides a standardised way for submitting mass spectrometry based proteomics data to public-domain repositories. Datasets are submitted to ProteomeXchange via PRIDE and are handled by expert bio-curators. All PRIDE public datasets can also be searched in ProteomeCentral, the portal for all ProteomeXchange datasets.
PubChem
PubChem is an open chemistry database at the NIH. It accepts and stores information on chemical structures, identifiers, chemical and physical properties, biological activities, patents, health, safety, toxicity data, and many others.
Rat Genome Database
The Rat Genome Database (RGD) is the premier site for genetic, genomic, phenotype, and disease-related data generated from rat research. In addition, RGD has expanded to include a large body of structured and standardized data for ten species (rat, mouse, human, chinchilla, bonobo, 13-lined ground squirrel, dog, pig, green monkey/vervet and naked mole-rat). RGD also offers a growing suite of innovative tools for querying, analyzing and visualizing these data, making it a valuable resource for researchers worldwide.
RGD is deeply committed to the principles of FAIR data exchange, making data Findable, Accessible, Interoperable and Reusable.
RCSB Protein Data Bank (PDB)
RCSB Protein Data Bank (PDB) is the US data center for the global PDB archive of 3D structure data for large biological molecules (proteins, DNA, and RNA) essential for research and education in fundamental biology, health, energy, and biotechnology.
Sequence Read Archive (SRA)
Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys. SRA stores raw sequencing data and alignment information to enhance reproducibility and facilitate new discoveries through data analysis.
Single Nucleotide Polymorphism Database (dbSNP)
The Single Nucleotide Polymorphism Database (dbSNP) is a public-domain archive for a broad collection of simple genetic polymorphisms. This collection of polymorphisms includes single-base nucleotide substitutions (also known as single nucleotide polymorphisms or SNPs), small-scale multi-base deletions or insertions (also called deletion insertion polymorphisms or DIPs), and retroposable element insertions and microsatellite repeat variations (also called short tandem repeats or STRs). dbSNP has been designed to support submissions and research into a broad range of biological problems. These include physical mapping, functional analysis, pharmacogenomics, association studies, and evolutionary studies. Because dbSNP was developed to complement GenBank, it may contain nucleotide sequences from any organism. mutations.
Synapse (general)
Synapse is a cloud-based data repository and sharing platform where researchers can share and describe content to co-analyze, learn from, and improve knowledge of health and disease.
The Cancer Imaging Archive (TCIA)
TCIA is a service that de-identifies and hosts a large archive of medical images of cancer accessible for public download. The data are organized as “Collections”, typically patients related by a common disease (e.g. lung cancer), image modality (MRI, CT, etc) or research focus. DICOM is the primary file format used by TCIA for image storage. Supporting data related to the images such as patient outcomes, treatment details, genomics, pathology, and expert analyses are also provided when available.
The Image Data Resource (IDR)
The Image Data Resource (IDR) is a public repository of reference image datasets from published scientific studies. IDR enables access, search and analysis of these highly annotated datasets.
The Zebrafish Information Network (ZFIN)
The Zebrafish Information Network (ZFIN) is the database of genetic and genomic data for the zebrafish (Danio rerio) as a model organism. ZFIN provides a wide array of expertly curated, organized and cross-referenced zebrafish research data.
TriTrypDB
TriTrypDB is an integrated genomic and functional genomic database for pathogens of the families Eubodonida and Trypanosomatida.
UK Data Archive
The UK Data Archive (UKDA) is a center of expertise in data acquisition, preservation, dissemination and promotion and is curator of the largest collection of digital data in the social sciences and humanities in the UK.
WikiPathways
WikiPathways was established to facilitate the contribution and maintenance of pathway information by the biology community. WikiPathways is an open, collaborative platform dedicated to the curation of biological pathways. WikiPathways thus presents a new model for pathway databases that enhances and complements ongoing efforts, such as KEGG, Reactome and Pathway Commons. The familiar web-based format of WikiPathways greatly reduces the barrier to participate in pathway modeling. More importantly, the open, public approach of WikiPathways allows for broader participation by the entire community, ranging from students to senior experts in each field.
WormBase
WormBase is an international consortium of biologists and computer scientists providing the research community with accurate, current, accessible information concerning the genetics, genomics and biology of C. elegans and related nematodes.
Zenodo (general)
Zenodo builds and operate a simple and innovative service that enables researchers, scientists, EU projects and institutions to share and showcase multidisciplinary research results (data and publications) that are not part of the existing institutional or subject-based repositories of the research communities.
Subject-focused repositories, when available, are preferred over general repositories.
SOURCES:
Contact Us
For technical assistance: 1-800-875-2562 (toll-free U.S. and Canada), 1-703-964-5840 (direct dial international) or [email protected]
For award program or application content questions: 214-360-6107 (option 1) or [email protected]