Curated, open-access datasets for training and evaluating medical AI models β variants, sequences, clinical trials, and more.
Whole-genome sequences from 10,000 Indian individuals across 100+ ethnic groups. Includes SNP, INDEL, and structural variant annotations. De-identified and ethics-approved.
Curated subset of ClinVar with 450,000 pathogenic and likely-pathogenic variants, enriched with functional evidence, ACMG classifications, and literature links.
42 million biomedical research papers from PubMed, PMC Open Access, and preprints. Preprocessed for LLM training with structured metadata, abstracts, and full texts where available.
2 million drug-target interaction pairs with SMILES notation, protein sequences, binding affinities, ADMET properties, and clinical outcome data from ChEMBL and BindingDB.
2 million annotated histopathology images from Indian cancer centers, covering 18 cancer types. Includes WHO grading, tumor boundaries, and pathologist consensus labels.
The MedQA benchmark suite for evaluating medical AI models across 12 tasks including variant classification, gene function prediction, and clinical report generation.