📊 Medical Datasets

Curated, open-access datasets for training and evaluating medical AI models — variants, sequences, clinical trials, and more.

Whole-genome sequences from 10,000 Indian individuals across 100+ ethnic groups. Includes SNP, INDEL, and structural variant annotations. De-identified and ethics-approved.

EHRPopulationVCF

10,247Samples

deepcog-ai · ✓ Verified

ClinVar-Pathogenic-2024

Curated subset of ClinVar with 450,000 pathogenic and likely-pathogenic variants, enriched with functional evidence, ACMG classifications, and literature links.

deepcog-ai · aiims-delhi

BioMed-Papers-42M

42 million biomedical research papers from PubMed, PMC Open Access, and preprints. Preprocessed for LLM training with structured metadata, abstracts, and full texts where available.

LiteraturePretrainingJSONL

2 million drug-target interaction pairs with SMILES notation, protein sequences, binding affinities, ADMET properties, and clinical outcome data from ChEMBL and BindingDB.

Drug DiscoverySMILESCSV

iit-madras-bioai · ✓ Verified

HistoPath-India-2M

2 million annotated histopathology images from Indian cancer centers, covering 18 cancer types. Includes WHO grading, tumor boundaries, and pathologist consensus labels.

deepcog-ai · ✓ Benchmark

GeneTuring-Benchmark-v2

The MedQA benchmark suite for evaluating medical AI models across 12 tasks including variant classification, gene function prediction, and clinical report generation.

BenchmarkEvaluationJSON