Princeton University
Princeton University
Princeton University
π― The Problem: Language models lack deep domain expertise despite demonstrating cross domain capabilities. Current top-down training from internet derived text corpora is constrained by the efficacy of domain abstractions in the training data making it insufficent to elicit domain-specific reasoning. The path to domain-specific superintelligence may require a bottom-up approach that acquires deep expertise by explicitly learning to compose simple concepts of a domain into more complex ones.
π‘ Our Approach: We introduce a new paradigm for domain-specific superintelligence through a bottom-up approach leveraging knowledge graphs (KGs). We traverse mutiple paths of a KG to generate a reasoning curriculum grounded in domain primitives encoded along the paths. More specifically, we construct a task generation pipeline that directly synthesizes reasoning tasks from the domain-specific primitives of the KG. We then fine-tune a language model on the resultant bottom-up KG-grounded curriculum to demonstrate domain-specific superintelligence.
βοΈ Our Implementation: While our approach is applicable to a variety of domains, we demonstrate it in the context of medicine where a reliable KG is readily available. We instantiate our pipeline and generate 24,000 reasoning tasks from a medical KG which are then used to fine-tune QwQ-32B to obtain QwQ-Med-3, a language model equipped with diverse medical primitives.
π ICD-Bench: We introduce a comprehensive evaluation suite with 3,675 high-quality medical reasoning questions across 15 different medical sub-categories and 5 difficulty levels to rigorously quantify medical reasoning capabilities.
π€ QwQ-Med-3: Our curriculum-tuned model that achieves state-of-the-art performance on ICD-Bench, significantly outperforming both open-source and proprietary reasoning models. Additionally, it demonstrates superior or equivalent performance compared to similar sized models across multiple established medical QA benchmarks.
π§ͺ Try It Yourself: Experience the medical reasoning challenge firsthand with our interactive quiz - attempt the same questions used to evaluate AI models and see how you compare!
Our central insight is that paths in a KG can be translated into grounded natural language reasoning tasks, whose solution requires reasoning along the relational chain encoded in the paths (Top). We use a medical KG as a scaffold to generate QA items from KG paths along with their corresponding thinking traces using a backend Large Language Model (LLM) (Middle). We leverage the KG path to QA transformation to design a task-generation pipeline that curates a high-quality, verifiable, diverse currciculum of reasoning tasks (Bottom).
We introduce a new evaluation suite for quantifying medical reasoning capabilites: ICD-Bench. The benchmark comprises 3,675 evaluation tasks, systematically generated and evenly distributed across 15 International Classification of Disease (ICD-10) taxonomy categories.
We evaluated our models against baselines on the ICD-Bench where QwQ-Med-3 significantly outperforms other properietary and open-source reasoning models across all categories (Top). Further, our inference-time scaling analysis also reveals that curriculum-tuned models are more compute-optimal than their open-source counterparts, where an expanding curriculum from QwQ-Med-1 to QwQ-Med-3 yields improved inference-time scaling performance (Bottom).
Our interactive quiz lets you attempt questions from our ICD-Bench evaluation suite, organized across ICD-10 categories.
Take on the ICD-bench challenge and see how you fare against our QwQ-Med-3!
π― Launch ICD-Bench Quiz β
From easy to hard across diverse categories
3,675 carefully curated medical questions across 15 ICD-10 categories
Detailed explanations with knowledge graph paths and AI reasoning
Monitor your performance across categories and difficulty levels
Find key insights, complete technical details, and additional experiments in our paper.
π Read the Full Paper β
@misc{dedhia2025bottomupsuperintelligence,
author = "{Dedhia, Bhishma and Kansal, Yuval and Jha, Niraj K.}",
title = "Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need",
year = "2025",
url = {https://arxiv.org/abs/2507.13966},
}