Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need

Bhishma Dedhia

Princeton University

Yuval Kansal

Princeton University

Niraj K. Jha

Princeton University

QwQ-32B-Base vs QwQ-Med-3

πŸ“‹ Medical Query:
A 14-year-old male presents with recurrent episodes of painless rectal bleeding and iron deficiency anemia. Colonoscopy reveals numerous hamartomatous polyps throughout the colon. Further investigation reveals scattered telangiectasias on his lips and nasal mucosa. Which of the following neurovascular abnormalities should be considered in this patient, given the constellation of findings?
A. Moyamoya disease
B. Cerebral amyloid angiopathy
C. Intracranial arteriovenous malformation
D. Cavernous malformations
🧠 KG Path: Juvenile Colonic Polyposis β†’ allelic withβ†’ Hereditary haemorrhagic telangiectasia β†’causesβ†’ Arteriovenous malformation β†’ causesβ†’ Intracranial arteriovenous malformation
QwQ-32B-Base
Thinking Tokens:0
QwQ-Med-3
Thinking Tokens:0

πŸ“ TL;DR

🎯 The Problem: Language models lack deep domain expertise despite demonstrating cross domain capabilities. Current top-down training from internet derived text corpora is constrained by the efficacy of domain abstractions in the training data making it insufficent to elicit domain-specific reasoning. The path to domain-specific superintelligence may require a bottom-up approach that acquires deep expertise by explicitly learning to compose simple concepts of a domain into more complex ones.

πŸ’‘ Our Approach: We introduce a new paradigm for domain-specific superintelligence through a bottom-up approach leveraging knowledge graphs (KGs). We traverse mutiple paths of a KG to generate a reasoning curriculum grounded in domain primitives encoded along the paths. More specifically, we construct a task generation pipeline that directly synthesizes reasoning tasks from the domain-specific primitives of the KG. We then fine-tune a language model on the resultant bottom-up KG-grounded curriculum to demonstrate domain-specific superintelligence.

βš™οΈ Our Implementation: While our approach is applicable to a variety of domains, we demonstrate it in the context of medicine where a reliable KG is readily available. We instantiate our pipeline and generate 24,000 reasoning tasks from a medical KG which are then used to fine-tune QwQ-32B to obtain QwQ-Med-3, a language model equipped with diverse medical primitives.

πŸ“Š ICD-Bench: We introduce a comprehensive evaluation suite with 3,675 high-quality medical reasoning questions across 15 different medical sub-categories and 5 difficulty levels to rigorously quantify medical reasoning capabilities.

πŸ€– QwQ-Med-3: Our curriculum-tuned model that achieves state-of-the-art performance on ICD-Bench, significantly outperforming both open-source and proprietary reasoning models. Additionally, it demonstrates superior or equivalent performance compared to similar sized models across multiple established medical QA benchmarks.

πŸ§ͺ Try It Yourself: Experience the medical reasoning challenge firsthand with our interactive quiz - attempt the same questions used to evaluate AI models and see how you compare!

24K
Training Tasks
KG-derived Reasoning Curriculum
84.72%
ICD-Bench Score
SOTA Performance
15
ICD-10 Categories
Comprehensive Evaluation

Bottom-Up Curriculum Generation


Our central insight is that paths in a KG can be translated into grounded natural language reasoning tasks, whose solution requires reasoning along the relational chain encoded in the paths (Top). We use a medical KG as a scaffold to generate QA items from KG paths along with their corresponding thinking traces using a backend Large Language Model (LLM) (Middle). We leverage the KG path to QA transformation to design a task-generation pipeline that curates a high-quality, verifiable, diverse currciculum of reasoning tasks (Bottom).


Diagram of the QA generation pipeline using a KG.
Diagram of the QA generation pipeline from the KG.

ICD-Bench Evaluation Suite


We introduce a new evaluation suite for quantifying medical reasoning capabilites: ICD-Bench. The benchmark comprises 3,675 evaluation tasks, systematically generated and evenly distributed across 15 International Classification of Disease (ICD-10) taxonomy categories.

Sample Responses


🍽️Digestive System Diseases
Question:
A 14-year-old male presents with recurrent episodes of painless rectal bleeding and iron deficiency anemia. Colonoscopy reveals numerous hamartomatous polyps throughout the colon. Further investigation reveals scattered telangiectasias on his lips and nasal mucosa. Which of the following neurovascular abnormalities should be considered in this patient, given the constellation of findings?
A. Moyamoya disease
B. Cerebral amyloid angiopathy
C. Intracranial arteriovenous malformation
D. Cavernous malformations
🧠 KG Path: Juvenile colonic polyposis-may_be_allelic_with-Hereditary haemorrhagic telangiectasia-maybe_cause-Arteriovenous malformation-maybe_cause-Intracranial arteriovenous malformation
QwQ-32B-Base
Thinking Tokens:0
QwQ-Med-3
Thinking Tokens:0

Results


We evaluated our models against baselines on the ICD-Bench where QwQ-Med-3 significantly outperforms other properietary and open-source reasoning models across all categories (Top). Further, our inference-time scaling analysis also reveals that curriculum-tuned models are more compute-optimal than their open-source counterparts, where an expanding curriculum from QwQ-Med-1 to QwQ-Med-3 yields improved inference-time scaling performance (Bottom).

Comparison of our curriculum-tuned models with existing models on ICD-Bench.
Performance across different categories on ICD-Bench.

🩺 Try our Interactive ICD-Bench Quiz

Our interactive quiz lets you attempt questions from our ICD-Bench evaluation suite, organized across ICD-10 categories.


πŸš€ Ready to Test Your Medical Knowledge?

Take on the ICD-bench challenge and see how you fare against our QwQ-Med-3!


🎯 Launch ICD-Bench Quiz β†’

πŸ“Š 5 Difficulty Levels

From easy to hard across diverse categories

πŸ“ High-Quality Questions

3,675 carefully curated medical questions across 15 ICD-10 categories

⚑ Enhanced Feedback

Detailed explanations with knowledge graph paths and AI reasoning

πŸ“ˆ Progress Tracking

Monitor your performance across categories and difficulty levels

πŸš€ Dive Deeper

Find key insights, complete technical details, and additional experiments in our paper.

πŸ“„ Read the Full Paper β†’

BibTeX citation

    @misc{dedhia2025bottomupsuperintelligence,
  author = "{Dedhia, Bhishma and Kansal, Yuval and Jha, Niraj K.}",
  title = "Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need",
  year = "2025",
  url = {https://arxiv.org/abs/2507.13966},
}