Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need

📝 TL;DR

🎯 The Problem: Language models lack deep domain expertise despite demonstrating cross domain capabilities. Current top-down training from internet derived text corpora is constrained by the efficacy of domain abstractions in the training data making it insufficent to elicit domain-specific reasoning. The path to domain-specific superintelligence may require a bottom-up approach that acquires deep expertise by explicitly learning to compose simple concepts of a domain into more complex ones.

💡 Our Approach: We introduce a new paradigm for domain-specific superintelligence through a bottom-up approach leveraging knowledge graphs (KGs). We traverse mutiple paths of a KG to generate a reasoning curriculum grounded in domain primitives encoded along the paths. More specifically, we construct a task generation pipeline that directly synthesizes reasoning tasks from the domain-specific primitives of the KG. We then fine-tune a language model on the resultant bottom-up KG-grounded curriculum to demonstrate domain-specific superintelligence.

⚙️ Our Implementation: While our approach is applicable to a variety of domains, we demonstrate it in the context of medicine where a reliable KG is readily available. We instantiate our pipeline and generate 24,000 reasoning tasks from a medical KG which are then used to fine-tune QwQ-32B to obtain QwQ-Med-3, a language model equipped with diverse medical primitives.

📊 ICD-Bench: We introduce a comprehensive evaluation suite with 3,675 high-quality medical reasoning questions across 15 different medical sub-categories and 5 difficulty levels to rigorously quantify medical reasoning capabilities.

🤖 QwQ-Med-3: Our curriculum-tuned model that achieves state-of-the-art performance on ICD-Bench, significantly outperforming both open-source and proprietary reasoning models. Additionally, it demonstrates superior or equivalent performance compared to similar sized models across multiple established medical QA benchmarks.

🧪 Try It Yourself: Experience the medical reasoning challenge firsthand with our interactive quiz - attempt the same questions used to evaluate AI models and see how you compare!

24K

Training Tasks

KG-derived Reasoning Curriculum

84.72%

ICD-Bench Score

SOTA Performance

ICD-10 Categories

Comprehensive Evaluation

Our central insight is that paths in a KG can be translated into grounded natural language reasoning tasks, whose solution requires reasoning along the relational chain encoded in the paths (Top). We use a medical KG as a scaffold to generate QA items from KG paths along with their corresponding thinking traces using a backend Large Language Model (LLM) (Middle). We leverage the KG path to QA transformation to design a task-generation pipeline that curates a high-quality, verifiable, diverse currciculum of reasoning tasks (Bottom).

We evaluated our models against baselines on the ICD-Bench where QwQ-Med-3 significantly outperforms other properietary and open-source reasoning models across all categories (Top). Further, our inference-time scaling analysis also reveals that curriculum-tuned models are more compute-optimal than their open-source counterparts, where an expanding curriculum from QwQ-Med-1 to QwQ-Med-3 yields improved inference-time scaling performance (Bottom).

🩺 Try our Interactive ICD-Bench Quiz

Our interactive quiz lets you attempt questions from our ICD-Bench evaluation suite, organized across ICD-10 categories.

🚀 Ready to Test Your Medical Knowledge?

Take on the ICD-bench challenge and see how you fare against our QwQ-Med-3!

🎯 Launch ICD-Bench Quiz →

📊 5 Difficulty Levels

From easy to hard across diverse categories

📝 High-Quality Questions

3,675 carefully curated medical questions across 15 ICD-10 categories

⚡ Enhanced Feedback

Detailed explanations with knowledge graph paths and AI reasoning

📈 Progress Tracking

Monitor your performance across categories and difficulty levels

Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need

QwQ-32B-Base vs QwQ-Med-3

📝 TL;DR

Bottom-Up Curriculum Generation

ICD-Bench Evaluation Suite

Sample Responses

Results