CGAS: Constraint-Guided Agentic Search for Human-AI Collaborative Kernel Generation

A three-layer framework combining structured kernel design spaces, Hierarchical Constrained Monte Carlo Tree Search (HC-MCTS), and mixed-initiative interaction for GPU kernel optimization.

Problem & Framework

GPU kernel optimization involves navigating a vast combinatorial design space. Purely autonomous agents waste compute on configurations experts would reject; purely manual tuning cannot scale. CGAS addresses this with three layers:

Layer 1 -- Structured Design Space: 9 typed parameters (tile sizes, memory placement, thread blocks, vectorization, unroll factor) with hardware-semantic annotations. Total space: 230,400 configurations.
Layer 2 -- HC-MCTS Agent: UCB1-guided exploration with hard constraint pruning and soft preference biasing. Concentrates evaluation on performance-relevant configurations.
Layer 3 -- Mixed-Initiative Protocol: Uncertainty-adaptive consultation where experts inject constraints and provide feedback, focused on high-uncertainty parameters.

Design Space Reduction Through Expert Constraints

Design Space Size vs. Constraints

Mean Random-Sample Performance (TFLOPS)

ConstraintsSizeReductionMean TFLOPS
None230,4000.0%8.04
Tile M >= 64138,24040.0%8.89
+ Vec >= 2103,68055.0%10.82
+ Tile N >= 6462,20873.0%12.09
+ Block >= 6446,65679.8%12.75
+ Tile K >= 1634,99284.8%12.33

Parameter Sensitivity Analysis

TFLOPS Range by Parameter

Performance by Parameter Value (Top 4)

Search Strategy: Bottleneck Distribution

Bottleneck Distribution by Strategy

Memory-Bounded Fraction

HC-MCTS: 94.4% memory-bounded
Random: 55.6% memory-bounded
Best TFLOPS: 19.305

Convergence & Efficiency

Evaluations to Reach TFLOPS Threshold

Both agent-only and human-assisted strategies reach 15 TFLOPS within 15 evaluations and 19+ TFLOPS within ~21 evaluations. Primary benefit of human expertise: design space reduction and focused evaluation rather than faster convergence.

Key Results

Design Space

84.8% space reduction
58.6% perf improvement

Search Focus

94.4% memory-bounded configs
vs 55.6% random

Parameter Sensitivity

Tile K: 15.1 TFLOPS range
Block dims: 11-14 range