I unified convolution and attention into a single framework

The operational primitives of deep learning, primarily matrix multiplication and convolution, exist as a fragmented landscape of highly specialized tools. This paper introduces the Generalized Windowed Operation (GWO), a theoretical framework that unifies these operations by decomposing them into three orthogonal components: Path, defining operational locality; Shape, defining geometric structure and underlying symmetry assumptions; and Weight, defining feature importance. We elevate this framework to a predictive theory grounded in two fundamental principles. First, we introduce the Principle of Structural Alignment, which posits that optimal generalization is achieved when the GWO’s (P, S, W) configuration mirrors the data’s intrinsic structure. Second, we show that this principle is a direct consequence of the Information Bottleneck (IB) principle. To formalize this, we define an Operational Complexity metric based on Kolmogorov complexity. However, we move beyond the simplistic view that lower complexity is always better. We argue that the nature of this complexity—whether it contributes to brute-force capacity or to adaptive regularization—is the true determinant of generalization. Our theory predicts that a GWO whose complexity is utilized to adaptively align with data structure will achieve a superior generalization bound. Canonical operations and their modern variants emerge as optimal solutions to the IB objective, and our experiments reveal that the quality, not just the quantity, of an operation’s complexity governs its performance. The GWO theory thus provides a grammar for creating neural operations and a principled pathway from data properties to generalizable architecture design.

I unified convolution and attention into a single framework

Share this article

Related Articles