Teacher teaching addition to a child and tiny robot

Curriculum Pretraining Enables 10-Digit Addition for a 296-Parameter GPT with 99% Accuracy

A 296-parameter GPT learns to add 10-digit numbers not by changing the architecture, but by changing the training recipe. Abstract Can a transformer with fewer than 300 parameters reliably solve 10-digit addition? The answer turns out to be yes if you train it right. This post describes AdditionGPT, a minimal causal transformer that treats digit-wise addition as a sequence classification task. The key insight is that the architecture need not change at all: a two-stage training recipe, curriculum-style pre-training on variable-length addition, followed by fine-tuning is sufficient for a sub-300-parameter GPT to achieve 99% test accuracy on 10-digit addition....