Why all ARC-AGI solvers fail today (and how to fix them)
Update 5th Dec 2025: I tested this hypothesis out and broke the pareto frontier on ARC-1. Here’s its twitter thread and here’s the code. These experiments are still in progress and I expect performance to hit 30%. Currently its at 12%. Also, here’s the original twitter thread of this blog post that gives a good summary of this blog
Every solver today fails because it leaves some data uncompressed.
Fixing this will boost accuracy and allow us to drop hard-coded tricks like augmentations.
Outline:
- there are 3 uncompressed data sources
- compressing each one will boost performance mathematically
- this explains why “test time training” is necessary
- augmentations / other tricks are unnecessary and can backfire
- simple predictions you can falsify
Introduction
The ideas in this blog are built on the MDL principle:
Minimum Description Length
The best AI model describes the data (and itself) in the fewest bits
This act of describing lots of data in very few “bits” is called compression.
MDL says that a more intelligent AI model will compresses its training dataset better
(Btw: If “compression” sounds weird anywhere, replace it with “is trained to predict”)
ARC-AGI is a benchmark that tests how well your AI model can solve simple spatial puzzles.
Every AI model today fails at it.
I claim this is because no AI model compresses all data sources provided by ARC.
Specifically, there are 3 uncompressed sources:
- example input grids
- test input grids
- private puzzles
Instead of fixing this flaw, approaches today rely on human scaffolding
(This means hardcoded data augmentations, special architectures and custom DSLs)
This is bad - It is not scalable, is mathematically unnecessary and can backfire
But first, let’s look at the
Uncompressed data sources
We’ll explore each data source use this naive approach1:
Naive AI model:
Given an ARC puzzle,
Find the shortest program $P$ such that $P(\text{Input}) = \text{Output}$ for all example pairs.
This looks like MDL, but it is mathematically wrong.
It leaves all 3 sources uncompressed
Source 1: Example inputs
In the naive solver, $P$ consumes the input and produces the output.
This is bad - it completely fails to compress the input grids.
Proof:
Let’s write the training data as
Each grid is expensive to describe (~=3000 bits), so $D_E$ is long.
Naive method:
We find the shortest $P$ such that $P(I_k) = O_k$ for all $k$.
We can now shorten the description:
Look closely:
This only replaces the outputs by a short program.
The inputs $I_1,\dots,I_n$ are still listed verbatim, so they are not compressed at all.
This is terrible because the input grids in ARC are highly regular and compressible. This means a shorter description definitely exists.
(Eg: In Fig 2, all example inputs share obvious structure: same kinds of red squares, same object layouts, same background patterns, etc. These patterns can be used to compress the input.)
In short, the naive method violates MDL: we know $D_E$ can be shortened further.2
Better alternative:
Instead, let’s find a program $F$ such that $F($k$) = (I_k,O_k)$
Note the difference: the old one was $P(I_k) = O_k$
Immediately we have a much shorter description3:
\[\boxed{D_E = \big(\text{"for every integer from 1 to n, apply"}, F\big).}\]Here $F$ jointly compresses both inputs and outputs. Any regularity that appears in either the inputs, the outputs, or their relationship can be folded into a single short program.
This is strictly a better compression4 than only explaining $O_k$ given $I_k$.
The only approach doing this is CompressARC (though it still has problem 2).
We can generalise this concept:
Whenever datasets $A$ and $B$ share mutual information,
compressing them together is shorter (in bits) than separatelyThink of it as the compressor exploiting common patterns in A and B.
ARC has a lot of such common patterns to exploit.
Eg: Both the inputs and outputs are grids built from the same kinds of objects, shapes, colors, symmetries, etc.Notes:
This generalised principle makes it easy to prove the MDL violation for the next 2 uncompressed sources:
Source 2: Private puzzles
There is a HUGE amount of mutual information among all puzzles.
- Every puzzle in ARC is built using the same basic concepts like objectness, geometric transforms, etc.
- Crucially, almost all concepts needed for the eval puzzles already appear in the train puzzles. Eg: occlusion, symmetry, color swaps, etc.
Now we apply the same (generalised) argument:
Because every puzzle shares common patterns, MDL demands compressing ALL of them together, including the private ones.
Hence, the naive solver fails again - it trains on each puzzle separately. (This also applies to “no-pretraining” methods like CompressARC and NCAs)
Test time training
The need to compress private puzzles implies that test time training is 100% necessary.
Every approach must train on the public puzzles offline AND on the private puzzles during runtime.
However, all current “test time training” methods are suboptimal because they ignore data sources 1 and 3
Source 3: Test inputs
The test grid in the figure clearly shares patterns with the example grids. Hence, MDL says compressing them jointly will improve performance (same mutual info argument). Obviously, this proves that the naive method fails too.
Ideally we’d be done here, but there’s a catch:
Test inputs are distribution shifted
In a typical ARC puzzle,
If the examples pairs are sampled from some distribution ~$X$ then the grids are sampled from a transformed distribution ~$\mathcal{f}(X)$. Eg: In fig 3, the test grid is rotated 90 degrees
This is bad.
Vanilla MDL applies ONLY when training and inference are from the same distribution. To make MDL applicable, we need to train on a dataset that is sampled from a union of $\mathcal{f}(X)$ and $X$.
Luckily, jointly compressing the example pairs and train input grids does exactly this.
“Doesn’t unsupervised generalisation happens on different distributions”
No, generalisation only happens AFTER you jointly compress some part of the new distribution.
Eg: Base LLMs are terrible chatbots. They only become good at chatting AFTER they are trained on chat conversations (supervised finetuning).
Wait, then most ARC solvers are dead on arrival!
Yeah, the only approach today that compresses the test inputs is CompressARC The rest should all score 0 since they can’t generalise across the distribution shift
Obviously, many of them do get a non-zero score. This is because:
- In some puzzles $\mathcal{f}$ is an identity transform. Here, MDL works
- Some puzzles reuse $\mathcal{f}$ from the example grids of other puzzles. (This makes $\mathcal{f}$ a part of $X$, hence MDL works)
- Some approaches are designed with hardcoded augmentations or special DSLs/architectures.
(3) is actually very problematic.
Special architectures/DSLs essentially bake $\mathcal{f}$ into the model itself. Augmentations expand $X$, making it a superset of $\mathcal{f}(X)$. Both these tricks allow MDL to apply again.
However such human intervention is anti-bitter lesson.
It is not scalable and goes against the spirit of the benchmark. It should not be done.
We discuss this below.
Human interventions
Hardcoding augmentations (synthetic data)
Augmentations can make a lot of puzzles trivial to solve. We just discussed why this happens - it expands the train distribution to cover the test distribution. For example, a tiny NCA model immediately solves this when augmented:
But augmentations can also make a puzzle impossible to solve. In this puzzle for example, an AI model might memorise brown to be filling color. It won’t know what to do when it sees the test input:
Theoretically, augmentations are unnecessary. By definition, a Kolmogorov compressor with access to all data will solve the puzzle without augmentations.7 Augmentations can only worsen its performance (by contradicting correct information)
So if your AI model needs augmentations, that means:
- it is a weak compressor (augmentation transforms data into a format it recongises) or
- it forgot to compress some necessary source of data (and the augmentation is accidentally leaking bits from here)
Keeping the math aside, the main problem is that it is hardcoded
In general, anything intentionally designed by a human:
- moves some of the “intelligence” from the AI model into the human who designed the trick
- and is not scalable (testmaker can always create a new puzzle that is out of distribution)
- might fail by contradicting info not available to the human (eg: private puzzles)
Hardcoded architectures/DSLs
(By “architecture” I mean the neural network design and by “DSLs” I mean the hand-picked set of operations allowed in program synthesis)
Same problem as augmentations: doing this smuggles in extra information about the task that never appears in the data. This can make some puzzles trivial. It can also make some puzzles impossible to solve
For example, the puzzle above requires partial color equivariance. If you bake it into the architecture, then it becomes trivial. In an ideal world, your AI model should learn this equivariance by itself.
Testable predictions
Predictions:
1. Every solver will improve once it starts compressing currently ignored data sources
2. Once those are compressed, handcrafted tricks can be removed without losing performance
3. For some methods, performance will improve when you remove the tricks
The modifications required to execute (2) and (3) might be very complex.
The appendix lists the exact problems of each approach
Conclusion
Scores on ARC-AGI can be improved
You just gotta compress everything
I’m gonna test this in the next few days.
Appendix
List of problems in each approach
| Approach | Uncompressed Test Input | Uncompressed Example Inputs | Uncompressed Private Puzzles | Human Intervention | Comments |
|---|---|---|---|---|---|
| MindsAI (TTFT, AIRV) | ✓ | ✓ | ✓ (example outputs are compressed though) | Augmented puzzles | |
| Combining induction and transduction (both approaches) | ✓ | ✓ | ✓ | Large synthetic dataset | |
| CompressARC | ✓ | special decoder architecture with ARC-related biases | |||
| TRM / HRM | ✓ | ✓ | ✓ (example outputs are compressed though) | Augmented puzzles | |
| ARChitects | ✓ | ✓ | ✓ (example outputs are compressed though) | Augmentations | |
| Surprising Effectiveness of Test-Time Training | ✓ | ? | ✓ (coz test inputs uncompressed) | Synthetic dataset | training on example inputs performed worse for them, but lots of confounding factors |
| Ouellette’ Neurally-Guided Program Induction | ✓ | ✓ | ✓ | hardcoded DSL + synthetic data | |
| OmniARC | ✓ | ✓ (example outputs are compressed though) | puzzle augmentations, extra datasets | TBD. | |
| Latent program network | ✓ | ✓ | ✓ | Augmented puzzles | Searches in latent space for a function that solves the example pairs; encoder/decoder conditioned on the train set. |
| Greenblatt’s “Draw more samples” | ✓ (only on private set) | ✓ (only on private set) | ✓ | Some, mentioned here | Frontier LLMs likely trained on public set |
| Evolutionary Test-time compute | ✓ (only on private set) | ✓ (only on private set) | ✓ | - | Frontier LLMs likely trained on public set |
| Efficient Evolutionary Program Synthesis | ✓ (only on private set) | ✓ (only on private set) | ✓ | - | Frontier LLMs likely trained on public set |
| Icecuber | ✓ | ✓ | ✓ | Intentionally designed DSL | |
| A 2D nGPT Model for ARC Prize | ✓ | ✓ | ✓ (example outputs are compressed though) | Augmented puzzles | |
| Partial Functions | ? | ✓ | ✓ | - | (Theoretical approach) Domain restriction might cause some test input compression |
Everything below applies to deep learning just as well as program synthesis. Gradient descent is just a locally aware search algorithm. A Neural network is a set of programs. Each unique weights combination corresponds to one program ↩
There are some technical caveats here. It is possible to find a pathological turing machine where this is the best compression, but a more careful application of MDL avoids this. ↩
These comparisons hold asymptotically. The original length is ~$O(2n)$ bits. Naive solver costs ~$O(n) + O(1)$ bits. Alternate solver costs $O(log(n)$ bits ↩
Technically, this is strict only in the limit becuase of the constant term from the cost of F and the “apply …” instruction ↩
with some caveats $KC(X|Y) \le KC(X) + O(1)$ ↩
Technically, there are 2 types of compression possible here: conditional and joint. We use it interchangeably because you can prove that both are better than isolated compression. ↩
Unless you’re in a pathological turing machine. If that’s the case, move to a better one. ↩
