Symmetry in subword segmentation

Language Log 2025-08-30

"Subword Symmetry in Natural Languages." Pelloni, Olga et al. Royal Society Open Science 12 (August 21, 2025).

Abstract

Symmetric patterns are found in the orderly arrangements of natural structures, from proteins to the symmetry in animals’ bodies. Symmetric structures are more stable and easier to describe and compress, which is why they may have been preferred as building blocks in natural selection. The idea that natural languages undergo an evolutionary process akin to the evolution of species has been pervasive in the study of language. This process might result in symmetric patterns as in other natural structures, but the notion of symmetry is rarely associated with the study of natural language. In this study, we look for symmetric patterns in text data, considering the length of subword units under a range of possible subword analyses. We study the length of subword units in 32 languages and discover that the splits of long words tend to be symmetric regardless of the segmentation method and that some automatic methods give symmetric splits at all word lengths. These results include natural language in the set of phenomena that can be described in terms of symmetry, opening a new research avenue for the empirical study of text data as a structure comparable to various other structures in the natural world.

Discussion

Our findings suggest that natural language can be described in terms of symmetry like the morphology of different physical objects in natural science or more abstract objects and processes in psychology, cognitive sciences and artificial intelligence. Previous studies of perceptual organization in gestalt psychology, for instance, named symmetry as one of the key principles involved in the perceptual grouping of objects. The human mind will tend to perceive a visual field consisting of multiple objects as a single figure if the shape of the visual field is symmetrical [68]. The speed of visual processing is also shown to be impacted by symmetry: symmetric shapes are processed faster than asymmetric ones, and vertically symmetric shapes, in particular, are preferred starting from infancy [69]. The perception principles known for human cognition have found application in artificial intelligence as well, for example, for building AI agents that use symmetry as the organizing principle for enhanced visual reasoning [70]. Here, we discuss the broader implications of our findings for the theoretical study of natural language and for its computational modelling.

Conclusion

We have introduced a new perspective in the study of subword units in natural languages manifested as text samples. Regarding text as a one-dimensional object whose main property is length, we show that lengths of subword units tend to be more symmetric in long than in short words. This distinction holds for all subword segmentation methods that we considered, including manual segmentation. A comparison of segmentation methods reveals the impact of the modelling choice on the symmetry of the resulting segments: naive methods that rely on raw frequency produce more symmetric units than probabilistic methods. Manual segmentation turns out relatively symmetric only in the sense of evenness (not strict symmetry). The most symmetric units are produced by an automatic subword segmentation method (BPE-MR) that minimizes text redundancy.

Our findings are in line with the known opposition between frequency and regularity discussed in the linguistic literature and contribute new insights concerning the uniform information density hypothesis. They also provide a scientific basis for dealing with the problem of subword tokenization in large language models, suggesting a simple and principled method for achieving symmetry in subword segmentation in any language.

An earnest attempt to build bridges between cognitive and physical sciences.

 

Selected readings

[Thanks to Ted McClure]