TY - JOUR

T1 - A scaling law beyond Zipf's law and its relation to Heaps' law

AU - Font-Clos, Francesc

AU - Boleda, Gemma

AU - Corral, Álvaro

PY - 2013/9/1

Y1 - 2013/9/1

N2 - The dependence on text length of the statistical properties of word occurrences has long been considered a severe limitation on the usefulness of quantitative linguistics. We propose a simple scaling form for the distribution of absolute word frequencies that brings to light the robustness of this distribution as text grows. In this way, the shape of the distribution is always the same, and it is only a scale parameter that increases (linearly) with text length. By analyzing very long novels we show that this behavior holds both for raw, unlemmatized texts and for lemmatized texts. In the latter case, the distribution of frequencies is well approximated by a double power law, maintaining the Zipf's exponent value γ ≃ 2 for large frequencies but yielding a smaller exponent in the low-frequency regime. The growth of the distribution with text length allows us to estimate the size of the vocabulary at each step and to propose a generic alternative to Heaps' law, which turns out to be intimately connected to the distribution of frequencies, thanks to its scaling behavior. © IOP Publishing and Deutsche Physikalische Gesellschaft.

AB - The dependence on text length of the statistical properties of word occurrences has long been considered a severe limitation on the usefulness of quantitative linguistics. We propose a simple scaling form for the distribution of absolute word frequencies that brings to light the robustness of this distribution as text grows. In this way, the shape of the distribution is always the same, and it is only a scale parameter that increases (linearly) with text length. By analyzing very long novels we show that this behavior holds both for raw, unlemmatized texts and for lemmatized texts. In the latter case, the distribution of frequencies is well approximated by a double power law, maintaining the Zipf's exponent value γ ≃ 2 for large frequencies but yielding a smaller exponent in the low-frequency regime. The growth of the distribution with text length allows us to estimate the size of the vocabulary at each step and to propose a generic alternative to Heaps' law, which turns out to be intimately connected to the distribution of frequencies, thanks to its scaling behavior. © IOP Publishing and Deutsche Physikalische Gesellschaft.

U2 - 10.1088/1367-2630/15/9/093033

DO - 10.1088/1367-2630/15/9/093033

M3 - Article

VL - 15

JO - New Journal of Physics

JF - New Journal of Physics

SN - 1367-2630

M1 - 093033

ER -