Large-scale analysis of Zipf's law in English texts

Isabel Moreno-Sánchez, Francesc Font-Clos, Álvaro Corral

    Research output: Contribution to journalArticleResearchpeer-review

    46 Citations (Scopus)

    Abstract

    © 2016 Moreno-Sánchez et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Despite being a paradigm of quantitative linguistics, Zipf's law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So, we can summarize the current support of Zipf's law in texts as anecdotic. We try to solve these issues by studying three different versions of Zipf's law and fitting them to all available English texts in the Project Gutenberg database (consisting of more than 30 000 texts). To do so we use state-of-the art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. Remarkably, one of the three versions of Zipf's law, consisting of a pure power-law form in the complementary cumulative distribution function of word frequencies, is able to fit more than 40% of the texts in the database (at the 0.05 significance level), for the whole domain of frequencies (from 1 to the maximum value), and with only one free parameter (the exponent).
    Original languageEnglish
    Article numbere0147073
    JournalPLoS ONE
    Volume11
    Issue number1
    DOIs
    Publication statusPublished - 1 Jan 2016

    Fingerprint Dive into the research topics of 'Large-scale analysis of Zipf's law in English texts'. Together they form a unique fingerprint.

    Cite this