You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/WP_2025-05-11-Zipf-s-Law-on-La-Comedie-humaine.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ Where:
11
11
- $f(r)$ is the frequency of a word with rank $r$
12
12
- $r$ is the rank of the word when all words are arranged by decreasing frequency
13
13
14
-
It's worth noting that this law was first investigated by the French stenographer Jean-Baptiste Estoup (left/top) in 1916 and later extended and widely popularized by the American linguist George Kingsley Zipf (right/bottom).
14
+
It's worth noting that this law was first investigated by the French stenographer Jean-Baptiste Estoup [left/top] in 1916 and later extended and widely popularized by the American linguist George Kingsley Zipf [right/bottom].
@@ -20,7 +20,7 @@ It's worth noting that this law was first investigated by the French stenographe
20
20
21
21
This low applied to words may suggest that language follows a principle of least effort, where communicators balance the desire for precision ($r$ is large) with efficiency ($r$ is small). Zipf wrote:
22
22
23
-
> The power laws in linguistics and in other human systems reflect an economical rule: everything carried out by human beings and other biological entities must be done with least effort (at least statistically)
23
+
> The power laws in linguistics and in other human systems reflect an economical rule: everything carried out by human beings and other biological entities must be done with least effort [at least statistically]
24
24
25
25
More generally, the relationship is written as:
26
26
@@ -34,7 +34,7 @@ In a log-log plot, Zipf's law appears as a straight line with slope $-\alpha$:
34
34
35
35
$$\log(f(r)) = \log(C) - \alpha \log(r)$$
36
36
37
-
Let's check how this applies to a text corpus. For this, we are using Honoré de Balzac's extensive series of books called "La Comédie Humaine" (in French). It's a collection of many novels depicting French society, written between 1829 and 1850.
37
+
Let's check how this applies to a text corpus. For this, we are using Honoré de Balzac's extensive series of books called "La Comédie Humaine" [in French]. It's a collection of many novels depicting French society, written between 1829 and 1850.
Looking at the visualization, we can observe that Zipf's law doesn't fit very well for the highest-ranked words ("de", "la", "le", "à", "et", ...). The most frequent words tend to be less frequent than what the simple power law predicts. And because we gave these words the largest weights in the regression, this deteriorates the fit. Maybe the Zipf-Mandelbrot law might provide a better fit. It is a more general formula that includes an additional parameter:
598
+
Looking at the visualization, we can observe that Zipf's law doesn't fit very well for the highest-ranked words ["de", "la", "le", "à", "et", ...]. The most frequent words tend to be less frequent than what the simple power law predicts. And because we gave these words the largest weights in the regression, this deteriorates the fit. Maybe the Zipf-Mandelbrot law might provide a better fit. It is a more general formula that includes an additional parameter:
0 commit comments