Updated Vocabulary Coverage Statistics

In various mailing list posts, blog posts and talks, I’ve shown vocabulary coverage statistics. It’s time to update the code to use more recent data and republish the results here.

The vocabulary coverage tables have a number of different parameters:

what are the items being learnt: lexemes or forms or something else?
what are the targets: verses or sentences or something else?
what ordering is being used: item frequency or something else?

and, of course, what text and lemmatization is being used.

Most of my published stats before were based on the UBS3 version of MorphGNT. Here I’m going to use the latest MorphGNT based on the SBLGNT (MorphGNT 6.06) and I’m going to explore not just verses but (in followup posts) clauses and sentences from the GBI Syntax Trees and paragraphs from the SBLGNT.

I also want to start incorporating the information from my morphological lexicon into the item/target modeling and ordering algorithms.

But first let’s just update the basic stats.

Verses-Lexemes with Frequency Ordering

A target-item file for verses-lexemes can be achieved with:

awk '{print $1,$7}' sblgnt/*-morphgnt.txt

if we then feed that to vocab-coverage.py we get the following result:

             ANY    50.00%    75.00%    90.00%    95.00%   100.00%
------------------------------------------------------------------
   100    99.91%    91.07%    24.36%     2.13%     0.64%     0.48%
   200    99.92%    96.83%    51.80%     9.75%     3.43%     2.54%
   500    99.97%    99.13%    82.23%    36.57%    17.81%    13.81%
  1000    99.99%    99.71%    93.60%    62.57%    37.28%    29.99%
  2000   100.00%    99.92%    98.41%    84.95%    65.38%    56.43%
  5000   100.00%   100.00%   100.00%    99.51%    96.44%    94.58%
   ALL   100.00%   100.00%   100.00%   100.00%   100.00%   100.00%

What this table is saying is that if you learn, say, the 200 most frequent lexemes, you’ll be able to read 95% of the lexemes in 3.43% of verses.

Verses-Forms with Frequency Ordering

A target-item file for verses-forms can be achieved with:

awk '{print $1,$6}' sblgnt/*-morphgnt.txt

if we then feed that to vocab-coverage.py but with 10000 added as an item count, we get the following result:

             ANY    50.00%    75.00%    90.00%    95.00%   100.00%
------------------------------------------------------------------
  99.82%    57.63%     1.10%     0.04%     0.01%     0.01%
  99.86%    78.86%     6.51%     0.34%     0.05%     0.05%
  99.91%    92.85%    26.95%     2.23%     0.59%     0.52%
  99.94%    96.95%    51.23%     7.75%     2.31%     1.74%
  99.96%    98.65%    72.52%    21.74%     7.86%     5.80%
  99.97%    99.74%    90.97%    52.13%    28.52%    21.61%
 100.00%    99.94%    98.31%    78.28%    55.19%    45.28%
   ALL   100.00%   100.00%   100.00%   100.00%   100.00%   100.00%

What this table is saying is that if you learn, say, the 500 most frequent forms, you’ll be able to read 75% of the forms in 26.95% of verses.

Various talks, including those at BibleTech in 2010 and 2015 explain a ton of caveats around these numbers but I wanted to at least refresh them (and then code) with the latest data.