2 February 2022

Non-literal matches: Jaccard distance

In a former entry we started to analyze Paul's non-literal quotations in the Romans. Before continuing the list of such quotations we need to learn how similarity is defined in the non-literal matches.

The technique for comparing texts is well-known from data mining. We combine two ideas – both can be found in the popular book Leskovec-Rajamaran-Ullman: Mining of massive datasets, Cambridge University Press, 2014, p. 77-78. The first idea is to create the 2-shingles of an input passage, and then use the Jaccard similarity for bags (see footnote 2 on page 77).

Let's illustrate the concept on Romans 15:11 – it is a quotation of Psalms 117:1. (This is the last non-literal quotation in the Romans, close to the end of the letter. We temporarily skip a couple of other non-linear quotations to get a clear example of the two ideas.)

Then, issue the following commands:
  1. lookup SBLGNT Romans 15:11
  2. lookup LXX Psalms 117:1
It is quite clear that more parts of the two passages literally match. Indeed, the command getrefs SBLGNT LXX Psalms 117:1 finds two of them: “επαινεσατ” and “αυτον παντες οι λαοι”. Even though, the other literal parts are not detected, because they are not unique in the LXX. These literal matches are “αινειτε”, “τον κυριον” and “παντα τα εθνη”. These short texts, appearing in this order in the LXX (1-2-3), are ordered differently in the Romans (1-3-2). Moreover, there is the word “και” (“and”) inserted in Romans. Finally, Paul uses a variant of the LXX word “επαινεσατε”: “επαινεσατωσαν” (here the first 9 characters are literally the same). Luckily, the two ideas taken from the theory of data mining helps us to conclude that the two passages are still quite similar and they differ only in 19%.

The first step is to create 2-shingles of the passages
  1. lookup2 SBLGNT Romans 15:11+8 15:11
  2. lookup1 LXX Psalms 117:1+9 117:1
The 2-shingles of a passage are any substring of length 2 in the passage. That is, for the quotation in Romans they are ai, in, ne, ei, it, te, …, il, la and ao. Some substrings appear more than once – we will count their appearance accordingly later. For the quoted string in Psalm 117 we find similar 2-shingles, but there are some differences. The following table shows the 2-shingles and the number of their appearances in the two passages:

ai in ne ei it te ep pa an nt ta at ae eu un nh ht to on nk ky yr ri io ka ie es sa tv vs na ay yt np so oi il la ao et he ea
Romans 15:11+8 15:11 3 2 2 1 1 2 2 3 3 2 2 2 1 1 1 1 1 2 3 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 2 1 1 1
Psalms 117:1+9 117:1 2 2 2 1 1 3 1 3 2 2 2 2 1 1 1 1 2 3 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1

Next, the sum of the numbers in both rows are computed: 59 and 53, these are actually by 1 less than the lengths of the texts (60 and 54, respectively – you can use the commands length2 and length1 to check this). After then for each 2-shingle we choose the minimal number of appearances from each text:
ai in ne ei it te ep pa an nt ta at ae eu un nh ht to on nk ky yr ri io ka ie es sa tv vs na ay yt np so oi il la ao et he ea
2 2 2 1 1 2 1 3 2 2 2 2 1 1 1 1 2 3 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1

We compute the sum of these numbers, it is 48. After dividing this number by the maximum of 59 and 53 we get a ratio of similarity between the two texts, it is about 81%, so the difference is about 19%.

The same computation can be performed automatically by using the command jaccard12 (after both clipboards are set – they should be, if you followed the previous steps).

A short summary of the above result can be seen in this diagram:

This piece of diagram was created, however, not in a fully automated way.

Continue reading…

See also a filtered list of the entries on topics GeoGebra, technical developments or internal references in the Bible.

Zoltán Kovács
Linzer Zentrum für Mathematik Didaktik
Johannes Kepler Universität
Altenberger Strasse 54
A-4040 Linz