1 February 2022

Literal matches: the minunique and getrefs algorithms

In the previous entry we defined the notions of minimal uniquity and maximal extension of a passage. Usually we say that an Old Testament passage is minimally unique if it is unique in the Old Testament, but after removing its first or last character it is no longer unique anymore. Also, we say that a New Testament passage is maximally extended if it is a quotation of an Old Testament passage, but any extension of it (by continuing the text with previous or next consecutive characters from the same book of the New Testament) is no longer a quotation.

In this blog entry two algorithms are presented. The first one is called the minunique algorithm: it finds all minimally unique substrings in the input passage. The second one is the getrefs algorithm: it finds all passages in the New Testament that may be quotations of the input passage. Here simplified versions of both algorithms are provided, in a Python-like code style. (According to multiple sources, Python is the top programming language as of December 2021.) These algorithms are implemented, however, in C++, for the bibref tool.

The minunique algorithm

The first algorithm searches for candidates of substrings of the input passage, and if a candidate is indeed minimally unique, it prints it:

def minunique(input_passage):
  l = len(input_passage)
  for i in range(l):
    for j in range(l-i):
      U[i][j] = -1
  for i in range(l):
    for j in range(l-i):
      if i>0 and (U[i-1][j]>0 or U[i-1][j+1]>0):
        U[i][j] = 2
      else:
        candidate = input_passage[j:j+i-1]
        if is_unique(candidate):
          U[i][j] = 1
          print(candidate)
        else:
          U[i][j] = 0

To show an example run of the algorithm we consider the input passage abcdef with the assumption that ab and cde are minimally unique. We create the following table:

	1	2	3	4	5	6
1	a	b	c	d	e	f
2	ab↓	bc	cd	de	ef
3	abc↓	bcd	↙cde↓	def
4	abcd↓	↙bcde↓	↙cdef
5	abcde↓	↙bcdef
6	abcdef

The table contains all possible substrings of the input passage, in a triangular form. The colored entries are unique. The red entries are minimally unique. It is clear that a colored entry produces two colored entries below it (to the South and South-West), if there exist further texts in the next row. By using this property it is simple to fill the table from top-to-bottom and left-to-right and get the sought output.

Here is a minified version of the table shown above. This saves some space and it may explain the practical way of executing the algorithm better:

a	b	c	d	e	f
b	c	d	e	f
c	d	e	f
d	e	f
e	f
f

The advantage of this second table is that we only need to read off the texts upside down beginning with a green letter and stopping at a red one. Actually, the Python code fills the same table, with the only difference that black corresponds to 0, red to 1, and blue to 2.

Let's perform the same concept on the four middle words of Psalm 82:5-6: “εγω ειπα θεοι εστε και” – for the better readability we use Latin characters here.

You may want to check this result by using the bibref program. To achieve this:

Then, issue the following commands:

lookup1 LXX Psalms 82:5+72 82:6-17
minunique1 LXX

At this point we are already able to search for potential quotations that match a minimally unique passage. A potential quotation is not necesserily a quotation: it can be a false positive (in other words: a random match), or an uncertain match (if we cannot or cannot fully agree if it is written on purpose). Let us check the minimally unique text “eipau”:

latintext1 eipau
find1 SBLGNT

These are the obtained results:

Found in Luke 24:26+11 24:26-41 (book position 93880-93884)
Found in John 10:34+58 10:34-7 (book position 36128-36132)
Found in Acts 17:3+40 17:3-73 (book position 55952-55956)
Found in Acts 26:23 26:23-80 (book position 87546-87550)

In the previous blog entry we already mentioned John 10:34 as the matching passage from the New Testament, so this result was, of course, expected. The other three matches are, however, random appearances, and no extension of them can be made. In fact, most verified quotations are far longer than 5 characters, so such false positives can be filtered quite effectively with a mechanical search.

As a final part of this section, we prove that the minunique algorithm correctly identifies all substrings of the input passage. We show that at the end of the algorithm the value of \(U[i][j]\) contains the value 0 if the substring \([j:j+i-1]\) is not unique, 1 if it is minimally unique and 2 if it is unique (but not minimally). We use induction on \(i\):

Let \(i=0\), in this case \(U[0][j]\) is set to 1 or 0, depending on the uniqueness of the single character at position \(j\) in the input.
Now let \(i\) be arbitrary. The value of \(U[i][j]\) describes the passage \([j:j+i-1]\) which is unique if and only if the values \(U[i-1][j]\) and \(U[i-1][j+1]\) (describing the passages \([j:j+i-2]\) and \([j+1:j+i-1]\), respectively) are such that:
1. one of them is unique, in this case the passage \([j:j+i-1]\) is unique but not minimally, so the value of \(U[i][j]=\color{blue}2\) is correct,
2. neither of them is unique, but \([j:j+i-1]\) is already unique, so this passage is also minimally unique: in this case the value of \(U[i][j]=\color{red}1\) is correct again.
In the remaining case \(U[i][j]\) is set to 0, because the corresponding text is not unique.

The minunique algorithm makes a remarkable speedup on finding the minimally unique passages. In most cases uniqueness check for a string is an expensive operation. For example, in Psalm 82 in the passage above we have 21 characters of input – instead of computing 231 uniqueness operations we need only 124 to perform. Many of the further computations can also be avoided since not all of the 114 unique subpassages need to be searched but only for 7 minimally unique ones.

The getrefs algorithm

Now we are ready to build a second algorithm that finds all potential quotations for a given passage in the Old Testament. Some matches will be false positive, but the search is complete in the sense that no potential quotation is skipped. The algorithm is sketched again as a Python code snippet:

def getrefs(input_passage):
  for M in output_of(minunique(input_passage)):
    for F in output_of(find(M in SBLGNT)):
      print(extend(M))

Let's try the getrefs algorithm on Psalm 82:6: issue the command: getrefs SBLGNT LXX Psalms 82:6 – this results in a couple of lines of outputs, at last in the following three best matches:

…
LXX Psalms 82:6+11 82:6-13 = SBLGNT I_Thessalonians 5:5+23 5:5-31 (length=11, pos1=96148, pos2=6064)
LXX Psalms 82:6+18 82:6-6 = SBLGNT Luke 6:35+102 6:35-48 (length=11, pos1=96155, pos2=23621)
LXX Psalms 82:6 82:6-20 = SBLGNT John 10:34+55 10:34 (length=15, pos1=96137, pos2=36125)

This is somewhat promising: the expected verse John 10:34 is reported as the best match. On the other hand, sometimes we can obtain false positives on quite long passages as well. For example, checking Psalm 91:11 results in well-identified matches, however, a false positive is also found with a length of 19 characters. To check this case, issue getrefs SBLGNT LXX Psalms 91:11 which results in

LXX Psalms 91:11+24 91:11-41 = SBLGNT Revelation_of_John 11:8+54 11:8-55 (length=7, pos1=107380, pos2=20295)
LXX Psalms 91:11+24 91:11-40 = SBLGNT Acts 10:32+45 10:32-51 (length=8, pos1=107380, pos2=33969)
LXX Psalms 91:11+24 91:11-40 = SBLGNT Acts 10:5+57 10:5-4 (length=8, pos1=107380, pos2=31685)
LXX Psalms 91:11 91:11-63 = SBLGNT Romans 8:28+9 8:28-62 (length=9, pos1=107356, pos2=17205)
LXX Psalms 91:11+25 91:11-38 = SBLGNT Luke 17:2+32 17:2-73 (length=9, pos1=107381, pos2=66321)
LXX Psalms 91:11+2 91:11-51 = SBLGNT Matthew 25:41+95 25:42-56 (length=19, pos1=107358, pos2=76695)
LXX Psalms 91:11 91:11-35 = SBLGNT Matthew 4:6+54 4:6-55 (length=37, pos1=107356, pos2=6566)
LXX Psalms 91:11 91:11-20 = SBLGNT Luke 4:10+12 4:10 (length=52, pos1=107356, pos2=13765)

The last two matches are correct, but the third-best one is just a false positive.

As already mentioned, literal matches cannot provide a full picture of the system of quotations in the Bible, since several quotations are non-literal. Even though, we can identify many of the quotations quite well just by this technique – without using any fuzzy comparisons. In fact, the fuzzy quotations often contain at least one longer part of literal quotation, hence there are good chances to find parts of a fuzzy match just with a literal search.

Exercises

Find the longest potential quotations for Psalm 2 by using the command getrefs SBLGNT LXX Psalms 2:1 2:12. (You may need to wait several seconds in the browser to complete the search.) Which potential quotations can be identified as real quotations?
Re-run the search for Psalms 91:11 by including also the next verse in the input. That is, issue getrefs SBLGNT LXX Psalms 91:11 91:12. Explain the result.

Entries on topic internal references in the Bible

Zoltán Kovács
Linz School of Education
Johannes Kepler University
Altenberger Strasse 69
A-4040 Linz

zoltan@geogebra.org