1 February 2022

Literal matches: the minunique and getrefs algorithms

In the previous entry we defined the notions of minimal uniquity and maximal extension of a passage. Usually we say that an Old Testament passage is minimally unique if it is unique in the Old Testament, but after removing its first or last character it is no longer unique anymore. Also, we say that a New Testament passage is maximally extended if it is a quotation of an Old Testament passage, but any extension of it (by continuing the text with previous or next consecutive characters from the same book of the New Testament) is no longer a quotation.

In this blog entry two algorithms are presented. The first one is called the minunique algorithm: it finds all minimally unique substrings in the input passage. The second one is the getrefs algorithm: it finds all passages in the New Testament that may be quotations of the input passage. Here simplified versions of both algorithms are provided, in a Python-like code style. (According to multiple sources, Python is the top programming language as of December 2021.) These algorithms are implemented, however, in C++, for the bibref tool.

The minunique algorithm

The first algorithm searches for candidates of substrings of the input passage, and if a candidate is indeed minimally unique, it prints it:
def minunique(input_passage):
  l = len(input_passage)
  for i in range(l):
    for j in range(l-i):
      U[i][j] = -1
  for i in range(l):
    for j in range(l-i):
      if i>0 and (U[i-1][j]>0 or U[i-1][j+1]>0):
        U[i][j] = 2
      else:
        candidate = input_passage[j:j+i-1]
        if is_unique(candidate):
          U[i][j] = 1
          print(candidate)
        else:
          U[i][j] = 0
To show an example run of the algorithm we consider the input passage abcdef with the assumption that ab and cde are minimally unique. We create the following table:

1 2 3 4 5 6
1 a b c d e f
2 ab↓ bc cd de ef
3 abc↓ bcd ↙cde↓ def
4 abcd↓ ↙bcde↓ ↙cdef
5 abcde↓ ↙bcdef
6 abcdef

The table contains all possible substrings of the input passage, in a triangular form. The colored entries are unique. The red entries are minimally unique. It is clear that a colored entry produces two colored entries below it (to the South and South-West), if there exist further texts in the next row. By using this property it is simple to fill the table from top-to-bottom and left-to-right and get the sought output.

Here is a minified version of the table shown above. This saves some space and it may explain the practical way of executing the algorithm better:

a b c d e f
b c d e f
c d e f
d e f
e f
f

The advantage of this second table is that we only need to read off the texts upside down beginning with a green letter and stopping at a red one. Actually, the Python code fills the same table, with the only difference that black corresponds to 0, red to 1, and blue to 2.

Let's perform the same concept on the four middle words of Psalm 82:5-6: “εγω ειπα θεοι εστε και” – for the better readability we use Latin characters here.

ghsegveipaueoiestekai
hsegveipaueoiestekai
segveipaueoiestekai
egveipaueoiestekai
gveipaueoiestekai
veipaueoiestekai
eipaueoiestekai
ipaueoiestekai
paueoiestekai
aueoiestekai
ueoiestekai
eoiestekai
oiestekai
iestekai
estekai
stekai
tekai
ekai
kai
ai
i

You may want to check this result by using the bibref program. To achieve this:


Then, issue the following commands:
  1. lookup1 LXX Psalms 82:5+72 82:6-17
  2. minunique1 LXX
At this point we are already able to search for potential quotations that match a minimally unique passage. A potential quotation is not necesserily a quotation: it can be a false positive (in other words: a random match), or an uncertain match (if we cannot or cannot fully agree if it is written on purpose). Let us check the minimally unique text “eipau”:
  1. latintext1 eipau
  2. find1 SBLGNT
These are the obtained results:
  1. Found in Luke 24:26+11 24:26-41 (book position 93880-93884)
  2. Found in John 10:34+58 10:34-7 (book position 36128-36132)
  3. Found in Acts 17:3+40 17:3-73 (book position 55952-55956)
  4. Found in Acts 26:23 26:23-80 (book position 87546-87550)
In the previous blog entry we already mentioned John 10:34 as the matching passage from the New Testament, so this result was, of course, expected. The other three matches are, however, random appearances, and no extension of them can be made. In fact, most verified quotations are far longer than 5 characters, so such false positives can be filtered quite effectively with a mechanical search.

As a final part of this section, we prove that the minunique algorithm correctly identifies all substrings of the input passage. We show that at the end of the algorithm the value of \(U[i][j]\) contains the value 0 if the substring \([j:j+i-1]\) is not unique, 1 if it is minimally unique and 2 if it is unique (but not minimally). We use induction on \(i\):
  1. Let \(i=0\), in this case \(U[0][j]\) is set to 1 or 0, depending on the uniqueness of the single character at position \(j\) in the input.
  2. Now let \(i\) be arbitrary. The value of \(U[i][j]\) describes the passage \([j:j+i-1]\) which is unique if and only if the values \(U[i-1][j]\) and \(U[i-1][j+1]\) (describing the passages \([j:j+i-2]\) and \([j+1:j+i-1]\), respectively) are such that:
    1. one of them is unique, in this case the passage \([j:j+i-1]\) is unique but not minimally, so the value of \(U[i][j]=\color{blue}2\) is correct,
    2. neither of them is unique, but \([j:j+i-1]\) is already unique, so this passage is also minimally unique: in this case the value of \(U[i][j]=\color{red}1\) is correct again.
    In the remaining case \(U[i][j]\) is set to 0, because the corresponding text is not unique.
The minunique algorithm makes a remarkable speedup on finding the minimally unique passages. In most cases uniqueness check for a string is an expensive operation. For example, in Psalm 82 in the passage above we have 21 characters of input – instead of computing 231 uniqueness operations we need only 124 to perform. Many of the further computations can also be avoided since not all of the 114 unique subpassages need to be searched but only for 7 minimally unique ones.

The getrefs algorithm

Now we are ready to build a second algorithm that finds all potential quotations for a given passage in the Old Testament. Some matches will be false positive, but the search is complete in the sense that no potential quotation is skipped. The algorithm is sketched again as a Python code snippet:
def getrefs(input_passage):
  for M in output_of(minunique(input_passage)):
    for F in output_of(find(M in SBLGNT)):
      print(extend(M))
Let's try the getrefs algorithm on Psalm 82:6: issue the command: getrefs SBLGNT LXX Psalms 82:6 – this results in a couple of lines of outputs, at last in the following three best matches: This is somewhat promising: the expected verse John 10:34 is reported as the best match. On the other hand, sometimes we can obtain false positives on quite long passages as well. For example, checking Psalm 91:11 results in well-identified matches, however, a false positive is also found with a length of 19 characters. To check this case, issue getrefs SBLGNT LXX Psalms 91:11 which results in The last two matches are correct, but the third-best one is just a false positive.

As already mentioned, literal matches cannot provide a full picture of the system of quotations in the Bible, since several quotations are non-literal. Even though, we can identify many of the quotations quite well just by this technique – without using any fuzzy comparisons. In fact, the fuzzy quotations often contain at least one longer part of literal quotation, hence there are good chances to find parts of a fuzzy match just with a literal search.

Exercises

  1. Find the longest potential quotations for Psalm 2 by using the command getrefs SBLGNT LXX Psalms 2:1 2:12. (You may need to wait several seconds in the browser to complete the search.) Which potential quotations can be identified as real quotations?
  2. Re-run the search for Psalms 91:11 by including also the next verse in the input. That is, issue getrefs SBLGNT LXX Psalms 91:11 91:12. Explain the result.


Entries on topic internal references in the Bible

  1. Web version of bibref (12 January 2022)
  2. Order in chaos (17 January 2022)
  3. Reproducibility and imperfection (20 January 2022)
  4. A student of Gamaliel's (23 January 2022)
  5. Non-literal matches in the Romans (26 January 2022)
  6. Literal matches: minimal uniquity and maximal extension (31 January 2022)
  7. Literal matches: the minunique and getrefs algorithms (1 February 2022)
  8. Non-literal matches: Jaccard distance (2 February 2022)
  9. Non-literal matches in the Romans: Part 2 (3 February 2022)
  10. A summary on the Romans (5 February 2022)
  11. The Psalms (6 February 2022)
  12. The Psalms: Part 2 (7 February 2022)
  13. A classification of structure diagrams (15 February 2022)
  14. Isaiah: Part 1 (19 February 2022)
  15. Isaiah: Part 2 (26 February 2022)
  16. Isaiah: Part 3 (2 March 2022)
  17. Isaiah: Part 4 (7 March 2022)
  18. Isaiah: Part 5 (15 March 2022)
  19. Isaiah: Part 6 (23 March 2022)
  20. Isaiah: Part 7 (30 March 2022)
  21. A summary (7 April 2022)
  22. On the Wuppertal Project, concerning Matthew (17 July 2022)
  23. Matthew, a summary (25 July 2022)
  24. Isaiah, a second summary (31 July 2022)
  25. Long false positives (23 August 2022)
  26. A general visualization (25 August 2022)
  27. Stephen's defense speech (19 September 2022)
  28. Statistical Restoration Greek New Testament (31 July 2023)
  29. Qt version of bibref (11 March 2024)
  30. Statements on Bible references (5 August 2024)

Zoltán Kovács
Linz School of Education
Johannes Kepler University
Altenberger Strasse 69
A-4040 Linz