This article started out as a talk,* Is FileMan Relational?* that I’ve given on a few occasions at MUMPS and VistA conferences. File Manager (or just FileMan) is a MUMPS application that provides another level of abstraction and a set utilities that comes close to being a database management system built on top of MUMPS. In fact, I often compare the relationship between FileMan and MUMPS to the storage engines in MySQL.

Without going into too much detail, I ask to what degree FileMan follows Dr. E. F. Codd’s famous twelve rules, and conclude that it compares favorably with popular relational database management systems. This meant to be counterintuitive, and a bit provocative, because the basic abstraction in FileMan (the file) has a hierarchical structure, uses things called pointers, and doesn’t have a special query language. Today, my thinking has changed somewhat. I would classify MUMPS as a type of NoSQL database, coupled with a programming language. Specifically, MUMPS stores data in a key/value store similar to Riak. The main difference, of course, is that MUMPS supports multilevel keys. What FileMan does is add a level of abstraction similar to that of *document oriented* databases such as CouchDB or MongoDB.

Roughly speaking, the database management systems used in health IT can be divided into three broad groups:

- MUMPS and related technologies (such as InterSystems Caché)
- Relational database management systems
- NoSQL databases

Of these, NoSQL is an eclectic group including (at least) document oriented databases, key/value stores, and graph databases.

Today, we’ll take a closer look at the second of these options (relational databases) and what makes them attractive. It is usual to start out with the mathematical concept of *relation* upon which the model is based. We are all familiar numeric relations such as equality (a = b) and inequality (a > b). These are both examples of *binary* relations; an example of a ternary relation is: x is *between* a and b. The degree of a binary relation is 2, and the degree of a ternary relation is 3. We can have relations of degree 1 (called unary), too. For example, we can define r(x) to mean x > 0 (or x is positive). These examples have all involved numbers, but this isn’t required. For example, we can define a relation on strings; e.g., S1 is a *prefix* of S2, or S1 *contains* S2 Of course, there are relations on people such as John is Mary’s grandfather, or John and Mary live on the same street.

Let’s try to make this a little more formal. We start with the notion of a set, which is a collection of objects with no duplicates. Some examples of sets are {1, 2, 3, 4}, {Peter, Paul, Mary} and {}, where the last example is the set with no elements (empty or *null* set). Sets may also be infinite. Some important properties of sets are:

- Sets contain no duplicate elements.
- The order of elements in a set does not matter.
- Sets may be finite or infinite.

Sets also have a notion of membership. We say 1 is a member (or *element*) of {1, 2, 3} and write 1 in {1, 2, 3}. Two sets A and B are equal if x in A if and only if x in B. We can define a relation (!) on sets by defining A to to be a subset of B if x in A implies x in B.

As an aside, talking informally about sets in this way can get us into trouble. To see how, consider Russel’s Paradox: Let R be the set of all sets that don’t contain themselves. Is R a member of R? If it is, then by definition, it is *not* a member R. On the other hand, if it’s not a member of R, then it is by definition! This, and similar paradoxes led to the development of axiomatic set theory, and formal foundations such as Zermelo-Fraenkel set theory, or Gödel-Bernays set theory. (Yes, *that* Gödel.) Fortunately, none of this affects what we are doing here.

We need one more concept from set theory, the Cartesian product. Given two sets A and B, the product of A and B is the set A x B consisting of all ordered pairs (a, b) where a in A and b in B. This concept can be extended to three sets, where the product A x B x C is a set of ordered triples and, in general, to any number of sets.

At last, we come to the key concept of the *relation*. Formally, given n sets, an n-ary relationship is an arbitrary subset of the Cartesian product of those sets. That’s pretty abstract, but it’s not as bad as it sounds. Suppose we define R to be the set {(x, x) | x in A} (read: all pairs (x, x) such that x is an element of A.) if we write a ~ b if (a, b) in R, then a ~ b is only possible if a = b. Conversely, if a = b, then (a, b) and (a, a) are the same, so (a, b) in R by definition. We have just defined a relation that turns out to be the same thing as equality!

Now, in the database world, we’re interested in different types of relations. Suppose we want to express the idea that John Doe has a patient ID of 12. This an be expressed by saying (12, “John”, “Doe”) in R, where R is a subset of Integer x String x String where there is one element of R for each patient in the database, and that entry expresses the fact the patient with a given ID (in this case, 12) has a given name (in this case, John Doe). If you think about it, the set of elements of R can be thought of as rows in a table (think of it as a spreadsheet if you like). Now, there are some technicalities here: if we want to think of relations as tables, we will want to assign a *heading* to each component of the relation, say PatientID, FirstName and LastName. Note also that, since R is a set, there can be no duplicate rows (this is not true of spreadsheets, or even of views in some relational database products). Note also that there is no provision for NULL values. If we want to allow them, we’ll have to adjust the definitions accordingly.

After all of this, we finally come to

**Definition**: A relational database is a database that stores facts as tuples in relations.

That seems a bit anticlimactic, doesn’t it? The interesting thing about relational databases is how we get information out of them. With MUMPS databases, we normally write procedural code to do it. In other words, we describe the steps we need to go through to find the data we want. Said differently, we say *how* to find the data, not *what* we want. If we could design a language that would allow us to express what it is we want know, without going through the steps we need to follow to find it, then we would have a *declarative* query language. Examples of query languages people have developed include Steuctured Query Language ( SQL), Query by Example ( QBE) and QUEL.

The basic idea is to take advantage of the formal foundations of relational databases and develop a mathematical language, such as relational algebra, which will serve as the model of practical query languages. It then becomes a mathematical problem to find a way to minimize the number of steps needed to carry out a given query. This is the job of system components called the query compiler and query optimizer. It’s not unlike the problem of of optimizing the object code generated when compiling languages such as C or Java.

The trouble is that this is all easier said than done. Careful indexing will help, but in practice, queries will often involve table scans and can be slow. The upshot is that relational databases are often outperformed by MUMPS and other databases that give the programmer more direct control over how queries are actually executed. So, it really comes down to a trade-off. Relational databases can be more flexible and require less application specific code (though the complexity of writing SQL is not to be underestimated). On the other hand, they can sometimes be slow, and physical database design can be a real challenge. MUMPS systems can include a lot of specialized code.

In short, the choice of technology is not always clear. There are many options available today, and each comes with its own advantages and disadvantages.

Tagged: database, mathematics, mumps, rdbms, relational, theory