Extended Set Theory Storage Model

Relational math can describe the logic of the data it provides to you, but it isn't strong enough to describe the storage space upon which it is implemented. ExtendedSetTheory (XST) can model the user data, but it can also model the storage, all the way down to the bits.

Why is it a bug? Specifically, why would we want to restrict the modelling of database representation issues to have to use same theory that is used to model the abstract database structure? Especially when there is an existing theory of abstraction functions and representation invariants that is perfectly adequate?

It's a bug because we are not able at all to model using the same approach all the way down if we want to. No one says we would have to. The problem is that now, if we want to, we cannot. We are forced now to change models even if we don't want to.

It seems kind of obvious to me that a unified theory that can cover everything would be the preferred approach, just as we seek in physics and other sciences. I guess it doesn't seem that way to you.


The reason it doesn't seem that way to me is that modelling data representation in terms of abstraction functions works for any paradigm, not just relational. I.e. we already have a "unified theory". -- DavidSarahHopwood

[XST would appear to offer an alternate "unified theory". Is it a problem to have more than one such complete paradigm? -- DougMerritt]

[...] ExtendedSetTheory can model the user data, but it can also model the storage, all the way down to the bits.

I wonder why one would want to do this. You use the relational model to model the logic level. Than you use whatever you want -- hierarchies, networks, files, any kind of graph or list -- to map that to implementation. A RDBMS should have all this implementation functionality, but very separate from the logical model; and users should never deal with the physical layer or its mapping to the logical schema, only DBAs and the such should care about this.

Here's a simple example. (I don't see a good way to unthread all this, but hope it's still worth getting down.)

Imagine some simple relational table:

 | Name             | Age |
 | Bill Jones       |  36 |
 | Cassandra Walker |  29 |

This might well be implemented in some flat way on the disk: say 30 bytes of alpha for the Name, one byte for the short Age. Now suppose we want all the people whose age is less than thirty.

One way to do this is with some code that would look like this:

  Relation young = new Relation();
  foreach ( Person p in People ) 
    if (p.Age < 30)

This would be nearly efficient, except that it has to create and destroy all these Person records, and then flatten them back out into bytes and put them on the disk.

Another way might be something like this:

  inByte= 0;
  outByte = 0;
  while (inByte<= People.Size)
    int age = People.File[inByte+30];
    if ( age < 30)
      Movebytes(People.File[inByte], Young.File[outByte], People.Recordlength);
      outByte += People.Recordlength;
    inByte += People.Recordlength;

In words, we just map to the disk, compare the byte, and if it's less than 30, we move one record from input to output.

This would be tons more efficient than the first example. The problem that an RDBMS has is that it has to figure out all those addresses and file-to-memory mappings, and finally pop down into a routine that basically does what the one just above does: whip through all the data comparing and moving bytes.

It turns out that this is quite doable, of course. Oracle proves that. But it isn't easy.

I don't understand why it isn't easy. Translating from the first example to code based on byte offsets is just what any compiler for a language with structures would do. Then memory-mapped files or similar can be used to support persistence. All well-understood technology, not requiring any new theory.

One of the putative advantages of ExtendedSetTheory is that since the math can map all the way from an abstract relation, through the relation's representation on disk (however complicated), down to bytes in memory, the whole operation above can be done by mathematical expressions, rather than by "flimsy, ill-defined operations in code".

People who have built an RDBMS in the ad-hoc way, or who don't have to build one at all, might say "OK, so what, I can still do it." And they're right. But I have some input for you.

Back in the day, teams of mine built three RDBMS products, two of which were ad hoc, and one of which was built on what we understood of XST. The latter was very neat. What it did was take the relational expression desired by the user (SQL-equivalent) and do mathematical substitutions on that expression until it came down to a series of operations to be done on storage (disk mapped to memory). Then it executed those expressions, producing the result.

It was really neat. You could implement any relational operation you wanted "just" by expressing the mathematics that defined it. It was very hard, but in the end we who had built DB both ways really felt that it was better in the XST style.

But that was long ago, and in another company, and besides, the product's dead.

That's very interesting. Is there somewhere I could learn more about that approach?

Unfortunately, there is not much really good to read. Some stuff by the originator is at http://www.xegesis.org/xsp/ . Those few who have built things based on the theory have this in common: we all got something good, and we are to this day not sure that what we did was what Dave was talking about. He's a hard man to understand ...

Well, could you give any examples you recall of how the substitutions worked in what you did?

I'll try. I don't know if they'll be evocative of how cool it is when it works.

Suppose we have a query like SELECT NAME, AGE, ADDRESS FROM PEOPLE WHERE AGE = 21.

Now the extended set theoretic model for PEOPLE means that its records look like this:

  PEOPLE ::=
    { Jones^NAME, 36^AGE, 1234 Main St^ADDRESS }
    { Smith^NAME, 24^AGE, 221 Baker St^ADDRESS }

And the query would like to use an input set like this:

  QUERY ::=
    { 21^AGE }

And the result set we want is exactly


where | is the XST operator "Restrict", defined as

  A | B = { a^i such that a element-i A and exists b element B and a intersect b = b }

Now at this level of "abstraction", the operation is defined in terms of arbitrary elements and exponents (which are here modeling field names, although they don't have to).

But the same operation can be used, directly, to refer, for example to memory.

Suppose also that we have the data in a flat table, with 2 bytes for age, 30 for name, 50 for address.

We can imagine that we keep symbol data in a standard format like this:

 FieldName  Offset  Length
 Age        0        2
 Name       2        30
 Address    32       50

The standard format, of course, can be represented in its own format:

 FieldName  Offset  Length
 FieldName  0        16
 Offset     16       2
 Length     18       2

And the data for our table looks like this (with appropriate blanks to line it up):

  36 Jones 1234 Main St
  21 Smith 221 Baker St


The operation that produces the bytes that make up our result table is


Using the same definition of restrict as we started with. The only difference is that we are now interpreting the table as something like

    { 3^0 6^1 J^2 o^3 n^4 e^5 s^6 ... }
    { 2^0 1^1 S^2 m^3 i^4 t^5 h^6 ... }

and the query (of course) is { { 2^0 1^1 }}

So the result of all this is that we can parse the SQL in the "usual" way, but we only have to determine that the result we want is the original restrict.

Then we do various math to map the byte addresses around. This math effects the same table lookups one would do using ad hoc code, but, for example, it tends to use the same operations yet again.

For example, if we want to know the byte offset and length for the Name field, because we want to print it or something, it turns out that the information in question is in the record

  PEOPLE_SYMBOLS | { { Name^FieldName} }

How can I get to that operation? One way, certainly, is with the necessary ad hoc code to rip apart the symbol table, figure out a restricting record, and plug the stuff in. And the ad hoc code isn't incredibly hard to write.

However it is rather easier to write in set theory, with the right set of operations.

It turns out, in the simple case, that the system can completely bootstrap all the information it needs from a few simple facts: the location and format of a table saying where all the tables are, and the format of the SYMBOLS table. (We can get the location from the tables table.)

Most everything else then turns out to be expressions in math, very little procedural code.

When we use SQL or some other relational language, we find ourselves thinking in aggregates -- whole sets of reocrds -- most of the time.

When we implement a relational system, we usually have to think rather procedurally, or at best in terms of objects, to do the implementation. Using XST, more of our thinking and programming can be in terms of aggregates.

So it turns out to be very powerful in the hands of someone who knows how to use it.

Is it worth bootstrapping up to know how to use it? I'd not make that claim today, when there are, as far as I know, no available XST systems to work with. Things might have been different, but that's the way things are.


Cool. Your example makes it seem easy, so I'm probably missing a lot of the nuances. I'll have to think upon this. Thanks.

Hi all,

I have gotten an interest in D.L. Childs' works when I read about them in Ron's articles in 2005. I have recently come to the conclusion that XST, XSN and XSP go much further than simply saying "Well, my table is a set, and I'll be selecting all entries that have this field equal to this.". This does nothing more than SQL, and I'd say it actually does less.

Childs talks about three different things: Buried into one of the papers, there is a particular sentence I want stress. I will mention it further down.

First, Childs mentions that Codd's works on relational data models required strong Data Independence between application, model and storage, so that the performance of the storage would be determined by the storage model only, and not by the way the queries were written. Yet, in nowadays RDBMS, you have to pay attention to the way queries are written, and nobody laughs.

Basically, the application should query the model, and the model should query the storage. The applicative queries to the model should only determine what data you want. Not how you get them from storage. XSN Restrict would be one way. SQL without hints would be another.

About storage performance, one of the keypoints of applying XSP is at I/O level. The sentence that is buried down hild's papers is that: column storage is more efficient than row storage. Part of what I realized is this:

Suppose you could store a table with two columns "Name" and "FirstName?" that have 5 characters each. (for clarity)

You could represent each of your row data as a tuple of tuples like this:
  <D, o, e, ., ., B, i, l, l, y>,
  <D, o, e, ., ., J, o, h, n, .>,
  <B, o, b, ., ., B, i, l, l, y>,
  <P, o, w, e, r, M, a, x, ., .>,
  <G, r, a, n, t, H, u, g, h, .>,
  <D, o, w, ., ., J, o, n, e, s>

And store them in a flat file as "row storage" :

 <D, o, e, ., ., B, i, l, l, y, D, o, e, ., ., J, o, h, n, .,B, o, b, ., ., B, i, l, l, y,P, o, w, e, r, M, a, x, ., .,G, r, a, n, t, H, u, g, h, ., D, o, w, ., ., J, o, n, e, s>

Suppose, now, that each one I/O gives you 4 (arbitrary but) consecutive bytes from your storage. Your query is "the first name of all names that start with D and o". Your I/Os would give you :
 <D, o, e, .>
 <B, i, l, l> (selecting the first first name)
 <y, D, o, e> (end of selecting the first first name, plus enough info on the second name)
 <J, o, h, n>
 <., B, o, b>
 <P, o, w, e>
 <G, r, a, n>
 <D, o, w, .>
 <J, o, n, e>

For a total of 10 I/Os. Which is already improved with regards to the dumb conditioning, because I took advantage of conditioning every I/O.

Now, imagine that I store characters per "column".

Let's admit that I use one file for each column position. I'd be using:

At this point, I have tested all "rows" to my condition. I know exactly which data to fetch.

I'd be using 8 more I/Os to fetch all relevant data. That totals to 12 I/O, which kinda defeats the demonstration.

But wait! Look at tuples number 2, 4 and 5!

I can express them with a default in the data definition!
 2 -> 'o', unless otherwise stated. + { r^5 }
 4 -> '.' unless otherwise stated + { e^4, n^5 }
 5 -> '.' unless otherwise stated + { r^4, t^5 }

Admittedly, I can save 1 I/O on the condition on the second character. That totals to 11 I/O.

Wait, there is more! What if I used a row storage for the FirstName? column only?

(I am tired of typing, so please imagine)

I would have to fetch rows 1, 4 and 5. 4 and 5 being consecutive, 5 I/Os are enough. That totals to 5+3 = 8 I/Os with such a structure.

That does not look like a impressive improvement. Yet, this is because my I/O buffer is only 4-byte wide. If you admit that at least one fetched byte is useful per I/O, it floors the useful data ratio to 25% in any configuration, even row storage. Actual buffers are bigger than this (like 4KB) with rows have an average of 100 bytes each. If you condition on 10-byte names, 90% of the data you fetched was wasted if the row does not match. You can only condition 40 rows per I/O. With a column storage, you can condition 4096 rows in one shot.

XSP allows you to express such transformations from (e.g.) "SQL tables" to storage and has mathematical soundness to prove the equivalences. (and content preservation) The performance would then be dependent on the way the table-storage equivalence is implemented, and not on the query stating "HAVING" or "WHERE (SELECT .. WHERE)".

Other things can be imagined: Laurent LA RIZZA

View edit of December 16, 2010 or FindPage with title or text search