Sparse Columns

Sparse columns are sometimes considered a weak-spot in relational theory and design.

For example, suppose we used a DataDictionary-like table to model GUI widgets. Widgets will have many things in common, such as coordinates or location (on screens or panels). However, many widgets will have attributes that are only meaningful to them, such as "slider width" (the middle part) on a slider bar. If we put columns for these in our Widgets table, then most of the "cells" will be blank for non-slider widgets.

One viewpoint is that this is not a problem if the DBMS is designed right. Sparse cells will not take up any space except in the column definition structure. If a column in "not mentioned" in a particular physical record, then it is assumed to be blank/empty/null. However, the schema might tend to get rather large and unwieldy. (I tend to put less commonly used columns on the "right" side of the table.)

Usually, views on a base table are used to hide unneeded columns. An example: suppose we have foo's and subfoo's. All foo's have an attribute bar, but only subfoo's have an attribute baz. We create the following table and views:
  CREATE TABLE foo_base (
    foo        NUMBER,
    foo_type   VARCHAR2(10),
    bar        NUMBER,
    subfoo_baz NUMBER
  );

CREATE VIEW foo (foo, bar) AS (SELECT foo, bar FROM foo_base WHERE foo_type = 'foo');

CREATE VIEW subfoo (foo, bar, baz) AS (SELECT foo, bar, subfoo_baz FROM foo_base WHERE foo_type = 'subfoo');

Another solution would be to use an AttributeTable. The jury is still out on these.

A third solution is to abandon relational and go with NetworkDatabases (NDB), HierarchicalDatabases (HDB), and/or their newer cousin, the ObjectOrientedDatabase or OOP in general. Hierarchies could allegedly allow all widgets to inherit the common behaviors and attributes, but specific "types" (leaves) of widgets could have dedicated attributes. One problem is that the differences may not actually be tree-shaped. Tree taxonomies are sometimes criticized for not scaling elegantly (LimitsOfHierarchies). For example, if we sub-divide widgets into Images and Buttons (plus others), then what if we later decide to include images on buttons? What looks like a "button" on some web pages is actually an image that changes with "onRollover" and "onClick" events. The short story is that the border between an image and button can be blurry. Some GUI engines also allow image backgrounds to input boxes. The combinations could get sticky to model as trees. We may have to overhaul or toss our original taxonomy tree when such changes come along.

The NDB approach basically has a "record" or attribute dictionary as its primary structure. (These generally correspond to "object" in OOP, see ObjectsAreDictionaries). Each record can have its own set of attributes (or methods or pointers) without worrying about conforming to a larger structure, such as a "table". This gives it the latitude to be anything it wants. (Inheritance will be dealt with later.)

However, the drawback is that no "aggregate reasoning" can be done on independent records (IR). We have to potentially look at *each* one to see what it is an what is in it. They are nomads without a country, you could say. Relational allows reasoning about "tables", which may contain thousands of instances (rows). It is a "math" that can operate at an entity level instead of at just a record level. It allows "forest-level" operations and views instead of just tree-level ones. "Having to navigate at a low level" is a common complaint about NetworkDatabase-like arrangements. The only mechanism that ties IR's together is perhaps inheritance (which includes class or type membership). However, this brings us back to the potential limits of trees.

What if we managed the differences with sets instead of trees? I have not seen a lot of tests of this idea, at least not built into the database or language.

I don't think there is any perfect solution. There are indeed attributes which many things share in common, but dealing with those that are different is the biggest problem. The pattern or "shape" of the differences may not follow a predictable enough pattern(s) to deal with cleanly. Tables don't deal well with the differences, at least not out-of-the-box; and independent records don't deal well with the commonality.

I personally think that one is the lessor of two evils, but will not state it here without finding more external evidence first.

(Some of this duplicates material in OoLacksMathArgument. When the debate grows more mature or stable, a refactoring is perhaps in order.)


A table per type would seem to avoid the problem. Aggregate queries become clumsy, but not too difficult.

I think that is an anti-pattern. Many sub-entities don't stay as clean sub-types over time. For example, suppose only managers were giving dedicated parking spaces at first. We may put the parking space number in a Manager table. But later some non-managers may get numbered parking also. We would then have to move the parking number column back to Employees, and perhaps change a bunch of code (although views may be able to simplify some of the changes). The only column that would probably be guarenteed to stay manager-only would be peopleManaged perhaps (number of people managed). But it is not worth a dedicated table for just one column. See ThereAreNoTypes.

I would have the parking space 'key' in both tables as it will appear in both types, I see no point in maintaining the abstract inheritance structure in the instance. I tend to avoid concrete supertypes also, but that is a design issue rather than an implementation issue.

Perhaps not have a Manager table if this is all that is in it. If there is only one column guarenteed to *stay* dedidated to being a manager ("peopleManaged" number), a separate table is not worth it. Some of the "normalization" material over-encourages little sub-tables for stuff that may drift over time IMO.

This whole 'it may change later' argument is bogus. You will have exactly the same problem even without a database as you will have to change the classes and *all the clients* of those classes (woe to you if there are any published interfaces). With a database, the overhead of change is because you have to synchronize two things (code and DB schema). Even OODBMS won't save you, as you will still have to migrate persisted objects. This migration is conceptually very similar to changing the schema.


See Also: DeltaIsolation, MultiParadigmDatabase, DynamicRelational

EditText of this page (last edited March 28, 2008) or FindPage with title or text search