Copy And Baste

Making a sub-domain-specific subset copy of corporate or "master" databases and adjusting it to reflect local needs. This pruning, tuning, and re-translating for local needs has been nicknamed "basting" here, making a play on words surrounding CopyAndPaste (with a "P").
From DepartmentInfoMirroring...

If a way can be found to automate a copy process, it can be a very powerful tool when the "copy" can be reprojected in a way that better fits the local (section-specific) users or needs. For example, your particular department may only need a small subset of all available info (columns), and the translation step can replace convoluted coding schemes, such as product, service, or status codes, with something more meaningful or simpler for local needs by tossing out idioms or layers meant for other departments but not useful to yours. This can make for simpler and faster queries and data usage. It's customized data abstraction.

As an example, I once worked for a big subscription provider that provided several products and services. However, my particular section, which dealt with mostly marketing research and junk mailing/spam (sorry about that), usually only cared about general categories of products instead of individual products. The raw customer product table resembled:


(Dots to prevent TabMunging and a wiki spacing bug.)

There were others tables, not shown, that mapped specific products into categories. The early programmers kept using the raw table and manually coded product-to-category translations[1] because the joins for obtaining categories was too slow because of the complex relationships in that system. Eventually a smarter set came along and produced a bi-weekly table for our section that resembled:


A "1" indicated a given customer has at least one product/service that falls into the given category, and "0" means does not have. (Integers were used instead of Boolean or Character to simplify aggregate counting). This simplified queries immensely, and we could still join with the raw table for those occasional queries that needed info about specific products. (This is a somewhat over-simplification for illustration purposes.)

One could argue that one should put each category on a different row (columns: customerID, categCode, hasFlag) to make the Customer_Product_Categ_Table more stable such that one doesn't have to change the schema for each new category that came along. However, this would complicate the most common type of queries we did. Further, new categories only came along about once every 3 years on average. It was decided to live with the once-per-3-year change spike to have simpler production queries for everyday work. -t

This looks like accidental complexity - working around inadequacies in your technology. Now, that's understandable (you must MakeItWork), but do this often enough and it's time to refactor the solution into your database technology. One possibility: lightweight 'views' that are maintained automatically during relative down-time for the database?

What do you mean by "lightweight" views? Views are usually immediately generated based on live data, not cached or copied.

I mean views that are cached and copied, and thus distinct from the 'usual' views generated by expensive joins on live data before being expensively delivered across the network at each request. Was this unclear in context? The name 'lightweight' alludes to lightweight directory access protocol (LDAP), which is designed for read-mostly work.

I do find a common pattern where the "master" data has a good many elements, too many for timely "deep" queries that are exploratory or research in nature. Different parts of the organization may only want to see a subset of roughly 5% to 20% of the total organization domain data, along with some customized re-calculated columns to simplify certain common needs similar to the categories described above, or maybe give semi-meaningful abbreviations to numeric status codes that make more sense to local staff but wouldn't necessarily for other sections, etc. If you extract 10% of the data that your local group needs, queries will typically run roughly 10 times faster.

{I think the term y'all are looking for is "materialized view". A materialized view is a table that caches live data, typically the result of a query. It may be updated manually or automatically.}

Because Oracle (and DB2) licensing can be expensive, many organizations I've worked with use MS-Sql-Server, MicrosoftAccess, and/or MySql for the reprojected tables. Since such tables are usually read-only, they are not really playing to Oracle's strength of transaction reliability, and thus Oracle is considered overkill. Therefore, a formal MATERIALIZED VIEW command may not be used.

Knowing what changed

If the master dataset is too large to rely on a full copy per update cycle, then some sort of incremental approach is needed. If each table or entity group (such as product or customer) has a "last-updated" column, that can be used to know what changed. If not, then one can use a "round-robbin" approach whereby the local dataset has a "last-updated" date and the oldest in a virtual queue get updated next using a result-set size limiter such as Microsoft's "TOP X" or MySql's LIMIT clauses, as illustrated below. Detection of new records will likely require domain-specific analysis, but often relies on "customer-since-date" or the like. -t

  /* select a chunk of the "stalest" ID's to process */
  SELECT TOP 5000 customerID
  FROM myLocalCustomerTable
  ORDER BY last_updated ASCENDING   /* oldest first */

Reasons for CopyAndBaste include but are not limited to:


[1] To be fair, there were some political and "dumb PHB" reasons for resisting improvement. For one, they wanted PlugCompatibleInterchangeableEngineers, and a custom table for a small group was seen as against this goal. They first had to be convinced that the simpler table justified a potentially longer learning curve.
See DepartmentInfoMirroring, OnceAndOnlyOnce


View edit of October 16, 2010 or FindPage with title or text search