SQL Criticism



Updated: 5/17/2002

Below are some criticisms of standard SQL. Note that I am not necessarily suggesting that SQL be abandoned, just that it may need some work done on it to improve its rough spots.

Divisability

A "real" programming language always allows a programmer to break constructs into simpler or smaller parts to make managing them easier. SQL is lacking in this department. Therefore, one has to over-rely on nesting.

For example, if the same equation appears more than once in an SQL statement, there is no way to factor them into one via subroutine-like syntax.

In larger statements, the nesting approach can get quite cumbersome. Although proprietary versions of SQL or the combining of strings using SQL API's can allow some of this, why is a solution not part of the standard? (See Alternative to SQL.)

Another syntactical annoyance is field selection. If a tables has 100 fields, and you want to specify about 30 of them, it gets tedius to name each one. One, but perhaps not only, solution to this is the use of field set groups in Data Dictionaries.

Set Orientation Often Not Intuitive To Most

Let's look at the following table of forest workers, called "Worker".

nameagelodging
fred30papa king
bob25rose hill
rich40papa king
pete37rose hill
pat19mullers
kay18mullers
don34papa king
grey28roswell

Suppose you wanted to list the oldest worker at each location (lodge). Here is one SQL solution from [insert reference]:

  select name, lodging, age from worker w where age = 
    (select max(age) from worker where w.lodging = lodging)
    order by lodging

It produces this result:

namelodgingage
patmullers 19
richpapa king 40
peterose hill 37
greyroswell 28

Although this SQL statement is fairly brief, I do not find it very intuitive. Part of the reason is that it is tough to break it into mental (or actual) parts in order to understand it bit by bit.

Many programmers, including me, often solve problems by envisioning how we would do it by hand, and then convert that to program code. I have yet to find a way to envision the above SQL as a straightforward physical task.

Let's see how we could break the problem down into a more procedural approach.

One approach that comes readily to mind is to sort by descending age and pick who comes to the top. We can also sort by lodge to group by location. Using simple SQL, or any other multi-field sorting utility, we could do this:

select * from worker order by lodging, age desc
nameagelodging
pat19mullers
kay18mullers
rich40papa king
don34papa king
fred30papa king
pete37rose hill
bob25rose hill
grey28roswell

If you run your finger down the table, the oldest person in a given lodge is the first name of each lodge group. In procedural programming, one usually detects a new group by comparing a "current" record to the previous. Thus, we can directly translate our physical approach to code:

  prev = "[init]"
  open "worker" as w
  addTempColumn(w, "mark", "")        // add a temp, empty column called 'mark'
  scan w order by lodging, age desc   // traverse each record with a loop
     if w.lodging != prev then w.mark = 'X'   // Mark 1st in new group
     prev = w.lodging
  endScan
  select * from w where mark = 'X'
Now, this solution is (so far) larger than the SQL version, but most programmers will agree that it is more intuitive.

The first SQL solution reminds me of Recursion in some ways. Recursion is not very intuitive to most programmers. Sure, if you use it often you will get used to it, and maybe it will even become intuitive. However, tree traversal is not a problem that is frequently encountered by most programmers.

The problem is that non-recursive solutions to tree traversal are rarely simple and intuitive. Therefore, recursion is still often the best solution to tree traversal, despite it's oddity.

In this case, our procedural solution has roughly 2 or 3 times more parts than the SQL solution. It is my opinion that the procedural solution is still preferable because it is much more intuitive than the SQL approach to most people. Thus, the advantage of its intuitiveness weighs more than the disadvantage of solution size.

(You could possibly argue that most programmers are "stupid", but if that is what is out there, then techniques and languages which claim to be "general use" must deal with that alleged fact.)

Now, let's see if we can provide language constructs which could simplify our procedural approach.

  select * from worker where lodging != prev.lodging
    order by lodging, age desc
In this case, a Prev dummy record, a built-in construct, is added that returns the value of the corresponding field of the prior record during the implied traversal of the records. Category breaks are common enough in procedural table traversal that a table-aware Prev record or function is probably justified.

There! Now we have a concise procedural way to solve the problem that is also intuitive, unlike the set-oriented SQL approach.

Another procedural way to break the problem into parts is to have an outer loop that loops for each category, and an inner loop to find the max of that given category. It would resemble this simple pseudo-code structure:

   for x = each lodge group in worker {
      find max(age) in x
   }

Or, perhaps:

   for x = each lodge group in worker {
      find record of max(age) in x
   }

This is based on a very simple algorithm:

1. Split the data into groups by lodge

2. Find the max age in each group (lodge).

In our example, "x" is simply the record set for one lodge (at a time). This is a very natural and intuitive algorithm generation process. It allows us to consider the problem from the viewpoint of a single lodge. The issue that there are multiple lodges has been moved to an outer loop, allowing to inside to deal with only one lodge at a time. (Note that it is possible that there could be more than one person with the same max age in each group, but it is only a relatively minor change in our example to deal with this.)

This also highlights a related annoyance of SQL. It is easy to find the minimum, maximum, etc. VALUE of a set or groups, but fairly tricky to return the ENTIRE record(s) that has the extreme value, especially when groups are involved. This tends to violate Bertrand Meyer's "continuity" principle. (Small changes in requirements should only result in small changes of code.)

Most SQL problems that require correlated sub-queries are generally much more intuitive when done with a procedural approach. I know of only one person who claims correlated sub-queries (CSQ) are intuitive, yet he was unable to explain them by providing written steps of the mental process to arrive at them. (He referred us to books about mathematical set theory.)

However, we could always provide a stepwise breakdown of our procedural approaches (general-to-specific), or at least provide a hands-on equivalent (pencil, paper, and rolodexes). CSQs so far appear immune to stepwise reduction in the traditional sense. I have yet to find a book that even tries to explain how to solve CSQ problems. They simply say, "here is the solution."

It is my opinion that SQL focuses on a set-oriented approach instead of a procedural/cursor-oriented approach because it is based on fields of mathematics familiar to the originators of SQL. Procedural/cursor oriented approaches, even though most would find them more intuitive, are (unfortunately) not related to any common school curriculum.

Note that some readers have claimed that set orientation is more optimize-able for speed and resources than procedural/cursor (P/C) approaches for table manipulation. The jury is still out on this, however, in many circumstances they appear to be right so far.

Even if CSQ approaches do turn out to be inherently faster than P/C approaches (which I don't buy at this time), the programmer should still be given a choice of intuitiveness over speed if desired, such as when execution speed is not a bottleneck for a given problem.

These next examples suggests a variation of SQL that is more "group friendly":

  SELECT PER GROUP TOP 1 name, lodging, age
    FROM worker GROUP BY lodging 
    ORDER BY lodging, age desc
The TOP clause is borrowed from Microsoft Access. The PER GROUP clause means apply the statement to each group. If we wanted to list all workers of the top age (there may be more than one per group), then something like this may be feasible:
  SELECT PER GROUP name, lodging, age FROM worker
    WHERE age = MAX(group.age)
    GROUP BY lodging
    SORT BY  lodging, age desc
This variation uses the WHERE clause. The PER GROUP clause is intended to make any comparisons or actions treat each group in isolation. It in essence replaces the double loop of our prior examples where the outer loop was per group. The pseudo-table "group" in the MAX() function is needed so that the age is not compared to the maximum of all groups.

Someone also suggested using the HAVING clause for a similar purpose. Their solution resembled:

  SELECT name, lodging, age FROM worker
    GROUP BY lodging
    HAVING age = MAX(age)
    SORT BY  lodging, age desc
All the field names in the HAVING clause would be based on the scope of the "current" group. However, their solution did not work on at least 2 platforms, and may not be standard SQL.

One reader has suggested that SQL is missing a "universal qualifier", sometimes known as "FOR ALL". This universal qualifier is part of standard set theory, but is missing in SQL for some reason.

These are only preliminary suggestions for improving grouping operations. These approaches may also make auto-optimization easier than the for-each-group example does.

Native Calls Lacking

Working with SQL API's in a programming language can be likened to dealing with a bank teller. All transactions have to go through a little window between you and the bank. Actually, a country border crossing booth is a more accurate analogy because all your stuff has to be packaged a certain way such that it can be properly carried and inspected.

SQL is a protocol between your program and the database server. Although SQL is quite powerful, there are times when you want to manipulate the tables, yet do something that also requires a complex algorithm.

For example, suppose you created a nifty function in your favorite language that translated English to Spanish. Example usage:

  spanish = Eng2Spanish("Pull my finger")
Suppose you need to do a lot of translating at your company. Suppose you wanted to use your new function to translate the description field in a parts list table. You might by tempted to do something like this:
  Update parts set descript = Eng2Spanish(descript)
But guess what? You cannot do this in standard SQL. You have to either use messy cursors or hope that some proprietary feature in your database server allows you to somehow install your function.

Thus, the power of your SQL server is somewhat isolated from the power of your favorite programming language.

It does not have to be this way. In XBase, for example, one can easily say:

  Use parts
  Replace all descript with Eng2Spanish(descript)
Providing such communication between SQL (or it's equivalent) and your programming language is not a simple task, but probably doable to a large extent.

Virtual Direct Access To Tables

Although this complaint is not directly the fault of SQL, it is related to approaches that SQL often encourages.

To manipulate or reference individual fields, especially for dealing with GUI's, one often ends up doing something roughly similar to this:

  record = getfromSQL(sql_access_statment)
  title = record.getField("title")     // store to memory variable
  displayGUI title, ...[other fields]
  record.setField("title") = title    // transfer to field
  record.save
Although our example has only one field, in the real world one may see 50+ fields in many cases. If we had 50 fields, then we would have 100 conversion operations. (50 to translate to memory variables, and 50 to translate back.)

Also note that some variations require one to replace the equivalent the setField operation with code to put the value into an SQL Update command.

Rather than doing the get/set thing to translate fields to memory variables and then back to fields when we are done with them, wouldn't it be nice to "talk" to the field variable as if it was a regular variable? For example:

  open emp as e
  locate(e, where firstName = "Fred")   // find a Fred
  displayGUI e.firstname
  e.firstname = "Freddy"
  close e
The syntax shown is only one suggestion. However, we are able to use "e.firstname" just like an ordinary variable. (Some approaches may require an explicit "save" operation, while some may automatically save upon close or retreival of a new record.)

Note that this approach can be done through an SQL translation layer. The "get" and "set" and their translation to SQL are done in the background. Also note that Data Dictionaries can also help with this process by automating or hiding the set/get process from the programmer.

A related topic is that sometimes returning a "result set" is not an efficient or natural paradigm for our needs. For example, suppose we had a table that had all the phone book entries for a big city. We want an application where one puts in a last name and the closest match is displayed in the middle of a scrollable grid, sorted by last name. If we looked up "Scrooge" for instance, we might get this:

LastFirstMIPhone^
Scottson Bod 123-4567
Screm Shelly 439-2828
Scrime Dan R. 114-3983
Scroa Micheal 545-6566
Scrog Bella 123-5432
Scrooj Tina 881-4432
Scrop Jack S. 234-2342
Scroto Randy 543-5432
Sdallena Sandy 998-3421
Sela Mandy 202-3300
Sella Hoga 789-2025v

We want the user to be able to scroll up or down as far as they want. Many of the non-set oriented database systems allowed one to issue a commend to return a "record pointer" to the nearest match. One could then scroll forward or backward one at a time using the record pointer. Here is an example:

  open phonebook as b
  selectIndex(b, "Lastname")   // show in last name order
  set near on           // setting to find nearest match 
  locate(b, where Lastname="Scrooge")  // put record pointer to nearest match
  while not eot(b) {    // while not end-of-table
     display b.firstname, b.lastname, b.phone
     skip(b, 1)         // move 1 record foward
  } 
  skip(b, -1)    // we could also go backward
Standard SQL cannot really return a window around the nearest match. The Like operation is not really what we need.

Many SQL-based systems can do something similar with proprietary cursors; however, many of them also must first generate a result set on which the cursor will traverse. Returning a result set of a million names may not be very efficient or practical. (There are games one can play to reduce the result set, but they are often imperfect and kludgy.)

Thus, the set oriented paradigm is not always appropriate.

 

I have found that treating tables similar to memory-accessible entities makes many tasks more natural and strait-forward than the set-oriented approach.

Let's assume that you have a sort of hash-like table where you want to also track the number of times that a hash is accessed. Suppose we want the record deleted after we reach 10 accesses:

TABLE: hashy
FIELDS: thekey, thevalue, count

In Clipper you could do something like this:

Proc hashit(tablehand, mykey, myvalue)
  select (tablehand)
  seek myvalue   // assumed indexed selected (xbase is weird at this)
  if not found()
    append blank   // insert new record
    thekey = mykey
  endif
  thevalue = myvale
  count = count + 1   // it might also accept ++
  if count > 10 
    delete
  endif
Return

If I used my preferences it might look more like:

sub hashit(tablehand, mykey, myvalue)
  private t = tablehand  // shorten the reference a bit
  if not found(t, _where thekey = mykey)
    insert_blank(t)
    t.thekey = mykey
  endif
  t.thevalue = myvale
  t.count++
  if t.count > 10 do delete_rec(t)
endSub


However, SQL would require something like this:

procedure hashit(tablename, mykey, myvalue)
  private sql, mycount, rs
  mycount = 0
  sql = "select * from $tablename where thekey = '$mykey'"
  rs = runSQL(sql)
  if rs.eof   // not found
    sql = "insert into $tablename (thekey) values ($mykey)"
    runsql(sql)
  else
    mycount = rs("count")
  end if
  rs.close
  sql = "update $tablename set thevalue = '$myvalue', "
  sql = sql & " count = count + 1 where thekey = '$mykey'"
  runSQL(sql)
  if mycount > 9
    sql = "delete $tablename where thekey = '$mykey'"
    runSQL(sql)
  end if
endProcedure
In the cursor-ish approach we don't have to keep looking up the record. A virtual pointer is set to it and we can do anything we wish to the current record. However, SQL requires isolated operations with regard to adding, changing, and deleting. I find the cursor approach much more natural and shorter for such operations.
Note that this does not necessarily require that the record be locked during the duration of our activities. The changes could (hopefully at our option) be buffered until we tell it to write them or until a waiting period is met.
A common need for this is returning a sequentially assigned unique ID number (assigned by the RDBMS). It is a pain to write SQL to save a new record, and then go back and get the just-created ID number, especially on a multi-user system where many new records may have been added between the time your program added a record and the time it reads it back. (There are vendor-specific solutions, but a standard would be nice.)

SQL triggers can perhaps be used. However, triggers may separate the logic from where it is needed/used. The example(s) I gave may only need to be done for one application or script. Triggers are generally meant for global business rule enforcement. Even if you created and dropped triggers within the code, I find that solution a little on the "hacky" side. What if the program croaked in the middle of setting the triggers and thus never got around to removing them? (They are essentially global) It is also not as compact as my original suggestion.

Inconsistency

An oddity of SQL syntax is the different way fields and values are specified between Insert and Update commands. The Insert command uses a separate field-name and values lists. Yet the Update command uses "Set [field] = [value]" syntax. There is no practical reason I can find why these are not interchangeable. (There are times when one approach is preferred over another, but both commands should allow both syntax types.) It can greatly complicate programming in many cases because a completely different algorithm is needed to build Update lists versus Insert lists for SQL commands.

Simplifying Joins

Most join paths between different tables is the same throughout an application. Thus, it may simplify application development to be able to specify the "standard join" rather than mention which fields to join on every time a join is needed. The syntax might resemble:
  SELECT *
  JOIN invoices TO invcDetails,
       invoices TO customers
The system assumes that if no field is supplied, then the "standard" join fields are used. These would have to be set up by the table designer or DBA ahead of time.

If we want to specify outer joins, perhaps something like:

  SELECT *
  JOIN invoices TO invcDetails OUTER,
       invoices TO customers OUTER
The direction of the outer join would probably also be part of the standard join setup process. If they don't want the standard direction, then they can still use the "old fashioned" syntax.

In case somebody forgets to mention fields (not wanting the standard join), perhaps an explicit clause should be required:

  SELECT *
  JOIN invoices TO invcDetails STD,
       invoices TO customers STD OUTER
Some of this could be done via a wrapper API around SQL. But there are problems with this approach, such as the inability to prototype the SQL expression before applying it to the application.

Alternative or Complimentary Specimens

One existing alternative or enhancement to the SQL set approach can be found in parts of the XBase language (and it's dialects). An XBase-like cursor-oriented approach avoids some of the problems described above, such as arbitrary result boundaries, and disjointed adds, changes, and deletes. (XBase has many problem areas as a language; thus, I am only talking about certain features, not necessarily the entire language.)

In Xbase, one traverses tables by selecting (or making) a single "controlling" index. What appears in "next" and "previous" record pointer (cursor) movements depends on the controlling index. (Note that an SQL engine has to make a temporary index anyhow for orders that do not already have an index for them. However, the fact that SQL does it automatically is nice.)

When a new record is added, it defaults to a blank record (or as described by any defaults). This remains the current record until the cursor is moved. Thus, one can add, change, and delete the same record multiple times without having to reissue queries, unlike SQL.

Relational operations are done by specifying the links between two or more tables. When the cursor for one of the related tables moves, the cursor for the others is automatically moved. (An index is needed for the the relation key.)

Note that even though there is only one cursor at a time, disk buffering in RAM generally improves the speed of access. Thus, XBase does not have to necessarily physically read one record at a time from disk. (This of course depends on the size and order of physical records.)

XBase is far from perfect, and contains(ed) some archaic concepts. However, it can serve as a guide to cursor-oriented approaches and to integrating collection handling into the language to reduce DB-engine-to-code-and-back translations. (The cursor languages built into some of the commercial RDBMS SQL systems still lack this integration, beyond API's.)

There are times when the set-oriented approach is preferred and times when a cursor-oriented approach is preferred. Perhaps someday somebody will find ways to integrate all of them to give us the best of both worlds in a single package.

I have found that the API's needed for cursor and set approaches have enough similarities to perhaps simplify the integration. For one, when traversing records, set-orientation tends to first collect a "result set" into an internal buffer, whereas cursor-orientation tends to fetch each record as it is requested. (Note that buffering may still be used to speed up cursoring, however is not tied to a logical result-set.)

To the application programmer, these approaches look nearly the same in the code body. Most of the differences will be in the query or table handle setup code. Thus, if a collection system offered both approaches, one could switch from one to the other without major code changes. Thus, we avoid any major continuity problems.

   t = openRecordSetHandle(setUpStuff)
   while getNext(t)
      print t.a
      print t.b
      t.c = foo
      t.b = bar
   end while
   t.save   // may not be needed in some cases
Most of the differences would only be in "setUpStuff". Sometimes cursoring will move the ending condition into the While loop expression instead of a query. However, this actually highlights one of the advantages of cursoring: being able to traverse data without having to know the end condition (size of result) in advance.


Related Documents

SQL Alternative
Automated Joins?
Table Oriented Programming
Collection Taxonomy Criticism
Remote SQL
XBase


Main
© Copyright 1999, 2000, 2001 by Findy Services and B. Jacobs