How To Design A Programming Language

A survey of scripting programming language feature options

In search of the ultimate language
We Dare to Compare
(Note: This is about syntax design, not interpreter building)

Updated 2/19/2004

Most of the popular scripting languages were built with poor research or debate about why one approach is "better" than another. The author(s) of the scripting languages simply built upon something they were already familiar with or gave it a bunch of features without documenting their tradeoff considerations. A lot of "history" is thus lost to new language inventors.

To reduce such problems when the Next Great Scripting Language is built, I have assembled together a list of language options, possibilities, and my preferences based on the pros and cons given. You are welcome to contribute any comments or options. I am not promising the best decision, only the best collection of possibilities and analysis of the design options.

Audience and Scope

This article focuses primarily on scripting languages for professional programmers, although simplicity and easy learning are certainly criteria that we will apply (among others). After all, programmers have to learn, remember, use, read, and fix software in many languages. We should not assume that they always have the time or budget to master a bunch of funny little rules of any one language.

We are also focusing on a middle-of-the-road scripting language; one that is fairly good at quick and dirty stuff, yet has enough "safety features" to build fairly large applications. More on scripting limits and target applications is presented under the flexibility criteria. Some people believe that a dichotomy between scripting and non-scripting languages is false. If you are in this camp, then simply ignore the word "scripting" whenever you see it. Or perhaps replace it with "weak-typed" or "dynamically-typed". More on the definition of scripting can be found here.

No one single language is to be promoted here. We are exploring concepts, options, and possibilities; not specific existing languages. If I mention a language, it is only to serve as an example for those who may wish to go see a concept in real-world action.

My Bias

I must admit that I am a fan of Table Oriented Programming (TOP). However, I will attempt to put this bias aside most of the time, but mention how TOP could be applied under some circumstances. Languages like Perl tend to use strings, streams, lists, and arrays to do operations that, in my opinion, are really meant for TOP. I am also not much of an OOP fan.

As far as language familiarity, I have had exposure to *Pascal, XBase, *C, *Java, Visual Basic, Perl, *Fortran, *COBOL, and briefer exposure to many others, including Lisp, Tcl, Python and APL. (Items marked with an asterisk are generally not considered scripting languages.)

Some also complain that I have never built a formal compiler or full interpreter, only simple ones; and thus allegedly have no business designing or critiquing a language. To this I reply that composing music well and playing music well are not prerequisites to each other; and that most of the options presented below already exist in other languages. They have already proven to be implement-able. Focusing too much on implementation may also bias one to create an interpreter-friendly language over a human-friendly one.

Most of the options and discussions presented here are based on my own observations and experience. I have already incorporated some observations mentioned by others. The feedback of others is welcome, although I cannot give personal credit unless it is an extended work.

General Criteria

Before we get into the actual options, we will look at the criteria that is being used to evaluate the options. The goal is to optimize for them all, although the more important ones for scripting are listed first. In most cases, the difference between a scripting language and a non-scripting language is in the ordering of these criteria.

1. Rapid Development

Rapid development is the primary reason for using a scripting language. This is often achieved by having powerful, concise operations that leverage common needs. Parsing, sorting, searching, and file processing are common examples of what I call "aggregate power" functions. They are generic operations which are generally easy to describe but time-consuming to implement from scratch.

Another way to simplify development is simplify the syntax. This is done by avoiding the need for excessive type conversion functions, wrappers, mandatory error handling, etc. It is also achieved by allowing or assuming common defaults. Passive defaults are those built into the language. Active defaults can be set by the programmer.

Still another approach is the use of abbreviations, aliases, and/or macros to avoid the repetition of long statements. (Surgeon General Warning: poor usage of defaults and abbreviations can lead to programs that are difficult to read.)

Many scripting languages are also known to have a good set of string manipulation and parsing operations and functions. String processing is often needed to extract data from one system and prepare it for another. Hence, scripting languages are sometimes called "glue languages".

2. Maintainability

Just because a program is easy to write does not mean it is easy to read. Maintenance is an important part of the programming profession. There is nothing more irritating than inheriting cryptic or poorly designed code to maintain or fix. Maintainability is often considered a weak point in scripting languages.

Although it is often said that bad programmers can ruin any language, some languages are much more abusable than others. To prevent bad or selfish programmers from doing too much damage, I propose that a good scripting language should be designed such that it is always easy to tell where one command/assignment ends and another starts, and easy to tell the relationship (if any) between the command/assignments. One way to achieve this is to avoid unnecessary deviations from the function rule.

Thus, you will always know what is a parameter, an assignment, a control statement, a function call, etc. The programmer may be able to scramble the molecules, but not the atoms.

Languages like Pascal and Java (non-scripting languages) are examples of languages that are the least abusable from this standpoint, while Perl and C++ can be notorious in the wrong hands. We will try to reach a good compromise here.

The amount of readability should be given much thought. Programmers are more heavily rewarded for finishing on time than for producing readable code. This is primarily because meeting deadlines is easier to measure than code readability. Everybody knows when a programmer is late, but few if any know how maintainable their code is. Thus, in my opinion the language needs to protect itself (and the company) from this unfortunate bias to some extent. Reverse-engineering somebody else's code can be an expensive and time-consuming task. Don't ignore this cost just because it is down the road a bit. (In finance it is customary to downplay future results, but not ignore them.)

Some "bad" programmers claim that it is the reader's fault if they cannot decipher the code or don't know the shadier side of a language. The problem is that many programmers are called on to use many different languages in one organization and cannot always become a complete expert on any one language or style. How would you like a TV repair person to completely rewire your TV set outside of factory specs. Would you blame the follow-up repair guy if he quoted you 7 grand?

There is a saying in the Unix community that one should not prevent idiots from abusing something because it may prevent someone else from making good use of it. In other words, give everyone chain-saws because (hopefully) more people will build useful things than the number who will damage something or someone. However, it is my observation that more programmers are more likely to abuse a language than make good use of it. This is usually because the incentive to finish fast is greater (and easier to measure) than the incentive to make coherent and maintainable systems. Thus, I unfortunately have to disagree with the "Unix Chain-saw" rule.

3. Flexibility

This is sort of catchall phrase for being able to be applied and/or extended to many uses. Thus, it must be easy to bang out throwaway tools, or write a 30,000 line application without too many issues with variable and function scope. (Note that very large applications, life-support applications, mass distribution, fast math, or base tools like compilers and OS's are usually not the target of scripting languages. These are usually not what scripting languages are meant for and one should not be offended by this.) Convenient ways to link to database and GUI engines should also be considered.

4. Language Simplicity

If we include everything plus the kitchen sink, the compiler and/or interpreter would be too difficult or expensive to build and debug. The learning curve would also be greater. One should not assume that a given scripting language will be the only language a person will ever want or need. Perl and Python are more complex then they need to be because the authors had a vision for the "ultimate language" with seemingly every feature they ever envisioned. (Of course, some of this depends on one's design philosophy. For example, I tend to use databases and table API's for things that others put into arrays.)

One way to make life easier for the compiler or interpreter (C/I) is simply to use function calls or API's for as much as possible. We will call this the function rule. The function rule says to use function/subroutine syntax to implement all operations unless you have a good reason to deviate.

As example of an unjustifiable violation of the function rule, XBase uses the dollar operator ($) to mean "is contained in". However, we see little reason not to make this a simple built-in function, such as InStr(), meaning "in string". Thus, if x $ "abc" could instead be represented as if InStr(x, "abc"). Because the logic for parsing a function is already built into the C/I, it will add very little to C/I's burden to implement InStr; whereas the dollar-sign operator is another token that has to be parsed. (It is also harder to document and index in a manual.)

On the other hand, functions can get annoying for some commonly-used operations. For example, math functions could be represented with just functions likePlus(), Minus(), and Divide(). However, most language builders chose to implement these using the operators +, -, and / instead. Thus, instead of using:

 
    total = a + b + c + d + e

one would have to use this if the function rule was strictly followed:

 
    total = add(a, add(b, add(c, add(d, e))))

(Note that it may be possible to implement something like add(a,b,c,d,e) but this approach has other sticky issues associated with it.)

Although functions keep the syntax simple by being a generic programmer interface, they do have stylistic limitations.

Deviations from the function rule often indicate the "orientedness" (target audience or usage) of the language. This is where the art and politics of language design come in. For instance, in table-oriented programming, one would rather see and type table2.fieldx instead of fieldvalue(table2,"fieldx") required by some API's. Object-oriented programming uses the dot operator for yet another purpose.

Footnote: Perhaps there is a pattern to deviations from the function rule that an implementer may want to build-in or prepare for. This could allow for custom oriented-ness to be added onto a language. This way a base language could be built, but then have certain ways to "overload" some operators so that fans of Objects, Tables, Strings, Streams, Pipes, Math, etc. can do their favorite stuff without too many unnecessary functions, parentheses, or quotes. This may be an interesting topic of research that is outside the scope of this article.

5. Speed

This is the speed at which the result runs. There is usually a direct tradeoff between speed and programming effort. For example, a scripting language may have one numeric type; while a speed-oriented language may have many numeric types such as single, double, integer, short, long, etc. This is because if the compiler has more information about a number, the better decisions it can make when choosing which machine instructions to use for the resulting executable or byte-code. For example, if a type is integer, then there is no need to check for and/or process decimals. Also, short integers require fewer bytes to process than long ones.

Of course, this can put a burden on the programmer to plan types better and add more type converting between number types. It is basically a tradeoff of burdening the programmer with the details or burdening the CPU and compiler/interpreter with the details.

Although it is true that CPU's are becoming faster and faster, it is also true that more is being asked of programs. GUI's and speech recognition are examples of new burdens that come along. Fortunately, these functions can often be off-loaded to API's or frameworks written in C or chips that do not burden the decision processing code itself. The idea is that a scripting-like language makes the decisions, but lets fast but ugly components do the actual number crunching.

6. Check Before Run

This is the ability to prevent as much errors as possible before run-time. Scripting languages are usually not good at this because preventing such problems often involves implementing picky little rules that tend to bloat up the syntax with strong typing, conversion functions, and other "anal retentive" concepts. Although it is something to keep in the back of your mind when making a scripting language, it is not something to obsess on. Perhaps "Lint" type utilities can be made to look for suspicious code before running. It would only point out possible trouble-spots upon request, not abort compilation or running.

7. Keep With Past

A sticky issue is how much to follow existing patterns at the expense of a better idea. Familiarity has the advantage of being time tested and it reduces the learning curve. Newness is somewhat risky because we don't know how it will be used or what unforeseen drawbacks there are. As a general rule of thumb, we will say that a new idea must appear at least 20 percent better than the existing one before it replaces the current one.

Click here for more information about cataloging the evaluation criteria for languages, paradigms, and tools.

The Options

Index

Option Selectors

I enjoy debating language features. However, in doing so I have encountered certain "hot button" topics. In these cases the choices seem related to personal preferences, habits, and programming philosophies. Rather than forcing one's own preferences on other programmers, perhaps the language builder can make hot button features optional or select-able with preference indicators at the top of a program file or directory. Example:

 
   #Prefs 
       LineBased:   on    // line separators instead of semicolons 
       NullHalt:   off    // no halt if nulls in expression 
       DivideResult: halt   // halt if division by zero 
       // Other DivideResult options: null, zero, -1 
       IgnoreCaps:  on    // ignore case in strings compares 
   #EndPrefs

Or, if we already have a file with our favorite preference settings, then we could have something like:

 
   #Preffile: /prefs/Joes_prefs.txt

There are two drawbacks to allowing too many variations. First, the compiler/interpreter has to be more complicated to handle the variations, and second, the learning curve for the language may be steeper. Perhaps the learning curve would actually be smaller in some cases because the programmer can make the settings closer to what they are familiar with.

Whether to allow the selection of the options or to pick what you as a language designer think is the best is part of the art of language design. I have indicated areas where option selectors may be a better alternative to pissing off a programmer by forcing your philosophies down his/her throat. Remember the fast-food slogan:

Have it your way

Footnote: Perhaps votes can be taken on features where the voter ranks the strength of their preference. Those features with the most variance in the strength rankings would be the best candidates for option selectors. (Perhaps an sum allocation system may prevent voters from exaggerating all their preferences in order to inflate their vote.) This may be a more scientific approach than counting the personal insults in forum discussions. Of course, the difficulty of implementing selectors for a given feature should also be factored in. Something like optional halting on division by zero should be fairly easy to implement. However, the handling of null values in expressions can get complicated.

Statement Separation

This is how statements are separated from each other. Some languages, like C and Perl, use only semicolons; and others use line-feeds as their primary separator, such as Basic, XBase, and Fortran. Note that contrary to popular belief, Visual Basic does allow multiple commands on one line using a colon (:) as a separator. Example:

 
    thing = 5 : foo = "this" : bar = 99.99

(A semicolon could perhaps be used for this purpose instead of a colon.)

Line-feed languages also allow continuation of a statement to the next line(s) with explicit markers at the end of a line such as underlines or semicolons. Example:

 
   aBigLongVariableName = aBigFatFunctionName( BigFatParameter1,  _ 
      BigFatParameter2, BigFatParameter3 )

Keep in mind that the primary difference between the two is not whether semicolons are used, but whether line-feeds can also act as a separator (sometimes the only separator). Often the best way to distinguish is by whether a continuation character is needed to wrap a long statement. If one is needed, then it is probably line-feed based. (However, JavaScript still seems to be an odd hybrid.)

Option 1 - Semicolon

Pro - Easier to write compilers/interpreters for. Familiar to many C, Java, and UNIX programmers. Easier to use for very long statements that should be wrapped. (Long lines are less common in scripting languages than formal OOP languages.)
Con - Drives beginners or multiple language users nuts when they often forget to include a few. A little extra typing.

Option 2 - Line-Feed

Pro - Uses the line-feed, which is already there, as a separator; thus closer to the code's physical appearance.
Con - see above under semicolon pro.

Option 3 - Select-able

This would be a setting at the top of the program file to select which style to compiler/interpret for.

Pro - Allows programmer to select the style they are most comfortable with.
Con - Makes the compiler/interpreter harder to build.

Favorite

We choose the select-able option. I have participated in too many heated debates about what is the best approach. (My personal preference is line-feeds, by the way.)

Blocking Markers

This is the method used to specify which group (block) of commands an operation applies to. The two most common block marking approaches are curly braces, "{}", and what we will call "X/endX". In its simplest form, X/endX simply uses the word "end" concatenated with the first word of the control statement. Thus, if a control statement starts with "While", it will end with "EndWhile".

Unfortunately, BASIC and other languages have muddied up X/endX with "noise words" like "do" and "then". These extra words are of no real use, and therefore should not be included. Thus, a language should have "while" and "if ...", not "do while" and "if ... then". (I will not include a pro/con section on these noise words since I know of no known benefits.)

My personal preference is X/EndX because it gives the reader and interpreter/compiler better locational information if there is a typo somewhere. You can read more about this at this link.

A third possibility is to use the indentation itself to indicate a block. A possible problem with this is that when tabs and spaces are mixed together, the indentation may get messed up. This is because different text editors and printers interpret tabs differently. On some systems a tab is 4 spaces, on others 7, etc. The compiler would probably have to reject (halt on) one of leading tabs or leading spaces to enforce consistency. I would give this option much higher marks if tabs were standardized.

Since this topic also incites riots, the formal favorite is a style option selector at the top of the program file. (See below about other possible selection methods.)

To recap, the options are:

Curly Braces (or a consistent pair)
X/EndX
Indentation
Style Switch

Note that braces (above) and semicolons often seem to go hand-in-hand. Therefore, it might simplify the compiler/interpreter to offer the selection of either braces and semicolons together, or X/EndX using line-feeds as the primary separator. The camps can generally be split into the UNIX-syntax camp and the non-UNIX-syntax camp. (UNIX systems seem to be the biggest source or influence of the semicolon-and-braces languages.) Rather than pick one side or the other of these two large groups, it may make more sense to offer both styles somehow.

It might even be possible to allow a mixture both approaches in the same program without explicit options switches. If a blocked statement ends with a left brace, then every statement in the block must be separated with a semicolon and the block must end with a right brace. If the control statement does not end with a left brace, then x/endx and the line-feed approach is assumed. Perhaps this only has to be done at the subroutine definition level -- it is not likely that someone would mix styles within the same subroutine.

   sub a() {    // interpreter sees brace
     foo();     // semicolons
     b = 9;
     ...
   }

   sub a()       // no brace, endx assumed
     foo()       // no semicolons
     b = 9
     ...
   endSub

Whether self-indication at the block or subroutine level is feasible or not needs some study.

Further note that Pascal's Begin...End pairs are actually more similar to the braces in structure than X/EndX. Since braces appear more common than Begin...End, I will not consider it here.

Note 3: I will mix styles in the coming examples, although most will use X/EndX. I will also use "//" as comment markers in the examples.

Block-less Statements

A sub-issue is whether to allow single statements after control structures. For example, be able to say "if x then do this statement" without placing blocking markers around the statement. One way to handle this is to have a simple keyword such as "do" to indicate we are acting on only one statement. Example:

 
   if x > 3   // regular blocked approach 
      m = x 
   endif 
   while y < 0 {   // regular with braces 
      mysub; 
   } 
   if x > 3 do m = x       // combined with "do" 
   while y < 0 do mysub;

Some languages allow such without an indicator such as "do", but in my opinion, "do" would prevent a lot of syntactical accidents and misunderstandings. ("Do" is only meant for the non-semicolon code style.)

Parameter Calling

There are two sides to the parameters issue, sending them and receiving them. The sending side can use positional parameters, named parameters, clause-based parameters, or a mix.

Option 1 - Positional Parameters

Positional parameters are the most common. A parameter's position determines how it is received on subroutine side.

Pro - Compact syntax.
Con - Requires one to know the proper position. Can get cumbersome if there are a lot of parameters, many of which you would rather use defaults for.

Option 2 - Named Parameters

Here is an example of named parameters:

 
   openTable(table="clients", access="read", sharemode="shared")

Pro - You do not need to know the proper position. Very useful if there are a lot of parameters in which the defaults are acceptable. (You don't need to list those parameters for which you want to use a default.) Named parameters are very convenient when different function use the same parameter naming conventions, such as a "from" and "to" pair for Copy, Rename, and Move functions. Similarly, in SQL many different commands have a "Where" clause that filters the row scope. (Although SQL better fits the next variation.)
Con - Not very compact syntactically.

(Note: some variations do not allow default parameters. All parameters must be supplied under this limit.)

Option 3 - Mixed Clauses

This approach is fairly common in database languages and somewhat resembles Smalltalk's messages. Any mandatory parameters come first, and the rest are optional clauses. Examples:

Open "Clients" // Minimum is table name Open "Clients" readonly shared Open "Clients" shared readonly // position not matter Open "Clients" _shared _readonly // with underscores

Select * from Clients Where status = "M" // SQL Select("*" _from "Clients" _where "status = 'M'") // variation

Some of the examples use an underscore to indicate a clause. This is to distinguish between the clauses and the "parameter" of the clauses. Thus, you can have a kind of "sub-parameter", such as the criteria ('status = M') after the _where clause in the last example. Note that the "_shared" clause does not have any sub-parameters because it provides information by itself. This provides two ways to possibly specify some types of parameters:

 
   Open "cleints" _shared off 
   Open "clients" _noshare       // means the same thing 
   Open "clients" _exclusive     // same as no-share

We are not proposing that all variations be implemented; we are only presenting design options.

Note that if there are more than one mandatory parameters, they are separated by a comma in traditional fashion:

 
   foo("this", "that", "pat" _thingy "stuff") 
   foo("this", "that", "pat")

The first 3 parameters are mandatory. Thus, those who don't like to program clauses can stay with the familiar, assuming the built-in functions do not use them.

One possible way to implement the processing of these types of parameters in the callee routine is presented at this link. It uses a function called Clause(). Clause() with one parameter simply is a boolean for it's existence. With two parameters it returns the sub-parameter(s) if the given clause. For example, for an SQL-like statement, Clause("Where") would return True if there is a Where clause, and Clause("Where",1) would return something like "status = 'M'".

Note that there are other variations that allow for an unknown quantity of parameters, such as field lists. Of course, field names could be put into one big string instead. There are many ways to skin a cat. The final choice usually depends on the orientation of the language. An advantage of this approach is that it can easily be expanded to allow zero or many sub-parameters per clauses if the language builder later decides to add these features.

Pro - Good compromise between positional and named. Can handle both types of parameters. Can also handle non-parametered clauses and potentially multiple sub-parameters per clause.
Con - May be a bit difficult or slow to parse. Not very common outside of database languages, so it may cause some confusion.

Option 4 - Mixed Named Parameters

This is very similar to mixed clauses. Example:

 
   rr(1, 2, "m") 
   rr(1, y=2) 
   rr(m, y=n) 
   rr(1) 
   rr(x=1, y=7, z="hey") 
   rr(y=7, z="hey", x=1) 
   // The subroutine definition 
   sub rr(x=0, y=0, z="a") 
      ... 
   endsub

This example shows the subroutine definition with defaults assigned. (Defaults are addressed in the Parameter Receiving section.)

Pro - Good compromise between positional and named. Can handle both types of parameters. Easy to specify defaults.
Con - Allows exactly one parameter per parameter name. (Contrast this with clauses, which can potentially have zero or many sub-parameters per clause.)

Favorite

We chose either of the mixed parameter types for their versatility, although we lean toward the mixed clause approach for its expandability.

Emulating Named Parameters

Rather than create special syntax just for named parameters (described above), it is possible to use a string parameter to simulate named parameters. All that is needed are parsing functions to extract them. Example:

  x = myFunc(12, "blah", "foo='nork',glob=13");
  ....
  function myFunc(a, b, myNamedParams) {
    ....
    if paramExists(myNamedParams,"foo") {
      foo = getParamValue(myNamedParams,"foo");
      ....
    }
    ....
  }

getParamValue would return an empty string if "foo" does not exist. Thus, in some cases we may not need to call paramExists. Note that I would perhaps recommend a shorter function name for getParamValue in practice because it may be used often.

One drawback of this approach is that we cannot add new fixed-position parameters without changing any existing calls that use named parameters. However, new parameters tend to be optional anyhow, meaning they could be implemented with a named parameter. Another possible drawback is that it may not run as fast as dedicated syntax, although this depends on how the functions are implemented.

Parameter Receiving

Parameter receiving generally has 2 issues: First is providing protection options (such as passing by value), and second is specifying defaults.

Scripting languages usually pass by reference because it is the most generic. The incoming parameter is simply a full-featured alias of the original.

Passing by value is used to give protection to the originator of the the subroutine calls. No matter what is done to the local version, the original (caller) parameter is protected. The rules often get a little tricky though if large structures like arrays are passed.

Another possibility is to make all calls be by reference, but be able to designate them as read-only if protection is desired. This has the benefit of avoiding the problems with arrays that by-value methods have. (The read-only mechanisms of the compiler/interpreter can also be used to implement constants. Another possible benefit over by-value is speed, because internal copying is not needed.)

One of the drawbacks of the read-only method is that you cannot change the local copy. This is easily solved by explicitly making a copy if one is needed. Many consider this good programming practice anyhow. Altering parameters for local-only use can be misleading and risky.

We vote for the read-only option, if you can accept a little newness. (I don't know any languages that currently support it.)

We think it best that by-value (or read-only) parameters be the default because they are safer and more common than changeable parameters. Some marker can indicate changeable parameter(s):

 
   sub foo(this, *that, those)   // asterisk indicates changeable param. 
      that = this + those 
   endsub

In this case, an asterisk is used to mark changeable parameters. Pascal uses "var" for the same purpose.

Now on to the issue of default parameter values. Here is one approach:

 
   sub routinex(foo=9, bar="", stuff=0) 
     ... 
   endsub

If "bar" is not referenced by the caller, for example, it will receive the default value of 9. One possible drawback of this approach is that it may be tough to determine if the caller omitted required parameters. Another possible approach is to use the Clause() function set already described under Parameter Calling.

Another way to handle omitted and default parameters is with the OOP multiple prototype approach. However, we consider this a bit cumbersome and anti-scripting.

Alternatives are to use null values to indicate omitted parameters, or a ParamCount() function to indicate the number of parameter supplied. The chosen method would probably reflect the type of parameters supported. For example, ParamCount() would be more appropriate if only positional parameters are supported (discussed in prior topic), and Clause() more appropriate if named or clause-based parameters are supported.

Here is a possible way to implement variable quantities of parameters for positional schemes:

 
 sub foo(*) { 
    if paramCount() >= 1 
       param1 = param(1); 
    } 
    if paramCount() >= 2 { 
       param2 = param(2); 
    } 
    ... 
 }

Note that Param() is a function and not an array. It returns the value of the corresponding parameter. The drawback of this approach is that it only operates on by-value parameters, which may be acceptable for most variable-quantity parameter uses. A special scope operator could perhaps give direct access to the parent routine. For example, L can use "parent$foo = 12".

We could also consider a Perl-like approach which puts the parameters into an array; however, this also does not offer us control over by-reference versus by-value parameters, unless pointer-like constructs are used. However, pointers are ripe for abusability and complicate the language, so we will avoid them like the plague, right?

Error Handling

Error handling is very tricky to design. The basic options are to handle an error, ignore the error (go on), or allow a default run-time halt. The tricky part is specifying when and where to do one of these 3.

Option 1 - Java-Type Handling (forced catch)

Pro - The programmer must deal with most or all kinds of errors, thus perhaps improving error handling quality. Provides ways to "pass" error on to compiler or system error handler. Can handle groups of statements.
Con - Forcing is Anti-scripting. Forced catching makes for cluttered and bloated code.

Option 2 - Blocked Checking (non-forced)

This is similar to the Java style, but is not forced. If there is no error block, then a default run-time error is triggered upon error. Example:

 
    block { 
      x = openfile(blah) 
      blah = read(x) 
      close(x) 
    } catch { 
      show("Error writing to file") 
    }

Pro - Not forced. Can handle groups of statements.
Con - Causes a fair amount of extra indenting and clutter that gets in the way of seeing the primary statements (It sort of elevates the status of errors above that of the statements themselves). The block length can end up being rather arbitrary, and combining of error types can make it even worse.

Option 3 - Function Status Return

This is where any operation that can detect an error returns a status code as a value. This is used by many C functions.

Pro - Easy to ignore by programmer if they don't want to check. No extra syntax structure is needed.
Con - Uses up the returned value. Tough to tell interpreter what to do if not directly handled (ignore or halt). Does not provide block (grouped) checking.

Option 4 - Error Handling Routine

This is the specification or designation of a routine that will be triggered if there is an error. One variation resembles this:

 
   On Error Call RoutineX()

This allows the handling routine to be reassigned as needed throughout the program. The second variation is to have a pre-designated routine that is always the error handler routine.

Pro - Does not clutter the code with error handling logic.
Con - Many agree it is best to keep the error handling logic close to the area of the cause, or at least have the option to.

Note that Visual Basic can have an error-handling section at the end of a given routine. This is sort of a compromise between a global routine and per-command handling. We are not that fond of this approach because we prefer handling to be right next to the offending command.

Option 5 - Check/Clear/Off Method

This is perhaps best introduced with sample code:

 
   errhalt("off") 
   stuff = read(x) 
   if err()         // check #1 
      show "Error in read. Error Number: " + errno() 
   endif 
   stuff = read(x) 
   write(x, stuff) 
   if err()         // check #2 
      show "One of the two above statements errored" 
   endif 
   stuff = read(x)  // checkpoint Lisa 
   clearerr()       // clear error status 
   write x, stuff 
   if err()        // check #3 
      show "The Write statement errored, Error#: " + errno() 
   endif 
   errhalt("on")   // halt if future error 
   stuff = read(x) 
   if err()        // check #4 
      show 'A useless check because of the halt' 
   endif 
   errhalt "off"     // don't halt if error 
   write(x, stuff)   // checkpoint Amy 
   status = err()    // check #5 
   if err()          // check #6 
      show "Will never trigger because prior err() cleared it" 
   endif

The Err() function returns True if any prior statement generated an error. It is cleared either after checking its value (such as in an IF statement), or if Clearerr() is called.

Thus, if the Read function at checkpoint Lisa errored, we could not catch it because we did a Clearerr() right after. Similarly, the statement at checkpoint Amy will not trigger check#6 because the check#5 "took" the error already.

The ErrHalt() function can be used to turn on or turn off halting if there is an error. The default is halting, thus you would want to issue ErrHalt("off") before using Err().

Since the error status is only cleared when sampled or upon a ClearErr(), one does not have check for errors after each statement. Example:

 
   write h, a 
   write h, b 
   write h, c 
   write h, d 
   if err() { 
     show "An error ocurred somewhere above" 
   }

The Err() function will tell if there was an error in any one of the above 4 Write statements, not just the most recent one. Thus, it provides many of the benefits of blocked catching (Java-like).

Pro - Can handle both an active or passive handling philosophy with the ErrHalt setting. Does not hog parameters or function results. Can detect errors for groups of statements, not just one.
Con - Not very common, could take some getting used to. Programmer may forget to reset the ErrHalt setting after setting it.

Option 6 - Optional Parameter Indicator

This would use a non-positional parameter (see parameter options) to optionally "sample" the error status. Example:

 
   openfile "stuff.dat" _handle h  _read  _errto errstat 
   if errstat != 0 
     show "Error on open, Number: " + errstat 
   endif 
   // Example 2 -- No error clause: 
   openfile "stuff.dat" _handle h  _read

The first example would put an error status value into the variable "errstat". If the program does not "sample" the error status, such as in example 2, then a run-time halt is generated if there is an error. Note that it is the existence of the _errto clause that determines if a halt is generated, NOT the existence of an "if" statement to evaluate it. In fact, the "if" statement is not required to be there. An _errto clause without an if statement is a way to ignore an error.

Pro - Gives the programmer the option of handling the error, letting the interpreter handle it, or ignoring the error.
Con - Tends to add more parameters. Designed mostly for languages that support named parameters. Does not provide easy block (grouped) checking.

Option 7 - Optional Parameter Indicator Variation

   sub myRoutine
     var locmark
     on error handlerA   // set up handler routine
     locmark = 1
     foo
     foo
     foo _errorto x
     if x <> 0
       message "Error: " & errText()
       return
     endif
     locmark = 3
     bar
     bar _errorto x
     if x <> 0
       message "Error: " & errText()
       return
     endif
     bar
   endsub
   sub handlerA inher  // see scoping about 'inher'
     if locmark = 1
       message "Error during breakfast: " & errText()
       return true   // go back to just after the error
     endif
     if locmark = 3
       message "Error during lunch: " & errText()
       return false   // don't go back (exit MyRoutine)
     endif
   endsub

This is a more complex variation of option 6. This sets up a default handler routine called "handlerA" in this case. If there is a run-time error, then handlerA is called unless the problem statement has an _errorto clause. There should also be a way to turn the error handling off (causing program aborts). Perhaps "on error abort" or something.

"Locmark" is an ordinary variable that demonstrates how some block detection can be done with this method. (An old VB trick I picked up).

Some may prefer the error handling section to somehow be at the bottom of the routine instead of a separate routine, similar to VB's approach (except a little more modern). I will leave the details of such syntax up to you.

Pro - Provides routine-level and command-level error handling within the same routine. Does not require as much switching between error modes as the check/clear/off method. Does not require Java-like block structures.
Con - Tends to add more parameters. Designed mostly for languages that support named parameters (although positional parameter languages could support such a clause). Block-level (group) checking has to be implemented manually (but is not that hard).

Favorite

I will not vote on this one yet. I am holding out hope that a better approach will be found. However, I will suggest that Java's approach is anti-scripting in philosophy. It is best left to the more pedantic languages.

Auto Error Logging

A nice feature of a language would be the (optional) automatic logging of all run-time errors to a log file (in addition to console messages). The log could contain the error message, error number (if any), function name, line number, date, time, and perhaps others such as the function stack frame, etc.

This would be an addition to, not a replacement for the above methods. Perhaps only run-time halts should go to the log, not handled errors. There should also be a way to optionally pass an error or message onto the log from the handler(s).

Capitalization Handling

There are two issues to case handling. One is whether to have the machine ignore case when identifying variables and tokens, and the other is ignoring case when comparing strings.

I think it is to the spirit of scripting that the machine should generally ignore case, provided that case-sensitive string comparing functions are available for the rare times where case matters.

I can understand why the more formal languages might want to default to sensitivity, but it does not belong in scripting languages.

Dealing with file names gets a bit trickier though. Generally the interpreter should follow the conventions of the host OS.

The Leaky Assignment Controversy

A sticky controversy in languages is whether or not to allow the mixing of assignment statements with control statements and others. We call them "leaky assignment statements" because they leak or pass on the assigned value to other constructs.

The most common use for leaky statements is in While statements. Here is a common example that reads and echoes the contents of a text file:

 
   handle = fileopen("afile.dat") 
   while( line = readline(handle)) { 
       print(line) 
   }

It is assumed that a null value evaluates to False within the While statement. (Whether this is good or not will not be debated here.)

In the example, the assignment to "line" also echoes that value to the While statement. If this echoing (leaking) were not allowed, we may have to write something like this:

 
   handle = fileopen("afile.dat") 
   line = readline(handle) 
   while( line != null ) { 
       print(line) 
       line = readline(handle)   // note this! 
   }

This forces us to repeat the "readline" statement twice. This is not very ideal because it is extra coding and because we may change the first readline but forget to change the second one, especially in a longer loop.

In this particular case we can solve the problem by adding an EOF() (end-of-file) function. However, we have observed that this "repeat checker" problem pops up fairly often in loops, not just file I/O. I cringe everytime I see it in a language book or manual, which is fairly often.

We also do not like the idea of the assignment statement returning a value; assignment statements should be by themselves. If it were not for this Repeat Checker problem, there would be little reason to allow leaky assignment statements. Their potential for confusion and abuse is too great. For example:

 
   result = (a = b)

Is this two assignments, or one assignment and one boolean statement? Using "==" (2 equals) for booleans does not solve the problem, as described later. Some languages use a colon instead of an equal sign for assignments. Example:

 
   result : 8 
   result : (a = b)    // assignment of a boolean value

However, the equal sign for assignments is perhaps too familiar to do away with. It also requires the pressing of the shift key, unlike the equal sign. (Since it is used so often, we are looking at keystrokes in this case.)

Thus, if we can find a way to fix the Repeat Checker problem, we can get rid of a legitimate reason to allow leaky assignments. Here is one solution:

 
   prewhile { 
      A 
   } goif X { 
      B 
   } 
  
   // Applied Example: 
   handle = fileopen("afile.dat") 
   prewhile { 
      line = readline(handle) 
   } goif line != null { 
      print(line) 
   }

It is sort of like an if-else crossbred with While loops. In the first construct above, all relevant statements in position "A" are executed at the top of each iteration. Expression "X" is then evaluated. If expression "X" is true, then all relevant statements in position "B" are executed, otherwise the loop is terminated (not returning to "A", "X", nor "B" any more).

It is basically no different from existing While loops, except that there is no pressure to cram as many operations as possible into one statement. You are now given plenty of statements to calculate the looping criteria.

Although a naming contest should be held to find better names for "prewhile" and "goif", the concept is very useful. It is not a very compact structure, but it beats allowing some of the notorious UNIX-influenced spaghetti-one-liner assignment statements.

Note that some Repeat Checker loops require too many statements for even leaky assignments to handle. Our new construct can handle these as well.

At least consider it.

A secondary advantage of eliminating leaky assignments is that the programmer does not have to alternate between a single equal sign and double equal signs (= and ==). This would eliminate a constant source of typos and bugs for beginners and multiple language users.

An astute reader pointed out that a simple break statement may accomplish the same thing. Example:

 
   while true 
      x = stuff() 
      y = morestuff(x) 
      if not y 
         break   // exit loop 
      endif 
      normal_loop_stuff() 
   endwhile

However, breaks are kin to ill-reputed Goto's, and are not significantly less code than my proposed solution. But, they are common in existing languages. It is a choice between the dirty, familiar past or a cleaner, but unfamiliar future.

Another possible use of leaky assignments is simplifying initialization:

 
   foo = bar = them = those = 0

This assigns zero to all the listed variables. However, there are other ways to do this without leaky assignments:

 
   Store 0 to foo, bar, them, those    // XBase approach 
   // or 
   init(0, foo, bar, them, those)

The Init() function keeps with the function rule, but may be tricky to implement because of the variable quantity of parameters. However, just because it is tough to build using the language does not mean that the interpreter/compiler cannot do it.

The point is, we still have plenty of alternatives to leaky assignments, which are too abusable to release to the general public in my opinion.

Variable Scope

The two basic issues with variable scope are subroutine related scope rules and individual variable modifiers that allow overriding the subroutine related scope rules.

Subroutine Influence

First we will discuss how and if a subroutine (or function) inherits it's parent's (caller) variables and parent's variable scope aside from parameters and global variables. Here is an example of scope inheritance:

 
   x = 5 
   Aroutine() 
   ... 
   sub Aroutine() { 
      print x 
   }

If scope inheritance is active, then Aroutine will print "5" because it inherits "x" from it's caller routine (which may be the main routine). Scripting languages tend to be a bit loser with regard to this than other languages.

Most languages fit into one of three categories. First, some languages inherit scope automatically. Pascal is a (non-scripting) language that allows scope to be inherited only if a subroutine is nested within another. We will put Pascal in a second category that requires explicit coding to inherit scope, or at least gives both options. C will be in the third category of not allowing any inheriting except via global (or per-file) variables and parameters.

In the hands of a sloppy programmer (or bad luck) automatic inheritance can result in hard-to-maintain or hard-to-debug code. At the other extreme, option 3 can make subroutines hard to split up when they grow beyond original expectations and tend to encourage overuse of globals. Therefore, we are proposing the middle option.

However, Pascal provides a cumbersome way control variable scope inheritance. Here is an example of an alternative:

Sub myroutine(x, y, z) child of aroutine // stuff goes here EndSub

// Example 2: Sub myroutine(x, y, z) child of * // stuff goes here EndSub

The first example lets "myroutine" inherit the variable scope of it's designated parent (calling) routine, known as "aroutine" in this case.

The second example lets "myroutine" inherit the scope of any and all subroutines that call it. The asterisk acts as a wild-card.

If you find this a little too formal for a scripting language, you could use simpler keywords such as "inherit" or "noinher". "Inherit" says that a routine inherits the scope of it's caller routine(s), while "noinher" prevents inheritance. "Isolate" could perhaps be used instead of "noinher". Example:

 
   m = 5 
   arout "foo" 
   arout2 "foo" 
   ... 
   sub arout(x) 
      // I see m 
   endsub 
   sub arout2(x) isolate 
      // I DON'T see m 
   endsub

Note that this approach would not determine how child routines see the scope. It would be up to the children to "isolate" themselves if needed. Also note that if a statement like m = 2 was put inside of both arout() and arout2(), arout() would modify the original m to 2, whereas arout2() would not. In arout2(), m would be a regular local variable that would "disappear" when arout2() was done, leaving the original m untouched with 5.

Also note that "Isolate" has no influence on parameters nor globals (described later), only on regular routine-level variables created in parent and ancestor routines. (If globals and statics can be declared in routines, then Isolate would not influence these either; they would still be visible. Isolate only targets the "call stack" variables.)

Choosing between "inherit" and "noinher" may depend on whether scope inheritance is the default or non-inheritance is the default. We will be satisfied as long as both options are easily made available. Perhaps "inherit" would be "safer" from a software engineering standpoint because it would require an explicit request to inherit scope. (XBase uses scope inheritance by default).

Some people suggest using packaging or groups of routines to create to create sort of "neighborhood globals", or "regional" scope. However, this approach still requires moving local variables out and into the this semi-global (or regional) location. Often a routine will grow a bit unwieldy or suddenly gain the need to access a given portion of code from two or more different points within the routine. The easiest solution is to have these "sprouts" be able to inherit the parent's scope. This eliminates the need to physically move variables to "higher ground" (regional or global) in order to be shared by both.

I don't see how this poses a "safety problem" as long as the relationship of the sprout(s) to the parent is clear. However, lack of a one-to-one relationship is common and generally accepted in scripting. Perhaps a language can have both "quicky" inheritance using something like the "inher" option, but also have a "child of" option for a tight relationship that specifies explicitly which routine(s) it inherits from.

Some languages allow access to non-local variables via a scope modifier. For example "caller::x" or "caller$x" refers to variable x of the calling routine. This is perfectly fine as an option, but should not replace scope inheritance so that routines can be split up when they grow larger without having to add a bunch of scope modifiers.

Some languages, like PHP, require a "Global" scope modifier in order to access regional variables. ("Global" is a misnomer since they are not really global.) However, this suffers from the split problem mentioned above. The chosen approach should not make splitting routines difficult.

Individual Variable Scope

Many languages supply modifier keywords that allow specific variables to override the normal scope they would fall under. The most common is Global (sometimes called Public). Some languages allow it to applied inside of routines, while others require that it be applied only outside of or before any routines are defined.

Then there is Private, which allows a variable to have the same name as a variable in a parent routine without interfering with the parent variable. Private is not needed nearly as much if some form of the "noinher" option discussed above is available.

Related to Private is Local. Local is similar to Private, except that child routines also cannot inherit it's scope (in addition to parent).

(Note that different languages use different names for these concepts. The names we use here are only to give labels for discussion purposes, not serve as a final suggestion.)

Finally, there is the Static type. It allows the variable to keep it's value between subroutine calls. Normally, variables are reinitialized for each call to a routine. Static simply overrides this behavior. It is very useful for building add-on packages.

It can be argued that most of these modifiers are more than a scripting language needs. If the routine-level modifiers discussed above are available, then Private and Local may be somewhat redundant. Globals, however, are a near must for any scripting language. Globals can even serve as a "dirty" substitute for Static. (Globals have a greater risk of unintended alternation by a distant routine.)

Favorites

I vote for a subroutine scope inheritance toggle option, such as "inher", and the Global modifier. Static variables would be a nice feature, but hardly mandatory. Local and Private variables are probably not both really needed if we get our "inher" or "isolate" subroutine options, so we suggest choosing one or the other unless you are building a higher-end scripting language. In my opinion, Private is a better choice than Local if you must choose between them. (The reasons are too lengthy to present here.)

Null Values and Zero Divides

I never was fond of null values. To me they are anti-scripting and meant for the more formal anal-retentive languages. This is because they tend to force one into checking for them before using a value. Example:

 
   if x != null    // typical annoyance check 
      return x 
   else 
      return "" 
   endif 
   // variation: 
   return iif(x != null, x, "")     // shows the useful iif function

If you are forced to check, you might as well check to prevent a null rather than after it is generated. (This gets into the sticky area of where nulls actually originate from in a program.) Further, they do not import or export very well between spreadsheets and database tables. If you really think that nulls earn their keep, then perhaps a compromise can be worked out. Nulls could be tested for, but otherwise return legitimate values such as blanks or zero. Thus, a function like isnull() can be used if one wants to deal with nulls, but no operation will trigger "Null Error" exceptions.

Also, null strings are even more problematic than null numbers. Or perhaps said another way, nulls have a little more meaning with numbers than with strings. Thus, if one feels that null numbers are computationally useful, then at least get rid of null strings. A blank, zero length string is sufficient.

Null is more of an attribute than a value under one plan. If we were OOP fans, we might have x.value and x.null as separate attributes and/or methods. Setting x.null = true would (perhaps) also set the value to an empty string. x.value = null would not be valid because again nulness is an attribute, not a value. However, the syntax is simpler if we simply have an IsNull() function to check nullness instead of using OOP constructs. Here is an example:

 
   a = space(0)    // same as "" 
   b = null        // also assigns "" 
   show "lengths: " & len(a) & ", " & len(b) & ", " & len(a & b) 
   show 'nullcheck: ' & isnull(a) & ", " & isnull(b) 
   show 'equiv: ' & a = b 
   // end of code 
  
   Output Results: 
   lengths: 0, 0, 0 
   nullcheck: false, true 
   equiv: true

Note how both a and b are empty strings; however, only b is null. If this seems a little odd to you, then please consider the messiness and confusion of the alternatives. My suggestion simply isolates nullness from the value so that a programmer does not have to deal with the paradoxes of null values if they do not wish to. Consider this silliness of some languages:

 
   x = null 
   if x > 34  { 
      show "greater than 34" 
   } 
   if x <= 34  { 
      show "less than or equal to 34" 
   }

Neither of these Show statements would execute under some languages. To me this is a bit silly.

There are further issues to consider in deciding how, if, and when to carry nulness to a result. Example:

 
   a = null 
   y = 3 + a

What will y be? 3? null? zero? 3 and null? zero and null? Rather than give my favorite answer, I will point out that treating and/or storing nullness separate from the value does allow more possibilities than traditional null handling. The attribute approach can emulate the old way if needed, plus handle the "new" way and many combinations. In other words, the new (attribute) way can potentially act either old or new, but the old way cannot act new.

A related and heated topic is division by zero. Should the result halt in error, be null, or be zero? Perhaps this and null selections can be a compiler/interpreter option so that the programmer can choose instead of the language builder. (Debates about these get almost as personal and heated as debates about semicolons.)

String Functions

Many Perl fans argue that much of Perl's potentially cryptic syntax is necessary to provide Perl's power. However, we disagree. With enough well-chosen string and parsing functions, the power of Perl can be approached without sacrificing readability. Even Unix- and Perl-like regular expressions can be used within normal functions. (Although in my opinion, complex regular expressions should be broken up into separate statements for separate steps if possible. Some programmers like to use tricky, multi-operation regular expressions as a macho nerd litmus test. Thus, the expression compactness serves a social purpose instead of a business purpose.)

First, get rid of "hidden linkers" such as the "@_" operator. Their harm to readability is greater than their contribution to rapid development. Second, provide a rich set of string and parsing functions. Examples:

 
   replace("ab123ab456", "ab", "xy") // result: "xy123xy456" 
   list = split(",", "123,456,789")  // similar to Perl's Split 
   astring = combine(",", list)   // opposite of Split 
   trim("  abc  ")   // result: "abc" 
   at("quick brown fox", "brown")  // result: 7 (position 7), zero = not found 
   rat("ababababa","b")  // result: 8; starts search from right 
   stuff("the low level",5,3,"high")  // replace "low" with "high" 
   empty("   ")   // result: true; returns true if blank or white spaces 
   format(123.12, "######.##")  // result: "   123.12"

This is just a sample of string functions. Perl-like regular expressions can be implemented in some functions. Some of the above functions could also have optional parameters for options such as case sensitivity.

See the Macro section for some more examples of string parsing functions.

Here is one function that I wish to promote:

 
   AppendLine("filex.txt", astring)

This would open the given file, write the given line to the end of a file, and then close the file. It is one-stop shopping. It is ideal for debugging, and could be used for log and trace files. Note that a file handle does not have to be tracked. False is returned if there is an I/O error.

Note that some of our example functions names are somewhat long by scripting standards. This is primarily to make the examples easier to follow. However, it is perfectly understandable if a scripter wants shorter names, like AppendLn() instead of AppendLine(), and so forth. Keep in mind, however, that similar names like "substring" and "substitute" can cause confusion if not abbreviated carefully. (In this case, perhaps "replace" is a better choice than "substitute".) Generally, commonly-used functions should be shorter than the more obscure ones. (That means the language builder has to do some usage guessing.)

Associative Arrays Versus Tables

Associative arrays in Perl are indeed a nice feature to have in a language. However, they strike me as being a subset of table operations. Tables have much more flexibility than associative arrays. For example, they automatically can have multiple keys (indexes), non-unique keys, automatic persistence, record locking, filtered views, and many other features commonly found in relational and SQL-based systems. One does not have to rewrite the code when their collection (array) graduates from being simple to complex.

I rarely used arrays in XBase languages because tables were much more natural and convenient in XBase. XBase has a lot of flaws, but it provides hints of the power and convenience of tables. Unfortunately, table-friendly languages are rarely the focus of scripting organizations and research. It is an area where much language research and improvements can be done. Many programmers end up using arrays, files, and streams in a cumbersome way to do table-oriented operations when what is really needed is table-orientation. Arrays are a poor substitute for table operations!

Here is a hint of table-power using SQL-influenced syntax:

 
   Directory "../foo/*.txt" _alias flist    // put dir list in a table 
   default _alias flist           // pick default table handle to simplify syntax 
   list "*" _orderby "fdate:d"    // list by date order, descending 
   list "*" _orderby "fname"      // list by file name order 
   list "*" _where "fexten = 'dat'"    // list all .dat files 
   list "fname" _tofile "names.txt" _orderby "fname"   // list names to a file

   // Another syntactical variation:

   d = directory("../foo/*.txt")
   list(d, "*", #orderby "fdate:d")
   list(d, "*", #orderby "fname")
   list(d, "*", #where "fexten = 'dat'")
   list(d, "fname", #tofile "names.txt")

A very table-oriented language may not need quotes around many of these parameters.

Tables could also assist with things like string parsing. For example, a substring search could produce the following table structure for each match:

Position - Starting position of the match.
End_Pos - Ending position of the match
Before_Char - Character just before the match (if any).
After_Char - Character just after the match (if any).
Case_Diff - Number of characters with a different case than the template.
Etc.

Tables can make it easy to reference multiple pieces of information about multiple matches and other multiple-entity operations.

Note that SQL API's used by many languages to talk to database systems are a bit cumbersome for those used to scripting. It does not have to be this way. SQL was not originally designed for scripting, but could possibly be altered (or translated) a bit to be less verbose and cumbersome. Tables can be convenient and light-on-their-feet if done in the right spirit. I will now step off of my table soapbox.

Associative arrays using "dot" syntax (in addition to square brackets) can actually be quite handy for use as interface mechanisms, such as holding an individual data record. However, using them as an alternative to multi-record databases/tables often backfires in my opinion as projects change and scale.

Using arrays of arrays in languages like Perl makes the interface to the collection too tied to the implementation. If you later want to use a linked list or static arrays or a database, then your collection calls may all have to be changed. A table-based interface (API) is more flexible in this respect. True, a "lite" engine may not provide all the features you need, but upgrading will not require rewriting the code to handle the more powerful collection engine. (Even if you don't like databases, putting collection manipulation operations behind an API can improve change-friendliness over "raw" access.)

Also see the Array section and OOP notes for alternatives or additions to associative arrays.

Packages and Naming Conflicts

Many languages offer ways to link in or reference groups of other functions. We will call these groups "packages". Other packages are often referenced with commands such as "include", "use", "attach", "tie", etc.

One of the problems introduced by this approach is naming conflicts between same-named variables and/or functions. Often a prefix is used to distinguish names. Example:

 
   use "package1.prg" as pk1 
   pk1::varx = 7 
   foo = pk1::bar()

This syntax can get a bit cumbersome. Therefore, we propose that a package reference not be needed unless there is a conflict between names. In our example, unless there is a local bar() function, there would be no requirement to include the "pk1::" prefix. (Having it there anyhow may be considered good documenting by some, but excess "path coupling" by others.)

Note that a "sys::" prefix could be used to distinguish between user-defined functions and built-in functions. Thus, if you define your own "len" (length) function, then all references to the built-in function of the same name would have to be like "sys::len(x)". This may seem cumbersome, but it is a good reason to avoid using reserved words.

The two colons are only one possible symbol option. Some may suggest a period, although I prefer to reserve the period for possible dictionary arrays and/or objects. A dollar sign is another candidate.

Subroutine Hunts

Some languages have a handy feature that if a subroutine is not found in the current file, it looks for a file with the same name of the routine and sees if that routine is in the file. For example, if a call to "foo(x)" is made, and the routine is not found using the standard approaches, then the interpreter will look for a file called "foo.prg" in the current directory and see if that routine is in there. ("prg" is an example language extension, but would otherwise fit the language.)

The search priority can be 1) current file 2) referenced modules/files, 3) any file with name of function. For #2 (referenced files), the search order is usually the order in which they are declared, but at times it would be nice to override this with an optional search order:

  module("foo.prg", 5)
  blahblah(x)
  module("bar.prg", 4)

Here, module/file "bar" is searched before "foo" if "blahblah" is not local. Note that unlike an "include" operation used in some languages, the modules are only searched if a local routine is not found.

Variable Scope and Modules

I recommend that variables are not searched for in other modules unless they have explicit scope indicators (ex: "mudule7::myvar", see above), or perhaps the language could define a wild-card ("*::myvar) if a search is desired. (Global variables are a separate issue.)

Reducing Subroutine Scope

In a given module, sometimes you don't want a routine to have scope outside of its containing module. This can reduce name collisions. One approach is to have an explicit routine modifier that limits its scope. In other words, an indicator that excludes it from "name hunts" from other modules.

Pascal's nested routine approach appears the best solution to this I have concluded. A nested routine is not going to be part of an outside name-hunt. If we want wider scope, we simply un-nest it. Nesting avoids extra modifier keywords/operations, and also solves variable scope issues at the same time, killing two birds with one stone. It just takes a while to get used to. But I have not tested the Pascal approach in really large applications, so am suggesting it with caution. A keyword approach may still offer more dynamicy and meta-abilities, even though it is messier. (Pascal traditionally requires routines be defined in the order used, but we don't have to keep that convention.)

Aliasing Routines

One way to deal with naming conflicts between subroutines is to alias a routine that conflicts with another:

    alias aPackage::myfunc1 as anotherfunc

    public alias aPackage::myfunc as anotherfunc  // global version

This could also be done by simply having anotherfunc call myfunc1; however, matching parameters can be a maintenance headache using such an approach.

Implied Declarations

A feature unique to scripting languages are the instant or implied formation of a new variable. One can suddenly say "x = 5" regardless of whether "x" was ever declared.

Although this is a nice feature, there are times and application types where one may want to limit this. Visual Basic, for example, provides an optional "Option Explicit" designator in the case that a programmer wants variable declarations to be mandatory.

However, one of the reasons Visual Basic needs this is that references are also given implied declarations, not just assignments. Let's look at the example assignment of "x = 3 + y". If "y" has not been assigned or declared, Visual Basic assumes some default value (something like null, blank, or zero).

In other languages, such as XBase and Python, only assignments can provide implied declarations. Any referenced variable must have been declared or assigned previously, otherwise a run-time error is triggered. I consider this more logical since there is no sense in reading an unassigned variable. The only possible reason I can think of for accepting undeclared references is to reduce run-time halts at the expense of having garbage output. Otherwise, I don't know what Visual Basic's reasoning is for that rather odd approach.

   Allowed under Python-style:

   y = 7
   x = 3 + y       // x is a new var

   Not allowed under Python, but okay in VB:

   x = 3 + y      // x and y are new vars

Thus, there are 4 options:

Assignments can declare and new references allowed. (both x and y can be new in x = y)
Assignments can declare, but new references not allowed. (x can be new but not y in x = y)
All variables must be declared
Option Switch (selection of the above)

Favored - I prefer the option switch (#4), with #2 the runner up. #1 is totally out of the game and only mentioned because some languages actually use it.

Basic Types

Most scripting languages either have very few basic types, or allow easy and/or automatic conversions between the types. The concept of "type" can even be foreign to some languages. Since there are many variations on typing, I will not attempt to list mutually exclusive options. Instead we will describe some of the issues involved.

Almost all scripting languages provide strings and numbers. Sometimes the conversion between them is be automatic, and sometimes explicit.

One of the problems with implicit conversion is "dirty numbers". For example, we can make a string "123.45stuff876". If we do math on it, it may be interpreted as 123.45. To reduce the chance of dirty numbers, there should be a tonum() function to convert or clean a number to its purist form. Perhaps also have a numcheck() function that returns zero if the number is clean, or the position of the first unrecognized character (1 is the first position).

Note that the Tonum or any conversion function should probably not generate a run-time error if used on its native type. For example, Tonum(5) and ToString("yep") should be perfectly legal under weak typing.

If the distinction between numbers and strings is weak, then some operators like the plus sign (+) cannot be used for both concatenation and math addition. A different operator will have to be used for concatenation. Visual Basic started using the ampersand (&) for string concatenation instead of plus when Microsoft loosened VB's typing. Other languages use a period for concatenation. One thing that bothers me about the period is that it can be mistaken for an OOP or "field" separator sometimes. Others have disagreed.

It may make the conversions a little bit cleaner and/or better documented to require conversions from numbers to strings be automatic (or transparent), but not the other way around. Example:

  print anum & astring & another_num

  x = anum + toNum(astring)

A "ToString" operation is not needed because there is never a chance of a number not being convertible. (Unless you allow those darn nulls.)

A similar issue must be addressed for comparison operations. In something like if "05" > 4 it is not obvious what the result will be. Perl solves this by using a different set of comparison operations for strings and numbers. However, under ASP (VB script) one must use something resembling if num(X) > num(Y) to be sure numbers are being compared and not strings. (See the section on Comparing for more on this. Comparing and type management tend to be closely related topics.)

Some languages offer a date type. However, dates can be handled just fine as an agreed-upon internal number or string representation and plenty of conversion, date parsing, and cleaning functions.

Note that this would generally be an internal (program) representation. There should still be formatting functions for external representations for input and output that would handle most of the international formats. Thus, the internal (program) representation and the interface to external date formats are not necessarily related.

Suppose we agree that dates are represented as "yyyy/mm/dd" for all built-in date functions. Here is an example that reads two dates in "mm/dd/yy" format, and picks the largest of the two, then outputs the result in original format:

 
   date1 = readline(h)     // input: "12/31/97"  (strings) 
   date2 = readline(h)     // input: "5/15/99" 
   d1 = normdate(date1,"m/d/y")  // to "1997/12/31" (strings) 
   d2 = normdate(date2,"m/d/y")  // to "1999/05/15" 
   result = max(d1, d2) 
   result = dformat(result,"m/d/yy:2")   // "5/15/99" 
   writeln(result)

The NormDate functions normalizes the date into yyyy/mm/dd format. The template "m/d/y" does not need to know how many digits each component has because it knows the slash (/) is the separator in this case. However, the number of digits is important for the Dformat function. The minimum digits is given by the number of characters for each part, but the maximum is given by the digit after the colon. In this case the minimum number of result year digits is two because there are two y's, and the maximum is also two because of the ":2".

If the century digits are not given, then the nearest year is assumed in the NormDate function.

Note that we could have used a different separator, a period, by doing:

 
   result = dformat(result,"m.d.yy:2")  // gives "5.15.99"

If any countries use colons as separators than we may have to pick a different maximizer indicator. Further, perhaps other operators besides digits after the colon could indicate things like text months. Thus, we could have a template such as "m:t.d.yy:2" where the "t" indicates text months like "jan, feb, mar" etc. This is only a suggestion, perhaps somebody has a better formatting scheme.

It would be useful to have functions like DateDiff() that indicate the number of days between two dates, DayName() that indicate the day of the week, such as "Thursday", and so forth. There is no need to have an explicit date type to perform these operations.

I will leave the representation of booleans up to you. The arguments for and against a dedicated boolean type seem to depend on the philosophy for the quantity and representation (below) of other dedicated types.

Internal Type Tracking

Some scripting languages have an internal flag or code that determines the type of a given variable. Others just use context. I don't like the internal flag approach. It is too "black box" from the programmer's viewpoint and can lead to misleading code. The context-based approach is more WYSIWYG. I don't know what the alleged advantages of internal type codes are. Perhaps they allow for strong typing to be later added to the language or a strong-typing option.

Visual Basic is one of the few languages that could reasonably straddle the strong-typing and dynamic-typing bridge. I am not recommending VB here in general, just pointing out a unique feature of it. Unlike say Java, this feature allowed different programmers to use the style that suited them or the project. I believe an internal type flag helped facilitate this. However, it's "variant" type (unknown type) still allowed type-specific operations to be done on it.

Unicode

Rather than directly muck up string operations by implementing Unicode, we propose Unicode be represented as such:

 
   Unistring = "1301,802,12101,10012,3321,etc." 
   // or hex: 
   Unistring = "0fc1,1c05,0ffa,1a23,etc."

Special functions can then be supplied to handle these kinds of strings.

Quotes

Since HTML and SQL became common, it is my opinion that languages should allow both single quotes and double quotes as string containers. Example:

 
    x = '<TAG COLOR="#808080">' 
    x = "Select * from tb where name='HOCKENS' "

This makes it easy to have single or double quotes inside the string without excessive use of escape characters. XBase and to some extent Perl allow both types of quotes.

Many Unix-influenced languages allow variables to be inserted into strings. Example:

 
   // the common way 
   show "My name is " & name & " and I am " & age & " years old." 
   // The Unix way 
   show "My name is $name and I am $age years old."

See how much nicer the Unix way is? This would be very helpful for HTML applications. However, I think it might be safer to require the variable marker to be on both sides of the variable name. Example:

 
   show "My name is $name$ and I am $age$ years old."

There are other syntactical variations on this theme that we will not go into here because there is no clear winner among them.

Object Orientation

As you may know, I am not much of an OOP fan. OOP tends to create syntax bloat and it's constructs can usually be handled just fine with traditional syntax as long as the language supports static structures. (Static means that it lasts between subroutine calls. However, if statics are not available, globals can be a substitute.) In short, OOP does not belong in a scripting language.

The biggest benefit of direct OOP is that it prevents one from using the wrong type of object with an operation. This tilts toward the strict typing of non-scripting languages and is not worth the extra language constructs and syntax bloat.

Using an object-oriented approach without direct OOP is somewhat like using files and file handles. The handle serves the same purpose as an instantiated object, except that it is usually an integer (or long) instead of an object type. Here is a file operation using direct OOP and indirect OOP:

 
   // OOP 
   file fi = new file("sample.dat", READ)   // open for read 
   fi.binmode = true     // set to binary mode (see Perl 'binmode') 
   line = fi.readline() 
   fi.close 
  
   // Non-OOP 
   fi = fopen("sample.dat", READ) 
   fbinmode(fi, true) 
   line = readline(fi) 
   fclose(fi)

As you can see, the biggest difference is where "fi" goes. The OOP way provides no usage benefits over the more traditional way. Perhaps building the file class/package itself may be somewhat easier in OOP, but using it is not. Therefore, I propose that complicating the language to handle direct OOP is not worth it. It nearly doubles the complexity of the language with only a few percentage increase in utility. There are much better areas to "spend complexity" on. (See the discussion about the function rule.)

Note that it may be possible to internally translate some OOP usage syntax into function call syntax so that those ingrained with the OOP way can use OOP syntax. If the interpreter saw "x.y = z", it could translate it as "y(x, z)".

If you do decide to go ahead and put OOP in your scripting language, then perhaps think about leaving out inheritance. Or, at least greatly reduce the syntax elements devoted to it. Inheritance is the most over-hyped and least useful aspect of OOP in my opinion. OOP as a component building paradigm is perhaps becoming too common (perhaps out of shear habit) to ignore. However, this aspect of it can still be realized without syntactical inheritance for the most part.

One way to get OOP without adding many new language constructs is to use associative arrays as objects. In each "key slot" (method or attribute) goes either a value or method code. The method code could be interpreted at run-time by either an Eval()-like function, or by putting parenthesis after the key:

   var x[]   // declare associative array
   x.attributeA = "foo"
   x["attributeA"] = "foo"  // same as prior (traditional syntax)
   x.methodC = "if z < 4 {zark(); lo=park()} return(lo)"
   x.parents = "zark, dark"
   ....
   print(x.methodC())      // execute method
   print(eval(x.methodC))  // same as prior

The "parents" key would tell the array to hunt the listed routine(s) or other dictionary arrays for any referenced key not defined in the current array. It is like a "search path" defined in some OS file systems. This, and perhaps parenthesis for an "Eval" or "Execute" shortcut, are the only new features needed for full OOP. Languages that do this sometimes provide two different syntaxes for dictionaries. One with square brackets, and the dot. The square brackets allows spaces and other characters to be embedded within the dictionary key.

   myDict.keyWithoutSpaces = "foo"
   myDict["key with spaces"] = "bar"

The language could have formal class and method blocks, but these would simply be an alternative to creating objects via dictionary array syntax alone.

Note that such an approach makes no linguistical distinction between "object" and "class". This is fairly common in scripting OOP. One can perhaps define "object" as something that happens to inherit all its methods.
Also, one may want to have the parent list key be something like "~~parents" or "__parents" instead of "parents" to avoid name collisions.

Function and Subroutine Difference

Some languages make a large distinction between subroutines and functions, and others make almost none. In scripting languages we know of no known reason to make a large distinction. If no value is explicitly returned, then the return value (if read from) could simply default to a blank string or zero. Example:

 
   gotback = mysub(12) 
   show 'Returned result: ' & gotback 
   show 'Length: ' & len(gotback) 
   // 
   sub mysub(thing)             // define subroutine 
      thing = thing + 1         // or thing++ 
      // notice no Return statement here 
   endsub 
  
   The output: 
 
 
   Returned result:           [blank] 
   Length: 0

Thus, there is no real syntactical distinction here between subroutines and functions. If you think a distinction is important, then simply generate an error if an attempt is made to read a value from a "sub" that has no Return statement (or did not get to it).

One syntactical issue to consider is whether or not to require parentheses around subroutine parameters. It probably would be a bad idea to omit them for function calls, especially when assignments are involved [such as x = y(z) ]; but it can be argued that parentheses are unnecessary for stand-alone subroutine calls.

Statement Communication Shortcuts

As mentioned in the criteria section, there are various methods employed by scripting languages to reduce repetition in code. We will examine methods of shortening inter- and intra- statement communication.

Perl and other languages often use default parameters and results to string (link) together commands. We will can this "command piping". However, the simplification provided by these constructs does not overpower the risk of readability that they cause. One might as well as do something like this:

 
    t = somefunc(x) 
    another_op(t)     // "t" is used to pass result

This is not significantly more code than:

 
    // Perl-like implied piping 
    somefunc(x) 
    another_op()     // operates on the result of prior func.

(Perl has a default result indicator specified by "$_".) The piping approach saved only a few keystrokes. This savings is not nearly enough to compensate for the readability risk. Thus, we recommend against built-in command piping.

A similar shortcut is eliminating the need to re-reference the result variable when something is being done to the same variable. Examples:

 
    i = i + 1    // increment 
    i++          // C-like simplification 
    i =+ 1       // Another variation 
    astring = substitute(astring, "Borland", "Inprise")  // search and replace 
    astring =~ substitute("Borland", "Inprise")   // more Perl-like

The "=~" operator eliminates mentioning the variable again. The problem with these shortcuts is that they may hamper readability and violate the sacred function rule. However, altering the same variable is common enough in occurrance that some method should be provided for this. It also makes code modification safer because you only have to change the result variable in only place if you decide to change the variable being affected.

Thus, we propose a special operator that represents the assignment variable. Let's try "@". Examples:

 
    i = @ + 5    // same as i = i + 5 
    astring =  substitute(@, "Borland", "Inprise") 
    astring = @ & "append this"   // assuming & is concatination

This approach is more informative than the "hidden" Perl approach because @ is visible and explicit. It also does not require any weird syntax because it is basically just a replacement macro to the interpreter/compiler. (Note that the standard "++" incrementer should probably be kept. The @ is really meant for more complicated statements.) I know of no language that supports this approach, so I guess that makes me the tentative inventor.

Thus, I think we found a very nice compromise between the hiding approach of Perl and readability.

Run-time Macros

Some languages allow statements to be created and evaluated at run-time. Although there are a lot of names for this feature, perhaps some more appropriate than "macros", we will call them Macros for the time being. ("Macro" came from dBASE's usage). Examples:

 
   x = 1 
   y = 2 
   amac = "x + y" 
   show %amac%      // result: "3" 
   // or perhaps: 
   show eval(amac) 
   // Note that we may not know the string at compile time 
   line = readline(handle) 
   eval(line)

In the first example, amac is a string that is evaluated at run time by putting percent signs around it and/or using the Eval() function.

Although I found these very useful in some languages, it was often to get around weaknesses in the language itself rather than add power. However, there still are many situations where they are very useful. They are almost necessary for certain types of high-level abstraction factoring. Control Tables in Table Oriented Programming can make fine use of them, for instance.

Macros are very powerful, but can also lead to sloppy code. They also make compilation nearly impossible, limiting some or all of the language to interpretation. These problems can be reduced by limiting the places and circumstances where macros can be used. For example, some languages only allow macros to return the results of and expression. Under this limitation, you could not do this:

 
   // A no-no in some langs 
   amac = "(" 
   show afunc%amac%x)   // intended: afunc(x)

A further way to simplify implementation is to only allow variables in macros, not expressions. This would avoid having to implement a run-time expression evaluator.

Here is a hierarchy of macro implementation:

Program text substitution of any kind.
A block of code (multiple statements, but no control structures such as "if" and loops.
Expressions. Example: "myfunc(x) / foo(abs(y) - 7)" Clipper, a semi-compiled language, still allowed expressions.
A single function or variable reference. Ex: "myfunc(x,y,z)"
A single variable

Note that these are not the only variations, but do provide options to consider. Generally the lower the number, the harder it is to fully compile the language.

Examples

An example use of run-time macros would be to implement embedded variable evaluation if the language does not support it. Embedded variables were described in the Quotes section. Here is another example:

 
   repeats = 7 
   thing = "puppy" 
   show "He kicked the $thing$ at least $repeats$ times." 
   // Result: He kicked the puppy at least 7 times.

If a language did not implement this feature, macros could help us do so on our own:

 
   sub show(s) 
     private i, curvar="", result="", invar=false 
     for i = 1 to length(s) step 1 
       c = substr(s, i, 1)  // get one character at a time 
       if c = '$'      // variable name marker? 
           if invar    // must be ending marker 
             result = @ & eval(curvar)  // evaluate macro and append to result 
             invar = false   // reinitialize 
             curvar = "" 
           else       // starting marker for embedded variable name 
             invar = true 
           endif 
       else 
           if invar do curvar = @ & c   // append character to var name 
           else do result = @ & c      // append character to result 
       endif 
     endfor 
     output(result) 
   endsub

Note that the "@" operator is presented in the Statement Communications Shortcut section, and that "&" is used for string concatenation in this example. We also assume that such a For loop would not loop if the length is zero. (Why some languages loop under "for i = 1 to 0 step 1" escapes me.)

This example simply traverses through the input string and substitutes the value of the variable in place of the variable name. This would be nearly impossible to implement without run-time macros, especially if the input strings are read from a file or keyed in during execution.

Complex regular expressions could also be implemented with macros (if not built into the language). Let's look at a Perl example first:

 
   $content =~ s/%(..)/pack("c",hex($1))/ge;    # Perl version

This is used for CGI processing. It replaces all occurrences of %xx with it's ASCII equivalent where "xx" is a hexadecimal number. Thus, "123%4b123" would translate to "123K123". Here is our version, which can be implemented by using macros:

 
   content = replace(@, "%(..)", "cnvrt('c',hex($1$))", "rge")

The "e" in the 4th parameter tells the Replace function to execute (using Eval(x)) the third parameter, which contains the Cnvrt() function (equivalent to Perl's Pack). The "r" in the 4th parameter tells the Replace command to use regular expressions. (The code to implement regular expressions and Replace is too long to show here.)

Here is an alternative way to implement it without macros. It uses a function called Parse() which puts all the pieces of a string into a list (array) and splits on a pattern. For example, Parse("mm", "123mm456mm789") would produce a list of "123", "mm", "456", "mm", "789".

 
   array list[] 
   list = parse("%..",myString)   // uses a simple reg-ex 
   for i = 1 to len(list)        // for each array element 
      if substr(list[i],1,1) = '%'   // has hex indicator? 
         list[i] = cnvrt('c',substr(@,2))   // convert to character 
      endif 
   endfor 
   myString = concat(list)   // put list back together

In summary, run-time macros allow one to add fancy features and domain-specific shortcuts that may not be built into the original language.

Arrays

One of the more interesting way to implement arrays is treating them more like associative arrays rather than a big grid. This allows them to be more dynamic than traditional arrays, and more in the spirit of scripting. Examples:

 
    array a[]    // declaration 
    a[1] = 3 
    a[1,1] = "hey" 
    a[12,12,12,4] = 9 
    a["bob"] = 15 
    a[-1] = "foo"

If we could peer inside the storage structure for this array, it would resemble:

Key	Value
"1"	3
"1,1"	"hey"
"12,12,12,4"	9
"bob"	15
"-1"	"foo"

This allows the array user to pick any type of subscript range or type they want without pre-declaration. It also allows the array to double as an associate array; thus it's a two-for-one deal!

Note that the interpreter/compiler would probably have to treat numeric and string subscripts differently, otherwise, A[1] and A[01] would be different positions. Thus, if there are no quotes around a subscript, then the interpreter would clean up the number before inserting it into the hash table. Realize that A["1"] and A[1] would be considered the same, which is generally not a problem in scripting languages.

This approach would not be as fast as traditional arrays for intensive math operations, but fast math is not the common domain of scripting languages anyhow. (Traditional arrays use subscript multiplication for position lookups instead of hash tables.)

Some basic built-in operations related to these arrays could be ClearArray() to erase all the elements, ElemCnt() to see how many elements are in the array, and GetElem(3) would return the 3rd element (as sorted in the hash table).

See Also:
Arrays Can Be Harmful

Comparing

Comparing two variables or expressions is often a lot more tricky than most languages seem to assume. Should spaces be ignored? capitalization ignored? Compare them as strings or numbers? You get the idea. This issues seem to get short attention in most languages. If you are not using the default comparison handling, then they make you jump through hoops.

The help out with this, I suggest a compare() function or perhaps cmp() function that has various compare options. Example:

    if cmp( foo, ">=", bar, "LC") ...

This compares foo to bar. In the options parameter, "L" means ignore leading spaces, and "C" means pay attention to upper/lower case (otherwise ignored).

Although this follows the Function Rule, some may consider it too cumbersome for frequent usage. Perhaps the language could allow something like this:

    if foo >= bar (LC) ...

This has the options after the comparison in parenthesis. Other syntax candidates are:

    if foo (>=, LC) bar ...
    if foo >=(LC) bar ...
    if foo >=,LC bar ...
    if cmp(foo, "LC>=", bar) ...

Altough I take a lot of heat for it, I am leaning toward the last one. The letter positions would not make any difference (as long as they don't divide the comparison operator set itself). For example, these would all be interchangable:

   if cmp(foo, "LC>=", bar) ...
   if cmp(foo, "LC >=", bar) ...
   if cmp(foo, ">=LC", bar) ...
   if cmp(foo, ">= LC", bar) ...
   if cmp(foo, ">= L C", bar) ...
   if cmp(foo, ">L=C", bar) ...  // Invalid!

Other possible option letters could be:

I - compare as integer, lop off any fractional part.
T - ignore trailing (white) spaces. (Similar to L).
S - compare as strings
N - compare as numbers
U - compare as Unicode chunks
A - ignore all white spaces, even in between

Note that these should not be case-sensitive, so "n" would mean the same as "N". (I hate case-sensitivity in the UNIX world).

Symbols

There is a fair amount of friction about the usage of symbols (&, *, %, @, #, etc.) in languages. UNIX-influenced languages like Perl tend to have much more symbols than languages that came out of other camps. (See Blocking for some examples and also the function rule.)

It is my opinion that some languages have taken symbols too far. Most symbols almost completely lack "mnemonics". Words and abbreviations serve as better (but not always perfect) memory aides than symbols. However, I also realize that there perhaps should be some compromises due to common traditions.

Symbols are best used for common operations that tend to occur in long expressions. A good example is string concatenation. The "&" in parameter lists in C may also be justified under this criteria.

Beyond this, symbols are rarely justified, except perhaps for habitual reasons.

For example, UNIX-influenced languages tend to use certain symbols for Boolean operations. Among these are && ("and"), || ("or"), and ! ("not"). Since these are ingrained in the minds of a good many programmers, perhaps the ideal language would except both operators. (I believe there are a few that already do.)

Such a language could except both of these:

   x and y or not c

   x && y || ! c

One justification that some have tried to use for symbols is that they are more (spoken) language neutral than words or abbreviations. However, there are not enough symbols on the keyboard to cover more than a fraction of the reserved words of most languages. Besides, words are no more cryptic than symbols to a nonnative speaker of the target spoken language. Further, it is easier to alphabetize words and abbreviations than symbols, making translation dictionaries easier to use.

See "Context Insensitivity" below for more about symbols.

Context Insensitivity

The recent web-oriented language PHP has triggered a little battle over context-sensitivity and symbols in languages.

Web languages often have "include" clauses that allow the program to pull in program code or text. How dynamic the timing of the "pull in" depends somewhat on the language being context insensitive. This means that the language interpreter does not have to look that far ahead, or even beyond the current token to find out what the current symbol or token is.

One way to achieve this is to have a clear set of indicators that tell what type the token is. For example, in PHP the dollar sign indicates a variable. The interpreter does not have to look around to see that it is a variable based on context. Thus, PHP is "context insensitive" in this regard.

However, I am bothered by the heavy use of dollar signs (indicators) for every single variable. This not only increases typing, but also increases syntax errors because programmers often switch between different languages, some which don't use dollars, and will keep forgetting things like dollar signs on variables.

Before we look at possible solutions, lets look at typical indicators used to create context insensitive tokens:

     x    - no indicator means it is a keyword
    $x    - a variable
     x(   - a routine (or at least the start of one)
     x[   - an array
     x.y  - a table.field reference (I tossed that in for thought)

One way to reduce "dollarage" is to swap the indicators for variables and keywords. Thus, you would get code such as:

    $sub bar(x, y, z) {
       $while x > y {
         x--
         $if y = 7 {
           $break   // exit loop
         } $else {
           y = foo(z - y, x)
         }
       }
    }

This example has 5 keywords and 11 variable references; thus we saved 6 dollars (11 - 5). Variables are usually more common than keywords, so dollaring the keywords usually reduces the quantity of indicators needed.

(Note that an underscore or some other symbol may be preferable because the $ convention is too common. If an underscore is used, then no variable or function may be named with a leading underscore. Also note that we have pegged underscores for use in named parameters in a different section, so they may not be appropriate. Our examples will use percent signs.)

We can reduce indicators even more! Keywords can be identified simply by their name because we already know what the keywords are. Thus, keywords and variables both don't need indicators. Before we discuss some small caveats, here is the algorithm in pseudo-code:

   if the token is fully non-alphanumeric then
      it is an operator (+,-,/,},{, etc.)
   else if the token is in the keyword list then
      it is a keyword
   else if the indicator is "%" then
      it is a keyword
   else if the indicator is "(" then
      it is a routine definition or declaration
   else if the indicator is "[" then
      it is an array
   etc...

Notice that the keyword indicator ("%" in our example) is still optional for all keywords. The reason for this has to do with additions to keywords over time. For example, suppose the "break" statement in our example above was not in the original language specification, but later added to the language. We might use variables named "break" because it was not a keyword at the time of the program writing. If we did this, we could not run some programs under the new version of the language if they contained variables named "break".

For the record, I do not think this is a major problem at all. If by chance it happens to a program, simply find all the ocurences of "break", and change them to something else, like "mybreak" or something.

However, many programmers are anal-retentive (AR) about such potential "name overlap" problems. (Why they would rather type jillions of dollar signs to prevent such a rare and easy-to-solve problem, I have no idea. But, I will cater somewhat to their concerns here.)

Thus, we will do 2 things to satisfy the AR programmers. First, if a new keyword comes along, then simply require an indicator for it. Our prior example would look something like this:

    sub bar(x, y, z) {
       while x > y {
         x--
         if y = 7 {
           %break   // exit loop
         } else {
           y = foo(z - y, x)
         }
       }
    }

Some programmers objected to the idea that some keywords have indicators and some do not. They did not like such "inconsistency." Although I find this a very minor issue, such programmers can simply still use indicators on all keywords for their programs. Remember that the indicator is still optional for old keywords. (They may be annoyed by other programmers' code, but this is a small price for a little flexibility in my opinion. Programmer's rarely like other programmers' styles anyhow.)

In summary, we found a way to greatly reduce the need for indicators, yet kept the need for context out of the picture. What is PHP's excuse now?

Footnotes

Someone suggested that in-string variable expansion (such as found in Perl and other UNIX-influenced languages) could not happen if dollars were eliminated from variables. However, what happens inside quotes and what happens outside are not necessarily related. A dollar sign inside of quotes can still indicate variable expansion under our new plan.

Also, some programmers claim that heavy use of indicators makes programs easier to read because the indicator reduces their need to read the whole token to figure out whether it is keyword or a variable. However, I do not find this the case. To me, the keywords are well-learned after a few programs and are instantly recognized as keywords. This may just vary by individual and a psychological study would be needed to see what is the most common reaction in the programmer population.

Minimalism

Lately I have been fascinated with the idea of minimalism. The goal of minimalism is to keep the syntax simple without sacrificing power. This is often done by having complex libraries/API's instead of complex syntax. The trick is to find simple syntax that can "bend" to do or represent many different things.

We have seen examples of this with the function rule and the consolidation of dictionary arrays with OOP. It came up again when a reader suggested that I recommend "sets" in a language. However, a language with dynamic parameters could handle such just by adding functions to the library(s):

  if x in {0, 1, 2, 10, 20}    // set version

  if isIn(x, 0, 1, 2, 10, 20)  // function version

I agree that the first version is slightly more "natural" than the second, but it is not used common enough to justify dedicated syntax. (At least not the way I program.)

Smalltalk and some versions of LISP have done a pretty good job at keeping the syntax simple. Even IF statements in some languages are nothing more than expressions or function-like things. However, that is probably taking things too far. IF statements are common enough to justify dedicated syntax.

A simplified version of LISP (LISP-Lite) can be represented with only these "syntax generators":

  statement -> (command params)
  statement -> (command)
  params    -> params param
  params    -> param
  param     -> constant
  param     -> variable
  param     -> statement

IF's, loops, function declarations, assignments, etc. can all be specified with just this simple syntax. All this with only a few piddly generators! Wow! Although I question the human-readability of such a language, the concept is fascinating. (The parenthesis in the first 2 lines are part of the language.)

One area ripe for minimalism in practical languages is collections. Languages like Python have dedicated syntax for tuples, lists, dictionaries, etc. This makes the language confusing and harder to read and learn in my opinion. Most collection handling can be moved to libraries/API's. Smalltalk has done a good pretty job of this, although its collection libraries are too hierarchical (IS-A) in my opinion. (The only "native" collection is dictionaries, a.k.a. "associative array", in my pet language. They are used as a primary interface mechanism there; not so much data holding. Lists can be useful too, but can come from a library alone.)

See Also:
Arrays Can Be Harmful

More Comparisons To Come . . .

Contents Index

General Criteria
Option Selectors
Statement Separation
Blocking Markers
Parameter Calling
Emulating Named Parameters
Parameter Receiving
Error Handling
Capitalization Handling
The Leaky Assignment Controversy
Variable Scope
Null Values and Zero Divides
String Functions
Associative Arrays Versus Tables
Packages and Naming Conflicts
Implied Declarations
Basic Types
Unicode
Quotes
Object Orientation
Function and Subroutine Difference
Statement Communication Shortcuts
Run-time "Macros"
Arrays
Comparing
Symbols
Context Insensitivity
Minimalism
Misc. Notes on Closures, Etc.

See Also:
"L" - A Draft Language Description based on some of the above favorites
Procedural/Relational Language Helpers
Dynamic Relational Database

Main | Top of Page | e-mail