Updated 2/19/2004
Most of the popular scripting languages were built with poor research or debate about why one approach is "better" than another. The author(s) of the scripting languages simply built upon something they were already familiar with or gave it a bunch of features without documenting their tradeoff considerations. A lot of "history" is thus lost to new language inventors.
To reduce such problems when the Next Great Scripting Language is built, I have assembled together a list of language options, possibilities, and my preferences based on the pros and cons given. You are welcome to contribute any comments or options. I am not promising the best decision, only the best collection of possibilities and analysis of the design options.
We are also focusing on a middle-of-the-road scripting language; one that is fairly good at quick and dirty stuff, yet has enough "safety features" to build fairly large applications. More on scripting limits and target applications is presented under the flexibility criteria. Some people believe that a dichotomy between scripting and non-scripting languages is false. If you are in this camp, then simply ignore the word "scripting" whenever you see it. Or perhaps replace it with "weak-typed" or "dynamically-typed". More on the definition of scripting can be found here.
No one single language is to be promoted here. We are exploring concepts, options, and possibilities; not specific existing languages. If I mention a language, it is only to serve as an example for those who may wish to go see a concept in real-world action.
As far as language familiarity, I have had exposure to *Pascal, XBase, *C, *Java, Visual Basic, Perl, *Fortran, *COBOL, and briefer exposure to many others, including Lisp, Tcl, Python and APL. (Items marked with an asterisk are generally not considered scripting languages.)
Some also complain that I have never built a formal compiler or full interpreter, only simple ones; and thus allegedly have no business designing or critiquing a language. To this I reply that composing music well and playing music well are not prerequisites to each other; and that most of the options presented below already exist in other languages. They have already proven to be implement-able. Focusing too much on implementation may also bias one to create an interpreter-friendly language over a human-friendly one.
Most of the options and discussions presented here are based on my own observations and experience. I have already incorporated some observations mentioned by others. The feedback of others is welcome, although I cannot give personal credit unless it is an extended work.
Another way to simplify development is simplify the syntax. This is done by avoiding the need for excessive type conversion functions, wrappers, mandatory error handling, etc. It is also achieved by allowing or assuming common defaults. Passive defaults are those built into the language. Active defaults can be set by the programmer.
Still another approach is the use of abbreviations, aliases, and/or macros to avoid the repetition of long statements. (Surgeon General Warning: poor usage of defaults and abbreviations can lead to programs that are difficult to read.)
Many scripting languages are also known to have a good set of string manipulation and parsing operations and functions. String processing is often needed to extract data from one system and prepare it for another. Hence, scripting languages are sometimes called "glue languages".
Although it is often said that bad programmers can ruin any language, some languages are much more abusable than others. To prevent bad or selfish programmers from doing too much damage, I propose that a good scripting language should be designed such that it is always easy to tell where one command/assignment ends and another starts, and easy to tell the relationship (if any) between the command/assignments. One way to achieve this is to avoid unnecessary deviations from the function rule.
Thus, you will always know what is a parameter, an assignment, a control statement, a function call, etc. The programmer may be able to scramble the molecules, but not the atoms.
Languages like Pascal and Java (non-scripting languages) are examples of languages that are the least abusable from this standpoint, while Perl and C++ can be notorious in the wrong hands. We will try to reach a good compromise here.
The amount of readability should be given much thought. Programmers are more heavily rewarded for finishing on time than for producing readable code. This is primarily because meeting deadlines is easier to measure than code readability. Everybody knows when a programmer is late, but few if any know how maintainable their code is. Thus, in my opinion the language needs to protect itself (and the company) from this unfortunate bias to some extent. Reverse-engineering somebody else's code can be an expensive and time-consuming task. Don't ignore this cost just because it is down the road a bit. (In finance it is customary to downplay future results, but not ignore them.)
Some "bad" programmers claim that it is the reader's fault if they cannot decipher the code or don't know the shadier side of a language. The problem is that many programmers are called on to use many different languages in one organization and cannot always become a complete expert on any one language or style. How would you like a TV repair person to completely rewire your TV set outside of factory specs. Would you blame the follow-up repair guy if he quoted you 7 grand?
There is a saying in the Unix community that one should not prevent idiots from abusing something because it may prevent someone else from making good use of it. In other words, give everyone chain-saws because (hopefully) more people will build useful things than the number who will damage something or someone. However, it is my observation that more programmers are more likely to abuse a language than make good use of it. This is usually because the incentive to finish fast is greater (and easier to measure) than the incentive to make coherent and maintainable systems. Thus, I unfortunately have to disagree with the "Unix Chain-saw" rule.
One way to make life easier for the compiler or interpreter (C/I) is simply to use function calls or API's for as much as possible. We will call this the function rule. The function rule says to use function/subroutine syntax to implement all operations unless you have a good reason to deviate.
As example of
an unjustifiable violation of the function rule,
XBase uses the dollar operator ($) to mean "is contained in".
However, we see little reason not to make this a simple built-in function,
such as InStr(), meaning "in string".
Thus,
On the other hand, functions can get annoying for some commonly-used operations. For example, math functions could be represented with just functions like Plus(), Minus(), and Divide(). However, most language builders chose to implement these using the operators +, -, and / instead. Thus, instead of using:
total = a + b + c + d + eone would have to use this if the function rule was strictly followed:
total = add(a, add(b, add(c, add(d, e))))(Note that it may be possible to implement something like add(a,b,c,d,e) but this approach has other sticky issues associated with it.)
Although functions keep the syntax simple by being a generic programmer interface, they do have stylistic limitations.
Deviations from the function rule often indicate the "orientedness"
(target audience or usage) of the language.
This is where the art and politics
of language design come in. For instance, in table-oriented programming,
one would rather see and type
Footnote: Perhaps there is a pattern to deviations from the function rule that an implementer may want to build-in or prepare for. This could allow for custom oriented-ness to be added onto a language. This way a base language could be built, but then have certain ways to "overload" some operators so that fans of Objects, Tables, Strings, Streams, Pipes, Math, etc. can do their favorite stuff without too many unnecessary functions, parentheses, or quotes. This may be an interesting topic of research that is outside the scope of this article.
Of course, this can put a burden on the programmer to plan types better and add more type converting between number types. It is basically a tradeoff of burdening the programmer with the details or burdening the CPU and compiler/interpreter with the details.
Although it is true that CPU's are becoming faster and faster, it is also true that more is being asked of programs. GUI's and speech recognition are examples of new burdens that come along. Fortunately, these functions can often be off-loaded to API's or frameworks written in C or chips that do not burden the decision processing code itself. The idea is that a scripting-like language makes the decisions, but lets fast but ugly components do the actual number crunching.
Click here for more information about cataloging the evaluation criteria for languages, paradigms, and tools.
I enjoy debating language features. However, in doing so I have encountered certain "hot button" topics. In these cases the choices seem related to personal preferences, habits, and programming philosophies. Rather than forcing one's own preferences on other programmers, perhaps the language builder can make hot button features optional or select-able with preference indicators at the top of a program file or directory. Example:
#Prefs LineBased: on // line separators instead of semicolons NullHalt: off // no halt if nulls in expression DivideResult: halt // halt if division by zero // Other DivideResult options: null, zero, -1 IgnoreCaps: on // ignore case in strings compares #EndPrefsOr, if we already have a file with our favorite preference settings, then we could have something like:
#Preffile: /prefs/Joes_prefs.txtThere are two drawbacks to allowing too many variations. First, the compiler/interpreter has to be more complicated to handle the variations, and second, the learning curve for the language may be steeper. Perhaps the learning curve would actually be smaller in some cases because the programmer can make the settings closer to what they are familiar with.
Whether to allow the selection of the options or to pick what you as a language designer think is the best is part of the art of language design. I have indicated areas where option selectors may be a better alternative to pissing off a programmer by forcing your philosophies down his/her throat. Remember the fast-food slogan:
Have it your way
Footnote: Perhaps votes can be taken on features where the voter ranks the strength of their preference. Those features with the most variance in the strength rankings would be the best candidates for option selectors. (Perhaps an sum allocation system may prevent voters from exaggerating all their preferences in order to inflate their vote.) This may be a more scientific approach than counting the personal insults in forum discussions. Of course, the difficulty of implementing selectors for a given feature should also be factored in. Something like optional halting on division by zero should be fairly easy to implement. However, the handling of null values in expressions can get complicated.
This is how statements are separated from each other. Some languages, like C and Perl, use only semicolons; and others use line-feeds as their primary separator, such as Basic, XBase, and Fortran. Note that contrary to popular belief, Visual Basic does allow multiple commands on one line using a colon (:) as a separator. Example:
thing = 5 : foo = "this" : bar = 99.99(A semicolon could perhaps be used for this purpose instead of a colon.)
Line-feed languages also allow continuation of a statement to the next line(s) with explicit markers at the end of a line such as underlines or semicolons. Example:
aBigLongVariableName = aBigFatFunctionName( BigFatParameter1, _ BigFatParameter2, BigFatParameter3 )Keep in mind that the primary difference between the two is not whether semicolons are used, but whether line-feeds can also act as a separator (sometimes the only separator). Often the best way to distinguish is by whether a continuation character is needed to wrap a long statement. If one is needed, then it is probably line-feed based. (However, JavaScript still seems to be an odd hybrid.)
This would be a setting at the top of the program file to select which style to compiler/interpret for.
This is the method used to specify which group (block) of commands an operation applies to. The two most common block marking approaches are curly braces, "{}", and what we will call "X/endX". In its simplest form, X/endX simply uses the word "end" concatenated with the first word of the control statement. Thus, if a control statement starts with "While", it will end with "EndWhile".
Unfortunately, BASIC and other languages have muddied up X/endX with "noise words" like "do" and "then". These extra words are of no real use, and therefore should not be included. Thus, a language should have "while" and "if ...", not "do while" and "if ... then". (I will not include a pro/con section on these noise words since I know of no known benefits.)
My personal preference is X/EndX because it gives the reader and interpreter/compiler better locational information if there is a typo somewhere. You can read more about this at this link.
A third possibility is to use the indentation itself to indicate a block. A possible problem with this is that when tabs and spaces are mixed together, the indentation may get messed up. This is because different text editors and printers interpret tabs differently. On some systems a tab is 4 spaces, on others 7, etc. The compiler would probably have to reject (halt on) one of leading tabs or leading spaces to enforce consistency. I would give this option much higher marks if tabs were standardized.
Since this topic also incites riots, the formal favorite is a style option selector at the top of the program file. (See below about other possible selection methods.)
To recap, the options are:
It might even be possible to allow a mixture both approaches in the same program without explicit options switches. If a blocked statement ends with a left brace, then every statement in the block must be separated with a semicolon and the block must end with a right brace. If the control statement does not end with a left brace, then x/endx and the line-feed approach is assumed. Perhaps this only has to be done at the subroutine definition level -- it is not likely that someone would mix styles within the same subroutine.
sub a() { // interpreter sees brace foo(); // semicolons b = 9; ... } sub a() // no brace, endx assumed foo() // no semicolons b = 9 ... endSub
Whether self-indication at the block or subroutine level is feasible or not needs some study.
Further note that Pascal's Begin...End pairs are actually more similar to the braces in structure than X/EndX. Since braces appear more common than Begin...End, I will not consider it here.
Note 3: I will mix styles in the coming examples, although most will use X/EndX. I will also use "//" as comment markers in the examples.
if x > 3 // regular blocked approach m = x endif while y < 0 { // regular with braces mysub; } if x > 3 do m = x // combined with "do" while y < 0 do mysub;Some languages allow such without an indicator such as "do", but in my opinion, "do" would prevent a lot of syntactical accidents and misunderstandings. ("Do" is only meant for the non-semicolon code style.)
There are two sides to the parameters issue, sending them and receiving them. The sending side can use positional parameters, named parameters, clause-based parameters, or a mix.
Positional parameters are the most common. A parameter's position determines how it is received on subroutine side.
Here is an example of named parameters:
openTable(table="clients", access="read", sharemode="shared")
This approach is fairly common in database languages and somewhat resembles Smalltalk's messages. Any mandatory parameters come first, and the rest are optional clauses. Examples:
Open "Clients" // Minimum is table name Open "Clients" readonly shared Open "Clients" shared readonly // position not matter Open "Clients" _shared _readonly // with underscoresSome of the examples use an underscore to indicate a clause. This is to distinguish between the clauses and the "parameter" of the clauses. Thus, you can have a kind of "sub-parameter", such as the criteria ('status = M') after the _where clause in the last example. Note that the "_shared" clause does not have any sub-parameters because it provides information by itself. This provides two ways to possibly specify some types of parameters:Select * from Clients Where status = "M" // SQL Select("*" _from "Clients" _where "status = 'M'") // variation
Open "cleints" _shared off Open "clients" _noshare // means the same thing Open "clients" _exclusive // same as no-shareWe are not proposing that all variations be implemented; we are only presenting design options.
Note that if there are more than one mandatory parameters, they are separated by a comma in traditional fashion:
foo("this", "that", "pat" _thingy "stuff") foo("this", "that", "pat")The first 3 parameters are mandatory. Thus, those who don't like to program clauses can stay with the familiar, assuming the built-in functions do not use them.
One possible way to implement the processing of these types of parameters in the callee routine is presented at this link. It uses a function called Clause(). Clause() with one parameter simply is a boolean for it's existence. With two parameters it returns the sub-parameter(s) if the given clause. For example, for an SQL-like statement, Clause("Where") would return True if there is a Where clause, and Clause("Where",1) would return something like "status = 'M'".
Note that there are other variations that allow for an unknown quantity of parameters, such as field lists. Of course, field names could be put into one big string instead. There are many ways to skin a cat. The final choice usually depends on the orientation of the language. An advantage of this approach is that it can easily be expanded to allow zero or many sub-parameters per clauses if the language builder later decides to add these features.
This is very similar to mixed clauses. Example:
rr(1, 2, "m") rr(1, y=2) rr(m, y=n) rr(1) rr(x=1, y=7, z="hey") rr(y=7, z="hey", x=1) // The subroutine definition sub rr(x=0, y=0, z="a") ... endsubThis example shows the subroutine definition with defaults assigned. (Defaults are addressed in the Parameter Receiving section.)
Rather than create special syntax just for named parameters (described above), it is possible to use a string parameter to simulate named parameters. All that is needed are parsing functions to extract them. Example:
x = myFunc(12, "blah", "foo='nork',glob=13"); .... function myFunc(a, b, myNamedParams) { .... if paramExists(myNamedParams,"foo") { foo = getParamValue(myNamedParams,"foo"); .... } .... }getParamValue would return an empty string if "foo" does not exist. Thus, in some cases we may not need to call paramExists. Note that I would perhaps recommend a shorter function name for getParamValue in practice because it may be used often.
One drawback of this approach is that we cannot add new fixed-position parameters without changing any existing calls that use named parameters. However, new parameters tend to be optional anyhow, meaning they could be implemented with a named parameter. Another possible drawback is that it may not run as fast as dedicated syntax, although this depends on how the functions are implemented.
Parameter receiving generally has 2 issues: First is providing protection options (such as passing by value), and second is specifying defaults.
Scripting languages usually pass by reference because it is the most generic. The incoming parameter is simply a full-featured alias of the original.
Passing by value is used to give protection to the originator of the the subroutine calls. No matter what is done to the local version, the original (caller) parameter is protected. The rules often get a little tricky though if large structures like arrays are passed.
Another possibility is to make all calls be by reference, but be able to designate them as read-only if protection is desired. This has the benefit of avoiding the problems with arrays that by-value methods have. (The read-only mechanisms of the compiler/interpreter can also be used to implement constants. Another possible benefit over by-value is speed, because internal copying is not needed.)
One of the drawbacks of the read-only method is that you cannot change the local copy. This is easily solved by explicitly making a copy if one is needed. Many consider this good programming practice anyhow. Altering parameters for local-only use can be misleading and risky.
We vote for the read-only option, if you can accept a little newness. (I don't know any languages that currently support it.)
We think it best that by-value (or read-only) parameters be the default because they are safer and more common than changeable parameters. Some marker can indicate changeable parameter(s):
sub foo(this, *that, those) // asterisk indicates changeable param. that = this + those endsubIn this case, an asterisk is used to mark changeable parameters. Pascal uses "var" for the same purpose.
Now on to the issue of default parameter values. Here is one approach:
sub routinex(foo=9, bar="", stuff=0) ... endsubIf "bar" is not referenced by the caller, for example, it will receive the default value of 9. One possible drawback of this approach is that it may be tough to determine if the caller omitted required parameters. Another possible approach is to use the Clause() function set already described under Parameter Calling.
Another way to handle omitted and default parameters is with the OOP multiple prototype approach. However, we consider this a bit cumbersome and anti-scripting.
Alternatives are to use null values to indicate omitted parameters, or a ParamCount() function to indicate the number of parameter supplied. The chosen method would probably reflect the type of parameters supported. For example, ParamCount() would be more appropriate if only positional parameters are supported (discussed in prior topic), and Clause() more appropriate if named or clause-based parameters are supported.
Here is a possible way to implement variable quantities of parameters for positional schemes:
sub foo(*) { if paramCount() >= 1 param1 = param(1); } if paramCount() >= 2 { param2 = param(2); } ... }Note that Param() is a function and not an array. It returns the value of the corresponding parameter. The drawback of this approach is that it only operates on by-value parameters, which may be acceptable for most variable-quantity parameter uses. A special scope operator could perhaps give direct access to the parent routine. For example, L can use "parent$foo = 12".
We could also consider a Perl-like approach which puts the parameters into an array; however, this also does not offer us control over by-reference versus by-value parameters, unless pointer-like constructs are used. However, pointers are ripe for abusability and complicate the language, so we will avoid them like the plague, right?
Error handling is very tricky to design. The basic options are to handle an error, ignore the error (go on), or allow a default run-time halt. The tricky part is specifying when and where to do one of these 3.
This is similar to the Java style, but is not forced. If there is no error block, then a default run-time error is triggered upon error. Example:
block { x = openfile(blah) blah = read(x) close(x) } catch { show("Error writing to file") }
This is where any operation that can detect an error returns a status code as a value. This is used by many C functions.
This is the specification or designation of a routine that will be triggered if there is an error. One variation resembles this:
On Error Call RoutineX()This allows the handling routine to be reassigned as needed throughout the program. The second variation is to have a pre-designated routine that is always the error handler routine.
This is perhaps best introduced with sample code:
errhalt("off") stuff = read(x) if err() // check #1 show "Error in read. Error Number: " + errno() endif stuff = read(x) write(x, stuff) if err() // check #2 show "One of the two above statements errored" endif stuff = read(x) // checkpoint Lisa clearerr() // clear error status write x, stuff if err() // check #3 show "The Write statement errored, Error#: " + errno() endif errhalt("on") // halt if future error stuff = read(x) if err() // check #4 show 'A useless check because of the halt' endif errhalt "off" // don't halt if error write(x, stuff) // checkpoint Amy status = err() // check #5 if err() // check #6 show "Will never trigger because prior err() cleared it" endif
The Err() function returns True if any prior statement generated an error. It is cleared either after checking its value (such as in an IF statement), or if Clearerr() is called.
Thus, if the Read function at checkpoint Lisa errored, we could not catch it because we did a Clearerr() right after. Similarly, the statement at checkpoint Amy will not trigger check#6 because the check#5 "took" the error already.
The ErrHalt() function can be used to turn on or turn off halting if there is an error. The default is halting, thus you would want to issue ErrHalt("off") before using Err().
Since the error status is only cleared when sampled or upon a ClearErr(), one does not have check for errors after each statement. Example:
write h, a write h, b write h, c write h, d if err() { show "An error ocurred somewhere above" }The Err() function will tell if there was an error in any one of the above 4 Write statements, not just the most recent one. Thus, it provides many of the benefits of blocked catching (Java-like).
This would use a non-positional parameter (see parameter options) to optionally "sample" the error status. Example:
openfile "stuff.dat" _handle h _read _errto errstat if errstat != 0 show "Error on open, Number: " + errstat endif // Example 2 -- No error clause: openfile "stuff.dat" _handle h _readThe first example would put an error status value into the variable "errstat". If the program does not "sample" the error status, such as in example 2, then a run-time halt is generated if there is an error. Note that it is the existence of the _errto clause that determines if a halt is generated, NOT the existence of an "if" statement to evaluate it. In fact, the "if" statement is not required to be there. An _errto clause without an if statement is a way to ignore an error.
sub myRoutine var locmark on error handlerA // set up handler routine locmark = 1 foo foo foo _errorto x if x <> 0 message "Error: " & errText() return endif locmark = 3 bar bar _errorto x if x <> 0 message "Error: " & errText() return endif bar endsub sub handlerA inher // see scoping about 'inher' if locmark = 1 message "Error during breakfast: " & errText() return true // go back to just after the error endif if locmark = 3 message "Error during lunch: " & errText() return false // don't go back (exit MyRoutine) endif endsubThis is a more complex variation of option 6. This sets up a default handler routine called "handlerA" in this case. If there is a run-time error, then handlerA is called unless the problem statement has an _errorto clause. There should also be a way to turn the error handling off (causing program aborts). Perhaps "on error abort" or something.
"Locmark" is an ordinary variable that demonstrates how some block detection can be done with this method. (An old VB trick I picked up).
Some may prefer the error handling section to somehow be at the bottom of the routine instead of a separate routine, similar to VB's approach (except a little more modern). I will leave the details of such syntax up to you.
See also: Error Handling Notes
This would be an addition to, not a replacement for the above methods. Perhaps only run-time halts should go to the log, not handled errors. There should also be a way to optionally pass an error or message onto the log from the handler(s).
There are two issues to case handling. One is whether to have the machine ignore case when identifying variables and tokens, and the other is ignoring case when comparing strings.
I think it is to the spirit of scripting that the machine should generally ignore case, provided that case-sensitive string comparing functions are available for the rare times where case matters.
I can understand why the more formal languages might want to default to sensitivity, but it does not belong in scripting languages.
Dealing with file names gets a bit trickier though. Generally the interpreter should follow the conventions of the host OS.
A sticky controversy in languages is whether or not to allow the mixing of assignment statements with control statements and others. We call them "leaky assignment statements" because they leak or pass on the assigned value to other constructs.
The most common use for leaky statements is in While statements. Here is a common example that reads and echoes the contents of a text file:
handle = fileopen("afile.dat") while( line = readline(handle)) { print(line) }It is assumed that a null value evaluates to False within the While statement. (Whether this is good or not will not be debated here.)
In the example, the assignment to "line" also echoes that value to the While statement. If this echoing (leaking) were not allowed, we may have to write something like this:
handle = fileopen("afile.dat") line = readline(handle) while( line != null ) { print(line) line = readline(handle) // note this! }This forces us to repeat the "readline" statement twice. This is not very ideal because it is extra coding and because we may change the first readline but forget to change the second one, especially in a longer loop.
In this particular case we can solve the problem by adding an EOF() (end-of-file) function. However, we have observed that this "repeat checker" problem pops up fairly often in loops, not just file I/O. I cringe everytime I see it in a language book or manual, which is fairly often.
We also do not like the idea of the assignment statement returning a value; assignment statements should be by themselves. If it were not for this Repeat Checker problem, there would be little reason to allow leaky assignment statements. Their potential for confusion and abuse is too great. For example:
result = (a = b)Is this two assignments, or one assignment and one boolean statement? Using "==" (2 equals) for booleans does not solve the problem, as described later. Some languages use a colon instead of an equal sign for assignments. Example:
result : 8 result : (a = b) // assignment of a boolean valueHowever, the equal sign for assignments is perhaps too familiar to do away with. It also requires the pressing of the shift key, unlike the equal sign. (Since it is used so often, we are looking at keystrokes in this case.)
Thus, if we can find a way to fix the Repeat Checker problem, we can get rid of a legitimate reason to allow leaky assignments. Here is one solution:
prewhile { A } goif X { B }It is sort of like an if-else crossbred with While loops. In the first construct above, all relevant statements in position "A" are executed at the top of each iteration. Expression "X" is then evaluated. If expression "X" is true, then all relevant statements in position "B" are executed, otherwise the loop is terminated (not returning to "A", "X", nor "B" any more).// Applied Example: handle = fileopen("afile.dat") prewhile { line = readline(handle) } goif line != null { print(line) }
It is basically no different from existing While loops, except that there is no pressure to cram as many operations as possible into one statement. You are now given plenty of statements to calculate the looping criteria.
Although a naming contest should be held to find better names for "prewhile" and "goif", the concept is very useful. It is not a very compact structure, but it beats allowing some of the notorious UNIX-influenced spaghetti-one-liner assignment statements.
Note that some Repeat Checker loops require too many statements for even leaky assignments to handle. Our new construct can handle these as well.
At least consider it.
A secondary advantage of eliminating leaky assignments is that the programmer does not have to alternate between a single equal sign and double equal signs (= and ==). This would eliminate a constant source of typos and bugs for beginners and multiple language users.
An astute reader pointed out that a simple break statement may accomplish the same thing. Example:
while true x = stuff() y = morestuff(x) if not y break // exit loop endif normal_loop_stuff() endwhileHowever, breaks are kin to ill-reputed Goto's, and are not significantly less code than my proposed solution. But, they are common in existing languages. It is a choice between the dirty, familiar past or a cleaner, but unfamiliar future.
Another possible use of leaky assignments is simplifying initialization:
foo = bar = them = those = 0This assigns zero to all the listed variables. However, there are other ways to do this without leaky assignments:
Store 0 to foo, bar, them, those // XBase approach // or init(0, foo, bar, them, those)The Init() function keeps with the function rule, but may be tricky to implement because of the variable quantity of parameters. However, just because it is tough to build using the language does not mean that the interpreter/compiler cannot do it.
The point is, we still have plenty of alternatives to leaky assignments, which are too abusable to release to the general public in my opinion.
The two basic issues with variable scope are subroutine related scope rules and individual variable modifiers that allow overriding the subroutine related scope rules.
x = 5 Aroutine() ... sub Aroutine() { print x }If scope inheritance is active, then Aroutine will print "5" because it inherits "x" from it's caller routine (which may be the main routine). Scripting languages tend to be a bit loser with regard to this than other languages.
Most languages fit into one of three categories. First, some languages inherit scope automatically. Pascal is a (non-scripting) language that allows scope to be inherited only if a subroutine is nested within another. We will put Pascal in a second category that requires explicit coding to inherit scope, or at least gives both options. C will be in the third category of not allowing any inheriting except via global (or per-file) variables and parameters.
In the hands of a sloppy programmer (or bad luck) automatic inheritance can result in hard-to-maintain or hard-to-debug code. At the other extreme, option 3 can make subroutines hard to split up when they grow beyond original expectations and tend to encourage overuse of globals. Therefore, we are proposing the middle option.
However, Pascal provides a cumbersome way control variable scope inheritance. Here is an example of an alternative:
Sub myroutine(x, y, z) child of aroutine // stuff goes here EndSubThe first example lets "myroutine" inherit the variable scope of it's designated parent (calling) routine, known as "aroutine" in this case.// Example 2: Sub myroutine(x, y, z) child of * // stuff goes here EndSub
The second example lets "myroutine" inherit the scope of any and all subroutines that call it. The asterisk acts as a wild-card.
If you find this a little too formal for a scripting language, you could use simpler keywords such as "inherit" or "noinher". "Inherit" says that a routine inherits the scope of it's caller routine(s), while "noinher" prevents inheritance. "Isolate" could perhaps be used instead of "noinher". Example:
m = 5 arout "foo" arout2 "foo" ... sub arout(x) // I see m endsub sub arout2(x) isolate // I DON'T see m endsubNote that this approach would not determine how child routines see the scope. It would be up to the children to "isolate" themselves if needed. Also note that if a statement like m = 2 was put inside of both arout() and arout2(), arout() would modify the original m to 2, whereas arout2() would not. In arout2(), m would be a regular local variable that would "disappear" when arout2() was done, leaving the original m untouched with 5.
Also note that "Isolate" has no influence on parameters nor globals (described later), only on regular routine-level variables created in parent and ancestor routines. (If globals and statics can be declared in routines, then Isolate would not influence these either; they would still be visible. Isolate only targets the "call stack" variables.)
Choosing between "inherit" and "noinher" may depend on whether scope inheritance is the default or non-inheritance is the default. We will be satisfied as long as both options are easily made available. Perhaps "inherit" would be "safer" from a software engineering standpoint because it would require an explicit request to inherit scope. (XBase uses scope inheritance by default).
Some people suggest using packaging or groups of routines to create to create sort of "neighborhood globals", or "regional" scope. However, this approach still requires moving local variables out and into the this semi-global (or regional) location. Often a routine will grow a bit unwieldy or suddenly gain the need to access a given portion of code from two or more different points within the routine. The easiest solution is to have these "sprouts" be able to inherit the parent's scope. This eliminates the need to physically move variables to "higher ground" (regional or global) in order to be shared by both.
I don't see how this poses a "safety problem" as long as the relationship of the sprout(s) to the parent is clear. However, lack of a one-to-one relationship is common and generally accepted in scripting. Perhaps a language can have both "quicky" inheritance using something like the "inher" option, but also have a "child of" option for a tight relationship that specifies explicitly which routine(s) it inherits from.
Some languages allow access to non-local variables via a scope modifier. For example "caller::x" or "caller$x" refers to variable x of the calling routine. This is perfectly fine as an option, but should not replace scope inheritance so that routines can be split up when they grow larger without having to add a bunch of scope modifiers.
Some languages, like PHP, require a "Global" scope modifier in order to access regional variables. ("Global" is a misnomer since they are not really global.) However, this suffers from the split problem mentioned above. The chosen approach should not make splitting routines difficult.
Then there is Private, which allows a variable to have the same name as a variable in a parent routine without interfering with the parent variable. Private is not needed nearly as much if some form of the "noinher" option discussed above is available.
Related to Private is Local. Local is similar to Private, except that child routines also cannot inherit it's scope (in addition to parent).
(Note that different languages use different names for these concepts. The names we use here are only to give labels for discussion purposes, not serve as a final suggestion.)
Finally, there is the Static type. It allows the variable to keep it's value between subroutine calls. Normally, variables are reinitialized for each call to a routine. Static simply overrides this behavior. It is very useful for building add-on packages.
It can be argued that most of these modifiers are more than a scripting language needs. If the routine-level modifiers discussed above are available, then Private and Local may be somewhat redundant. Globals, however, are a near must for any scripting language. Globals can even serve as a "dirty" substitute for Static. (Globals have a greater risk of unintended alternation by a distant routine.)
I never was fond of null values. To me they are anti-scripting and meant for the more formal anal-retentive languages. This is because they tend to force one into checking for them before using a value. Example:
if x != null // typical annoyance check return x else return "" endif // variation: return iif(x != null, x, "") // shows the useful iif functionIf you are forced to check, you might as well check to prevent a null rather than after it is generated. (This gets into the sticky area of where nulls actually originate from in a program.) Further, they do not import or export very well between spreadsheets and database tables. If you really think that nulls earn their keep, then perhaps a compromise can be worked out. Nulls could be tested for, but otherwise return legitimate values such as blanks or zero. Thus, a function like isnull() can be used if one wants to deal with nulls, but no operation will trigger "Null Error" exceptions.
Also, null strings are even more problematic than null numbers. Or perhaps said another way, nulls have a little more meaning with numbers than with strings. Thus, if one feels that null numbers are computationally useful, then at least get rid of null strings. A blank, zero length string is sufficient.
Null is more of
an attribute than a value under one plan. If we were OOP fans,
we might have
a = space(0) // same as "" b = null // also assigns "" show "lengths: " & len(a) & ", " & len(b) & ", " & len(a & b) show 'nullcheck: ' & isnull(a) & ", " & isnull(b) show 'equiv: ' & a = b // end of codeNote how both a and b are empty strings; however, only b is null. If this seems a little odd to you, then please consider the messiness and confusion of the alternatives. My suggestion simply isolates nullness from the value so that a programmer does not have to deal with the paradoxes of null values if they do not wish to. Consider this silliness of some languages:Output Results: lengths: 0, 0, 0 nullcheck: false, true equiv: true
x = null if x > 34 { show "greater than 34" } if x <= 34 { show "less than or equal to 34" }Neither of these Show statements would execute under some languages. To me this is a bit silly.
There are further issues to consider in deciding how, if, and when to carry nulness to a result. Example:
a = null y = 3 + aWhat will y be? 3? null? zero? 3 and null? zero and null? Rather than give my favorite answer, I will point out that treating and/or storing nullness separate from the value does allow more possibilities than traditional null handling. The attribute approach can emulate the old way if needed, plus handle the "new" way and many combinations. In other words, the new (attribute) way can potentially act either old or new, but the old way cannot act new.
A related and heated topic is division by zero. Should the result halt in error, be null, or be zero? Perhaps this and null selections can be a compiler/interpreter option so that the programmer can choose instead of the language builder. (Debates about these get almost as personal and heated as debates about semicolons.)
Many Perl fans argue that much of Perl's potentially cryptic syntax is necessary to provide Perl's power. However, we disagree. With enough well-chosen string and parsing functions, the power of Perl can be approached without sacrificing readability. Even Unix- and Perl-like regular expressions can be used within normal functions. (Although in my opinion, complex regular expressions should be broken up into separate statements for separate steps if possible. Some programmers like to use tricky, multi-operation regular expressions as a macho nerd litmus test. Thus, the expression compactness serves a social purpose instead of a business purpose.)
First, get rid of "hidden linkers" such as the "@_" operator. Their harm to readability is greater than their contribution to rapid development. Second, provide a rich set of string and parsing functions. Examples:
replace("ab123ab456", "ab", "xy") // result: "xy123xy456" list = split(",", "123,456,789") // similar to Perl's Split astring = combine(",", list) // opposite of Split trim(" abc ") // result: "abc" at("quick brown fox", "brown") // result: 7 (position 7), zero = not found rat("ababababa","b") // result: 8; starts search from right stuff("the low level",5,3,"high") // replace "low" with "high" empty(" ") // result: true; returns true if blank or white spaces format(123.12, "######.##") // result: " 123.12"This is just a sample of string functions. Perl-like regular expressions can be implemented in some functions. Some of the above functions could also have optional parameters for options such as case sensitivity.
See the Macro section for some more examples of string parsing functions.
Here is one function that I wish to promote:
AppendLine("filex.txt", astring)This would open the given file, write the given line to the end of a file, and then close the file. It is one-stop shopping. It is ideal for debugging, and could be used for log and trace files. Note that a file handle does not have to be tracked. False is returned if there is an I/O error.
Note that some of our example functions names are somewhat long by scripting standards. This is primarily to make the examples easier to follow. However, it is perfectly understandable if a scripter wants shorter names, like AppendLn() instead of AppendLine(), and so forth. Keep in mind, however, that similar names like "substring" and "substitute" can cause confusion if not abbreviated carefully. (In this case, perhaps "replace" is a better choice than "substitute".) Generally, commonly-used functions should be shorter than the more obscure ones. (That means the language builder has to do some usage guessing.)
Associative arrays in Perl are indeed a nice feature to have in a language. However, they strike me as being a subset of table operations. Tables have much more flexibility than associative arrays. For example, they automatically can have multiple keys (indexes), non-unique keys, automatic persistence, record locking, filtered views, and many other features commonly found in relational and SQL-based systems. One does not have to rewrite the code when their collection (array) graduates from being simple to complex.
I rarely used arrays in XBase languages because tables were much more natural and convenient in XBase. XBase has a lot of flaws, but it provides hints of the power and convenience of tables. Unfortunately, table-friendly languages are rarely the focus of scripting organizations and research. It is an area where much language research and improvements can be done. Many programmers end up using arrays, files, and streams in a cumbersome way to do table-oriented operations when what is really needed is table-orientation. Arrays are a poor substitute for table operations!
Here is a hint of table-power using SQL-influenced syntax:
Directory "../foo/*.txt" _alias flist // put dir list in a table default _alias flist // pick default table handle to simplify syntax list "*" _orderby "fdate:d" // list by date order, descending list "*" _orderby "fname" // list by file name order list "*" _where "fexten = 'dat'" // list all .dat files list "fname" _tofile "names.txt" _orderby "fname" // list names to a file // Another syntactical variation: d = directory("../foo/*.txt") list(d, "*", #orderby "fdate:d") list(d, "*", #orderby "fname") list(d, "*", #where "fexten = 'dat'") list(d, "fname", #tofile "names.txt")A very table-oriented language may not need quotes around many of these parameters.
Tables could also assist with things like string parsing. For example, a substring search could produce the following table structure for each match:
Position - Starting position of the match.
End_Pos - Ending position of the match
Before_Char - Character just before the match (if any).
After_Char - Character just after the match (if any).
Case_Diff - Number of characters with a different
case than the template.
Etc.
Tables can make it easy to reference multiple pieces of information about multiple matches and other multiple-entity operations.
Note that SQL API's used by many languages to talk to database systems are a bit cumbersome for those used to scripting. It does not have to be this way. SQL was not originally designed for scripting, but could possibly be altered (or translated) a bit to be less verbose and cumbersome. Tables can be convenient and light-on-their-feet if done in the right spirit. I will now step off of my table soapbox.
Associative arrays using "dot" syntax (in addition to square brackets) can actually be quite handy for use as interface mechanisms, such as holding an individual data record. However, using them as an alternative to multi-record databases/tables often backfires in my opinion as projects change and scale.
Using arrays of arrays in languages like Perl makes the interface to the collection too tied to the implementation. If you later want to use a linked list or static arrays or a database, then your collection calls may all have to be changed. A table-based interface (API) is more flexible in this respect. True, a "lite" engine may not provide all the features you need, but upgrading will not require rewriting the code to handle the more powerful collection engine. (Even if you don't like databases, putting collection manipulation operations behind an API can improve change-friendliness over "raw" access.)
Also see the Array section and OOP notes for alternatives or additions to associative arrays.
Many languages offer ways to link in or reference groups of other functions. We will call these groups "packages". Other packages are often referenced with commands such as "include", "use", "attach", "tie", etc.
One of the problems introduced by this approach is naming conflicts between same-named variables and/or functions. Often a prefix is used to distinguish names. Example:
use "package1.prg" as pk1 pk1::varx = 7 foo = pk1::bar()This syntax can get a bit cumbersome. Therefore, we propose that a package reference not be needed unless there is a conflict between names. In our example, unless there is a local bar() function, there would be no requirement to include the "pk1::" prefix. (Having it there anyhow may be considered good documenting by some, but excess "path coupling" by others.)
Note that a "sys::" prefix could be used to distinguish between user-defined functions and built-in functions. Thus, if you define your own "len" (length) function, then all references to the built-in function of the same name would have to be like "sys::len(x)". This may seem cumbersome, but it is a good reason to avoid using reserved words.
The two colons are only one possible symbol option. Some may suggest a period, although I prefer to reserve the period for possible dictionary arrays and/or objects. A dollar sign is another candidate.
Some languages have a handy feature that if a subroutine is not found in the current file, it looks for a file with the same name of the routine and sees if that routine is in the file. For example, if a call to "foo(x)" is made, and the routine is not found using the standard approaches, then the interpreter will look for a file called "foo.prg" in the current directory and see if that routine is in there. ("prg" is an example language extension, but would otherwise fit the language.)
The search priority can be 1) current file 2) referenced modules/files, 3) any file with name of function. For #2 (referenced files), the search order is usually the order in which they are declared, but at times it would be nice to override this with an optional search order:
module("foo.prg", 5) blahblah(x) module("bar.prg", 4)Here, module/file "bar" is searched before "foo" if "blahblah" is not local. Note that unlike an "include" operation used in some languages, the modules are only searched if a local routine is not found.
Pascal's nested routine approach appears the best solution to this I have concluded. A nested routine is not going to be part of an outside name-hunt. If we want wider scope, we simply un-nest it. Nesting avoids extra modifier keywords/operations, and also solves variable scope issues at the same time, killing two birds with one stone. It just takes a while to get used to. But I have not tested the Pascal approach in really large applications, so am suggesting it with caution. A keyword approach may still offer more dynamicy and meta-abilities, even though it is messier. (Pascal traditionally requires routines be defined in the order used, but we don't have to keep that convention.)
alias aPackage::myfunc1 as anotherfunc public alias aPackage::myfunc as anotherfunc // global versionThis could also be done by simply having anotherfunc call myfunc1; however, matching parameters can be a maintenance headache using such an approach.
A feature unique to scripting languages are the instant or implied formation of a new variable. One can suddenly say "x = 5" regardless of whether "x" was ever declared.
Although this is a nice feature, there are times and application types where one may want to limit this. Visual Basic, for example, provides an optional "Option Explicit" designator in the case that a programmer wants variable declarations to be mandatory.
However, one of the reasons Visual Basic needs this is that references are also given implied declarations, not just assignments. Let's look at the example assignment of "x = 3 + y". If "y" has not been assigned or declared, Visual Basic assumes some default value (something like null, blank, or zero).
In other languages, such as XBase and Python, only assignments can provide implied declarations. Any referenced variable must have been declared or assigned previously, otherwise a run-time error is triggered. I consider this more logical since there is no sense in reading an unassigned variable. The only possible reason I can think of for accepting undeclared references is to reduce run-time halts at the expense of having garbage output. Otherwise, I don't know what Visual Basic's reasoning is for that rather odd approach.
Allowed under Python-style: y = 7 x = 3 + y // x is a new var Not allowed under Python, but okay in VB: x = 3 + y // x and y are new vars
Thus, there are 4 options:
Favored - I prefer the option switch (#4), with #2 the runner up. #1 is totally out of the game and only mentioned because some languages actually use it.
Most scripting languages either have very few basic types, or allow easy and/or automatic conversions between the types. The concept of "type" can even be foreign to some languages. Since there are many variations on typing, I will not attempt to list mutually exclusive options. Instead we will describe some of the issues involved.
Almost all scripting languages provide strings and numbers. Sometimes the conversion between them is be automatic, and sometimes explicit.
One of the problems with implicit conversion is "dirty numbers". For example, we can make a string "123.45stuff876". If we do math on it, it may be interpreted as 123.45. To reduce the chance of dirty numbers, there should be a tonum() function to convert or clean a number to its purist form. Perhaps also have a numcheck() function that returns zero if the number is clean, or the position of the first unrecognized character (1 is the first position).
Note that the Tonum or any conversion function should probably not generate a run-time error if used on its native type. For example, Tonum(5) and ToString("yep") should be perfectly legal under weak typing.
If the distinction between numbers and strings is weak, then some operators like the plus sign (+) cannot be used for both concatenation and math addition. A different operator will have to be used for concatenation. Visual Basic started using the ampersand (&) for string concatenation instead of plus when Microsoft loosened VB's typing. Other languages use a period for concatenation. One thing that bothers me about the period is that it can be mistaken for an OOP or "field" separator sometimes. Others have disagreed.
It may make the conversions a little bit cleaner and/or better documented to require conversions from numbers to strings be automatic (or transparent), but not the other way around. Example:
print anum & astring & another_num x = anum + toNum(astring)A "ToString" operation is not needed because there is never a chance of a number not being convertible. (Unless you allow those darn nulls.)
A similar issue must be addressed for comparison
operations. In something like
Some languages offer a date type. However, dates can be handled just fine as an agreed-upon internal number or string representation and plenty of conversion, date parsing, and cleaning functions.
Note that this would generally be an internal (program) representation. There should still be formatting functions for external representations for input and output that would handle most of the international formats. Thus, the internal (program) representation and the interface to external date formats are not necessarily related.
Suppose we agree that dates are represented as "yyyy/mm/dd" for all built-in date functions. Here is an example that reads two dates in "mm/dd/yy" format, and picks the largest of the two, then outputs the result in original format:
date1 = readline(h) // input: "12/31/97" (strings) date2 = readline(h) // input: "5/15/99" d1 = normdate(date1,"m/d/y") // to "1997/12/31" (strings) d2 = normdate(date2,"m/d/y") // to "1999/05/15" result = max(d1, d2) result = dformat(result,"m/d/yy:2") // "5/15/99" writeln(result)The NormDate functions normalizes the date into yyyy/mm/dd format. The template "m/d/y" does not need to know how many digits each component has because it knows the slash (/) is the separator in this case. However, the number of digits is important for the Dformat function. The minimum digits is given by the number of characters for each part, but the maximum is given by the digit after the colon. In this case the minimum number of result year digits is two because there are two y's, and the maximum is also two because of the ":2".
If the century digits are not given, then the nearest year is assumed in the NormDate function.
Note that we could have used a different separator, a period, by doing:
result = dformat(result,"m.d.yy:2") // gives "5.15.99"If any countries use colons as separators than we may have to pick a different maximizer indicator. Further, perhaps other operators besides digits after the colon could indicate things like text months. Thus, we could have a template such as "m:t.d.yy:2" where the "t" indicates text months like "jan, feb, mar" etc. This is only a suggestion, perhaps somebody has a better formatting scheme.
It would be useful to have functions like DateDiff() that indicate the number of days between two dates, DayName() that indicate the day of the week, such as "Thursday", and so forth. There is no need to have an explicit date type to perform these operations.
I will leave the representation of booleans up to you. The arguments for and against a dedicated boolean type seem to depend on the philosophy for the quantity and representation (below) of other dedicated types.
Visual Basic is one of the few languages that could reasonably straddle the strong-typing and dynamic-typing bridge. I am not recommending VB here in general, just pointing out a unique feature of it. Unlike say Java, this feature allowed different programmers to use the style that suited them or the project. I believe an internal type flag helped facilitate this. However, it's "variant" type (unknown type) still allowed type-specific operations to be done on it.
Rather than directly muck up string operations by implementing Unicode, we propose Unicode be represented as such:
Unistring = "1301,802,12101,10012,3321,etc." // or hex: Unistring = "0fc1,1c05,0ffa,1a23,etc."Special functions can then be supplied to handle these kinds of strings.
Since HTML and SQL became common, it is my opinion that languages should allow both single quotes and double quotes as string containers. Example:
x = '<TAG COLOR="#808080">' x = "Select * from tb where name='HOCKENS' "This makes it easy to have single or double quotes inside the string without excessive use of escape characters. XBase and to some extent Perl allow both types of quotes.
Many Unix-influenced languages allow variables to be inserted into strings. Example:
// the common way show "My name is " & name & " and I am " & age & " years old." // The Unix way show "My name is $name and I am $age years old."See how much nicer the Unix way is? This would be very helpful for HTML applications. However, I think it might be safer to require the variable marker to be on both sides of the variable name. Example:
show "My name is $name$ and I am $age$ years old."There are other syntactical variations on this theme that we will not go into here because there is no clear winner among them.
As you may know, I am not much of an OOP fan. OOP tends to create syntax bloat and it's constructs can usually be handled just fine with traditional syntax as long as the language supports static structures. (Static means that it lasts between subroutine calls. However, if statics are not available, globals can be a substitute.) In short, OOP does not belong in a scripting language.
The biggest benefit of direct OOP is that it prevents one from using the wrong type of object with an operation. This tilts toward the strict typing of non-scripting languages and is not worth the extra language constructs and syntax bloat.
Using an object-oriented approach without direct OOP is somewhat like using files and file handles. The handle serves the same purpose as an instantiated object, except that it is usually an integer (or long) instead of an object type. Here is a file operation using direct OOP and indirect OOP:
// OOP file fi = new file("sample.dat", READ) // open for read fi.binmode = true // set to binary mode (see Perl 'binmode') line = fi.readline() fi.closeAs you can see, the biggest difference is where "fi" goes. The OOP way provides no usage benefits over the more traditional way. Perhaps building the file class/package itself may be somewhat easier in OOP, but using it is not. Therefore, I propose that complicating the language to handle direct OOP is not worth it. It nearly doubles the complexity of the language with only a few percentage increase in utility. There are much better areas to "spend complexity" on. (See the discussion about the function rule.)// Non-OOP fi = fopen("sample.dat", READ) fbinmode(fi, true) line = readline(fi) fclose(fi)
Note that it may be possible to internally translate some OOP usage syntax into function call syntax so that those ingrained with the OOP way can use OOP syntax. If the interpreter saw "x.y = z", it could translate it as "y(x, z)".
If you do decide to go ahead and put OOP in your scripting language, then perhaps think about leaving out inheritance. Or, at least greatly reduce the syntax elements devoted to it. Inheritance is the most over-hyped and least useful aspect of OOP in my opinion. OOP as a component building paradigm is perhaps becoming too common (perhaps out of shear habit) to ignore. However, this aspect of it can still be realized without syntactical inheritance for the most part.
One way to get OOP without adding many new language constructs is to use associative arrays as objects. In each "key slot" (method or attribute) goes either a value or method code. The method code could be interpreted at run-time by either an Eval()-like function, or by putting parenthesis after the key:
var x[] // declare associative array x.attributeA = "foo" x["attributeA"] = "foo" // same as prior (traditional syntax) x.methodC = "if z < 4 {zark(); lo=park()} return(lo)" x.parents = "zark, dark" .... print(x.methodC()) // execute method print(eval(x.methodC)) // same as priorThe "parents" key would tell the array to hunt the listed routine(s) or other dictionary arrays for any referenced key not defined in the current array. It is like a "search path" defined in some OS file systems. This, and perhaps parenthesis for an "Eval" or "Execute" shortcut, are the only new features needed for full OOP. Languages that do this sometimes provide two different syntaxes for dictionaries. One with square brackets, and the dot. The square brackets allows spaces and other characters to be embedded within the dictionary key.
myDict.keyWithoutSpaces = "foo" myDict["key with spaces"] = "bar"The language could have formal class and method blocks, but these would simply be an alternative to creating objects via dictionary array syntax alone.
Note that such an approach makes no linguistical distinction between "object" and "class". This is fairly common in scripting OOP. One can perhaps define "object" as something that happens to inherit all its methods.Also, one may want to have the parent list key be something like "~~parents" or "__parents" instead of "parents" to avoid name collisions.
Some languages make a large distinction between subroutines and functions, and others make almost none. In scripting languages we know of no known reason to make a large distinction. If no value is explicitly returned, then the return value (if read from) could simply default to a blank string or zero. Example:
gotback = mysub(12) show 'Returned result: ' & gotback show 'Length: ' & len(gotback) // sub mysub(thing) // define subroutine thing = thing + 1 // or thing++ // notice no Return statement here endsubThus, there is no real syntactical distinction here between subroutines and functions. If you think a distinction is important, then simply generate an error if an attempt is made to read a value from a "sub" that has no Return statement (or did not get to it).The output:
Returned result: [blank] Length: 0
One syntactical issue to consider is whether or not to
require parentheses around subroutine parameters. It
probably would be a bad idea to omit them for function
calls, especially when assignments are involved [such
as
As mentioned in the criteria section, there are various methods employed by scripting languages to reduce repetition in code. We will examine methods of shortening inter- and intra- statement communication.
Perl and other languages often use default parameters and results to string (link) together commands. We will can this "command piping". However, the simplification provided by these constructs does not overpower the risk of readability that they cause. One might as well as do something like this:
t = somefunc(x) another_op(t) // "t" is used to pass resultThis is not significantly more code than:
// Perl-like implied piping somefunc(x) another_op() // operates on the result of prior func.(Perl has a default result indicator specified by "$_".) The piping approach saved only a few keystrokes. This savings is not nearly enough to compensate for the readability risk. Thus, we recommend against built-in command piping.
A similar shortcut is eliminating the need to re-reference the result variable when something is being done to the same variable. Examples:
i = i + 1 // increment i++ // C-like simplification i =+ 1 // Another variation astring = substitute(astring, "Borland", "Inprise") // search and replace astring =~ substitute("Borland", "Inprise") // more Perl-likeThe "=~" operator eliminates mentioning the variable again. The problem with these shortcuts is that they may hamper readability and violate the sacred function rule. However, altering the same variable is common enough in occurrance that some method should be provided for this. It also makes code modification safer because you only have to change the result variable in only place if you decide to change the variable being affected.
Thus, we propose a special operator that represents the assignment variable. Let's try "@". Examples:
i = @ + 5 // same as i = i + 5 astring = substitute(@, "Borland", "Inprise") astring = @ & "append this" // assuming & is concatinationThis approach is more informative than the "hidden" Perl approach because @ is visible and explicit. It also does not require any weird syntax because it is basically just a replacement macro to the interpreter/compiler. (Note that the standard "++" incrementer should probably be kept. The @ is really meant for more complicated statements.) I know of no language that supports this approach, so I guess that makes me the tentative inventor.
Thus, I think we found a very nice compromise between the hiding approach of Perl and readability.
Some languages allow statements to be created and evaluated at run-time. Although there are a lot of names for this feature, perhaps some more appropriate than "macros", we will call them Macros for the time being. ("Macro" came from dBASE's usage). Examples:
x = 1 y = 2 amac = "x + y" show %amac% // result: "3" // or perhaps: show eval(amac) // Note that we may not know the string at compile time line = readline(handle) eval(line)In the first example, amac is a string that is evaluated at run time by putting percent signs around it and/or using the Eval() function.
Although I found these very useful in some languages, it was often to get around weaknesses in the language itself rather than add power. However, there still are many situations where they are very useful. They are almost necessary for certain types of high-level abstraction factoring. Control Tables in Table Oriented Programming can make fine use of them, for instance.
Macros are very powerful, but can also lead to sloppy code. They also make compilation nearly impossible, limiting some or all of the language to interpretation. These problems can be reduced by limiting the places and circumstances where macros can be used. For example, some languages only allow macros to return the results of and expression. Under this limitation, you could not do this:
// A no-no in some langs amac = "(" show afunc%amac%x) // intended: afunc(x)A further way to simplify implementation is to only allow variables in macros, not expressions. This would avoid having to implement a run-time expression evaluator.
Here is a hierarchy of macro implementation:
Note that these are not the only variations, but do provide options to consider. Generally the lower the number, the harder it is to fully compile the language.
An example use of run-time macros would be to implement embedded variable evaluation if the language does not support it. Embedded variables were described in the Quotes section. Here is another example:
repeats = 7 thing = "puppy" show "He kicked the $thing$ at least $repeats$ times." // Result: He kicked the puppy at least 7 times.If a language did not implement this feature, macros could help us do so on our own:
sub show(s) private i, curvar="", result="", invar=false for i = 1 to length(s) step 1 c = substr(s, i, 1) // get one character at a time if c = '$' // variable name marker? if invar // must be ending marker result = @ & eval(curvar) // evaluate macro and append to result invar = false // reinitialize curvar = "" else // starting marker for embedded variable name invar = true endif else if invar do curvar = @ & c // append character to var name else do result = @ & c // append character to result endif endfor output(result) endsubNote that the "@" operator is presented in the Statement Communications Shortcut section, and that "&" is used for string concatenation in this example. We also assume that such a For loop would not loop if the length is zero. (Why some languages loop under "for i = 1 to 0 step 1" escapes me.)
This example simply traverses through the input string and substitutes the value of the variable in place of the variable name. This would be nearly impossible to implement without run-time macros, especially if the input strings are read from a file or keyed in during execution.
Complex regular expressions could also be implemented with macros (if not built into the language). Let's look at a Perl example first:
$content =~ s/%(..)/pack("c",hex($1))/ge; # Perl versionThis is used for CGI processing. It replaces all occurrences of %xx with it's ASCII equivalent where "xx" is a hexadecimal number. Thus, "123%4b123" would translate to "123K123". Here is our version, which can be implemented by using macros:
content = replace(@, "%(..)", "cnvrt('c',hex($1$))", "rge")The "e" in the 4th parameter tells the Replace function to execute (using Eval(x)) the third parameter, which contains the Cnvrt() function (equivalent to Perl's Pack). The "r" in the 4th parameter tells the Replace command to use regular expressions. (The code to implement regular expressions and Replace is too long to show here.)
Here is an alternative way to implement it without
macros. It uses a
function called Parse() which puts all the pieces of
a string into a list (array) and splits on a pattern.
For example,
array list[] list = parse("%..",myString) // uses a simple reg-ex for i = 1 to len(list) // for each array element if substr(list[i],1,1) = '%' // has hex indicator? list[i] = cnvrt('c',substr(@,2)) // convert to character endif endfor myString = concat(list) // put list back togetherIn summary, run-time macros allow one to add fancy features and domain-specific shortcuts that may not be built into the original language.
One of the more interesting way to implement arrays is treating them more like associative arrays rather than a big grid. This allows them to be more dynamic than traditional arrays, and more in the spirit of scripting. Examples:
array a[] // declaration a[1] = 3 a[1,1] = "hey" a[12,12,12,4] = 9 a["bob"] = 15 a[-1] = "foo"If we could peer inside the storage structure for this array, it would resemble:
Key | Value |
---|---|
"1" | 3 |
"1,1" | "hey" |
"12,12,12,4" | 9 |
"bob" | 15 |
"-1" | "foo" |
This allows the array user to pick any type of subscript range or type they want without pre-declaration. It also allows the array to double as an associate array; thus it's a two-for-one deal!
Note that the interpreter/compiler would probably have to
treat numeric and string subscripts differently,
otherwise,
This approach would not be as fast as traditional arrays for intensive math operations, but fast math is not the common domain of scripting languages anyhow. (Traditional arrays use subscript multiplication for position lookups instead of hash tables.)
Some basic built-in operations related to these arrays could be ClearArray() to erase all the elements, ElemCnt() to see how many elements are in the array, and GetElem(3) would return the 3rd element (as sorted in the hash table).
Comparing two variables or expressions is often a lot more tricky than most languages seem to assume. Should spaces be ignored? capitalization ignored? Compare them as strings or numbers? You get the idea. This issues seem to get short attention in most languages. If you are not using the default comparison handling, then they make you jump through hoops.
The help out with this, I suggest a compare() function or perhaps cmp() function that has various compare options. Example:
if cmp( foo, ">=", bar, "LC") ...This compares foo to bar. In the options parameter, "L" means ignore leading spaces, and "C" means pay attention to upper/lower case (otherwise ignored).
Although this follows the Function Rule, some may consider it too cumbersome for frequent usage. Perhaps the language could allow something like this:
if foo >= bar (LC) ...This has the options after the comparison in parenthesis. Other syntax candidates are:
if foo (>=, LC) bar ... if foo >=(LC) bar ... if foo >=,LC bar ... if cmp(foo, "LC>=", bar) ...Altough I take a lot of heat for it, I am leaning toward the last one. The letter positions would not make any difference (as long as they don't divide the comparison operator set itself). For example, these would all be interchangable:
if cmp(foo, "LC>=", bar) ... if cmp(foo, "LC >=", bar) ... if cmp(foo, ">=LC", bar) ... if cmp(foo, ">= LC", bar) ... if cmp(foo, ">= L C", bar) ... if cmp(foo, ">L=C", bar) ... // Invalid!
Other possible option letters could be:
I - compare as integer, lop off any fractional part.
T - ignore trailing (white) spaces. (Similar to L).
S - compare as strings
N - compare as numbers
U - compare as Unicode chunks
A - ignore all white spaces, even in between
Note that these should not be case-sensitive, so "n" would mean the same as "N". (I hate case-sensitivity in the UNIX world).
There is a fair amount of friction about the usage of symbols (&, *, %, @, #, etc.) in languages. UNIX-influenced languages like Perl tend to have much more symbols than languages that came out of other camps. (See Blocking for some examples and also the function rule.)
It is my opinion that some languages have taken symbols too far. Most symbols almost completely lack "mnemonics". Words and abbreviations serve as better (but not always perfect) memory aides than symbols. However, I also realize that there perhaps should be some compromises due to common traditions.
Symbols are best used for common operations that tend to occur in long expressions. A good example is string concatenation. The "&" in parameter lists in C may also be justified under this criteria.
Beyond this, symbols are rarely justified, except perhaps for habitual reasons.
For example, UNIX-influenced languages tend to use certain symbols for Boolean operations. Among these are && ("and"), || ("or"), and ! ("not"). Since these are ingrained in the minds of a good many programmers, perhaps the ideal language would except both operators. (I believe there are a few that already do.)
Such a language could except both of these:
x and y or not c x && y || ! cOne justification that some have tried to use for symbols is that they are more (spoken) language neutral than words or abbreviations. However, there are not enough symbols on the keyboard to cover more than a fraction of the reserved words of most languages. Besides, words are no more cryptic than symbols to a nonnative speaker of the target spoken language. Further, it is easier to alphabetize words and abbreviations than symbols, making translation dictionaries easier to use.
See "Context Insensitivity" below for more about symbols.
The recent web-oriented language PHP has triggered a little battle over context-sensitivity and symbols in languages.
Web languages often have "include" clauses that allow the program to pull in program code or text. How dynamic the timing of the "pull in" depends somewhat on the language being context insensitive. This means that the language interpreter does not have to look that far ahead, or even beyond the current token to find out what the current symbol or token is.
One way to achieve this is to have a clear set of indicators that tell what type the token is. For example, in PHP the dollar sign indicates a variable. The interpreter does not have to look around to see that it is a variable based on context. Thus, PHP is "context insensitive" in this regard.
However, I am bothered by the heavy use of dollar signs (indicators) for every single variable. This not only increases typing, but also increases syntax errors because programmers often switch between different languages, some which don't use dollars, and will keep forgetting things like dollar signs on variables.
Before we look at possible solutions, lets look at typical indicators used to create context insensitive tokens:
x - no indicator means it is a keyword $x - a variable x( - a routine (or at least the start of one) x[ - an array x.y - a table.field reference (I tossed that in for thought)One way to reduce "dollarage" is to swap the indicators for variables and keywords. Thus, you would get code such as:
$sub bar(x, y, z) { $while x > y { x-- $if y = 7 { $break // exit loop } $else { y = foo(z - y, x) } } }This example has 5 keywords and 11 variable references; thus we saved 6 dollars (11 - 5). Variables are usually more common than keywords, so dollaring the keywords usually reduces the quantity of indicators needed.
(Note that an underscore or some other symbol may be preferable because the $ convention is too common. If an underscore is used, then no variable or function may be named with a leading underscore. Also note that we have pegged underscores for use in named parameters in a different section, so they may not be appropriate. Our examples will use percent signs.)
We can reduce indicators even more! Keywords can be identified simply by their name because we already know what the keywords are. Thus, keywords and variables both don't need indicators. Before we discuss some small caveats, here is the algorithm in pseudo-code:
if the token is fully non-alphanumeric then it is an operator (+,-,/,},{, etc.) else if the token is in the keyword list then it is a keyword else if the indicator is "%" then it is a keyword else if the indicator is "(" then it is a routine definition or declaration else if the indicator is "[" then it is an array etc...Notice that the keyword indicator ("%" in our example) is still optional for all keywords. The reason for this has to do with additions to keywords over time. For example, suppose the "break" statement in our example above was not in the original language specification, but later added to the language. We might use variables named "break" because it was not a keyword at the time of the program writing. If we did this, we could not run some programs under the new version of the language if they contained variables named "break".
For the record, I do not think this is a major problem at all. If by chance it happens to a program, simply find all the ocurences of "break", and change them to something else, like "mybreak" or something.
However, many programmers are anal-retentive (AR) about such potential "name overlap" problems. (Why they would rather type jillions of dollar signs to prevent such a rare and easy-to-solve problem, I have no idea. But, I will cater somewhat to their concerns here.)
Thus, we will do 2 things to satisfy the AR programmers. First, if a new keyword comes along, then simply require an indicator for it. Our prior example would look something like this:
sub bar(x, y, z) { while x > y { x-- if y = 7 { %break // exit loop } else { y = foo(z - y, x) } } }Some programmers objected to the idea that some keywords have indicators and some do not. They did not like such "inconsistency." Although I find this a very minor issue, such programmers can simply still use indicators on all keywords for their programs. Remember that the indicator is still optional for old keywords. (They may be annoyed by other programmers' code, but this is a small price for a little flexibility in my opinion. Programmer's rarely like other programmers' styles anyhow.)
In summary, we found a way to greatly reduce the need for indicators, yet kept the need for context out of the picture. What is PHP's excuse now?
Also, some programmers claim that heavy use of indicators makes programs easier to read because the indicator reduces their need to read the whole token to figure out whether it is keyword or a variable. However, I do not find this the case. To me, the keywords are well-learned after a few programs and are instantly recognized as keywords. This may just vary by individual and a psychological study would be needed to see what is the most common reaction in the programmer population.
Lately I have been fascinated with the idea of minimalism. The goal of minimalism is to keep the syntax simple without sacrificing power. This is often done by having complex libraries/API's instead of complex syntax. The trick is to find simple syntax that can "bend" to do or represent many different things.
We have seen examples of this with the function rule and the consolidation of dictionary arrays with OOP. It came up again when a reader suggested that I recommend "sets" in a language. However, a language with dynamic parameters could handle such just by adding functions to the library(s):
if x in {0, 1, 2, 10, 20} // set version if isIn(x, 0, 1, 2, 10, 20) // function versionI agree that the first version is slightly more "natural" than the second, but it is not used common enough to justify dedicated syntax. (At least not the way I program.)
Smalltalk and some versions of LISP have done a pretty good job at keeping the syntax simple. Even IF statements in some languages are nothing more than expressions or function-like things. However, that is probably taking things too far. IF statements are common enough to justify dedicated syntax.
A simplified version of LISP (LISP-Lite) can be represented with only these "syntax generators":
statement -> (command params) statement -> (command) params -> params param params -> param param -> constant param -> variable param -> statementIF's, loops, function declarations, assignments, etc. can all be specified with just this simple syntax. All this with only a few piddly generators! Wow! Although I question the human-readability of such a language, the concept is fascinating. (The parenthesis in the first 2 lines are part of the language.)
One area ripe for minimalism in practical languages is collections. Languages like Python have dedicated syntax for tuples, lists, dictionaries, etc. This makes the language confusing and harder to read and learn in my opinion. Most collection handling can be moved to libraries/API's. Smalltalk has done a good pretty job of this, although its collection libraries are too hierarchical (IS-A) in my opinion. (The only "native" collection is dictionaries, a.k.a. "associative array", in my pet language. They are used as a primary interface mechanism there; not so much data holding. Lists can be useful too, but can come from a library alone.)
See Also:
"L" - A Draft Language Description
based on some of the above favorites
Procedural/Relational Language Helpers
Dynamic Relational Database