AJS's Software Development Blog: The Sand programming language

The following is a direct cut-and-paste from my specification of the Sand programming language on my Wiki (which is now down and may never be revived, depending on how much free time I have). This copy was fetched from Google's cache.

Syntax

Encoding

A Sand program is comprised of UTF-8 encoded Unicode text.

Characters

Throughout this document the word "character" will be used loosely to refer to a non-combining, printable Unicode codepoint, roughly analogous to a grapheme.

Characters also include the "non-printable" characters that make up the traditional "whitespace" set:

tab      U+0009
linefeed U+000A
return   U+000D
space    U+0020

Codepoints defined by the Unicode standard as whitespace (other than those above), combining or non-rendering are only valid within special parser blocks which explicitly allow for them and POD. The same is true for "private use" codepoints, which some implementations may not allow, even within parser blocks.

See Sand: Rules for a discussion of terminology such as "punctuation" and "alphanumeric" with respect to Unicode. The terminology used in that document is also used here. Note, however, that there are some differences, specifically in the definition of "whitespace" which is much more liberal in rules than it is outside of them.

Lines

By default, lines of program text are terminated by a linefeed or the end of input. However, because carriage returns (U+000D) are considered whitespace, the carriage return/linefeed combination which is in use on some systems is almost never harmful.

Identifiers

Identifiers are sequences of one or more characters. They may not contain code-points which are used only for combining or are otherwise non-printing. The first character must be an alphabetic character or the underscore (_, U+005F). Subsequent characters must be alpha-numeric characters or the underscore. The alphabetic characters which make up an identifier must all come from the same Unicode block. Similarly, the numeric characters must all come from the same Unicode block. However, extension and supplemental blocks may be freely mixed with their basic block (e.g. to allow all Chinese characters to be used in the same identifier). So, for example, the following are all valid identifiers:

_
apple
_1234
æther

Translation for those not familiar with Unicode: identifiers must be alpha-numeric identifiers much like C or Perl 5, but mixing characters from multiple languages in one identifier is only allowed with respect to mixing numbers from one language with alpha characters from another. Identifiers which contain illegally mixed Unicode blocks will be accepted by the lexical analyzer as identifier tokens, and will result in an error.

Japanese: a case study

In Japanese, there are three "native" alphabets and it is not uncommon to also mix in roman letters and numbers as needed. The strictures of Sand identifiers would allow only one of the three Japanese alphabets to be used in each identifier. Here are some examples:

サンド - "sando", a phonetic translation of "sand" in Katakana
砂 - "suna", a literal translation of "sand" in Kanji
サンド1 - Katakana with a 1

Identifiers may also be composed of multiple individual identifiers, joined with two colons, e.g.:

fruit::apple::seeds

The validity of each component identifier is determined on its own, without respect to the others in the chain. This is important, as it allows module hierarchies to contain identifiers made up of different Unicode blocks, e.g.:

Math::π

By convention, all identifiers starting with two underscore characters are used only by Sand and its libraries. Use by end-user programs should be viewed with considerable suspicion.

There are four special, predefined identifiers which may be used as bareword terms:

undef
true
false
inf

Variables

All variables must be prefixed with a dollar sign ($, U+0024) and otherwise be composed of a single, valid identifier. There is no difference between a container variable (such as an array) and a scalar variable in terms of how the variable name is written or prefixed.

$x;
$Math::pi;
$cars = [ "Ford", "Toyota", "BMW" ];

The combined dollar sign and identifier are considered to be a single token, not an identifier with a unary operator. Other prefixes for identifiers are actually unary operators:

::ident - The type/class/namespace named ident
&ident  - A function reference to the sub, ident

Indexing

A variable, close-square bracket or close-parenthesis which is followed by an open-bracket ([, U+005B) indicates an indexing operation. The indexing operation is terminated by a matching close-bracket (], U+005D).

$x[10]
$y["flower"]
[1,2,3][1]
["a"=>1,"b"=>2]["a"]

Auto-quoting

In the cases of indexing and pair construction, literal strings may be written without the use of quotes as long as the string:

appears on the left-hand side of a pair construction (=>) or within brackets ([...])
is a valid identifier
does not contain ::

$y[flower]
[a=>1,b=>2][a]

Function invocation

An identifier, variable, close-square bracket or close-parenthesis which is followed by an open-parenthesis ((, U+0028) is the start of a function invocation.

function();
$functionref();
$function_vector["apple"]();
($functionref)();
function_returning_functionref()();

All function and method invocation must use parentheses around the parameter list, even when the parameter list is empty.

Method invocation

A variable or close-parenthesis which is followed by a period (., U+002E) indicates a method invocation. After the period must follow either an identifier (the method name) or a variable (indirect method name).

Keywords

Keywords are those identifiers which have a predefined meaning when used without a prefix (such as $), suffix (such as (...)) or some other context which would map it to the name of some user-defined data. Typically keywords are used for constants which are constant with respect to the language, not the program or one of its libraries; or they are used for control structures and other non-function builtins. The full list is:

__END
class
else
elsif
false
if
inf
method
module
my
never
our
sub
true
undef
use
when

Note that variables and function may have names which collide with keywords. The designation of a word as a keyword is only of interest to the compiler for purposes of tokenization.

Inline data

Data is represented in four primary ways:

Numeric

Numeric data can take all of the forms specified by this rule:

[ \+ | \- ]? [ \d<[\d_]>* [ \. \d* ]? | \. \d+ ] [ <:i>e \d<[\d_]>* ]?

Underscores are notational only, and do not affect the numeric value.

String

String literals take 4 forms: single-quoted or double quoted in balanced or unbalanced notation:

'...' or q{...} 
"..." or qq{...}

The only transformation performed on a single-quoted string is the replacement of two backslashes (\, U+005C) with a single backslash and the replacement of a backslash followed by a single quote (Unicode or ASCII) or close-brace with a literal single-quote or close-brace, respectively (which will not be counted when matching the initial token).

A double quoted string is scanned for backslashes, open-braces and dollar signs. Backslash sequences are:

\n - newline.
\r - return.
\f - form feed.
\a - bell.
\x - Followed by a hexadecimal codepoint to insert
\x{...} - Enclosed hex codepoint is inserted
\N{...} - The enclosed character name is inserted

Also, a newline or carriage-return/newline pair after a backslash consumes any subsequent spaces or tabs, and emits a single space. Any other non-alpha-numeric character following a backslash will simply result in that character as a literal, which will not be counted toward balancing the initial token (quote or open-brace). Thus \ and } may both appear within the string, preceded by a backslash to escape them.

A dollar sign allows the insertion of a simple variable. The only operation that may be performed on the variable is subscripting. Unlike the shell and Perl, there is no bracketing construct to isolate the name of the variable from surrounding text. Instead, use braces.

A brace-delimited substring is replaced by its evaluated results. So, the code:

$x = "Hello";
$y = [ "World" ];
$z="{$x}, {$y[0]}\n"

yields the string "Hello, World\n" in $z. Any valid Sand program may be placed inside the braces. Its return value may be delivered by default (the value of the last statement) or using the return directive.

qw{...}

The contained text is stripped of leading and trailing whitespace, and then split on whitespace and returned as a list.

Strings literals may contain any valid Unicode sequences.

Pairs

Pairs are a non-delimited data structure which consist of a left-hand-side value, the => operator and a right-hand-side value. Pairs are typically used in the construction of named parameters, hash entries or elements of an ordered, associative list.

Lists

Lists are described using the square brackets ([, U+005B and ], U+005D), enclosing a comma-separated list of values:

$fruit = [ "apple", "banana", "coconut" ];

Associative lists are lists whose elements are all pairs:

$produce = [
  "apple"   => 1.95,
  "banana"  => 1.60,
  "coconut" => 2.50
];

Hash, array and alist variables may be populated from lists.

Adverbs

Adverbs are identifiers preceded by a colon. They are used to modify the meaning of code or data. The colon must not have a space after it. The meaning of any given adverb (also called "modifiers") is determined by its definition.

Adverbs may take parameters, just as a subroutine invocation, but do not require empty parentheses when no parameters are passed.

Declarations

The keywords my and our introduce declarations and limit scope.

Structure

Grammar

The full grammar for Sand may be found at: Sand: Grammar

Expressions

Like almost all C-like languages, expressions include terminals grouped by operators or surrounded by parentheses. Flow control statements are, with some exceptions, not available in expressions.

Unlike most C-like langauges, blocks can be part of expressions (see below).

Parentheses are used to group expressions like so:

3 * ((6 / (8 + 100) / 4) - (7 ** 2))

Variable declarations are considered expressions, so the following is valid:

my $x = 7 + (my $y = 10);

Statements

A statement consists of one of the following:

An expression
A flow control statement
A block which has no trailing expression (see below)

Statement termination

A statement may always be terminated by a semicolon, though they are optional in many cases and contraindicated in some. Here are the rules for statement termination:

A close-paren always marks the end of a statement when it is followed by whitespace which contains a newline unless:
- A balanced operator such as parens or brakcets have not been closed in the current statement
A close-brace always marks the end of a statement when it is not followed by an equal or open-paren
A semi-colon always terminates a statement unless it occurs within a parser block (which may have its own rules for statement termination)

So the following are all valid statements which terminate without the use of a semicolon:

 print("Hello")

 if a == 1 {
   ...
 }

These are not terminated statements:

 print(("Hello)

 {
   foo()
 } ==>

Blocks

Parser blocks are covered, below, but braces encountered that do not follow an identifier (with no whitespace) are the indicator of a code block or simply, block. Blocks begin with an open-brace ({, U+007B) and end with a balanced, matching close-brace (}, U+007D). Typically, the matching close-brace flags the end of a statement, but the following cases (involving the characters that follow the close-brace) allow the current statement to continue past the close brace:

==>
(

Each of these operate on the value of the expression to which the block belongs.

map *$list -> ($item) { $item+1 } ==> sort *;

{ &func }();

Parser blocks

A parser block is comprised of a valid identifier followed by an open-brace ({, U+007B) and a matching, balanced, close-brace (}, U+007D). What comes between the two braces is determined by the grammar specified by the identifier. No space may appear between the identifier and the open-brace. Some examples include:

qq{...} - Interpolating quoted string.
q{...} - Non-interpolating quoted string.
re{...} - Regex/rule.

Single and double quotes are a special case. They are interpreted as a parser block as if q or qq had been used.

POD

Any line which begins with an equal sign (=, U+003D) followed by and alphabetic character begins a special sort of parser block which continues until the first line that contains =cutby itself. Within this block, all text is ignored for purposes of code generation and execution. These regions are intended only for documentation in the POD

format.

print("Hello, world");

=head1 NAME

hello - The hello world program.

=cut

Note: Perl 6's perldoc format may be transitioned to in the future.

Data blocks

Defining complex data can be done using lists like so:

 $data = [ 1, 2, [ a => 3] ];

or, it can be done using special parser blocks called "data blocks". These data blocks can come in any form, but the two supported by the core language are YAML

and JSON

 $data = yaml{
   - 1
   - 2
   - a: 3
 }

and

 $data = json{ [ 1, 2, { "a":3 } ] };

The JSON implementation used is specifically a proper subset of the YAML implementation used, so any data placed inside a json data block can be placed inside a yaml data block without change, but not visa versa.

Regexes

Regexes are a special sort of parser block which generate closures with a special signature for use with rules for matching text. Each character of a regex is, by default, a literal which matches itself in the input. Exceptions such as alternations, quantifiers and special grouping constructs allow any grammar to be described.

For more information, see Sand: Rules.

rule definition

When the rule keyword is used, special processing takes place. Its "block" is implicitly of parser block type "re". That is:

rule digits { \d+ }

Is parsed as if the block were preceded by the re parser block identifier, and is not parsed as normal Sand code.

Rules may take parameters like subroutines, but may not redefine their return value (which is always passed via the auto-lexical, $/.

Operators

The operator precedence levels are (all operators are infix, binary unless otherwise noted):

terms, ... term
(), [] post-circumfix
.
++, -- prefix/suffix
**
!, ::, &, *, +, - prefix
~~, !~
*, /, @
+, -
<<, >>
<, >, <=, >=, lt, gt, le, ge
==, !=, <=>, eq, ne, cmp
&
| ^
&&
||
.., ^.., ..^, ^..^
??!! ternary
=, +=, -=, etc.
=>
,
not prefix
and
or, xor

The associativity for these operators is:

left: terms, ... term,
., ~~, !~, &, | ^, &&, ||, *, /, =>, ,, and, or, xor
right: (), [] post-circumfix,
!, ::, &, *, +, -, not prefix,
??!! trinary
**, @, +, -, <<, >>, =, +=, -=, etc.
non: ++, -- prefix/suffix,
<, >, <=, >=, lt, gt, le, ge, ==, !=, <=>, eq, ne, cmp, .., ^.., ..^, ^..^

Blocks, closures and subroutines

Any text which is not a parser block, and is enclosed by braces is called a "block" or "simple block". A block is a grouping of statements which can form a closure. For example, the following code creates a closure and then calls it:

$hello = { print("Hello, world"); };
$hello();

Blocks are also used by most of the loop and control operators such as if and while:

if $a == $b { print($a); }

while and for both accept blocks that are optionally parameterized. Parameters are specified before the start of a block with -> followed by a parenthesized parameter list:

for 1..100 -> ($odd,$even) {
 print("Odd: $odd Even: $even");
}

A semicolon may appear after a block, but if a close-brace occurs at the end of a line, then it automatically terminates the current statement.

As a special case, these two usages have identical behavior:

-> ($a, $b, $c) { ... }
{ -> ($a, $b, $c) { ... } }

That is, you may provide the parameter declaration for a block within an enclosing set of braces. This allows interpolated blocks such as those found in strings and rules to parse more easily, and yet still provide parametrization.

Subroutines

Subroutines are blocks of code, no different from any other block except that they can be named and thus scoped. A subroutine definition matches this rule:

[ my | our ]? \s+ sub \s+ <identifier> <parameters> [ <return> ]? <block>

The parameters are enclosed by parentheses and may include a leading invocant which is separated from the other parameters by a semicolon.

The optional return value specifier is of the form:

\-\> \s* [ '!'? <type> | <return-typelist> | never ]

As a special case, the keyword never indicates that the routine does not return. This typically happens when a routine is intended to perform final debugging, notification or cleanup before it calls exit.

The return-typelist consists of a parenthesized sequence of zero or more type names, separated by commas and optional whitespace:

sub swapnums(num $a, num $b) -> (num, num) { return($b, $a) }

Invocants are only used in conjunction with methods (declared with the "method" keyword, rather than the "sub" keyword).

method x ($me; $when, $where, $how) { ... }

See the section on classes for more detail on methods and their invocants.

Functions or methods may be declared with a single return value which is preceded by an exclamation point like so:

method x ($a, $b, $c) -> (!bool) { ... }

This declares that the method returns a boolean value as normal, but it implicitly wraps the function in an exception handler and calls the appropriate method on any exception raised in order to convert it to the given type, returning that type. If more than one value is normally returned, then any return values not declared with the exclamation point will be undefined.

Control structures

Conditionals

The conditional control structures are if and given.

if $a == $b {
  print("$a == $b");
}

given $x {
  when 1 { print("x is one") }
  default { print("x is unexpexted") }
}

Loops

There are two primary looping operators, while and for.

while =$stdin {
  print();
}

for *$list -> ($e) {
  $total += $e;
}

The following are variations of for which collect the block's return value:

*$list2 = map *$list1 -> ($e) {
  $e + 1;
}

*$list2 = grep *$list1 -> ($e) {
  $e > 0;
}

Because chaining map and grep can be cumbersome, the ==> operator is provided:

*$list2 = map *$list1 -> ($e) {
  $e + 1;
} ==> grep * -> ($e) {
  $e > 0;
} ==> sort *;

The ==> operator creates a pseudo-value * which contains the list returned by the right-hand-side expression, and makes it available to the left-hand-side-expression, returning the result of the left-hand-side expression. When used in this way, map and grep act as expressions, not as statements.

Because of this use, map, grep and any related control loops are considered expressions, not statements like for.

Types

Variables are declared with the my and our keywords. Between the keyword and the variable, is an optional type name. The builtin types are:

int
num
buf
str
array
alist
hash
bool
type
class
module

The following adverbs modify the meaning of these types:

For int and num:

:bits(width)
:unsigned

For str:

:encoding(name)
:charset(name)
:language(name)

For array, alist and hash:

:of(type)
:key(type)

Namespaces

A namespace is introduced with the module keyword. There are two forms:

module Foo; # until the end of current lexical scope
module Foo { ... } # only inside the given block

Lexicals from the enclosing scope are available:

module Foo {
  my $x = 1;
  module Bar {
    print $x;
  }
}

A library which is accessed via the "use" directive must define its module namespace before performing any other action, and no other statements may come after that module. Because a class is a kind of namespace, a class declaration may substitute for a module declaration. That isn't to say that only one module can be declared per library, but subsequent modules must be nested like so:

 module Foo {
   module Bar {
     ...
   }
 }

Would be accessed like so:

 use Foo;
 print(Foo::Bar::somefunc())

The module declaration can take the adverb :exports after the module name and before the brace. It must give a list of exported symbols as strings. This is not the advised way to use modules. Typically, they export nothing and users of the module can selectively import anything they wish or nothing.

The use statement cannot import symbols that are declared with the "my" keyword, which can be applied to modules, classes and subroutines as well as its more traditional use in declaring variables. Methods cannot be exported from a class, but may also use "my" to indicate that they may not be called from external classes unless those classes are nested within the current class.

Here are some examples:

 module Foo {
   my $private_var = 1;
   our $public_var = 2;
   my sub private_sub() { ... }
 }

The use statement may take several adverbs:

:import: A list of strings which name items to import from the target namespace into the current namespace.
:alias: A string which is the name of an alias for the module within the current scope.
:version: A major an minor version which is the minimum version of the module that will be accepted. e.g. :version(major=>3, minor=>0)

Classes

See Sand: Object Oriented Features

A class is a namespace that is introduced with the class keyword.

class Dog :is(::Animal) {
  our int $legs = 4;
  my str $color;
  my int $age :rw;
  method bark() { .dosound() }
}

This example demonstrates a class Dog which derives from class Animal and has three attributes: an integer number of legs set to 4, common to all dogs; a string describing itscolor which has no default value but can be initialized at construction; and age which is an integer, and can be changed externally at any time. This class might be used like so:

my Dog $spot(color => "white");
$spot.age(2);
$spot.bark();

Only one parent class may be defined using :is. If no :is modifier is provided, the immediate ancestor of the class is assumed to be the parent of the namespace path. That is:

class Animal::Dog { ... }

Would define the same parentage as the previous example.

If there is only one element in the identifier for the class name, then its parent defaults to Object.

Roles

A role looks much like a class, and is very similar, but roles cannot be used to instantiate objects. Instead, they are used to control the composition of classes. A role is never the parent of a class. Instead, it is "composed" into the class's definition. For example:

role Animal::Flying {
  has $wings;
  method take_flight() { ... }
}
class Animal::Dog :does(::Animal::Flying) {
  # a dog that can fly!
}

Now Animal::Dog will be composed as if it contained the text of Animal::Flying. Because classes can only have a single parent, this can greatly increase the flexibility of class construction. To test for a role, use the does function:

if does($x, ::Animal::Flying) { ... }

This does not determine if the named role was used in composing the object in question's class (that information is lost during composition). It only tests the object's capabilities (called "properties") to determine their compatibility with the properties of the given role. If they do match, true is returned. If they do not, then false is returned.

Notice that take_flight is undefined. This is typical of roles. They define interfaces (in the Java sense) that describe what the class is responsible for providing (through composition with other roles, inheritance from a class, or definition within the class).

Notes

While & is not used to prefix subroutine calls, Perl 4/5-like usage, &function(...) will actually work because it takes a code-ref and then invokes it.
Perl 6 handles parens more elegantly than Sand, currently. More work needed there. Specifically, the ambiguity between list-context and expression grouping needs to be resolved.

TODO

Types
Classes/objects/roles
Object system
Dispatch
Operators
- Comparisons
- Precedence
- Hyper-operations?
Exceptions
Evaluation
Regex doc
Generators / reduction / etc.

Tuesday, January 21, 2014

The Sand programming language