Self-Understanding Programs


                          Mitch Bradley


                            _A_B_S_T_R_A_C_T

         This presentation describes a wordset that makes it
     possible  to  write  portable programs which understand
     themselves.  The most obvious example is a Forth decom-
     piler, which is traditionally non-portable.


The Problem

    It is not possible to write a "standard" decompiler,  because
the  details  of  what  goes  into  memory when compiling a colon
definition are left to the implementor.  A decompiler  must  know
these  details,  thus it is non-portable.  Other interesting pro-
grams face the same problem (e.g. Kim  Harris's  Structure  Chart
program[1]).

    Such programs are examples of "Self-Understanding  Programs",
programs  that  know  the format in which they are stored.  Self-
Understanding is possible in a few other languages, such as  Lisp
and  PostScript.  In these languages, a program is a visible data
type.  In Lisp, a program is just a list, which is the basic Lisp
data  type.  In PostScript, which is very much like Forth in many
ways, the equivalent of a colon  definition  is  compiled  as  an
array of objects, and a PostScript "word" is one kind of object.

    In constrast, there is no standard way in which the body of a
Forth colon definition is stored in memory.

The Solution

    Rather than constrain  Forth  implementations  by  specifying
details  of  how  definitions  are  compiled, it is sufficient to
define a set of words which can "unravel"  the  compiled  defini-
tions.

    In the process of porting  a  decompiler  to  four  different
Forth implementations, I have developed a relatively small set of
Forth words which are helpful in encapsulating the implementation
details of the system. The key observation is that a colon defin-
ition body contains only a few different kinds of things:  refer-
ences  to  other  Forth  words,  branches,  literals,  and string
literals (like ." ).  The new wordset understands  the  way  that


W. M. Bradley            March 25, 1986         Page 1 of 5 pages


references,  branches, literals and string literals are compiled.
Only this small set of words has to be re-implemented for  a  new
system, and the decompiler is largely portable.

    This wordset can be used to write other  programs,  not  just
decompilers,  which can decode and understand compiled Forth pro-
grams.

    The  wordset  works  well  for  different  threading  schemes
(direct,  indirect,  token,  subroutine), different branch styles
(absolute, relative), and different address lengths (16-bit,  32-
bit,  absolute,  relative).   It is probably insufficient to deal
with Forth systems that compile directly to native code [2],  but
it may be helpful even there.

The Colon Definition Model

    This scheme assumes that Forth colon definitions contain  the
following kinds of things:

Tokens
    A Token is a compiled reference to another Forth word.   This
    does not imply the use of token-threaded code.  Tokens can be
    absolute addresses, relative addresses, indices into a table,
    jump-to-subroutine  instructions, or whatever the system com-
    piles  when  a  Forth  word  is  referenced  inside  another.
    Another  way  of looking at the same thing is that a token is
    the way that a compilation address is stored within the  body
    of a colon definition.

Branches
    A Branch is something that is compiled inside a Forth defini-
    tion  in  order  to  change the order of execution of Tokens.
    Each Branch has a destination, which is the the new value  of
    the  Interpreter  Pointer  if the branch is taken.  Different
    implementations compile the branch destination  in  different
    ways;  absolute addresses, relative addresses, etc.  Examples
    of branches are the run-time words compiled by IF  ,  ELSE  ,
    WHILE , REPEAT , LOOP , etc.

Literals
    A Literal is a parameter that is compiled "in-line" inside  a
    colon  definition.   The  most  common  example is a compiled
    number.

String Literals
    A String Literal is a string that is compiled  into  a  colon
    definition,  perhaps  by  ."  or ABORT" .  The implementation
    differences between  compiled  strings  usually  result  from
    alignment  restrictions  of various processors.  For example,
    the 68000 requires a 16-bit item to  be  stored  at  an  even
    address,  so compiled strings are usually padded so that they
    end on an even boundary.


W. M. Bradley            March 25, 1986         Page 2 of 5 pages


Properties of Words

    Each word, regardless of its  type  (colon  definition,  code
definition,  variable,  constant,  etc)  is assumed to have these
attributes:

Compilation Address
    The compilation address of a word is  the  address  which  is
    returned  by FIND.  If the word is headerless, it still has a
    compilation address, which is the  address  that  would  have
    been returned by FIND before the name was thrown away.

Name
    The name is some  string  which  represents  the  word.   The
    "real"  name  of  the  word cannot always be determined.  For
    instance, the word may be headerless or the  name  stored  in
    the  name  field  may  be  truncated, as in systems that only
    store the first 3 characters of the name.  In the first case,
    the name could be represented as Wnnnn, where nnnn is the hex
    representation of the compilation address of  the  word.   In
    the  second case, the name could be represented as, for exam-
    ple, INT------, where the dashes show the missing characters.

Immediate Flag
    A word is either IMMEDIATE or not.  If the  word  is  header-
    less, in which case it may be impossible to determine whether
    or not the word is immediate, it is reasonable to assume that
    the  word  is not immediate, since headerless compiling words
    aren't used often.

Definer
    The Definer of a word "A" is the compilation address  of  the
    word "B" which created the word "A".  For a colon definition,
    the definer is the compilation address of the word ":".

The Wordset

    The _s_p_e_c_i_f_i_c_a_t_i_o_n and use of the following  words  is  system
independent  and  portable.   The _i_m_p_l_e_m_e_n_t_a_t_i_o_n is highly system
dependent, hence I do not show the implementation.

    A unifying rule followed here is that an  address  is  always
represented  _o_n  _t_h_e  _s_t_a_c_k as an absolute address, regardless of
how it is stored inside a colon definition.   This  is  true  for
both  compilation  addresses  and  branch  destinations.   If the
address is on the stack, it is absolute.

TOKEN@ ( addr -- cfa )
    cfa is the (absolute) compilation address which  is  compiled
    at  addr.   In  a  system  that  compiles absolute addresses,
    TOKEN@ is the same as @ .  Otherwise, TOKEN@ has to do  what-
    ever  is  necessary to put an absolute compilation address on
    the stack.


W. M. Bradley            March 25, 1986         Page 3 of 5 pages


TOKEN! ( cfa addr -- )
    The absolute compilation address cfa is stored at addr.  In a
    system  that  compiles absolute addresses, TOKEN! is the same
    as ! .   Otherwise,  TOKEN!  compiles  cfa  in  some  system-
    dependent manner.

/TOKEN ( -- n )
    Pronounced _p_e_r _t_o_k_e_n.  n is the length in bytes of a compiled
    token.   For example, a system which compiles 16-bit absolute
    addresses would have /TOKEN equal to 2.

>TARGET ( addr -- destination-addr )
    Pronounced _t_o _t_a_r_g_e_t.  destination-addr is the branch  desti-
    nation address of the (high level) branch that is compiled at
    addr.  Note that addr is the address of the  branch  instruc-
    tion,  not the offset word.  For instance, if the system com-
    piles  BRANCH -10  for REPEAT , addr would be the address  of
    the BRANCH .

/BRANCH ( -- n )
    Pronounced _p_e_r _b_r_a_n_c_h.  n is the length in bytes of  compiled
    branch  destination.  For example, a system which compiles 1-
    byte relative branch offsets would have /BRANCH equal  to  1.
    A  system  which  compiles  16-bit  branch offsets would have
    /BRANCH equal to 2.

+STR ( addr1 -- addr2 )
    addr2 is the first available address after  the  string  com-
    piled  at  addr1, taking into account any padding that may be
    appended to the string.

>DATA ( cfa -- data-address )
    Pronounced _t_o _d_a_t_a.  data-address is the address of the  data
    storage  area  associated  with  the  word  whose compilation
    address is cfa  .   For  example,  for  variables,  >DATA  is
    equivalent  to  >BODY  , for user variables, >DATA returns an
    address in the user area, etc.  The implementation of this is
    not necessarily very efficient.

DOES-IP? ( addr1 -- addr2 flag )
    addr1 is assumed to point just after the run-time  word  com-
    piled  by  ;CODE or DOES> .  flag is true if the code at addr
    is a DOES> clause as opposed to a ;CODE clause.  In the  case
    of  a  ;CODE  clause, addr2 points to the actual machine code
    that is the ;CODE clause, skipping whatever linkage code  was
    necessary to set up the execution of the clause.  In the case
    of a DOES> clause, addr2 points to the first token  that  was
    compiled as part of the DOES> clause, skipping whatever link-
    age code was necessary to set up the execution of the clause.
    This is a necessary kludge.

C.ID ( cfa -- )
    Displays the name of the word whose  compilation  address  is
    cfa.   This is similar to the phrase  >NAME .ID , but it must


W. M. Bradley            March 25, 1986         Page 4 of 5 pages


    cope properly with the case of headerless code.  If the  word
    is  headerless, C.ID displays the string Wnnnn, where nnnn is
    the hex representation of the compilation address cfa .

IMMEDIATE? ( cfa -- flag )
    Flag is true if the word whose compilation address is cfa  is
    immediate.   If  the  correct answer cannot be determined, as
    for instance in a headerless definition, flag is false (which
    may be wrong, but is nevertheless a reasonable guess).

/N ( -- n )
    n is the number of bytes occupied by a number that is  stored
    in  memory.  This is necessary for portability between 16 and
    32 bit implementations. [3]

FOLLOW ( voc -- )
    Used for scanning the list of words in  a  vocabulary.   Ini-
    tializes a system-dependent data structure in preparation for
    scanning a vocabulary.   voc-threads  is  a  system-dependent
    number  which represents a vocabulary.  It is the same number
    that is stored in CONTEXT and CURRENT .  FOLLOW  is  used  in
    conjunction with ANOTHER? .

ANOTHER? ( -- nfa true   OR   false )
    If there is another word in the  vocabulary  currently  being
    scanned,  nfa  is the name field address of that word and the
    flag is true.  If there are no more words, the flag is false.
    Used  in  conjunction  with  FOLLOW .  Example:  To print the
    names of all the words in the CURRENT vocabulary,

            : WORDS ( -- )
              CONTEXT @ FOLLOW
              BEGIN   ANOTHER?   WHILE   .ID    REPEAT
            ;


>THREADS ( cfa -- voc-threads )
    cfa is the compilation address of a vocabulary.   voc-threads
    is the number which is stored in CONTEXT when that vocabulary
    is executed.  I.e.  " ' FORTH >THREADS "   and   " FORTH CON-
    TEXT  @  "  both  leave the same number on the stack (but the
    second form has the side effect of also affecting the  search
    order).

Other Useful Portability Words

    These words do not fit into the category of building programs
which  understand  Forth definitions.  However, I have found them
useful in writing portable Forth code for other purposes.

ALIGNED ( addr1 -- addr2 )
    Addr2 is the result of aligning addr1 to the "natural"  boun-
    dary  for  the  particualar  system.  Addr2 is always greater
    than  or  equal  to  addr1.   For  example,  in  most   68000


W. M. Bradley            March 25, 1986         Page 5 of 5 pages


    implementations  ALIGNED would move up to the next even boun-
    dary.  In most 8 bit processor implementations, ALIGNED would
    be a no-op.

ALIGN ( -- )
    Advance the dictionary pointer to the  next  alignment  boun-
    dary.   This can be defined in a portable fashion in terms of
    ALIGNED .

SKIPSTR ( -- addr len )
    This is a very important word.  It is compiled as part of the
    definition  of  a run-time word which is followed by a string
    literal.  When SKIPSTR executes, it leaves  the  address  and
    length  of  its  string literal, and advances the Interpreter
    Pointer past the string.  For example, the run time word  for
    ." could be defined as:

            : (.") ( -- )    SKIPSTR TYPE  ;

    SKIPSTR could be defined in some systems with:

            : SKIPSTR ( -- addr len )
              R> COUNT  2DUP +  ALIGNED  >R
            ;

    The use of SKIPSTR replaces many instances of phrases like  "
    R>  COUNT  +  >R  ", which are not always portable because of
    differences in the way that systems compile in-line  strings.
    SKIPSTR is a good candidate for a CODE word.

FIND-CFA ( addr -- cfa )
    addr is an address within the body of a forth word.   cfa  is
    the compilation address of that word.  This word is difficult
    to implement so that it is guaranteed to work, but is  fairly
    easy  to  implement  so that it works most of the time. It is
    useful for things like a word to unravel the return stack and
    print  the  names  of the words on it.  It is also useful for
    implementing the word DEFINER.

Discussion

    Using these words, I have moved a number of utilities, previ-
ously  considered  highly  system-dependent, across several Forth
implementations with  widely  differing  implementation  details.
The  wordset does not ensure complete portability, but it helps a
lot.  The problems that remain are  mostly  the  different  names
that  different  systems  give  to the run-time words compiled by
branches and various  kinds  of  literals.   These  problems  are
easily  handled  by a table which enumerates the various run-time
words.  The wordset described here takes  care  of  the  annoying
details which otherwise tend to get buried in hard-to-find places
in the code.


W. M. Bradley            March 25, 1986         Page 6 of 5 pages


Acknowledgements

    The FOLLOW / ANOTHER? scheme was invented  by  Gordon  Smith.
My  portable  decompiler  started  with  the  decompiler  in  the
Laxen/Perry F83 model [4], and was then extended to do structured
decomilation of compiling words, to do sensible things with chil-
dren of user-defined compiling words, and to be  portable.   Mike
Perry suggested the word >DATA .

References

1.  Sebok, William L.  "A VAX  Immplementation  of  the  FORTH-79
    Standard",  _T_h_e  _J_o_u_r_n_a_l  _o_f  _F_o_r_t_h _A_p_p_l_i_c_a_t_i_o_n _a_n_d _R_e_s_e_a_r_c_h,
    Vol. 2 No. 4, 1984.

2.  Harris, Kim, "Structure Charts",  presented  at  the  Silicon
    Valley Chapter FIG Meeting, October 1985.

3.  Bradley, Mitch, and Sebok, William L.  "Compatible Forth on a
    32-bit   Machine",  _T_h_e  _J_o_u_r_n_a_l  _o_f  _F_o_r_t_h  _A_p_p_l_i_c_a_t_i_o_n  _a_n_d
    _R_e_s_e_a_r_c_h, Vol. 2 No. 4, 1984.

4.  Perry, Michael, "F83, a Public Domain Model Implementation of
    the  Forth-83  Standard",  _P_r_o_c.  _1_9_8_3 _A_s_i_l_o_m_a_r _F_O_R_M_L _C_o_n_f_e_r_-
    _e_n_c_e".


W. M. Bradley            March 25, 1986         Page 7 of 5 pages