Self-Understanding Programs Mitch Bradley _A_B_S_T_R_A_C_T This presentation describes a wordset that makes it possible to write portable programs which understand themselves. The most obvious example is a Forth decom- piler, which is traditionally non-portable. The Problem It is not possible to write a "standard" decompiler, because the details of what goes into memory when compiling a colon definition are left to the implementor. A decompiler must know these details, thus it is non-portable. Other interesting pro- grams face the same problem (e.g. Kim Harris's Structure Chart program[1]). Such programs are examples of "Self-Understanding Programs", programs that know the format in which they are stored. Self- Understanding is possible in a few other languages, such as Lisp and PostScript. In these languages, a program is a visible data type. In Lisp, a program is just a list, which is the basic Lisp data type. In PostScript, which is very much like Forth in many ways, the equivalent of a colon definition is compiled as an array of objects, and a PostScript "word" is one kind of object. In constrast, there is no standard way in which the body of a Forth colon definition is stored in memory. The Solution Rather than constrain Forth implementations by specifying details of how definitions are compiled, it is sufficient to define a set of words which can "unravel" the compiled defini- tions. In the process of porting a decompiler to four different Forth implementations, I have developed a relatively small set of Forth words which are helpful in encapsulating the implementation details of the system. The key observation is that a colon defin- ition body contains only a few different kinds of things: refer- ences to other Forth words, branches, literals, and string literals (like ." ). The new wordset understands the way that W. M. Bradley March 25, 1986 Page 1 of 5 pages references, branches, literals and string literals are compiled. Only this small set of words has to be re-implemented for a new system, and the decompiler is largely portable. This wordset can be used to write other programs, not just decompilers, which can decode and understand compiled Forth pro- grams. The wordset works well for different threading schemes (direct, indirect, token, subroutine), different branch styles (absolute, relative), and different address lengths (16-bit, 32- bit, absolute, relative). It is probably insufficient to deal with Forth systems that compile directly to native code [2], but it may be helpful even there. The Colon Definition Model This scheme assumes that Forth colon definitions contain the following kinds of things: Tokens A Token is a compiled reference to another Forth word. This does not imply the use of token-threaded code. Tokens can be absolute addresses, relative addresses, indices into a table, jump-to-subroutine instructions, or whatever the system com- piles when a Forth word is referenced inside another. Another way of looking at the same thing is that a token is the way that a compilation address is stored within the body of a colon definition. Branches A Branch is something that is compiled inside a Forth defini- tion in order to change the order of execution of Tokens. Each Branch has a destination, which is the the new value of the Interpreter Pointer if the branch is taken. Different implementations compile the branch destination in different ways; absolute addresses, relative addresses, etc. Examples of branches are the run-time words compiled by IF , ELSE , WHILE , REPEAT , LOOP , etc. Literals A Literal is a parameter that is compiled "in-line" inside a colon definition. The most common example is a compiled number. String Literals A String Literal is a string that is compiled into a colon definition, perhaps by ." or ABORT" . The implementation differences between compiled strings usually result from alignment restrictions of various processors. For example, the 68000 requires a 16-bit item to be stored at an even address, so compiled strings are usually padded so that they end on an even boundary. W. M. Bradley March 25, 1986 Page 2 of 5 pages Properties of Words Each word, regardless of its type (colon definition, code definition, variable, constant, etc) is assumed to have these attributes: Compilation Address The compilation address of a word is the address which is returned by FIND. If the word is headerless, it still has a compilation address, which is the address that would have been returned by FIND before the name was thrown away. Name The name is some string which represents the word. The "real" name of the word cannot always be determined. For instance, the word may be headerless or the name stored in the name field may be truncated, as in systems that only store the first 3 characters of the name. In the first case, the name could be represented as Wnnnn, where nnnn is the hex representation of the compilation address of the word. In the second case, the name could be represented as, for exam- ple, INT------, where the dashes show the missing characters. Immediate Flag A word is either IMMEDIATE or not. If the word is header- less, in which case it may be impossible to determine whether or not the word is immediate, it is reasonable to assume that the word is not immediate, since headerless compiling words aren't used often. Definer The Definer of a word "A" is the compilation address of the word "B" which created the word "A". For a colon definition, the definer is the compilation address of the word ":". The Wordset The _s_p_e_c_i_f_i_c_a_t_i_o_n and use of the following words is system independent and portable. The _i_m_p_l_e_m_e_n_t_a_t_i_o_n is highly system dependent, hence I do not show the implementation. A unifying rule followed here is that an address is always represented _o_n _t_h_e _s_t_a_c_k as an absolute address, regardless of how it is stored inside a colon definition. This is true for both compilation addresses and branch destinations. If the address is on the stack, it is absolute. TOKEN@ ( addr -- cfa ) cfa is the (absolute) compilation address which is compiled at addr. In a system that compiles absolute addresses, TOKEN@ is the same as @ . Otherwise, TOKEN@ has to do what- ever is necessary to put an absolute compilation address on the stack. W. M. Bradley March 25, 1986 Page 3 of 5 pages TOKEN! ( cfa addr -- ) The absolute compilation address cfa is stored at addr. In a system that compiles absolute addresses, TOKEN! is the same as ! . Otherwise, TOKEN! compiles cfa in some system- dependent manner. /TOKEN ( -- n ) Pronounced _p_e_r _t_o_k_e_n. n is the length in bytes of a compiled token. For example, a system which compiles 16-bit absolute addresses would have /TOKEN equal to 2. >TARGET ( addr -- destination-addr ) Pronounced _t_o _t_a_r_g_e_t. destination-addr is the branch desti- nation address of the (high level) branch that is compiled at addr. Note that addr is the address of the branch instruc- tion, not the offset word. For instance, if the system com- piles BRANCH -10 for REPEAT , addr would be the address of the BRANCH . /BRANCH ( -- n ) Pronounced _p_e_r _b_r_a_n_c_h. n is the length in bytes of compiled branch destination. For example, a system which compiles 1- byte relative branch offsets would have /BRANCH equal to 1. A system which compiles 16-bit branch offsets would have /BRANCH equal to 2. +STR ( addr1 -- addr2 ) addr2 is the first available address after the string com- piled at addr1, taking into account any padding that may be appended to the string. >DATA ( cfa -- data-address ) Pronounced _t_o _d_a_t_a. data-address is the address of the data storage area associated with the word whose compilation address is cfa . For example, for variables, >DATA is equivalent to >BODY , for user variables, >DATA returns an address in the user area, etc. The implementation of this is not necessarily very efficient. DOES-IP? ( addr1 -- addr2 flag ) addr1 is assumed to point just after the run-time word com- piled by ;CODE or DOES> . flag is true if the code at addr is a DOES> clause as opposed to a ;CODE clause. In the case of a ;CODE clause, addr2 points to the actual machine code that is the ;CODE clause, skipping whatever linkage code was necessary to set up the execution of the clause. In the case of a DOES> clause, addr2 points to the first token that was compiled as part of the DOES> clause, skipping whatever link- age code was necessary to set up the execution of the clause. This is a necessary kludge. C.ID ( cfa -- ) Displays the name of the word whose compilation address is cfa. This is similar to the phrase >NAME .ID , but it must W. M. Bradley March 25, 1986 Page 4 of 5 pages cope properly with the case of headerless code. If the word is headerless, C.ID displays the string Wnnnn, where nnnn is the hex representation of the compilation address cfa . IMMEDIATE? ( cfa -- flag ) Flag is true if the word whose compilation address is cfa is immediate. If the correct answer cannot be determined, as for instance in a headerless definition, flag is false (which may be wrong, but is nevertheless a reasonable guess). /N ( -- n ) n is the number of bytes occupied by a number that is stored in memory. This is necessary for portability between 16 and 32 bit implementations. [3] FOLLOW ( voc -- ) Used for scanning the list of words in a vocabulary. Ini- tializes a system-dependent data structure in preparation for scanning a vocabulary. voc-threads is a system-dependent number which represents a vocabulary. It is the same number that is stored in CONTEXT and CURRENT . FOLLOW is used in conjunction with ANOTHER? . ANOTHER? ( -- nfa true OR false ) If there is another word in the vocabulary currently being scanned, nfa is the name field address of that word and the flag is true. If there are no more words, the flag is false. Used in conjunction with FOLLOW . Example: To print the names of all the words in the CURRENT vocabulary, : WORDS ( -- ) CONTEXT @ FOLLOW BEGIN ANOTHER? WHILE .ID REPEAT ; >THREADS ( cfa -- voc-threads ) cfa is the compilation address of a vocabulary. voc-threads is the number which is stored in CONTEXT when that vocabulary is executed. I.e. " ' FORTH >THREADS " and " FORTH CON- TEXT @ " both leave the same number on the stack (but the second form has the side effect of also affecting the search order). Other Useful Portability Words These words do not fit into the category of building programs which understand Forth definitions. However, I have found them useful in writing portable Forth code for other purposes. ALIGNED ( addr1 -- addr2 ) Addr2 is the result of aligning addr1 to the "natural" boun- dary for the particualar system. Addr2 is always greater than or equal to addr1. For example, in most 68000 W. M. Bradley March 25, 1986 Page 5 of 5 pages implementations ALIGNED would move up to the next even boun- dary. In most 8 bit processor implementations, ALIGNED would be a no-op. ALIGN ( -- ) Advance the dictionary pointer to the next alignment boun- dary. This can be defined in a portable fashion in terms of ALIGNED . SKIPSTR ( -- addr len ) This is a very important word. It is compiled as part of the definition of a run-time word which is followed by a string literal. When SKIPSTR executes, it leaves the address and length of its string literal, and advances the Interpreter Pointer past the string. For example, the run time word for ." could be defined as: : (.") ( -- ) SKIPSTR TYPE ; SKIPSTR could be defined in some systems with: : SKIPSTR ( -- addr len ) R> COUNT 2DUP + ALIGNED >R ; The use of SKIPSTR replaces many instances of phrases like " R> COUNT + >R ", which are not always portable because of differences in the way that systems compile in-line strings. SKIPSTR is a good candidate for a CODE word. FIND-CFA ( addr -- cfa ) addr is an address within the body of a forth word. cfa is the compilation address of that word. This word is difficult to implement so that it is guaranteed to work, but is fairly easy to implement so that it works most of the time. It is useful for things like a word to unravel the return stack and print the names of the words on it. It is also useful for implementing the word DEFINER. Discussion Using these words, I have moved a number of utilities, previ- ously considered highly system-dependent, across several Forth implementations with widely differing implementation details. The wordset does not ensure complete portability, but it helps a lot. The problems that remain are mostly the different names that different systems give to the run-time words compiled by branches and various kinds of literals. These problems are easily handled by a table which enumerates the various run-time words. The wordset described here takes care of the annoying details which otherwise tend to get buried in hard-to-find places in the code. W. M. Bradley March 25, 1986 Page 6 of 5 pages Acknowledgements The FOLLOW / ANOTHER? scheme was invented by Gordon Smith. My portable decompiler started with the decompiler in the Laxen/Perry F83 model [4], and was then extended to do structured decomilation of compiling words, to do sensible things with chil- dren of user-defined compiling words, and to be portable. Mike Perry suggested the word >DATA . References 1. Sebok, William L. "A VAX Immplementation of the FORTH-79 Standard", _T_h_e _J_o_u_r_n_a_l _o_f _F_o_r_t_h _A_p_p_l_i_c_a_t_i_o_n _a_n_d _R_e_s_e_a_r_c_h, Vol. 2 No. 4, 1984. 2. Harris, Kim, "Structure Charts", presented at the Silicon Valley Chapter FIG Meeting, October 1985. 3. Bradley, Mitch, and Sebok, William L. "Compatible Forth on a 32-bit Machine", _T_h_e _J_o_u_r_n_a_l _o_f _F_o_r_t_h _A_p_p_l_i_c_a_t_i_o_n _a_n_d _R_e_s_e_a_r_c_h, Vol. 2 No. 4, 1984. 4. Perry, Michael, "F83, a Public Domain Model Implementation of the Forth-83 Standard", _P_r_o_c. _1_9_8_3 _A_s_i_l_o_m_a_r _F_O_R_M_L _C_o_n_f_e_r_- _e_n_c_e". W. M. Bradley March 25, 1986 Page 7 of 5 pages