Implementing tail recursion in the MLDS code generator

Tail recursion optimization versus last call optimization

Most implementations of declarative languages implement last call optimization (LCO) to allow recursive algorithms to handle arbitrary amounts of data using constant stack space. With LCO, when the last thing that a procedure does is call a callee whose vector of return values is the same as the vector of return values of the caller, it deallocates the stack frame of the caller before the call, allowing its space to be used to store the stack frame of the callee.

In its general form, LCO does not require any knowledge of the callee; it works even when the identity of the callee is unknown at compile time, as when the last call is a higher order call or a method call, or its code is unavailable, as when it is defined in a different compilation unit. However, implementing this general form of LCO requires the implementation to have direct control over the use of the stack, and the ability to generate jumps (not calls) to arbitrary locations. In the Mercury compiler, the LLDS code generator can do both these things, but the MLDS code generator can do neither. This is why it can implement only tail recursion optimization (TRO). This differs from LCO in two ways.

This means that TRO is a less general form of last call optimization, because it is applicable only when

Tail recursion optimization is applicable to both self-recursion and mutual recursion. TRO for self recursion is significantly simpler, so we describe that first. This is a general principle we use everywhere below: we introduce the simplest case first, and add the complications (and their solutions) later.

Self tail recursion

To explain how the Mercury compiler applies TRO to self-recursive calls, we will use this example predicate:

:- pred len(list(int)::in, int::in, int::out) is det.

len(L, Len0, Len) :-
  (
    L = [],
    Len = Len0
  ;
    L = [_ | T],
    Len1 = Len0 + 1,
    len(T, Len1, Len)
  ).

Here is the C code of generated by the Mercury compiler for this predicate without TRO:

void MR_CALL
x__len_3_p_0(
  MR_Word L_4,
  MR_Integer Len0_5,
  MR_Integer * Len_6)
{
  if ((L_4 == ((MR_Word) MR_mkword(MR_mktag(0), MR_mkbody((MR_Integer) 0)))))
    *Len_6 = Len0_5;
  else
  { 
    MR_Word T_8;
    MR_Integer Len1_9;
    MR_Integer Var_10;
    MR_Integer Var_7;

    Var_7 = ((MR_Integer) (MR_hl_field(MR_mktag(1), L_4, (MR_Integer) 0)));
    T_8 = ((MR_Word) (MR_hl_field(MR_mktag(1), L_4, (MR_Integer) 1)));
    Var_10 = (MR_Integer) 1;
    Len1_9 = (Len0_5 + Var_10);
    x__len_3_p_0(T_8, Len1_9, Len_6);
  }
}

The last call is a tail recursive call. When this predicate is compiled with TRO, we get this C code:

void MR_CALL
x__len_3_p_0(
  MR_Word L_4,
  MR_Integer Len0_5,
  MR_Integer * Len_6)
{
  while (MR_TRUE)
  {
    if ((L_4 == ((MR_Word) MR_mkword(MR_mktag(0), MR_mkbody((MR_Integer) 0)))))
      *Len_6 = Len0_5;
    else
    {
      MR_Word T_8 = ((MR_Word) (MR_hl_field(MR_mktag(1), L_4, (MR_Integer) 1)));
      MR_Integer Len1_9;
      MR_Integer Var_10 = (MR_Integer) 1;
      MR_Integer Var_7 = ((MR_Integer) (MR_hl_field(MR_mktag(1), L_4, (MR_Integer) 0)));
      MR_Word next_value_of_L_4;
      MR_Integer next_value_of_Len0_5;

      Len1_9 = (Len0_5 + Var_10);
      // direct tailcall eliminated
      next_value_of_L_4 = T_8;
      next_value_of_Len0_5 = Len1_9;
      L_4 = next_value_of_L_4;
      Len0_5 = next_value_of_Len0_5;
      continue;
    }
    break;
  }
}

This differs from the unoptimized code in two major aspects.

The first aspect affected by TRO is the translation of the self tail call (or self tail calls, plural, in the general case). TRO replaces the call with

There is no code for handling the output arguments, since (by the definition of tail calls) these must be the same in the caller and the callee. On every non-recursive path, we return the values of the output arguments using the exact same code as we would use without TRO.

Note that code that passes the input arguments does so in two stages: assignments of the actual parameter values to the next_value_of_ forms of the input arguments, followed by assignments of these next_value_of_ forms to the input arguments themselves. This is to handle the case where some variable is both an input argument and an actual parameter of the call. If we just assigned each actual parameter to the corresponding input directly in (say) ascending order of argument number, then the translation of a call such as foo(In2, In1, outputs) in a predicate whose head looks like foo(In1, In2, outputs) would consist of the assignments

In1 = In2;
In2 = In1;

and the first assignment would clobber the value to be assigned by the second. This is the standard problem of swapping two values, and its solution requires at least one temporary variable (if we don't want to resort to unnecessarily complicated code using xors). Our solution works because the next_value_of_ forms of the input arguments are never live outside the small blocks of code resulting from a single tail recursive call (we simply don't generate references to them in any other context), and inside each block, each such variable is written exactly once and read exactly once (in that order). The fact that we use more temporaries than may be strictly necessary does not matter, because the final decision on how the assigned values end up in their target locations is not up to the Mercury compiler; it is up to the compiler that translates the generated C, C# or Java to machine code.

The second aspect affected by TRO is that the entire body of the target language (in this case C) code we generate for the procedure is wrapped up in a loop.

The usual way we wrap the procedure body is with a while loop:

ret_type func_name(args)
{
  while (MR_TRUE)
  {
    // procedure body
    // in which tail calls transfer control using "continue"
  }
}

However, we can also use gotos:

ret_type func_name(args)
{
top_of_proc:
  {
    // procedure body
    // in which tail calls transfer control using "goto top_of_proc"
  }
}

Mutual tail recursion

The MLDS code generator partitions the procedures of a module into a sequence of SCCs, where each SCC (strongly connected component) consists of a set of procedures that are all reachable from each other via calls, whether tail or non-tail. Since TRO applies only to tail calls, it also partitions each SCC further into one or more TSCCs (tail SCCs), which are strongly connected components of a graph whose nodes represent procedures and in which there are edges only for tail calls. This means that by definition, every procedure in a TSCC is reachable from every procedure in that TSCC using only tail calls. It then implements tail recursion optimization in each TSCC that contains tail calls.

Most TSCCs contain only one procedure, which means that we can implement TRO for them using only the techniques above, without using any of the techniques below. The techniques below are needed only for TSCCs that contain two or more procedures.

Note that two (or more) mutually recursive procedures can end up in different TSCCs even if there is a tail call between them, if the tail calls go only one way, e.g. if procedure p calls procedure q using tail calls, but q calls p using only ordinary nontail calls. The LLDS backend can optimize the tail calls to q in p, but the MLDS backend cannot do so, because it cannot generate nonlocal gotos.

To implement mutual tail recursion between the procedures of a nontrivial TSCC, we need to generalize

Transfers of control

The easiest to generalize is the last one: the transfer of control. To see how it is done, consider a small TSCC containing two procedures, tscc_a and tscc_b. Since we need to translate tail calls into local transfers of control, we translate each TSCC together, either using labels and gotos like this:

ret_type_a tscc_a(args_a)
{
  goto top_of_proc_1;
top_of_proc_1:
  {
    // body of procedure tscc_a
    // in which tail calls transfer control using "goto top_of_proc_N"
    goto tscc_end;
  }
top_of_proc_2:
  {
    // body of procedure tscc_b
    // in which tail calls transfer control using "goto top_of_proc_N"
    goto tscc_end;
  }
tscc_end:
  return ...
}

ret_type_b tscc_b(args_b)
{
  goto top_of_proc_2;
top_of_proc_1:
  {
    // body of procedure tscc_a
    // in which tail calls transfer control using "goto top_of_proc_N"
    goto tscc_end;
  }
top_of_proc_2:
  {
    // body of procedure tscc_b
    // in which tail calls transfer control using "goto top_of_proc_N"
    goto tscc_end;
  }
tscc_end:
  return ...
}

or using while loops and switches like this:

ret_type_a tscc_a(args_a)
{
  int tscc_selector = 1;
  switch (tscc_selector)
  {
    case 1:
      {
        // body of procedure tscc_a
        // in which tail calls transfer control using
        // "tscc_selector = N; continue"
      }
      break;
    case 2:
      {
        // body of procedure tscc_b
        // in which tail calls transfer control using
        // "tscc_selector = N; continue"
      }
      break;
  }

  return ...
}

ret_type_b tscc_b(args_b)
{
  int tscc_selector = 2;
  switch (tscc_selector)
  {
    case 1:
      {
        // body of procedure tscc_a
        // in which tail calls transfer control using
        // "tscc_selector = N; continue"
      }
      break;
    case 2:
      {
        // body of procedure tscc_b
        // in which tail calls transfer control using
        // "tscc_selector = N; continue"
      }
      break;
  }

  return ...
}

In both cases, each procedure in the TSCC has its own number in the TSCC (in this case, tscc_a is procedure 1 in the TSCC and tscc_b is procedure 2 in the TSCC). We call this number the procedure's in-TSCC id number.

We translate each procedure in the TSCC into MLDS code just once, yielding the code represented by "body of procedure ..." above. We call these inner or wrapped procedures. If the TSCC contains N procedures, then each C function we generate will contain N wrapped procedures. We call the entirety of each C function an outer or container procedure, since each contains two or more wrapped procedures. We must generate a container procedure for every member of the TSCC that may be called by a non-tail call from anywhere; from other modules, from other (higher) SCCs in the current module, from procedures in the current SCC that are not in the TSCC, and via non-tail calls from any procedure in the TSCC itself. This means that the code of every procedure in a TSCC that contains N procedures will be present up to N times in the executable. Since mutually-tail-recursive procedures are relatively rare, and most TSCCs contain only two or three procedures, this increase in the total code memory requirement is usually a more than acceptable price to pay for the ability to handle arbitrarily deep recursion in constant stack space. (In fact, the increased memory requirement is probably not as important as the reduction of the effectiveness of the instruction cache: the cache misses that bring in the code of a wrapped procedure from main memory have to be incurred for each one of its executed copies.)

Parameter passing

Parameter passing between the procedures of a TSCC at tail calls is not as simple as parameter passing at self-tail-recursive calls, because (except in the case of self-tail-calls) the actual parameters in the caller and the formal parameters of the callee will come from two different procedures, and thus from two different varsets. Since every procedure's varset contains variables whose numbers are allocated consecutively from one, the sets of variable numbers in two different procedures will of course greatly overlap, and it is possible for a variable with a given number to have the same name in both varsets as well. We don't want any such accidental name collisions to result in the generated code using the same C variable to represent both of the colliding variables, since that would be semantically wrong. (For starters, the two variables could even have different types, but the sharing of their storage would be a bug even if they had the same type.) We therefore need a mechanism to avoid this problem.

One possible solution would be to rename (or renumber) apart either the varsets of the procedures in the TSCC before code generation, or their sets of MLDS variables either during or after code generation. Both are problematic. HLDS procedures contain lots of fields that contain variables, so the code for renaming or renumbering variables in all of them would be big (which would pose a program maintenance burden) and would turn over a lot of memory (a performance problem). And some compiler-generated MLDS variable names have fixed names and no changeable number.

Our chosen solution sidesteps such problems altogether by inventing a new set of compiler-generated MLDS variables specifically for parameter passing in TSCCs.

In every procedure, every argument that participates in parameter passing (i.e. every argument that is not of a dummy type and whose mode is not unused) has either a corresponding tscc_proc_N_input_M_VarName variable (if it is an input argument) or a corresponding tscc_output_M_VarName variable (if it is an output argument). In each such pair of corresponding variables, we call the MLDS variable representing the argument the procedure's own variable, and we call the other the tscc variable.

Suppose both tscc_a and tscc_b are det functions whose argument vectors are tscc_a(AIn1, AIn2) = AOut1 and tscc_b(BIn1) = BOut1 respectively, and the name of the MLDS type of each variable is the name of the variable with a "Type" added to it. However, since AOutType1 must be the same as BOutType1, we will replace both with just "OutType1". Then the parameter passing code we generate will look like this:

OutType1
tscc_a(
  AInType1      tscc_proc_1_input_1_AIn1,
  AInType2      tscc_proc_1_input_2_AIn2)
{
  BInType1      tscc_proc_1_input_2_BIn1;
  OutType1      tscc_output_1_AOut1;

  goto top_of_proc_1;
top_of_proc_1:
  {
    AInType1    AIn1 = tscc_proc_1_input_1_AIn1;
    AInType2    AIn2 = tscc_proc_1_input_2_AIn2;
    OutType1    AOut1;

    // body of procedure tscc_a in which
    // tail calls to tscc_a look like this:
    //      tscc_proc_1_input_1_AIn1 = input arg 1 of tail call;
    //      tscc_proc_1_input_2_AIn2 = input arg 2 of tail call;
    //      goto top_of_proc_1;
    // tail calls to tscc_b look like this:
    //      tscc_proc_1_input_2_BIn1 = input arg 1 of tail call;
    //      goto top_of_proc_2;
    // and base cases assign to AOut1 as usual

    tscc_output_1_AOut1 = AOut1;
    goto tscc_end;
  }
top_of_proc_2:
  {
    BInType1    BIn1 = tscc_proc_1_input_2_BIn1;
    OutType1    BOut1;

    // body of procedure tscc_a in which
    // tail calls to both tscc_a and tscc_b look like they do above
    // and base cases assign to BOut1 as usual

    tscc_output_1_AOut1 = BOut1;
    goto tscc_end;
  }
tscc_end:
  return tscc_output_1_AOut1;
}

OutType1
tscc_b(
  BInType1      tscc_proc_2_input_1_BIn1)
{
  AInType1      tscc_proc_1_input_1_AIn1;
  AInType2      tscc_proc_1_input_2_AIn2;
  OutType1      tscc_output_1_AOut1;

  goto top_of_proc_2;
top_of_proc_1:
  {
    AInType1    AIn1 = tscc_proc_1_input_1_AIn1;
    AInType2    AIn2 = tscc_proc_1_input_2_AIn2;
    OutType1    AOut1;

    // body of procedure tscc_a in which
    // tail calls to tscc_a look like this:
    //      tscc_proc_1_input_1_AIn1 = input arg 1 of tail call;
    //      tscc_proc_1_input_2_AIn2 = input arg 2 of tail call;
    //      goto top_of_proc_1;
    // tail calls to tscc_b look like this:
    //      tscc_proc_1_input_2_BIn1 = input arg 1 of tail call;
    //      goto top_of_proc_2;
    // and base cases assign to AOut1 as usual

    tscc_output_1_AOut1 = AOut1;
    goto tscc_end;
  }
top_of_proc_2:
  {
    BInType1    BIn1 = tscc_proc_1_input_2_BIn1;
    OutType1    BOut1;

    // body of procedure tscc_a in which
    // tail calls to both tscc_a and tscc_b look like they do above
    // and base cases assign to BOut1 as usual

    tscc_output_1_AOut1 = BOut1;
    goto tscc_end;
  }
tscc_end:
  return tscc_output_1_AOut1;
}

The general principles of our parameter passing scheme are as follows.

For model semi predicates, the succeeded (own) variable and the tscc_output_succeeded variable corresponding to it effectively function as an output argument returned by value. By its position in the vector of output arguments, it is effectively output argument 0.

The final detail is the treatment of output arguments that are passed by reference. Our chosen approach is designed to work even in cases where the procedures of the TSCC, although they return the same vector of outputs, return different subsets of them by reference.

The basic idea is to generate the wrapped procedures as if all the output arguments were returned by value, exactly as shown in the example above, and to handle the difference at the container function level.

The parts of a container function corresponding to an output argument passed by value are the following.

When an output argument is passed by reference, we create a tscc_output_ptr_M_VarName variable for it as well as a tscc_output_M_VarName variable. The parts of a container function corresponding to such an output argument are the following.