21 June 2020

Conway's Squasher Coroutine

Since some time I am researching coroutines. Wikipedia says that a coroutine is a computer program component that generalises subroutines, allowing execution to be suspended and resumed. [...] Subroutines are special cases of coroutines. When subroutines are invoked, execution begins at the start, and once a subroutine exits, it is finished; an instance of a subroutine only returns once, and does not hold state between invocations. By contrast, coroutines can exit at any time, and execution may later return to these points. [...] A coroutine instance holds state, and varies between invocations. Coroutines are useful for many things. For example the Kotlin programming language lists several uses cases including asynchronous computations, generators and implementing state machines in a sequential way. More and more languages provide them.

The Paper
According to Donald Knuth, Melvin E. Conway coined the term coroutine in 1958 when he applied it to the construction of an assembly program. He first published it 1963 in his paper Design of separable transition-diagram compiler. Conway states that the coroutine idea was concurrently developed by him and by Joel Erdwinn. The paper is about implementing a COBOL compiler, the first two pages introduce the coroutine together with its implementation in assembly for the Burroughs model 220, a late vacuum-tube computer.

Code Example
Conway shows an example, the asterisk squasher subroutine, in both subroutine and coroutine form. The mentioned assembly commands are unknown to me and the flowcharts are not 100% readable, so I decided to reimplement the example in modern assembly to see what is going on. The asterisk squasher is a program which reads cards and writes the string of characters it finds, column 1 of card 1 to column 80 of card 1, then column 1 of card 2, and so on, with the following wrinkle: every time there are adjacent asterisks they will be paired off from the left and each "**" will be replaced by the single character "^". This operation is done with the exponentiation operator of FORTRAN and COBOL.

Asterisk Squasher Subroutine
The problem is that out - the output area of the SQUASHER routine - only holds a single byte, which means the routine can only return and subsequentially write a single character. When looking for the second asterisk, the routine reads a second element and thus needs somehow to return it on the next call unless it was a "*". Conway continues that the simple requirement of compressing asterisks requires the introduction of a switch which bifurcates the subroutine and selects the part to be used at each call depending on the situation at the last call. The reason for the switch is that each call of the subroutine must result in output of exactly one character. The second read character is stored in a (module) global variable t2 until the next call. Here is its flowchart:

Figure 1: Asterisk squasher subroutine with environment
I use my trusted Windows assembly setup for Intel IA-32 assembly. The assembly is in NASM format. First define the data:
card_len equ    80

switch: resd    1
%define ON      dword 1
%define OFF     dword 0

i:      resd    1
card:   resb    card_len

t1:     resd    1
t2:     resd    1

out:    resd    1
Let's warm up with the code for RDCRD, which reads 80 characters of a card. The code strictly follows the flowchart above.
RDCRD:
        mov     eax, [i]
        cmp     eax, card_len
        jne     .exit

        mov     [i], dword 0

        ; Omitted using Windows system library functions
        ; _GetStdHandle@4 and _ReadFile@20 to read 80
        ; characters into [card].

.exit:
        ret
The main method is SQUASHER with the switch variable.
SQUASHER:
        mov     eax, [switch]
        cmp     eax, OFF
        je      .off
.on:
        mov     eax, [t2]
        mov     [out], eax

        mov     [switch], OFF
        jmp     .exit

.off:
        call    RDCRD

        mov     esi, [i]
        xor     eax, eax
        mov     al, [card + esi]
        mov     [t1], eax

        inc     esi
        mov     [i], esi

        mov     eax, [t1]               ; redundant, value still in register
        cmp     eax, '*'
        jne     .not_equal_ast

.equal_ast:
        call    RDCRD

        mov     esi, [i]                ; redundant, value still in register
        xor     eax, eax
        mov     al, [card + esi]
        mov     [t2], eax

        inc     esi
        mov     [i], esi

        mov     eax, [t2]               ; redundant, value still in register
        cmp     eax, '*'
        jne     .not_equal_second_ast

.equal_second_ast:
        mov     [t1], dword '^'
        jmp     .not_equal_ast

.not_equal_second_ast:
        mov     [switch], ON
        jmp     .not_equal_ast

.not_equal_ast:                         ; label 1
        mov     eax, [t1]
        mov     [out], eax

.exit:
        ret
There is nothing special happening in the code above. There are the two main branches .on and .off. To run the whole program on a single test card I need a WRITE and a main method. WRITE iterates a card and prints the result of each call to SQUASHER to standard out.
WRITE:
.loop:
        call    SQUASHER

        ; Omitted using Windows system library functions
        ; _GetStdHandle@4 and _WriteFile@20 to write 1
        ; character from [out].

        mov     eax, [i]
        cmp     eax, card_len
        jne     .loop

        ret

; ----------------------------------------
        global  _main
_main:
        ; set up switch
        mov     [switch], OFF

        ; set up global data
        mov     [i], dword card_len

        call    WRITE

        push    0
        call    _ExitProcess@4
Asterisk Squasher as Coroutine
Conway writes further in his paper: The coroutine approach to the same problem accomplishes the switching job implicitly by use of the subroutine calling sequence. When coroutines A and B are connected so that A sends items to B, B runs for a while until it encounters a read command, which means it needs something from A. Then control is transferred to A until it wants to "write," whereupon control is returned to B at the point where it left off. In this situation A is the SQUASHER and B is the WRITE coroutine. The following flowchart shows SQUASHER when both it and the using program (WRITE) are coroutines.

Figure 2: Asterisk squasher as coroutine
Coroutine in Assembly?
While some architectures provide support for coroutines, according to the Art of Assembly Language 80x86 CPUs do not provide a COCALL instruction. Conway describes the coroutine linkage on the Burroughs 220 as follows: The UNCONDITIONAL BRANCH instruction BUN A works by placing its address A into the P-register (the Instruction Pointer). The STORE P instruction STP B places the contents of P plus one into the address part of the contents of location B. The standard subroutine call is
1:      STP EXIT                        ; store 3 into target address of address EXIT
2:      BUN ENTRANCE                    ; jump to address ENTRANCE
3:
where location EXIT contains a BUN instruction whose address will be altered by the STP instruction in the call. (So this is self-modifying code and the return of the subroutine is then EXIT: BUN 3 or something like that.) A pair of subroutines becomes a pair of coroutines by adding to each an isolated BUN instruction which we can call its router, and by changing the addresses of the STP and BUN calls as follows: when coroutine A calls coroutine B the call is
        STP AROUTER
        BUN BROUTER
That seems a small change for the Burroughs but how could I recreate that?

Coroutine in Assembly!
The BUN is a JMP on Intel. Unfortunately not all instructions are one byte, so using the current IP + 1 as done by STP does not work. I tried to follow the original idea with a short macro. When coroutine A calls coroutine B the call is coro A B,
%macro coro 2
        mov     ebx, %%_cnt             ; basically IP + 1
        mov     [route_%1], ebx         ; store it in target of router A
        jmp     %2                      ; jump to router B
%%_cnt: nop
%endmacro

route_SQUASHER: resd 1
route_WRITE:    resd 1

SQUASHER:
        jmp     [route_SQUASHER]

WRITE:
        jmp     [route_WRITE]
This needs storage for the route of each coroutine. The call of the macro does not look so good as it needs the name of the current coroutine as first argument for the macro to use the right route storage. If each coroutine would be its own module, I could drop that and just use [route]. The router variables [route_*] will be initialised at startup. The code of the two routines SQUASHER and WRITE changes slightly, strictly following the second flowchart above.
; router
SQUASHER:
        jmp     [route_SQUASHER]

; coroutine
SQUASHER_CORO:                          ; label 1
        call    RDCRD

        mov     esi, [i]
        xor     eax, eax
        mov     al, [card + esi]
        mov     [t1], eax

        inc     esi
        mov     [i], esi

        mov     eax, [t1]               ; redundant, value still in register
        cmp     eax, '*'
        jne     .not_equal_ast

.equal_ast:
        call    RDCRD

        mov     esi, [i]                ; redundant, value still in register
        xor     eax, eax
        mov     al, [card + esi]
        mov     [t2], eax

        inc     esi
        mov     [i], esi

        mov     eax, [t2]               ; redundant, value still in register
        cmp     eax, '*'
        jne     .not_equal_second_ast

.equal_second_ast:
        mov     [t1], dword '^'
        jmp     .not_equal_ast

.not_equal_second_ast:
        mov     eax, [t1]
        mov     [out], eax
        coro    SQUASHER, WRITE

        mov     eax, [t2]
        mov     [out], eax
        coro    SQUASHER, WRITE

        jmp     SQUASHER_CORO

.not_equal_ast:                         ; label 2
        mov     eax, [t1]
        mov     [out], eax
        coro    SQUASHER, WRITE

        jmp     SQUASHER_CORO

; ----------------------------------------
; router
WRITE:
        jmp     [route_WRITE]

; coroutine
WRITE_CORO:
.loop:
        coro    WRITE, SQUASHER

        ; Omitted using Windows system library functions
        ; _GetStdHandle@4 and _WriteFile@20 to write 1
        ; character from [out].

        mov     eax, [i]
        cmp     eax, card_len
        jne     .loop

        ret

; ----------------------------------------
        global  _main
_main:
        ; set up coroutine routers
        mov     ebx, SQUASHER_CORO
        mov     [route_SQUASHER], ebx

        mov     ebx, WRITE_CORO
        mov     [route_WRITE], ebx

        ; set up global data
        mov     [i], dword card_len

        call    WRITE

        push    0
        call    _ExitProcess@4

Conslusion
I hope that clarifies how the first coroutine presented in 1963 worked. Re-coding the example was a lot of work and in the end it paid off because everything is clear now, even more so after I wrote this blog post about it. This is the golden triangle of learning: theory (e.g. reading about it), practice (e.g. creating it) and finally teaching (documenting and sharing it).

Full Source Code
After I published this article Dmitry Kandalov pushed me to make the code available on GitHub. And then we went to port it to macOS, so he could play with it as well. To compile and run the code you need GNU make, NASM and a linker depending to your platform. Thanks to Samir Talwar's Smoke, it even comes with tests. Give it a try: Conway's Squasher Coroutine @ GitHub.

2 comments:

Dmitry Kandalov said...

For anyone interested in the topic I wrote a follow-up blog post with 64-bit assembler and Kotlin :)

Peter Kofler said...

Thank you Dmitry for bringing the code to macOS and Linux. The concept is easier to understand if you have code that actually runs on your machine.

Great follow-up post. I like it. I see your post as a consequence of mine, taking the original idea further and comparing it with today's technology. Doing so provides more and new insight into the topic. (Also I know the topic well so I guess I am biased ;-)