The Paper
According to Donald Knuth, Melvin E. Conway coined the term coroutine in 1958 when he applied it to the construction of an assembly program. He first published it 1963 in his paper Design of separable transition-diagram compiler. Conway states that the coroutine idea was concurrently developed by him and by Joel Erdwinn. The paper is about implementing a COBOL compiler, the first two pages introduce the coroutine together with its implementation in assembly for the Burroughs model 220, a late vacuum-tube computer.
Code Example
Conway shows an example, the asterisk squasher subroutine, in both subroutine and coroutine form. The mentioned assembly commands are unknown to me and the flowcharts are not 100% readable, so I decided to reimplement the example in modern assembly to see what is going on. The asterisk squasher is a program which reads cards and writes the string of characters it finds, column 1 of card 1 to column 80 of card 1, then column 1 of card 2, and so on, with the following wrinkle: every time there are adjacent asterisks they will be paired off from the left and each "**" will be replaced by the single character "^". This operation is done with the exponentiation operator of FORTRAN and COBOL.
Asterisk Squasher Subroutine
The problem is that
out
- the output area of the SQUASHER
routine - only holds a single byte, which means the routine can only return and subsequentially write a single character. When looking for the second asterisk, the routine reads a second element and thus needs somehow to return it on the next call unless it was a "*". Conway continues that the simple requirement of compressing asterisks requires the introduction of a switch which bifurcates the subroutine and selects the part to be used at each call depending on the situation at the last call. The reason for the switch is that each call of the subroutine must result in output of exactly one character. The second read character is stored in a (module) global variable t2
until the next call. Here is its flowchart:I use my trusted Windows assembly setup for Intel IA-32 assembly. The assembly is in NASM format. First define the data:
card_len equ 80 switch: resd 1 %define ON dword 1 %define OFF dword 0 i: resd 1 card: resb card_len t1: resd 1 t2: resd 1 out: resd 1Let's warm up with the code for
RDCRD
, which reads 80 characters of a card. The code strictly follows the flowchart above.RDCRD: mov eax, [i] cmp eax, card_len jne .exit mov [i], dword 0 ; Omitted using Windows system library functions ; _GetStdHandle@4 and _ReadFile@20 to read 80 ; characters into [card]. .exit: retThe main method is
SQUASHER
with the switch
variable.SQUASHER: mov eax, [switch] cmp eax, OFF je .off .on: mov eax, [t2] mov [out], eax mov [switch], OFF jmp .exit .off: call RDCRD mov esi, [i] xor eax, eax mov al, [card + esi] mov [t1], eax inc esi mov [i], esi mov eax, [t1] ; redundant, value still in register cmp eax, '*' jne .not_equal_ast .equal_ast: call RDCRD mov esi, [i] ; redundant, value still in register xor eax, eax mov al, [card + esi] mov [t2], eax inc esi mov [i], esi mov eax, [t2] ; redundant, value still in register cmp eax, '*' jne .not_equal_second_ast .equal_second_ast: mov [t1], dword '^' jmp .not_equal_ast .not_equal_second_ast: mov [switch], ON jmp .not_equal_ast .not_equal_ast: ; label 1 mov eax, [t1] mov [out], eax .exit: retThere is nothing special happening in the code above. There are the two main branches
.on
and .off
. To run the whole program on a single test card I need a WRITE
and a main method. WRITE
iterates a card and prints the result of each call to SQUASHER
to standard out.WRITE: .loop: call SQUASHER ; Omitted using Windows system library functions ; _GetStdHandle@4 and _WriteFile@20 to write 1 ; character from [out]. mov eax, [i] cmp eax, card_len jne .loop ret ; ---------------------------------------- global _main _main: ; set up switch mov [switch], OFF ; set up global data mov [i], dword card_len call WRITE push 0 call _ExitProcess@4Asterisk Squasher as Coroutine
Conway writes further in his paper: The coroutine approach to the same problem accomplishes the switching job implicitly by use of the subroutine calling sequence. When coroutines A and B are connected so that A sends items to B, B runs for a while until it encounters a read command, which means it needs something from A. Then control is transferred to A until it wants to "write," whereupon control is returned to B at the point where it left off. In this situation A is the
SQUASHER
and B is the WRITE
coroutine. The following flowchart shows SQUASHER
when both it and the using program (WRITE
) are coroutines.Coroutine in Assembly?
While some architectures provide support for coroutines, according to the Art of Assembly Language 80x86 CPUs do not provide a
COCALL
instruction. Conway describes the coroutine linkage on the Burroughs 220 as follows: The UNCONDITIONAL BRANCH
instruction BUN A
works by placing its address A
into the P-register (the Instruction Pointer). The STORE P
instruction STP B
places the contents of P
plus one into the address part of the contents of location B
. The standard subroutine call is1: STP EXIT ; store 3 into target address of address EXIT 2: BUN ENTRANCE ; jump to address ENTRANCE 3:where location
EXIT
contains a BUN
instruction whose address will be altered by the STP
instruction in the call. (So this is self-modifying code and the return of the subroutine is then EXIT: BUN 3
or something like that.) A pair of subroutines becomes a pair of coroutines by adding to each an isolated BUN
instruction which we can call its router, and by changing the addresses of the STP
and BUN
calls as follows: when coroutine A calls coroutine B the call isSTP AROUTER BUN BROUTERThat seems a small change for the Burroughs but how could I recreate that?
Coroutine in Assembly!
The
BUN
is a JMP
on Intel. Unfortunately not all instructions are one byte, so using the current IP + 1 as done by STP
does not work. I tried to follow the original idea with a short macro. When coroutine A calls coroutine B the call is coro A B
,%macro coro 2 mov ebx, %%_cnt ; basically IP + 1 mov [route_%1], ebx ; store it in target of router A jmp %2 ; jump to router B %%_cnt: nop %endmacro route_SQUASHER: resd 1 route_WRITE: resd 1 SQUASHER: jmp [route_SQUASHER] WRITE: jmp [route_WRITE]This needs storage for the route of each coroutine. The call of the macro does not look so good as it needs the name of the current coroutine as first argument for the macro to use the right route storage. If each coroutine would be its own module, I could drop that and just use
[route]
. The router variables [route_*]
will be initialised at startup. The code of the two routines SQUASHER
and WRITE
changes slightly, strictly following the second flowchart above.; router SQUASHER: jmp [route_SQUASHER] ; coroutine SQUASHER_CORO: ; label 1 call RDCRD mov esi, [i] xor eax, eax mov al, [card + esi] mov [t1], eax inc esi mov [i], esi mov eax, [t1] ; redundant, value still in register cmp eax, '*' jne .not_equal_ast .equal_ast: call RDCRD mov esi, [i] ; redundant, value still in register xor eax, eax mov al, [card + esi] mov [t2], eax inc esi mov [i], esi mov eax, [t2] ; redundant, value still in register cmp eax, '*' jne .not_equal_second_ast .equal_second_ast: mov [t1], dword '^' jmp .not_equal_ast .not_equal_second_ast: mov eax, [t1] mov [out], eax coro SQUASHER, WRITE mov eax, [t2] mov [out], eax coro SQUASHER, WRITE jmp SQUASHER_CORO .not_equal_ast: ; label 2 mov eax, [t1] mov [out], eax coro SQUASHER, WRITE jmp SQUASHER_CORO ; ---------------------------------------- ; router WRITE: jmp [route_WRITE] ; coroutine WRITE_CORO: .loop: coro WRITE, SQUASHER ; Omitted using Windows system library functions ; _GetStdHandle@4 and _WriteFile@20 to write 1 ; character from [out]. mov eax, [i] cmp eax, card_len jne .loop ret ; ---------------------------------------- global _main _main: ; set up coroutine routers mov ebx, SQUASHER_CORO mov [route_SQUASHER], ebx mov ebx, WRITE_CORO mov [route_WRITE], ebx ; set up global data mov [i], dword card_len call WRITE push 0 call _ExitProcess@4Conslusion
I hope that clarifies how the first coroutine presented in 1963 worked. Re-coding the example was a lot of work and in the end it paid off because everything is clear now, even more so after I wrote this blog post about it. This is the golden triangle of learning: theory (e.g. reading about it), practice (e.g. creating it) and finally teaching (documenting and sharing it).
Full Source Code
After I published this article Dmitry Kandalov pushed me to make the code available on GitHub. And then we went to port it to macOS, so he could play with it as well. To compile and run the code you need GNU make, NASM and a linker depending to your platform. Thanks to Samir Talwar's Smoke, it even comes with tests. Give it a try: Conway's Squasher Coroutine @ GitHub.