# Application Note 001: Implementing MD5 On a GreenArrays™ Chip

MD5 is a well known cryptographic algorithm, calculating a one way hash function of 128 bits on an arbitrarily long sequence of octets; it is a standard component of many cryptosystems including IPSec and Public Key Infrastructures. There are many possible ways to implement such an algorithm on GreenArrays chips; the implementation described in the first part of this document is the first we have produced (written by staff member Charley Shattuck) and is written as machine code for a cluster of F18 computers. This method generally leads to maximum speed, no dependency on external memory or high level programming such as eForth, and may be adapted for commitment to ROM if desired.

A definitive specification for the MD5 algorithm is published by the Internet Engineering Task Force as RFC 1321 and a discussion of performance is available in RFC 1810 as well as tables in books such as Applied Cryptography by Bruce Schneier. Incidentally, our own reference FORTH implementation running on a PC with 2.9 GHz modern Intel processor hashes 1,000,000 bytes in 3.782 milliseconds for a hash speed of 2.12 gigabits per second.

MD5 presents a few problems for programming a Green Arrays device. For one thing it depends on modulo 32 bit addition and rotation. Green Arrays chips deal in 18 bit quantities. For another, md5 is complicated enough that neither the code nor the set of constants required to implement the algorithm will fit into one or even two or three nodes of a Green Arrays computer. Let's see how to deal with that.

## Data Sources

During each step of the MD5 algorithm there are three main sources of data and several numerical sequences for accessing this data.

The first data source is the current state of the message digest, represented as ABCD. It contains 4 32 bit numbers. Temporary storage for A, B, C, and D is required as well. Let's call that AA, BB, CC, and DD, each a 32 bit number. ABCD is accessed in rotating fashion as ABCD, DABC, CDAB, BCDA, and so on as the algorithm proceeds.

The second data source is the message buffer, represented by X(k). The message is divided into blocks of 64 bytes or 16 32 bit words, indexed by the sequence represented by k. The message buffer is accessed in an order that is not linear, but which can be calculated in less space than it can be listed in.

The third data source is 64 32 bit word constants in a table represented by T(i). These constants are accessed in a linear fashion via the index i.

Finally there is a sequence of 64 numbers representing a rotation amount, called s. This sequence can also be calculated in less space than would be required to list it.

There are four bitwise functions applied to the ABCD data. They are called f', g', h', and i' where:

f'(X, Y, Z) = (X and Y) ior (not(X) and Z)

g'(X, Y, Z) = (X and Z) ior (Y and not(Z))

h'(X, Y, Z) = (X xor Y) xor Z

i'(X, Y, Z) = Y xor (X ior not(Z))

Note that ior (inclusive or) and xor (exclusive or) are spelled out in order to be perfectly clear.

Let abcd represent the current rotation of ABCD, DABC, CDAB, or BCDA. Let "function" stand for one of the four functions listed above. Using the other symbols also introduced above this is how the 64 operations for each message buffer will look:

a := b + (rotate ( s, (a + function(bcd) + X(k) + T(i))))

## Green Arrays Architecture

A couple of things about the Green Arrays architecture dominate this implementation of MD5. One is the fact that MD5 is a 32 bit algorithm and GA architecture is 18 bits. The other is that each node of a Green Arrays chip is limited to 64 18 bit words of memory for both program and data.

To address the first item we will perform all the 32 bit arithmetic and logic in parallel, 16 bits per partner node. For addition, carries can be accumulated in the upper two bits of the low word and communicated to the node handling the high word in time to avoid overflow. The second item is addressed by distributing program code and data amoung several nodes which communicate with each other.

Note that the four functions specified above are bitwise functions. There is no carry to ripple and no bit rotations occur. This means that two separate nodes can independently perform the high 16 bit and low 16 bit operations without interacting. Nice.

### 32 Bit Addition and Rotation

We've split the 32 bit numbers into 16 bits high and 16 bits low, handled by separate but adjacent nodes. The low node can add and maintain the carries for up to three additions before having to pass its two bits of carry up to the high word to be added in.

32 bit rotation can be implemented using the Green Arrays +* instruction. For example, an 18 bit rotation could be performed by putting zeros into S and A and executing +* repeatedly. The 0 in S would be added to T without changing anything and bit(s) would shift from T into A. At the end you could fetch A and or it with T to construct the rotated value.

Rotating a 32 bit number is just a bit more complicated. Both the high and low words would be shifted from T into A, but this time the nodes would swap their T values before oring with A. Also the A values would be shifted right two more times via 2/ and anded with ffff before being ored with T.

### A Lookup Table with 64 32 Bit Words

Now we've already split the high and low 16 bit word operations into separate rows of nodes. We therefore need two tables of 64 16 bit words each. We can do that using two nodes filled with all data, no code, which jump to neighbor ports to receive the code that looks up data. Once each data node has its A register initialized to zero, The data can be looked up with the simple line `@+ !p ..` executed in its port. The `..` fills the rest of the current instruction word with nops so that the next opcode will go into slot 0 of the next word.

### Other Code and Data

The other data buffers, ABCD and the message buffer, are smaller and can be in nodes that also contain code. One pair of nodes is pretty much dedicated to calculating the message buffer index, reading a word each from their buffers, and passing those words down the line to nodes that apply the MD5 algorithm. Since the rotation amount is the same for high word and low word, a single node can calculate this amount and pass it to both the high row and the low row. Similarly a single node buffers ABCD when a particular message block begins and adds that back in to the message digest at the end of that block's processing. One pair of nodes is pretty well occupied with the four bitwise functions but has room to also do a bit of addition before passing the sum off to another pair that handles resolving the carry, fetching and adding the constants from the lookup tables, and performing the 32 bit rotations.

## Data flow block diagram

In planning a multi-node application for GreenArrays chips, we think in terms of a data flow diagram. The following Figure diagrams this implementation of MD5:

## Allocation of Code and Data for MD5

Let's describe some of the thinking that led up to this layout. After reading and understanding the md5 specification it was obvious that some sort of extended precision arithmetic and logic was needed. MD5 is a 32 bit algorithm and Green Arrays architecture is 18 bits. The first thought was to define 36 bit operators for addition and each of the four bit-oriented functions needed, as well as 32 bit rotation. Each of these was actually coded in order to see how much memory they would consume. After coding the four functions there wasn't room for much else. Each function implied having three double precision items on the stack as inputs. This occupies six stack locations leaving only four more to work with. Stack overflow was a serious possibility. In addition, watching the numbers in the simulator was difficult since they didn't line up in nibbles as 32 bit hex numbers would.

### 32 Bit Operations in 16 Bit Halves

This problem of visibility while debugging is what inspired the idea of splitting 32 bit operations into two 16 bit operations in partner nodes. Suddenly the bit-wise functions only required three stack locations per node instead of six. The lookup tables could be split into a high 16 bit word and a low 16 bit word. Numbers being passed from node to node could easily be recognized as high or low 16 bit halves. In addition this introduced some parallel processing to an otherwise very sequential set of operations.

When it comes to addition or rotation you might think that propagating the carry across nodes would waste some time and it does, but not as much as you might expect. The low partner node can accumulate carries from up to three additions before propagating it to the high partner in a single addition. 32 bit rotation can be done using the +* instruction to shift bits from T into A then sharing and oring those bits across partner nodes to achieve a 32 bit rotation with the minimum of inter-node communication.

### Layout

The layout of the md5 block diagram is determined to some extent by the need to have partner nodes in communication with each other for carry and rotations. The obvious layout is two rows of nodes one above the other. The block diagram shows high 16 bit nodes in the 200 row and low nodes in the 100 row.

First the 64 word lookup tables were placed on the east end of the layout in nodes 207 and 107. After coding the four bit-wise functions in nodes 205 and 105 it was seen that there was still room to put the MD (Message Digest) buffer in the same node simplifying calculations. There wasn't going to be room to code the rotations in the same node, so that was put along with carry resolution and lookup code for the constant tables in 206 and 106 between the functions and the constant tables. This leaves 204 and 104 to the west of the MD buffer for the message buffer. The message buffer contains 16 32 bit words for a total of 64 bytes. The high 16 bit words of the message buffer are in the high row and the low 16 bit words are of course in the low row. There is room in these nodes to calculate the index into the message buffer. Unfortunately there is not room in any of the aforementioned nodes to calculate the rotation amount, so another pair of nodes is tacked on to the west. Since both rows get the same rotation amount, it is only needs to be calculated in the high row and sent down to the low row. The low member of this pair is free for other code and in fact has code to save the old MD and add it to the new after a message block is hashed.

Five pairs of nodes have been mentioned so far. These nodes are active in calculating the md5 hash once the MD and message buffers have been filled. One of the western most of these nodes, 103, actually just passes the rotation amount calculated by its partner 203 during the hashing phase. Once the md5 algorithm has been applied to the whole message buffer though, this is the node that receives the new message digest and adds it to the old message digest saved there. Before starting work on a new message block this node will send the current MD values over to the pair that contain the MD buffer and functions.

Nodes 206 and 106 need to be adjacent in order for carry to propagate in addition and for rotation, but 207 and 107 only need to communicate with their neighbors 206 and 106 respectively. They could be moved up to 306 and down to 006. 205 and 105 don't need to talk to each other so they could move up to 306 and down to 006 instead of the constant tables. Likewise 204 and 104 don't need to talk to each other, so they could be up in the 300 row and down in the 000 row too. 203 and 103 do need to talk to each other, so they really need to be adjacent as do 202 and 102.

### Beginning and Ending the Application

Another pair of nodes, 202 and 102, is added to the west in order to receive each octet of the message block, assemble them into 16 bit words and distribute those to the high and low buffer nodes. One last node, 002, is added below the low row on the far west side. It receives and passes on message octets, pads the message buffer at the end and gets the other nodes to start their processing. Each node except this one begins by jumping to a neighbor, waiting to be told what to do. This node starts a chain reaction by telling its neighbor to start, that neighbor tells two others to start, each of those tells a neighbor to start until the application is up and running. When the hash is finished each node has returned to its neighbor waiting for instructions except this one which jumps to WARM. This way all the nodes used in the application can be reclaimed to do something else without having to reset the chip. The final addition is node 002. This node accepts octets from the outside and passes them on to node 202, accumulating a bitcount, until a marker is received indicating end of message. At that point node 002 will pad the last message block and insert the bit count before passing the final octets on to node 102.

## Discussion of the details

It's time to talk about the code.

 `md5`` `` lod rot msg md5 con dat`` `` 202 203 204 205 206 207`` `` lod sum low md5 con dat`` `` 102 103 104 105 106 107`` `` oct tst`` `` 002 003``tst`` sends a test stream to 002``oct`` receives octets and pads buffer``lod`` loads msg buffer``rot`` generates rotation amount``sum`` adds new md to old``msg`` message buffer``md5`` md5 buffer and functions``con`` constant generator and rotator``dat`` constant data table``the`` 200`` line works on the high words``the`` 100`` line works on the low words``they communicate to resolve carry for``addition and rotation                         ` 840 list `md5 loader``207`` node`` data high`` 842`` load`` 040``107`` node`` low`` 844`` load`` 040``const and adder and rotator``206`` node`` high`` 846`` load`` 850`` load`` 027``106`` node`` low`` 848`` load`` 850`` load`` 031``md5``205`` node`` high`` 852`` load`` 040``105`` node`` low`` 852`` load`` 040``msg``204`` node`` high`` 0`` org`` 854`` load`` 039``104`` node`` low`` 0`` org`` 854`` load`` 039``rots``203`` node`` high`` 856`` load`` 03B``103`` node`` low`` 858`` load`` 02F``message loader``202`` node`` high`` 860`` load`` 01A``102`` node`` low`` 862`` load`` 027``octet feeder``2`` node`` 864`` load`` 039``3`` node`` 866`` load`` 02C                           `
Starting with the main load block note that the shadow screen on the left has a very simple block diagram with hints as to the functions and placement of each node in the application. The load block on the right follows a pattern of declaring the node being programmed and then loading the source block(s) for that node. The first two lines of yellow words load nodes 207 and 107 with their 64 words of lookup table. Yellow words are executed rather than compiled. Comments show up as a dark shade of gray in html, rather than the white you see in arrayforth. That's because we're using a white background for html rather than the black of arrayforth. Gray words show as light gray italic here. They're used to learn the address at that point in your compiled code.

 `207`` high data``high`` 16`` bit word of`` 32`` bit lookup table       ` 842 list `207`` high data`` 0`` org``D76A`` ,`` E8C7`` ,`` 2420`` ,`` C1BD`` ,``F57C`` ,`` 4787`` ,`` A830`` ,`` FD46`` ,``6980`` ,`` 8B44`` ,`` FFFF`` ,`` 895C`` ,``6B90`` ,`` FD98`` ,`` A679`` ,`` 49B4`` ,``F61E`` ,`` C040`` ,`` 265E`` ,`` E9B6`` ,``D62F`` ,`` 244`` ,`` D8A1`` ,`` E7D3`` ,``21E1`` ,`` C337`` ,`` F4D5`` ,`` 455A`` ,``A9E3`` ,`` FCEF`` ,`` 676F`` ,`` 8D2A`` ,``FFFA`` ,`` 8771`` ,`` 6D9D`` ,`` FDE5`` ,``A4BE`` ,`` 4BDE`` ,`` F6BB`` ,`` BEBF`` ,``289B`` ,`` EAA1`` ,`` D4EF`` ,`` 488`` ,``D9D4`` ,`` E6DB`` ,`` 1FA2`` ,`` C4AC`` ,``F429`` ,`` 432A`` ,`` AB94`` ,`` FC93`` ,``655B`` ,`` 8F0C`` ,`` FFEF`` ,`` 8584`` ,``6FA8`` ,`` FE2C`` ,`` A301`` ,`` 4E08`` ,``F753`` ,`` BD3A`` ,`` 2AD7`` ,`` EB86`` ,``                                 `
Here you see the high 16 bits of lookup table data compiled into node 207. Note that the numbers are dark yellow and italic, in other words hexadecimal. You'll find these numbers in RFC 1321 starting on page 12.

 `107`` low data``low`` 16`` bit word of`` 32`` bit lookup table        ` 844 list `107`` low data`` 0`` org``A478`` ,`` B756`` ,`` 70DB`` ,`` CEEE`` ,``FAF`` ,`` C62A`` ,`` 4613`` ,`` 9501`` ,``98D8`` ,`` F7AF`` ,`` 5BB1`` ,`` D7BE`` ,``1122`` ,`` 7193`` ,`` 438E`` ,`` 821`` ,``2562`` ,`` B340`` ,`` 5A51`` ,`` C7AA`` ,``105D`` ,`` 1453`` ,`` E681`` ,`` FBC8`` ,``CDE6`` ,`` 7D6`` ,`` D87`` ,`` 14ED`` ,``E905`` ,`` A3F8`` ,`` 2D9`` ,`` 4C8A`` ,``3942`` ,`` F681`` ,`` 6122`` ,`` 380C`` ,``EA44`` ,`` CFA9`` ,`` 4B60`` ,`` BC70`` ,``7EC6`` ,`` 27FA`` ,`` 3085`` ,`` 1D05`` ,``D039`` ,`` 99E5`` ,`` 7CF8`` ,`` 5665`` ,``2244`` ,`` FF97`` ,`` 23A7`` ,`` A039`` ,``59C3`` ,`` CC92`` ,`` F47D`` ,`` 5DD1`` ,``7E4F`` ,`` E6E0`` ,`` 4314`` ,`` 11A1`` ,``7E82`` ,`` F235`` ,`` D2BB`` ,`` D391`` ,``                                 `
And here are the low 16 bit words of each constant, compiled into node 107. Both nodes 107 and 207 jump to their right ports to await instructions from their neighbors.

 `206 106`` constant generator rotator adder``both partner nodes start at the same address``to make it easier to get this node started``from outside                                  ` 850 list `206 106`` constant generator rotator adder``here`` 0`` org``get`` 000`` -n`` right b! @p ..`` /`` @+ !p ..`` /`` !b @b ;``go`` 004`` left a! right b!``    `` @p !b ..`` /`` dup or a! ..`` /``    `` 63`` for @ @ @ . + get +c``    `` a push @ rotate +c pop a! ! next ;``014`` org`` 031``                      `
Here is some compiled code, rather than just data. Since this is appnote number one, we'll go into more detail about the source code than will probably be the case in later appnotes. We'll try to point out any arrayforth idioms as well as the meanings of the different colors. Remember that red words are names, green words are compiled, yellow words are executed. Block 850 is common to and compiled by both nodes 206 and 106.

`get` is the same in each node. It grabs the next constant from the neighbor to the right. And each node has the same source for `go`, though as you'll see in the next two blocks, they call different versions of `rotate` and `+c`.

`get` writes the instruction word `@+ !p ..` to the right port. The right neighbor is waiting on that port for this instruction, having had its `a` register initialized to zero. The result is that a value is retrieved from the table and the pointer is incremented for next time. The idiom in arrayforth is to point the a or b register to a neighbor's port and then fetch a word of instruction and store that word to the port. The code for `get` begins by storing the address of the right port into the b register. This is followed by `@p ..`. `@p` fetches the following word in memory onto the data stack. `..` aligns memory to the nearest word by padding with nops. `@+ !p ..` is the instruction word that will be placed on the stack. It is meant to be executed by the neighbor in its right port and what it does there is to fetch the next piece of data, increment the data pointer, and send the data back to the neighbor listening on the right port. Finally `!b @b ;` will be executed on the local node to send the instruction word to its neighbor and receive the data word being sent back. That was a lot of words to about a small amount of code, but this code must be understood for the rest of the program to make any sense.

The main program `go` first initalizes the a and b registers to point left and right. Then the right neighbor is told to execute `dup or a! ..` which initializes the data pointer by putting a zero into a for the first table reading. The rest of the word is a for next loop that runs 64 times. Three words are fetched from the neighbor to the left, the top two are added, `get` fetches a constant from the right and that's added with `+c`, resolving the accumulated carry. Now we save the value of `a` on the return stack because `rotate` is going to change the a register. `@` fetches the rotation amount from our neighbor to the left and the rotation is performed. The result is added via `+c` to what was left on the stack earlier and `a` is restored in order to send the result back to the left neighbor for storage.

 `206`` high constant generator and adder``rotate`` shift right via +* and partner node``to effect a`` 32`` bit rotation``+c`` receive carry from low word and add in     ` 846 list `206`` high constant generator and adder``14`` org``rotate`` 014`` ni-n`` up b! dup dup or``    `` a! push push a pop pop``    `` -if - push push @b pop !b pop then``    `` for +* unext !b drop @b``    `` a 2/ 2/`` FFFF`` and or ;``+c`` 022`` nn-n`` + up b! @b . +`` FFFF`` and ;``027                                           `
Here you see the definitions of `rotate` and `+c` for the high parts of those 32 bit operations.

`+c` can afford to execute `+` without a preceding `.` since the top two stack items will have been stable for a sufficient time in making the call. `b` is set to `up` and the accumulated carry is retrieved from node 106, the low partner. This carry is added and the result is clipped to 16 bits.

`rotate` begins with some initializations. `b` is set to `up` in anticipation of swapping bits with the partner node. `dup dup or` puts a zero on the stack, which is then placed into the a register and into the S register, the second location on the data stack. The rotation amount is on top of the stack. If it's negative that's a signal that we want to rotate by more than 16 places. The partner nodes exchange their T registers in this case. That's equivalent to a rotation of 16 places. Also the rotation amount is inverted via `-` to get the number of places left to rotate the number. The resulting number is pushed onto the return stack by `for` as a loop counter. `for +* unext` loop shifts T into A repeatedly. Finally the T values are swapped between partners and ored with the A values after shifting right twice via `2/ 2/` `ffff` `and` to construct the rotated number.

 `106`` low constant generator and adder``rotate`` shift right via +* and partner node``to effect a`` 32`` bit rotation``+c`` send carry to high word and mask off       ` 848 list `106`` low constant generator rotator and adder``14`` org``rotate`` 014`` ni-n`` up b! dup dup or a!``    `` push push a pop pop``    `` -if - push !b @b pop then``    `` for +* unext @b push !b drop pop``    `` a 2/ 2/`` FFFF`` and or ;``+c`` 022`` nn-n`` + dup dup or push ..``    `` -if pop`` 2`` or push then 2*``    `` -if pop`` 1`` or push then 2/``    `` FFFF`` and pop up b! !b ;``031                                           `
`rotate` and `+c` are defined differently for the low partner.

`+c` still does the addition. The carry has accumulated in the high two bits of an 18 bit word, and we need it in the low two bits. We could have shifted right with `15 for 2/ unext` `ffff` `and`. This takes less code space but a longer time to execute. Instead the carry was constructed by checking the top two bits one at a time.

`rotate` is different only in that the high partner sends before receiving while the low partner receives before sending.

 `205 105`` md5 buffer code``round`` send b , send func+a , pass msg ,``receive new a and store it in md5 buffer``must patch the address where f' is called``see yellow`` 2E`` ,`` and gray`` 2E``note that`` +or`` is a`` 16`` bit operation           ` 852 list `205 105`` md5 buffer code`` 0`` org`` 0`` ,`` 0`` ,`` 0`` ,`` 0`` ,``prep`` 004`` -3`` right b!`` 3`` dup dup or a! ;``toss`` 008`` prep for @+ !b unext ;``grab`` 00B`` prep for @b !+ unext ;``clip`` 00E`` a`` 3`` and a! ;``f'`` 011`` xyz-n`` push over - push and pop pop and``+or`` 014`` nn-n`` over`` FFFF`` or and or ;``g'`` 017`` xyz-n`` a! push a and pop a - and +or ;``h'`` 01B`` xyz-n`` or or ;``i'`` 01C`` xyz-n`` a! push a - +or pop or ;``pass`` 01F`` right b! @b``  send`` left b! !b ;``round`` 023`` 2E`` a! ! dup dup or a!``    `` 15`` for @+ clip @+ clip``    `` dup send @+ clip @+ clip``    `` a push ..`` 02E`` f' . + pop a!``    `` send pass pass @b !+``    `` @+ drop @+ drop clip next ;``md5`` 038``    `` @p round f' @p round g'``    `` @p round h' @p round ; i'``040``                             `
The exact same code is compiled for both nodes 205 and 105. A temporary message digest buffer of four words is reserved at address 0. The words `toss` and `grab` are used to receive the current hash from node 103 at the beginning of a 64 octet block and to send it back toward 103 at the end. Both words are executed remotely by the 204, 104 pair of nodes because there isn't room here for that much code. `clip` ensures that the a register will wrap around from 3 to 0 so it always points into the local md5 buffer.

The four md5 bitwise functions are then defined followed by `pass` and `send` which are used to pass data from right to left or just to send data to the left. Note that `pass` falls through into `send`. There are four rounds for each 64 octet message block. The word `round` first patches the current function into address `02e` and yes, this is self-modifying code. The code just fits into 64 words of RAM this way. Then `round` loops 16 times where it fetches the current hash values, sending the "b" hash value on to the right neighbor before executing the current function and adding it to the "a" hash value. This sum is sent to the right neighbor. Then the message fragment and rotation amount are received from the left and passed on to the right. The right neighbor does its calculations and sends a result back, which is fetched and stored here back to the temporary hash buffer. The hash buffer pointer in the a register is then incremented twice and kept within the range 0-3 by `@+ drop @+ drop clip` before moving on to the next step of this round.

Back to the patching. In the word `md5` you see phrases such as `@p round f'`. The `@p` is compiled into slot 0 with the call to `round` in slot 1. The next word in memory will contain a call to `f'`. This call to f' will not be executed here. It is simply data to be stored into address 2e where it will be executed later.

 `204 104`` message buffer code``msg gets both neighbor pairs started``205,105 and 206,106``it also computes the sequence in which``the message buffer is accessed                ` 854 list `204 104`` message buffer code`` 0`` org``0`` ,`` 0`` ,`` 0`` ,`` 0`` ,`` 0`` ,`` 0`` ,`` 0`` ,`` 0`` ,``0`` ,`` 0`` ,`` 0`` ,`` 0`` ,`` 0`` ,`` 0`` ,`` 0`` ,`` 0`` ,``msgs`` 010`` ni`` pop a! @+ @+ a push``    `` 15`` for dup a! @ !b over . +``    `` F`` and left a! @ !b next drop drop ;``msg`` 01C`` right b! @p ..`` /`` @p b! @p ..`` /``    `` !b left !b @p`` /`` go`` /``    `` !b @p !b @p`` /`` !b grab`` /`` md5`` /``    `` left a!`` 3`` dup push``    `` for @ !b unext !b``    `` msgs`` 1`` ,`` 0`` ,`` f's``    `` msgs`` 5`` ,`` 1`` ,`` g's``    `` msgs`` 3`` ,`` 5`` ,`` h's``    `` msgs`` 7`` ,`` 0`` ,`` i's``    `` @p !b ..`` /`` toss`` /``    `` begin @b ! unext ;``039``                              `
Again, the same code is compiled for both node 204 and node 104.

Each node has its own half of the message buffer. Node 204 has the upper 16 bits of each of 16 32 bit words of message. Node 104 has the lower 16 bits of each 32 bit word. The buffer is filled remotely by nodes 202 and 102 before this node is started up. The word `msg` is kicked off by the neighbor to the left. Nodes 204 and 104 kick off both their neighbors to the right with `grab` and then `md5` and their neighbors to the left with `go`.

The word `msgs` begins by reading the two following constants via `pop a! @+ @+ a push`. The address of the first constant is popped from the return stack and the incremented address is pushed back onto the return stack to continue execution just after the constants. This may seem convoluted but it saves a bit of memory over using conventional constant values with @p. These numbers are used to calculate the current message index at each step of the algorithm. For each round the message index starts at a specified number for that round and is incremented each step by that round's specified increment. The index thus calculated is used to fetch the current chunk of message and send it on to the right for processing. The rotation amount is then fetched from the left neighbor and passed on to the right neighbor.

Finally after all four rounds have run, the right neighbor is told to run `toss` in order to send the temporary hash back to node 103, and this node passes the values on.

 `203`` rots generator``rotation amounts are encoded as``sx`` 1111`` 1111`` 1111`` 1111``col-3 col-2 col-1 col-0 at address zero.``where`` s`` signals need to swap words and``invert this value for rotations greater``than 16 bits in the third slot and`` x`` is don't care.``this node passes the rotation amounts to``both the high word row and the low word row``                                              ` 856 list `203`` rots generator`` 0`` org``29E38`` ,`` B16A`` ,`` 28F4B`` ,`` A059`` ,``send`` n`` dup left a! ! up a! ! ;``keep`` 009`` up b! left a!`` 3`` for @ !b unext ;``put`` 00F`` up b! left a!`` 3`` for @b ! unext ;``ncol`` 015`` for 2/ unext``0col`` 017`` F`` and ;``  1col`` 019`` 3`` ncol ;``2col`` 01B`` dup push`` 7`` ncol pop``    `` -if drop - dup then drop ;``3col`` 021`` 11`` ncol - ;``jump`` 024`` i`` pop + push ;``rots`` 025`` i`` dup 2/ 2/ 2/ 2/ a! @ over``    `` 3`` and jump 0col ; 1col ; 2col ; 3col ;``rotgen`` 02E`` left a! @p ..`` /`` msg`` /`` ! put``    `` 0 63`` for dup rots send`` 1`` . + next``    `` drop keep ;``03B``                             `
Nodes 203 and 103 have very different functions. They should not be considered partner nodes as all the others up to now have been. During each of the 64 steps of the md5 algorithm node 203 calculates the rotation amount needed for `rotate` in nodes 106 and 206. The rotation amount is passed on to node 204 which passes it to 205 which passes it to 206 where it is used. Also 203 passes the rotation amount to node 103 which then passes it on to 106 via 104 and 105. The only function of node 103 during the 64 steps is as a wire, passing data from 203 to 104.

Rotation amounts are encoded into the first four words of node 203. Each of the four words contains the encoded rotation amounts for one round of the algrithm. Within one round there are four rotation amounts to be cycled through. In general the rotation amount is encoded as a single nibble. Only 16 positions of rotation can be encoded in a nibble. It happens that the first two rotation amounts are always less than 16 and the last one is always greater than 16. The third amount is sometimes less, sometimes greater, and once equal to 16. If the third amount is greater than 16 that is encode by setting bit 17. When the rotation amount is greater than 16 that amount is inverted before being sent on to the 206 106 pair, to signal that 16 bit words must be swapped to effect rotation by 16 before performing the rest of the rotation.

The first word of block 203 is `send`, which sends the same data word (rotation amount) to both neighbor nodes 103 and 204. `keep` and `put` are responsible for sending and receiving the temporary hash values from 103 to 205 in the beginning, and from 205 back to 103 in the end. The words `0col 1col 2col 3col` each extract a nibble from the coded word, inverting the word where appropriate, and `jump` implements a computed goto so that the appropriate `ncol` word can be executed by number. The word `rots` transforms an index, 0-63, into a rotation amount to be sent on to the pair 206 106. Bits in the index are used to decide which coded word to fetch and then which nibble to decode. A lot of work, but it would burn a whole 64 word node to simply look up the value directly.

`rotgen` is the main program for this node. It starts by telling neighbor to run `msg`. Then it acts as a wire via `put` passing the temporary hash values from 103 over to 204 which passes them to 205. Now we go into a for next loop for 64 iterations, calculating the rotation amount and sending it on. At last the work `keep` is executed to act as a wire passing the currently calculated hash values back to 103.

 `103`` save and add abcd back``the current message digest is first sent to``the md5 node and finally added back into``this buffer                                   ` 858 list `103`` save and add abcd back`` 0`` org``2301`` ,`` 6745`` ,`` AB89`` ,`` EFCD`` ,``DCFE`` ,`` 98BA`` ,`` 5476`` ,`` 1032`` ,``put`` 008`` dup dup or a!`` 3`` for``    `` left b! @+ !b up b! @+ !b next ;``sum`` 012`` dup dup or a!`` 3`` for``    `` left b! @b @ . + dup`` FFFF`` and !+``    `` up b! @b @ . + over 2*``    `` -if drop`` 1`` . + dup then drop``    `` FFFF`` and !+ next`` ;``sums`` 025`` left a! @p ..`` /`` msg`` /`` ! put``    `` left a!`` 63`` for @b ! unext sum ;``02F``                             `
Node 103 is used to house the calculated md5 hash values between message blocks. The first 8 words of memory are used to buffer this data. The first word defined, `put` is meant to fetch each word of the buffer in sequence and send it to the apppropriate row, high or low. The word `sum` is used at the end of a message block in order to add the new 32 bit hash values to the old ones that were saved here. First the a register is initialized to zero with the idiom `dup dup or a!` and then for each 32 bit hash value the 16 bit halves are fetched from up and left neighbors and added with carry resolved to the hash values preserved here at the beginning of the current message block.

`sums` ties this all together beginning by pointing the a register left and telling the left neighbor to run `msg`. Then it runs `put`, tranferring the current hash values to nodes 205 and 105. During the 64 steps of the algorithm this node acts as a wire, passing rotation amounts on. Finally the word `sum` is executed to receive new hash values and add them to the preserved old values.

The ten nodes mentioned so far implement the complete md5 algorithm for a single 64 octet message buffer. Once the buffer in nodes 204 and 104 has been filled, only these nodes operate.

 `202`` main control``nodes`` 202`` and`` 102`` setup and control``the others``only these two nodes need to be customized``in order to feed in the message and``read out the digest                           ` 860 list `202`` main control`` 0`` org``code`` 203`` left b!`` 20`` for @p !b unext ..``    `` 204`` dup dup or a!`` 15`` for``    `` @p !+ unext ..`` 009`` --l- ; r--- ;`` 00B``highs`` 00B`` right b! dup dup or a!``    `` 8`` for @+ !b unext a push up a!``    `` 15`` for @ !b unext pop a!``    `` @+ !b @+ !b .. @p !b ;`` /`` rotgen`` /``01A``                                `
Nodes 202 and 102 are fancy wire nodes passing the message stream in to nodes 204 and 104. Node 202 here is mostly wire. After starting port pumps in neighbor 203 and its neighbor 204 it passes the 16 bit high parts of the 16 32 bit message chunks in from 102 to 204. The word `code` is really a stream of code to be sent to and executed in the left port by the neighbor, node 203. Some of the code in the stream is in turn sent to 203's neighbor 204 to be executed there.

The first line, labeled `203`, is executed by node 203 in it's right port. It causes 203 to point its b register to its left port and run a port pump, `for @p !b unext ..` through 21 iterations. This causes the next bit of code to be executed in the left port of node 204, which puts a zero into the a register and runs another port pump, `for @p !+ unext ..` for 16 iterations, ultimately storing the 16 high 16 bit parts of the message block into node 204. Node 204 ends up jumping left and node 203 jumps right to await further instructions from 202. The final instruction send is the call to `rotgen` which kicks off node 203 who in turn kicks of the rest of the high row to run 64 steps of the md5 algorithm.

 `                                              ` 862 list `102`` main control`` 0`` org``code`` 103`` left b!`` 20`` for @p !b unext ..``    `` 104`` dup dup or a!`` 15`` for``    `` @p !+ unext ..`` 009`` --l- ; r--- ;`` 00B``half`` p`` push down a! @ @`` 7`` for 2* unext``    `` or pop a! ! ;``lows`` 012`` up b! @p ..`` /`` highs`` /`` !b``    `` right b!`` dup dup or a!``    `` 8`` for @+ !b unext a push``    `` 15`` for right half up half next``    `` pop a! @+ !b @+ !b .. @p !b ;`` /`` sums`` /``027``                              `
Node 102 has a little more work to do than 202 did. It receives the 64 octets of a message buffer from node 002 and assembles them into the high and low 16 bit words of message chunks before sending those on to nodes 202 and 103, eventually to be stored in 204 and 104. The word `code` is identical to that already mentioned in node 202 and sets up similar port pumps in nodes 103 and 104.

`half` does the work of assembling a 16 bit word from two octets, then sends it to the port passed as a parameter. The main program for this node, `lows` first starts node 202 running `highs`. Then it sets up the port pumps in nodes 103 and 104 before reading in octets from 002, assembling them into 16 bit words and sending those on to the high row and the low row. Its final act is to tell node 103 to run `sums` in order to add the new hash values to the old ones saved in node 103.

 `002 main interface``data entry point                              ` 864 list `002 main interface`` 0`` org`` 0`` ,`` 0`` ,`` 0`` ,``add`` 003`` n-n`` @ . + dup`` FFFF`` and !+ ;``count`` 006`` 1`` a!`` 8`` add 2* -if``    `` drop`` 1`` add then drop ;``get`` 00F`` @ -if dup or a!`` 80`` ; then``    `` a if count then a! ;``octets`` 016`` n`` for get !b next ;``ablk`` 01A`` n`` 55`` octets``    `` a if drop`` 7`` octets ; then drop``    `` @+ drop a for``    `` @+ dup`` FF`` and !b`` 7`` for 2/ unext``    `` FF`` and !b next``    `` 0`` a!`` 3`` for @ !b unext warm ;``start`` 031`` here boot``    `` down b! right a!``    `` begin @p !b ..`` /`` lows`` /`` ablk end``039                                           `
Node 002 has the job of receiving message octets delimited by a negative number. It counts the message bits and in the last message block, after receiving the delimiter, the message is padded and the bit count is inserted. As far as neighbor 102 is concerned 002 just sends blocks of 64 message octets.

The first word, `add` is part of counting message bits. It fetches the value pointed to by the a register, then adds it to what was already on the stack. The result is clipped to 16 bits before being stored back into where a points. The a register is incremented for the next addition and the sum is left on the stack, including any accumulated carry in the upper two bits. Here's how `count` uses `add` to count bits in the message. Two words are reserved and initialized to zero at addresses 1 and 2 in memory. The word `count` adds 8 to the low word at address 1. If bit 16 indicates a carry, then 1 is added to address 2.

The word `get` is used to get the next octet. Normally the a register points to the right point and the octet is fetched from there. If the octet is not negative then the bit count is incremented by eight and the octet is left on the data stack. However if the octet was negative, indicating end of message, the top of stack is changed to zero via `dup or` and stored into the a register and an `80` is returned. `80` is the padding octet, a one bit followed by zeros. Now the a register points to address 0 which contains a zero, so further calls to `get` will simply return zero.

`octets` is a factor used within `ablk` that allows us to say `55 octets` in order to read in the first 56 bytes, and `7 octets` for the next 8 bytes when end of message has not yet been received. After the first 56 octets have been read with `get` the a register will tell us if end of message has been received, by containing a zero. When the a register is zero we execute the following code, `@+ drop a for`. The a register contains zero to begin with. Its incremented to contain 1, which is fetched as parameter to for. This loop then executes twice to fetch and send on the bit count stored at addresses 1 and 2 as four octets. Then the a register is set back to zero so that `3 for @ !b unext` will finish padding the message block with zeros. At this point the md5 algorithm is finished, as far as node 002 is concerned, so it jumps to warm where it awaits new instructions. The final md5 hash can now be extracted from node 103.

## To be continued...

• Testing strategy
• Timing estimation and measurement
• Energy estimation and measurement
• Comparison with other implentations
• Opportunities for improvement