Sunday, April 20, 2008
How far we've come... Full speed CAN traffic!
In this post I've a bunch of waveforms to present. The good news is that each zooms in on a rather simple observation so there is not much detail to each of them but the resulting behavior is too fun to not show.
To the right is an overview of a Bus Census with something you haven't seen here before. This is the Census being transmitted from PropCAN (as directed by the host software, CDNC) and fifteen, yes 15!, CAN Nodes all eagerly responding. Do you see what happened? Remembering the last post we saw that we couldn't handle 5 nodes responding, but look at this. Here we have 15 nodes all responding as fast as the CAN Bus will allow and the new SPI Engine receive side is so fast that we never even needed our 2nd receive buffer. The /Rx0BF line (bright green, near top) show the signal being asserted and then our SPI Engine back-end Cog is so fast that we empty the buffer while we are still receiving the next message. It is wonderful to see an idea pan out, isn't it? Let's quickly walk thru a few significant parts of this traffic which we'll do by presenting four more pictures.
In this next picture I've zoomed in on the asserting of the /Rx0BF line (buffer full for the 1st message to arrive.) I've placed markers M1 and M2 at the time buffer full is signaled and at the time our Rx-side begins to send the buffer unload command. I've placed a time interval measurement label at the top left. We can clearly see that M1 to M2 = 1.655 uSeconds (or 33 assembly instructions.) Our SPI Rx back-end Cog was in a loop deciding what needed to be done next when it saw the /Rx0BF line being asserted at which time it formulated the command to unload the interrupting buffer and initiated sending the command via SPI. Not a bad response time.
As long as we have this view, let's look around. The other waveforms quickly show us the relative speeds of each of the communication happening at the same time as our unload. From top to bottom, we see the SPI CLK signal showing us our SPI Write speed as we see the command to unload the buffer being sent to the CAN controller (MCP 2515). Below the SPI CLK we see the SPItoCAN which is our command data being sent and below that is the SPIfmCAN signal which is data returning from the 2515 (our received message arriving). Finally, at the bottom we see 2515-CAN Rx toggling. This is our next CAN Message arriving while this buffer unload is occurring. Compared to our SPI Tx and Rx rates the CAN bits seem pretty slow. We remember that our CAN rate is 800,000 bits/second. What is our new SPI Tx/Rx rate? Let's look at our next picture to see.
In this next picture we've shifted to the right along our unload so that we can clearly see the SPI Tx and SPI Rx bit rates. The code is asymmetrical for the similar reasons that our Serial Rx and Tx turned out to be asymmetrical. The work we are doing is different for the two sides of the effort. Anyway, I've placed two markers for Tx and two more for Rx and I've placed two measurements for each which tell us the duration and bit-rate that we are now seeing. Let's look at them.
Markers M1 and M2 are spanning a single byte write from the Propeller to the MCP 2515 device. At the top we see that this write is 350 nSec long, resulting in a 2.857M bits/second (bps) SPI transmit rate. On the receive side, Markers M3 and M4 bracket a single byte being received. The duration of this byte is 255 nSeconds and the resulting bit-rate is 3.922M bps. We are commanding the CAN device at nearly 3M bps and receiving data at nearly 4M bps. If you remember our performance criteria we set for ourselves these speeds are better than we needed to meet our goals. At this rate, and since we are getting it done during the receive of our next CAN message just how fast are we unloading the receive buffer from the MCP 2515?
In this fourth picture I've zoomed back out a bit so that we can see the entire unload. I've placed new markers (M5 and M6) along with a new interval measurement. M5 is at the start if the first received byte while M6 is placed at the end of the last byte emptied from the receive buffer. The interval measurement shows that we are unloading the buffer in 29.89 uSeconds. Finally, just shortly after the last byte is read, the /CS is de-asserted and the /Rx0BF is also de-asserted which happens automatically when we unload (read) the receive buffer. Now, lets zoom out one last time to look at the full transaction again but let's zoom far enough to see the serial transmitting of the messages to the connected PC via USB.
Here in this final picture you can see the overhead of handing off the messages between cogs and finally getting sent over USB as the dominating feature of the waveform. You can see what I mean when you see that all of our 16 CAN messages (1 outbound, 15 in-bound responses) are gathered in the colorful burst at the left of our waveforms. Then, spread out, over the remainder of the picture to the right of the burst we see the yellow message transmits over serial from the Propeller to the FTDI chip to be sent over USB to our PC. While each message is very fast (the yellow burst is narrow) there is a lot of dead time between each message. Depending upon the rate of messages arriving via CAN this could spell doom for our system if we don't have enough buffering to handle the majority case. That is enough space in the Rx Queue so that we can place arriving message from CAN into the Rx Queue and then hold them there until they are sent via serial to the FTDI chip and the entry in the Rx Queue then freed up for use by another arriving CAN message.
So, in the end, we've got system that's performing much better and so far is meeting our basic performance specifications. Now it's time to move on to quantifying the performance of PropCAN under normal system loads and with normal traffic patterns. We're almost there!
Friday, April 11, 2008
Object reorganization to achieve SPI performance
This new object diagram may be confusing but I will try to sort it out in this post. (for the original diagram refer to the January post entitled "Which Cog is doing what?") New in this diagram is the appearance of the TV Debug object-set on the right. The A.2 hardware has the four debug pins carefully chosen so that they can be used to drive the TV-out signals. I even created a little daughter card which contains the DAC resistors and the RCA connector attached to a socket header so it simply plugs into the 4-pin debug header.
(If you are interested, pictures of the hardware can be found at the project web-site: http://propcan.moraco.us in the "PropCAN Detail", "DEMO Board", and "Gallery" sections.)
Let's start this review by going over the list of significant changes between the v1.0 and the v2.0 diagrams:
- /Rx0BF, /Rx1BF pins now route to the ASM portion of the SPI Engine
- /INT now routes to both the SPI Engine and to the CANtrafficTxHandler
- The RxQueue now straddles ASM portion of SPI Engine and the CANtrafficTxHandler
- CmdReceiver fully encloses Serial Receive functions while PacketHandler fully encloses the Serial Transmit and the SPI Transmit/Receive functions
Now let's review the reasons why I made these changes.
The Rx buffer full pins now route only to the ASM portion of the SPI Engine so I could move the entire receive function into the ASM code. This is also the reason the RxQueue is overlapping the ASM portion of the engine. Not only did I move all receive but I also rewrote the receive so that the MCP 2515 buffers are unloaded directly into the Queue data structure. There is no longer any middle-man routine transferring unloaded buffer content out of this SPI engine and into the Rx Queue. We were suffering CAN messages arriving faster than we could unload them. We have to be much faster.
The /INT line is now handled in two areas of the code but for two different reasons. The SPI engine watches /INT to know when it can get Tx-complete statuses and when it needs to clear the Tx-complete interrupt. CANtrafficTxHandler receives /INT so that it can deal with Rx errors.
Lastly, I moved to more fully enclosing objects with minimal interface facades for the enclosed objects so that code outside of the enclosure is simpler and easier to read/understand.
Well, now you see what reorganization we did. Next post, I'll show measurements of the surprising performance increase these changes have yielded.
Monday, April 7, 2008
Improving SPI performance - our starting line
Let me show you what I found the performance to be. The picture to the right shows us a number of things (most of which are not good ;-)
First let's look at how we initiated the traffic. I'm setup with 5 CAN nodes on the CAN bus. I sent a message out to the five CAN Nodes to have them identify themselves (part of the AMSAT protocol). Locating this in the logic analyzer trace, we first see the serial RxD (Red) traffic at the bottom left which is followed by the serial TxD (Yellow) "OK" response. Next we have the MCP2515 transmit buffer load (SPI-labelled signals in middle) and the send buffer command again via SPI. Then we see the actual CAN transmit (Green) signal at bottom followed by the CAN receive (Red) with the Green CAN-Tx ACKs for each CAN message arriving.
Notice how quickly these messages arrive? CAN is a great little protocol in that all devices listen to all traffic on the bus and they inject their message as soon as they can. They will wait only long enough (mandatory gap between successive messages) and then the next message is transmitted. This means that as a receiving device we wait around for traffic which will come in bursts and sometimes, like this test case, our traffic will arrive in maximum speed bursts. How fun! Well, OK it's not fun at all. This means that our device has to immediately support running at best possible speed. The only freedom we have is that the MCP2515 does/can double buffer received traffic but this is very little additional freedom.
Now, turning our attention to the top signals we see the /INT line (Red) being asserted which is our transmit complete signal to the Propeller. We then see the /Rx0BF signal (Green) asserting followed by almost exactly one CAN message later the /Rx1BF signal (Yellow) asserting. These are the "receive buffer zero full" and "receive buffer one full" signals. This is all working exactly as we want. We are being notified of each event as we intended!
Where we see the first indication of performance issues with our prototype code is that we don't begin to unload the 2nd message until after all 5 messages have arrived. (the first unload occurs during the last of the 4th message arriving and the 2nd unload occurs quite a bit after the last message arrived.) Given that we know we only have two receive buffers we just proved that we lost 3 of the 5 messages arriving. We just are not fast enough.
Our transmit complete clearing of the interrupt can happen much faster (should likely happen before the first message arrives) and certainly our receives need to be much faster. In fact this almost certainly proves that we can't have a separate Cog watching the /Rx0BF and /Rx1BF lines and then asking the SPI back-end Cog to unload the buffers as we do here. This simply isn't fast enough.
This means that our initial functional decomposition and assigning of responsibility amongst Cogs is not going to work. It looks like we have to move some of the transmit acknowledgement handling and most if not all of the receive handling. Well, it's back to drawing board for me to figure out which Cog needs to do what, one more time...
In my next post I'll describe how things are to get rearranged and then I'll follow up with measurements of the new organization.
Monday, March 31, 2008
New Serial Driver testing results
I spent my time today testing and making some late adjustments to code so that it would run correctly. As you remember in my last post I noted that I was running Rx and Tx at what Windows calls 921,600 baud and it was working well. However, next I had to modify my Windows FTDI driver installation to get to higher baud rates so I could continue testing.
After a few false starts (read- two driver reinstallations and a first incorrect attempt at modifying the 3 alias baud rate entries...), I managed to get my FTDI drivers configured correctly. What's fun is how I know I did. In my last efforts yesterday I upgraded my interaction with the autobaud routines to show detailed findings when asked and I also added a sign-on transmission which happens immediately after the autobaud completes. This is where it gets fun! I'll show you.
First, here are my speed aliases for my updated serial drivers:
- 300 baud -> 3M baud
- 1,200 baud -> 1.5M baud
- 2,400 baud -> 2M baud
I chose these aliases so that I would have some hope (in my older age ;-) of remembering the new alias values. I also have no desire to be running this device at those original baud rates.
When I first start PropCAN now and send it a $0d <CR> is responds with the following message:* PropCAN connected at 923076 baud, (92307 chars/sec) *
(It's pretty easy to tell when things are working well enough for it to send this <grin>.)
So I select 1200 baud (intending to get 1.5M baud) and connect to PropCAN and I see the following:
* PropCAN connected at 1500000 baud, (150000 chars/sec) *
I'm smiles all over! So I ask it it's current configuration by entering ?<CR> and get the following:
D PropCAN - A Propeller-based USB to CAN Controller by KZ0Q
D0 F/W Debug: OFF
VA201
O? CAN Closed
S7 CAN at 800 Kbps
X0 AUTO Fwd: OFF
Z0 Time-stamp: OFF
I move on to 2400 (new 2.0 Mbaud) and try once more:
* PropCAN connected at 2000000 baud, (200000 chars/sec) *
Now I'm grinning ear to ear.. but then I entered the ? command and got nothing. Then I'm moved up to 300 baud (3M baud) and I got nothing... Hmm, this is an amazing start to my testing but I've got work to do.
Now remember, I've an analyzer connected which is currently sampling at a 5 nS rate so I can pretty much accurately determine if what is being sent and received is actually at the bit rates I think it should be at. I verified each of these speeds and yes the driver is correctly sending at the 1.5, 2.0 and 3.0 bit rates!
This is where I had to go back and adjust things. My transmit side needed some code tweaking as it was exhibiting some really fat start-bits at these three speeds. I was able to make better instruction choices to bring this into alignment really well. On the receive side it took some studying. However, eventually I replaced a couple of test and jmp loops with waitpne/waitpeq and the receive side snapped to as well. After these changes, I was able to fully interact at 2.0M baud and I also retried my 300 baud ;-) test with the following results:
* PropCAN connected at 3000000 baud, (300000 chars/sec) *
So, in the end, this has been a great day. Now let's look at what's happened to our overall timing picture. I measured a full can transmit and the reply of the CAN Node's response sent back via USB. (I've no markers in this picture, so I'll talk in signal color and location instead...)
At the left, just up from the bottom we see to signals (red and Yellow) which are labelled as PropSIO RxD and TxD. I make the red color receive so these are easy to remember. The red descending blip at the left is the packet in ASCII coming in over serial "T1430<CR>" and the yellow blip following it is the <CR> reply back via USB. Moving on to the bottom two lines (red and green this time, 2515-CAN labels) we see our CAN message transmission in green and we see our CAN receiver seeing our full outgoing message (our receiver always sees our own transmissions in the MCP 2515.) Then we see the CAN Node replying to the message with a longer (8-byte payload) message. This is the next red burst. Just near the end of the red burst you see a green blip. This is the MCP 2515 ACK'ing the message (as each listener does on the CAN bus.) Now, look all the way to the right and you see the ASCII representation of the 8-byte payload message being sent to the PC via USB: "t36286263623726040034<CR>". Finally, and less important but so you know, the burst of activity above the CAN signals we just spoke of is the SPI traffic with interrupts.
The important part of this is seeing the relative time now taken for each of these actions. Just "eyeballing" the picture you can see that our serial transmission (in Yellow, and at 2M baud) is about the same width (takes about the same amount of time) as the CAN message arriving from the CAN Node (in Red). Contrast this against this 2nd picture originally shown in the Jan 22 post "Anatomy of a full transaction - Part 2". In this older picture we see the serial transmission burst (this time in Red) again to the right but look how much time it was taking! That's 1.2 mSec! vs. the 128.9 uSec of the CAN message arriving.
I think you'll agree this is exactly the improvement we've needed to meet our timing goals for this project. We needed 1.5M baud and we've got a functioning 2.0 M baud giving us timing margin and in fact our transmit side works to 3.0M baud according to our testing.
This has been some fun work on serial drivers for PropCAN! We see the autobaud object working well at all speeds of the FT232R. We see the serial transmit object working with all speeds as well. And finally we have the serial receive object working at the supported speeds up through 2.0M baud. This is all but the highest 3.0M baud speed. We've done ok.
Now it's time to work on the next performance upgrade for the SPI subsystem. I'm off to study what to do next there... I'll be posting more later this week after I've made some progress.
Sunday, March 30, 2008
Circular ring of Buffers for Serial In/Out - fast!
Our study of the serial traffic has shown us that this application has string length maximums and an overall performance need. Let's make use of this knowledge and only transfer from Cog to Hub on each long within command (max string length) boundaries, not character boundaries. In the highest traffic cases we will then have 6 transfers were were seeing 22 transfers earlier. Let's also not make our Cogs wait on each other by providing more than one buffer so that a buffer can be being transmitted while another is being filled. Likewise, a buffer can be filled by the receiving Cog while another is being acted up by the command processing Cog.
In our serial transmit case we can transmit a buffer at full speed and not break stride until we next need another buffer. If another has been prepared we simply switch to it and start sending it; again at full speed.
In our serial receive case the original object was limited by our ability to dump characters into Main RAM. This severely impacted our ability to be ready to receive the next character. Since we now know that our characters streaming at us are broken into max command/message size strings,we can defer writing into Main RAM until we hit one of those natural boundaries where we have more time. This means that we can receive serial data from programs controlling the PropCAN at full serial rate without missing characters!
The command protocol for PropCAN ensures that activities are gated by acknowledgements which means that only one or sometimes a couple of commands will arrive before an acknowledgement must be sent at which time the sender will stop sending and wait for a response. This is great in that it allows our new routines to perform well but not reach the limit in performance they too have.
These new routines are built around a new data structure the contents of which controls the actions of the single producer and single consumer of data contained within the structure. The new data structure is simple. We have an array of pointers to fixed size buffers. The number of pointers in the array is adjustable. The fixed size buffers contain a length/flag byte and the rest of the space is for the data (to be received or to be transmitted). The fixed length of these buffers is the same for all buffers pointed to by the array but is adjustable. For the PropCAN device we use 32-byte fixed length buffers and the array pointing to them consists of 4 entries pointing to four unique 32-byte buffers. We choose the 32 because it is the next power of two greater than our maximum 22-character command/message. We chose the 4 rather arbitrarily (read- based on "no real data") but it can be adjusted separately for Transmit and for Receive as we discover what our real depth needs to be. The Transmit side and the Receive sides each have their own independent data structure instance (array of pointers and set of buffers to which they point.)
Did I catch you on my having an array of pointers to fixed length buffers? Isn't a concatenated set of fixed length buffers also an array? Why have the array of pointers to the array of buffers? The answer is really quite simple. It's the old standard trade-off between memory and performance. Think of this array of pointers as a pre-calculated set of answers. With the array of pointers I have much less code and therefore much less execution time in calculating which buffer will be used next. I simply move the preparation of these pre-calculated answers out of the critical path of when i needed to access buffers to when I'm starting up the PropCAN device; a much less time critical point in time.
I mentioned that our length byte in each fixed buffer is really length and flag (dual purpose). Let's look at why I think this. The array of pointers to these buffers lets us easily treat the set of buffers as a circular list. So we'll let the producer start at array[0]'s buffer and when it is filled the last value written will be the length byte. The consumer will also start at array[0] and will not consume the buffer until the length byte becomes non-zero. When the consumer is finally done with the buffer it zeros out the length byte and moves on to array[1]'s buffer and waits for it to have a non-zero length. Likewise, our producer sets a length in array[0]'s buffer and then moves on to filling array[1]'s buffer. As each tried to locate the next buffer after finishing the last in the array (in this case array[3]) then they wrapped the index back to zero and started again with array[0]'s buffer. So we have the buffers being used in a circular fashion and we have the length field within each buffer being used as a "buffer is empty - can be filled" or "buffer is full - can be emptied" flag.
Well, it's time to end this post but first let me describe the state of the code since my Wednesday post. I've implemented and tested the new "circular-fixed size buffer" handling Serial Receive and Serial Transmit objects. I've run them at baud-rates from 19,200 to 923,076 baud rates without error or data loss. Tomorrow I'll update my FTDI driver installation on my Windows XP PC so I can test them at 1.5M baud and 2M baud (I know they won't run at 3M baud so I won't go to the FTDI 3M baud during my testing...<sigh>)
My next post will be showing Logic Analyzer waveforms of the areas shown in earlier posts were we saw the serial traffic taking so long. We hope to see the serial traffic no longer dominating the waveform as it was in the past. We hope now to be taking up only as much time as we really need!
Performance a quick review
The AMSAT CAN bus runs at 800,000bps. We have three CAN message sizes in the protocol: 0-byte payloads, 2-byte payloads, and 8-byte payloads. If the bus traffic was the maximum load of each of these messages the traffic rates would be as follows:
- 0-byte payload: min. of 47-bits per message, max 17,021 messages / sec.
- 2-byte payload: min. of 63-bits per message, max 12,698 messages / sec.
- 8-byte payload: min of 111-bits per message, max of 7,207 messages / sec.
PropCAN translates these messages into ASCII strings and sends them up the USB interface to the connected PC. After formatting in ASCII let's look as what happens to the message size:
- 0-byte payload: tiiiL<CR> (6-chars x 17,021 /sec = 1,021,277 bits per sec.)
- 2-byte payload: tiiiL0011<CR> (10-chars x 12,698 /sec = 1,269,841 bits per sec.)
- 8-byte payload: tiiiL0011223344556677<CR> (22-chars x 7,207 /sec = 1,585,586 bits per sec.)
OK, that's a lot of numbers but it shows us something interesting. It shows us that two message formats cause different ends of PropCAN to be working the hardest (at the fastest rate) and NOT at the same time. The 0-byte payload messages (the shortest ones) make our SPI offload run at maximum performance (1.8 Mbits/Sec) but the demand on the Serial Transmit routines is lower. However, the 8-byte payload messages don't load the SPI offload routines as much but now our Serial Transmit routines must run at 1.5Mbps! And now you see where I came up with my ~1,500,000 baud quoted in my opening paragraph.
Now we've reminded ourselves of Max SPI offload rate we need, the max message handling rate we need and the maximum Serial transmit rate we need. Our goal is to not let our I/O performance run slower at some point in our design which would cause us to need to buffer messages and then, ultimately, to not keep up with traffic arrival rates. We know we've got work to do to make this system work at our needed performance.
Now that we've clarified our goals, let's move on to the topic for the next post: my now working code for the new Serial Transmit and Receive Objects.
Wednesday, March 26, 2008
Internal message storage formats
When last we discussed the internals, we were musing at all the "tuning" ahead of us yet in terms of the performance of our existing "proof of concept drivers". It is this driver tuning which will be the subject of my next posts but let's get into how we are choosing to move data through this system in this post. Our reducing data movement and transformation is our first step in tuning this system's performance.
Our goal is "best possible throughput" and to meet this we must be efficient. This means we want a minimum of format translations and a minimum of copying data from one location to another within propCAN. Let's look at our USB and CAN interfaces to see in what natural forms our traffic exists and this should lead us to defining what storage forms we will use. First let's look at the USB side.
Messages to be sent via CAN arrive as ASCII strings via USB. These strings are anywhere from 5 to 22 bytes for standard mode (11-bit ID) CAN packets and 11 to 27 bytes for extended mode (29-bit ID) packets. In addition to messages, commands to propCAN and status strings from propCAN are also sent via USB but these are all smaller than our 27-byte longest message. So this then is our USB side definition:
- USB I/F: strings are sent and received which are 2-bytes to 27-bytes in length.
Now let's look at our CAN side. The MicroChip MCP 2515 receives packets into 13-byte buffers with an additional (optional) byte of status at the end. Both our standard and extended messages fit into this size buffer. In order to send a message the 2515 also wants the message formatted into this layout for best efficiency hand-off from the Propeller to the 2515. Also, we have an additional requirement to time-stamp arriving messages (relative time from packet to packet) so we'll add another two bytes to this length. So here we have our CAN side definition:
- CAN I/F: messages are in 14-byte buffer with two byte time-stamp (16-bytes total).
If we round our needed buffer sizes up to multiples of longs (our largest native Propeller data size) we end up with 32-byte buffers on the USB side and 16-byte buffers on the CAN side. Let's look at the resulting minimum transforms we end up with:
- SEND side: a message arrives as an ASCII string received by our Serial Receive COG. It transfers the message into a 32-byte buffer in Main RAM. A pointer to this 32-byte message is handed eventually to validation and then conversion code. This conversion code grabs a 16-byte buffer, again in Main RAM, and re-formats the ASCII text into an MCP 2515 layout message. The pointer to this 16-byte buffer is then passed to the SPI send routines at which time the buffer is accessed by the SPI cog and bytes from it are sent to the MCP 2515 device. Since the layout of the 13 early bytes in the buffer are exactly what the 2515 most wants the transmission to the 2515 is efficient: command followed by 13-bytes of data (actually, less bytes need be sent if the message payload is less than the full 8 bytes.)
- RECEIVE side: this is pretty much the same effort but in reverse sequence. The 2515 receives the message and then sends an interrupt to the Propeller. Upon receiving the interrupt a 16-byte buffer is populated with the data offloaded from the 2515. We append the time-stamp of the unload effort as the last couple of bytes and the rest is the reverse of the send side effort.
Now we are beginning to understand the data flow through the propCAN system. We are not moving data around or reformatting it unless we need to. We've normalized our system buffers in Main RAM to 16 and 32-byte objects a convenient size for the Propeller.
Next we'll look at an approach to getting our serial drivers to be much better performing.