Sunday, April 20, 2008

How far we've come... Full speed CAN traffic!

In the last post I presented the new object organization. In the end we had to move responsibilities around (amongst cogs) in order to get the system to be as responsive as we needed. I started with simple symmetry and had to abandon it. At some level this surprised me, but our goals had to be met, so here we are.

In this post I've a bunch of waveforms to present. The good news is that each zooms in on a rather simple observation so there is not much detail to each of them but the resulting behavior is too fun to not show.

To the right is an overview of a Bus Census with something you haven't seen here before. This is the Census being transmitted from PropCAN (as directed by the host software, CDNC) and fifteen, yes 15!, CAN Nodes all eagerly responding. Do you see what happened? Remembering the last post we saw that we couldn't handle 5 nodes responding, but look at this. Here we have 15 nodes all responding as fast as the CAN Bus will allow and the new SPI Engine receive side is so fast that we never even needed our 2nd receive buffer. The /Rx0BF line (bright green, near top) show the signal being asserted and then our SPI Engine back-end Cog is so fast that we empty the buffer while we are still receiving the next message. It is wonderful to see an idea pan out, isn't it? Let's quickly walk thru a few significant parts of this traffic which we'll do by presenting four more pictures.

In this next picture I've zoomed in on the asserting of the /Rx0BF line (buffer full for the 1st message to arrive.) I've placed markers M1 and M2 at the time buffer full is signaled and at the time our Rx-side begins to send the buffer unload command. I've placed a time interval measurement label at the top left. We can clearly see that M1 to M2 = 1.655 uSeconds (or 33 assembly instructions.) Our SPI Rx back-end Cog was in a loop deciding what needed to be done next when it saw the /Rx0BF line being asserted at which time it formulated the command to unload the interrupting buffer and initiated sending the command via SPI. Not a bad response time.

As long as we have this view, let's look around. The other waveforms quickly show us the relative speeds of each of the communication happening at the same time as our unload. From top to bottom, we see the SPI CLK signal showing us our SPI Write speed as we see the command to unload the buffer being sent to the CAN controller (MCP 2515). Below the SPI CLK we see the SPItoCAN which is our command data being sent and below that is the SPIfmCAN signal which is data returning from the 2515 (our received message arriving). Finally, at the bottom we see 2515-CAN Rx toggling. This is our next CAN Message arriving while this buffer unload is occurring. Compared to our SPI Tx and Rx rates the CAN bits seem pretty slow. We remember that our CAN rate is 800,000 bits/second. What is our new SPI Tx/Rx rate? Let's look at our next picture to see.

In this next picture we've shifted to the right along our unload so that we can clearly see the SPI Tx and SPI Rx bit rates. The code is asymmetrical for the similar reasons that our Serial Rx and Tx turned out to be asymmetrical. The work we are doing is different for the two sides of the effort. Anyway, I've placed two markers for Tx and two more for Rx and I've placed two measurements for each which tell us the duration and bit-rate that we are now seeing. Let's look at them.

Markers M1 and M2 are spanning a single byte write from the Propeller to the MCP 2515 device. At the top we see that this write is 350 nSec long, resulting in a 2.857M bits/second (bps) SPI transmit rate. On the receive side, Markers M3 and M4 bracket a single byte being received. The duration of this byte is 255 nSeconds and the resulting bit-rate is 3.922M bps. We are commanding the CAN device at nearly 3M bps and receiving data at nearly 4M bps. If you remember our performance criteria we set for ourselves these speeds are better than we needed to meet our goals. At this rate, and since we are getting it done during the receive of our next CAN message just how fast are we unloading the receive buffer from the MCP 2515?

In this fourth picture I've zoomed back out a bit so that we can see the entire unload. I've placed new markers (M5 and M6) along with a new interval measurement. M5 is at the start if the first received byte while M6 is placed at the end of the last byte emptied from the receive buffer. The interval measurement shows that we are unloading the buffer in 29.89 uSeconds. Finally, just shortly after the last byte is read, the /CS is de-asserted and the /Rx0BF is also de-asserted which happens automatically when we unload (read) the receive buffer. Now, lets zoom out one last time to look at the full transaction again but let's zoom far enough to see the serial transmitting of the messages to the connected PC via USB.

Here in this final picture you can see the overhead of handing off the messages between cogs and finally getting sent over USB as the dominating feature of the waveform. You can see what I mean when you see that all of our 16 CAN messages (1 outbound, 15 in-bound responses) are gathered in the colorful burst at the left of our waveforms. Then, spread out, over the remainder of the picture to the right of the burst we see the yellow message transmits over serial from the Propeller to the FTDI chip to be sent over USB to our PC. While each message is very fast (the yellow burst is narrow) there is a lot of dead time between each message. Depending upon the rate of messages arriving via CAN this could spell doom for our system if we don't have enough buffering to handle the majority case. That is enough space in the Rx Queue so that we can place arriving message from CAN into the Rx Queue and then hold them there until they are sent via serial to the FTDI chip and the entry in the Rx Queue then freed up for use by another arriving CAN message.

So, in the end, we've got system that's performing much better and so far is meeting our basic performance specifications. Now it's time to move on to quantifying the performance of PropCAN under normal system loads and with normal traffic patterns. We're almost there!

Friday, April 11, 2008

Object reorganization to achieve SPI performance

I finally revised the PropCAN object diagram so I can continue this SPI performance discussion! ;-) Last post I identified why the CAN receive system had to be reorganized- We found that the draft organization was too slow to unload the CAN messages as they arrived!

This new object diagram may be confusing but I will try to sort it out in this post. (for the original diagram refer to the January post entitled "Which Cog is doing what?") New in this diagram is the appearance of the TV Debug object-set on the right. The A.2 hardware has the four debug pins carefully chosen so that they can be used to drive the TV-out signals. I even created a little daughter card which contains the DAC resistors and the RCA connector attached to a socket header so it simply plugs into the 4-pin debug header.

(If you are interested, pictures of the hardware can be found at the project web-site: http://propcan.moraco.us in the "PropCAN Detail", "DEMO Board", and "Gallery" sections.)

Let's start this review by going over the list of significant changes between the v1.0 and the v2.0 diagrams:

  • /Rx0BF, /Rx1BF pins now route to the ASM portion of the SPI Engine
  • /INT now routes to both the SPI Engine and to the CANtrafficTxHandler
  • The RxQueue now straddles ASM portion of SPI Engine and the CANtrafficTxHandler
  • CmdReceiver fully encloses Serial Receive functions while PacketHandler fully encloses the Serial Transmit and the SPI Transmit/Receive functions

Now let's review the reasons why I made these changes.

The Rx buffer full pins now route only to the ASM portion of the SPI Engine so I could move the entire receive function into the ASM code. This is also the reason the RxQueue is overlapping the ASM portion of the engine. Not only did I move all receive but I also rewrote the receive so that the MCP 2515 buffers are unloaded directly into the Queue data structure. There is no longer any middle-man routine transferring unloaded buffer content out of this SPI engine and into the Rx Queue. We were suffering CAN messages arriving faster than we could unload them. We have to be much faster.

The /INT line is now handled in two areas of the code but for two different reasons. The SPI engine watches /INT to know when it can get Tx-complete statuses and when it needs to clear the Tx-complete interrupt. CANtrafficTxHandler receives /INT so that it can deal with Rx errors.

Lastly, I moved to more fully enclosing objects with minimal interface facades for the enclosed objects so that code outside of the enclosure is simpler and easier to read/understand.

Well, now you see what reorganization we did. Next post, I'll show measurements of the surprising performance increase these changes have yielded.

Monday, April 7, 2008

Improving SPI performance - our starting line

OK, I've spent the past week first discovering just what the current SPI performance is relative to live CAN traffic followed by determining how to address the issues and finally by implementing the solution. The exciting news is that it was a very productive week. It appears that the performance is now where it needs to be but further testing must be done to prove this.

Let me show you what I found the performance to be. The picture to the right shows us a number of things (most of which are not good ;-)

First let's look at how we initiated the traffic. I'm setup with 5 CAN nodes on the CAN bus. I sent a message out to the five CAN Nodes to have them identify themselves (part of the AMSAT protocol). Locating this in the logic analyzer trace, we first see the serial RxD (Red) traffic at the bottom left which is followed by the serial TxD (Yellow) "OK" response. Next we have the MCP2515 transmit buffer load (SPI-labelled signals in middle) and the send buffer command again via SPI. Then we see the actual CAN transmit (Green) signal at bottom followed by the CAN receive (Red) with the Green CAN-Tx ACKs for each CAN message arriving.

Notice how quickly these messages arrive? CAN is a great little protocol in that all devices listen to all traffic on the bus and they inject their message as soon as they can. They will wait only long enough (mandatory gap between successive messages) and then the next message is transmitted. This means that as a receiving device we wait around for traffic which will come in bursts and sometimes, like this test case, our traffic will arrive in maximum speed bursts. How fun! Well, OK it's not fun at all. This means that our device has to immediately support running at best possible speed. The only freedom we have is that the MCP2515 does/can double buffer received traffic but this is very little additional freedom.

Now, turning our attention to the top signals we see the /INT line (Red) being asserted which is our transmit complete signal to the Propeller. We then see the /Rx0BF signal (Green) asserting followed by almost exactly one CAN message later the /Rx1BF signal (Yellow) asserting. These are the "receive buffer zero full" and "receive buffer one full" signals. This is all working exactly as we want. We are being notified of each event as we intended!

Where we see the first indication of performance issues with our prototype code is that we don't begin to unload the 2nd message until after all 5 messages have arrived. (the first unload occurs during the last of the 4th message arriving and the 2nd unload occurs quite a bit after the last message arrived.) Given that we know we only have two receive buffers we just proved that we lost 3 of the 5 messages arriving. We just are not fast enough.

Our transmit complete clearing of the interrupt can happen much faster (should likely happen before the first message arrives) and certainly our receives need to be much faster. In fact this almost certainly proves that we can't have a separate Cog watching the /Rx0BF and /Rx1BF lines and then asking the SPI back-end Cog to unload the buffers as we do here. This simply isn't fast enough.

This means that our initial functional decomposition and assigning of responsibility amongst Cogs is not going to work. It looks like we have to move some of the transmit acknowledgement handling and most if not all of the receive handling. Well, it's back to drawing board for me to figure out which Cog needs to do what, one more time...

In my next post I'll describe how things are to get rearranged and then I'll follow up with measurements of the new organization.