Sometime back, I was working on an embedded system. That system was crashing sometimes. The crash-core dump was getting generated, but it was not very helpful for finding the root cause of the crash. We were aware that the pointer is getting corrupted, but we were not able to find the culprit i.e. who is corrupting the pointer?

It became nightmare to us, and we were feeling helpless. One fine day, I was searching something on Somehow this keyword i.e. Stack-overflow got stuck in my mind. I was in a meeting but again and again Stack-overflow was coming to my mind. So, I started studding about stack-overflow in embedded systems. I learned that there are many ways to detect the stack-overflow. In this Article, we will discuss one method. Later we found that the crash was result of stack-overflow only.

Let us first understand what stack-overflow in embedded systems is. While initializing a thread, we also configure size of stack. e.g. we are initializing a thread “uart_comm” with stack size of 5KB. The memory address beyond 5KB might be used by some other thread’s stack or may be used by heap. So, the stack use by this thread should never go beyond 5KB otherwise this thread might be writing in others territory. Thus, the data of some other thread might get corrupted.

StackOverflow jpg
Representation of Stack

During development, we keep on adding new code and sometimes forgot to increase or decrease the size of stack accordingly. If a lot of new code gets added for a thread and size of stack is not increased, then there are very high chances that stack-overflow condition will occur.

Now, the question is how we can detect that the stack-overflow is happening, and we should increase the size of the stack. There are many ways to do so.

Some RTOS support the feature where, on detection of stack-overflow it triggers an interrupt/exception.

But if our RTOS does not have this feature or we don’t want to enable this feature of RTOS due to some reason, then we can introduce the below logic to determine the stack-overflow.

  • During initialization, we should write a known magic number at the end of the stack of every thread. e.g. we can write 0x54FE at the end of every stack.
    • Some RTOS initialize the stack with a predetermined magic number.
    • Before writing the magic number, kindly cross confirm that the stack is growing upward/downwards so that you will be writting at correct address. Otherwise it will lead to corruption.
  • In the monitoring thread, we can validate this magic number for each threads stack at a very regular interval.
  • If we found that this magic number is updated, then it is confirmed the stack of that thread have grown to this size and might have gone beyond the stack size also.

The above-mentioned logic will tell us that the stack-overflow has occurred. But sometimes during development or validation phase the stack may have grown upto 99% and may not have touched 100%. While the same system might touch 100% in production/field.

So, we should add one more logic to detect 80% consumption of stack. So, if any thread is consuming more than 80% stack during development/validation phase then we should consider increasing the size of stack at right time i.e. before release to field.

By using this simple method, we can detect stack-overflow much before when the issue might have occurred and enjoy our sleeps 🙂