Nasty bug due to python's garbage collection resolved after 2 years

December 8, 2022

The system and the symptom

At Nanonets, we used to have intermittent failures for the longest time in one particular experiment during model training. The experiment was a job in a kubernetes cluster coded in python. The python process would just get stuck. No cpu usage. But this didn’t happen every time, the errors would happen for 1 in 100 or so, making it difficult to replicate.

Initial investigation

The logs weren’t helpful and I couldn’t replicate the issue locally, but fortunately, I could get exec into the pod and get a shell. Python processes getting stuck aren’t that rare, especially in a multi-processing environment, so I thought this would be an easy fix.

py-spy makes it very easy to get a line-level python traceback of currently running threads. The output of that was similar to this:

Thread 9954 (idle): "MainThread"
    term (zmq/sugar/context.py:195)
    __del__ (zmq/sugar/context.py:74)
    random_function (mycodebase/mycode.py:83)
    ...

Curiously, the code would get stuck at random places in our code and there was no call to __del__ of zmq/sugar/context.py in mycode.py. This happens because garbage collection (GC) is the one calling __del__ method, and GC can happen at any step of the code.

the term method of zmq Context class would go on to call term method in the libzmq library. So the stack trace proceeds further into the C part of the code which py-spy doesn’t show.

zmq is a messaging library for communicating over transports like TCP, and it was being used by bert-client library to retrieve BERT embeddings in our code.

It just didn’t make any sense why ZMQ Context termination would hang sometimes but be successful most of the time. Based on the documentation of py-zmq and zmq, one has to be careful in closing sockets and context once done. We were doing the same in a manner similar to the following:

from bert_client import BertClient
@retry_on_timeout
def get_embedding(sentence):
    with BertClient(IP) as bert_client:
        return bert_client.encode(sentence)
        

Not a multiprocessing problem

Python processes getting stuck happens a lot in multiprocessing environments in fork mode if you aren’t careful what you’re forking. For example, python threads and sockets aren’t safe to fork. I had similar encounters in the past like initializing a boto3 session before forking leading to programs getting intermittently stuck.

I changed fork method to spawn method, which avoids pitfalls associated with fork, but the problem persisted. Replacing BertClient with ConcurrentBertClient didn’t help.

Finally, I resorted to changing multiprocessing to a single process, confident that it must be a multiprocessing-related issue, but I was wrong. The problem persisted in the synchronous code.

GDB traceback

Since py-spy misses C stack trace, GDB comes to the rescue.

GDB can be used to debug a python program. To simplify life, you need to have python debug symbols present in the environment. We didn’t have python3-dbg installed in the production environment.

Sometimes I had to manually link the debugging symbols, but most of the time a simple apt install python3.7-dbg did the trick and gdb traceback with python symbols was available.

GDB backtrace looked something similar to this

#1  0x00007ff5599f90da in zmq::signaler_t::wait (this=this@entry=0x2a064c8, timeout_=timeout_@entry=-1) at src/signaler.cpp:218
#2  0x00007ff5599e4ed0 in zmq::mailbox_t::recv (this=this@entry=0x2a06468, cmd_=cmd_@entry=0x7ffc879dea40, timeout_=timeout_@entry=-1) at src/mailbox.cpp:80
#3  0x00007ff5599d65bc in zmq::ctx_t::terminate (this=0x2a063d0) at src/ctx.cpp:165

You can go up and down the frame and inspect the values of the registers. You can inspect the value of the python variables using py-print in the python frames. (python gdb tool should be configured)

Even though it was a good learning experience, this activity didn’t give much insight into things. It did help me learn a bit about the working of the garbage collector, which was connected to the problem.

Resolution

Around July this year, the failures started increasing. From 1 in 100 they went up to 1 in 5. That’s when I decided to dedicate my complete effort to this.

After some digging, I was able to find similar issues reported https://github.com/zeromq/libzmq/issues/2586 https://github.com/zeromq/pyzmq/issues/1003 https://github.com/zeromq/pyzmq/issues/1512

None matched the exact problem, but there were some hints which helped me understand the issue better. The final breakthrough came when I wrote a lot more logs, and ran the experiments again.

Cause

The root cause was that although the context manager used in the get_embedding method above closes the socket and context on exception, if the error is raised during __init__ of BertClient, the sockets created in __init__ wouldn’t close.

I confirmed that there were exceptions raised in BertClient while checking the health of the remote server due to load. The remote server wouldn’t respond in time and it would raise a timeout error. This is caught by retry_on_timeout decorator, which calls the method again. Retries ensured that the method succeeds and training continues.

However, sometimes at some indefinite point, the garbage collector cleans the older reference to socket and context (created before the exception) in a way unexpected by pyzmq. An assumption went wrong in pyzmq.

The code assumes that since context object holds a reference to socket objects, socket.__del__ would be called first before context.__del__. But such an order is not guaranteed by the python garbage collector.

Additional references were created due to the exception leading to the cyclic garbage collector getting invoked which has non-deterministic order of cleanup.

context.__del__ then calls term() which hangs indefinitely waiting for the socket to close which doesn’t since GC runs synchronously.

Solution

This was solved temporarily in our code by catching the exception during BertClient.__init__ and closing the zmq sockets.

With pyzmq==24 released in September, this is permanently solved (based on the code changes, although I haven’t tested).