Using Python for High-Performance Mathematical Computing in Software

In the realm of software, there exists a distinct class of tasks whose characteristics have remained constant for decades. Only the means of solving these tasks have changed. This category refers to complex, high-load calculations that solve problems in applied mathematics and numerical methods. It’s important to distinguish these from big data or stream processing systems. Algorithmically complex tasks have a much broader scope than processing video streams or recognizing images using neural networks. Such tasks encompass problems in numerical modelling, methods for solving differential equations in physics, noise reduction in optical, fluid acoustic, or radio channels, and many others.

During the 1960s, this area experienced rapid growth due to the widespread (at that time) introduction of computing technology into scientific and technical settings. This era was marked by the developing of the first high-level language, Fortran, and the growth of specialized research institutes and faculties worldwide. Since then, computing tasks have gradually moved away from the forefront of mass-market software. Instead, the focus of software specialists has shifted towards business applications, databases, and, more recently, web and mobile development. Although algorithms and calculations are still present in these fields, they play a more minor and less critical role in the development process, frequently addressed using pre-existing frameworks. Tools for specific computational tasks have become increasingly sophisticated, and the shortage of skilled professionals has led to increased development costs. This article examines potential solutions to problems related to high-load mathematical calculations in software development, considering the client’s business interests, with examples from projects completed by the Auriga team.

Possible solutions

Nowadays, when businesses face the challenge of solving the abovementioned tasks, their choice of tech stack is limited. Finding Fortran specialists can take a long time and only makes sense if your work is planned for many years ahead and you plan to maintain your laboratory with minimal staff changes. It is reassuring that experts in this field are not prone to sudden mood swings or “project-hopping” every two years.

The prevalent approach until recently was to search for and employ mathematician-programmers skilled in C/C++. With a surplus of engineers available and a comprehensive library of tools, including those provided by semiconductor giants such as SoC manufacturers, the code produced by engineers was high-performing and compact, efficient, and portable for embedded systems. However, this method was costly since developing code was labor-intensive, and each line of code came at a high price due to prototyping and searching for optimal solutions in the “labor-intensive” C/C++.

The third option addressed the issue of high labor intensity by using MatLab and similar tools that allowed researchers to conduct their own research and seek mathematical solutions without hiring programmers. However, the downside to this approach is that the resulting solution is typically very bulky and difficult to replicate or share with colleagues. Additionally, MatLab is not particularly suited to this task and requires buying its license.

Python has been widely used in the scientific world as an alternative to Matlab due to its simplicity, conciseness, extensive libraries, and free availability. Knowing Python has become as important as knowing LaTeX. Private businesses can also benefit from Python by passing researchers’ code to developers for optimization, “industrialization” of code, and further integration into large systems. Contrary to the common belief that it is unsuitable for production systems and must be ported to C/C++, Python can be used beyond prototypes and demo versions and can solve complex mathematical calculations “in production.” This allows businesses to minimize code, hire cost-efficient developers, and solve business problems of their product without depending on a single star performer in the team.

Challenge description

A few years ago, Auriga developed a solution for a medtech startup that dealt with the parallel processing of data from multiple optical channels. These data were vital to the success of minimally invasive surgical procedures, as they provided the only source of information for the surgeon. The solution needed to meet specific data transmission requirements, including synchronous processing of parallel data streams with a frequency of 30 frames per second and a 400ms timeframe for a frame to travel from the device driver to the doctor’s display.

The first concept – TCP sockets

The analog device produced 15+ data streams during operation, written in their unprocessed state to a circular buffer by the device driver using a non-blocking mode.

The computationally demanding mathematical calculations, including Hilbert transform, Fourier series transformations, adaptive filtering, and other cascade transformations up to 20 steps, were performed using Python modules applying numpy and scipy libraries. As true multithreading was impossible due to the GIL, developers had to run separate instances of the Python handler as isolated OS processes for each data stream. The scaled-down resulting data was then sent to the PyQt UI. The initial implementation of the solution utilized TCP sockets.

At the time, the mathematical processing of a frame took around 150ms, leaving 250ms for all other stages from the driver to display the image on the screen. Performance tests on various message queues (ActiveMQ, Mosquitto, RabbitMQ, ZeroMQ) showed insufficient stability and data transmission rate, resulting in the desynchronization of different streams and noticeable visualization gaps, which could not provide an adequate quality level for a medical product.

The engineers initially aimed for a total data stream of 25 Mbps and a frame size of 100+ kilobytes, which was sufficient for a TCP socket connection. However, further studies of the optical component of the device revealed the need to handle a larger volume of data due to a significant increase in the detail of the read data. This required increasing the frame size to 1.2+ megabytes, which resulted in a data stream of 26 Gbps and a processing time of 250-300 ms per frame. Despite this, the TCP socket on the customer’s stand could only manage a maximum speed of 1.7 Gbps in synthetic tests.

Ten times difference between the available and required data transmission rates caused an immediate overflow of TCP buffers and subsequent cascading packet loss. A new solution had to be invented.

Next step – named/unnamed pipes in Python

Named or unnamed pipes were considered as the next transmission settings candidate. While a synthetic test on the stand demonstrated a rate of around 21.6 Gbps that closely approached the requirements, technical issues surfaced early in the implementation phase:

  • The data acquisition rate was higher than the transmission rate, resulting in uncontrolled buffer growth and unpredictable transmission times of 50 to 150 ms
  • The math calculation processes caused irregular and unexpected overloading of the top-notch 40-core processor. Within 10-15 minutes of launch, the load on all occupied cores sharply increased from approximately 80% to 100%, and the processing time for a single frame went up from a reasonable 300 ms to 600-700 ms.

After analyzing the situation, it was determined that the cause of the problem was a combination of high memory allocation and deallocation, together with overwriting data and system process for (zeroing) the freed memory. While these factors were not problematic on their own, this combination led to the system process of zeroing pages being unable to handle background processes properly and to forward the workload onto the “native” CPU cores, performing the processes calculations, leading to increased load on them due to frequent context switches on the core.

Code refactoring with the help of the NumPy library noticeably reduced the amount of “dirty” memory produced but did not completely solve the issue. A fast and efficient data transfer was required at the interface between the device driver and Python, where the amount of data being transmitted was at its maximum. However, porting the math from Python to C++ was not feasible given the time and budget constraints of the project.

Final solution – IPC protocol

The solution involved implementing a custom IPC protocol on a circular buffer in shared memory – not a common approach but is well-known to engineers. The only challenge was using different programming languages – a C++ connector to the driver and a Python computational module. The PyWin32 package was utilized to work with semaphores via win32api, allowing access to the same semaphore from independent applications.

The simplicity of both the protocol and the transmitted data enabled an average frame transfer rate of 5 ms. A synthetic test indicated a maximum throughput of around 84.7 Gbit/s, comfortably exceeding the volume of transmitted data and delivery time requirements.

Computation process

The data received through IPC needed to undergo complex multi-step transformations and noise filtration. Although the mathematical foundation was mainly prepared and tested in Matlab, concerns about the GIL preventing parallel synchronous processing were raised. Nevertheless, the restriction on “out of the box” multithreading for libraries and the in-house developed calculation division mechanism allowed for the process to be configured to fully utilize the resources of a specific processor model with performance reserves for random spikes. It is worth reminding that the solution was aimed at a life-critical medical device.

As previously stated, the second issue was the libraries’ constant generation of a large number of temporary structures in memory, along with data copying and subsequent release. However, the availability of the source code of open-source libraries and proper estimation of array sizes at the start of functions enabled us to mitigate this issue swiftly.

And as a finishing touch, our team reduced the computational complexity of the algorithm code, minimized the number of loops and iterations, reused data structures, and implemented some other techniques familiar to developers but distant from scientists. This allowed achieving performance comparable to and even surpassing that of Matlab, considering a more comprehensive utilization of processor resources. Moreover, there was still room for ideas to further increase calculation speed, although they were not required by the product owner.

The subsequent migration of the solution to a real-time, Linux-based operating system went smoothly. Although the porting tasks required an additional budget, the expenses were accurately estimated and controlled, thanks in no small part to the cross-platform nature of Python.

Lessons Learned

A group of engineers with higher technical education, with only one of them experienced in scientific research, implemented a mathematical idea on a final device within a limited timeframe. Python played a key role in this, being a suitable tool for production, easily adaptable and optimized, and not requiring unique knowledge from developers. All tasks were predictable and manageable with proper decomposition, estimation, and planning.

The choice of tool is the most important decision that should be made at the start of work on industrializing computational tasks. Additionally, proper work planning, including prototyping bottlenecks and writing synthetic tests, helped avoid multiple iterations in development and unpleasant surprises during testing of the final medical device.