Machine learning for memory pre-fetching

Access to memory is one of the major bottlenecks in the performance of applications that have to handle large amounts of data (e.g., machine learning, scientific computing, neural networks, etc.). Many of these applications have access to large data structures that do not fit in cache memory. An intensive memory traffic is then produced when reading/writing data from/to main memory and the CPU is often stalled waiting for the arrival of the requested data. The delays produced by the main memory may take a few hundreds of cycles and may imply an important performance penalty.

Every application shows particular patterns in the generation of memory addresses. In many cases they are simple, e.g., sequential access to a vector, row-based access to matrix elements, etc. In other cases, these patterns are difficult to identify, specially in those cases in which several data structures are used simultaneously (e.g., matrix multiplication generates memory accesses to three matrices).

Question: would it be possible to learn the patterns generated by a specific application and predict the future memory accesses in a way that the CPU can find the data in the cache memory without waiting for the arrival from main memory?

In case these patterns can be learned, would it be possible to design a parallel program, e.g. a thread, that could cooperate with the main program and pre-fetch the required data in advance?

The plots below show the memory patterns generated by four different algorithms: matrix multiplication, LU decomposition, merge sort and quick sort. The colors identify different data structures. For example, three matrices are involved in matrix multiplication (C=A*B).

From the plots it is quite obvious that the memory addresses are not randomly generated. Clearly, matrix multiplication reveals very predictable patterns that can be easily modeled with simple recurrences. However, the other algorithms are less predictable, even though there is a clear evidence that some underlying patterns exist for address generation.

The tasks related to this project are next outlined:

Instrument the LLVM compiler to insert code that can dump information about memory accesses (a trace with information about time stamps and memory addresses).
Analyze the generated traces as time series and learn address generation models to predict the next generated address.
Synthesize the models to implement memory pre-fetchers (in software or hardware).