About us
High order mathematical and computational infrastructure for streamed data that enhance contemporary generative and large language models
DataSıg is a research programme reinventing how we understand sequential data. We bring together rough path theory and modern machine-learning to build a new generation of transformers and tools for real-world sequential data.
Datasıg has already established rough path-based methods as effective tools for sequential data with complex internal dependencies, with applications in numerous domains. (See the Research tab for our papers and examples.) Meanwhile, transformers have revolutionised models for sequential data and underpin large language models (LLMs). Our aim is to use rough path theory to develop a deeper mathematical understanding of sequence-to-sequence models like transformers and expand techniques for representing, manipulating, and comparing sequential data.
Datasıg is organised around four interconnected research themes, each shaped by real-world applications through our project partners:
- Rough to rough transformers explores transformer-like architectures grounded in rough path theory.
- Data representation and dimensionality reduction investigates representations for streamed data that make the data easier to work with.
- Outlying streams provides techniques for determining whether a new stream is similar to an existing collection of streams.
- Computation develops computational tools for working with streamed data using rough path theory.
DataSıg is led by Prof. Terry Lyons, Dr Blanka Horvath, and Dr Sam Morley (Oxford), Prof. Thomas Cass (Imperial), and Prof. Hao Ni (UCL). The programme is funded by a Programme Grant from the EPSRC.
Rough to rough transformers
Transformers are models for large-scale sequential data built around a self-attention mechanism (all you need is attention). Most implementations use multi-headed attention, which allows these models to be trained over very large datasets. For example, the transformers that underpin large language models are trained on data equivalent to thousands of books. However, transformers still face significant challenges. The token space is typically very large, making them computationally expensive to train and evaluate, and even modest models require hundreds of billions or trillions of parameters.
We can reformulate these transformer-like models as a specific class of controlled differential equations. (See a similar reformulation due to Ricky Chen et al..) This realization has two major implications. First, it provides a mathematical framework to understanding, analysing, and generalising these models. Second, it allows us to apply the techniques of rough path theory and numerical analysis to gain new insight into the complexity and tokenisation and unlock performance. The more general class of models describe rough path to rough path transformations, hence the name rough to rough transformers.
Data representation and dimensionality
One of the key challenges in dealing with sequential data is the high dimensionality and fragility of naive representations. For instance, small changes in the units of measurement, sampling rate, or noise can adversely affect the performance of machine learning models. Rough path theory provides the signature as a robust, faithful representation for streams over short intervals. Replacing the raw data sequences by signatures is a dramatic reduction in dimensionality, although the signature itself may still be large.
The signature has already been widely applied in machine learning, throughout the first five years of DataSıg, across numerous domains such as healthcare, finance, and cybersecurity. However, it is not the only stream representation that arises from rough path theory. A small change in the formulation gives rise to a family of robust, faithful representations of streams that emphasise different properties of the underlying process. (See the paper on the path development network.) Crucially, we can learn a good choice of representation from this family that best fits the task and the data that we have, making them very flexible.
In a different direction, kernel methods are widely used to quantify the similarity of between object. For streams, one can use the signature kernel in these methods. The signature kernel can be computed directly without computing the signatures directly. The theory surrounding the signature kernel is still evolving.
Outlying streams
Identifying when a stream “doesn’t fit” a known pattern or conform with an existing set of streams is a widespread challenge. This could be detecting malware on a computer, identifying interference in radioastronomy data, or spotting differences in RNA molecules. Using the signature, we can build a data-driven method to determine whether a stream belongs to a set of known “good” streams. This approach respects natural transformations of the data, such as changes in units, and performs well for streams in moderate dimensions.
Scaling this approach to higher dimensions remains a challenge. One promising strategy is to use a foundation model (like a large language model) and a combination of other techniques to reduce the high-dimensional input to a more complex low-dimensional stream that captures the structure of the inputs, and then use the signature-based methods.
Computation
Theoretical advances are only useful if they can be translated into efficient, practical tools. In data science, this means high-quality software capable of computing signatures and manipulating streams at scale. Our Python library RoughPy is designed so that code closely mirrors the underlying mathematics. This makes it easier to move between theory and practice and provides efficient access to stream data.
The largest challenge is developing software that can be used at the scales required by machine learning. This requires a different approach compared to writing code for isolated problems. Frameworks like Torch and JAX are built with scale in mind, but don’t provide the primitive operations needed for working with streams and signatures. RoughPy serves as a prototype and template for the next generation of software for rough streams. With the help of our project partners, we are developing high-performance implementations for GPUs, which will make it easier to use our techniques on large-scale machine learning projects.