Cloud computing, it means a lot of different things to different people. There are public, private and hybrid models, yet the variations are endless. A key characteristic of the cloud is rapid elasticity which offers compute power unheard in priory infrastructures. Such parallelized scalability allows previously intractable problems to become a reality. There is two key underlying components behind this – a computational model and distributed data.
OneTick Map-Reduce is a Hadoop based solution combining OneTick’s analytical engine with the MapReduce computational model that can be used to perform distributed computations over large volumes of
financial tick data. As a distributed tick data management system, the OneTick internal architecture provides support for databases that are spread across multiple physical machines. This architecture designed for distributed parallel processing improves query performance – as the typical OneTick query is easily parallelize-able at logical boundaries (e.g. running the same query analytics across a large symbol universe) and can be processed on separate physical machines.
OneTick Map-Reduce offers a solution to leverage elastic computation by dynamically distributing both data and computation across a Hadoop cluster using a combination of a distributed file system (HDFS) and a computational framework called MapReduce.
OneTick Map-Reduce dynamically distributes data (stored in OneTick historical archives) and analytics across the nodes using a combination of distributed file system (HDFS) and the MapReduce computational framework.
- OneTick archives are stored on a distributed file system (e.g. HDFS with Amazon S3 as a backup). The distributed file system serves as an abstraction layer providing shared access — physically the data resides on different nodes of the cluster. The distributed file system is also responsible for balancing disk utilization and minimizing the network bandwidth.
- Hadoop’s MapReduce daemons are responsible for distributing the query across the nodes of the cluster, by taking into account the locality of the queried data.
- The distributed OneTick query is an analytical process that semantically defines a user’s business function. OneTick query analytics are designed specifically for that purpose.
OneTick Analytics
OneTick provides a large collection of built-in analytical functions which are applied to streams of historical or real-time data. These functions referred to as Event Processors (EPs) are a set of business and generic processors that are semantically assembled and ultimately define the logical, time series result set of a query. Event Processors include aggregations, filters, transformers, joins & unions, statistical and finance-specific functions order book management, sorting and ranking, and input and output functions. Also included is a reference data architecture for managing security identifiers, holiday calendars and corporate action information. Together these allow time series tick streams originating from any of the OneTick storage sources (archive, in-memory or real time) to be filtered, reduced and/or enriched into the business logic supporting a wide variety of use cases.
- Quantitative Research
- Algorithmic, low-touch and program trading
- Firm-wide profit / loss monitoring
- Real-time transaction cost analysis
- Statistical arbitrage and market making
- Regulatory compliance and surveillance
OneTick, Hadoop and Spark
Spark and Hadoop are middleware frameworks that facilitate parallel processing of data, whereas MapReduce is a computation model. These components provide a platform for distributed computation and combined with HDFS offer distributed data access as well. HDFS is (by definition) the file system part of Hadoop, and Spark can make use of HDFS as input data source. Yet, neither Hadoop nor Spark provide targeted business-oriented functions to support the above-mentioned use-case solutions. Furthermore, those trade-related solutions depend on cleansed, normalized high-quality data available in OneTick data management either by itself of integrated into Hadoop. OneTick has its own very efficient mechanisms for parallelization of computations (e.g. concurrent processing of symbol sets across load-balanced group of tick servers, client-side and server-side database partitioning with concurrent partition access, splitting queries locally into multiple execution threads). OneTick also supports Hadoop as an alternative mechanism of parallelization of computations. The OneTick Map-Reduce design allows an easy to switch between different data representation/job dispatching models – affording support for an internal model and external models (Hadoop, Spark, etc.). The idea is that you start with a collection of data items and start applying certain map and reduce operations on this collection (as in functional programming). Map operations transform existing items into new items and Reduce operations group multiple items into a single aggregated item. The computation must be stateless, so that it’s easier to parallelize. This means that each transformation creates a new collection, rather than manipulating the existing one. Users define their “map”, “reduce” operations in this restricted computational model and the framework takes care of the parallelization.
How does this translate to OneTick’s data model?
Data items <-> OneTick time series
Map operations <-> OneTick transformer EPs
Reduce operations <-> OneTick merge/join EPs
Spark is similar to Hadoop yet it overcomes the limitation that they take a long time upon startup of a job. Similar to OneTick’s own dispatching model, Spark appears to be more suitable for interactive data processing. Nonetheless, both are suitable for large batch processing tasks and thus the reason for OneTick’s integration as a complementary technology.
Once again thanks for reading.
Louis Lovas
For an occasional opinion or commentary on technology in Capital Markets you can follow me on twitter, here.