Video (IBM Rüschikon talk, 23.7.2012). Dr. Efstratios Rappos
We propose a novel framework, called Picasso (PredICtive cAching framework for faSt temporal StOrage), for increasing the performance of hardware and software storage solutions for very large temporal data sets. Our algorithms use temporal models to predict the access to temporal data which can lead to drastic reduction in cost. The decreased cost of storing and accessing temporal data will benefit both enterprise and scientific applications and provide for more efficient data-mining and time series analysis.
In 2009 alone the SIX Swiss Exchange executed more than 30 million security trades (http://bit.ly/bYm61T). Most of the data generated by these trades is temporal in nature and is stored in time-series (ordered sets of timestamp/value pairs). The Swiss stock-exchange example is miniscule compared to all temporal data generated by science, engineering, finance, and the Internet in general (consider the dominance of time-series in biology, climatology, structure analysis, etc.). Even the data stored by the popular web-service Twitter (http://twitter.com), can be considered as temporal.
Today, temporal data is rarely differentiated and treated specially, and if they are, solutions are ad-hoc, not proven optimal, and infrastructure specific. Often temporal data are stored in relational databases with significant space overhead (a time-series stored in Sybase, for example, used to occupy roughly 100 times more space compared to a naïve, binary, non-compressed representation of the same data). Even when properly organized and indexed, businesses lose hundreds of millions of dollars while waiting their traditional storage and database facilities to access and manipulate these data. Only a few months ago, Twitter run into the world headlines by experiencing network/capacity problems (http://bit.ly/d5acoY).
We propose a radically new approach that can change the way temporal data is handled. We call our proposed framework Picasso (PredICtive cAching framework for faSt temporal StOrage). Picasso has the potential of decreasing the access time to time-series and increasing the throughput of temporal data at least several orders-of-magnitude and beyond what is theoretically possible with existing storage, indexing, and caching schemes. To do that, Picassoborrows techniques from Artificial Intelligence (AI) modeling, temporal reasoning, and statistics. The main idea behind Picasso is to use predictive caching in a data storage solution specialized for temporal data/time-series.
The idea of Picasso is very simple. In many environments batch processes (derivation of data, indexing, etc.) account for a significant portion of the overall I/O load. These batch processes are fully predictable (known in advance) and repetitive. Our idea is to design a simple modeling language for describing the use of such back-end processes. The emphasis in this Picasso data access modeling language is temporal (i.e., frequency of data access, duration, etc.) but we also plan facilities for representing spatial access patterns (i.e., if time series X is read at time T, then, with high probability, time series Y is read at time T + 1). Based on this model, we can implement a scheduler that will predict the data use and fetch it in faster memory. The above modeling is sufficient for production, but during the design of the Picasso storage solution (recall that a temporal storage is a combination of hardware, software, network elements, APIs, etc) we plan to use additional modeling and statistical techniques.
Picasso is a highly relevant and practical project as indicated by the initial group of users who have expressed interest in the idea of predictive temporal storage. At the time of writing of this proposal we have received an expressed interest from the Financial Markets division of ING Group NV, Amsterdam. The ING group is a well-known global financial conglomerate of Dutch origin, offering banking, investments, life insurance and retirement services to over 85 million private, corporate and institutional customers in more than 40 countries from Europe, North and Latin America, Asia and Australia. The amount of temporal data used by ING is terabytes of data in hundreds of thousands of time series, used for trading strategy analyses, Value at Risk analyses etc.
- Efstratios Rappos, Stephan Robert, Rudolf H. Riedi : »A Cloud Data Center Optimization Approach Using Dynamic Data Interchanges », The 2nd IEEE Internation Conference on Cloud Networking, IEEE CloudNet 2013, November 11-13, San Francisco, USA.
- Efstratios Rappos, Stephan Robert: »Using GPU Simulation to Accurately Fit to the Power-Law Distribution », arXiv:1305.6738. May 29, 2013, Cornell University Library, USA
- Efstratios Rappos, Stephan Robert: “Predictive caching in computer grids », The 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid,May 13-16, 2013, Delft, The Netherlands.