Customer Performance Requirements
I was recently asked by a customer to recommend a solution that could solve the problem of performing analytics on data getting generated at ATMs and POS machines. The problem in essence was that while the data was getting generated there was no way to query it on the fly. Traditional databases like Oracle, DB2, SQL Server and others were not able to solve this problem with their current product set since these databases were not built with the purpose of analyzing large amounts of data online. The traditional databases also required constant tuning, indexing and materializing the data before you could run any sort of business intelligence query on it. Essentially someone had to prepare the data and make it query ready and to say the least this costs a lot of time and money which this particular customer was trying to avoid.
Linear Scaling for Big Data in Real Time
In my opinion the only way to do this was using an in-memory database like SAP HANA which was built with the purpose of running analytics on live data. I did have some doubts about HANA’s scalability and requested SAP for guidance. They briefed me about a recent scale-out, testing of HANA where they simulated 100 TB of actual BW customer raw data over a SAP Certified configuration of a 16 node cluster with 4 IBM X5 CPUs each having 10 cores and 512 GB memory. The test data consisted of 100TB test database with one large fact table (85TB, 100 billion records) and several dimension tables. 20x data compression was observed, resulting in a 4TB HANA instance, distributed equally on the 16 nodes (238GB per node). Without indexing, materializing the data, or caching the query results, the queries ran between 300 to 500 milliseconds, which in my opinion close enough to real time. There were also ad-hoc analytic query scenarios where materialized views cannot be easily used, such as listing top 100 customers in a sliding time window, and year-to-year comparisons for a given month or quarter.
In my opinion these tests demonstrate that SAP HANA offers linear scalability with sustained performance at large data volumes. Very advanced compression methods were applied directly to the columnar database without degrading the query performance. Standard BW workload provides validation for not only SAP BW customers, but any data mart use cases. This is the first time I have encountered a solution offering BW the potential to access raw transactional ERP data in virtual real-time.
Data Management Architecture for Next-generation of Analytics
Readers of this blog may be also interested in knowing that new business intelligence optimized databases such as HANA have inherent architectural advantages over traditional databases. Old database architectures were optimized for transactional data storage on disk-based systems. These products focused more on transactional integrity during the age of single CPU machines connected through low-bandwidth distributed networks while optimizing the use of expensive memory. The computing environment has changed significantly over last decade. With multi-core architectures becoming available through commodity hardware, processing large volumes data in real-time over high-speed distributed networks is becoming a reality due to products such SAP HANA.
All in-memory Database Appliances are Not Created Equal
Apparently some solutions in the market like Oracle’s Exadata, also cache the data in Exalytic/TimesTen for in-memory acceleration. However, TimesTen is a row-based in memory database and not a columnar database like HANA which are faster for business intelligence applications. Oracle also uses these databases for in-memory cache, not like HANA which is the primary data persistence layer for BW or data mart. Therefore in my opinion, Oracle’s solution is more suited for faster transactional performance but creates data latency issues for real-time data required for analytics. From a cost and effort perspective it will also require significant amount of tuning and a large database maintenance effort when doing ad-hoc queries (sliding time-window or month2month comparison…etc) because you are trying to re-configure an architecture that is meant for transactional systems to deploy for analytics.
I hope this blog is useful and provides general guidelines to people interested in considering new database technologies like SAP HANA.