Hi, welcome to the first of our Big Data Top Trumps series. We will see how SAP solutions and Open Source solutions stack up in the enterprise, where one may make sense over another, and also where it might make sense to use both. First up is Speed.
When it comes to speed let’s first look at the end-user perspective. In this context, I am going to distinguish between the “known queries” that are run every day, week or month to give the business their “normal reports” and the “ad hoc queries” that answer a specific business query but are not part of the standard “business reporting”.
Starting with the former, a relational database can be carefully tuned with good index management for fast response to the “known questions”. However, the more challenging part are the questions that the business want to ask their data that are ad-hoc in nature. This is where classic relational databases can grind to a halt because the query may miss the tuned indexes and as such take a long time to scan the data for the answer.
A columnar database organises and stores data tables by columns rather than rows. This enables information to be “targeted” more accurately when retrieving data reducing the amount of serial scanning that occurs particularly where queries are ad-hoc in nature as previously described. This means that columnar databases are ideally suited to OLAP solutions. SAP have two columnar databases in their product suite namely SAP IQ and SAP HANA, the difference being that SAP IQ is disk based and SAP HANA is a fully in-memory solution. The advantage that SAP HANA has over both SAP IQ and it’s rivals is that it can also be used to support on-line transactional processing systems. This provides SAP with the perfect database platform for all of its Enterprise solutions and has led them to remove support for non-HANA based databases from 2025.
SAP HANA therefore can provide answers to queries in near-real time performance based on data that itself is real-time in nature. To most business users this is the most important consideration; knowing that when I ask the question (as long as the data can support it) that I will not be left with a spinning clock for the next 5 minutes until the query times out, or if I am lucky enough to get a response, the data returned is not already out of date.
In the Open Source world, Apache HBase offers a columnar database overlay to Hadoop HDFS and can perform in near real time for pre-defined queries using the correct key makeup. Whilst it’s not a database, Apache Spark also provides an in-memory computing platform for running applications. Spark can be used in conjunction with Hadoop for persistence, and provides up to 100 times performance improvement over Hadoop MapReduce.
Other options in the Open Source world include Cassandra and Apache Ignite for highly performant transactions. Cassandra is a NoSQL database solution, whilst Apache Ignite is an in-memory database that fully supports ACID compliance, thus addressing one of the main concerns with NoSQL versus RDBMS.
In summary, when it comes to highly performant OLAP, Hana comes out on top with its columnar, in-memory architecture, but that must be considered against cost. We’ll cover that in our next Top Trumps card…