Although there have been a number of articles already written about the column store in SAP HANA, it’s still a question that gets asked regularly, particularly around data reconstruction. The aim of this blog is to explain:
1. How we can access the statistics behind a table stored in column store format within SAP HANA, such as the main memory size and compression rate.
2. The workings of a column store, including data reconstruction.
3. Some of the advantages of the column store for business insights and analytics.
In order to discuss these aims, let’s work with an example of a row based table from a typical relational database. We will load the data into SAP HANA in order to store it in a column store table within main memory.
A sample of the table we will work with is in Figure 1. In total we have 650,216 records. It shows New York Stock Exchange Data. For each company that trades (stock symbol) it shows the prices for each day.
In SAP HANA, we can choose to store the table in a row store or column store, both of which are stored in memory. For this blog we have loaded the same data into two separate tables, one row store, one column store, in order to analyse them both.
The size of a table in main memory can be seen in its definition view (see Figure 2 for how to access this in the HANA Studio) as well as the type of store it uses.
Figure 2 – Accessing the table definition
If we loaded the table into row store the table is essentially stored as-is from source. The screenshot in Figure 3 shows the size of the table in memory (45,948 Kb = 44.9 Mb) for our 650,216 records. No fields in a column have been merged/compressed.
Figure 3 – Seeing the table definition of the row store table
If we loaded the data into column store the size of the table in memory is 15,754 Kb (15.4 Mb) which can be seen in Figure 4. This size is significantly lower than the memory size of the same table loaded into row store. The way the data is stored in column store to enable this reduced size is discussed shortly.
Figure 4 – Seeing the table definition of the column store table
Compression has occurred against the columns. We can see the compression rate against the columns of the table in Figure 5. This info is also visible in the table definition view within the HANA Studio.
Figure 5 – Seeing the table definition of the column store table
But how does the compression in column store actually work?
SAP HANA has a number of different compression techniques (Run-length encoding, Cluster encoding and Dictionary encoding) deep down in its inner workings. We will be looking theoretically at how Dictionary Encoding works. For each column in the table, the database is storing a Dictionary and an Attribute Vector. We’ll take a look at both of these objects in Figure 6 for our Stock Symbol field based on our table sample of 9 records.
Figure 6 – The original column for Stock Symbol and how it is now represented in column store format
The next question to answer is how we reconstruct the row when querying the data in column store. Let’s suppose we want to run a query showing us the sum of the Stock Volume for the Stock Symbol ABB. First we would read the dictionary of the Stock Symbol to get the Value ID for ABB (value ID 2) and use the inverted index to access the record IDs that use Value ID 2 (records 1, 4 & 7). See reference 7 for a visual reference of this.
Figure 7 – Reading of the Stock Symbol Dictionary and Index for ABB records
Then with the index of the Stock Volume we can map the records (1, 4 & 7) to the Stock Volume Value IDs (4, 5 & 6) and then map these to the actual Stock Volumes values in the dictionary (see Figure 8). Then we apply the sum function against the values to give us 8,804,800.
Figure 8 – Reading of the Stock Volume Dictionary and Index
Key thing with our query is that we only read the columns we needed, i.e. Stock Symbol and Stock Volume.
From a business insights and analytics perspective, here are some of the main benefits in us being able to store our data in column store format.
Store more data, in more detail than before
We can achieve very efficient compression because a large proportion of fields have a low number of distinct values, that are used many times, which we can merge together. The compressed data means the storage space is reduced, meaning more main memory is available whilst helping to eliminate data redundancy.
We can store large detailed and historic data sets that were too large for previous data marts and warehouses from both storage and read/write performance perspectives.
Quicker access to this larger and more detailed dataset
The compressed data can be moved quicker from main memory into the CPUs in order to be processed. SAP HANA already allows data to be moved to the CPU quickly because everything is in the main memory layer and not from disk. With the compressed data the transfer from main memory to the CPU can happen even quicker, meaningful rapid response times for users.
Only read what we need
When querying the data we only read the columns we need. Research shows¹ that we only use a small number of columns/fields within a table. Also, we would typically want to scan the whole column for each field (e.g. Sum the Sales Revenue, for all Countries). In row store we read all of the columns regardless of whether we actually need them. Column store eliminates unnecessary scanning.
Of course, we need to be aware that the write performance of a column store compared to row store isn’t as good but we can counter this with the current differentiation of read and write areas within the column store (and the subsequent delta merge process). SAP is developing the capabilities of SAP HANA in order to improve write performance into the column store.
Ultimately from a business insights perspective we no longer need to compromise or make a decision on what type of data (detailed or aggregated, or at which level of aggregation) we need to store in our analytical database. Neither do we need to think about the most common granularity of data access requests in order to build are indexes optimally (e.g. is most retail data accessed by region and material category or by store and material?). With SAP HANA we can store the complete dataset as it is and the column store technology will ensure everything is optimised for storage and retrieval (we do still need however to think about lifecycle of data, we shouldn’t load data that isn’t going to be needed).
The column store is removing the boundaries of what’s possible, ultimately meaning better insights leading to better decisions.
1 – OpenHPI In-Memory Database course http://openhpi.de