-
Notifications
You must be signed in to change notification settings - Fork 360
Speed up Big Data User Behavior Analysis using Bi dimension Ordering Structure
User analysis (account analysis) computes detail user or account data and analyzes or summarizes it. Often such a case is user behavior analysis, bank account analysis, calculating conversion funnel rate or insurance policy analysis.
This type of computing scenarios involves historical data of a huge number of users. The total data volume is gigantic (up to tens of millions of even a hundred million records), and the external storage is needed. Yet, a single user or account contains a small volume of data (from several to thousands of records). Generally, the user analysis is performed online and requires an instant result. High computing speed is necessary to get an immediate response. So, it is important to study these computing scenarios to find out characteristics of the computations and related data and determine a suitable optimization algorithm according to their characteristics so that the best possible performance can be achieve.
Filtering on time dimension is one characteristic of the user analysis. The time span of all data is long but that of filtered data is relatively short. Performance will be noticeably increased if we can quickly get filtering result without traversing all data. Creating an index on time dimension cannot bring the desired effect because the size of filtered data in those scenarios is still large. Even an index helps locate the target data quickly, a large number of useless reads are also involved when the target data does not exist continuously on the disk and speed cannot substantially improved. The only effective way is to physically store data in order of the time dimension. Traditional relational databases, however, are based on the concept of unordered sets and cannot ensure that data is stored physically in order. This can only be made up for through engineering optimization. Some database products try to make use of the data orderliness in their optimization engines, yet the actual effect is uncertain because in essence databases do not support orderliness.
The other characteristic of user analysis is that different users are not related. Computing data of one user generally does not involve data of another user. Imagine that when the source data contains a lot of users, even a simple distinct count on user will be very inconvenient. An alternative solution is loading and computing data of each user separately. This can effectively reduce coding and computing complexity, as well as increasing performance. Sometimes when the computing logic is complex and we need to load data of a single user into the memory and write complicated code to achieve it, handling multiple users one by one becomes particularly necessary.
Similarly, creating an index on the user dimension cannot make the process more efficient. When data of one user isn’t physically stored continuously, reading data bit by bit using index usually result in worse performance (much worse because data of all users will be traversed). Data of each user also needs to be physically stored continuously, that is to say, data should be ordered by the user dimension, in order to handle them one by one fast and efficiently.
It is a contradiction to order data by time dimension (for the convenience of filtering) and to order data by user dimension (for the convenience of later computations). Obviously, data cannot be ordered by two dimensions at the same time (ordering data by two dimensions one by one makes no sense). Even if an optimized relational database can, to some degree, make use of the order in which data is imported, the order is on one dimension. It is impossible to perform optimization on both the time and user dimensions, and thus perform computations fast.
SPL, the open-source data computing engine, offers bi-dimension ordering structure that during user analysis scenario makes data ordered by time dimension as a whole (for a fast filtering) while letting it ordered by users at retrieval (for reading them one by one for subsequent computations). This creates the impression of data being ordered by two dimensions at the same time. Now we can exploit the above-mentioned two characteristics to increase performance of user analysis tasks.
SPL stores data in multiple data tables of same structure (which we call zone tables) in time order. Each zone table stores data within a specific time range. Zone tables are arranged by time dimension and in each zone table data is ordered by user dimension and time dimension.
When performing a filtering operation on time dimension, SPL can quickly locate the zone table containing the target data according to the start time and end time specified in the filtering condition. The number of involved zone tables is much less than the total number of zone tables as the location efficiently excludes the useless data that occupies the greater part of data. Though each involved zone table is not ordered by time and we need to traverse each and perform filtering on time dimension at retrieval, the process is much faster than traversing all data to get target data.
If there is only one table meeting the filtering condition, we can quickly retrieve users in order for use in later computations as data in the zone table is already ordered by user. If there are multiple eligible zone tables, SPL has the efficient order-based merge algorithm to merge them together on the user dimension as data in each zone table is ordered by user, and we can still retrieve data of users one by one.
About the principle behind bi-dimension ordering structure, See SPL Pseudo Table Bi-dimension Ordering Structure.
We’ll further explain the use of bi-dimension ordering structure through two examples. First is a simple, regular computing task involving distinct count.
Suppose we have TransDetail table T that stores detailed account transaction data in one year in multiple fields, including userid (user account), dt (date), city (city the account belongs to), product (product) and amt (transaction amount). The task is to get records whose dt field values are within a specified time period, group them by product, and in each group count the distinct user ids and sum amounts.
Here the distinct count is troublesome. The regular method is to maintain a distinct-value result set, compare each original record with the result set to see if there is a same record in the latter, and decide whether or not retain the record. This requires a large portion of memory space and involves complex comparisons. Sometimes the result set of a user-based distinct operation is huge. If it cannot fit into the memory and external buffer is needed, performance will go down sharply.
But, if data is already ordered by the user dimension, we can read in data of users in order, during which a simple count is performed whenever the user dimension value changes. One traversal is enough to calculate the distinct count fast, with light memory usage and simple comparisons and without the need of external buffer no matter how much data we have.
With SPL, we can store the one year’s detail data into 12 zone tables by month using the bi-dimension ordering structure. Each zone table stores data of one month. The zone tables as a whole are ordered by dt. In each zone table, data is ordered by userid and dt. Now we can use the above algorithm to perform distinct count quickly:
A | |
1 | =create(file,zone,date).record(["T.ctx",to(12),"dt"]) |
2 | =pseudo(A1,0) |
3 | =A2.select(dt>=date("2021-05-15") && dt<= date("2021-07-05") ) |
4 | =A3.groups(product;icount(userid),sum(amt)) |
A1 A2 Generate a pseudo table object of bi-dimension ordering structure.
A3 performs fast filtering on ordered dt. In A4, groups function implements the above-mentioned algorithm to perform a fast order-based distinct count by making use of the ordered userid.
Let’s look at a more complicated account analysis scenario – the eCommerce conversion funnel analysis.
Suppose AccountTrans table T1 stores 12 months of data in the same way. The table has multiple fields – userid (user account), etime (event time) and etype (event type). An event type can be "login", "search" or "pageview". We are trying to count the distinct users who perform a series of continuous operations like login, search and pageview in a specified time range. An event that comes late in the series has smaller number of accounts. We can see that the numbers of accounts corresponding to different event types form a funnel shape that has big top and small bottom.
In essence the funnel analysis is a time-series computation, during which each user is matched to events in time order. As SQL is based on unordered sets, it is extremely hard to achieve complex time-series computations. There will be hundreds of lines of code even if the computation can be implemented. Moreover, the code is related with the number of events. Adding events needs modifying the code. This makes optimization impossible. Programmers often write complicated UDFs in the database to achieve the computation. The code is complex and hard to maintain. And as data is stored in the database, they cannot be ordered by two dimensions at the same time and thus high performance is still unattainable.
If data is ordered by user dimension, we can load data of each account into the memory one by one and form an ordered set. The order of records in this set is that of events happened in time order. So, it is easy to code computation and even maintain a general code for multiple scenarios.
SPL’s bi-dimension ordering structure can help achieve the computing logic. The result set of a fast filtering on time is also ordered by user account. We can retrieve data of user circularly and load it into the memory. As data in each account is ordered by time, it is easy to achieve the funnel analysis with multistep coding. The SPL code is significantly less than SQL code and has much higher performance. Below is the SPL code:
A | B | C | D | |
1 | =["login","search","pageview"] | =A1.(0) | ||
2 | =T1.select(dt>=date("2021-05-15") && dt<=date("2021-06-14")).cursor() | |||
3 | for A2;userid | =first=A3.select(etype==A1(1)) | if first.len()==0 | next |
4 | =t=null | =A1.(null) | ||
5 | for A1 | if #B5==1 | >C4(1)=t=t1=first.dt | |
6 | else | >C4(#B5)=t=if(t,A3.select@1(etype==B5 && dt>t && dt | ||
7 | =C4.(if(~==null,0,1)) | >C1=C1++B7 |
A2 defines cursor for the result set of filtering on time. A3 loops to retrieve one user account and perform complex computation each time. As each round of the loop retrieves all data of one userid quickly, only a simple piece of code is enough to achieve the complex conversion funnel analysis.
There are more explanations about conversion funnel computations in SQL Performance Enhancement: Conversion Funnel Analysis.
SPL bi-dimension ordering structure also supports multithreaded computations. We can use the computing abilities of multiple CPUs or multiple CPU cores to further increase computing speed.
To speed up user analysis tasks, it is necessary that data is ordered by both time dimension and user dimension. The traditional unordered-set-based relational databases cannot bring into play the data orderliness. Even if through engineering optimizations the data importing order can be exploited, it is impossible that data is ordered by two dimensions at the same time. SPL bi-dimension ordering structure helps to reach a similar effect of data being ordered by both time dimension and user dimension. This enables programmers to make effective use of the user analysis’s two prominent characteristics to speed up computations.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code