In this section, we cover the most complex and data intensive sample queries. These queries may join two or more tables or involve use of specific HIVE keywords.
One thing to note is that the queries here may not match metrics on your dashboard as your dashboard metrics may have specific definitions for your applications. In most circumstances, your metrics should be the most correct source of information, though these examples should produce results that are close unless your specific metric differs significantly.
Some of these examples include
app_id in the
WHERE clause and some do not. In general, you will likely want to limit your queries to a specific application or add
app_id to the
GROUP BY to get per-app information.
This query calculates the number of users who installed on Day 0 (January 21, 2017 in the example) and returned on Day 1 (January 22).
For day 1 retention we need to identify users who installed one day and had a session exactly one day after that. A user may have had more than one session so we need to
SELECT DISTINCT users with a session on the relevant day. It is uncommon, but possible, for duplicate installs to be sent. To avoid any errors, this table is also de-duped with
SELECT DISTINCT. The
LEFT OUTER JOIN selects all rows on the LEFT table, the
install table, and finds any row with an
sid that matches from the
session table. If there is no matching session. the row will contain a
NULL instead of 1 in the
SELECT SUM(retained) as retained_users, SUM(installs)as installs FROM ( SELECT DISTINCT sid, 1 as installs FROM install WHERE day = 20170121 ) inst LEFT OUTER JOIN ( SELECT DISTINCT sid, 1 as retained FROM session WHERE day = 20170122 ) ret ON inst.sid = ret.sid
This query outputs the country, number of returning users, as well as the numbers of installs for a given date period.
Compared to day 1 retention, if you want to look at a larger range of install days, it is important to be sure you properly match the day the session occurred. Since DataMine and HIVE only support equality type joins, we need to use the
date_add function to match installs to returning users. If we want to see retention by another dimension, such as country, it is important to consider the case where a user installed in one country and returned in a different country. The usual treatment of this edge case is to cohort users based on the country they installed in and ignore the country they returned in.
SELECT location_country_ip, COALESCE(SUM(retained),0) as retained_users, SUM(installs) as installs FROM ( SELECT DISTINCT sid, 1 as installs, to_date(from_unixtime(ts)) as installed_day, location_country_ip FROM install WHERE day between 20170120 and 20170123 ) inst LEFT OUTER JOIN ( SELECT DISTINCT sid, 1 as retained, to_date(from_unixtime(ts)) as returned_day FROM session WHERE day between 20170127 and 20170130 ) ret ON inst.sid = ret.sid AND date_add( inst.installed_day,7) = ret.returned_day GROUP BY location_country_ip
This query looks at how many users who were active one day and returned the next in a given date range. To do this, the first step is to limit the data to one row per user per day. The
LEAD() function then compares each user’s day of activity to their next day of activity. If they are consecutive, we count them as a returning user.
SELECT day, COUNT(1) as users, SUM( active_next_day) as returners FROM ( SELECT day, IF( datediff( lead(day) OVER( PARTITION BY sid ORDER BY day), day) = 1, 1, 0) as active_next_day FROM ( SELECT sid, to_date(from_unixtime(ts)) as day, row_number() OVER( PARTITION BY sid, day ) as msg_per_day FROM session WHERE day between 20170101 and 20170110 ) t1 WHERE msg_per_day = 1 ) t2 GROUP BY day
This query calculates the number of users who are active within the application within the given date range that did not install on the current day. We will calculate this by finding all users who’s install day is different than the day they generated a session message. Much like day over day retention, we start by reducing the data to one row per user per day.
SELECT day, COUNT(1) as users, SUM( IF(install_day = day, 0, 1 ) ) as returners FROM ( SELECT sid, to_date(from_unixtime(ts)) as day, to_date(from_unixtime(install_ts)) as install_day, row_number() OVER( PARTITION BY sid, day ) as msg_per_day FROM session WHERE day between 20170101 and 20170110 ) t2 WHERE msg_per_day = 1 GROUP BY day
This query outputs the last three actions of every non-returning (churned) user within a given time frame. We define a churned user as any user who was active in the last 30-day period, but not active in the last 2 weeks.
First, the query selects all users and their events in the month of October.
IF(day > 20171015,1,0) checks user activity and assigns a value of
1 to active users and
0 to inactive users. We then use the filter
WHERE active_user = 0 to cut down to just inactive users.
Now, we have all events for all inactive users. The next thing to do is to find the last 3 events. In the outer
SELECT statement, we will create a column that records the user’s last activity date using
MAX(day) OVER( PARTITION BY app_id, sid). Then, in the
WHERE day = last_day clause we will keep only events that happened on their final day.
We have filtered down to the last day of events for all inactive users. Next, we must find their last 3 events on this day. We partition the users based on their application and
sid, then use
ROW_NUMBER() to order their events from most recent to least recent.
In the final
SELECT statement, we can finally filter down to the last 3 actions of a churned user with
WHERE user_rank_desc <= 3.
-- Prints last 3 actions of each churned user SELECT app_id, sid, from_unixtime(ts) as ts, type FROM( SELECT *, row_number() OVER( PARTITION BY app_id, sid ORDER BY day DESC, ts DESC, seq_id DESC) as user_rank_desc -- orders events from most recent to least recent FROM( SELECT *, MAX(day) OVER( PARTITION BY app_id, sid) as last_day FROM( SELECT app_id, sid, type, ts, seq_id, day, MAX(IF(day > 20171015,1,0) OVER( PARTITION BY app_id, sid) as active_user FROM event WHERE day between 20171001 and 20171031 ) t1 WHERE active_user = 0 -- filter on churned users ) t2 WHERE day = last_day -- keeps events from churned users’ last day ) WHERE user_rank_desc =< 3 -- filter to last 3 events
This query will read all of the events sent on a given day, count how many of each event is sent, and then look at all of the values in the pub_data field, finding the top 5 most common keys for each of those values. This is often useful in debugging or exploring the custom event data you are sending to Upsight. The
user_attributes field can also be used in place of the
SELECT app_id, key_sum, type, key, value as five_most_common_values FROM ( SELECT app_id, n, type, key, value, row_number() OVER(PARTITION BY app_id, type, key ORDER BY n ) as val_rank, sum(n) OVER(PARTITION BY app_id, type, key) as key_sum FROM ( SELECT app_id, COUNT(1) as n, type, key, value FROM event LATERAL VIEW EXPLODE ( pub_data ) sub as key, value WHERE day in (20170131) GROUP BY app_id, type, key, value )t1 )t2 WHERE val_rank =< 5