Advanced Queries

Overview


In this section, we cover the most complex and data intensive sample queries. These queries may join two or more tables or involve use of specific HIVE keywords.

One thing to note is that the queries here may not match metrics on your dashboard as your dashboard metrics may have specific definitions for your applications. In most circumstances, your metrics should be the most correct source of information, though these examples should produce results that are close unless your specific metric differs significantly.

Some of these examples include app_id in the WHERE clause and some do not. In general, you will likely want to limit your queries to a specific application or add app_id to the GROUP BY to get per-app information.

Day 1 Retention

Advanced Query


Overview

This query calculates the number of users who installed on Day 0 (January 21, 2017 in the example) and returned on Day 1 (January 22).

For day 1 retention we need to identify users who installed one day and had a session exactly one day after that. A user may have had more than one session so we need to SELECT DISTINCT users with a session on the relevant day. It is uncommon, but possible, for duplicate installs to be sent. To avoid any errors, this table is also de-duped with SELECT DISTINCT. The LEFT OUTER JOIN selects all rows on the LEFT table, the install table, and finds any row with an sid that matches from the session table. If there is no matching session. the row will contain a NULL instead of 1 in the retained column.

Query

SELECT SUM(retained) as retained_users, SUM(installs)as installs 
FROM 
( 
    SELECT DISTINCT sid, 1 as installs 
    FROM install 
    WHERE day = 20170121 
) inst 
LEFT OUTER JOIN 
( 
    SELECT DISTINCT sid, 1 as retained 
    FROM session 
    WHERE day = 20170122 
) ret 
ON inst.sid = ret.sid

Day 7 Retention With Additional Dimensions

Advanced Query


Overview

This query outputs the country, number of returning users, as well as the numbers of installs for a given date period.

Compared to day 1 retention, if you want to look at a larger range of install days, it is important to be sure you properly match the day the session occurred. Since DataMine and HIVE only support equality type joins, we need to use the date_add function to match installs to returning users. If we want to see retention by another dimension, such as country, it is important to consider the case where a user installed in one country and returned in a different country. The usual treatment of this edge case is to cohort users based on the country they installed in and ignore the country they returned in.

Query

SELECT 
  location_country_ip, 
  COALESCE(SUM(retained),0) as retained_users, 
  SUM(installs) as installs 
FROM 
( 
    SELECT DISTINCT sid, 1 as installs, 
      to_date(from_unixtime(ts)) as installed_day, 
      location_country_ip 
    FROM install 
    WHERE day between 20170120 and 20170123 
) inst 
LEFT OUTER JOIN 
( 
    SELECT DISTINCT sid, 1 as retained, 
      to_date(from_unixtime(ts)) as returned_day 
    FROM session 
    WHERE day between 20170127 and 20170130 
) ret 
ON inst.sid = ret.sid 
AND date_add( inst.installed_day,7) = ret.returned_day 
GROUP BY location_country_ip

Day Over Day Retention

Advanced Query


Overview

This query looks at how many users who were active one day and returned the next in a given date range. To do this, the first step is to limit the data to one row per user per day. The LEAD() function then compares each user’s day of activity to their next day of activity. If they are consecutive, we count them as a returning user.

Query

SELECT day, 
  COUNT(1) as users, SUM( active_next_day) as returners 
FROM 
( 
    SELECT day, IF( datediff( 
       lead(day) OVER( PARTITION BY sid ORDER BY day), 
       day) = 1, 1, 0) as active_next_day 
    FROM 
    ( 
        SELECT sid, to_date(from_unixtime(ts)) as day, 
          row_number() OVER( PARTITION BY sid, day ) as msg_per_day 
        FROM session 
        WHERE day between 20170101 and 20170110 
    ) t1 
    WHERE msg_per_day = 1 
) t2 
GROUP BY day

Percent Returning Users

Advanced Query


Overview

This query calculates the number of users who are active within the application within the given date range that did not install on the current day. We will calculate this by finding all users who’s install day is different than the day they generated a session message. Much like day over day retention, we start by reducing the data to one row per user per day.

Query

SELECT day, 
  COUNT(1) as users, SUM( IF(install_day = day, 0, 1 ) ) as returners 
FROM 
( 
    SELECT sid, to_date(from_unixtime(ts)) as day, 
      to_date(from_unixtime(install_ts)) as install_day, 
      row_number() OVER( PARTITION BY sid, day ) as msg_per_day 
    FROM session 
    WHERE day between 20170101 and 20170110 
) t2 
WHERE msg_per_day = 1 
GROUP BY day

Last 3 actions of inactive (churned) users

Advanced Query


Overview

This query outputs the last three actions of every non-returning (churned) user within a given time frame. We define a churned user as any user who was active in the last 30-day period, but not active in the last 2 weeks.

First, the query selects all users and their events in the month of October. IF(day > 20171015,1,0) checks user activity and assigns a value of 1 to active users and 0 to inactive users. We then use the filter WHERE active_user = 0 to cut down to just inactive users.

Now, we have all events for all inactive users. The next thing to do is to find the last 3 events. In the outer SELECT statement, we will create a column that records the user’s last activity date using MAX(day) OVER( PARTITION BY app_id, sid). Then, in the WHERE day = last_day clause we will keep only events that happened on their final day.

We have filtered down to the last day of events for all inactive users. Next, we must find their last 3 events on this day. We partition the users based on their application and sid, then use ROW_NUMBER() to order their events from most recent to least recent.

In the final SELECT statement, we can finally filter down to the last 3 actions of a churned user with WHERE user_rank_desc <= 3.

Query

-- Prints last 3 actions of each churned user
SELECT app_id, sid, from_unixtime(ts) as ts, type
FROM(
  SELECT
    *, row_number() OVER( PARTITION BY app_id, sid ORDER BY day DESC, ts DESC, seq_id DESC) as user_rank_desc -- orders events from most recent to least recent
  FROM(
    SELECT
      *, MAX(day) OVER( PARTITION BY app_id, sid) as last_day
    FROM(
      SELECT
        app_id, sid, type, ts, seq_id, day,
        MAX(IF(day > 20171015,1,0) OVER( PARTITION BY app_id, sid) as active_user
      FROM event
      WHERE day between 20171001 and 20171031
    ) t1
    WHERE active_user = 0 -- filter on churned users
  ) t2
  WHERE day = last_day -- keeps events from churned users’ last day
)
WHERE user_rank_desc =< 3 -- filter to last 3 events

Find popular events and pub data keys

Advanced Query


Overview

This query will read all of the events sent on a given day, count how many of each event is sent, and then look at all of the values in the pub_data field, finding the top 5 most common keys for each of those values. This is often useful in debugging or exploring the custom event data you are sending to Upsight. The user_attributes field can also be used in place of the pub_data field.

Query

SELECT app_id, key_sum, type, key, value as five_most_common_values 
FROM 
( 
    SELECT app_id, n, type, key, value, 
      row_number() OVER(PARTITION BY app_id, type, key ORDER BY n ) as val_rank, 
      sum(n)       OVER(PARTITION BY app_id, type, key) as key_sum 
    FROM 
    ( 
        SELECT app_id, COUNT(1) as n, type, key, value 
        FROM event 
        LATERAL VIEW EXPLODE ( pub_data ) sub as key, value 
        WHERE  day in (20170131) 
        GROUP BY app_id, type, key, value 
    )t1 
)t2 
WHERE val_rank =< 5