Best Practices

General Guidelines


Good Naming Conventions

You should always take care when making aliases or otherwise giving names to elements of the query that your names are clear, consistent and self-explanatory. Though it can seem inconvenient when building a query initially to make your names longer, it will be much, much clearer for debugging when your naming scheme can be easily understood by both yourself and Upsight support staff.

Restrict Data Range and Filter Data Early

You should always specify a month in the WHERE clause of the statement while you are making changes to a query or are creating a new query. Where possible, including the day or hourly partitions will provide additional optimizations and faster query times. Your data in Upsight is stored in monthly, daily and hourly partitions. By specifying the range of data that you want to see, we then avoid reading data that is not needed. The amount of data that you will process without using a month = YYYYMM, day = YYYYMMDD, or hour = YYYYMMDDHH in your WHERE clause will almost certainly be so large as to slow down your query. If after you have run the query with a month restriction and you’d like a greater range of data, you can then reduce the restriction by querying several months worth of data by using WHERE month in (month1, month2). Only if you have a use case to look at data for all time (e.g. finding the number of days a user has used your application), and after you have followed the previous steps, should you remove the month restriction.

The above is also true when it comes to looking at app_id and message type, be sure to pay close attention to this when creating your innermost queries.

Deduplicate

You should be aware that since we are using all raw data as it is sent to our servers, this means for aggregation related metrics it is very important to dedupe your data. For example, with our mobile SDK, session messages are sent regularly when users use your application. This means if a user has used your application several times in the course of a day, they may have tens or hundreds of session messages. If you wanted to get the number of active users for a specific day, we would then have to do COUNT(DISTINCT sid) to filter out all of these 'duplicate' messages as simply doing `COUNT(sid)' would give us a count of messages and not a count of users.

Let's say that we want to find the number of users that made a purchase for a specific day. Without thinking about deduplication, one might write this query:

SELECT COUNT(sid) 
FROM monetization
WHERE month = 201701
AND app_id = 123456

However if we take a closer look at the data, we will see that sometimes a single user may send more than 1 monetization message.

SELECT sid, count(*) as totalmsgs 
FROM monetization
WHERE month = 201701
AND app_id = 123456
GROUP BY sid
ORDER BY totalmsgs DESC
sidtotalmsgs
4897645610
56798567547
65867864
335472344

Instead we should use DISTINCT to get the unique number of users that have sent monetization messages this month

SELECT COUNT(DISTINCT sid) 
FROM monetization
WHERE month = 201701
AND app_id = 123456

Operator Precedence

When writing complex queries it is easy to confuse the order of operations of arithmetic operators, or of AND/OR precedence in JOIN or WHERE statements. Be sure to always use parentheses to explicitly specify precedence.

For example, suppose you would like to select two types of events from your application, app_id 123456 and month = 201601. The following example would be the wrong way to structure the WHERE clause. In this clause you would either retrieve data that satisfies the first three requirements or the last requirement. Meaning, you would retrieve records that have have type = "pub.economy.sell" that would not belong in the right month or application.

WHERE app_id = 123456 AND month = 201601 AND type = "pub.economy.buy" OR type ="pub.economy.sell"

The correct way to structure the WHERE clause is indicated below. Including the parentheses to explicitly specify precedence will allow . More specifically, the query scans through data to determine if the message satisfies either of the two event types (buy or sell) and then places a check to see if requirements are met with respect to the app_id and month.

WHERE app_id =123456 AND month=201601 AND (type ="pub.economy.buy" OR type ="pub.economy.sell")

Missing ON Criteria

When making a JOIN in a query be certain to add an ON criteria every time. Without the ON criteria your query will attempt to join every row of the first table with every row of the second table. This is almost never what is intended with a query. You should also be certain that you do not use only one ON criterion if you have multiple JOIN clauses. There should be an ON for every JOIN.

Querying Large Datasets - Sampling

For many large applications, it can be time consuming or impractical to gather all of the data for a query. For these situations, we instead focus on uncovering trends or patterns on a subset of the data. Random sampling is a great way to achieve this goal. Assuming that the unique identifier for each of your users is uniformly distributed, we can take the last two digits of each user ID to get a good random sample of your user base.

  1. Before sampling, let's find out the size of the population.

    SELECT COUNT(DISTINCT(sid))
    FROM event
    WHERE month >= 201506
    

    Result: 2,441,856

  2. When sampling for a percentage of the population, we should ensure that the sampling technique is not biased towards users of a specific property. In most cases, user IDs are randomly created, which is why we can rely on segmenting the last few digits of the user IDs as a way to randomly sample our population. Let's say we want to sample 1% of the population (2.4 million users), we can use the following query, which only selects IDs that have their last 2 digits end with "11":

    SELECT COUNT(DISTINCT(sid))
    FROM event
    WHERE month >= 201506
    AND SUBSTR(sid, LENGTH(sid) -1, 2) = 11;
    

    Result: 24,414

  3. In order to verify that our sampling technique is random, let's find out the distribution of all the user IDs based on their last 2 digits. Ideally it should be evenly distributed, so that when we sample we will get an accurate percentage and random sample. If it is not evenly distributed, we can use the function HASH() on the user ID,sid, to get a more even distribution.

    SELECT SUBSTR(distributiontest.sid, LENGTH(distributiontest.sid) - 1, 2), COUNT (*)
    FROM (
            SELECT DISTINCT(sid)
            FROM event
            WHERE month >= 201506) distributiontest
    GROUP BY SUBSTR(distributiontest.sid, LENGTH(distributiontest.sid) - 1, 2)
    
    Last 2 digits of useridNumber of occurrences
    0024598
    0124546
    0224622
    0324622
    0424313
    0524283
    0624578
    0724393
    0824611
    0924475
    1024594
    ......
    4124342
    ......
    9924420

    If you find that the user IDs are unevenly distributed, then you can use the HASH function on the user ID to get a more even distribution.

    SELECT SUBSTR(distributiontest.sid, LENGTH(distributiontest.sid) - 1, 2), COUNT(*)
    FROM (
            SELECT DISTINCT(HASH(sid)) AS s
            FROM event
            WHERE month >= 201506) distributiontest
    GROUP BY SUBSTR(distributiontest.sid, LENGTH(distributiontest.sid) - 1, 2)
    
    Last 2 digits of useridNumber of occurrences
    0024664
    0124255
    0224482
    0324369
    0424425
    0524737
    0624333
    0724517
    0824369
    0924305
    1024476
    ......
    4124632
    ......
    9924352
  4. References

    To sample x% of the total population, here are some example functions you can use.

    Sampling PercentageExample Function
    0.1%SUBSTR(sid, LENGTH(sid) - 2, 3) = 111
    1%SUBSTR(sid, LENGTH(sid) - 1, 2) = 11
    25%SUBSTR(sid, LENGTH(sid) - 1, 2) between 11 and 35
  5. Sampling Use Cases

    • For extracting user data from a percentage of the user population for post data processing in R or other statistical programs.

    • Calculate the distribution of device types in a application. We can sample 25% of the total population by gathering messages from only 25% of our SIDs.

Use UNION over JOIN whenever possible

Our basic, intermediate, and advanced query sections go over this in more detail. Typically a UNION will be more efficient than a JOIN.

Efficient Instrumentation

During instrumentation, implement pub_data and user_attribute fields that contribute to query efficiency. For example, implementing a total_spend user attribute allows for spender group segmentation without having to query for and sum every monetization message for a user. Your CSM can help you determine the best attributes to implement.

Set up a Custom Table

For complex queries, consider setting up a custom table that first aggregates your data into fewer rows, and use that table to query from instead of querying from the Upsight standard tables.

Note For customers that make heavy use of custom tables in queries, Upsight also offers the option to set up “Intermediate Datasets” which are custom tables that contain aggregated data from a DataMine query and are scheduled to update automatically. Consult your CSM to determine if Intermediate Datasets are right for your use case.

Common Mistakes


Mistake #1: GROUP BY and Aggregate Functions

Imagine we have the following table and query, and we would like to find out how much revenue each user has generated. (sid is the user ID and upsight_data_value_usd is the value of transaction converted to US Currency, which is derived by multiplying upsight_data_currency x upsight_data_total_price. If you would like to use a different currency simply multiply USD x Your Exchange Rate. )

sidupsight_data_value_usd ts
110001234
120002345
210001234
210004567
SELECT sid, SUM(upsight_data_value_usd), ts 
FROM monetization

The above query will result in a syntax error. We could fix this by adding a GROUP BY clause. We might try the following:

SELECT sid, SUM(upsight_data_value_usd), ts 
FROM monetization
GROUP BY sid

Unfortunately, this still results in an error.

Let's explain why by looking at user sid = 1. The question then becomes, what to do with the two ts? Should we take the minimum? The maximum? A random one? The first one it sees, or both? This is the reason for the syntax error. To fix this, we need to specify what we want to do with the columns that are not being aggregated. For this query, it would make the most sense to group by sid (to see the total revenue of each user) and then to MIN the ts to see the first monetization event of that user.

SELECT sid, SUM(upsight_data_value_usd), MIN(ts) as first_purchase 
FROM monetization 
GROUP BY sid

Resulting Table

sidupsight_data_value_usdfirst_purchase
130001234
220001234

Mistake #2: Cannot Group by Alias

Incorrect Query

In Hive QL, we cannot use an alias to refer to a column in a GROUP BY clause. Instead, we must use the actual function.

SELECT FROM_UNIXTIME(ts, 'yyyy-MM-dd') as day, SUM(upsight_data_value_usd) as total_revenue
FROM monetization 
WHERE month = 201512
GROUP BY day

Query Error Message

FAILED: Error in semantic analysis: Line 4:9 Invalid table alias or column reference day

The error in the above query is in the GROUP BY clause. If you want to group by a column, you have to use the column name and not the alias.

Correct Query

SELECT FROM_UNIXTIME(ts, 'yyyy-MM-dd') as day, SUM(upsight_data_value_usd) as total_revenue
FROM monetization
WHERE month = 201512
GROUP BY FROM_UNIXTIME(ts, 'yyyy-MM-dd')

Differences Between SQL/Hive QL


Aliasing Syntax

SQL Syntax

SELECT ins.sid 
FROM install AS ins

The above query is how you would alias in SQL. However, Hive QL does not use AS to alias a table:

HQL Syntax

SELECT ins.sid 
FROM install ins

LEFT OUTER JOIN instead of EXCEPT or NOT IN

Assume that we have a flight booking application: we want to identify the users who have signed up through our app, and determine the average amount of money spent by a typical registered user. However, our application has a few registered users that have purchased more than three flights in the past 12 months, also known as frequent fliers. We want to exclude frequent fliers from our query, as they aren’t representative of a typical user.

To achieve this metric, we would need to query the following tables:

  • install messages from the install table
  • monetization table from the monetization table

If you were going to use the Upsight Analytics schema to try to write this in SQL instead of Hive, you might try the following:

SELECT SUM(upsight_data_value_usd)/COUNT(DISTINCT (sid))
FROM monetization
WHERE month = 201507 AND sid NOT IN
(
SELECT purchase_history.sid
FROM(
    SELECT sid, count(*) as purchase_count
    FROM monetization
    WHERE month = 201507
    GROUP BY sid )purchase_history
WHERE purchase_history.purchase_count > 3)

However, the NOT IN clause is not supported in HIVE. To get around this, we can use the LEFT OUTER JOIN clause as shown in the following example:

SELECT all_purchases.sid AS userid,all_purchases.v AS revenue,CASE WHEN frequent_buyer.sid is NULL THEN "no" ELSE "yes" END AS if_frequent_flyer
FROM
    (SELECT sid, upsight_data_value_usd AS v
    FROM monetization
    WHERE month = 201507)all_purchases
    LEFT OUTER JOIN
    (
    SELECT purchase_history.sid
    FROM(
        SELECT sid, count(*) as purchase_count
        FROM monetization
        WHERE month = 201507
        GROUP BY sid )purchase_history
    WHERE purchase_history.purchase_count >= 3)frequent_buyer
    ON all_purchases.sid=frequent_buyer.sid

This query would yield the following sample return:

#useridrevenueif_frequent_flyer
11000000000011500no
210000000000120no
31000000000221500yes
41000000000221500yes
51000000000221500yes
61000000000221500yes
101000000000501500no
1110000000000820no
12100000000600400no
131330045555591500yes
141330045555591500yes
151330045555591500yes

Now to exclude the frequent fliers, we simply wrap the above query with another select statement:

SELECT user_bucketing.userid,user_bucketing.revenue
FROM
    (SELECT all_purchases.sid AS userid,all_purchases.v AS revenue,CASE WHEN frequent_buyer.sid is NULL THEN "no" ELSE "yes" END AS if_frequent_flyer
FROM
    (SELECT sid, upsight_data_value_usd AS v
    FROM monetization
    WHERE month = 201507)all_purchases
    LEFT OUTER JOIN
    (
    SELECT purchase_history.sid
    FROM(
        SELECT sid, count(*) as purchase_count
        FROM monetization
        WHERE month = 201507
        GROUP BY sid )purchase_history
    WHERE purchase_history.purchase_count >= 3)frequent_buyer
    ON all_purchases.sid=frequent_buyer.sid) user_bucketing
WHERE user_bucketing.if_frequent_flyer = "no"

This would yield the following return:

#useridrevenueif_frequent_flyer
11000000000011500no
210000000000120no
31000000000501500no
410000000000820no
5100000000600400no

NOT LIKE

In SQL, the NOT LIKE syntax allows you to retrieve variables that don't fit a certain criteria. For example, assume we are discussing the schema for our Event tables:

SELECT sid, pub_data
FROM event
WHERE app_id = 123456 and month = 201601
LIMIT 100
spub_data
13139email:user.last@gmail.com;first_name:user;last_name:last
14130email:user2.last@companyx.com;first_name:user2;last_name:last
13144email:user3.last@companyx.com;first_name:user3,last_name:last

For the next example, assume we have instrumented the data column with the appropriate JSON key value pairs so that we are sending the email address of the person who triggered each custom event. You want to look at all the custom events for your application that did not come from someone within your company. For clarity, let’s call your company CompanyX. In SQL, you would do the following:

SELECT * 
FROM demo_evt evt
WHERE month = 201210 AND get_json_object(evt.json_data, '$.email') NOT LIKE'%@companyx.com'

However in HIVE, the NOT LIKE syntax is not supported. Instead we can do a WHERE NOT LIKE clause. Here is an example of a query in HIVE that achieves the same result:

SELECT * 
FROM demo_evt evt
WHERE month = 201210 AND NOT get_json_object(evt.json_data, '$.email') LIKE '%@companyx.com'
#utc_timestamptypesnvlst1st2st3json_datamonth
11350936966evt13139Page Viewed00Homepage1page2201210

HiveQL Documentation

Full HiveQL Documentation