You should always take care when making aliases or otherwise giving names to elements of the query that your names are clear, consistent and self-explanatory. Though it can seem inconvenient when building a query initially to make your names longer, it will be much, much clearer for debugging when your naming scheme can be easily understood by both yourself and Upsight support staff.
You should always specify a month in the WHERE clause of the statement while you are making changes to a query or are creating a new query. Where possible, including the day or hourly partitions will provide additional optimizations and faster query times. Your data in Upsight is stored in monthly, daily and hourly partitions. By specifying the range of data that you want to see, we then avoid reading data that is not needed. The amount of data that you will process without using a
month = YYYYMM,
day = YYYYMMDD, or
hour = YYYYMMDDHH in your
WHERE clause will almost certainly be so large as to slow down your query. If after you have run the query with a month restriction and you’d like a greater range of data, you can then reduce the restriction by querying several months worth of data by using
WHERE month in (month1, month2). Only if you have a use case to look at data for all time (e.g. finding the number of days a user has used your application), and after you have followed the previous steps, should you remove the month restriction.
The above is also true when it comes to looking at
app_id and message
type, be sure to pay close attention to this when creating your innermost queries.
You should be aware that since we are using all raw data as it is sent to our servers, this means for aggregation related metrics it is very important to dedupe your data. For example, with our mobile SDK, session messages are sent regularly when users use your application. This means if a user has used your application several times in the course of a day, they may have tens or hundreds of session messages. If you wanted to get the number of active users for a specific day, we would then have to do
COUNT(DISTINCT sid) to filter out all of these 'duplicate' messages as simply doing `COUNT(sid)' would give us a count of messages and not a count of users.
Let's say that we want to find the number of users that made a purchase for a specific day. Without thinking about deduplication, one might write this query:
SELECT COUNT(sid) FROM monetization WHERE month = 201701 AND app_id = 123456
However if we take a closer look at the data, we will see that sometimes a single user may send more than 1 monetization message.
SELECT sid, count(*) as totalmsgs FROM monetization WHERE month = 201701 AND app_id = 123456 GROUP BY sid ORDER BY totalmsgs DESC
Instead we should use
DISTINCT to get the unique number of users that have sent monetization messages this month
SELECT COUNT(DISTINCT sid) FROM monetization WHERE month = 201701 AND app_id = 123456
When writing complex queries it is easy to confuse the order of operations of arithmetic operators, or of
OR precedence in
WHERE statements. Be sure to always use parentheses to explicitly specify precedence.
For example, suppose you would like to select two types of events from your application, app_id
month = 201601. The following example would be the wrong way to structure the
WHERE clause. In this clause you would either retrieve data that satisfies the first three requirements or the last requirement. Meaning, you would retrieve records that have have
type = "pub.economy.sell" that would not belong in the right month or application.
WHERE app_id = 123456 AND month = 201601 AND type = "pub.economy.buy" OR type ="pub.economy.sell"
The correct way to structure the
WHERE clause is indicated below. Including the parentheses to explicitly specify precedence will allow . More specifically, the query scans through data to determine if the message satisfies either of the two event types (buy or sell) and then places a check to see if requirements are met with respect to the app_id and month.
WHERE app_id =123456 AND month=201601 AND (type ="pub.economy.buy" OR type ="pub.economy.sell")
When making a
JOIN in a query be certain to add an
ON criteria every time. Without the
ON criteria your query will attempt to join every row of the first table with every row of the second table. This is almost never what is intended with a query. You should also be certain that you do not use only one
ON criterion if you have multiple
JOIN clauses. There should be an
ON for every
For many large applications, it can be time consuming or impractical to gather all of the data for a query. For these situations, we instead focus on uncovering trends or patterns on a subset of the data. Random sampling is a great way to achieve this goal. Assuming that the unique identifier for each of your users is uniformly distributed, we can take the last two digits of each user ID to get a good random sample of your user base.
Before sampling, let's find out the size of the population.
SELECT COUNT(DISTINCT(sid)) FROM event WHERE month >= 201506
When sampling for a percentage of the population, we should ensure that the sampling technique is not biased towards users of a specific property. In most cases, user IDs are randomly created, which is why we can rely on segmenting the last few digits of the user IDs as a way to randomly sample our population. Let's say we want to sample 1% of the population (2.4 million users), we can use the following query, which only selects IDs that have their last 2 digits end with "11":
SELECT COUNT(DISTINCT(sid)) FROM event WHERE month >= 201506 AND SUBSTR(sid, LENGTH(sid) -1, 2) = 11;
In order to verify that our sampling technique is random, let's find out the distribution of all the user IDs based on their last 2 digits. Ideally it should be evenly distributed, so that when we sample we will get an accurate percentage and random sample. If it is not evenly distributed, we can use the function
HASH() on the user ID,
sid, to get a more even distribution.
SELECT SUBSTR(distributiontest.sid, LENGTH(distributiontest.sid) - 1, 2), COUNT (*) FROM ( SELECT DISTINCT(sid) FROM event WHERE month >= 201506) distributiontest GROUP BY SUBSTR(distributiontest.sid, LENGTH(distributiontest.sid) - 1, 2)
|Last 2 digits of userid||Number of occurrences|
If you find that the user IDs are unevenly distributed, then you can use the
HASH function on the user ID to get a more even distribution.
SELECT SUBSTR(distributiontest.sid, LENGTH(distributiontest.sid) - 1, 2), COUNT(*) FROM ( SELECT DISTINCT(HASH(sid)) AS s FROM event WHERE month >= 201506) distributiontest GROUP BY SUBSTR(distributiontest.sid, LENGTH(distributiontest.sid) - 1, 2)
|Last 2 digits of userid||Number of occurrences|
To sample x% of the total population, here are some example functions you can use.
|Sampling Percentage||Example Function|
|0.1%||SUBSTR(sid, LENGTH(sid) - 2, 3) = 111|
|1%||SUBSTR(sid, LENGTH(sid) - 1, 2) = 11|
|25%||SUBSTR(sid, LENGTH(sid) - 1, 2) between 11 and 35|
Sampling Use Cases
For extracting user data from a percentage of the user population for post data processing in R or other statistical programs.
Calculate the distribution of device types in a application. We can sample 25% of the total population by gathering messages from only 25% of our SIDs.
During instrumentation, implement
user_attribute fields that contribute to query efficiency. For example, implementing a
total_spend user attribute allows for spender group segmentation without having to query for and sum every monetization message for a user. Your CSM can help you determine the best attributes to implement.
For complex queries, consider setting up a custom table that first aggregates your data into fewer rows, and use that table to query from instead of querying from the Upsight standard tables.
Note For customers that make heavy use of custom tables in queries, Upsight also offers the option to set up “Intermediate Datasets” which are custom tables that contain aggregated data from a DataMine query and are scheduled to update automatically. Consult your CSM to determine if Intermediate Datasets are right for your use case.
GROUP BYand Aggregate Functions
Imagine we have the following table and query, and we would like to find out how much revenue each user has generated. (
sid is the user ID and
upsight_data_value_usd is the value of transaction converted to US Currency, which is derived by multiplying upsight_data_currency x upsight_data_total_price. If you would like to use a different currency simply multiply USD x Your Exchange Rate. )
SELECT sid, SUM(upsight_data_value_usd), ts FROM monetization
The above query will result in a syntax error. We could fix this by adding a
GROUP BY clause. We might try the following:
SELECT sid, SUM(upsight_data_value_usd), ts FROM monetization GROUP BY sid
Unfortunately, this still results in an error.
Let's explain why by looking at user
sid = 1. The question then becomes, what to do with the two
ts? Should we take the minimum? The maximum? A random one? The first one it sees, or both? This is the reason for the syntax error. To fix this, we need to specify what we want to do with the columns that are not being aggregated. For this query, it would make the most sense to group by
sid (to see the total revenue of each user) and then to
ts to see the first monetization event of that user.
SELECT sid, SUM(upsight_data_value_usd), MIN(ts) as first_purchase FROM monetization GROUP BY sid
In Hive QL, we cannot use an alias to refer to a column in a GROUP BY clause. Instead, we must use the actual function.
SELECT FROM_UNIXTIME(ts, 'yyyy-MM-dd') as day, SUM(upsight_data_value_usd) as total_revenue FROM monetization WHERE month = 201512 GROUP BY day
Query Error Message
FAILED: Error in semantic analysis: Line 4:9 Invalid table alias or column reference day
The error in the above query is in the
GROUP BY clause. If you want to group by a column, you have to use the column name and not the alias.
SELECT FROM_UNIXTIME(ts, 'yyyy-MM-dd') as day, SUM(upsight_data_value_usd) as total_revenue FROM monetization WHERE month = 201512 GROUP BY FROM_UNIXTIME(ts, 'yyyy-MM-dd')
SELECT ins.sid FROM install AS ins
The above query is how you would alias in SQL. However, Hive QL does not use AS to alias a table:
SELECT ins.sid FROM install ins
Assume that we have a flight booking application: we want to identify the users who have signed up through our app, and determine the average amount of money spent by a typical registered user. However, our application has a few registered users that have purchased more than three flights in the past 12 months, also known as frequent fliers. We want to exclude frequent fliers from our query, as they aren’t representative of a typical user.
To achieve this metric, we would need to query the following tables:
If you were going to use the Upsight Analytics schema to try to write this in SQL instead of Hive, you might try the following:
SELECT SUM(upsight_data_value_usd)/COUNT(DISTINCT (sid)) FROM monetization WHERE month = 201507 AND sid NOT IN ( SELECT purchase_history.sid FROM( SELECT sid, count(*) as purchase_count FROM monetization WHERE month = 201507 GROUP BY sid )purchase_history WHERE purchase_history.purchase_count > 3)
However, the NOT IN clause is not supported in HIVE. To get around this, we can use the LEFT OUTER JOIN clause as shown in the following example:
SELECT all_purchases.sid AS userid,all_purchases.v AS revenue,CASE WHEN frequent_buyer.sid is NULL THEN "no" ELSE "yes" END AS if_frequent_flyer FROM (SELECT sid, upsight_data_value_usd AS v FROM monetization WHERE month = 201507)all_purchases LEFT OUTER JOIN ( SELECT purchase_history.sid FROM( SELECT sid, count(*) as purchase_count FROM monetization WHERE month = 201507 GROUP BY sid )purchase_history WHERE purchase_history.purchase_count >= 3)frequent_buyer ON all_purchases.sid=frequent_buyer.sid
This query would yield the following sample return:
Now to exclude the frequent fliers, we simply wrap the above query with another select statement:
SELECT user_bucketing.userid,user_bucketing.revenue FROM (SELECT all_purchases.sid AS userid,all_purchases.v AS revenue,CASE WHEN frequent_buyer.sid is NULL THEN "no" ELSE "yes" END AS if_frequent_flyer FROM (SELECT sid, upsight_data_value_usd AS v FROM monetization WHERE month = 201507)all_purchases LEFT OUTER JOIN ( SELECT purchase_history.sid FROM( SELECT sid, count(*) as purchase_count FROM monetization WHERE month = 201507 GROUP BY sid )purchase_history WHERE purchase_history.purchase_count >= 3)frequent_buyer ON all_purchases.sid=frequent_buyer.sid) user_bucketing WHERE user_bucketing.if_frequent_flyer = "no"
This would yield the following return:
In SQL, the
NOT LIKE syntax allows you to retrieve variables that don't fit a certain criteria. For example, assume we are discussing the schema for our Event tables:
SELECT sid, pub_data FROM event WHERE app_id = 123456 and month = 201601 LIMIT 100
For the next example, assume we have instrumented the data column with the appropriate JSON key value pairs so that we are sending the email address of the person who triggered each custom event. You want to look at all the custom events for your application that did not come from someone within your company. For clarity, let’s call your company CompanyX. In SQL, you would do the following:
SELECT * FROM demo_evt evt WHERE month = 201210 AND get_json_object(evt.json_data, '$.email') NOT LIKEemail@example.com'
However in HIVE, the
NOT LIKE syntax is not supported. Instead we can do a
WHERE NOT LIKE clause. Here is an example of a query in HIVE that achieves the same result:
SELECT * FROM demo_evt evt WHERE month = 201210 AND NOT get_json_object(evt.json_data, '$.email') LIKE '%@companyx.com'