You should always take care when making aliases or otherwise giving names to elements of the query that your names are clear, consistent and self-explanatory. Though it can seem inconvenient when building a query initially to make your names longer, it will be much, much clearer for debugging when your naming scheme can be easily understood by both yourself and Upsight support staff.
You should always specify a month in the WHERE clause of the statement while you are making changes to a query or are creating a new query. Where possible, including the day or hourly partitions will provide additional optimizations and faster query times. Your data in Upsight is stored in monthly, daily and hourly partitions. By specifying the range of data that you want to see, we then avoid reading data that is not needed. The amount of data that you will process without using a month = YYYYMM
, day = YYYYMMDD
, or hour = YYYYMMDDHH
in your WHERE
clause will almost certainly be so large as to slow down your query. If after you have run the query with a month restriction and you’d like a greater range of data, you can then reduce the restriction by querying several months worth of data by using WHERE month in (month1, month2)
. Only if you have a use case to look at data for all time (e.g. finding the number of days a user has used your application), and after you have followed the previous steps, should you remove the month restriction.
The above is also true when it comes to looking at app_id
and message type
, be sure to pay close attention to this when creating your innermost queries.
You should be aware that since we are using all raw data as it is sent to our servers, this means for aggregation related metrics it is very important to dedupe your data. For example, with our mobile SDK, session messages are sent regularly when users use your application. This means if a user has used your application several times in the course of a day, they may have tens or hundreds of session messages. If you wanted to get the number of active users for a specific day, we would then have to do COUNT(DISTINCT sid)
to filter out all of these 'duplicate' messages as simply doing `COUNT(sid)' would give us a count of messages and not a count of users.
Let's say that we want to find the number of users that made a purchase for a specific day. Without thinking about deduplication, one might write this query:
SELECT COUNT(sid)
FROM monetization
WHERE month = 201701
AND app_id = 123456
However if we take a closer look at the data, we will see that sometimes a single user may send more than 1 monetization message.
SELECT sid, count(*) as totalmsgs
FROM monetization
WHERE month = 201701
AND app_id = 123456
GROUP BY sid
ORDER BY totalmsgs DESC
sid | totalmsgs |
---|---|
48976456 | 10 |
5679856754 | 7 |
6586786 | 4 |
33547234 | 4 |
Instead we should use DISTINCT
to get the unique number of users that have sent monetization messages this month
SELECT COUNT(DISTINCT sid)
FROM monetization
WHERE month = 201701
AND app_id = 123456
When writing complex queries it is easy to confuse the order of operations of arithmetic operators, or of AND
/OR
precedence in JOIN
or WHERE
statements. Be sure to always use parentheses to explicitly specify precedence.
For example, suppose you would like to select two types of events from your application, app_id 123456
and month = 201601
. The following example would be the wrong way to structure the WHERE
clause. In this clause you would either retrieve data that satisfies the first three requirements or the last requirement. Meaning, you would retrieve records that have have type = "pub.economy.sell"
that would not belong in the right month or application.
WHERE app_id = 123456 AND month = 201601 AND type = "pub.economy.buy" OR type ="pub.economy.sell"
The correct way to structure the WHERE
clause is indicated below. Including the parentheses to explicitly specify precedence will allow . More specifically, the query scans through data to determine if the message satisfies either of the two event types (buy or sell) and then places a check to see if requirements are met with respect to the app_id and month.
WHERE app_id =123456 AND month=201601 AND (type ="pub.economy.buy" OR type ="pub.economy.sell")
When making a JOIN
in a query be certain to add an ON
criteria every time. Without the ON
criteria your query will attempt to join every row of the first table with every row of the second table. This is almost never what is intended with a query. You should also be certain that you do not use only one ON
criterion if you have multiple JOIN
clauses. There should be an ON
for every JOIN
.
For many large applications, it can be time consuming or impractical to gather all of the data for a query. For these situations, we instead focus on uncovering trends or patterns on a subset of the data. Random sampling is a great way to achieve this goal. Assuming that the unique identifier for each of your users is uniformly distributed, we can take the last two digits of each user ID to get a good random sample of your user base.
Before sampling, let's find out the size of the population.
SELECT COUNT(DISTINCT(sid))
FROM event
WHERE month >= 201506
Result: 2,441,856
When sampling for a percentage of the population, we should ensure that the sampling technique is not biased towards users of a specific property. In most cases, user IDs are randomly created, which is why we can rely on segmenting the last few digits of the user IDs as a way to randomly sample our population. Let's say we want to sample 1% of the population (2.4 million users), we can use the following query, which only selects IDs that have their last 2 digits end with "11":
SELECT COUNT(DISTINCT(sid))
FROM event
WHERE month >= 201506
AND SUBSTR(sid, LENGTH(sid) -1, 2) = 11;
Result: 24,414
In order to verify that our sampling technique is random, let's find out the distribution of all the user IDs based on their last 2 digits. Ideally it should be evenly distributed, so that when we sample we will get an accurate percentage and random sample. If it is not evenly distributed, we can use the function HASH()
on the user ID,sid
, to get a more even distribution.
SELECT SUBSTR(distributiontest.sid, LENGTH(distributiontest.sid) - 1, 2), COUNT (*)
FROM (
SELECT DISTINCT(sid)
FROM event
WHERE month >= 201506) distributiontest
GROUP BY SUBSTR(distributiontest.sid, LENGTH(distributiontest.sid) - 1, 2)
Last 2 digits of userid | Number of occurrences |
---|---|
00 | 24598 |
01 | 24546 |
02 | 24622 |
03 | 24622 |
04 | 24313 |
05 | 24283 |
06 | 24578 |
07 | 24393 |
08 | 24611 |
09 | 24475 |
10 | 24594 |
... | ... |
41 | 24342 |
... | ... |
99 | 24420 |
If you find that the user IDs are unevenly distributed, then you can use the HASH
function on the user ID to get a more even distribution.
SELECT SUBSTR(distributiontest.sid, LENGTH(distributiontest.sid) - 1, 2), COUNT(*)
FROM (
SELECT DISTINCT(HASH(sid)) AS s
FROM event
WHERE month >= 201506) distributiontest
GROUP BY SUBSTR(distributiontest.sid, LENGTH(distributiontest.sid) - 1, 2)
Last 2 digits of userid | Number of occurrences |
---|---|
00 | 24664 |
01 | 24255 |
02 | 24482 |
03 | 24369 |
04 | 24425 |
05 | 24737 |
06 | 24333 |
07 | 24517 |
08 | 24369 |
09 | 24305 |
10 | 24476 |
... | ... |
41 | 24632 |
... | ... |
99 | 24352 |
References
To sample x% of the total population, here are some example functions you can use.
Sampling Percentage | Example Function |
---|---|
0.1% | SUBSTR(sid, LENGTH(sid) - 2, 3) = 111 |
1% | SUBSTR(sid, LENGTH(sid) - 1, 2) = 11 |
25% | SUBSTR(sid, LENGTH(sid) - 1, 2) between 11 and 35 |
Sampling Use Cases
For extracting user data from a percentage of the user population for post data processing in R or other statistical programs.
Calculate the distribution of device types in a application. We can sample 25% of the total population by gathering messages from only 25% of our SIDs.
Our basic, intermediate, and advanced query sections go over this in more detail. Typically a UNION
will be more efficient than a JOIN
.
During instrumentation, implement pub_data
and user_attribute
fields that contribute to query efficiency. For example, implementing a total_spend
user attribute allows for spender group segmentation without having to query for and sum every monetization message for a user. Your CSM can help you determine the best attributes to implement.
For complex queries, consider setting up a custom table that first aggregates your data into fewer rows, and use that table to query from instead of querying from the Upsight standard tables.
Note For customers that make heavy use of custom tables in queries, Upsight also offers the option to set up “Intermediate Datasets” which are custom tables that contain aggregated data from a DataMine query and are scheduled to update automatically. Consult your CSM to determine if Intermediate Datasets are right for your use case.
GROUP BY
and Aggregate FunctionsImagine we have the following table and query, and we would like to find out how much revenue each user has generated. (sid
is the user ID and upsight_data_value_usd
is the value of transaction converted to US Currency, which is derived by multiplying upsight_data_currency x upsight_data_total_price. If you would like to use a different currency simply multiply USD x Your Exchange Rate. )
sid | upsight_data_value_usd ts | |
---|---|---|
1 | 1000 | 1234 |
1 | 2000 | 2345 |
2 | 1000 | 1234 |
2 | 1000 | 4567 |
SELECT sid, SUM(upsight_data_value_usd), ts
FROM monetization
The above query will result in a syntax error. We could fix this by adding a GROUP BY
clause. We might try the following:
SELECT sid, SUM(upsight_data_value_usd), ts
FROM monetization
GROUP BY sid
Unfortunately, this still results in an error.
Let's explain why by looking at user sid = 1
. The question then becomes, what to do with the two ts
? Should we take the minimum? The maximum? A random one? The first one it sees, or both? This is the reason for the syntax error. To fix this, we need to specify what we want to do with the columns that are not being aggregated. For this query, it would make the most sense to group by sid
(to see the total revenue of each user) and then to MIN
the ts
to see the first monetization event of that user.
SELECT sid, SUM(upsight_data_value_usd), MIN(ts) as first_purchase
FROM monetization
GROUP BY sid
sid | upsight_data_value_usd | first_purchase |
---|---|---|
1 | 3000 | 1234 |
2 | 2000 | 1234 |
In Hive QL, we cannot use an alias to refer to a column in a GROUP BY clause. Instead, we must use the actual function.
SELECT FROM_UNIXTIME(ts, 'yyyy-MM-dd') as day, SUM(upsight_data_value_usd) as total_revenue
FROM monetization
WHERE month = 201512
GROUP BY day
Query Error Message
FAILED: Error in semantic analysis: Line 4:9 Invalid table alias or column reference day
The error in the above query is in the GROUP BY
clause. If you want to group by a column, you have to use the column name and not the alias.
SELECT FROM_UNIXTIME(ts, 'yyyy-MM-dd') as day, SUM(upsight_data_value_usd) as total_revenue
FROM monetization
WHERE month = 201512
GROUP BY FROM_UNIXTIME(ts, 'yyyy-MM-dd')
SELECT ins.sid
FROM install AS ins
The above query is how you would alias in SQL. However, Hive QL does not use AS to alias a table:
SELECT ins.sid
FROM install ins
Assume that we have a flight booking application: we want to identify the users who have signed up through our app, and determine the average amount of money spent by a typical registered user. However, our application has a few registered users that have purchased more than three flights in the past 12 months, also known as frequent fliers. We want to exclude frequent fliers from our query, as they aren’t representative of a typical user.
To achieve this metric, we would need to query the following tables:
If you were going to use the Upsight Analytics schema to try to write this in SQL instead of Hive, you might try the following:
SELECT SUM(upsight_data_value_usd)/COUNT(DISTINCT (sid))
FROM monetization
WHERE month = 201507 AND sid NOT IN
(
SELECT purchase_history.sid
FROM(
SELECT sid, count(*) as purchase_count
FROM monetization
WHERE month = 201507
GROUP BY sid )purchase_history
WHERE purchase_history.purchase_count > 3)
However, the NOT IN clause is not supported in HIVE. To get around this, we can use the LEFT OUTER JOIN clause as shown in the following example:
SELECT all_purchases.sid AS userid,all_purchases.v AS revenue,CASE WHEN frequent_buyer.sid is NULL THEN "no" ELSE "yes" END AS if_frequent_flyer
FROM
(SELECT sid, upsight_data_value_usd AS v
FROM monetization
WHERE month = 201507)all_purchases
LEFT OUTER JOIN
(
SELECT purchase_history.sid
FROM(
SELECT sid, count(*) as purchase_count
FROM monetization
WHERE month = 201507
GROUP BY sid )purchase_history
WHERE purchase_history.purchase_count >= 3)frequent_buyer
ON all_purchases.sid=frequent_buyer.sid
This query would yield the following sample return:
# | userid | revenue | if_frequent_flyer |
---|---|---|---|
1 | 100000000001 | 1500 | no |
2 | 100000000001 | 20 | no |
3 | 100000000022 | 1500 | yes |
4 | 100000000022 | 1500 | yes |
5 | 100000000022 | 1500 | yes |
6 | 100000000022 | 1500 | yes |
10 | 100000000050 | 1500 | no |
11 | 100000000008 | 20 | no |
12 | 100000000600 | 400 | no |
13 | 133004555559 | 1500 | yes |
14 | 133004555559 | 1500 | yes |
15 | 133004555559 | 1500 | yes |
Now to exclude the frequent fliers, we simply wrap the above query with another select statement:
SELECT user_bucketing.userid,user_bucketing.revenue
FROM
(SELECT all_purchases.sid AS userid,all_purchases.v AS revenue,CASE WHEN frequent_buyer.sid is NULL THEN "no" ELSE "yes" END AS if_frequent_flyer
FROM
(SELECT sid, upsight_data_value_usd AS v
FROM monetization
WHERE month = 201507)all_purchases
LEFT OUTER JOIN
(
SELECT purchase_history.sid
FROM(
SELECT sid, count(*) as purchase_count
FROM monetization
WHERE month = 201507
GROUP BY sid )purchase_history
WHERE purchase_history.purchase_count >= 3)frequent_buyer
ON all_purchases.sid=frequent_buyer.sid) user_bucketing
WHERE user_bucketing.if_frequent_flyer = "no"
This would yield the following return:
# | userid | revenue | if_frequent_flyer |
---|---|---|---|
1 | 100000000001 | 1500 | no |
2 | 100000000001 | 20 | no |
3 | 100000000050 | 1500 | no |
4 | 100000000008 | 20 | no |
5 | 100000000600 | 400 | no |
In SQL, the NOT LIKE
syntax allows you to retrieve variables that don't fit a certain criteria. For example, assume we are discussing the schema for our Event tables:
SELECT sid, pub_data
FROM event
WHERE app_id = 123456 and month = 201601
LIMIT 100
s | pub_data |
---|---|
13139 | email:user.last@gmail.com;first_name:user;last_name:last |
14130 | email:user2.last@companyx.com;first_name:user2;last_name:last |
13144 | email:user3.last@companyx.com;first_name:user3,last_name:last |
For the next example, assume we have instrumented the data column with the appropriate JSON key value pairs so that we are sending the email address of the person who triggered each custom event. You want to look at all the custom events for your application that did not come from someone within your company. For clarity, let’s call your company CompanyX. In SQL, you would do the following:
SELECT *
FROM demo_evt evt
WHERE month = 201210 AND get_json_object(evt.json_data, '$.email') NOT LIKE'%@companyx.com'
However in HIVE, the NOT LIKE
syntax is not supported. Instead we can do a WHERE NOT LIKE
clause. Here is an example of a query in HIVE that achieves the same result:
SELECT *
FROM demo_evt evt
WHERE month = 201210 AND NOT get_json_object(evt.json_data, '$.email') LIKE '%@companyx.com'
# | utc_timestamp | type | s | n | v | l | st1 | st2 | st3 | json_data | month |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1350936966 | evt | 13139 | Page Viewed | 0 | 0 | Home | page1 | page2 | 201210 |