Best Practices

General Guidelines


Good Naming Convention

You should always take care when making aliases or otherwise giving names to elements of the query that your names are clear, consistent and self-explanatory. Though it can seem inconvenient when building a query initially to make your names longer, it will be much, much clearer for debugging when your naming scheme can be easily understood by both yourself and Upsight support staff.

Restrict Data Range

You should always specify a month in the WHERE clause of the statement while you are making changes to a query or are creating new query. The amount of data that you will process without using a WHERE month = YYYYMM will almost certainly be so large as to slow down your query. If after you have run the query with the month restriction and you’d like a greater range of data, you can then reduce the restriction by querying several months worth of data. Only if you have a use case to look at data for all time (e.g. finding a user’s first install time), and after you have followed the previous steps, should you remove the month restriction.

Deduplicate

You should be aware that since we are using all raw data as it is sent to our servers, this means for aggregation related metrics it is very important to dedupe your data. For example, with our mobile SDK, PGR messages are sent every 30 seconds as a heartbeat. This means for 10 minutes of activity we should expect 20 PGR messages from the same user. If you wanted to get he number of active users for a specific day, we would then have to do COUNT(DISTINCT s) to filter out all of these 'duplicate' messages as simply doing COUNT(s) would give us a count of messages and not a count of users.

Let's say that we want to find the number of users that made a purchase for a specific day. Without thinking about deduplication, one might write this query:

SELECT COUNT(s) 
FROM demoapp_mtu
WHERE month = $(months_ago(0))

However if we take a closer look at the data, we will see that sometimes a single user may send more than 1 MTU message.

SELECT s, count(*) as totalmsgs 
FROM demoapp_mtu
WHERE month = $(months_ago(0))
GROUP BY s
ORDER BY totalmsgs DESC
stotalmsgs
4897645610
56798567547
65867864
335472344

Instead we should use DISTINCT to get the unique number of users that have sent MTU messages

SELECT COUNT(DISTINCT s) 
FROM demoapp_mtu
WHERE month = $(months_ago(0))

Operator Precedence

When writing complex queries it is easy to confuse the order of operations of arithmetic operators, or of AND/OR precedence in JOIN or WHERE statements. Be sure to always use parentheses to explicitly specify precedence.

Missing ON Criteria

When making a JOIN in a query be certain to add an ON criteria every time. Without the ON criteria your query will attempt to join every row of the first table with every row of the second table. This is almost never what is intended with a query. You should also be certain that you do not use only one ON criterion if you have multiple JOIN clauses. There should be an ON for every JOIN.

Querying Large Datasets - Sampling

For many large applications, it can be time consuming or impractical to gather all of the data for a query. For these situations, we instead focus on uncovering trends or patterns on a subset of the data. Random sampling is a great way to achieve this goal. Assuming that the unique identifier for each of your users is uniformly distributed, we can take the last two digits of each user ID to get a good random sample of your user base.

  1. Before sampling, let's find out the size of the population.

    SELECT COUNT(DISTINCT(s))
    FROM demoapp_evt
    WHERE month >= 201212
    

    Result: 2,441,856

  2. When sampling for a percentage of the population, we should ensure that the sampling technique is not biased towards users of a specific property. In most cases, user IDs are randomly created, which is why we can rely on segmenting the last few digits of the user IDs as a way to randomly sample our population. Let's say we want to sample 1% of the population (2.4 million users), we can use the following query, which only selects IDs that have their last 2 digits end with "11":

    SELECT COUNT(DISTINCT(s))
    FROM demoapp_evt
    WHERE month >= 201212
    AND SUBSTR(s, LENGTH(s) -1, 2) = 11;
    

    Result: 24,414

  3. In order to verify that our sampling technique is random, let's find out the distribution of all the user IDs based on their last 2 digits. Ideally it should be evenly distributed, so that when we sample we will get an accurate percentage and random sample. If it is not evenly distributed, we can use the function HASH() on the user ID,s, to get a more even distribution.

    SELECT SUBSTR(distributiontest.s, LENGTH(distributiontest.s) - 1, 2), COUNT (*)
    FROM (
            SELECT DISTINCT(s)
            FROM demoapp_evt
            WHERE month >= 201212) distributiontest
    GROUP BY SUBSTR(distributiontest.s, LENGTH(distributiontest.s) - 1, 2)
    
    Last 2 digits of useridNumber of occurences
    0024598
    0124546
    0224622
    0324622
    0424313
    0524283
    0624578
    0724393
    0824611
    0924475
    1024594
    ......
    4124342
    ......
    9924420

    If you find that the user IDs are unevenly distributed, then you can use the HASH function on the user ID to get a more even distribution.

    SELECT SUBSTR(distributiontest.s, LENGTH(distributiontest.s) - 1, 2), COUNT(*)
    FROM (
            SELECT DISTINCT(HASH(s)) AS s
            FROM demoapp_evt
            WHERE month >= 201212) distributiontest
    GROUP BY SUBSTR(distributiontest.s, LENGTH(distributiontest.s) - 1, 2)
    
    Last 2 digits of useridNumber of occurrences
    0024664
    0124255
    0224482
    0324369
    0424425
    0524737
    0624333
    0724517
    0824369
    0924305
    1024476
    ......
    4124632
    ......
    9924352
  4. References

    To sample x% of the total population, here are some example functions you can use.

    Sampling PercentageExample Function
    0.1%SUBSTR(s, LENGTH(s) - 2, 3) = 111
    1%SUBSTR(s, LENGTH(s) - 1, 2) = 11
    25%SUBSTR(s, LENGTH(s) - 1, 2) between 11 and 35
  5. Sampling Use Cases

    • For extracting user data from a percentage of the user population for post data processing in R or other statistical programs.
    • Queries that contain inner queries which in return requires multiple map reduce steps. For example, we wanted to know the total time spent for all users, we can sample 25% of the total population and multiple the result by 4. For this query we have to first gather the timestamps for each user from multiple tables in an inner query and aggregate the total time spent per user from the outer query using the UDF kt_sessions. Performing a sampling on the inner query will ensure that we are selecting only a percentage of the data from the beginning. This will significantly speed up the entire query.

Common Mistakes


Mistake #1: GROUP BY and Aggregate Functions

Imagine we have the following table and query, and we would like to find out how much revenue each user has generated. (s is the user ID and v is the value of transaction)

svutc_timestamp
110001234
120002345
210001234
210004567
SELECT s, SUM(v), utc_timestamp 
FROM demo_mtu

The above query will result in a syntax error. We could fix this by adding a GROUP BY clause. We might try the following:

SELECT s, SUM(v), utc_timestamp 
FROM demo_mtu
GROUP BY s

Unfortunately, this still results in an error.

Let's explain why by looking at user s = 1. The question then becomes, what to do with the two utc_timestamps? Should we take the minimum? The maximum? A random one? The first one it sees, or both? This is the reason for the syntax error. To fix this, we need to specify what we want to do with the columns that are not being aggregated. For this query, it would make the most sense to group by s (to see the total revenue of each user) and then to MIN the utc_timestamp to see the first monetization event of that user.

SELECT s, SUM(v), MIN(utc_timestamp) as first_purchase 
FROM demo_mtu 
GROUP BY s

Resulting Table

svutc_timestamp
130001234
220001234

Mistake #2: Cannot Group by Alias

Incorrect Query

In Hive QL, we cannot use an alias to refer to a column in a GROUP BY clause. Instead, we must use the actual function.

SELECT FROM_UNIXTIME(utc_timestamp, 'yyyy-MM-dd') as day, SUM(v) as total_revenue
FROM demo_mtu mtu
WHERE month = 201302
GROUP BY day

Query Error Message

FAILED: Error in semantic analysis: Line 4:9 Invalid table alias or column reference day

The error in the above query is in the GROUP BY clause. If you want to group by a column, you have to use the column name and not the alias.

Correct Query

SELECT FROM_UNIXTIME(utc_timestamp, 'yyyy-MM-dd') as day, SUM(v) as total_revenue
FROM demo_mtu mtu
WHERE month = 201302
GROUP BY FROM_UNIXTIME(utc_timestamp, 'yyyy-MM-dd')

Differences Between SQL/Hive QL


Aliasing Syntax

SQL Syntax

SELECT apa.s 
FROM demo_apa AS apa

The above query is how you would alias in SQL. However, Hive QL does not use AS to alias a table:

HQL Syntax

SELECT apa.s 
FROM demo_apa apa

LEFT OUTER JOIN instead of EXCEPT or NOT IN

Assume that we have a flight booking application: we want to identify the users who have signed up through our app, and determine the average amount of money spent by a typical registered user. However, our application has a few registered users that have purchased more than three flights in the past 12 months, also known as frequent fliers. We want to exclude frequent fliers from our query, as they aren’t representative of a typical user.

To achieve this metric, you would instrument the following:

  • Send an APA message when a user signs up
  • Then send a MTU call when a registered user makes a purchase

If you were going to use the Upsight Analytics schema to try to write this in SQL instead of Hive, you might try the following:

SELECT SUM(v)/COUNT(DISTINCT (s))
FROM demo_mtu
WHERE month = 201107 AND s NOT IN
(
SELECT purchase_history.s
FROM(
    SELECT s, count(*) as purchase_count
    FROM demo_mtu
    WHERE month = 201107
    GROUP BY s )purchase_history
WHERE purchase_history.purchase_count > 3)

However, the NOT IN clause is not supported in HIVE. To get around this, we can use the LEFT OUTER JOIN clause like in the following example:

SELECT all_purchases.s AS userid,all_purchases.v AS revenue,CASE WHEN frequent_buyer.s is NULL THEN "no" ELSE "yes" END AS if_frequent_flyer
FROM
    (SELECT s,v
    FROM demo_mtu
    WHERE month = 201107)all_purchases
    LEFT OUTER JOIN
    (
    SELECT purchase_history.s
    FROM(
        SELECT s, count(*) as purchase_count
        FROM demo_mtu
        WHERE month = 201107
        GROUP BY s )purchase_history
    WHERE purchase_history.purchase_count >= 3)frequent_buyer
    ON all_purchases.s=frequent_buyer.s

This query would yield the following sample return:

#useridrevenueif_frequent_flyer
11000000000011500no
210000000000120no
31000000000221500yes
41000000000221500yes
51000000000221500yes
61000000000221500yes
101000000000501500no
1110000000000820no
12100000000600400no
131330045555591500yes
141330045555591500yes
151330045555591500yes

Now to exclude the frequent fliers, we simply wrap the above query with another select statement:

SELECT user_bucketing.userid,user_bucketing.revenue
FROM
    (SELECT all_purchases.s AS userid,all_purchases.v AS revenue,CASE WHEN frequent_buyer.s is NULL THEN "no" ELSE "yes" END AS if_frequent_flyer
    FROM
        (SELECT s,v
        FROM demo_mtu
        WHERE month = 201107)all_purchases
        LEFT OUTER JOIN
        (
        SELECT purchase_history.s
        FROM(
            SELECT s, count(*) as purchase_count
            FROM demo_mtu
            WHERE month = 201107
            GROUP BY s )purchase_history
        WHERE purchase_history.purchase_count >= 3)frequent_buyer
        ON all_purchases.s=frequent_buyer.s) user_bucketing
WHERE user_bucketing.if_frequent_flyer = "no"

This would yield the following return:

#useridrevenueif_frequent_flyer
11000000000011500no
210000000000120no
31000000000501500no
410000000000820no
5100000000600400no

NOT LIKE

In SQL, the NOT LIKE syntax allows you to retrieve variables that don't fit a certain criteria. For example, assume we are discussing the schema for our EVT tables:

SELECT * 
FROM demo_evt
WHERE month = 201210
#utc_timestamptypesnvlst1st2st3json_datamonth
11350936966evt13139action00Homepage1page2201210
21350936972evt14130action00reportingdashboarddefault201210
31350936980evt13144action00Homepage1page2201210

For the next example, assume we have instrumented the data column with the appropriate JSON key value pairs so that we are sending the email address of the person who triggered each custom event. You want to look at all the custom events for your application that did not come from someone within your company. For clarity, let’s call your company CompanyX. In SQL, you would do the following:

SELECT * 
FROM demo_evt evt
WHERE month = 201210 AND get_json_object(evt.json_data, '$.email') NOT LIKE'%@companyx.com'

However in HIVE, the NOT LIKE syntax is not supported. Instead we can do a WHERE NOT LIKE clause. Here is an example of a query in HIVE that achieves the same result:

SELECT * 
FROM demo_evt evt
WHERE month = 201210 AND NOT get_json_object(evt.json_data, '$.email') LIKE '%@companyx.com'
#utc_timestamptypesnvlst1st2st3json_datamonth
11350936966evt13139Page Viewed00Homepage1page2201210

HiveQL Documentation

Full HiveQL Documentation

UDFs and Macros

Overview


Function Return Type Description
kt_date(
  bigint unixtime,
  [string pattern])
string kt_date has been deprecated. For example of using the UDF from_unixtime(), please refer to this page.
kt_session(
  string,
  array,
  [int timeout],
  ['include-start-end'])
tuple(
  string s,
  int label,
  int length,
  'include-start-end')

To calculate the number of sessions and the duration of each session.

  • Input timeout is measured in minutes, which is by default 30.
  • Output length is measured in seconds.
  • The parameter 'include-start-end' is optional; it will give two additional columns displaying the start and end time of each session.

Sessions with only a start time and no end time (only 1 API is sent), will count as 1 session with a session length of 0.

kt_ipcountry(string ip)string Returns a two letter ISO country code for the given ip address
kt_collectall(column foo)array Transposes all of the values of the specified column into an array. The order of the rows are maintained.
$(months_ago(INT)) N/A

Dynamically select the month partition in your WHERE clause. For more information, see example below.

UDFs and Macros

Use kt_session to Return the Number of Sessions for Each User


The kt_session call will calculate the length of each session for each user. This example query is composed of three parts. The query in its entirety is shown as follows:

SELECT kt_session(calc_session.s, calc_session.evt_lst,2) AS (s, session_number, session_length)
FROM (
   SELECT session_set.s, collect_set(session_set.utc_timestamp) evt_lst
   FROM (
      SELECT total_list.s, total_list.utc_timestamp 
      FROM (
         SELECT pgr.s, pgr.utc_timestamp
         FROM XYZ_pgr pgr
         WHERE pgr.month = 201205 AND pgr.s != ""

         UNION ALL

         SELECT evt.s, evt.utc_timestamp
         FROM XYZ_evt evt
         WHERE evt.month = 201205 and evt.s != ""
      ) total_list 
   )session_set
   GROUP BY session_set.s
) calc_session 
ORDER BY s, session_number DESC

The above query can be broken down into three distinct queries; two inner queries and one outter query:

Inner Query #1

SELECT total_list.s, total_list.utc_timestamp 
FROM (
   SELECT pgr.s, pgr.utc_timestamp
   FROM XYZ_pgr pgr
   WHERE pgr.month = 201205 AND pgr.s != '' ''

   UNION ALL

   SELECT evt.s, evt.utc_timestamp
   FROM XYZ_evt evt
   WHERE evt.month = 201205 and evt.s != '' ''
) total_list

As we know that custom events, evt, and page requests, pgr, are the two messages types that contribute to a user's session, we must combine both of these tables into 1 master table. The information we need is the user ID and timestamp:

#sutc_timestamp
11234561307321107
21234561307321105
39876541307551408
49876541307551431

Inner Query #2

SELECT session_set.s, collect_set(session_set.utc_timestamp) evt_lst
FROM (
   SELECT total_list.s, total_list.utc_timestamp 
   FROM (
      SELECT pgr.s, pgr.utc_timestamp
      FROM XYZ_pgr pgr
      WHERE pgr.month = 201205
      AND pgr.s != '' ''

      UNION ALL

      SELECT evt.s, evt.utc_timestamp
      FROM XYZ_evt evt
      WHERE evt.month = 201205 and evt.s != '' ''
   ) total_list
   ORDER BY utc_timestamp DESC
) session_set
GROUP BY session_set.s

Using COLLECT_SET() and GROUP BY, we will collect the unique timestamps in an array for each user.

#sutc_timestamp
1123456[1307321117,1307321099,1307321097,1306928320,1307321110,1307321111, 1306928314,1306928315,1307321107,1307321105]
2987654[1307551408,1307551431,1307916523,1308342427,1307551391,1307916518, 1307916693]

Outer Query

Now, using the Upsight Analytics UDF kt_session, we can find out the session length for each session of each user, assuming that for a single session the interval between messages received is less than 2 minutes.

#ssession_numbersession_length
11234563234
2123456230
31234561845
4154532156

UDFs and Macros

Get "Total sessions / Total time spent per user"


Sample Result:

stotal_sessionstotal_timespent_minute
1005526893753430000015
1006536536664320000210.4
100742958637631000001861.9
1009027302448410000026

In order to get the total number of sessions per user, or the total time spent per user, we must wrap around the original kt_sessions example with an aggregate MAX() or SUM(). Here is an example query:

-- Total sessions/ Total time spent per user 
-- Replace all text enclosed with ## ## (example ##month_in_YYYYMM_format## could be 201303) using the following instructions:
-- Replace ##APP1_pgr##, ##APP1_evt## with the name of your applications you are interested in reviewing, for example: demogame_pgr, demogame_evt
-- Replace ##month_in_YYYYMM_format## with the month you are interested in querying, for example 201303
-- Replace ##timeout_in_minutes_integer## with the desired timeout, for example a value of 2 would mean that when a user is inactive for 2 minutes or more, the session would end
-- For mobile applications we recommend setting the timeout to 2 minutes and for web applications we recommend using 30 minutes.

SELECT session_data.s, MAX(session_data.session_number) AS total_sessions, ROUND(SUM(session_data.session_length)/60,1) AS total_timespent_minute
FROM
(
   SELECT kt_session(calc_session.s, calc_session.evt_lst,##timeout_in_minutes_integer##) AS (s, session_number, session_length)
   FROM (
      SELECT session_set.s, collect_set(session_set.utc_timestamp) evt_lst
      FROM (
         SELECT total_list.s, total_list.utc_timestamp 
         FROM (
            SELECT pgr.s, pgr.utc_timestamp
            FROM ##APP1_pgr## pgr
            WHERE pgr.month = ##month_in_YYYYMM_format##
            AND pgr.s != ""

            UNION ALL

            SELECT evt.s, evt.utc_timestamp
            FROM ##APP1_evt## evt
            WHERE evt.month = ##month_in_YYYYMM_format## and evt.s != ""
         ) total_list 
      )session_set
      GROUP BY session_set.s
   ) calc_session 
   ORDER BY s, session_number DESC
)session_data
GROUP BY session_data.s

UDFs and Macros

Get "Total sessions / Total time spent per user per day"


Sample Result:

symdtotal_sessionstotal_timespent_minute
10065365366643200002013-08-1813.5
10065365366643200002013-08-2216.9
100742958637631000002013-08-24839.7
100742958637631000002013-08-251022.2
101409413977837000002013-08-0559.9
101409413977837000002013-08-2012.5
101409413977837000002013-08-2723

There are two things to keep in mind when writing this query.

  1. kt_sessions() returns session data per session, so we need to wrap around the final query with a SUM() and a GROUP BY s to get the total time spent per user and a MAX() to get the total number of sessions per user.

  2. The other thing that we have to think about is grouping it by user and day. Since kt_session() UDF doesn't take day as a bucket, we will need to rework our inner queries a bit.

Right before we apply the kt_session UDF, we would usually have the timestamps for each user organized in an array per user. At this stage, we should organize the timestamps by user and day (instead of just users), like so:

symdtimestamps
12342013-08-01[3456464,686586,234545,6786786]
12342013-08-02[5756756,658657876,3453456,86586867]
23452013-08-01[12325435]

In the outer query, instead of doing the normal kt_session(), you can then use the function CONCAT_WS to combine the ymd and s into 1 bucket and later use the SPLIT() function to separate the line, like so:

SELECT SPLIT(session_data.ymd__s, '__')[0] AS s,
   SPLIT(session_data.ymd__s, '__')[1] AS ymd,
   MAX(session_data.session_number) AS total_sessions,
   ROUND(SUM(session_data.session_length/60),1) AS total_timespent_minute
FROM (
   SELECT kt_session(CONCAT_WS('__', calc_session.s,calc_session.ymd),
      calc_session.evt_lst, 2) AS (ymd__s, session_number,
      session_length
)

Example Query:

-- Total sessions/ Total time spent per user per day
-- Replace all text enclosed with ## ## (example ##month_in_YYYYMM_format## could be 201303) using the following instructions:
-- Replace ##APP1_pgr##, ##APP1_evt## with the name of your applications you are interested in reviewing, for example: demogame_pgr, demogame_evt
-- Replace ##month_in_YYYYMM_format## with the month you are interested in querying, for example 201303
-- Replace ##timeout_in_minutes_integer## with the desired timeout, for example a value of 2 would mean that when a user is inactive for 2 minutes or more, the session would end
-- For mobile applications we recommend setting the timeout to 2 minutes and for web applications we recommend using 30 minutes.

SELECT SPLIT(session_data.ymd__s, '__')[0] AS s,
   SPLIT(session_data.ymd__s, '__')[1] AS ymd,
   MAX(session_data.session_number) AS total_sessions,
   ROUND(SUM(session_data.session_length/60),1) AS total_timespent_minute
FROM (
   SELECT kt_session(CONCAT_WS('__', calc_session.s, calc_session.ymd), calc_session.evt_lst, ##timeout_in_minutes_integer##) AS (ymd__s, session_number, session_length)
   FROM (
      SELECT session_set.s, FROM_UNIXTIME(CAST(session_set.ts AS BIGINT), 'yyyy-MM-dd') ymd, collect_set(session_set.ts) evt_lst
      FROM (
         SELECT total_list.s, total_list.ts
         FROM (
            SELECT pgr.s, pgr.ts
            FROM ##APP1_pgr## pgr
            WHERE pgr.month = ##month_in_YYYYMM_format##
            AND pgr.s != ""

            UNION ALL

            SELECT evt.s, evt.ts
            FROM ##APP1_evt## evt
            WHERE evt.month = ##month_in_YYYYMM_format##
            AND evt.s != ""
         ) total_list
      ) session_set
      GROUP BY session_set.s, FROM_UNIXTIME(CAST(session_set.ts AS BIGINT), 'yyyy-MM-dd')
   ) calc_session
) session_data
GROUP BY SPLIT(session_data.ymd__s, '__')[0],SPLIT(session_data.ymd__s, '__')[1]
ORDER BY s

UDFs and Macros

Use the "Dynamic Month" macro to select a specific time period


Dynamically select the month partition in your WHERE clause. This will increase the speed and performance of your query. The macro is especially useful if you have scheduled queries where you are only interested in metrics from a specific time period (for example, the most recent month).

Select data from 2 months ago:

WHERE month = $months_ago(2))

Select data from this month, 2 months ago, and 3 months ago:

WHERE month in ( $(months_ago(0)), $(months_ago(2)), $(months_ago(3)) )

UDFs


Function Return Type Description
from_unixtime(bigint unixtime[, string format])string Changes the unix timestamp into various formats. For more information, refer to this page.
datediff(string enddate, string startdate)int Return the difference in days between startdate to enddate:
datediff('2009-03-01', '2009-02-27') = 2
datediff('2009-02-27', '2009-03-01') = -2

Examples

Function Call Description Return Value
FROM_UNIXTIME(utc_timestamp, 'yyyy-MM-dd') Changes the unix timestamp into year-month-day numerical format. 2013-09-17
FROM_UNIXTIME(unix_timestamp(), 'yyyy-MM-dd') Returns the current date in year-month-day numerical format. This is useful for scheduled queries. 2014-09-17
FROM_UNIXTIME(CAST(ts AS bigint), 'yyyy-MM-dd') Casts the string ts to a BIGINT before passing it to FROM_UNIXTIME since it only accepts BIGINT types. Note Since the ts parameter is a string, you will need to do this every time you pass it to FROM_UNIXTIME. 2013-09-17
FROM_UTC_TIMESTAMP((utc_timestamp*1000), "CET") Changes the timezone from GMT to GMT+1. There are 2 things to note. First we have to transform the timestamp into milliseconds, second for other possible timezones, you can refer to this page 2013-09-17 20:48:35
FROM_UNIXTIME((utc_timestamp+3600), 'yyyy-MM-dd HH:mm') Changes the timezone from GMT to GMT+1. Since GMT+1 is one hour ahead of GMT, we simply add 3600 seconds to the timestamp. 2013-09-17 20:48
FROM_UNIXTIME(utc_timestamp) Changes the unix timestamp into year-month-day numerical format with the hours, minutes, and seconds from the time appended in 24 hour format. 2013-09-17 20:48:35
MONTH(FROM_UNIXTIME(utc_timestamp)) Retrieves the month number from a unix time stamp. 9
FROM_UNIXTIME(utc_timestamp,'HH:mm') Retrieves the hours and minutes of the time from a unix timestamp in 24 hour format. 20:48
FROM_UNIXTIME(utc_timestamp,'hh:mm') Retrieves the hours and minutes of the time from a unix timestamp in 12 hour format. 08:48
FROM_UNIXTIME(utc_timestamp , 'EEEE') Retrieves the day of the week from a unix timestamp. Monday

CSV Format


Data Mine uses the following CSV format both for importing and exporting data from the system.

CSV Properties

  1. Columns delimited by ,
  2. Mapkeys are separated by :
  3. Array elements are separated by ;
  4. ,, :, and ; are all escaped by \

Validation Rules

  1. Only string type columns are allowed to have data of empty string.
  2. NULL data value is represented by \N (backslash capital n). Any column type can be NULL.
  3. Extra columns in a row are ignored.
  4. Unspecified columns are treated as NULL.
  5. Column types are enforced.
  6. Arrays elements are separated by ;
  7. Mapkeys are separated by :.
  8. Delimiters are escaped by \.
  9. Quoting of fields is not supported.
  10. Leading space between field and delimiter is not allowed unless column is of type string.

Example Validations

string | int | int valid rule comment
abc,1,2 Yes. 5 Column data types are correct.
,1,2 Yes. 1 The first column is a string and is allowed to be empty.
abc,,2 No. 1 The second column is an int and is not allowed to be empty.
abc,\N,2 Yes. 2 The second column is an int and is set to NULL.
abc,1,2,3,4 Yes. 3 More columns in row than in table is ok.
abc Yes. 4 Less columns in row than in table is ok. The tailing columns will have the value of NULL.
abc,foo,3 No 5 The second column value foo is not an integer.
one\, two\, and three,4,5 Yes 8 The comma in the string column is escaped by \.
"abc","2","3" No 9 Fields can not be quoted.
abc, 2,3 No 5,10 The leading space in column two is not allowed.
array | int valid rule comment
abc;def;hij,1 Yes. 6
abc;;hij,2 Yes. 6,1 The array elements are checked for validity. Since the array is a string an empty array element is valid.
one\;two;three\;four,3 Yes. 6,8 The semicolon is escaped by \.
array | string valid rule comment
1;2;3,abc Yes. 6
1;hi;3,abc No. 6,5 Array elements are type enforced as well.
1;;3,abc No. 6,1 The second element in the array can not be empty because the array is of type int.
1;\N;3,abc Yes. 6,2 The second element in the array is represented as NULL.