Numerical data is ubiquitous as is our reliance upon it. It can be found in earnings reports used to convey past and future company performance, IT support tickets where the time to respond and successfully close the ticket is captured or can be derived, and even in the tracking of our own personal goals over time. Luckily, the pervasiveness of data and its ever-increasing nature has been embraced and systems have been created to effectively manage and make use of it. Relational database management systems (RDBMS) come equipped with built-in functions meant to handle numeric data and assist us in answering questions based on, in full or in part, the data itself.
Let’s begin exploring numeric functions by tackling a classic programming problem where the task is to classify integer values as even or odd. The schemas, corresponding tables, and data used in the examples can be found at livesql.oracle.com.
Classifying Integers as Even or Odd
SELECT
oe.orders.order_id,
CASE MOD(oe.orders.order_id, 2)
WHEN 0
THEN 'Even'
WHEN 1
THEN 'Odd'
ELSE NULL
END AS order_id_type
FROM
oe.orders
FETCH FIRST 5 ROWS ONLY;
| order_id | order_id_type |
|---|---|
| 2354 | Even |
| 2355 | Odd |
| 2356 | Even |
| 2357 | Odd |
| 2358 | Even |
In the example above, the MOD function has been used to classify whether the pre-existing order ID values are even or odd. The function has two parameters, the numerator and denominator, and returns the remainder of the resulting division. This function is particularly useful because it provides a straightforward way to complete the task. If the denominator is two and the remainder is zero, then the order ID or first argument passed must be even. Otherwise, the value must be odd when the denominator is two. The example below is a bit more involved. It includes a recursive common table expression (CTE) and the use ROUND and DBMS_RANDOM.VALUE functions to generate one hundred random integer values. Subsequently, the MOD function is used to classify each integer as even or odd. For an in-depth explanation on this example, see my page on recursion.
/*
DBMS_RANDOM.VALUE() is used to generate 100 random integers. Subsequently, MOD is used to classify each integer as even or odd.
*/
WITH random_number_generator (iteration, random_number) AS (
SELECT
1,
ROUND(DBMS_RANDOM.VALUE(0, 100))
FROM
dual
UNION ALL
SELECT
iteration + 1,
ROUND(DBMS_RANDOM.VALUE(0, 100))
FROM
random_number_generator
WHERE
random_number_generator.iteration < 100
)
SELECT
random_number_generator.iteration,
random_number_generator.random_number,
CASE MOD(random_number_generator.random_number, 2)
WHEN 0
THEN 'Even'
WHEN 1
THEN 'Odd'
ELSE NULL
END AS integer_type
FROM
random_number_generator;
| iteration | random_number | integer_type |
|---|---|---|
| 1 | 98 | Even |
| 2 | 13 | Odd |
| 3 | 16 | Even |
| 4 | 28 | Even |
| 5 | 45 | Odd |
| 6 | 74 | Even |
| 7 | 91 | Odd |
| 8 | 73 | Odd |
| 9 | 81 | Odd |
| 10 | 48 | Even |
| 11 | 6 | Even |
| 12 | 4 | Even |
| 13 | 2 | Even |
| 14 | 5 | Odd |
| 15 | 74 | Even |
| 16 | 13 | Odd |
| 17 | 18 | Even |
| 18 | 54 | Even |
| 19 | 6 | Even |
| 20 | 66 | Even |
| 21 | 35 | Odd |
| 22 | 78 | Even |
| 23 | 31 | Odd |
| 24 | 83 | Odd |
| 25 | 75 | Odd |
| 26 | 67 | Odd |
| 27 | 55 | Odd |
| 28 | 50 | Even |
| 29 | 44 | Even |
| 30 | 27 | Odd |
| 31 | 60 | Even |
| 32 | 69 | Odd |
| 33 | 43 | Odd |
| 34 | 75 | Odd |
| 35 | 82 | Even |
| 36 | 89 | Odd |
| 37 | 55 | Odd |
| 38 | 48 | Even |
| 39 | 67 | Odd |
| 40 | 20 | Even |
| 41 | 75 | Odd |
| 42 | 6 | Even |
| 43 | 1 | Odd |
| 44 | 38 | Even |
| 45 | 54 | Even |
| 46 | 96 | Even |
| 47 | 47 | Odd |
| 48 | 22 | Even |
| 49 | 34 | Even |
| 50 | 42 | Even |
| 51 | 18 | Even |
| 52 | 34 | Even |
| 53 | 64 | Even |
| 54 | 61 | Odd |
| 55 | 4 | Even |
| 56 | 73 | Odd |
| 57 | 80 | Even |
| 58 | 71 | Odd |
| 59 | 78 | Even |
| 60 | 89 | Odd |
| 61 | 0 | Even |
| 62 | 51 | Odd |
| 63 | 41 | Odd |
| 64 | 24 | Even |
| 65 | 23 | Odd |
| 66 | 88 | Even |
| 67 | 93 | Odd |
| 68 | 58 | Even |
| 69 | 52 | Even |
| 70 | 41 | Odd |
| 71 | 33 | Odd |
| 72 | 93 | Odd |
| 73 | 61 | Odd |
| 74 | 7 | Odd |
| 75 | 3 | Odd |
| 76 | 90 | Even |
| 77 | 74 | Even |
| 78 | 78 | Even |
| 79 | 6 | Even |
| 80 | 9 | Odd |
| 81 | 9 | Odd |
| 82 | 73 | Odd |
| 83 | 84 | Even |
| 84 | 8 | Even |
| 85 | 88 | Even |
| 86 | 64 | Even |
| 87 | 42 | Even |
| 88 | 23 | Odd |
| 89 | 46 | Even |
| 90 | 60 | Even |
| 91 | 78 | Even |
| 92 | 56 | Even |
| 93 | 98 | Even |
| 94 | 8 | Even |
| 95 | 44 | Even |
| 96 | 98 | Even |
| 97 | 94 | Even |
| 98 | 94 | Even |
| 99 | 95 | Odd |
| 100 | 7 | Odd |
Rounding Numeric Values
One of the most basic transformations conceptually and in its implementation for numeric data is rounding. It’s highly likely that you’ve had the need to do this type of operation and have done so in other programming languages or software (e.g., Tableau, Microsoft Excel). Using SQL, the task is no more difficult. Built-in functions, such as ROUND, CEIL (i.e., ceiling), and FLOOR are typically available to RDBMS users in some form. ROUND is used to round numeric values in the most common sense and accepts two arguments, the value to be rounded and the integer value associated with the number of places to the right of the decimal point. It allows us to round a numeric value to the nearest integer value, tenth, hundredth, thousandth, etc. CEIL and FLOOR functions behave a bit differently. CEIL will round any non-whole number upward to the next integer value while FLOOR will round any non-whole number downward to the previous integer value. The following example code and corresponding result set illustrates the functionality afforded by ROUND, CEIL, and FLOOR functions given the original total sales value corresponding to each sales representative.
SELECT
oe.orders.sales_rep_id,
SUM(oe.orders.order_total) AS total_sales,
ROUND(SUM(oe.orders.order_total), 0) AS round,
CEIL(SUM(oe.orders.order_total)) AS ceiling,
FLOOR(SUM(oe.orders.order_total)) AS floor
FROM
oe.orders
WHERE
oe.orders.sales_rep_id IS NOT NULL
GROUP BY
oe.orders.sales_rep_id;
| sales_rep_id | total_sales | round | ceiling | floor |
|---|---|---|---|---|
| 158 | 156296.2 | 156296 | 156297 | 156296 |
| 161 | 661734.5 | 661735 | 661735 | 661734 |
| 159 | 151167.2 | 151167 | 151168 | 151167 |
| 163 | 128249.5 | 128250 | 128250 | 128249 |
| 155 | 134415.2 | 134415 | 134416 | 134415 |
| 156 | 202617.6 | 202618 | 202618 | 202617 |
| 153 | 114215.7 | 114216 | 114216 | 114215 |
| 154 | 171973.1 | 171973 | 171974 | 171973 |
| 160 | 88238.4 | 88238 | 88239 | 88238 |
Calculating Mean Absolute Deviation (MAD)
In statistics, the mean absolution deviation is the average distance between each data point and the mean of the data. In the example below, the common table expression (CTE) by the name of statistic is responsible for computing the average order total across all orders. Subsequently, the summation of the absolute differences between the average order total value and each order total value is computed. This row-level difference of the mean and each order total value is made possible by the CROSS JOIN. If you’re not familiar with the CROSS JOIN, then please consider checking out my page dedicated to the topic. Moving on, the result of that sum is then divided by the number of rows (i.e., observations or order total values) and finally rounded to the nearest hundredth using Oracle’s built-in function, ROUND.
WITH statistic AS (
SELECT
AVG(oe.orders.order_total) AS average_order_total
FROM
oe.orders
)
SELECT
ROUND(
SUM(
ABS(oe.orders.order_total - statistic.average_order_total)
)
/ COUNT(*),
2
) AS mad
FROM
oe.orders
CROSS JOIN
statistic;
| mad |
|---|
| 32812.29 |
Calculating Sample Variance
Like mean absolution deviation, we can calculate the sample variance using SQL. Sample variance is a measure of variability and can be calculated by taking the average of the squared deviations from the mean. It tells us how far each data point is from the mean of the data. In the following example, I will first calculate the sample variance manually, without a built-in function. Where possible, it’s important to not allow the lack of built-in functions provided by a specific RDBMS to limit analyses. Following the manual calculation, I will use the built-in function provided by Oracle. These functions are convenient in that the heavy lifting in terms of implementing the functionality is transferred from the developer or code writer to the RDBMS.
In the example below, there exists two CTEs. The first CTE, job_statistic, calculates the average salary observed for those employees corresponding to each job ID value. The second CTE, employee_salary, simply retrieves the job ID and salary for each employee. The outputs from both CTEs are necessary to compute the difference between each employee’s salary and the average salary of those employees with the same job. This is observed in the SELECT statement following the initial CTEs. In the SELECT statement, the JOIN and corresponding ON clause ensure the correct differences are computed. Correct meaning the average salary for the job corresponding to the specific employee’s job is used. Additionally, a CASE expression is used to determine the number of employees corresponding to each job ID. When the number of employees holding a specific job is one, the sample variance value returned is NULL. Otherwise, the average of the squared deviations from the mean is returned.
-- Manually calculating sample variance of employee salaries within each department.
WITH job_statistic AS (
SELECT
hr.employees.job_id,
AVG(hr.employees.salary) AS average_salary
FROM
hr.employees
GROUP BY
hr.employees.job_id
),
employee_salary AS (
SELECT
hr.employees.job_id,
hr.employees.salary
FROM
hr.employees
)
SELECT
job_statistic.job_id,
CASE
WHEN COUNT(employee_salary.job_id) = 1
THEN NULL
ELSE SUM(
(employee_salary.salary - job_statistic.average_salary) * (employee_salary.salary - job_statistic.average_salary)
)
/ (COUNT(employee_salary.job_id) - 1)
END AS sample_variance
FROM
job_statistic
INNER JOIN
employee_salary
ON job_statistic.job_id = employee_salary.job_id
GROUP BY
job_statistic.job_id;
| job_id | sample_variance |
|---|---|
| AD_VP | 0 |
| FI_ACCOUNT | 587000 |
| PU_CLERK | 57000 |
| SH_CLERK | 300289.4737 |
| HR_REP | – |
| AC_MGR | – |
| PU_MAN | – |
| ST_CLERK | 205552.6316 |
| AD_ASST | – |
| IT_PROG | 3708000 |
| AC_ACCOUNT | – |
| SA_MAN | 2325000 |
| FI_MGR | – |
| ST_MAN | 1137000 |
| AD_PRES | – |
| MK_MAN | – |
| SA_REP | 2323275.862 |
| MK_REP | – |
| PR_REP | – |
In the example code below, Oracle’s built-in function, VAR_SAMP, is used to calculate the sample variance of employee salaries within each department. Underneath the hood, the implementation is similar to that seen above in the manual calculation. Both methods return the same result.
-- Using the built-in function, VAR_SAMP to calculate the sample variance of employee salaries within each department.
SELECT
hr.employees.job_id,
VAR_SAMP(hr.employees.salary) AS sample_variance
FROM
hr.employees
GROUP BY
hr.employees.job_id;
| job_id | sample_variance |
|---|---|
| AD_VP | 0 |
| FI_ACCOUNT | 587000 |
| PU_CLERK | 57000 |
| SH_CLERK | 300289.4737 |
| HR_REP | – |
| AC_MGR | – |
| PU_MAN | – |
| ST_CLERK | 205552.6316 |
| AD_ASST | – |
| IT_PROG | 3708000 |
| AC_ACCOUNT | – |
| SA_MAN | 2325000 |
| FI_MGR | – |
| ST_MAN | 1137000 |
| AD_PRES | – |
| MK_MAN | – |
| SA_REP | 2323275.862 |
| MK_REP | – |
| PR_REP | – |
Calculating Sample Standard Deviation
When learning statistics, calculating variance is often tackled first because standard deviation is simply the square root of the variance number. I’ll use this fact to adjust the proposed manual solution above to calculate the standard deviation of employee salaries for each job ID value. I do cheat a little here by taking advantage of Oracle’s built-in square root function, SQRT. It’s worth noting that the units of the resulting standard deviation values are of the same units as the values used in the computation (e.g., $). Following the manual solution is a solution utilizing Oracle’s built-in function, STDDEV_SAMP.
-- Manually calculating sample standard deviation of employee salaries within each department.
WITH job_statistic AS (
SELECT
hr.employees.job_id,
AVG(hr.employees.salary) AS average_salary
FROM
hr.employees
GROUP BY
hr.employees.job_id
),
employee_salary AS (
SELECT
hr.employees.job_id,
hr.employees.salary
FROM
hr.employees
)
SELECT
job_statistic.job_id,
CASE
WHEN COUNT(employee_salary.job_id) = 1
THEN NULL
ELSE SQRT(
SUM(
(employee_salary.salary - job_statistic.average_salary) * (employee_salary.salary - job_statistic.average_salary)
)
/ (COUNT(employee_salary.job_id) - 1)
)
END AS sample_standard_deviation
FROM
job_statistic
INNER JOIN
employee_salary
ON job_statistic.job_id = employee_salary.job_id
GROUP BY
job_statistic.job_id;
| job_id | sample_standard_deviation |
|---|---|
| AD_VP | 0 |
| FI_ACCOUNT | 766.1592524 |
| PU_CLERK | 238.7467277 |
| SH_CLERK | 547.9867459 |
| HR_REP | – |
| AC_MGR | – |
| PU_MAN | – |
| ST_CLERK | 453.3791257 |
| AD_ASST | – |
| IT_PROG | 1925.616784 |
| AC_ACCOUNT | – |
| SA_MAN | 1524.795068 |
| FI_MGR | – |
| ST_MAN | 1066.302021 |
| AD_PRES | – |
| MK_MAN | – |
| SA_REP | 1524.229596 |
| MK_REP | – |
| PR_REP | – |
As seen previously when calculating sample variance, using Oracle’s built-in functions can significantly reduce the amount of effort involved in completing a task.
-- Using the built-in function, STDDEV_SAMP to calculate the sample standard deviation of employee salaries within each department.
SELECT
hr.employees.job_id,
STDDEV_SAMP(hr.employees.salary) AS sample_standard_deviation
FROM
hr.employees
GROUP BY
hr.employees.job_id;
| job_id | sample_standard_deviation |
|---|---|
| AD_VP | 0 |
| FI_ACCOUNT | 766.1592524 |
| PU_CLERK | 238.7467277 |
| SH_CLERK | 547.9867459 |
| HR_REP | – |
| AC_MGR | – |
| PU_MAN | – |
| ST_CLERK | 453.3791257 |
| AD_ASST | – |
| IT_PROG | 1925.616784 |
| AC_ACCOUNT | – |
| SA_MAN | 1524.795068 |
| FI_MGR | – |
| ST_MAN | 1066.302021 |
| AD_PRES | – |
| MK_MAN | – |
| SA_REP | 1524.229596 |
| MK_REP | – |
| PR_REP | – |
While the numeric functions used above should not be considered an exhaustive list, I hope exploring those functions leads to a general understanding on the usefulness of built-in functions and the importance of being able to derive or compute something manually when a function to do so is not offered by a specific RDBMS.