Last updated 2 years ago
Improving your data analysis skills is possible by mastering advanced SQL techniques, which can assist in deeper exploration of datasets leading to valuable insights. This article highlights seven advanced SQL queries that can be helpful for data analysis. Each query will be presented along with its results, syntax details and an explanation of how the database computes it.
For the 1st Example, let us scrutinize information found in the sales table. The said data set contains a unique identifier for every product, the date when the deal has transpired, and the overall value of the transaction.
id | product_id | amount | date |
---|---|---|---|
1 | 1 | 120 | 2023-04-01 |
2 | 112 | 310 | 2023-03-11 |
3 | 810 | 520 | 2023-04-08 |
4 | 11 | 410 | 2023-02-26 |
5 | 810 | 520 | 2023-04-08 |
6 | 11 | 410 | 2023-04-13 |
Avery common in analysis scenario required data like specific date periods, such as month or year.
SELECT EXTRACT(YEAR FROM date) AS year, EXTRACT(MONTH FROM date) AS month, SUM(amount) AS total_sales_amount FROM sales GROUP BY year, month ORDER BY year, month;
Output:
year | month | total_sales_amount |
---|---|---|
2023 | 2 | 410 |
2023 | 3 | 310 |
2023 | 4 | 1570 |
Using the EXTRACT() function, we retrieve the year and month from the date field in this query. The data is then grouped based on the new year and month fields, and the total sales for each group are calculated by employing the SUM() function.
Ranking information is frequently needed for complex data analysis purposes. For instance, you may want to rank your merchandise based on their sales figures to identify which items generate the majority of the earnings, or you may need to rank your outlets based on their subpar sales figures to pinpoint which locations are performing poorly.
Utilizing functions such as RANK() or DENSE_RANK() as window functions can allocate rankings to rows based on the value of a particular column. To begin, we shall examine the RANK() function:
SELECT product_id, SUM(amount) AS total_sales_amount, RANK() OVER(ORDER BY SUM(amount) DESC) AS rank FROM sales GROUP BY id;
Output:
product_id | total_sales_amount | rank |
---|---|---|
810 | 520 | 1 |
810 | 520 | 1 |
11 | 410 | 3 |
11 | 410 | 3 |
112 | 310 | 5 |
1 | 120 | 6 |
At times, it becomes necessary to juxtapose the values of successive rows, such as when retrieving sales figures from the preceding month to measure against those of the current month. By having data on sales from both the current and previous months at hand, it becomes possible to calculate the variance between the two values.
To accomplish this, you may use window functions such as LAG() and LEAD(). For this i will use a new table named sales_stats.
id | total | month | year |
---|---|---|---|
3 | 10 | 1 | 2022 |
4 | 2 | 2 | 2022 |
1 | 10 | 1 | 2023 |
2 | 20 | 2 | 2023 |
Using the LAG() function, this inquiry computes the variance between the sales of the present month and the sales of the preceding month.
SELECT year, month, total, LAG(total) OVER (ORDER BY year, month) AS previous_month, total-LAG(total) OVER (ORDER BY year, month) AS sales_difference FROM sales_stats ORDER BY year, month;
Output:
year | month | total | previous_month | sales_difference |
---|---|---|---|---|
2022 | 1 | 10 | NULL | NULL |
2022 | 2 | 2 | 10 | -8 |
2023 | 1 | 10 | 2 | 8 |
2023 | 2 | 20 | 10 | 10 |
LAG() is a window function which retrieves the total amount from the preceding row, sorted by year and month, and assigns an alias of previous_month to that column.
The sales_difference column is obtained by deducting the previous_month value from the total in each row. In the initial row, NULL values are present because there is no previous row.
Using a moving average, which calculates the mean of a given value and a set number of its preceding values, can be more beneficial in studying sales trends through data analysis rather than relying solely on individual data points. This method is also referred to as a rolling average.
For this example, i will use total_sales table.
total_sales
id | total_sales | month | year |
---|---|---|---|
1 | 320 | 1 | 2022 |
2 | 400 | 2 | 2022 |
3 | 200 | 3 | 2022 |
4 | 600 | 4 | 2022 |
5 | 650 | 5 | 2022 |
6 | 50 | 6 | 2022 |
7 | 623 | 7 | 2022 |
8 | 700 | 8 | 2022 |
let's calculate the average
SELECT year, month, total_sales, AVG(total_sales) OVER(ORDER BY year, month ROWS BETWEEN 2 PRECEDING and CURRENT ROW) AS moving_average FROM total_sales ORDER BY year, month;
Output:
year | month | total_sales | moving_average |
---|---|---|---|
2022 | 1 | 320 | 320.0000 |
2022 | 2 | 400 | 360.0000 |
2022 | 3 | 200 | 306.6667 |
2022 | 4 | 600 | 400.0000 |
2022 | 5 | 650 | 483.3333 |
2022 | 6 | 50 | 433.3333 |
2022 | 7 | 623 | 441.0000 |
2022 | 8 | 700 | 457.6667 |
A cumulative sum, also known as a running total or cumulative total, sums up a series of numbers by adding each new value to the previous running total.
Accumulating amounts are valuable in computing the combined influence of data elements as time progresses. One instance where it can be utilized is in computing the combined count of web visitors up to a specific moment to comprehend your website's user expansion.
For this example i will use the same table of previous point. Table name total_sales
SELECT year, month, total_sales, SUM(total_sales) OVER(ORDER BY year, month) AS cumulative_total FROM total_sales
ORDER BY year, month;
Output:
year | month | total_sales | cumulative_total |
---|---|---|---|
2022 | 1 | 320 | 320 |
2022 | 2 | 400 | 720 |
2022 | 3 | 200 | 920 |
2022 | 4 | 600 | 1520 |
2022 | 5 | 650 | 2170 |
2022 | 6 | 50 | 2220 |
2022 | 7 | 623 | 2843 |
2022 | 8 | 700 | 3543 |
You’d like to display a number of seconds as a time value in hours, minutes, and seconds in MySQL.
Read ItCSS or Cascading Style Sheets have become a really popular programming language in recent times and...
Read It