Business Intelligence with SQL Server & ME

Sunday, August 7, 2011

Setup Date Dimension

You cannot think of business intelligence without date dimension and it is so much so that all our measures mostly spins around date dimension for date/time slicing, YTD, MTD, Month to Month comparison and goes on. In my current assignment I had to build date dimension and that’s when I thought of checking with the BI core group in my company. I got plenty of replies from the group. However I was looking for something which is very easy to understand and implement and thinking so I built one date dimension. Note that in my post I have not kept attributes for fiscal year but if you need to, you would merely have to add bunch of attributes to store fiscal month, quarter etc.

I am straight away jumping into the script. We need 4 scripts as explained below -

Date dimension table. Name of the table is [OS_Date_Dimension].
Stored procedure to populate date dimension table.
Script to populate your fact with UNKNOWN time member if lookup does not find one key in the date dimension.
Call stored procedure to fill date table.

Date dimension table. Name of the table is [OS_Date_Dimension]

CREATE TABLE [dbo].[Os_Dim_Date](
    --Keys
    [DateKey] [int] PRIMARY KEY CLUSTERED NOT NULL,    --20110101
    [AlternateDateKey] [nvarchar](10) NOT NULL,        --2001-07-01
    [CalenderDate] [datetime] NOT NULL,                --2000-01-01 00:00:00.000
    
    --Day
    [DayOfWeek] [nvarchar](9) NOT NULL,                --Sunday
    [ShortDayOfWeek] [nvarchar](3) NOT NULL,        --Sun
    [DayNumOfYear] [int] NOT NULL,                    --1, 2, 3..365
    [DayNumOfQuarter] [int] NOT NULL,                --1, 2, 3..90
    [DayNumOfMonth] [int] NOT NULL,                    --1, 2, 3..31
    [DayNumOfWeek] [int] NOT NULL,                    --1, 2, 3..7
    
    --Week
    [WeekName] [nvarchar](20) NOT NULL,                --2000-Week-1
    [WeekNumOfYear] [int] NOT NULL,                    --1, 2, 3..49
    [WeekNumOfQuarter] [int] NOT NULL,                --1, 2, 3..14
    [WeekNumOfMonth] [int] NOT NULL,                --1, 2, 3..6
    
    --Month
    [MonthName] [nvarchar](20) NOT NULL,            --January 2001
    [MonthShortName] [nvarchar](6) NOT NULL,        --Jan
    [MonthNumOfYear] [int] NOT NULL,                --1, 2..12
    [MonthNumOfQuarter] [int] NOT NULL,                --1, 2 and 3
    
    --Quarter
    [QuarterName] [nvarchar](10) NOT NULL,            --2000 Q1
    [QuarterShortName] [nvarchar](5) NOT NULL,        --Q1
    [Quarter] [int] NOT NULL,                        --1,2,3 and 4
        
    --Year
    [YearName] [nvarchar](7) NOT NULL,                --2011
    [Year] [int] NOT NULL,                            --2011
        
    --Additional
    [SemiMonth] [nvarchar](50) NULL,                --SemiMonthP1 January 2011
    [Semester] [int] NOT NULL,                        --1 and 2
    [SemesterName] [nvarchar](11) NOT NULL            --H1 CY 2001
)
GO

Stored procedure to populate date dimension table

--Stored procedure to create date dimension table
CREATE PROCEDURE [dbo].[Os_Populate_Date_Dim]
(
  @FIRST_DATE       DATETIME,
  @LAST_DATE        DATETIME
)
AS
BEGIN
    WITH TE_Os_Dim_Date as 
    ( 
    SELECT Cast (@FIRST_DATE as DateTime) Date        --Start Date 
    UNION ALL 
    SELECT Date + 1 FROM TE_Os_Dim_Date 
        WHERE Date + 1 < = @LAST_DATE                --End date 
    ) 
    INSERT INTO [dbo].[Os_Dim_Date] (
         [DateKey]
        ,[AlternateDateKey]
        ,[CalenderDate]
        ,[DayOfWeek]
        ,[ShortDayOfWeek]
        ,[DayNumOfYear]
        ,[DayNumOfQuarter]
        ,[DayNumOfMonth]
        ,[DayNumOfWeek]
        ,[WeekName]
        ,[WeekNumOfYear]
        ,[WeekNumOfQuarter]
        ,[WeekNumOfMonth]
        ,[MonthName]
        ,[MonthShortName]
        ,[MonthNumOfYear]
        ,[MonthNumOfQuarter]
        ,[QuarterName]
        ,[QuarterShortName]
        ,[Quarter]
        ,[YearName]
        ,[Year]
        ,[SemiMonth]
        ,[Semester]
        ,[SemesterName]
    )
    SELECT 
        CAST(CONVERT(NVARCHAR(30),DATE, 112) AS INT) AS DateKey
        ,LEFT(CAST(CONVERT(NVARCHAR(30),DATE, 120) AS VARCHAR), 10) AS AlternateDateKey
        ,DATE AS CalenderDate
        
        ,DateName (dw, date) as DayOfWeek
        ,LEFT (DateName (dw, date), 3) as ShortDayOfWeek
        ,DATEPART (DY, DATE) AS DayNumOfYear 
        ,DATEDIFF(DD,DATEADD(QQ,DATEDIFF(QQ,0,DATE),0),DATE)+1 AS DayNumOfQuarter 
        ,DAY (DATE) AS DayNumOfMonth 
        ,DATEPART (DW, DATE) AS DayNumOfWeek 
        
        ,'Week ' + CAST(DatePart (wk, Date) AS VARCHAR) + ' CY ' + convert(varchar(4),YEAR(Date)) AS WeekName
        ,DatePart (wk, Date) as WeekNumOfYear 
        ,datediff(wk,dateadd(qq,datediff(qq,0,date),0),date)+1 as WeekNumOfQuarter 
        ,datediff(wk,dateadd(mm,datediff(mm,0,date),0),date)+1 as WeekNumOfMonth 

        ,DateName (mm, date) + ' ' + datename(year,Date) as MonthName 
        ,LEFT ( DateName (mm, date), 3) MonthShortName 
        ,MONTH (date) as MonthNumOfYear 
        ,DateDiff(mm,DateAdd(qq,DateDiff(qq,0,date),0),date)+1 as MonthNumOfQuarter 
        
        ,'Q' + datename(quarter,Date) + ' CY ' + datename(year,Date) AS QuarterName
        ,'Q'+ datename(quarter,Date) AS QuarterShortName
        ,DatePart ( qq, date) as Quarter 
        
        ,datename(year,Date) AS YearName
        ,YEAR (date) as Year 
        
        ,'SemiMonthP' +  CASE WHEN Day(Date) <= 15 then '1' ELSE '2' END + ' ' +  DateName (mm, date) + ' ' +  datename(year,Date) AS SemiMonth
        ,CASE WHEN MONTH(Date) <= 6 then 1 ELSE 2 END AS Semester
        ,'H' + CAST(CASE WHEN MONTH(Date) <= 6 then 1 ELSE 2 END AS VARCHAR) + ' CY ' +  datename(year,Date) AS [SemesterName]
    FROM TE_Os_Dim_Date 
    
    OPTION (MAXRECURSION 0)
END
GO

Script to populate your fact with UNKNOWN time member if lookup does not find one key in the date dimension

--Insert a record for unknown lookup
INSERT INTO [dbo].[Os_Dim_Date] (
         [DateKey]
        ,[AlternateDateKey]
        ,[CalenderDate]
        ,[DayOfWeek]
        ,[ShortDayOfWeek]
        ,[DayNumOfYear]
        ,[DayNumOfQuarter]
        ,[DayNumOfMonth]
        ,[DayNumOfWeek]
        ,[WeekName]
        ,[WeekNumOfYear]
        ,[WeekNumOfQuarter]
        ,[WeekNumOfMonth]
        ,[MonthName]
        ,[MonthShortName]
        ,[MonthNumOfYear]
        ,[MonthNumOfQuarter]
        ,[QuarterName]
        ,[QuarterShortName]
        ,[Quarter]
        ,[YearName]
        ,[Year]
        ,[SemiMonth]
        ,[Semester]
        ,[SemesterName]
    )
VALUES
    (
         19000101
        ,'1900-01-01' 
        ,'1900-01-01 00:00:00'
        ,'Unknown' ,'Unk', 1, 1 ,1 ,1
        ,'Unknown', 1 ,1 ,1
        ,'Unknown', 'Unk' ,1 ,1
        ,'Unkown', 'Q1', 1
        ,'1900', 1900
        ,'Unknown', 1
        ,'Unknown'
    )
GO

Call stored procedure to fill date table

--Populate Date from 2001 to 2020
EXEC [dbo].[Os_Populate_Date_Dim] '2001-01-01', '2020-12-31'
GO

Now let’s setup dimension attribute key, name and hierarchies in the cube. Your Dimension structure (Hierarchies) and attributes relationship should look like this at the end of the exercise.

Set attribute key and names as illustrated below -

Attribute Name	KeyColumns Property	NameColumn Property
DateKey	Os_Dim_Date.DateKey (Integer)	Os_Dim_Date.CalenderDate (WChar) ValueColumn: Os_Dim_Date.CalenderDate (WChar)
Month Name	Os_Dim_Date.Year (Integer) Os_Dim_Date.MonthNumOfYear (Integer)	Os_Dim_Date.MonthName (WChar)
Quarter	Os_Dim_Date.Year (Integer) Os_Dim_Date.Quarter (Integer)	Os_Dim_Date.QuarterName (WChar)
Year	Os_Dim_Date.Year (Integer)	Os_Dim_Date.YearName (WChar)
Week Name	Os_Dim_Date.MonthName (WChar) Os_Dim_Date.WeekNumOfMonth (Integer)	Os_Dim_Date.WeekName (WChar)

Next process dimension and browse Date and Week Hier. Find below the illustrations.

You may need more attributes related to fiscal year, holidays etc. which you should be able to add easily by extending table and stored procedure. Hope this helps.

Thursday, June 16, 2011

Reports using Cube or Relational Database..some points to consider

We should try and use cube and not relational database for building SSRS reports. Here are the some of the advantages that cube has over relational database –

Cube is much faster source for canned and ad-hoc reporting. Reason, aggregated and processed data is stored in cube database (of course if implemented using MOLAP).
Information or the measures are organized in a way which is easy to build excel pivot table which business user can easily understand.
It is very easy to implement hierarchy and drill down using cube and MDX compared with Relational database and SQL queries.
With cube we can use power of MDX to build complex reports, like YTD, Running total, Semi additive measure, trend, parallel period etc.
Calculated measures can be stored in the cube and can be used by developer to build report or business user browsing them directly using any client application or excel. Other hand we never store calculated column in the relational database table.
Even ad-hoc reporting becomes handy and easy with cube as some of the client software like ProClarity only work with cube.
Using cube we can easily get different view of a measure using linked dimensions. Say spend by vendor, location etc.
However not all the report should be build using the cube. For example, detail reports where we want to view detail data as they were pulled from the source system, we should use SQL report. In this case we are not making use of aggregated data from the cube and SQL query can returns the detail rows much master than MDX.

Monday, June 6, 2011

SSAS NULL value default behavior

One of my direct reports asked me last week about how SSAS handles NULL measure if present in the fact table. I think he encountered this in one of his business intelligence job interview :).

I thought let’s write a small post about the topic and this way others will also be get benefited plus I don’t have to explain it to him separately.

I have a sales fact table which is linked to product, measure type, organization and time dimensions. Here is part of the fact table -

As you see we have got plenty of NULL value under plan and forecast measures. We will stick to Actual and Plan measures for our discussion. Let’s browse the cube and put this in perspective for you to see how SSAS treat NULL measure value. Any guess at this point?

I have filter products with PRODUCT_SK in (48, 55 and 56) where we have NULL under plan but has got some Actual sales value. This is bit strange with SSAS which replaces NULL with 0 by default, meaning does not preserve NULL. However, this may mislead the business and you may want to preserve NULL for all intense purposes. In this case it is just that business has not decided on the plan, forecast etc for certain products does not necessarily mean that plan amount is 0.

Lets get straight to the SSAS measure properties and there is a way to preserver NULL value. Go to the cube solution, select you measure and change NullProcessing property under Source from Automatic to Preserve. Next process the cube and reconnect to OLAP db to refresh.

Here is what I mean. We don’t see 0 anymore. You should carefully decide whether or not you want to keep NULL and make use of this property.

Please leave any feedback or suggestions you have!

NON EMPTY vs NONEMPTY(<set>, <set>)

To remove NULL values from the query output we use NON EMPTY or NONEMPTY functions. There is small difference between NON EMPTY and NONEMPTY function, I am providing difference with examples through this post.

Let’s run following MDX query -

SELECT 
{
    [Measures].[INV PROD SALES ACTUAL],
    [Measures].[LEAD TO CALL PRODUCT ACTUAL]} ON COLUMNS,
{
    [MEASURE TYPE].[MEASURE TYPE].Children    
} ON ROWS
FROM [YOUR CUBE]

Result of the above query. As you see there are plenty of NULL values we got for the MEASURE TYPE set.

Now let’s use NON EMPTY on the set, here is the changed query -

SELECT 
{
    [Measures].[INV PROD SALES ACTUAL],
    [Measures].[LEAD TO CALL PRODUCT ACTUAL]} ON COLUMNS,
NON EMPTY
{
    [MEASURE TYPE].[MEASURE TYPE].Children    
} ON ROWS
FROM [YOUR CUBE]

Here is the output with NON EMPTY -

As you see now wherever we had NULL for both the measure those measure types have been filtered (meaning removed). So NON EMPTY works on the overall result, meaning after set get first evaluated, all the tuples with NULL values for all the measures, are removed.

On the other hand NONEMPTY() takes 2 sets as parameters. Its uses tuples of the first set and filtered out all empty tuples based on the cross product with second set. Meaning behind the scene what happens is, NONEMPTY is evaluated when the set defining the rows axis is evaluated. Let’s analyze the query -

SELECT 
NON EMPTY
{
    [Measures].[INV PROD SALES ACTUAL],
    [Measures].[LEAD TO CALL PRODUCT ACTUAL]} ON COLUMNS,
{
    NONEMPTY([MEASURE TYPE].[MEASURE TYPE].Children,
    [Measures].[LEAD TO CALL PRODUCT ACTUAL])    
} ON ROWS
FROM [YOUR CUBE]

Before we get into the result, let’s analyze the above query. As we discussed above NONEMPTY will be evaluated with LEAD TO CALL PRODUCT ACTUAL measure in the ROW axis first and will bring all the MEASURE TYPE tuples wherever we have value for LEAD TO CALL PRODUCT ACTUAL measure. Also note that if we don’t pass any measure in the NONEMPTY it takes default measure and filter non empty value for the set. From the first query output I am graying out the cells that we don’t expect in the query result.

Here is the above query output.

NONEMPTY should be used to get rid of empty tuples in the set for specified measure and it helps improve query performance. Specially when you cross join the dimension(s) members you should always consider NONEMPTY. Leave your comments..see you back soon!

Friday, June 3, 2011

SCD Type – 2 using MERGE and Output $action

In this article we will talk about achieving SCD behavior using Sql MERGE statement. MERGE join works well based on join condition with source table and through available condition you can have INSERT, UPDATE or DELETE actions. Primarily what it does is that it compare key value of target with source table and perform different operations specified like WHEN MATCHED, WHEN NOT MATCHED etc.

The main focus of this post is to make use of $action virtual output column when we execute MERGE sql and decide as in whether INSERT or UPDATE has been performed on the source.

Let get to the business area and see these in action.

Create a table first and populate some records, or you can use the same table from my previous post.
http://takeatourtobi.blogspot.com/2011/06/delete-and-truncate-behavior.html

--Create student table
CREATE TABLE [DBO].[TTB_STUDENT](
    ID INT,
    NAME VARCHAR(100) NULL,
    AGE INT NULL,
    ACTIVE CHAR(1)
)

--Insert some records
INSERT INTO [DBO].[TTB_STUDENT] VALUES
    (1, 'Mark', 32, 'Y'),
    (2, 'Elaine', 31, 'Y'),
    (3, 'Elton', 33, 'Y'),
    (4, 'John', 40, 'Y'),
    (5, 'Smith', 30, 'Y')

Select student table. This is our target table -

Create another table which will have the delta which we have to apply on student table. Note that we will be using ID column value as key to decide whether or not student already exists. Here is the delta table script and rows..

CREATE TABLE [DBO].[TTB_STG_STUDENT](
    ID INT PRIMARY KEY,
    NAME VARCHAR(100) NULL,
    AGE INT NULL
)

INSERT INTO [DBO].[TTB_STG_STUDENT] VALUES
    (1, 'Mark Luther', 42),
    (2, 'Elaine Wang', 41),
    (6, 'Elton B', 22)

Since we are doing SCD Type – 2, this is what will be the scenarios -

If get a new student (New ID value), we will insert into the target table and set Active = ‘Y’
If get an existing student, we will insert it to the target table with Active = ‘Y’ and set Active = ‘N’ for all other records of that student. Meaning, at a any given point in time, we should have only one active record for each student.

As said earlier idea here is to make use of OUTPUT $action to retrieve the records that got updated and once we have this list, we can insert it finally and set flag to ‘N’ for all other records for a student.

Here is the final SQL statement and target table output.

INSERT INTO [DBO].[TTB_STUDENT]    
SELECT ID, NAME, AGE, 'Y' FROM
(
    MERGE [DBO].[TTB_STUDENT] AS TRG
        USING [DBO].[TTB_STG_STUDENT] AS SRC
            ON (SRC.ID = TRG.ID AND TRG.ACTIVE = 'Y')
        WHEN NOT MATCHED THEN 
            INSERT VALUES (SRC.ID, SRC.NAME, SRC.AGE, 'Y')
        WHEN MATCHED THEN 
            UPDATE SET TRG.ACTIVE = 'N'
        OUTPUT $action, SRC.ID, SRC.NAME, SRC.AGE

) AS DELTA (action, ID, Name, Age)
WHERE action = 'Update'

As you see, we had 2 existing students in the delta table for which active is set to ‘N’ and 2 new entries have been created in the table with Active = ‘Y’. There were one new student (ID = 6), which got inserted into the table. This way we can achieve SCD Type – 2 using MERGE SQL. Please leave your comments/feedback.

DELETE and TRUNCATE behavior

We will create a student table then apply Delete and Truncate within Transaction and rollback. Leave your comment about the behavior and/or output you expect.

--Create student table
CREATE TABLE [DBO].[TTB_STUDENT](
ID INT IDENTITY(1, 1) PRIMARY KEY,
NAME VARCHAR(100) NULL,
AGE INT NULL,
ACTIVE CHAR(1))

--Insert some records
INSERT INTO [DBO].[TTB_STUDENT] VALUES
    ('Mark', 32, 'Y'),
    ('Elaine', 32, 'Y'),
    ('Elton', 32, 'Y'),
    ('John', 32, 'Y'),
    ('Smith', 32, 'Y')

--Delete records using DELETE 
--Rollback to bring back the records
--You can because it write to the log and
--can be rolled back
BEGIN TRANSACTION
DELETE [DBO].[TTB_STUDENT]
ROLLBACK 

--Select table
SELECT * FROM [DBO].[TTB_STUDENT]

Here is the output of the above select - 
 


--Do the same but use TRUNCATE TABLE
--Note that it does not write to log  by definition
BEGIN TRANSACTION
TRUNCATE TABLE [DBO].[TTB_STUDENT]
ROLLBACK 

--What do you expect ??
SELECT * FROM [DBO].[TTB_STUDENT]

Thursday, June 2, 2011

Dynamic vs Static set in SSAS

We deal with named set a lot when it comes to access multi dimensional data from the cube. There are two types of named set – Static and Dynamic. Till SQL Server 2005 we had only Static named set but now in SQL Server 2008 both are supported (static and dynamic named set).

You create named set either using CREATE SET (when defined in the cube) or WITH SET MDX (when defined in the MDX/Report) syntax. In case of static member, it get evaluated when the CREATE SET statement is executed or MDX is executed in case when it is defined in MDX using WITH SET. In case of static named set once it is executed it persists static during whole session or query or cube. In effect what will happen since named set is static and cached it will give same set of members even when you query with different condition and expect different result.

Assume a case where you need to get top 10 product by sales for whole period and then for a month. So after you run the query for whole period, you will end up getting same set of products for a month also when you execute next. So in such glitches we now have option to create dynamic set which works well.

I will explain what exactly I mean by all this.

Create a static set in the cube as under. In my example I am retrieving top 5 products based on sale.

CREATE SET CURRENTCUBE.[TOP PRODUCT SALES]
 AS TOPCOUNT(
        [INVESTMENT PRODUCT].[PRODUCT NAME].[PRODUCT NAME].MEMBERS,
        5,
        [MEASURES].[INV PROD SALES ACTUAL]) ;

Next, run following MDX to get top 5 product sales. Note that I am not using any time slicer here so what I am getting in here is across the time.

SELECT 
{
    [MEASURES].[INV PROD SALES ACTUAL]
} ON COLUMNS,
{
    NONEMPTY([TOP PRODUCT SALES], 
    [MEASURES].[INV PROD SALES ACTUAL])
} ON ROWS
FROM [YOUR_CUBE];

Here is output of above MDX.

Next let’s apply February time filter in the MDX and I expect at least 1-2 different product than one we got above.

SELECT 
{
    [MEASURES].[INV PROD SALES ACTUAL]
} ON COLUMNS,
{
    NONEMPTY([TOP PRODUCT SALES], 
    [MEASURES].[INV PROD SALES ACTUAL])
} ON ROWS
FROM [YOUR CUBE]
WHERE [TIME].[YQMD].[Month].&[2011-02-01T00:00:00]

Here is the output of above MDX - 
  
Note that though we now got different sales actual, it is the same set of products we earlier got when we ran without using any time filter. So what is happening here is when we created the set using CREATE SET statement in the cube and processed it, it evaluated the top 5 products and same set is being returned.

Now let’s change the set type to dynamic and retry the MDX and this time we expect it to be dynamic. Meaning, we should get different set of products based on top 5 sales when passed different time filter.

CREATE DYNAMIC SET CURRENTCUBE.[TOP PRODUCT SALES]
 AS TOPCOUNT(
        [INVESTMENT PRODUCT].[PRODUCT NAME].[PRODUCT NAME].MEMBERS,
        5,
        [MEASURES].[INV PROD SALES ACTUAL]) ;

Now run above 2 MDX queries and analyze the result.They should be different now.
SELECT 
{
    [MEASURES].[INV PROD SALES ACTUAL]
} ON COLUMNS,
{
    NONEMPTY([TOP PRODUCT SALES], 
    [MEASURES].[INV PROD SALES ACTUAL])
} ON ROWS
FROM [YOUR CUBE]



SELECT 
{
    [MEASURES].[INV PROD SALES ACTUAL]
} ON COLUMNS,
{
    NONEMPTY([TOP PRODUCT SALES], 
    [MEASURES].[INV PROD SALES ACTUAL])
} ON ROWS
FROM [YOUR CUBE]
WHERE [TIME].[YQMD].[Month].&[2011-02-01T00:00:00]

Now you would have noticed that we are getting different set of top 5 products and it is correct now and reflecting per sales. Please leave your feedback/comment.