Thursday, February 3, 2011

SQL find duplicate using Group and Partition By

There are various ways to find duplicate records in SQL Server. Like using group and count, partitioning result set etc.

This article talks about finding duplicate records using two of these methods. I will straight away jump into the SQL example for you to understand easily and quickly.

Create a table name with [STG].[TMS_TOP_PROGRAMS] with following entries into it. This table I have all the programs along with some other attributes like ID, Sort Order etc -

PROGRAM_ID PROGRAM_NAME SORT_ORDER BATCH_ID
11 Align the Right Customers to the Right Solutions at Optimized cost 1 100
11 Align the Right Customers to the Right Solutions at Optimized cost 1 100
12 Create competitive advantage through business analytics and benchmarking 6 100
13 Deploy future state CSSG operating model 2 100
13 Deploy future state CSSG operating model 2 100
14 Enhance employee engagement 7 100
14 Enhance employee engagement 7 100
15 Expand Direct Sales Capabilities to Grow Revenue 3 100
16 Optimize Service Contacts to Create Sales Opportunities 4 100
16 Optimize Service Contacts to Create Sales Opportunities 4 100
17 Simplify End to End Customer Experience 5 100
17 Simplify End to End Customer Experience 5 100
6 OpEx Enterprise-Wide Programs 6 100
6 OpEx Enterprise-Wide Programs 6 100

As you see there are lot of duplicate programs in this table. One ways to find duplicate records is by using Group clause as under -

SELECT 
PROGRAM_ID, [PROGRAM_NAME], KEY_PROGRAM
,
SORT_ORDER, [BATCH_ID], COUNT(PROGRAM_ID
)
FROM [STG].[TMS_TOP_PROGRAMS]
GROUP BY PROGRAM_ID, [PROGRAM_NAME], KEY_PROGRAM, PROGRAM_SORT_ORDER, [BATCH_ID]
HAVING COUNT(PROGRAM_ID) > 1

Another and yet more efficient way to find duplicate by partitioning the table based on the key, here is what I mean..

SELECT PROGRAM_ID, [PROGRAM_NAME], KEY_PROGRAM, PROGRAM_SORT_ORDER, [BATCH_ID], RANK_ID FROM
(
SELECT PROGRAM_ID, [PROGRAM_NAME], KEY_PROGRAM
,
SORT_ORDER, [BATCH_ID]
,
ROW_NUMBER() OVER (
PARTITION BY PROGRAM_ID ORDER BY PROGRAM_ID
) AS RANK_ID
FROM [STG].[TMS_TOP_PROGRAMS]
) AS INTBL WHERE RANK_ID > 1

No comments:

Post a Comment