Using Checksums in SQL to Find Groupings of Like Data

File under “Quick and dirty, but it works.”

September 11, 2010

OK, this is a bit of an edge case and there are probably better ways to approach it, but it took me less than five minutes to whip up and saved someone else in our company a bunch of time.

We have at least two custom apps (neither of which I had any part of writing, BTW – as the post goes on you’ll see why I disavow them) that use denormalized tables that hold both rows for individual items (users in one case, pick list items in the other) and groups (user groups and pick list names, respectively) in the same table. Then, in both apps there is a second “mapping” table that maps the individual items to the group items. Perhaps a sample would be clearer.

Here is the Users table – misnamed, since it holds both user and group definitions:

Id Name Type
1 Admins Group
2 Joe User
3 Jim User
4 Lusers Group
5 Jon User
6 Jack User
7 Jerry User


And here is the GroupUser table – the “mapping” table:

GroupId UserId
1 2
1 3
4 5
4 6
4 7


With me so far? If you haven’t clawed your eyes out by now, it’s probably because you’ve seen this sort of thing, too. Heinous, heinous stuff. Especially because that Users data table doesn’t just have those three columns in it – no, it has other columns, some of them holding orthogonal data based on what the Type column contains.

Now, consider a really flawed application that we want to rewrite. The application depends on two tables such as the above to hold users and groups in the Users table, with the GroupUser mapping table holding which users are in which groups. For reasons that are too painful to even contemplate the original authors of the app, which deals with reports, decided that each report should have it’s own group for defining who can access it. Worse, they didn’t allow for groups within groups, which means there are literally hundreds of groups (like, almost 800), many of which hold – you guessed it – identical group members, all manually maintained as reports are added or people are hired or leave or change job positions. So obviously as we rewrite the app we want to be able to define such groups once and reuse them across reports.

How to figure out which groups are identical in terms of user membership? Well, one way would be to have the person who administers them sit for a day and figure it all out by hand and hope they don’t miss the fact that some groups may differ from others by only one member (and some of these groups can hold a hundred users in them, so good luck with that). Instead, I came up with the following using T-SQL’s CHECKSUM function.

Here is it in all its glory:

SELECT
    [Group],
    SUM(Checksum) AS Total
FROM
(
    SELECT
        G.NAME AS [Group],
        CAST(CHECKSUM(U.NAME) AS BIGINT) AS Checksum
    FROM GROUPUSER GU
    INNER JOIN USERS G ON
        GU.GROUPID = G.ID
        AND G.TYPE = 'Group'
    INNER JOIN USERS U ON
        GU.USERID = U.ID
        AND U.TYPE = 'User'
    GROUP BY G.NAME, U.NAME
) A
GROUP BY [Group]
ORDER BY 2, [Group] 

Basically, this joins each user to their appropriate groups (a row per group/user combo) and takes the checksum of the user’s name (login id, actually). Then those checksums are summed together by group and displayed in order by those sums and then group names underneath that. When looking at the checksums all the rows with the same sum likely hold the same group members.

Now, purists will note there is some chance of collision with checksums, and summing checksums certainly raises the chances of that slightly, but in our case the results that came out look right, in terms of the person who admins all the groups looking at the clusters and saying, “Yes, those groups all have the same members in them.” So I wouldn’t claim this would work across really large data sets but for the purpose at hand it did Just Fine, and five minutes worth of work saved someone a lot of manual cross-checking.