Socks: Data Matching Lessons from the Laundry Pile

Ten years ago, before I specialized in Master Data Management matching, my wife and I had a friendly sock-matching competition. I am a self-proclaimed expert at everything; my wife, a self-proclaimed expert at proving me wrong. The goal: out of a pile of laundry, we each had to find our respective socks, match them, roll them, and put them away. There was an equal number of socks between us, fresh out of the laundry. Fastest person wins. Game on!

It started off great for me: I grabbed a sock, searched through the pile, quickly found its twin, and put it aside. I did this again, and again. My head was down, I was completely focused, in the zone. I was finding pair after pair after pair. I felt victory just around the corner. And then, “Would you like some help?” My wife was already doing her victory dance.

I asked her how she did it. “Simple,” she said. “You were picking out one sock at a time and looking for its mate through the entire pile. I simply found all of my socks, grouped them together in one pile, and matched the socks within the group. Much more efficient, don’t you think, dear?”

I soon realized that instead of comparing socks to shirts and pants, I should have created a smaller candidate group of socks, which would have allowed me to match socks to socks based on qualifying characteristics such as, light or dark, dress or sport, knee-length or ankle length, brand name or generic, and colour pattern.

This lesson has stayed with me as I have gone on to define data matching methodologies for highly complex MDM systems. It’s safe to say my trials in sock matching optimization provided me with a foundation to better prepare for data matching processes.

Matching data can be cumbersome. For one, where do you start? With socks, you know what you are trying to match and exactly what your optimal outcome is. It’s a simple manual process. With data, there are many different approaches to take, each with varying results. To manage the complexity, we need to automate the matching process as much as possible, and we need a tool that can handle matching algorithms or, at a minimum, complex ETL joins.

So, how do you hone in on an optimal strategy for matching your data? Let’s break it down into steps.

1.       Set Your Match Scope

Socks-1-Scope copy.jpg

First, to understand your scope, determine exactly what are you trying to match. Is it customer data, address data, or relationship data? Are you going to match records within a dataset or across different data sets? In the sock scenario, for example, we knew we were trying to match socks within a specific pile of laundry.  

2.       Determine Your Match Strategy

Socks-2-Strategy copy.jpg

Once you know what you’re trying to match, the next step is to determine a strategy.  How are you going to match the records? Are you going to match incoming records to other original incoming records, or a match them to a golden master record (a previously matched and enriched record that provides your source of truth)? For our scenario, we were matching single socks to single socks.

3.       Choose Your Match Type

Socks-3-MatchType copy.jpg

Then, you need to choose either an exact or fuzzy matching strategy. Use rules that allow for stringent comparison, or for a comparison with a degree of variance. For our scenario, we were matching socks using an exact comparison.

4.       Determine Your Match Methods

Socks-4-MatchMethod copy.jpg

Going even further, decide on either probabilistic or deterministic rules. The probabilistic approach assumes records are probably a match because a few of their most important characteristics are similar. A weighted average is calculated to determine match probability, giving us a black-box matching approach. The deterministic approach passes the records through a series of defined rules to check for pairs based on our match type, and only results in a match if the rules are satisfied. Thus, allowing us to know why and how the records are matched. Our sock-matching used a rules-based approach, ensuring that all characteristics of the socks had to match.

5.       Choose Your Match Fields

Socks-5-Matchfields copy.jpg

Now, pick out the fields that will determine our matching rules. This process can be the most time-consuming since it requires iterations and refinement that may even change the strategy mid-flight. The goal here is to test out different match rules and combinations of rules to determine the best set of matching results. Will you use a combination of Name, Date of Birth and Identifier? Or a combination of Name, SIN/SSN and Address? Or both, depending on the level of completeness for each field?  There is no plug-and-play for matching rules; it depends on the type of data you are trying to match, the fields you use, the overall quality of your data, and other factors. Accelerators and algorithms do exist, but mapping to them, as well as ordering, combining and building the overall solution requires effort and patience. For the socks, we determined that all the sock characteristics would be used for match fields, and we defined one rule – i.e. all the characteristics need to match.

6.       Define Your Candidate Groups

Socks-6-Groups copy.jpg

Having determined our matching fields, determine the candidate groups, a method for defining sub-sets of data to optimize the number of match attempts needed for each record. We can choose our candidate groups either before or after we choose our match fields, but we will likely revise our candidate grouping strategy after reviewing some preliminary match results. Based on the match rules, we will select a few fields to determine how to group records together. Do we choose multiple fields, and create a much larger group? Or, do we set a hierarchy of operations and narrow down our match candidate groups? The sock scenario used a hierarchical candidate grouping strategy of 1. socks, then 2. my wife’s socks and my socks, and for a further level of optimization, we grouped into group 3., lights and darks. This grouping allowed our match rules to run more efficiently and use less overhead.

7.       Manage Your Exceptions

Socks-7-Exception copy.jpg

Of course, there are exceptions to every rule. What happens if you don’t find an exact match, but you find a similar match? Do you want to lose this potential match, or investigate it further? What happens if you find no match at all? It may not be worth the effort to build rules to match every single record. Aim for a majority of records first, then refine your rules to address the rest. To manage these match exceptions, develop a process to identify and pass them off to a steward to manually investigate, along with enough information to determine how to make decisions on the data. For our sock scenario, we always want an exact match to another sock, and if we can’t find the matching sock anywhere, well then, it might just relegate to the dreaded remainder drawer, from which few socks return!

Imagine the victory dance I could have done if I’d known then what I know now!

We’re only just scratching the surface. In my next post, I’ll talk about the importance of Data Quality for matching. Then later, we’ll be primed to discuss different methods for merging records together to define the all-powerful Golden Master Record.

Rahim Hajee DQ, MDM, and RDM Practice Lead

Rahim Hajee
DQ, MDM, and RDM Practice Lead