Data Cleansing: How To Clean 34 Layers Of SaaS Customer Data

This is chapter two in our 5-part series in the customer data academy - subscribe above for more!

Previously, we’ve talked about the building a schemaless customer data model with the “Customer Stack” - a closed loop of customer data between messaging, tracking, profiling, and actions. This means we can act and react to each customer. We also discussed the importance of a customer profile.

In this chapter, we’re discussing data cleansing - how do you get your customer data fit for your customer operations.

Your customer operations is only as good as your data

What makes the building blocks of your customer operations?

It’s data!

Data’s the raw material you need to work with to power everything:

✅ Trigger and personalize messages
✅ Trigger activities for sales
✅ Track reactions from customers
✅ Progressively profile users and accounts
✅ Drive lead qualification, segmentation and targeting
✅ Power automated workflows

But your data is messy

Data is a raw material. Raw data needs to be treated like iron ore, murky seawater, or harvested crops. It needs to be cleansed, processed, and distributed - “data cleansing” - before it can be useful.

Hull - Data Cleansing - Layers of data puke

It’s no good trying to make this all happen with data puke. Avinash has talked about data puke before in his seminal post on the difference between web reporting and web analysis.

At the highest level, analysis is making your data useful through insight. But this is only ONE way to make your customer data useful. We want to make our data actionable!

Most teams fail to create an effective customer data cleansing strategy, so they lose out on these chances to take action with their data. Data puke results in:

🚫 Failure to send messages to their target audience
🚫 Incapable of sending personalized messages
🚫 Failure to trigger activities for sales
🚫 Incapable of tracking customer reactions
🚫 Incapable of profiling users and accounts
🚫 Unqualified leads, poor segmentation, and untargeted messages
🚫 Incapable of sending automated workflows

How did you end up spewing junk data everywhere?

If you never architected a house, would you be shocked if it fell down?

If you never followed a tried-and-tested recipe, would you be shocked if your creation tasted foul?

If you bolted different, maligned components to your car, would you be shocked if it didn’t work?

Of course not. Yet, this is exactly what happens with customer data management. In most scaling companies, customer data is FUBAR.

The problem is you never had a customer data management strategy from the get go.

Your teams, tools, and data were NOT all carefully chosen, precisely setup by someone who knows what they were doing.

One tool after another is added to the stack. Each tool tells a different fragment of the whole story - even if they’re never integrated with each other.

Hull - Data Cleansing - Something missing here

Google Analytics was probably your first “Wild West” wasteland. It’s free, easy to install, and tells you vanity metrics out-of-the-box - OMG how many people came to our website from our TechCrunch launch article?!

Of course, you’re going to have GA installed. But your pages, website, event tracking, your setup within Google Analytics is going to be touched and adapted over time by a multitude of different people.

But Google Analytics is most useful and actionable with clean, consistent, complete tracking across your website, and has more value with more traffic. By the time your website traffic has grown, and you’ve tasked a friend, team, or some costly consultant to “fix it”, your Google Analytics is already FUBAR.

But that’s not even the worst tool for that “cowboy data” attitude syndrome.

Checkout your CRM…

Checkout your (lack of) product tracking…

Checkout your database of users…

How confident are you that your team can react to ANY activity in ANY tool?

Your. Data. Is. S i l o e d. And. brR0 kEn.

The biggest burden of unclean, unprocessed customer data is the human cost on your team.

We have a whole separate chapter to help you fix your “Wild West” of customer data (subscribe above to get it). But with your “data strategy” run by cowboys, it creates a ton of data puke that’s a burden for other teams.

For instance, marketing may pull together a list of leads, qualify leads based on defunct email addresses, and then lob them over the wall to sales. They may never feel the consequences, but they incur a huge human cost on another team.

Data that is not immediately accessible and useful is NOT good enough.

Any data that isn’t relevant, timely, precise, correct, complete, credible incurs work for other teams:

  • Is this useful?
  • Do I have to do anything to this data to make it useful?
  • If so, how do I do that?

Teams struggle to trust data to depend on it. Data has to be clean and useful instead of siloed puke.

Customer data cleansing has three areas

  1. Combine - Capture and store customer data
  2. Connect - Deduplicate and merge contacts with common identifiers
  3. Compute - Cleanup, reformat, and process customer data with computation

By combining, connect, and computing your customer data, you can transform data puke (scattered raw material) into high-grade, actionable fuel for your customer operations.

Let’s discuss them in turn.

Capture and store customer data

From the customer stack model, we know that customer profiling is at the heart of your customer data management. You need to build a complete picture of each customer and account over their entire customer lifecycle.

But in real life, there’s a ton of headaches which make this really hard to pull off…

Hull - Data Cleansing - Combine

Let’s start with how data is captured.

Data Puke Layer #1: Humans are NOT consistent in how they enter data. Any form fill, spreadsheet or manually inputted data is usually messy.

Data Puke Layer #2: Scraped data is often a mess too. Since it’s built by humans (see above), it’s often chaotic in structure, style, and completeness.

Data enrichment with clean data can help verify and validate this data, but this can be expensive.

Data Puke Layer #3: The “Dump” approach. Customer data cowboys will often “dump” data into tools via some import tool for sales prospecting, testing new tools, or adding data from a new kind of source.

Incriminating phrases include “we’ll figure it out later”, “we’ll probably hire an intern”, and “we’ll clean it up afterwards”.

Data Puke Layer #4: The integration setup firehose. Integrations can add huge value to different tools, but setting up new integrations usually involves an initial “dump” of data (see above). Rarely is this a small test to see how the tool uses the data, usually you end up with a ton of junk data in one fell swoop.

If data wasn’t so intangible (imagine this with iron ore or something), this would be visibly ridiculous.

Incriminating phrases include “I want to play around with the X integration” and “I just plugged in X to {really important tool}”.

Data Puke Layer #5: Not every tool can take unstructured data. Modern web apps are getting better and better at parsing junky data in their own way, but many tools are not geared for this at all. Instead of attempting to self-format unstructured data (like from a CSV importer), they reject the data outright.

This isn’t always consistent either, so different tools have different versions and ‘completeness’ of customer data.

Data Puke Layer #6: Structured data is different between tools. If you’ve ever configured any tool built by Salesforce, you’ll know exactly what this feels like. An entire customer profile must be perfectly formatted, else it’s rejected. The onus is 100% on you - the customer data cowboys - to make sure, somehow, everything is accurate, consistent, complete, and able to pass in correctly.

Data Puke Layer #7: Small tweaks break EVERYTHING. Perhaps a small innocent update to a form, or updating the format of something, makes sense on its own. But in the wider CRM system, these small tweaks can break entire workflows because inflexible tools can no longer accept the updated data types. Customer data management is like treading on thin glass.

Data Puke Layer #8: Legacy “silent archives” still house useful data. Inevitably, tools and databases get retired from use. But the data that lies within these data graveyards can still be useful. With older formats, data structures, and integration methods, it can be a pain-in-the-ass to re-build the entire customer journey.

Data Puke Layer #9: Different systems are NOT up-to-date. Whatever method you use for customer data integration, the data captured in one place does not always mirror the “same” data in another tool in near real-time. For your team, this can mean they see two different versions of the same thing - and they don’t know which version to trust and depend on.

The solution? Store all your customer data in one place.

Don’t leave your customer data siloed and scattered. Combine it all into one database instead.

Make sure you can store any customer data in any form - a “schemaless” database. You need to be able to capture the whole customer journey in whatever form that appears. This rules out tools like Salesforce which can only take data which fits their mold.

But all this data still needs to be pieced together. The next step is to combine that data around each person and account.

Deduplicate and merge contacts with common identifiers

With every tool tracking your users, customers, and accounts, you will find a partial customer profile. The challenge is to connect together this scattered jigsaw of each customer profile so you can see (and use) the complete picture on each person and company in every tool and database.

  • Who they are: Form submissions, CRM profiles, email contact profiles, user signup data, enrichment data
  • What they’ve done: Analytics events, product usage, email opens, sales activities, etc.
  • Who they work for: Form submissions, CRM profiles, sales activities, email contact profiles, enrichment data, IP address

But in real life, there’s a ton of headaches which make this hard to pull off…

Hull - Data Cleansing - Connect

Data Puke Layer #10: Merging on name is dangerous. How many “Joe Bloggs” are in your database? You may have no real idea. Each person’s name is individual to them, but not in the context of a larger userbase. This rules it out from being a reliable identifier.

Data Puke Layer #11: Users have both private and work email addresses. Particularly painful for B2B companies, signup with and other freemail addresses obscures their employer and other crucial details.

Data Puke Layer #12: Only using an email address as an identifier. This means there’s no context on a user journey before their email signup. It also means the customer journey is lost if their email is changed, updated, or submitted as a new email address (like for a newsletter).

Data Puke Layer #13: Multiple email addresses for the same person. It becomes very difficult to build a complete picture from many “partial profiles” where one person has signed up to multiple things with different email addresses (newsletter, product trial, new account, etc.).

Data Puke Layer #14: Duplicate email addresses in the same database. This is surprising but catches many teams out. Having the same email address multiple times in the same database (sometimes due to error like Upper/lowercase - see #1) but tools aren’t 100% consistent are reconciling around email addresses. This is too common to overlook.

Data Puke Layer #15: Every system has a different ID system. Your Salesforce, Intercom, email tool, and backend database will all have a different identifier system. The same person will have multiple IDs of different formats - like having many different passports. These all need to be matched up 1:1.

Data Puke Layer #16: Duplicate and incorrect identifiers. This is more common on backend databases built by in-house engineers where some inconsistency or bug leads to duplicate IDs for each person, splitting up the customer journey.

Data Puke Layer #17: You need to map your identifiers between EVERY system. Most teams don’t have a global, leading identifier to act as the “master ID” across all the tools and databases they’re using. Only once you have a common identifier can you map together the entire customer profile and customer journey.

Data Puke Layer #18: Legacy systems can have incorrect identifiers. Different tools and databases will develop their identification methods over time. Sometimes legacy tools and legacy data can get left behind, which leaves fragments of the customer story out of touch.

Data Puke Layer #19: Relying on a tool to match up customer data that does NOT actually work. Some tools have a system called Aliasing to try and combine customer data through an API call to associate multiple IDs with one person. Sometimes the matching doesn’t work, and data is left behind.

Data Puke Layer #20: Few tools support automated updating and merging of contacts and accounts. Even for tools which do support this, you have to code and define the conditions where you want to merge contacts correctly - without messing up your database. This is very hard to test and diagnose accuracy too.

Data Puke Layer #21: Merging contacts needs a master identifier. Some tools redirect identifiers (like overwriting email addresses and IDs) instead of tagging both in and using all historical data.

Data Puke Layer #22: It’s hard to implement custom logic to merge contacts in everyday tools. Many B2B businesses have complex relationships between users of their product which need custom logic to map together. Many sales, marketing, and customer-facing tools aren’t built for this custom logic, so it has to be engineered elsewhere and synced back in some frankensteins-form.

Data Puke Layer #23: It’s hard to setup hidden form fields. One common method for combining analytics and customer profile data is to pass through hidden fields on a form submission (like a user signup), but this can be hard to setup with your many tools. This usually involves custom engineering, which is resource constrained, and limits its rollout only to key forms like signups.

The solution? One customer profile to power them all.

Instead of wrestling duplicates and partial profiles with dozens of different identifiers, have one “master identifier” and one “master profile” for each customer.

With all your data in one place, you can combine data all your tools and databases into your master profile. Now, you have the full picture of each customer and account all in one place.

  • Who they are
  • What they’ve done
  • Who they work for

But all this data about each person is not ready to be used by any team in any tool.

Cleanup, reformat, and process customer data with computation

Even with all your data captured and stored, and all your different “partial profiles” deduplicated and merged together with a common identifier system, you still have the challenge of cleansing your data.

Remember, the original goal is to get data into an immediately useful format.

Hull - Data Cleansing - Compute

Often, this will involve taking “raw” data and computing “new data” with it. For instance:

  • Cleansing the name on profiles to be consistent
  • Enriching an email address with social, firmographic, and other data
  • Scoring an account for sales
  • Segmenting a profile for different messages and sales activities
  • Transforming a stream of events into a new attribute to sync to another tool

But real life happens. There’s a ton of headaches which make cleaning, reformatting, and processing data hard.

Data Puke Layer #24: No one system can digest all data. Any kind of event, attribute, table, or anything can’t be imported and used immediately in all your tools. Your backend database is too far detached from where the work gets done to be useful too.

Data Puke Layer #25: Tools can’t compute data they can’t digest. Following the headache above, tools can’t compute (cleanse, segment, score etc.) data that they can’t digest. This results in sub-optimal “versions” of different criteria in each tool.

Data Puke Layer #26: Not every tool can do every computation. A simple example of this is segmentation. Mailchimp is very popular but has a very basic segmentation tool. is another email tool, with a much more advanced segmentation tool. With differences like this across your tools, you have approximations and abstractions (like meaningless b.s. scores) which aren’t useful.

Data Puke Layer #27: Not every tool can process data fast enough. Delays in processing and syncing data results in two different versions of the same data, which undermines teams trust in the data they see and use. See Data Puke Layer #9.

Data Puke Layer #28: Spreadsheet lookups and matching is possible, but hard. There’s a reason the fastest growing companies don’t depend on spreadsheets to define their customer logic given the volume and complexity of mapping all their data together.

Data Puke Layer #29: No changelogs of data. There’s no trail of how customer data has been changed or updated over time, so there’s no record of what’s happened.

Data Puke Layer #30: No documentation of data management. As well as not seeing what has changed (see Data Puke Layer #29), there’s no understanding why it has changed.

Data Puke Layer #31: No idea if and when data processing has completed. Most computation tools are NOT transparent in how their data is processed. No progress bar. No real error messages. No time to compute.

Data Puke Layer #32: Customer data warehouses and business intelligence tools cannot run necessary computations. Sure, some tools can be overlaid to run some basic analysis. But this is NOT the “immediately useful” customer data we need in all our tools.

Data Puke Layer #33: Computation is too hard and slow that we cannot experiment fast. With near-instant computation and data processing, more experimentation, queries, and ideas can be tested quickly.

Data Puke Layer #34: There’s exponential complexity with the number of tools we add. Each new tool adds its own data and logic to the spider's web of tools and databases, and each needs an integration with each other.

The solution? One central “brain” for your customer data.

Instead of “praying and spraying” whatever data you can into whatever tools can digest your computed data, create it once in a central customer data hub, then sync that data to your other tools.

  • Cleansing data before it ever reaches your other tools
  • Enriching data once, then sync ALL the enriched data to all your tools
  • Scoring data centrally, then updating the score each time there’s an update to the master profile.
  • Segmenting profiles centrally using all your data (and updating each segment in real-time), then syncing precise segments to all your tools to use.
  • Transforming your data to be useful and “accepted” in every tool before it is synced.

Hull - Data Cleansing - Brain

Customer data platforms make total data cleansing possible

Combining data in one place is NOT enough.

Combining data around each person is NOT enough.

Cleansing and computing data centrally is NOT enough.

You need bring all three of these tasks together - combine, connect, compute - to truly cleanse your customer data and make it immediately useful in all your tools.

Customer data platforms are designed to solve this problem from start to finish - learn more how customer data platforms like Hull work.

Next, how do you sync your customer data together?

In the next chapter of our customer data academy, learn best practices and methods for customer data integration.

Fear your team’s customer data puke? Get help, anonymously.

Share your problems with our team and partners to answer. We’ll record a short video for you to share discussing your problem and possible solutions. Share your Frankenstack's Anonymous submission here.

Ed Fry

Prev 'Ed of Growth at Hull, working on all things content, acquisition & conversion. Conference speaker, flight hacker, prev. employee #1 at (acq. HubSpot). Now at Behind The Growth

If you've questions or ideas, I'd love to geek out together on Twitter or LinkedIn. 👇