DELPHX MASKING SENSETIVE DATA PROFILING

Today's article will discuss one of greatest functionalities provided by delphix masking tool called profiler

The idea behind it is helping customers to detect if table columns are sensitive or not based on their contents

The Masking profiler uses two different methods to identify the sensitivity of data:

Column level (out of scope of this writing): looks through the column names in the target database, by querying the database metadata, looking for specific column names (eg : city column name will containe datas with cities name)

Data level (subject of this writing): looks at the data itself using a sampling algorithm, to see whether there is any sensitive data

The data profiler takes a sample of n first rows (n being 100, 1000, 10,000 rows, 100,000 rows, and so on) against the column, and tries to match it with the profile expressions (java regexp based)

The result should match at least 80% of the sampling content as defined per NO_OF_ROWS=100 and PERCENTAGE_REQUIRED=80 parameter in configuration file kettle-profiling.properties

Let’s demonstrate it by creating a profile and a regular expression to profile email addresses in one of demo tables with 100 columns

Create "MY_EMAIL_DL" expression as follows

For ease of demo create a profile "EMAIL PF" and assign the previews expression to it

Create a connector and ruleset to the demo table "MEDICAL RECORDS" with 100 or more rows

Let's check the inventory before using the data profiler

Create a profiler job using the created ruleset and profile set

Execute the profiler job and if i did my job right :), the EMAIL column will be tagged as sensitive and assigned EMAIL domain

Here we are, the profiler tagged the EMAIL column as sensitive based on its content

Hope that this article helped you understand what’s going on behind the scene when using delphix masking data profiler