DELPHX MASKING SENSETIVE DATA PROFILING


Today's article will discuss one of greatest functionalities provided by delphix masking tool called profiler

The idea behind it is helping customers to detect if table columns are sensitive or not based on their contents

The Masking profiler uses two different methods to identify the sensitivity of data:

  • Column level (out of scope of this writing): looks through the column names in the target database, by querying the database metadata, looking for specific column names (eg : city column name will containe datas with cities name) 


  • Data level (subject of this writing): looks at the data itself using a sampling algorithm, to see whether there is any sensitive data

The data profiler takes a sample of n first rows (n being 100, 1000, 10,000 rows, 100,000 rows, and so on) against the column, and tries to match it with the profile expressions (java regexp based)

The result should match at least 80% of the sampling content as defined per NO_OF_ROWS=100 and PERCENTAGE_REQUIRED=80 parameter in configuration file kettle-profiling.properties

Let’s demonstrate it by creating a profile and a regular expression to profile email addresses in one of demo tables with 100 columns

Create "MY_EMAIL_DL" expression as follows



For ease of demo create a profile "EMAIL PF" and assign the previews expression to it


Create a connector and ruleset to the demo table "MEDICAL RECORDS" with 100 or more rows



Let's check the inventory before using the data profiler


Create a profiler job using the created ruleset and profile set


Execute the profiler job and if i did my job right :), the EMAIL column will be tagged as sensitive and assigned EMAIL domain 


Here we are, the profiler tagged the EMAIL column as sensitive based on its content 

Hope that this article helped you understand what’s going on behind the scene when using delphix masking data profiler

No comments:

Post a Comment