Today's article will discuss one of greatest functionalities provided by delphix masking tool called profiler
The idea behind it is helping customers to detect if table columns are sensitive or not based on their contents
The Masking profiler uses two different methods to identify the sensitivity of data:
- Column level (out of scope of this writing): looks through the column names in the target database, by querying the database metadata, looking for specific column names (eg : city column name will containe datas with cities name)
- Data level (subject of this writing): looks at the data itself using a sampling algorithm, to see whether there is any sensitive data
The data profiler takes a sample of n first rows (n being 100, 1000, 10,000 rows, 100,000 rows, and so on) against the column, and tries to match it with the profile expressions (java regexp based)
The result should match at least 80% of the sampling content as defined per NO_OF_ROWS=100 and PERCENTAGE_REQUIRED=80 parameter in configuration file kettle-profiling.properties
Let’s demonstrate it by creating a profile and a regular expression to profile email addresses in one of demo tables with 100 columns
Create "MY_EMAIL_DL" expression as follows
For ease of demo create a profile "EMAIL PF" and assign the previews expression to it
Create a connector and ruleset to the demo table "MEDICAL RECORDS" with 100 or more rows
Let's check the inventory before using the data profiler
Create a profiler job using the created ruleset and profile set
Execute the profiler job and if i did my job right :), the EMAIL column will be tagged as sensitive and assigned EMAIL domain
Here we are, the profiler tagged the EMAIL column as sensitive based on its content
Hope that this article helped you understand what’s going on behind the scene when using delphix masking data profiler
No comments:
Post a Comment