The testers’ guide to masking, obfuscating and scrambling
What data do you use for testing and development – especially when this crucial work is increasingly being off-shored? Huw Price of Grid-Tools says that the trio of masking, obfuscating and scrambling offer a practical solution.
A major area of interest is the question of what data to use for development and testing, especially in offshore applications. Currently, most companies use copies of production data. This has obvious security issues but also has several disadvantages familiar to all CIOs. Copied data is usually out of date by the time it is used for testing making time specific tests irrelevant and new functionality will not have any pertinent data with which to test it. Multiple users will set up specific test scenarios which will be destroyed every time production is re-copied to testing. Large copies of production data on less powerful testing hardware make queries and searches run slowly and take up lots of expensive disk space.
In addition to full-sized copies, most companies will have additional approaches to building testing environments, which include: a small development database in which users create data by hand, this usually contains a large amount of invalid data; an extract subset of production data for use in development using tools such as GT Subset. They also use capture playback tools such as QTP, Forecast Studio etc to populate transactions using the online applications and also use data generation tools such as Datamaker to build accurate test data.
At some stage during the development and testing lifecycle, users will access production data. Here we outline some of the simpler techniques to obfuscate or mask the data where you cannot identify an original customer, account or secure entity from the masked data and the overall data trends cannot be easily identified.
Where to scramble
The first consideration when designing a scrambling architecture is where you want to physically scramble the data. Is it good enough to copy production, move it into development, run a few scripts and off you go? I would put this into the ‘doing as little as possible’ category, as there are a few specific problems with this approach. For instance, the live data lives unscrambled in a development environment for a time; the scripts to scramble the data tend to get forgotten, are not kept updated and tend to be built by a single DBA who may move on and there is no traceability.
Better and more systematic approaches to data scrambling will depend upon your specific infrastructure. Many sites, for example, will already have copies of production data for use as reporting databases or for access by data warehousing toolset. These copy databases are protected by security layers and access control. The reporting databases can then be used as a source for scrambling extracts without impacting on production performance.
The main secure methods of scrambling are:
- Extract the data through scrambling functions either on a live copy of production or preferably a reporting copy.
- Build a set of views that use scrambling functions to mask the data. Data will be extracted through these views. The access to these views can be granted using normal database security. As an indicator the initial data retrieval through the views is usually 5 to 10 times slower than against a native table.
- Create a secure environment and take a copy of production data, update the data in situ and when complete copy this to development. The same functions outlined in this paper can be used in update scripts to scramble the data in situ.
Ad hoc or systematic?
It is very easy to write a few scripts, change a few customer names or alter a few characters of an account ID, but there are problems with this approach. The scripts tend to fall outside normal programming control and are written in SQL and non standard languages such as PERL. They may well be perfect but tend not to be documented, not incorporated in source control systems and are not subject to testing by the test department.
Database structures tend to change regularly and the scrambling functionality needs to be upgraded with each release. After a while the scrambling routines tend to be forgotten. In an attempt to be more systematic, the use of tools can be helpful as well as turning the scrambling task into a normal IT project. The scrambling project would be subject to the infrastructure, testing and control used in your normal development lifecycle. The benefits of this are:
- Traceability – extremely important if data loss occurs.
- The scrambling tends to be more thorough and more useful to testing teams.
- Scrambling for new releases of software is automatically upgraded as part of the normal life cycle.
Know your data
Before beginning a scrambling project you will need to understand your data. A request from management may be as simple as “make sure no one can recognise a customer”; however, understanding what a customer consists of is the first task before you can begin scrambling. To begin building up a picture, you will need to gather all of the available ‘free’ information surrounding the data to be scrambled, this includes:
- Foreign Keys. How are tables related in the database?
- Documentation. This is usually held in a variety of formats and applications; however, they are rarely current.
- User knowledge. What is the user’s understanding of how and where key data is held and displayed?
- Naming standards. A surprisingly good source of information, column names in tables can give a strong hint to their use and relationship to other columns.
Once you have gathered a basic picture you will need to investigate the data itself to verify any documentation and try to understand in detail where the data is held and how it relates to other data.
There are a number of problems common across most systems, including data columns being used for multiple purposes. It is quite common for limitations in an application to be overcome by creative use of fields. Thus a field used for one purpose contains data to identify data for other uses. Examples of this type of usage are comment fields being used to hold structured information; these comments may contain data that is sensitive for example a temporary address or phone number.
As applications and databases evolve and merge with other systems data may be created that is invalid. Users usually have an idea that this invalid data exists, however, but have made the decision to ignore the data problems as there is no critical problem that would justify the time to clean up the data. These data issues must be understood prior to scrambling a database.
Documentation and traceability
It should be obvious that the ability to demonstrate that efforts have been made to scramble data requires a documented trail. Turning the task into a normal IT project will allow you to use your normal change control, testing and delivery methodologies. These are usually very mature in most organisations. The documentation and control should include:
- Which columns are sensitive and need scrambling?
- Who has access to any scrambling functions, ie the code that scrambles should be protected as well.
- A before and after report of what the data has been changed from and to. You can use database compare tools such as Datamaker for this or generate triggers to update audit tables.
- Who has access to any working schemas or files used in the scrambling process?
Figure 1 – Datamaker stores and audits which scrambling functions have been applied to each column.
Scrambling methods
There are numerous methods used to scramble data, however I shall break them down into three categories:
1 Simple independent functions to put in random text, dates and numbers.
2 Multi-table column values, for example, an account number is used in lots of tables and as an identifier in other applications.
3 Offset values, for example, if a date is adjusted then other related dates must shift in line with the original date; if a post code is changed then corresponding address lines must also shift.
When building up your library of functions remember that there are lots of powerful ones readily available from sources like database functions – every RDBMS comes with a vast library of built in functions many of which can be built up to scramble data quite easily; toolsets – Tools come with many pre built functions; your own code – Some of the scrambling you need will be very specific to your systems, for example, customer numbers can be built up of combinations of locations, dates of birth and partial names. There will be code in your system already that builds these numbers so use the same function as part of your scrambling strategy. And the internet provides a vast array of free code snippets which can be easily used.
Using seed tables
A very effective technique to scramble data is to use one or many static or temporary tables to hold data that can be included as part of your scrambling routines. These tables can include a list of made up customer names, product names, addresses etc.
Static tables – It only takes a simple piece of SQL or, even better, a simple database function call to replace the data being extracted with data from the seed table. The seed tables contain data familiar to testers and can be added to very easily and be customized to specific locales. For example, it is very easy to create specific groups of addresses for each country. Seed tables should be populated prior to beginning the scrambling and verified that they contain no production data.
Dynamic tables – an easy to use and effective technique is to build tables that are exclusively used for a scrambling build. The next time data is extracted you would simply drop and recreate the tables. These tables tend to be used when data identifiers are used across multiple tables and a number must be changed to the same number across each of these tables.
These cross reference tables can be very useful as they ensure that even if someone knows an internal ID, they will not be able to find the specific detail of a customer. So, for example if you scramble customer names and addresses and you shift the internal customer ID field, the data will still retain full integrity, however it will be difficult to identify a specific customer. This can also be used for detaching address IDs from Customer IDs allowing separating customer transaction details from address details; and detaching sign on information from personal information.
Independent functions
A library of simple functions to apply to data as it is extracted should be built up. It should include the adding of a small decimal increment to transaction values to mask individual transactions; and the adding of a number of days to all dates, a very simple transformation to implement, assuming all your dates are identified as date data type. This also has the advantage of allowing time-dependant process testing to be more accurate.
Bear in mind end of month processing can be affected by this. You may be better off using a cross reference table to match up periods.
Multi-table column values
Many column values are repeated in tables across the system. These values can be external identifiers, for example, an account number may be used extensively across a system and across other applications, or they could be internal identifiers for example an account ID. While changing external values is obvious, changing the internal identifiers should be considered. If a user can identify an internal ID, sometimes they are displayed in reports, XML messages, error messages etc then data is no longer masked.
The first problem is tracking down all of the links between tables and columns once this is done you need to make sure that the same scramble functions are applied to all of the related columns. Identifying these columns may need a tree walk across your model.
Figure 2 – A tree walk using Datamaker to identify internal IDs
Functions that are used in multi-column scrambling should include:
- A simple character by character replacement. Basically shift character five and six in a string identifier to one more less and one more respectively.
- Hashing, which is a key component in multi-column replacement. It allows values to be transformed to the same value every time dependant on a hash key. For example, 1 would be transformed to 7, 2 to 6, 3 to 1 etc. Each value has a unique hashed value and is repeatable based on the hash key.
- Using dynamic seed tables will build an exact replacement value for each identifier. You need to protect the seed table as it contains the ‘key to crack the code’ and you must also protect the offset algorithm, as this can be used to identify data.
Using the above techniques, it is possible to apply incremental updates to your development environment. So, for example, if you extract sub sets of production transaction data to keep your development data up-to-date, you can run an extract and import the new data which will still makes referential sense. You must, however from time to time change any offset values, hash keys or simple algorithms and refresh the test databases as users will eventually work out the old and new values.
Offset values
Managing offset values can be more complicated as data in one column/table must be related to the other column/table’s values for the system to operate in the same way as production. Obviously understanding where these types of relationships exist is crucial. The types of offset values include:
- Micro aggregation – The values of prior rows in a set of transactions refer to each other.
- Application process driven aggregations – Many systems have application components that calculate balances based on transaction throughput. These tend to be separate processes which can be run as stand alones. For example, if you are adjusting transaction amounts the customer balances may not match. Running the balance adjustment process may be required to reset these values.
- Offset dates – If you adjust dates then other values may well have to be reset. An example of this is dates of birth; adjusting this by a few days will alter the age and also the age bracket so a person may move into a different insurance premium.
The aggregation and setting of offset values can be the hardest part of any scrambling project. In my experience, the extensive use of dynamic tables beforehand, to create offset lookups is an effective technique, especially with date-type data. In addition, the ability to use portions of application code may need to be incorporated into the scrambling routines to reset end of day balances, customer totals etc.
Data creation
Many organisations have very strict data usage restrictions and may well need to create data from scratch to use in their development and testing environments. Having no access to production data is common in many government departments as well as more and more health care and financial sectors. At first, creating data from scratch may seem like a difficult task, however when you consider some of the difficulties with scrambling and the basic problem of developing with full size databases that are often out of date it may well be an easier option.
The basic procedures to do this are:
- Create a standard data object, for example a Customer.
- Tokenise or parameterise those objects so that you can vary them when you create them.
- Inherit objects so that you can make your own edits without affecting the original data objects.
Library of functions
Once you have completed your analysis and built a library of standard functions you can use these functions across the enterprise to scramble data in other systems and linked applications.
For each table in the application you must produce and store some level of documentation to verify the scrambling is effective and sufficient. Spend some time looking at each column deciding on which function or transformation should be applied to each column. You must also note any pre processing required to build cross reference lookups and any post-processing balance aggregations etc.
New builds and releases
An often forgotten problem with scrambling is that minor and major new releases and builds may come thick and fast so the scrambling toolset must be able to handle the addition of new, possibly sensitive columns, the addition of cross column relationships and offset values and any new functional requirements which may need new seed tables.
It is important that the scrambling methods are updatable quickly and you do not have to spend time waiting, for example, for a DBA to update some scripts. If this happens the initial work will be quickly lost and production data will begin being used again in development and testing.
Scrambling projects often start out as a request to the DBA team to “do something to make sure production data is not being used in development”. While the DBAs may be able to do a quick job to scramble some columns, the responsibility for data ownership cannot live with the DBA team. A scrambling project should come under normal application control and be the responsibility of the development team in conjunction with the application ‘data owners’. Too many projects have failed to deliver even basic data security by going down this route.
Box out: What a scrambling project should include
- Understand your data – Its peculiarities, quality issues and interfaces.
- Agree up front what data is sensitive – Don’t forget to include internal IDs.
- Discuss the different methods and how they apply to the application as a whole and in detail at the column level.
- Build up a library of functions – Consider reusing some of your existing code which may handle non standard situations.
- Consider buying in a toolset to scramble data.
- Consider creating data from scratch – It is often easier than you think and has many benefits in terms of the amount of code covered in testing.
- Document and audit everything you are doing.
- Use normal project control mechanisms.
- Don’t forget that the scrambling has to be kept up to date – design your scrambling application with this in mind.




Green
T.E.S.T Magazine
June Issue
T.E.S.T Magazine