Privacy and Small Result Sets
Maintaining Anonymity in Small Communities
by Judson B. Tunnell, 02-February-2023
Data allows us to do wonderful things. Properly written demographic profiles can provide a wealth of information about people. A good profile allows you to extract meaning from the data. This information should always be separated from personally identifiable information (PII). Small data sets, and small numbers in large data sets, can still result in being able to pair a profile with an individual.
Consider the following example. A researcher has the data from 75,000 mental health surveys with no PII, but with detailed demographic profiles. When checking the data, the user realizes that 135 of the results are from their home town, where the vast majority of the town was born in Canada. Curious they check the data and find one result where the birth country of the respondent is Ukraine, the gender is male and the age is between 45 and 50. They realize that has to be their brother in law. You can see how quickly a detailed profile can be used to identify someone from a result set with no PII.
Normally we trust researchers and academics to not go out of their way to identify someone. However we know that all people are susceptible to curiosity, or malintent. When building an application for the gathering and analysis of data, there is an opportunity for mitigation. You can build in ways to reduce the chance than an individual can be singled out in the data.
There are several ways to mitigate this:
Copy Data Subsets for Reporting Tables
When retrieving a large data set, perform a distinct count on the result and change the smallest results to a common value such as 'Other'. For example, if a all of the data from a town contains small numbers of people from a list of countries. These can all be changed to other and counted together. This takes some extra processing when extracting the data, but will prevent individual identification.
Hide Sets With Under 100 Results
In large reports, hide all counts under 100, or again collect them together under the value 'Other'. This may render some small data sets useless, but will help protect individual identification.
Fewer Columns = Harder to Identify
Return fewer columns of the profile in an individual query. Only return the profile columns that are relevant to the research or question. For example, if the birth country is irrelevant to the search, do not query on it or return it in data sets.
If the application will include reporting tools, you can build these mechanisms in so that the data is never revealed. If the data itself will be transmitted or accessed, you should use a reporting stack and have overnight queries pull the results using the same mechanisms.
Using any of these methods should prevent the accidental or intentional identification of individuals from within large data sets.
Dimensions of Wellness is comitted to never revealing your personal information. We are using all three of these methods for our reporting data.
- Judson B. Tunnell