Data Science In A Privacy-Centric World

Eliud Nduati
4 min readMar 31, 2022

Data is an essential resource today. All businesses should be using it or working to have data-centric operations. This is in line with growth and competitive advantage for all companies.

However, there is a limit to what data can be collected, stored, and processed. With the need for data to ensure competitive advantage and growth comes privacy regulations.

Data regulation laws lay down the rule that needs to be followed to protect personally identifiable information from data, personal data as it relates to processing and movement. In this, movement implies data sharing. Data protection regulations such as GDPR are essential to everyone as they enhance people’s privacy. However, they are a limiting factor in data science and AI. For data science and machine learning to prosper, data is integral. Not just any data but data in ample supply, varieties, and diversities.

Therefore, the question that arises is, how do data professionals navigate the minefield that is privacy regulations without compromising or breaching the laws?

Various actions can be taken to ensure that the regulations mentioned and implemented are followed and that the resulting applications and platforms do not breach the rules. Some of these methods entail how data is stored, processed and the control provided by the owner.

Data catalog

Data collectors bring about one issue is not informing their customers about what data they collect. As a result, the data owner is unaware of how much information has been collected. Additionally, the data use is not communicated to the customers.

A good approach to address this is communicating to the customers what data is being collected and what types are being stored. To process, say, most transactions online, such data as the names and credit card numbers are necessary.

Similarly, location data is essential in such scenarios. However, not all data is needed for the transaction to be completed. Collecting only the necessary data and communicating this in a catalog about the content that has been stored about the customer is essential. This will help address the issue of breaching regulations about data collection.

Information security

One provision in the famous data protection regulations, GDPR, is to communicate data breaches to customers. However, it is far more important to prevent access to this data by unauthorized personnel. Today this investment in data security is important. Some governments are still against the use of shared cloud data storage. This is due to the implications they feel are possible with data being stored in foreign countries.

As a result, investing in information security and information storage that ensures the infrastructure and provision meet the customer needs is important. Encrypting the data and storing it with anonymized user names and identification is also a way to avoid losing confidentiality when unauthorized persons access the data.

Data processing.

The purpose of storing data is to use it to gain an advantage in the market, make decisions, and strategize on the way forward. To achieve this, incomes data scientists and machine learning engineers. The purpose is to use the data to generate insights and models to help the organization make good decisions and plans for its operations. However, how the information is processed and what information is processed can be unethical. When facial recognition was first implemented, there were issues with certain unedifying races as faces.

This resulted in significant concern about the propagation of racism by these models. However, this issue would not have been corrected without intensive data modeling and training using vast amounts of data. Data processing is important, and to avoid problems with data regulation, by having precise data about an individual, the model can store the models. In this case, I suggest the results from the processing and not the raw data. Storing processes data would improve its anonymity as the reverse would seemingly be impossible.

Data quality and control

Organizations need to implement data quality mechanisms that will support the control of data accuracy and allow their users to correct, request, and delete their data whenever they want to. As has been seen on most social media platforms recently, the users have the chance to update their information on the platform, request the data, and delete their accounts and data altogether.

This will ensure that the user is comfortable and that their data is in their hands. Additionally, the companies will be able to comply with the data regulations rules.

Preventing bias in the model

One issue with machine learning models is that some will give biased results. In models on predicting criminal behaviors, one problem that can be noted is that there will be a bias based on the data available. Imbalanced data might result in biased models.

A data scientist would ensure that the data is well balanced before using it to train the model. Another engineering mechanism should also ensure that the resulting models address such issues. This will prevent biased and wrongful arrests and allow for user confidence in the models created and the tools resulting from the models.

While data science and the resulting benefits are important to businesses and people, it is crucial to ensure that first, the regulations about using this data are met, and the resulting platform that uses data allow for ethical decision-making and upholds the moral norms of impartiality.

Originally published at