Public LLMs Are the Hotel California for Data

Or at least, data will be very, very hard to delete.

But perhaps before we dive into the challenges of removing data from an GenAl, let's understand the implications of feeding data, intentionally or not, into LLM models.

Imagine an employee sending a document into a GenAI model, which is full of sensitive and confidential information, such as legal matters, customer contracts, proprietary IP, source code, employee data, and so on. This data now becomes a relevant and referenceable data point, and the LLMs can create associations with other publicly available information.

Suddenly, a generic AI prompt about your organization can make this information available to anyone. Before you know it, you have a major data leak or breach on your hands, and it can be embarrassing and very expensive.

Doesn’t that send shivers down your back? This is why many organizations are banning the use of ChatGPT and the likes at work and are actively trying to mitigate the risk of shadow AI.

Understanding the challenges with deleting data from GenAI models will make you realize why LLMs are Hotel California for your data.

Model Training Complexity: LLMs are trained on extensive data through complex processes with multiple layers and parameters. Once integrated into the model during training, the data becomes deeply embedded, making it difficult to isolate for deletion.
Data Entanglement: Training data can get highly entangled, complicating the isolation of specific data points without impacting the model's performance. Deleting one data point could unintentionally disrupt learned relationships, degrading the model's capabilities and accuracy.
Lack of Traceability: After data is inputted into the model, tracing individual data points to their sources becomes challenging. The model's knowledge is aggregated, so any information may consist of multiple data inputs without a clear way to identify specific sources.
Technical Limitations: The current machine learning infrastructure lacks tools to surgically remove data from a trained model. Unlike databases allowing direct record deletion, LLMs need re-training with adjusted datasets, a computationally and temporally demanding process.
Resource Intensity: Retraining an LLM from scratch to exclude specific data points is resource-intensive in terms of time and computational power. The process could span days, weeks, or even months based on factors like model size, data volume, and processing capabilities.

So what can be done?

The long answer-- employee awareness, new rules of engagement, updated operational practices. The short answer-- secure the data. A lot of headaches can be prevented if sensitive and critical data is encrypted, making it useless for use in GenAI.

Consider using at-rest encryption or data tokenization technique that masks data, yet still makes it portable, so the data can be secure and private. And complaint for that matter-- regulatory implications are yet another negative outcome when data is randomly fed into GenAI.

Once data is secured, the encryption keys used to protect it must be properly managed. Access to keys, decryption policies, and key storage should be well thought out. RBAC and Quorum Approvals will ensure that only the right people can access or modify keys.

Fine-grain policies will determine who can see and work with what data. To adhere to best practices, do not store the encryption keys next to the data or in an individual's file folder. Until engineers and data scientists solve the complex task of deleting data from LLMs, take those basic steps and secure that data.

About Fortanix

Fortanix is a global leader in data security. We prioritize data exposure management, as traditional perimeter-defense measures leave your data vulnerable to malicious threats in hybrid multicloud environments.

Our unified data security platform makes it simple to discover, assess, and remediate data exposure risks, whether it’s to enable a Zero Trust enterprise or to prepare for the post-quantum computing era. We empower enterprises worldwide to maintain the privacy and compliance of their most sensitive and regulated data, wherever it may be.