Jul 20, 2023
In light of recent lawsuits against makers of generative AI solutions, it's worth asking whether we can expect to retain exclusive ownership of any of the data we post online.
It’s difficult to read an online article, whitepaper, email, or anything else today without wondering whether generative AI had a hand in creating it. Few recent technology developments have polarized IT departments, businesses, and individuals quite like tools in the vein of OpenAI’s ChatGPT.
Recently, news broke that OpenAI is being sued for allegedly scraping massive amounts of personal data without permission to train its AI models. It is also involved in a separate lawsuit, in which the plaintiffs argue that OpenAI trained its large language model (LLM) using books without permission from their authors. Celebrities including the comedian Sarah Silverman are hopping aboard the litigation train, also citing worries that their material was being used without their consent.
If you ask me, this was to be expected as developers turn to the data provided to them, or simply what they have access to already, to gather the colossal amounts of data needed to train their machine learning models.
Why is this relevant, or at least worthy of another article? Well, it demonstrates just how much data these models must consume before they become useful. For me, this fact raises a few questions:
- How can we ensure that ownership of data stays with its creator? Is that even possible in our new, AI-enabled world?
- How do we determine who or what has access to our data? How can we ensure that we maintain control?
- What will the increasing use of generative AI mean for the consumption of our data and the infrastructure that enables that consumption?
Many civic organizations and government bodies are already discussing AI risks and ethics. More and more companies are creating, if they haven’t already, a code of conduct for their AI use, including what types of data can be input and which types of tools can be used. But, absent some legislation mandating such policies, we can predict some will be more careless than others.
Hype aside, generative AI has significant cybersecurity implications. Only some people will choose to follow some high-minded code of conduct. Some will use the power of generative AI to steal data, produce misinformation, and engage in many other illicit activities. This makes it extremely difficult to decide whether we should establish more guardrails for the use of generative AI or should we rather identify how we can most effectively protect our crucial data and digital assets.
Without waiting for the dust to settle on AI-related regulation, which will undoubtedly vary by region, privacy-focused IT leaders must take action to reduce the chances of sensitive data leaving their organizations as generative AI input. This will necessitate the evolution of data loss prevention (DLP) tools with new or enhanced capabilities. Controls can determine who has access to AI assistants, for example. Decision-makers should evaluate which AI-enabled tools satisfy acceptable use policies based on both their inherent security postures and their data retention policies, such as who they share prompt data with and whether such sharing can be opted out of.
Even in our information-saturated age, today’s IT teams still hunger to know more about their data: What is “critical?” Who or what has access to it? Does everyone with access to sensitive data need it, and vice versa? Businesses must retain the power to decide who has access to their intellectual property. Citizens want to remain the stewards of their photos, text, and transactions so they can make their own decisions regarding their privacy.
Therefore, the rise of generative AI should prompt us to look more closely at data classification and protection matters. With continuous cloud adoption and ongoing automation, we should see greater adoption of core zero trust principles. In light of enhanced data consumption, expect smart security teams to shift their focus from implementing more powerful protection to exploring smarter ways to secure their data, e.g., reducing the attack surface and demanding authentication with each access request.
Lastly, we must rethink our infrastructure for widespread generative AI use. These technologies will tax existing data centers and related hardware in ways we haven’t yet seen. The firm Tirias Research, for example, forecasts 2028 data center power consumption will be near 4,250 megawatts, 212 times their total in 2023. To cope, many data centers will require retrofitting with more advanced computing hardware, which is a trend likely to unfold for up to a decade.
Generative AI is spurring the evolution of related legislation, organizational policies, and even hardware requirements. Advancements in all of these fields will be necessary to preserve our existing notions of data privacy. LLMs require massive data sets for training and refining their models. How much of it will be yours has yet to be decided.
What to read next
Recommended