Building Ethical AI Starts with the Data Team — Here’s Why

TDS · March 20

Building Ethical AI Starts with the Data Team — Here’s Why

GenAI is an ethical quagmire. What responsibility do data leaders have to navigate it? In this article, we consider the need for ethical AI and why data ethics are AI ethics.

Image courtesy of aniqpixel on Shutterstock.

When it comes to the technology race, moving quickly has always been the hallmark of future success.

Unfortunately, moving too quickly also means we can risk overlooking the hazards waiting in the wings.

It’s a tale as old as time. One minute you’re sequencing prehistoric mosquito genes, the next minute you’re opening a dinosaur theme park and designing the world’s first failed hyperloop (but certainly not the last).

When it comes to GenAI, life imitates art.

No matter how much we might like to consider AI a known quantity, the harsh reality is that not even the creators of this technology are totally sure how it works.

After multiple high profile AI snafus from the likes of United Healthcare, Google, and even the Canadian courts, it’s time to consider where we went wrong.

Now, to be clear, I believe GenAI (and AI more broadly) will eventually be critical to every industry — from expediting engineering workflows to answering common questions. However, in order to realize the potential value of AI, we’ll first have to start thinking critically about how we develop AI applications — and the role data teams play in it.

In this post, we’ll look at three ethical concerns in AI, how data teams are involved, and what you as a data leader can do today to deliver more ethical and reliable AI for tomorrow.

The Three Layers of AI Ethics

When I was chatting with my colleague Shane Murray, the former New York Times SVP of Data & Insights, he shared one of the first times he was presented with a real ethical quandary. While developing an ML model for financial incentives at the New York Times, the discussion was raised about the ethical implications of a machine learning model that could determine discounts.

On its face, an ML model for discount codes seemed like a pretty innocuous request all things considered. But as innocent as it might have seemed to automate away a few discount codes, the act of removing human empathy from that business problem created all kinds of ethical considerations for the team.

The race to automate simple but traditionally human activities seems like an exclusively pragmatic decision — a simple binary of improving or not improving efficiency. But the second you remove human judgment from any equation, whether an AI is involved or not, you also lose the ability to directly manage the human impact of that process.

That’s a real problem.

When it comes to the development of AI, there are three primary ethical considerations:

1. Model Bias

This gets to the heart of our discussion at the New York Times. Will the model itself have any unintended consequences that could advantage or disadvantage one person over another?

The challenge here is to design your GenAI in such a way that — all other considerations being equal — it will consistently provide fair and impartial outputs for every interaction.

2. AI Usage

Arguably the most existential — and interesting — of the ethical considerations for AI is understanding how the technology will be used and what the implications of that use-case might be for a company or society more broadly.

Was this AI designed for an ethical purpose? Will its usage directly or indirectly harm any person or group of people? And ultimately, will this model provide net good over the long-term?

As it was so poignantly defined by Dr. Ian Malcolm in the first act of Jurassic Park, just because you can build something doesn’t mean you should.

3. Data Responsibility

And finally, the most important concern for data teams (as well as where I’ll be spending the majority of my time in this piece): how does the data itself impact an AI’s ability to be built and leveraged responsibly?

This consideration deals with understanding what data we’re using, under what circumstances it can be used safely, and what risks are associated with it.

For example, do we know where the data came from and how it was acquired? Are there any privacy issues with the data feeding a given model? Are we leveraging any personal data that puts individuals at undue risk of harm?

Is it safe to build on a closed-source LLM when you don’t know what data it’s been trained on?

And, as highlighted in the lawsuit filed by the New York Times against OpenAI — do we have the right to use any of this data in the first place?

This is also where the quality of our data comes into play. Can we trust the reliability of data that’s feeding a given model? What are the potential consequences of quality issues if they’re allowed to reach AI production?

So, now that we’ve taken a 30,000-foot look at some of these ethical concerns, let’s consider the data team’s responsibility in all this.

Why Data Teams Are Responsible for AI Ethics

Of all the ethical AI considerations adjacent to data teams, the most salient by far is the issue of data responsibility.

In the same way GDPR forced business and data teams to work together to rethink how data was being collected and used, GenAI will force companies to rethink what workflows can — and can’t — be automated away.

While we as data teams absolutely have a responsibility to try to speak into the construction of any AI model, we can’t directly affect the outcome of its design. However, by keeping the wrong data out of that model, we can go a long way toward mitigating the risks posed by those design flaws.

And if the model itself is outside our locus of control, the existential questions of can and should are on a different planet entirely. Again, we have an obligation to point out pitfalls where we see them, but at the end of the day, the rocket is taking off whether we get on board or not.
The most important thing we can do is make sure that the rocket takes off safely. (Or steal the fuselage.)

So — as in all areas of the data engineer’s life — where we want to spend our time and effort is where we can have the greatest direct impact for the greatest number of people. And that opportunity resides in the data itself.

Why Data Responsibility Should Matter to the Data Team

It seems almost too obvious to say, but I’ll say it anyway:

Data teams need to take responsibility for how data is leveraged into AI models because, quite frankly, they’re the only team that can. Of course, there are compliance teams, security teams, and even legal teams that will be on the hook when ethics are ignored. But no matter how much responsibility can be shared around, at the end of the day, those teams will never understand the data at the same level as the data team.

Imagine your software engineering team creates an app using a third-party LLM from OpenAI or Anthropic, but not realizing that you’re tracking and storing location data — in addition to the data they actually need for their application — they leverage an entire database to power the model. With the right deficiencies in logic, a bad actor could easily engineer a prompt to track down any individual using the data stored in that dataset. (This is exactly the tension between open and closed source LLMs.)

Or let’s say the software team knows about that location data but they don’t realize that location data could actually be approximate. They could use that location data to create AI mapping technology that unintentionally leads a 16-year-old down a dark alley at night instead of the Pizza Hut down the block. Of course, this kind of error isn’t volitional, but it underscores the unintended risks inherent to how the data is leveraged.

These examples and others highlight the data team’s role as the gatekeeper when it comes to ethical AI.

So, how can data teams remain ethical?

In most cases, data teams are used to dealing with approximate and proxy data to make their models work. But when it comes to the data that feeds an AI model, you actually need a much higher level of validation.

To effectively stand in the gap for consumers, data teams will need to take an intentional look at both their data practices and how those practices relate to their organization at large.

As we consider how to mitigate the risks of AI, below are 3 steps data teams must take to move AI toward a more ethical future.

1. Get a seat at the table

Data teams aren’t ostriches — they can’t bury their heads in the sand and hope the problem goes away. In the same way that data teams have fought for a seat at the leadership table, data teams need to advocate for their seat at the AI table.

Like any data quality fire drill, it’s not enough to jump into the fray after the earth is already scorched. When we’re dealing with the type of existential risks that are so inherent to GenAI, it’s more important than ever to be proactive about how we approach our own personal responsibility.

And if they won’t let you sit at the table, then you have a responsibility to educate from the outside. Do everything in your power to deliver excellent discovery, governance, and data quality solutions to arm those teams at the helm with the information to make responsible decisions about the data. Teach them what to use, when to use it, and the risks of using third-party data that can’t be validated by your team’s internal protocols.

This isn’t just a business issue. As United Healthcare and the province of British Columbia can attest, in many cases, these are real peoples lives — and livelihoods — on the line. So, let’s make sure we’re operating with that perspective.

2. Leverage methodologies like RAG to curate more responsible — and reliable — data

We often talk about retrieval augmented generation (RAG) as a resource to create value from an AI. But it’s also just as much a resource to safeguard how that AI will be built and used.

Imagine for example that a model is accessing private customer data to feed a consumer-facing chat app. The right user prompt could send all kinds of critical PII spilling out into the open for bad actors to seize upon. So, the ability to validate and control where that data is coming from is critical to safeguarding the integrity of that AI product.

Knowledgeable data teams mitigate a lot of that risk by leveraging methodologies like RAG to carefully curate compliant, safer and more model-appropriate data.

Taking a RAG-approach to AI development also helps to minimize the risk associated with ingesting too much data — as referenced in our location-data example.

So what does that look like in practice? Let’s say you’re a media company like Netflix that needs to leverage first-party content data with some level of customer data to create a personalized recommendation model. Once you define what the specific — and limited — data points are for that use case, you’ll be able to more effectively define:

Who’s responsible for maintaining and validating that data,
Under what circumstances that data can be used safely,
And who’s ultimately best suited to build and maintain that AI product over time.

Tools like data lineage can also be helpful here by enabling your team to quickly validate the origins of your data as well as where it’s being used — or misused — in your team’s AI products over time.

3. Prioritize data reliability

When we’re talking about data products, we often say “garbage in, garbage out,” but in the case of GenAI, that adage falls a hair short. In reality, when garbage goes into an AI model, it’s not just garbage that comes out — it’s garbage plus real human consequences as well.

That’s why, as much as you need a RAG architecture to control the data being fed into your models, you need robust data observability that connects to vector databases like Pinecone to make sure that data is actually clean, safe, and reliable.

One of the most common complaints I’ve heard from customers getting started with AI is that pursuing production-ready AI is that if you’re not actively monitoring the ingestion of indexes into the vector data pipeline, it’s nearly impossible to validate the trustworthiness of the data.

More often than not, the only way data and AI engineers will know that something went wrong with the data is when that model spits out a bad prompt response — and by then, it’s already too late.

There’s no time like the present

The need for greater data reliability and trust is the very same challenge that inspired our team to create the data observability category in 2019.

Today, as AI promises to upend many of the processes and systems we’ve come to rely on day-to-day, the challenges — and more importantly, the ethical implications — of data quality are becoming even more dire.

Building Ethical AI Starts with the Data Team — Here’s Why was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

View the full article

Sign In

Building Ethical AI Starts with the Data Team — Here’s Why

Recommended Posts

TDS