why AI struggles to understand that a six-year-old can’t be a doctor or claim a pension

by time news

2024-07-31 13:10:44

When you go to the hospital and do a blood test, the results are put in a document and compared with the results of other patients and population data. This allows doctors to compare you (your blood, age, sex, health history, viruses, etc.) to the results and histories of other patients, allowing them to predict, manage and develop new treatments.

For centuries, this has been the basis of scientific research: identify a problem, gather data, find patterns, and build a model to solve it. The hope is that Artificial Intelligence (AI) – the so-called Machine learning who make models from data – will be able to do this much faster, more efficiently and more accurately than humans.

However, training these AI models requires a lot of data, so much that some of it has to be synthetic – not real data from real people, but data that replicates existing patterns. Most synthetic datasets are generated themselves by AI Machine Learning.

Wild biases from image generators and chat sites are easy to spot, but synthetic data also produces distortions – improbable, biased, or impossible results. As with images and words, they can be entertaining, but the widespread use of these programs in all areas of public life means that the potential for harm is great.



What is synthetic data?

AI models require much more data than the real world can offer. Synthetic data provides a solution – AI-based analysis examines statistical distributions in real data and creates a new, a synthetic one to train other AI models.

This synthetic ‘pseudo’ data is similar but not identical to the original, meaning it can still ensure privacy, skirt data processes, and be freely distributed or shared.

Synthetic data can also supplement real data sets, making them large enough to build an AI system. Or, if the real dataset is biased (has too few women, for example, or over-represented cardigans instead of pumps), synthetic data can balance it out. There is an ongoing debate around how far synthetic data is from the original.

Bright exceptions

Without proper maintenance, tools that generate synthetic data will often over-represent the dominant objects already in a dataset and under-represent (or even omit) uncommon ‘edge cases’.

This is what initially sparked my interest in synthetic data. Medical research has previously under-represented women and other minorities, and I am concerned that synthetic data will exacerbate this problem. So, I joined a technologist, Dr. Saghi Hajisharifto detect the occurrence of lost edge cases.

Visual scenes are often easy to spot: an AI-generated image adds a train route to the Glenfinnan Viaduct, a famous railway bridge in Scotland.
Wikimedia Commons

Inside our research, we used a type of AI called GAN to create synthetic versions of older 1990 US census data. As expected, there are missing edge cases in the synthetic datasets. In the original data we have 40 countries of origin, but in the synthetic version, there are only 31 – the synthetic data leaves out immigrants from 9 countries.

Once we became aware of this error, we were able to tweak our methods and put them into a new synthetic database. It is possible, but only with careful treatment.

‘Intersectional hallucinations’ – AI creates difficult data

Then we start noticing something else in the data – intersectional hallucinations.

Intersectionality is an expert in feminist studies. He described power forces that produce discrimination and advantage for different people in different ways. It’s just not just gender, but also age, race, class, disability, etc., and how these elements ‘combine’ in any situation.

This can inform how we analyze synthetic data – all data, not just population data – as the intersecting parts of a dataset produce complex combinations of whatever the data explains.

In our synthetic dataset, the statistical representation of separate categories is very good. The age distribution, for example, is similar in the synthetic data to the original. Not the mark, but close. This is good because the synthetic data should be similar to the original, not an exact copy.

We then analyzed our synthetic data for intersections. Some of the more complex intersections are also being built, too. For example, in our synthetic dataset, the intersection of income tax was also quite complete. We call this loyalty ‘intersectional loyalty’.

But we also noticed the synthetic data has 333 datapoints labeled “husband / wife and only” – an intersectional hallucination. The AI ​​has not been taught (or told) that this is not possible. Of these, more than 100 data points have “unmarried husbands earning under 50,000 USD a year”, a cross-reference that was not present in the original data.

On the other hand, the original data includes many “widow women working in technical support”, but they are completely absent from the synthetic version.

This means that we can use our synthetic data for research income tax questions (where intersectional engagement) but not if one is interested in “widow women working in technical support”. And one should watch out for “unmarried husbands” in the results.

The big question is: where does this stop? These hallucinations are 2-part and 3-part hallucinations, but what about 4-part intersections? Or 5-part? At what point (and for what purposes) would synthetic data be irrelevant, misleading, useless or dangerous?

Involving intersectional hallucinations

Structured databases exist because the relationships between columns on a spreadsheet tell us something useful. Remember the blood test. Doctors want to know how your blood compares to normal blood, and to other diseases and treatment outcomes. That’s why we set the data in the first place, and it has been done for centuries.

However, when we use synthetic data, intersectional hallucinations often occur because the synthetic data must be slightly different from the original, otherwise it will simply be a copy of the original data. Synthetic data therefore asking hallucinations, but only in the right kind – which amplify or expand the dataset, but do not create something difficult, misleading or biased.

The existence of intersectional hallucinations means that a single synthetic database cannot serve many different uses. Each use case will require bespoke synthetic data sets including hallucinations, and this requires an identification system.

Building reliable AI systems

For AI to be reliable, we have to know what intersectional hallucinations are in its training data, especially when it is used to predict how people will act, or to control, manage, treat or police us. We need to make sure they are not educated on dangerous or misleading places of uncertainty – like the 6-year-old doctor receiving pension payments.

But what happens when synthetic databases are used carelessly? Currently there is no standard way to mark them, and they are often mixed with real data. When a dataset is shared for others to use, it is impossible to know if it can be trusted, and to know what is a hallucination and what is not. We need clear, universal methods to identify synthetic data.

Horrific hallucinations may not be as fun as a hand with 15 fingers, or instructions to put glue on a pizza. They are boring, random numbers and statistics, but they will affect us all – sooner or later, synthetic data will spread everywhere, and often, by its nature, have intersectional hallucinations. Some we like, some we don’t, but the problems are telling them apart. We need to make this possible before it is too late.

#struggles #understand #sixyearold #doctor #claim #pension

You may also like

Leave a Comment