Train AI and do not remove resident-passport number… Government: “Complete personal information protection vulnerabilities”

by times news cr

2024-03-28 14:24:50

A government investigation found that major big tech companies that provide generative artificial intelligence (AI) services such as ChatGPT do not properly remove sensitive personal information such as resident registration numbers and passport numbers when training AI. As there are concerns that personal information may be leaked indiscriminately, the government has advised companies to address vulnerabilities.

The Personal Information Protection Committee (Personal Information Protection Committee) held a general meeting on the 27th and decided to recommend to six companies, including Open AI, Google, Microsoft, Meta, Naver, and Rutten, to “complement vulnerabilities in personal information protection.” . These companies provide AI services or develop and distribute large-scale language models for them.

As generative AI services are rapidly expanding, the Personal Information Commission has conducted preliminary inspections of major AI services with the Korea Internet & Security Agency since November last year. As a result, it was confirmed that personal information such as resident registration number, passport number, and credit card number was not removed from the information entered into the AI ​​service.

A large-scale language model is a type of deep learning technology that inputs a large amount of text and outputs natural language appropriate for the given situation. Even if personal information is included in the input data, it can be prevented from being exposed through its own filtering technology. However, there are cases where personal information is exposed due to system errors, so it is safer to remove information at the input stage in advance.

In fact, in July of last year, Google researchers discovered that when the command “repeat the word poem infinitely” was entered into ChatGPT, an error occurred in the filtering system and personal information such as phone numbers and emails were exposed. In December of last year, the Personal Information Commission noticed similar problems occurring in other generative AI services based on Open AI and informed operators.

The reason personal information is indiscriminately included in learning data is because large-scale language model operators collect information using ‘crawling’ technology that randomly searches information on the web. Programs can be designed so as not to extract sensitive personal information. However, because the amount of data is vast and the data formats are all different, there is a high possibility that personal information will be included regardless of the information subject’s will.

The Personal Information Commission recommended these operators to increase accessibility so that AI service users can view and easily remove and delete entered data.

Reporter Joo Hyun-woo woojoo@donga.com

Hot news now

2024-03-28 14:24:50

You may also like

Leave a Comment