Korea Needs Special Law to Safeguard Culture in AI Era

August 19, 2025 Post a Comment

The Challenge of Building AI Datasets

Artificial intelligence relies heavily on large volumes of data to learn and improve. However, the question remains: how much data is enough to achieve true international competitiveness? A Chinese AI startup called DeepSeek reportedly trained its models on around 50TB of text. If 20% of that data is in Chinese, it would amount to approximately 10TB — equivalent to 30 million books. Korea’s National Library holds about 10 million books. Even if all Korean books ever published were digitized, they would still fall far short of this volume.

Text alone is no longer sufficient for modern AI systems. We are now entering an era of large world models — systems that learn not only from words but also from images and records of human actions. These models are crucial for applications such as self-driving cars, robotics, and medicine. At this stage, tens of terabytes of multimodal data are required. Without access to such extensive datasets, Korea risks falling behind in the global technology landscape.

Legal Hurdles in Data Collection

The biggest challenge in building these massive training datasets lies in the legal framework. Copyrights, database rights, privacy laws, and portrait rights all complicate the process of collecting and using data. Eliminating even minor risks of infringement is nearly impossible. Lawsuits related to data usage are already piling up in Korea and other countries. Despite years of debate, the rapid development of AI demands a swift solution.

One potential answer could be the creation of a special law tailored for training data. Simply allowing companies to use all works without restrictions would unfairly disadvantage rights holders. Instead, Korea could consider exempting responsibility for certain categories of data that carry low infringement risks but high value for AI training. Given the fast pace of technological change, such a law could be introduced for a limited period and subject to regular review and renewal.

Limiting Exemptions and Ensuring Fairness

These exemptions should be carefully controlled. They should apply only to general-purpose AI — models that enhance productivity and benefit society broadly — and only to large-scale systems requiring multiple terabytes of data. For example, a model with hundreds of billions of parameters could serve as a threshold.

Interestingly, large AI systems may actually help protect the rights of content creators. General-purpose models are increasingly capable of recognizing legal and ethical boundaries, including identifying potential copyright violations. Recent advancements show that large vision-language models can already assess infringement risks. However, these capabilities are only possible after training on vast datasets. In other words, to prevent copyright violations, mass training on copyrighted works may need to be allowed — another reason to consider legal exemptions.

Safeguards and Public Responsibility

However, these exemptions must not be unconditional. The benefits of AI training should feed back into Korea’s industrial ecosystem. Several safeguards could be implemented. For instance, training data could be required to be stored and processed domestically, preventing uncontrolled overseas transfers and ensuring economic value stays within the country. Transparency and accountability measures, such as a government-run registration system for training datasets, could also strengthen trust and safety.

This approach would not only apply to private firms but could also empower the government to build and provide large-scale training databases through public institutions. One of the most symbolic resources would be the National Library of Korea’s collection of over 10 million volumes, which could be digitized and refined into a high-quality text database. Rather than leaving companies to purchase books individually, the state could supply standardized, legally sound data.

Expanding the Public Data Hub

Such a public data hub could incorporate other Korean-language resources, including online archives, academic papers, court rulings, and textbooks. Over time, it could expand to include speech and video data. Applying the special law to this effort would reduce costs and eliminate legal uncertainty around these vast resources.

Designing such a law will require broad debate and careful consideration. However, delay is no longer an option. With the current legal framework, building training datasets of tens of terabytes is nearly impossible. If nothing changes, the Korean language and Korean culture risk being sidelined in the age of AI.

Korea must act quickly to adapt its laws to the new technological environment. Passing a special law for AI training data would be the first step toward securing a competitive position in the global AI landscape.

DISCOVER TREND