Legal Use of LLM Training Data Reduces AI Business Risks

reporter-profile
권순우 2024.05.12 05:09 PDT
Legal Use of LLM Training Data Reduces AI Business Risks
(출처 : 비큐AI )

Using Legal News Data for AI Training Reduces Business Risks
Contents safeguide here to stay

OpenAI and Google randomly collected AI training data for large language models (LLMs). They gathered unauthorized press articles, YouTube videos, podcasts, and more for training.

NewYork Times reports that as early as last month, OpenAI staff were aware of potential illicit activities but believed that the AI's training was properly aligned with its designated objectives.

In Silicon Valley, where 'data-hungry' has become the byword, major U.S. companies are exploiting vast datasets to fuel their AI operations.

This insatiable demand has given rise to 'synthetic data', crafted to supplement the immense volume of data—ranging from hundreds of billions to trillions—necessary for training sophisticated AI models.

In this landscape, the lawful utilization of 'refined' news data by AI platforms stands out as a global anomaly, drawing significant interest.

South Korea’s BECU AI, formerly known as Bflysoft, leads in providing cutting-edge knowledge data across various domains including news, media, and legal sectors.

BECU AI’s CEO, Lim Kyunghwan, underscores the pivotal roles of data quality and quantity in shaping AI's efficacy and reliability. Lim also highlighted the crucial role of real-time news data in training AI systems, while addressing the ethical and legal ramifications concerning data sourcing and usage.

Lim further discussed the strategic value of news data and the legal frameworks supporting its legitimate use, stressing the necessity for stringent compliance with established protocols. Under Lim’s stewardship, BECU AI has recalibrated its mission to amplify its AI endeavors and is gearing up for upcoming international projects.

Lim KyungHwan, CEO of BECU AI (출처 : BECU AI)

Why does the phenomenon of lack of LLM learning data occur?

Lim : The widespread reliance on Wikipedia for training language models is encountering scrutiny due to its limitations and potential biases, sources familiar with the matter say.

Wikipedia has long been a staple in AI development, offering an expansive trove of text data from over 100 countries that is both accessible and free. This has significantly aided in training AI systems to grasp a multitude of languages and cultural nuances globally.

However, industry insiders point out that Wikipedia's data is unverified, raising concerns over the accuracy of AI learning.

AI models trained on such data may inherit and perpetuate any biases or inaccuracies present, potentially compromising the reliability of their outputs. Given Wikipedia's reputation for credibility, there is a risk that outcomes derived from its data are also perceived as equally credible, despite possible flaws.

Further complicating the training process is the inconsistent structure and format of Wikipedia entries. The style and organization of content can vary widely depending on the contributors, making it challenging for AI to parse and learn from the data effectively. Variations in the presentation and depth of content on similar topics can lead to uneven AI performance.

To address these issues, additional resources and time are required to verify the factual accuracy of Wikipedia's content and to filter out irrelevant or erroneous information, highlighting a growing challenge in the field of AI development as reliance on open-source data faces increasing scrutiny.

The legal use of refined news data by AI platforms is drawing significant interest as a globaly, highlighting the importance of data quality and quantity in shaping AI's efficacy and reliability.

News data is also being utilized for learning. Why is it important?

Lim : The role of real-time updated news data is increasingly critical in the development of artificial intelligence systems, providing a wealth of information on global events that helps these technologies learn and adapt.

This continuously updated stream of data is invaluable for AI to accurately understand real-world scenarios, language dynamics, and the evolving patterns of human behavior and thought.

Industry leaders emphasize that such data is not just beneficial but essential for training AI to operate effectively in diverse environments. It enables AI systems to process human language in context and predict future trends based on current events, which is pivotal for businesses and governments alike looking to leverage AI for decision-making and strategy development.

As AI technology becomes more integrated into commercial applications and public services, the demand for timely and comprehensive news data feeds to train these systems is surging.

This reliance on up-to-the-minute information underscores the rapid pace at which AI is evolving and the corresponding need for data that keeps pace with global changes.

So, how are companies utilizing news-based data?

Lim : Several AI firms, including OpenAI, are facing legal challenges due to their past practice of using online data from press agencies without authorization.

This has raised significant copyright issues, as numerous media outlets have initiated lawsuits claiming unauthorized use of their proprietary content, leading to financial losses.

Companies must now navigate the complexities of acquiring appropriate permissions or establish formal agreements with the respective press agencies to use their content legally.

This requirement serves not only to sidestep potential legal conflicts but also to honor and remunerate the creators of news content appropriately.

While these steps may involve substantial costs and effort, they are seen as essential for reducing future legal risks and fostering sustainable relationships with media companies.

Amidst this backdrop, the EU is preparing to enforce its AI Act in 2026, which mandates transparency from AI model providers, like those creating general-purpose AI models including GPT chatbots.

The regulation will require these firms to adopt data usage policies that comply with EU copyright laws and to disclose detailed summaries of the news content utilized in training AI models, ensuring a clear, accountable framework for data usage in the burgeoning AI sector.

Should AI companies negotiate through individual media outlets in each country?

Lim : BECU AI, an official news copyright distribution agency endorsed by the Korea Press Foundation, offers a unique solution for AI companies seeking lawful access to news data: a comprehensive blanket contract.

With over two decades of experience in managing large-scale news big data, BECU AI holds the most extensive collection of media data in South Korea, encompassing content from more than 600 domestic press outlets.

The company’s extensive network includes legal partnerships with 580 print publications, 60 broadcasters, telecommunications entities, and 2,600 online media platforms.

BECU AI's offerings are not limited to the mere provision of source data; the firm also supplies labeled news data and operates an advanced system for 24-hour real-time automatic labeling, which significantly enhances the data's accuracy and reliability.

A key advantage of partnering with BECU AI is its robust capability in securing legal rights to news data. Through diligent copyright management and distribution practices, BECU AI not only safeguards the rights of original content creators but also ensures that its clients can use this data legally and ethically, thus mitigating legal risks and fostering trust among users of news data.

What are the advantages of a consistent contract with a specialized AI news data company?

Lim : Navigating the complex landscape of acquiring news data for AI applications can be a daunting task when approached on a country-by-country basis.

The challenge extends beyond mere logistics; it encompasses the negotiation of varied data supply standards, ranges, and pricing, along with the intricate legal frameworks that govern data usage in different jurisdictions.

Direct negotiations with individual press agencies often involve time-consuming discussions and can lead to elevated costs due to the agencies' high demands.

Additionally, these agencies might not always have the necessary technical and legal expertise related to data pricing and sales, which can further complicate negotiations and lead to misunderstandings or delays.

In contrast, partnering with an AI-oriented news data specialized company offers a more streamlined and efficient solution.

These agencies have already established supply prices and terms with multiple press agencies and bring a wealth of experience in negotiations.

Furthermore, they possess deep expertise in navigating the legal intricacies and managing data effectively, thereby reducing the risk and enhancing the efficiency of the data acquisition process for AI companies.

BECU AI leads in providing cutting-edge knowledge data across various domains including news, media, and legal sectors.

How BECU AI can provide timely information to LLM companies?

BECU AI has launched 'RDP LINE (Real-time Data PipeLine),' a cutting-edge news data distribution platform tailored for AI companies, which debuted last year.

RDP LINE is engineered to streamline the acquisition and application of news data, facilitating the training of AI models with high-quality, refined data, and providing real-time news data pipeline services.

The platform, leveraging BECU AI's two decades of expertise in news big data processing, provides access to a comprehensive archive of news from over 700 diverse media outlets, spanning historical to current events. This vast repository enables companies to efficiently locate and procure the specific news data they require for their operations.

RDP LINE's robust infrastructure not only simplifies the data procurement process but also ensures the availability of diverse and consistent data. This is crucial for the development of AI models, allowing for more sophisticated and balanced learning by exposing them to a wide range of perspectives and information.

Which companies are using it so far?

In a significant move to bolster local AI capabilities, major South Korean conglomerates including Samsung Electronics, LG Electronics, SK Telecom, and KT have inked deals to provide news data essential for training large language models (LLMs) for leading domestic firms.

Particularly noteworthy is SK Telecom's AI service, A dot, which is actively delivering real-time news via a sophisticated data pipeline. These companies are also engaged in collaborative projects spearheaded by government bodies such as the National Information Society Agency (NIA), aimed at constructing robust AI learning databases.

Through these initiatives, the firms have acquired vital technologies spanning the full spectrum of news data management—from collection and refinement to processing techniques like document summarization and detection of sensational content, ultimately facilitating the efficient use and distribution of news data. This strategic integration of advanced technologies underscores South Korea's commitment to advancing its AI infrastructure and capabilities.

Is it possible to secure global news data for AI?

The 'RDP LINE Alliance,' in partnership with prominent media companies across various nations, is broadening its capabilities to supply real-time news data for AI learning and services through a comprehensive platform.

This strategic expansion aims to furnish major technology firms with lawful access to a diverse array of news data from around the world.

Facilitating this global outreach, 'TheMiilk,' a Silicon Valley-based media and Agency, has solidified its commitment by signing a memorandum of understanding and the main contract.

TheMiilk is now actively implementing strategies to enhance the global news data pipeline, positioning itself as a key player in the international AI and data services market. This move is expected to provide significant leverage to tech companies seeking to innovate and improve their AI offerings with legally sourced, diverse news content.

BECU AI?  

BECU AI has been providing a service that connects over 3,000 domestic media data for 20 years since its establishment in 1988, securing proprietary big data technology and growing. It is researching technology and data interfaces that effectively and efficiently support companies attempting to change the competitive paradigm through high-quality enhancement of individual data sets and the introduction of data-centric artificial intelligence.

It is also pursuing expansion as a global Multi Contents Provider (MCP) by providing various solutions and services from the 'Data Pre-Processing' stage to the 'End-Point' news data supply and monitoring services. Last month, BeFlySoft changed its name to 'BECU AI' and began its leap as a core company in the AI field based on big data and related technology expertise in the media industry.

👉For more information: https://becuai.com

회원가입 후 뷰스레터를
주 3회 무료로 받아보세요!

단순 뉴스 서비스가 아닌 세상과 산업의 종합적인 관점(Viewpoints)을 전달드립니다. 뷰스레터는 주 3회(월, 수, 금) 보내드립니다.