Unpacking the AI Shift: Copyright, Disinformation, and the Future of News

The subversion of trust creates significant risk

2024 promises to unpack the some implications of the AI shift, with realizations and understandings about its impact suddenly coming to light. (that just happened!)

As we begin to understand the composition of ‘training data’ and its downstream effects, issues arise.

The New York Times’ copyright infringement claim against AI-generated content, and the ensuing discussions, signal a growing awareness of intellectual property (IP) concerns. As more instances of AI-generated IP infringement come to light, legal consequences for other users are likely.

The New York Times is the largest proprietary dataset in Common Crawl used to train GPT.

The sheer volume of data required for AI training presents a challenge. While readily available datasets like social media were easily eaten (GIGO), access to high-quality, curated data, like that of The New York Times, is crucial for producing reliable and ethical AI outputs.

The recent demonstration of ‘substantial similarity/copy’ between The New York Times articles and ChatGPT’s outputs underscores the issue of AI-generated copyright infringement, moving beyond theoretical concerns.

On social media the loudest voices against this lawsuits implications are the people who stand to benifit the most as it moves unchecked into mass adoption.

What is clear to me is that AI systems like DALL-E and ChatGPT, trained on undisclosed materials including copyrighted works, create content without indicating potential infringement.

The complexity and unpredictability of the tool’s output shifts greater responsibility to the company.

Beyond copyright concerns, AI’s ability to generate false information attributed to real sources poses a significant threat. The ease with which AI can create fabricated articles, mimicking the style and tone of trusted publications like The New York Times, raises serious concerns about disinformation (at a large scale). The ability to leverage copyrighted materials for disinformation generation further amplifies the potential harm.

The year ahead of unpacking LLMs requires us to confront these challenges. By scrutinizing these complexities, we can lay the groundwork for establishing new values in responsible data use, prioritizing transparency, ethical data use, and user accountability, to ensure responsible and trustworthy AI-generated content shapes the future, not fabricated narratives woven from stolen fabrics.

Not your model not your property.