Amazon Investigates Perplexity AI Over Potential Data-Scraping Violations

June 28, 2024

0 Views 0

SaveSavedRemoved 0

Amazon Investigates Perplexity AI Over Potential Data-Scraping Violations

[ad_1]

Amazon Web Services is investigating Perplexity AI over its data-scraping practices after multiple news outlets, including Forbes and Wired, reported that the AI startup is swiping their web archives to train its models without consent or compensation. An AWS rep confirmed Thursday that Amazon is looking into Perplexity’s behavior, Wired reports. The rep also said all AWS clients must follow the robots.txt file instructions. Robots.txt files are typically added to websites to ask bots and web crawlers not to scrape their data, whether for generative AI tools or other purposes. PCMag, for instance, has a robots.txt that disallows scraping from Perplexity, Anthropic’s Claude, and the GPTBot, to name a few.“AWS’s terms of service prohibit customers from using our services for any illegal activity, and our customers are responsible for complying with our terms and all applicable laws,” the AWS rep said in a statement.This month, Perplexity sparked a frustrated response from Forbes over the AI firm’s decision to publish AI-generated news articles that pull from human journalists’ work. Forbes Chief Content Officer Randall Lane accused Perplexity of conducting “cynical theft,” and further alleged that Perplexity is creating “knockoff stories” using “eerily similar wording” and “entirely lifted fragments” from its articles. Forbes is also taking issue with the lack of adequate citation and omission of the outlet’s name in the AI-generated stories.While many bots adhere to the robots.txt standard, others do not. Perplexity, OpenAI, and Anthropic have all been accused of purposefully ignoring them, which may have inspired Reddit’s recent decision to take further action to try to lock down its own content. Last week, Wired reported that Perplexity is ignoring the robots.txt standard—and dubbed Perplexity’s AI a “bullshit machine.” The publication identified an IP address it believes Perplexity is using to crawl its sites, as well as those of its parent company, Condé Nast. The Guardian, Forbes, and The New York Times also told Wired they have seen the same IP address on their servers.Notably, Perplexity is backed by Amazon founder Jeff Bezos’ Family Fund as well as Nvidia. The AI startup is trying to position itself as a Google competitor with the goal of offering an AI-powered “answer engine.” PCMag has reached out to Amazon and Perplexity for comment.

Recommended by Our Editors

Tech firms’ attitudes toward news sites and other web content has sparked ongoing backlash more broadly. Google and OpenAI have admitted that they train their AI tools on “publicly available” data, but haven’t provided full training data transparency.Microsoft’s AI CEO, Mustafa Suleyman, claimed this week that any content on the “open web” is supposedly “fair use” for AI companies to scrape, use, and monetize for their own financial gain because he believes a “social contract” has been in place for decades that permits this behavior. The New York Times, however, is suing OpenAI and Microsoft, alleging copyright infringement by training on and pulling from its articles without its consent. While some media outlets are rejecting and fighting AI firms’ nonconsensual scraping of their sites, others, like Semafor, TIME, and The Financial Times, have signed AI deals to proactively license their content.

OpenAI Reveals Its ChatGPT AI Voice Assistant

Get Our Best Stories!
Sign up for What’s New Now to get our top stories delivered to your inbox every morning.

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.

[ad_2]