It has come to light that several AI firms have been found disregarding protocols designed to prevent them from scraping websites. The Robots Exclusion Protocol, also known as robots.txt established in 1994 to guide web crawlers on which pages they are allowed to access is being overlooked by these AI companies. This development has sparked worries, among publishers and content creators regarding the use of their data.
As per a communication from TollBit, a startup facilitating licensing agreements between publishers and AI firms “AI agents from sources (not limited to one company) are choosing to bypass the robots.txt protocol for retrieving content from websites.” While adherence to robots.txt is not mandatory it has long been a practice in the community. The communication did not mention companies. Reports indicate that OpenAI and Anthropic known for creating the popular ChatGPT and Claude chatbots respectively are, among those disregarding the protocol.
Implications for Publishers and AI Companies
The unauthorized copying of website content, by AI companies poses challenges for both publishers and the AI industry. Publishers worry about their intellectual property being misused and losing control over their content. They are concerned that AI generated summaries or rephrased versions of their articles could be inaccurate or fail to give credit potentially harming their reputation and income.
On the hand AI companies argue that the Robots Exclusion Protocol is not a framework and that a new understanding between publishers and AI firms may be necessary. Some AI leaders have defended their actions suggesting that publishers may have to adjust to the evolving landscape of content creation and distribution in the age of intelligence. Nevertheless the lack of transparency and apparent disregard for website owners preferences have sparked concerns, about these AI companies practices.