6 min read
If we want to keep the web open and thriving, we need more tools to express how content creators want their data to be used while allowing open access. Today the tradeoff is too limited. Either website operators keep their content open to the web and risk people using it for unwanted purposes, or they move their content behind logins and limit their audience.
To address the concerns our customers have today about how their content is being used by crawlers and data scrapers, we are launching the Content Signals Policy. This policy is a new addition to robots.txt that allows you to express your preferences for how your content can be used after it has been accessed.
What robots.txt does, and does not, do today
Robots.txt is a plain text file hosted on your domain that implements the Robots Exclusion Protocol . It allows you to instruct which crawlers and bots can access which parts of your site. Many crawlers and some bots obey robots.txt files, but not all do.
For example, if you wanted to allow all crawlers to access every part of your site, you could host a robots.txt file that has the following:
User-agent: * Allow: /
A user-agent is how your browser, or a bot, identifies themselves to the resource they are accessing. In this case, the asterisk tells visitors that any user agent, on any device or browser, can access the content. The / in the Allow field tells the visitor that they can access any part of the site as well.
The robots.txt file can also include commentary by adding characters after # symbol. Bots and machines will ignore these comments, but it is one way to leave more human-readable notes to someone reviewing the file. Here is one example :
# .__________________________. # | .___________________. |==| # | | ................. | | | # | | ::[ Dear robot ]: | | | # | | ::::[ be nice ]:: | | | # | | ::::::::::::::::: | | | # | | ::::::::::::::::: | | | # | | ::::::::::::::::: | | | # | | ::::::::::::::::: | | ,| # | !___________________! |(c| # !_______________________!__! # / \ # / [][][][][][][][][][][][][] \ # / [][][][][][][][][][][][][][] \ #( [][][][][____________][][][][] ) # \ ------------------------------ / # \______________________________/
... continue reading