Jul 15, 2025
Navigate to following post
Reading time: 6 Min.
Reading time: 6 Min.

Creative Commons Signals: Fairness in the Age of Generative AI?

Creative Commons (CC) publicly announced the CC Signals project on June 26, 2025. The project is introduced as "a major step forward in building a more equitable, sustainable AI ecosystem rooted in shared benefits". Signals is currently in a pre-alpha state where they seek public feedback. The alpha launch is scheduled for November 2025. I was intrigued by how CC might imagine a sustainable AI ecosystem.

Prelude

CC has always been about providing a middle ground between "all rights reserved" copyright law and having no control over dissemination of your content at all. The idea is that s/he who uses the commons should contribute in turn to the commons, by treating shared content as indicated by the author. Consequently, project Signals appears to seek a way of adapting this spirit to the new ubiquity of generative AI (GenAI).

At the same time, there is a renewed discussion around the regulation of web crawlers and web scraping. The regulation of automatic processing of web content is with the web since its infancy. The now ubiquitous robots.txt was originally introduced to regulate the load that crawlers put on servers back in 1994.

Discourse around the ethical nature and regulation of web crawlers and web scraping became more prominent with the advent of internet search engines during the 2000s. Their large-scale reuse of content also brought attention to copyright issues. A prominent case in 2006, for example, saw Google being sued with the claim that their caching violates copyright.

Over time, the internet settled on the following pragmatic agreement: web crawlers are acceptable so long as they provide reciprocation in the form of traffic. In other words, it is okay that Google crawls my site because I get traffic from that. With this implicit social contract between search engines and content providers, the discussion around regulating web crawlers settled down.

The Contemporary Issue

The rise of GenAI and large language models (LLM) has eroded the aforementioned social contract. GenAI is fuelled by large amounts of data that is scraped from the web. It can use this data in two ways:

  1. as training data for the model itself (i.e., the data is "baked into" the LLM).
  2. as augmentation after training and deployment, in response to a prompt; e.g., retrieval augmented generation (RAG).

GenAI will rarely, if ever, attribute output drawn from their training data. This is relevant from two perspectives: 1) through the lens of the Commons, it means no credit is given, and 2) with regards to the social contract it means no traffic is generated.

As for RAG: most GenAI will cite sources of information retrieved from the web in real time (in my experience, at the time of writing). This potentially provides attribution to and traffic for the author. Nevertheless, by aggregating and summarising information, GenAI heavily reduces the incentive for the user to actually visit the content provider's site.

Enter CC Signals

In From Human Content to Machine Data, CC reasons that the erosion of the social contract around web crawlers will lead to enclosure of information; to the detriment of public interest.

To sustain public access to knowledge, CC argues that reciprocity between content creators and AI models is required. They propose so-called signals, a suite of elements that indicate to AI operators how they should contribute back to the commons. Following the argument of CC, these signals would be the foundation for reciprocity.

Signals can be some combinations of four core elements. At the time of writing, they are expressed as intent at best and hence rather vague.

  1. Credit: this seems to be the most straightforward signal. It does what the name suggests: require attribution. The authors expect "this signal to require citation of the training dataset by the reuser" at minimum. For RAG and similar techniques, "outputs must cite the collection as a source with a link".
  2. Direct Contribution: this appears to be aimed at helping the provider of a data set deal with the costs of making data accessible. If you are running, e.g., a content platform, I surmise that using this signal would request some contribution from AI operators; to keeping your platform up and running.
  3. Ecosystem Contribution: this is phrased rather nebulous and I expect it to take shape only over the course of the alpha phase. Seemingly, it is aimed at giving back to a broader system than the Direct Contribution signal.
  4. Open: a more concise signal, this appears to interpret openness of the AI models themselves as reciprocity. My interpretation is that this signal is satisfied if an AI operator publishes weights, training procedure, code, etc. of their AI model. So, in essence: "you may only use my data set if your model is open source".

CC proposes four combinations of signals, which they consider mutually exclusive in the current state of the proposal: Credit, Credit + Direct Contribution, Credit + Ecosystem Contribution, and Credit + Open. (I am not sure why the Open signal is mutually exclusive with the Contribution signals.)

Mixed Signals

From Human Content to Machine Data is the document that provides the foundation for Signals. Reading through it, I get mixed signals (sorry) about the purpose and target audience.

The argument about reciprocity and the social contract that it is derived from suggested to me that creators of blogs, news sites, image galleries, and similar would apply signals. That such creators can indicate to AI operators how reciprocation should look like. That I can stick signals to this blog to indicate that an LLM must reference it if it was used in training data.

Later in the document (p. 29), CC states that signals are intended to be "attached to large collections of content" (emphasis is mine). The descriptions of the signals themselves suggest that providers of data sets for AI training are the target audience.

This opens a gap in reciprocity: AI operators reciprocate holders of large data sets, but not necessarily the creators of the content that comprises such data sets. CC themselves give the examples of StackOverflow and Reddit in their reasoning. Cases where users heavily protested that the content they created be used as training data for AI. It is unclear to me how the Signals will alleviate such grievances by addressing only the platform as a whole.

Optimistic Reciprocity

CC believes that reciprocity is "a critical ingredient to widespread consent" (From Human Content to Machine Data, p.19). I share this view. I am not so certain that large players in GenAI will also share this view. Consider the following questions:

  1. Would GenAI companies already attribute training data of their models if it could easily be done with no repercussions?
  2. Is there a reliable way to prove that a model has been trained on a CC licensed data set?
  3. Have current models been trained on material that is under more restrictive copyright than any CC license?

CC further seems aware of technical hurdles, stating that they "seek to establish norms around what is possible, not letting the perfect be the enemy of the good" (emphasis theirs, From Human Content to Machine Data, p.26). The credit signal leaves a loophole that may pertain to this. In the description of the Credit signal, attribution in case of RAG is required "where it is technically feasible to connect content with particular outputs" (also p.26). That is a pretty big escape hatch.

Concluding Thoughts

In the end, I am left unsure how signals will contribute to restoring the social contract between content creators and AI operators. Reciprocity with data set providers and platforms does not necessarily mean reciprocity with the individual that creates content for said platforms. I am curious about what the alpha version of Signals will look like in November.

That being said, there is no denying the need for an initiative such as signals. A week after CC went public with signals, Cloudflare announced they will start blocking AI web crawlers by default, proclaiming July 1 to be "Content Independence Day". Their reasoning is similar to that of CC; their goal, in contrast, appears to be a purely monetary compensation model. Such pressure may, in the end, be enough to force large players in the GenAI space into reciprocity.

This action of Cloudflare showcases an increasing awareness that AI has disrupted the established agreement between content creators and web scrapers. I think CC correctly identified the need to give creators of the Commons some way to re-establish balance. It is yet too early to say if signals can achieve this. The attempt is certainly timely.