Data, Text, Image

How we describe creative work matters.

May 04, 2023

(NB: I had hoped to post at least monthly, but as anyone writing and thinking about AI knows, there’s so much going on that it’s hard to find a moment to breathe and reflect. As a result of the flood of AI-related articles, posts, and conversations, I’ve started four different posts. And, as T. S. Eliot wrote and as I take every opportunity to remind people in the hallway, “April is the cruellest month,” so I’ve not actually finished three of those.)

This post was inspired by the serendipitous experience of hearing Professor Meredith McGill’s fascinating lecture “The Immaterial Book” at my home institution (University of Kansas) last month and then reading Alberto Romero’s post “A Bull, a Rose, a Tempest” four days later. In both, I found myself thinking about the word “data” and its relationship to human creative work.

In part of McGill’s talk, “From Book to Work,” she discussed the theoretical implications of using the words “text” and “work” as well as the practical challenges and potential of the Black Bibliography Project. The BBP’s data model crafts semantic and machine-readable links between works, editions, copies, authors, publishers, etc.—an exciting development. But in a talk that teased out the theoretical implications of using different terms to talk about books—"work,” “text,” “copy”—it struck me that “data” stood out as an under-theorized term. This is, of course, no fault of McGill: “data” is the common term used to describe digital information. But it seems important to think about its use as carefully as we do those other words.

“Data” is not a recent term, of course; it is the plural form of the singular Latin datum, “that which is given.” The meaning of the word in English has evolved. According to the Oxford English Dictionary, “data” in the 1600s referred broadly to information collected for analysis, or facts known or assumed; from the 1800s onward, it gained a philosophical connotation related to direct perception or experience by the mind or senses; and more recently, it has taken on its pervasive technological meaning. Starting in the 1940s with the rise of computing, “data” was used to refer to items of information in digital form. In its specific technical usage, “data” refers to “the quantities, characters, or symbols on which operations are performed by a computer and which may be stored or transmitted in the form of electrical signals and held on recording media” (OED). From the 50s onward, various corpora (e.g., Google Ngram and English Corpora) show that the use of the word “data” skyrocketed, corresponding with the rise of computing. We can similarly see an exponential increase in the use of the terms “dataset” and “training data” (not surprisingly, corresponding with a similar trend in the phrase “machine learning”).

I remember the first time I looked at jpeg code back in the 90s. I was both amazed at the human ingenuity that allowed for this kind of translation and overwhelmed by the profoundly alien notion that what I was looking at would be, at some point, rendered as a photo I had taken. One thing I could not do was look at that data and see the photo. I could not see the result of my choice to frame the image in a particular way; I could not see the trees in the foreground or the buildings in the background; I could not see the mural that was the subject of the photo.

The code was data. The images translated by the data were not.

Language matters, and the use of the word “data” to describe what AI systems are trained on has implications for how developers, users, and regulators think about generative AI and its relationship to human creativity.

Referring to and thinking about texts and images as “data” allows for deliberately alienating reductionism. First of all, using “data” to talk about literary texts and artwork suggests that all of the component parts are equivalent and discrete, like pixels. Further, employing the word “data” to talk about the texts and art scraped from the internet means divorcing them from their historical and cultural contexts. And, ultimately, talking about human creations as “data” obscures both the creative process and the creators themselves.

As Nick Vincent and Hanlin Li note in “ChatGPT Stole Your Work. So What Are You Going to Do?” (Wired):

If you’ve uploaded photos or art, written a review, “liked” content, answered a question on Reddit, contributed to open source code, or done any number of other activities online, you’ve done free work for tech companies, because downloading all this content from the web is how their AI systems learn about the world.
Tech companies know this, but they mask your contributions to their products with technical terms like “training data,” “unsupervised learning,” and “data exhaust” (and, of course, impenetrable “Terms of Use” documents). In fact, much of the innovation in AI over the past few years has been in ways to use more and more of your content for free.

Let us not be complicit in the kleptography being perpetuated by tech companies by using “data” to refer to the human creations scraped from the internet to train their systems. I’ve been guilty of this tendency; I am trying to resist. McGill, in the Q&A for her talk, asserted that “books aren’t information”—another way of saying that texts (or other human creations) cannot and should not be reduced to “data.”

In “A Bull, a Rose, a Tempest,”

Alberto Romero

considers the difference between the writing he loves and the outputs of AI. He writes that

Borges's final output, what we can see printed on the page, isn't there by a mere process of statistical fitness. It's there because someone with desires and hopes, fears and anxieties, a vast past, an eternal future, and a thoughtful and decidedly conscious present, put it there.

This shouldn’t seem like a radical statement, yet Romero’s reflection runs contrary to the literary critical tendencies inspired by Barthes’ now-classic 1967 essay, “The Death of the Author.” At the time of its publication and for long after, it was a call to democratic reading practices: it dethroned the author as the single creator of meaning and moved that authority to readers. The point was that a “text consists of multiple writings, issuing from several cultures and entering into dialogue with each other, into parody, into contestation”; it was and still is a productive starting point for reexamining the assumptions about how meaning is produced.

In the current moment, however, where “reading” is what computers do to code and LLMs are “writers” trained on the “thousand sources of culture,” the following sounds more like a warning than a declaration of independence:

…The text is a tissue of citations, resulting from the thousand sources of culture. …The writer can only imitate a gesture forever anterior, never original; his only power is to combine the different kinds of writing, to oppose some by others, so as never to sustain himself by just one of them; if he wants to express himself, at least he should know that the internal "thing" he claims to "translate" is itself only a readymade dictionary whose words can be explained (defined) only by other words, and so on ad infinitum. Succeeding the Author, the writer no longer contains within himself passions, humors, sentiments, impressions, but that enormous dictionary, from which he derives a writing which can know no end or halt: life can only imitate the book, and the book itself is only a tissue of signs, a lost, infinitely remote imitation.
…There is one place where this multiplicity is collected, united, and this place is not the author, as we have hitherto said it was, but the reader: the reader is the very space in which are inscribed, without any being lost, all the citations a writing consists of; the unity of a text is not in its origin, it is in its destination; but this destination can no longer be personal: the reader is a man without history, without biography, without psychology; he is only that someone who holds gathered into a single field all the paths of which the text is constituted.

Barthes’ focus on “the enormous dictionary” from which writing emerges, writing’s “imitative” quality, and the lack of the “personal” in either writer or reader now sounds like a description of a dehumanized algorithmic writing process rather than a pluralistic liberation from the hegemonic notion of authorship. I doubt justifying an erasure of the complex human creative process was Barthes’ intent; and I find, like Romero, I care about that intent.

T. S. Eliot, like Barthes, also understood writing to be born of fragments and quotations; one might even say he was one of the ancestors of contemporary remix culture. But whether one is talking about remix culture, transformative work, or a modernist practice born of a “heap of broken images,” in all of those instances, fragments of human creative works are re-formed, deliberately, into new works by human agents. Creators might frame these components to evoke nostalgia; they may invoke them for satire; they may have many other intentions and purposes. But the products of that process are not simply “sophisticated averages of their training data that generate novel outputs through pattern recognition and probabilistic synthesis” (that’s one of ChatGPT’s descriptions of what it does).

In short, writing is communication, not an output. Art is not data. Books are not information. We allow these equivalencies at our peril.

Datta: what have we given?
My friend, blood shaking my heart
The awful daring of a moment’s surrender
Which an age of prudence can never retract
By this, and this only, we have existed

…

These fragments I have shored against my ruins
Why then Ile fit you. Hieronymo’s mad again.
Datta. Dayadhvam. Damyata.

Shantih shantih shantih

—from “What the Thunder Said,” The Waste Land, T. S. Eliot

PS: Grammarly recommended I change “datta” to “data.”

Pandora's Bot

Discussion about this post