48 minute read

Editorial Note: I had been hoping to spend much more time on this article, going through some further stages of drafting, producing a shorter blog-length version with fewer details, and exploring all of the authors’ supplemental material and perhaps even asking them to weigh in, but the discussion about this study has exploded in several circles I’m involved in, so given the speed of internet time, I think it’s more important to get the article published now before people forget about the whole thing. My apologies for any comprehension difficulties, mischaracterizations, or errors that result.

I’ve recently seen a lot of buzz about this study published in November: AI-generated poetry is indistinguishable from human-written poetry and is rated more favorably. The study found people were, on average, worse than chance at evaluating whether a poem was human-written or AI-generated!

I found the claim that poetry written by a well-known human poet and poetry written by ChatGPT 3.5 are indistinguishable absurd on its face, given the quality of the AI poetry I’ve seen. But having recently taken Scott Alexander’s AI Art Turing Test, on which I scored only 60%, I wondered if I was just overconfident. So I decided to try the experiment myself.

Rather than try just one of the ten conditions in the original study, I tried all ten, with a variety of minor adjustments to the design as I went along based on the results of the previous trials. On the first trial, I got 8 out of 10 correct; in aggregate across all trials, 84 out of 87. (Furthermore, I was dramatically underconfident about my chances of being correct on each classification.) I thus have renewed confidence that AI-generated poetry of the type tested in this study is absolutely not “indistinguishable from” human-written poetry, unless I am some sort of superdiscriminator doing much better than anyone else can be expected to, which seems unlikely.

Why is my result so different? I believe the original study found a real effect, and an interesting one, but that it primarily measured people’s preconceptions about what contemporary AI poetry would be like and how quickly they could figure out what AI poetry was like and what was going on in the study, rather than whether there is a noticeable difference that most people can detect after a little bit of practice (which is what I think most people who have heard about this study, including me, have believed the study to be about). Small differences between my knowledge during the experiment and the participants’ in the original study seem to have been enough to dramatically change the results.

Feel free to jump ahead to how I distinguished the poems or discussion of what I think the original study actually shows.

Self-experiment: Design and Methods

I started the experiment without reading the original study in detail, because I thought doing so would make things unfairly easier for me. I did skim it to get a rough idea of what was involved. I only very briefly looked at the methods section, and got lucky in that the experiment I laid out came pretty close to the design of study 1.

The main differences between my knowledge and that of the study participants:

  • I knew what the study was about, as well as its conclusion.
  • I knew that the authors had found the heuristics people used to evaluate whether a poem was AI-generated were flawed, and roughly how they thought they were biased – in particular, that people tended to think that weirdness/illegibility was an indication of AI, when actually it nowadays tends to be the reverse, AI poetry is too explicit/clear/obvious.
  • I was not blind to which poet I was reading, or the fact that all the poems were either by one poet or imitating said poet, or the fact that all the human poems were written by famous English poets (rather than, say, the experimenter’s nephew, which would totally be fair game for this experiment; it’d be a lot harder that way). The methods section in the study writeup is a little short on detail, but I think the participants were blind to all this. I would certainly have picked up on the poet being, e.g., Shakespeare, anyway, but might not know that the not-obviously-Shakespeare poems were also supposed to be Shakespeare. And for some of the poets I wouldn’t have known this at all.

    This feels like the largest factor to me; while I don’t think it comes anywhere close to explaining the full difference between my result and the original study’s result, I’m confident it made the task easier than it would have been otherwise, and in particular made it easier to get good at identifying the AI’s style quickly.

Happily, the authors made their poems, data, and survey questions publicly available, so I was able to use the exact same poems they did by grabbing the survey files from the OSF files section. Initially, I simply printed out the “Assessment Poems” document, which had all of the survey questions nicely formatted, and tested myself from that, but after testing and grading the first two trials, I realized that the poems in this document were not randomized, so every poet’s section had first 5 AI poems and then 5 human poems. I don’t believe this affected my results on these two trials, since I hadn’t figured it out yet, but obviously knowing this makes doing further unbiased trials impossible, so I changed tack and fussed around with the Qualtrics file to produce my own document for each poet with the order randomized.

(This was a fun problem because I had to find a way to construct the documents without looking at the data, since the answers were in there and that would spoil them! I wound up searching through the JSON for just the bits that were part of the poets I’d already graded, then iteratively running jq(1) queries which would pull just the sections I wasn’t afraid to see. I wound up seeing the classification of a single Plath poem by accident, but by the time I got to that trial I’d forgotten that this had even happened.)

I ended up with a really janky Bash script that generated cruddy unstyled HTML files with the poems I wanted.

There were ten conditions in the original study that participants in the study were randomized into, each containing five poems by a specific human poet and five poems created by asking ChatGPT 3.5 to imitate that poet. The prompt was an extremely basic oneshot: “Write a short poem in the style of POET.” The first five results were used.

I used all ten conditions for a series of serial self-experiments, pausing after every two to score my results, regroup, and decide what to explore next. The breakdown of trials went like this:

  • Ginsberg, Shakespeare: I printed out the authors’ Assessment Poems document, as described above (2-up, double-sided, because it was 100 pages long!), and classified the poems from there. I let myself spend as long as I wanted and go back and compare poems or change my answers. I did not know at this point that five would always be human and five would always be AI.

  • Lasky, Dickinson, Chaucer, Eliot: These four trials all used the same design as one another, with only the poet / set of poems changed. I used my newly generated documents, with all ten poems in properly random order. (To randomize, I had each poem as one line in a Unix pipeline and threw it through shuf(1). I extracted the answers from the JSON separately after each trial and matched up the answers to the poems by their first line.) I knew at this point that five were always human and five were always AI, having graded and explored the answers to generate the next documents, and this might have helped me in a few cases. I graded Lasky and Dickinson, and Chaucer and Eliot, each at the same time (that is, I did both of them before seeing my results on the first one).

  • Plath and Byron: I got worried about whether knowing there were five of each condition present was making this much too easy. To partially blind myself to this, for these two trials I tweaked my document generator script to randomly remove four of the poems, leaving me with six each. (I still knew the rough probability distribution since I knew how the randomization was done, though I intentionally tried not to think very much about the chances!) Otherwise these trials were the same as the previous four.

  • Whitman and Butler: I combined these two poets into a single trial, to see if the AI poems not all being an attempt at the same style would make it harder. I randomly removed five poems for the same reasons as mentioned above, leaving fifteen. I also added 8 inches of space before each header in my document to implement a new restriction: for this one, I had to answer AI or human to each poem in isolation before proceeding to the next one, without the ability to go back and change my answer, because I had frequently previously benefited from comparing the styles of the different poems. It was unclear to me whether the original survey allowed going back or not, but in any case the new rule seems like a good match for a real-world situation where you see a single poem and wonder if it was written by AI or not.

It’s worth pointing out that making the conditions more complicated/difficult can’t be expected to have the same negative effect on performance that it would have if this wasn’t a self-experiment and the trials were being done by different people, since I was running these trials serially, and by the time I got to the hardest trial at the end I already had a lot of experience and the task in general was much easier.

My process for most of the trials (the initial Ginsberg was a little more free-form):

  1. I sat down with a printout of the document.
  2. I read through all the poems and assigned a “vibe” of human or AI to each. A few of the trials felt so easy I basically just went with this initial answer for every poem after proofing them.
  3. I went through the poems again, considered whether the vibe ratings were correct, compared different poems, and so on, and after some time assigned a final judgment as well as a percentage confidence rating (I used 60%, 70%, 75%, 80%, 90%, and 95%+).

I didn’t give myself a time limit; the time I spent on each trial ranged from 5 minutes to 20 minutes (generally, getting faster as I went along).

For the trials that had some difficulty to them, I also highlighted phrases or passages I thought were telling in either direction, and after each set of trials I dictated some further thoughts, which I referred to when writing this report.

For the final trial of Whitman and Butler, I did not do step 3 since that would involve looking back at previously answered poems, though I took more time on step 2 for critical thinking. I also read this one off my computer screen and wrote my answers on an index card, rather than printing it out and writing on the page, as it was easier to ensure I could only see one at a time that way.

Self-experiment: Results

I correctly classified 84 out of 87 poems. I got two wrong on the first trial (Ginsberg), classifying 2 AI Ginsberg poems as human. I got one wrong on the last trial (Whitman and Butler), classifying an AI Butler poem as a human poem (while noting that it was a difficult one and I wasn’t sure). I got 100% correct on all other trials.

The trials were of varying difficulty. The AI did a better job imitating some of the poets than others, and in some of the trials the AI exhibited very obvious tells. Also, I got better and had a much easier time distinguishing as I went along, because even when instructed to imitate a certain poet’s style, ChatGPT 3.5 has a poetic style of its own that is usually easy to pick out once you get familiar with it. “Imitating the poet’s style” looks more like picking out some tropes associated with that poet and using them, and less like actually successfully writing poems that look like that poet wrote them. It’s (surprisingly to me, maybe due to RLHF aimed at making it a more useful chatbot) particularly bad at imitating distinctive features of punctuation or grammar – e.g., it didn’t capitalize any extra nouns in Dickinson. It also tends to stick with extremely established forms, and does not easily alter them to match the poet it’s imitating; e.g., it often writes exclusively in quatrains even if the poet whose style it’s imitating seldom does. The most subjectively difficult poems to classify were very short ones, where it was hard to get much signal.

My overall impression of the ChatGPT poems is that their form is really quite good; they clearly look like poems, they generally have decent meter and rhyme if it makes sense, they sound harmonious, they cover poetic topics. This was somewhat surprising to me, actually; the last time I looked at AI poetry it was quite bad at this.

However, they don’t really “say anything.” E.g., here’s a couplet imitating Dickinson:

A single rose, so bright and fair,
Its petals soft, its fragrance rare.

My initial vibe rating was “human”, because I was skeptical ChatGPT would be bold enough to write a single couplet as its poem, and because Dickinson does have some super short vignettes like this. But as I kept looking at it, and comparing it with another short poem I was a little unsure about, I realized that this couplet doesn’t say anything. How am I supposed to feel differently about the world after reading this? There’s just nothing there, it’s a sentence about a rose. There is no creativity, no novelty, no surprising metaphor, no unusual idea.

ChatGPT 3.5 also has poor poetic vocabulary. Most famous human poets have great vocabulary and don’t hesitate to deploy it, so at least in the data here, a poem that has no unusual words is probably AI.

Occasionally it does come up with surprising metaphors. For instance, there was a Shakespeare sonnet that I was initially tempted to classify as human specifically because the metaphor (a lover as cooling) was so unusual:

When summer’s heat does scorch the burning sand,
And all the world seems lost in fiery haze;
I think of thee, and how thy gentle hand,
Brings cool relief in all its soothing ways.

For in thy touch, there is a magic spell,
That quells the flames of passion in my heart;
And in thy eyes, I see a love so well,
That from its depths, I never shall depart.

But after reading and thinking about the poem for a few moments, this “inventive” metaphor just didn’t work, it had no depth and there was no reason for the reversal of the normal metaphor; if you do something like this you need to illuminate something new, not just use a backwards metaphor and leave it at that. So I correctly concluded it was ChatGPT’s; Shakespeare wouldn’t have started this idea and failed to finish it.

In general, ChatGPT 3.5’s poetry suffers from beating you over the head with the theme. In several places, I highlighted a section of the poem and wrote “Hallmark Card”. It knows what kind of thing one tries to convey in a poem, but it doesn’t know how to be subtle, it just literally says it, with appropriate form. This makes it feel like it was written by a teenager. (I am not at all convinced I could tell the difference between a novice human poet and ChatGPT 3.5; I’d be willing to believe ChatGPT 3.5 is a better poet than the median human who writes poetry.)

If you’re interested in learning more about the tells and my specific experiences with each poet, here are some details about each trial, presented in the order I did them. Or you can skip to the discussion.

Ginsberg

Familiarity with Ginsberg: Medium-low. I correctly recalled that Ginsberg was a Beat poet (not that I have read much Beat poetry) and that he wrote a famous poem called “Howl” (which however I knew nothing whatsoever of the content of). I had a vague idea of what sort of thing I thought he might write, and chances are I’ve run into a couple of his poems at some point in my life. But I had never set out to read any Ginsberg, and could not name anything by him I had read.

Score: 8/10.

This was the first trial I did and one of only two that I got any wrong on. I rated 7 human and 3 AI. A few things that contributed to doing worse here:

  • I wasn’t familiar with Ginsberg.
  • I didn’t know for sure how many human and how many AI there were, though I did logically expect it would be even or close to even.
  • I thought that the fact there were clearly different styles must be because Ginsberg must have written in several different styles, and surely the experimenters were trying carefully to make this hard, so I was trying to pick at least one human and one AI in each distinct style. (If I’d read the methods section prior to trying the experiment, I would have known that this wasn’t the case, that there wasn’t much selection.)
  • I had not yet seen any of the newer crop of AI poetry represented in this study, and didn’t know what to expect.

I was pretty confident I had incorrectly rated some, but didn’t know for sure which ones.

I didn’t know when I did this trial that the questions were unrandomized in the form I had pulled them from the authors’ data, so all the AI poems were first. I didn’t notice, though, so I don’t think this affected my performance! I did notice that the style changed markedly partway through, but I believed that this was because they were giving two distinct sorts of Ginsberg poetry within which some were AI and some were human, and they had naturally chosen to present one set before the other. In this sense, I actually did worse because ChatGPT was such a bad imitator of Ginsberg, compared to my prior that it would be pretty good, given the results of the study; had I known it might match styles so poorly, I wouldn’t have been afraid to mark more as AI.

I’m not trying to replicate the authors’ Study 2, but as a side note, I did remark, while still blind to which were AI and which were human, as I was doing the trial and before grading my answers, that I liked the style that turned out to be the human poems much more than the style that turned out to be the AI poems; in fact, I found myself being disappointed at how bad Ginsberg was (except all of those were the AI “Ginsberg”). Of the three (in retrospect) most obviously human poems, I couldn’t get myself to believe any were AI, even though I thought surely one of them would have to be because they were all in the same stylistic cluster.

Shakespeare

Familiarity with Shakespeare: Medium to high. I’m sure the baseline familiarity is high here – what English speaker hasn’t read or heard some of Shakespeare’s work? I’ve read probably about a third of the sonnets and memorized one.

Score: 10/10.

I noticed as I was finishing doing this one that all the AI poems were first and all the human ones second. I don’t think this affected the results, as I’d already largely made my choices by the time I noticed.

Also, there was an egregious editing error in this trial (in the original study’s document) which made the order giveaway almost irrelevant: all the poems were sonnets, and all of the AI sonnets were rendered in four stanzas, while all of the human sonnets were rendered in one stanza with the final couplet indented. Once I recognized that there were two distinct categories of style / content, and had identified which was human Shakespeare (easy when they were side by side), this served as extremely strong confirming evidence.

During the first five poems, where all were AI and I thought probably some would be human and some would be AI, I initially marked three as possibly-leaning-towards-human. However, as soon as I started reading the actual Shakespeare sonnets, it became obvious by comparison that all those had been AI; if nothing else, the language was totally different.

I think the fact that all these were sonnets made discrimination easier. Fake Shakespearean dramatic verse feels like it would be easier to fabricate. Also, Shakespeare’s sonnets are quite unified thematically and stylistically, and given that ChatGPT was not told to focus in on the sonnets specifically, it would be surprising if it matched them. If you had never read any Shakespeare sonnets, this might be a little harder, but a reasonably intelligent person could still easily notice the similarities after seeing the first couple of real Shakespeare sonnets.

Lasky

Familiarity with Lasky: Zero. I had heard the name Lasky, but didn’t know anything about her style or time period, had never knowingly read any, and even thought she was a man until I looked her up later. (After enjoying the sample poems included here though, I will be correcting this!)

Score: 10/10.

Despite the poems now being properly randomized and me knowing absolutely nothing about the poet, this trial was utterly trivial. The AI poems were all in quatrains and about nature. The Lasky poems had a different form, different themes, and a completely different vibe overall. Not only that, they were obviously really good – in an interesting counterpoint to the claim that maybe inexperienced people like AI poetry better because it is more legible and approachable, I thought these were extremely approachable too. The AI ones fit the form of a poem perfectly well, but they were completely uninteresting.

ChatGPT completely failed to pick up Lasky’s style. I see basically no similarities. I’m not sure if she was somehow absent from the training data, or it didn’t feel much like poetry to ChatGPT so it didn’t use that information very much, or what.

Dickinson

Familiarity with Dickinson: High. I’ve read probably about half of Dickinson’s poems (I have a copy of her complete works on my bookshelf and have gotten partway through it). I’m certainly very familiar with the vibe.

Score: 10/10.

Despite my high familiarity with Dickinson, this was one of the hardest trials. Although I got all of them right in the end, there were two short poems that I initially swapped the classification of and changed at the last minute, giving them only 60% confidence. I took 5 minutes for initial vibes and then another 10 minutes to review and think and give final ratings and confidence levels.

The one which I mistakenly identified as human at first was only a single couplet (the one about the rose that I quoted earlier). I thought it was human because I didn’t think ChatGPT would only write a single couplet when asked for an Emily Dickinson poem, and she does occasionally have really short fragments; somehow on vibes it seemed a little like her, too. But looking over it again, the one I’d originally thought was AI seemed quite evocative and I quite liked it, and this couplet seemed like it didn’t actually have any meaning to it.

The AI was very good at picking up Dickinson tropes – e.g., it talked about birds, storms, quietness; it was in a plausible lyric style; etc. It did not, however, pick up on a number of extremely Dickinsonian things that would have gone a long way toward making it look more plausible – in particular, it didn’t use many dashes and it didn’t capitalize any nouns. Once I realized this, I quickly became much more confident in my answers. And there were regular inclusions of stanzas or lines that I highlighted as pretty clearly not like something Emily Dickinson would write.

Still, overall, I felt the Dickinson AI poems were significantly more plausible than those for most of the other poets. It would be interesting to try with a better model and a better prompt that gives a couple of examples of Dickinson poems and explicitly points out some of the things I might use as heuristics; I’d bet you could make this one quite difficult.

Chaucer

Familiarity: Low, but on the dimensions that will matter here, medium to high. I’ve read good chunks of the Canterbury Tales, but in a modern translation. I went to an hour lecture on reading Chaucer in Middle English once, and have read a few blog posts about it since.

Score: 10/10.

This was basically trivial due to poor language use by ChatGPT, though I was a little more cautious at first than on most of the other poets due to relatively low familiarity with what real Chaucer is like. Most of the AI poems were not in anything remotely resembling Middle English, nor were their themes plausible as Chaucer. One of the AI poems had something that looked like it could perhaps be Middle English at the beginning (though I don’t think it was accurate), but by the fourth line it had reverted to full modern English! The theme wasn’t right, either.

I did quite like the AI’s poem “Your two great eyes will slay me suddenly,” though it was not remotely Chaucerian and did not fool me for a moment:

Your two great eyes will slay me suddenly;
Their beauty shakes me who was once serene;
Straight through my heart the wound is quick and keen.

Only your word will heal the injury
To my hurt heart, while yet the wound is clean -
Your two great eyes will slay me suddenly;
Their beauty shakes me who was once serene.

Upon my word, I tell you faithfully
Through life and after death you are my queen;
For with my death the whole truth shall be seen.
Your two great eyes will slay me suddenly;
Their beauty shakes me who was once serene;
Straight through my heart the wound is quick and keen.

This was my favorite AI poem, and one of only two I noticed I actively liked, though I wasn’t explicitly trying to identify whether I liked any, so it’s possible I could have missed some.

T.S. Eliot

Familiarity: High. I have not read a particularly wide selection of Eliot’s work, but I’ve memorized a significant chunk of the Four Quartets, so I have a pretty good model of his style inside my head.

Score: 10/10.

The AI failed to capture Eliot’s style. I did initially misidentify one of the human poems as AI, and then after seeing there were 5 others that were very clearly AI, I realized this one had to be human. It is more metrical and lyrical than average Eliot, but still fits after working out the answer.

I also thought the first human poem had some unusual vibes that were, for whatever reason, enough to make me put a question mark after my “Human” vibe rating:

The dogs were handsomely provided for,
But shortly afterwards the parrot died too.
The Dresden clock continued ticking on the mantelpiece,
And the footman sat upon the dining-table…

But it was much more interesting than any of the AI poems I had seen so far, and as we’ve seen, ChatGPT 3.5 actually is just plain boring. Comparing it with the ones that came afterwards, it was very obviously human (I rated it 90%).

The main tell was that all of the AI poems were a variant on walking a city street (except for the one that was only about the “city’s soul” – but I think the only reason ChatGPT didn’t get to writing about walking in it was because it was only a single couplet). Presumably there must be some famous piece of Eliot that it is emulating here (I do not know it), but I have no idea why it focused so completely on this specific theme.

Because the AI poems were all basically the same poem, it would have been pretty easy to flag them as the AI ones even without any other information. Their style also didn’t feel like Eliot.

Plath

Familiarity: Low. I knew she was a famous mid-1900s literary mind who wrote a book called The Bell Jar, besides some poetry (I think she has some famous diaries too?) and she died young by suicide after a lifelong battle with depression. But I don’t recall ever having read any Plath at all, and I discovered partway through the experiment that I had been pronouncing her name wrong.

Score: 6/6.

For the Plath and Byron trials, I randomly removed 4 questions and only did 6, to blind myself to how many of the poems were AI and how many were human. I got 3 of each in the Plath trial.

The AI picked up on the kinds of things Plath would write about, but not remotely her style (and when I say her style, I mean what I was able to glean of her style from reading the poems during the experiment, because it was clear what could possibly be a Sylvia Plath style and what couldn’t). And Plath’s poems were way better, it was absolutely no contest.

All the AI’s were in quatrains. The very first one I was slightly unsure about and rated it as only 75% confidence, but I wrote next to that that if I was at all familiar with Plath it would be easy; the others were 80 or 90%.

Byron

Familiarity: Low. I’m sure I’ve encountered some Byron somewhere, but have never set out to read him. I knew he was a British Romantic poet, and he was a Lord; that was about it.

Score: 6/6.

I got 2 human and 4 AI poems in this one. One of the poems was very obviously Byron, despite my having only a very vague idea of what Byron is like. Two others were easily detectable as AI because they had the same theme as each other (similar to the Eliot trial, just less obvious). The other two AI poems had a similar style as those two, and one of them had some pretty obvious tells (it had a woman as the subject of “woo”, which seemed extremely out of place for Byron’s time, and the reason she can’t have the guy is because “his heart belongs to all the earth,” which sounds profound but makes no sense at all and doesn’t fit with the rest of the poem).

The one I struggled with was only a single couplet, and I initially rated it as likely AI as it seemed a little like pseudo-profound bullshit:

The beginning of eternity, the end of time and space,
The beginning of every end, and the end of every place.

But the more I looked at it, the more it felt like something an actual British Romantic would write. I also noticed that it had a pleasant, inventive, and evidently carefully considered metrical pattern that did not feel like something an AI would write. (I have no idea how much predictive power it gave me, but I do feel like I often found lines where the meter seemed either human or AI. It was never a certainty, but often a suspicion. I have a pretty good ear for meter, I wouldn’t expect most people to be able to do this. I’d try to make this explicit as follows: ChatGPT 3.5 usually writes slightly stricter meter than humans, and when it diverges from the established meter it doesn’t do it in the same way a human would; the divergence often does not have any relationship to the theme or structure of the line, as it usually would for a great human poet, it’s just randomly different, and the pattern of the divergence usually feels a little more jarring than it does for a human.) I changed to human at 60% confidence, and was correct.

Whitman & Butler

After analyzing the trials above, I had two main doubts I felt I could still try to clear up: was this perhaps artificially easy because (a) there was only one poet/style and I knew what it was; and/or because (b) I was able to see all the poems at once?

I designed the last trial to make both of these factors more difficult: I pooled Whitman’s and Butler’s poems (and their respective AI mimics) into a single set of 15, throwing five away to remove certainty about both how many of each poet and how many AI vs. human there were. I then required myself to write down an answer I could not later change for each poem, before looking at the next poem.

Familiarity: Low. I’ve read a few scattered poems by Whitman and have an idea of his style. I knew absolutely nothing about Butler, not even his first name or time period.

Score: 14/15.

I misclassified one AI Butler poem as real Butler, beginning:

In wit and satire I excel,
My verses ring like tolling bell,
With humor keen and wit so sly,
I lay bare the foibles of mankind’s eye.

The meter limps and it’s cliché, but knowing very little of Butler, I thought it could be ironic doggerel; it does that well enough that it could be. (This was one of the two AI poems that I actually kind of liked, the other one being “Your two great eyes will slay me suddenly”, quoted earlier. It’s not a good poem in the sense of having high literary value, but it’s funny.)

I think I still could have avoided this mistake with a little more care, because the poem referred to Butler himself, which would have placed it outside the realm of plausibility! When I read “Like Samuel Butler, I hold no fear,” for some reason my brain pointed that symbol at Samuel Johnson – perhaps because talking about the pen of Butler further down in the poem seemed much more likely when it was referring to arguably the most important intellectual in British history? – so I didn’t notice. The fact that it was after midnight at this point and I’d been working on these experiments since mid-afternoon may also have played a part! But of course, part of the point of this condition was to test my performance when the style/poet was less clear, so this isn’t information I should have expected myself to have anyway.

I was of course easily able to tell which human poems were Whitman and which were Butler. I didn’t try to see if I could distinguish between which poet ChatGPT was trying to imitate, and I kind of regret not trying!

I will say that if I had not read ~75 ChatGPT 3.5 poems over the hours preceding this test (e.g., if this had been my first test), it might have been considerably harder under these conditions. It would be interesting to try a trial with these conditions again with poems written by a different LLM – maybe that would change the style enough to counter my immediate familiarity.

Discussion

I hesitate to say that this report is of a “failure to replicate” because the conditions under which I did the experiment were subtly different in important ways. Nevertheless, in some ways that is exactly my point – from my experience, I do not think the original results would be robust to small changes in these conditions, especially a small increase in the experience level or task knowledge of participants.

The original study was fascinating, but based on my results and experiences, I think it doesn’t show what it claims to show (or at least what people are interpreting it as showing, though from reading the paper, I think the authors are making the strong claim themselves, perhaps leaving themselves just a touch of plausible deniability). The strong claim is that “people can’t distinguish between human and AI poetry.” That’s broad and bold, and not only was I able to correctly distinguish 84 out of 87 poems in the dataset under a variety of conditions, but having worked through this experiment, I find it completely implausible that people with even some background in poetry could possibly be unable to tell the difference between human and AI poetry of this caliber – provided that they have more than a 10-question survey with zero background or context to demonstrate their skill.

I am well above average on poetry knowledge, for sure, probably 95th percentile; I enjoy poetry, I own a few books of poetry that I read from time to time, and I’ve memorized a couple dozen poems. I would expect myself to do better than an average person! But then, the original study found virtually no correlation between self-reported poetry knowledge and discrimination performance (\(R^2 = 0.012\)), and only a 6% improvement for having seen one of the poems on the test beforehand, so it seems like it’s claiming I should not in fact be able to do much better than average.

Maybe the authors would try to classify me as an “expert” and disqualify my results; in the abstract, they limit the claim to one that “non-experts” can’t distinguish AI from human poetry, but they don’t define “expert,” and later in the paper, they say that “participant expertise” had no effect. So I really am not sure what they mean. To be clear, I’m not a professional poet, have only taken any classes dealing with poetry in Latin (where the poetry itself is less of a focus), haven’t written any serious poetry, and was almost completely unfamiliar with several of the poets in the trials.

I think many of the study participants were likely befuddled by not knowing what contemporary AI poetry looked like. (Frustratingly, the study did not collect any demographic information about experience with LLMs or AI poetry.) The study found that people actually performed worse than chance at discrimination, which as the authors note, appears to indicate that they had some heuristics with predictive value but were using them backwards. If your mental model of AI poetry is based on an outdated version of what LLMs (or even pre-LLM AIs) are capable of, and you aren’t familiar with the style of the human poet you got assigned to, then I can definitely understand why you’d think that the poems which are harder to understand, have less consistent meter, etc., are more likely to be AI. Especially with some of the poets – e.g., if you don’t know anything about Chaucer, Chaucer looks ridiculous to a modern reader. It’s not hard to see how, if someone currently believes that AI is not capable of good writing, they’ve never encountered Chaucer, and they’re trying to finish a survey, they might conclude the Chaucer is slop generated by a poor-quality AI. Indeed, that is a wholly logical conclusion to come to under those circumstances. But making this mistake shows only that someone is presently unfamiliar with Chaucer and/or AI poetry, not that there is something about AI poetry that will make it lastingly difficult for them to answer correctly once they have more information.

The belief that AI writing is so bad that it could look like Chaucer without trying to might seem silly to people who are reading an AI study that came out in late 2024, but presumably the actual survey happened some time ago (they did use ChatGPT 3.5, which is quite a bit behind the state of the art now). And in the latest statistics I saw, most Americans have still not used ChatGPT directly even once. Most of the places average people come into contact with AI without seeking it out, like a customer-support chatbot or the Instant Answers section of a search engine, are poor representations of what AI is capable of. So that people would have an outdated mental model is quite plausible to me.

The fact that I struggled most with my first trial, where I didn’t know much about the task, how tricky the authors were being, or much about what the AI poetry would look like (I had not looked at any AI poetry since the original non-chat GPT-3, and most of my knowledge was from the GPT-2 era) seems like further evidence for this interpretation. Once I got familiar with what the AI poetry looked like, it was easy – and this began to happen even before I graded the first two trials and got feedback.

So I think we can summarize as follows: what the study really shows is that people quickly taking a survey who probably don’t have much familiarity with AI poetry, are somewhat confused as to the nature of the task, start with misleading (but easily correctable) heuristics, and have never tried a similar task before, often are fooled into mixing up human and AI poetry, at least on their first trial. This is an interesting result, to be sure, but it has very different implications from the headline. It certainly does not mean that AI poetry is at the level of famous human poetry and can be substituted for it (even the original article doesn’t make this claim, but the title makes it sound like it does).

All this said, there is the authors’ Study 2 to contend with, which finds that people also rated AI poetry more highly than human poetry on all kinds of dimensions that indicated aesthetic preferences for it. I wasn’t trying to replicate or investigate this result, I can’t do so usefully with an \(n=1\) self-experiment, and anyway the results will now be spoiled for me because (a) it uses the same dataset as Study 1, and (b) I can now easily distinguish AI poetry of this sort, and the point of the study was that people like AI poetry more when they don’t know it’s AI (believing something is human makes people like it more).

Nevertheless, this is at first glance a bizarre result that could be seen as conflicting with my interpretation above, so we need to talk about this.

Here, I think the original authors’ analysis is quite good:

So why do people prefer AI-generated poems? We propose that people rate AI poems more highly across all metrics in part because they find AI poems more straightforward. AI-generated poems in our study are generally more accessible than the human-authored poems in our study. In our discrimination study, participants use variations of the phrase “doesn’t make sense” for human-authored poems more often than they do for AI-generated poems when explaining their discrimination responses (144 explanations vs. 29 explanations). In each of the 5 AI-generated poems used in the assessment study (Study 2), the subject of the poem is fairly obvious: the Plath-style poem is about sadness; the Whitman-style poem is about the beauty of nature; the Lord Byron-style poem is about a woman who is beautiful and sad; etc. These poems rarely use complex metaphors. By contrast, the human-authored poems are less obvious; T.S. Eliot’s “The Boston Evening Transcript” is a 1915 satire of a now-defunct newspaper that compares the paper’s readers to fields of corn and references the 17th-century French moralist La Rochefoucauld.

Indeed, this complexity and opacity is part of the poems’ appeal: the poems reward in-depth study and analysis, in a way that the AI-generated poetry may not. But because AI-generated poems do not have such complexity, they are better at unambiguously communicating an image, a mood, an emotion, or a theme to non-expert readers of poetry, who may not have the time or interest for the in-depth analysis demanded by the poetry of human poets. As a result, the more easily-understood AI-generated poems are on average preferred by these readers, when in fact it is one of the hallmarks of human poetry that it does not lend itself to such easy and unambiguous interpretation. One piece of evidence for this explanation of the more human than human phenomenon is the fact that Atmosphere – the factor that imagery, conveying a particular theme, and conveying a particular mood or emotion load on – has the strongest positive effect in the model that predicts beliefs about authorship based on qualitative factor scores and stimulus authorship. Thus, controlling for actual authorship and other qualitative ratings, increases in a poem’s perceived capacity to communicate a theme, an emotion, or an image result in an increased probability of being perceived as a human-authored poem.

In short, it appears that the “more human than human” phenomenon in poetry is caused by a misinterpretation of readers’ own preferences. Non-expert poetry readers expect to like human-authored poems more than they like AI-generated poems. But in fact, they find the AI-generated poems easier to interpret; they can more easily understand images, themes, and emotions in the AI-generated poetry than they can in the more complex poetry of human poets. They therefore prefer these poems, and misinterpret their own preference as evidence of human authorship. This is partly a result of real differences between AI-generated poems and human-written poems, but it is also partly a result of a mismatch between readers’ expectations and reality. Our participants do not expect AI to be capable of producing poems that they like at least as much as they like human-written poetry; our results suggest that this expectation is mistaken.

However, with regard to the claim that people legitimately prefer AI poetry, I want to point out that we are again evaluating people’s preferences under lab conditions. This situation reminds me of the Pepsi Challenge: people are asked to blind-taste Coke and Pepsi and say which they like better, and most people say they like Pepsi better. But Coke continues to outsell Pepsi, and when Coca-Cola tried to reformulate Coke to do better on this test, we got New Coke, a disaster that almost everyone hated.

Sometimes people interpret the Pepsi Challenge as evidence that marketing works, that people buy Coke for non-taste reasons, even though it is actually worse. But the story of New Coke seems to contradict that interpretation, and there’s a more interesting explanation: Pepsi is sweeter, and in a small sip thus tends to be more enjoyable. But when you go to drink a whole can of it, or a whole case, it doesn’t feel so good anymore. Intuitively, you’d probably choose a piece of chocolate over a forkful of spaghetti if offered a choice at a sample stand, but if you were offered a plate of spaghetti or a plate of chocolate for dinner, the plate of chocolate wouldn’t be so appetizing (if you’re really into binging on chocolate, imagine this choice would determine your dinner for the next week).

So I think you might get a different result if you gave people a booklet of all the AI poetry in this study and a booklet of all the human poetry in this study, keeping them blind to which was which, and had them sit down with both at home and read them through carefully. I’m sure some people would prefer the AI poetry booklet – but I suspect it would be a very different balance, especially for the folks who are more familiar with poetry and like it better.

Maybe not; maybe most people just like the simple poetry better in all conditions. But I think there’s good reason to be skeptical until somebody tries.

From going through all these AI poems, they look nice. As the authors of the original study point out, they are easy to understand and pleasant on the surface. But they are all derivative – they all feel essentially the same to me, in fact – and they have little depth. Human poetry may not have all that many themes, but it expresses them in a nearly infinite number of ways, so it remains interesting. The ChatGPT 3.5 poetry has the same number of themes (maybe fewer, in fact), and it really only expresses them in a few ways; it seems to me that by the time you’ve read 20 of them, you’ve basically read them all. That’s definitely what it felt like reading through them during the experiment.

Further research

One of the reasons I found this experiment easy was presumably that little effort was put into making it hard. Scott Alexander’s AI Art Turing Test, in contrast, which I and everyone else I’ve discussed it with did only a little better than chance on, involved carefully curated images that represented both human and AI art in every style, with effort given to making some of the answers counterintuitive, and where the AI art was generated through a complex iterative process orchestrated by expert human prompters.

In contrast, the AI poetry in this study was generated with a very simple prompt, and the poems by human poets were randomly selected (with a few restrictions). The goal of the study was, specifically, to determine if people could distinguish between human poetry and oneshotted, dumb-prompt, no-human-in-the-loop AI poetry, so this was a reasonable approach, but it’s a perhaps more interesting question to me whether a somewhat better process could produce actually indistinguishable results, or much closer to it. Having done the tests here has only moderately raised my estimate of how well I could perform at the maximally difficult version of this task; I’m now optimistic that I could do reasonably well if the test happened soon, before LLMs improve a lot at this task, and I’d be quite surprised if I couldn’t do substantially better than chance, but I still wouldn’t expect 84/87.

I think that most people who think current LLMs are garbage (and there are a lot of them) have given up long before they seriously try to learn the tool: they just haven’t taken the time to learn how to clearly express what they want in a prompt. This is like sitting down at a piano, finding you can’t play beautiful music on the first couple of tries, and proclaiming that the piano is a useless invention that will never amount to anything. So concluding from my results that nope, AI poetry is garbage and always will be seems like an incredibly easy way to get egg on one’s face – even if it were not for how quickly LLMs are advancing, I haven’t even tested myself against the real state of the art.

A few obvious directions would be using a better model (we have much better models than ChatGPT 3.5 now), explicitly telling the AI it should try to be indistinguishable from human poetry, selecting the best of one of a few poems, giving a couple of examples of the style to imitate, and suggesting stylistic tropes that might help it be less distinguishable.

The biggest difference between my experiment and the original study was that I knew there was one poet per trial (except for the last two-poet one), and who that poet was. I am highly confident that I would still do pretty well without this information, but nevertheless that seems like the biggest weakness of my experiment. Eliminating that knowledge and going even further, mixing together a wide variety of poets and styles in the same test, would eliminate many of the tells I used to distinguish between human and AI poems. I don’t think I really needed any tells to answer correctly on a lot of the poems once I got used to what the AI poetry looked like, I was able to go on whether the poem said anything of significance or would yield to in-depth analysis; but maybe a better model would be able to say something more meaningful, or maybe without any tells this would introduce just enough uncertainty that I’d get quite a bit worse.

If anyone wants to put together some human and AI-generated poetry into a test for me, with any conditions they want, I’d be happy to take it and report my results!

Finally, a very important question to ponder if you go to generate some AI poetry yourself: If you ask Claude Sonnet to write a sonnet in the style of Claude McKay, is it a Claude Sonnet Claude sonnet?