(adapted from a Mastodon thread)
It would be interesting to train an LLM exclusively with my own writing. I’ve got millions of words here and there. Comprising about 320,000 words in published novels/short stories; probably around half a million words in handwritten journals; a few tens of thousands in hand written notes; approximately 2.6 million words on my old (locked private) LiveJournal, plus another few hundred thousand on other blogs…
In terms of LLM training, it isn’t actually all that much. Probably a little over 3 million total words. That’s nothing, yeah?
Well it’s a fucking start. Anyway…
I think it’s a far more interesting question than the one everyone has focused on of late. Obviously an LLM trained on the great glut of the internet is what it is, and I’m in the camp of “no, gross, stop.” I’m not going to re-litigate the whole thing here. Suffice to say, you either already know what I’m talking about or you’re in the wrong place.
So what happens when we reach the stage where an individual can instigate and run one on their laptop…
I don’t mean download and run an instance. I mean, here is the algorithm all open sourcy but there has been no training, it’s a blank slate. And maybe you do what’s already been done or maybe you train it differently.
I’m gonna stick to writing here. So now we have writers who, like me, have a few million words lying about unused and can train this LLM to write like themselves and no one else. Except, of course, in the way said human writer has already been shaped by others.
BUT…
You’ll also have writers who do that but also say, hey, I was heavily influenced by reading Kerouac so maybe I’ll feed it a little of that, or you know, my writing is inspired by and in conversation with Pynchon (you pretentious turd) so let’s give it a taste of Gravitys Rainbow and Inherent Vice along with my ten thousand page free form journal (douche) and see what comes out and…
Veering a lot further from “no that’s perfectly fine” territory… BUT…
There are acceptable reasons a human author might steep themselves in all things Hemingway and then consciously attempt to emulate style in efforts to respond. Hell, copying the greats is a time-honored tradition. Hunter S. Thompson used to transcribe classics on his typewriter to get the feel of writing like that. The rhythm. Training the fingers, such like.
Of course there’s also not so acceptable reasons. But that’s been covered elsewhere, let’s focus on what I’m looking at here.
Individual intent – not even getting into merit – has some meaning when there is an individual capable of intent. Would a personalized LLM count as an extension of my personal intent? Would it then count as an acceptable tool? (I’m leaning generally toward no, but it heavily depends on… Look, I lean toward no because I think humans are generally terrible, yeah? So not very likely to use it correctly.)
Of course then we will also have the Clancy effect. Tom Clancy died 10 years ago and they still put books out with his name in big letters. Much easier and eventually cheaper if you replace the various “co-“authors with a custom trained AI that knows not but Jack Ryan.
Aging titans of popular literature can thus continue generating profits regardless of failing health or unfortunate interment.
That’s… grim as fuck.
Then there’s the rare big name that tries to pay it forward. Instead of ghostwriters losing their credit, there’s ghostwriters getting co-writing credits with the big name that maybe did so much as script an idea, or more likely just leased out a setting/characters, but this big name is trying to help the next generation like. Now what does that big name do with this? Pack it in? Cash in on the AI and forget the incubator?
Either way … as Hollywood would have its CGI resurrections, publishing could have its immortal bestsellers. Ick.
I generally expect humanity to select the worst applications of any new technology. But I can see totally valid uses as well and so there’s a thorny “is it worth it?”
Probably not. Probably not at all.
But look, wouldn’t it be exciting to have an LLM that was exclusively trained to write like you? Fed only your words, and entirely under your control? Or say you’re part of a collective. There’s two of you, five, fifteen. Maybe you’ve decided to create a shared universe and build something sprawling. Maybe you’re going at it alone. Regardless:
You could skip first drafts. Or at least the difficult bits.
You could create your own immortal content machine, which is probably a terrible thing to do. Or you could use it sparingly.
And I should say in this scenario I don’t believe you ever just turn it on and leave it. Your input is necessary; concepts, direction, revision… I won’t say synthetic intelligence is never going to create on its own, but no currently extant iteration of the technology is anywhere remotely approaching that level. We’re not even talking sophistication here, we’re talking an evolutionary paradigm shift. Best not get into those weeds right now, it’s beside the point. Whether that ever comes to pass or not, what we have now is not intelligent. It’s tools.
So look, maybe one day a computer writes its masterpiece. Hell yeah. But focusing on what we have available now? No, you’ll always have to manage it. It’s an underling that can help you, at best.
Anyway the main point is that removing the vast and uncurated glut of “training data” that represents enormous intellectual theft and training a model entirely on one’s own personal output (or that of a consenting cooperative group) removes a lot of gross context, opens up new thorny moral questions, and generally makes the whole conversation about a thousand times more interesting.
To me, anyway.

Leave a comment