orphans.ai

CHAPTER FOUR

The data problem

In the last three chapters I told you about a pattern. An inherited disposition. Carriers of it, across twenty years of working life. Scenes in kitchens and on platforms and in a spare room and over Slack, none of which were ever written down. In twenty-five years of business there have been plenty of Grahams too. The pattern does not stop. Stopping would be a different life.

Now I have to make the argument explicit. I have been putting it off, across three chapters, and the reader who has got this far has earned it.

I was going to write this chapter as an argument. Then a week happened in my own house. The argument wrote itself in live time, in Birmingham, in scenes I did not plan, in the voice of a fifteen-year-old on a phone and a boy in a baseball hat at a door. I am going to tell you about the week first. The argument is inside the week. If the week lands, the argument lands. That is how this kind of knowledge has always travelled.

Two years ago my daughter told me, on a Wednesday, that a boy she knew was going to stay the night at our house. She had already said yes. She was telling me so I could know. I asked her to describe him to me. She said — he is black, from inner Birmingham, has been expelled from six schools, and has left home again because he and his mum have had a fight.

I thought oh no.

I want to be honest about that, because it matters for everything that comes after. I did not say it out loud. I did not text my daughter to say sorry, not tonight. I did not construct a plausible excuse. I thought oh no for about one second, privately. Then the second passed, and I said yes, and he came.

He was quiet. He was shy. He would not take his hat off. He said himself, in the kitchen, that he never takes his hat off — a small declaration from a boy sitting in a house that was not his. He mumbled, mostly. I chatted with him for a while. I took the piss out of him a little, the way you would in Shields or, I suspect, in inner Birmingham. He laughed. He stayed. He was polite.

I did not think much about it afterwards. A boy came, a boy went, my daughter had a friend, the disposition did the thing it does without my paying attention to it. I forgot, for the most part.

Last night my daughter went out in Birmingham. She is fifteen. She went to a small club at the back of a pub near Moseley, called the Flag and Castle.

She was not meant to get in. You had to be sixteen. She said she was. The bouncers knew she was not. One was a big white Brummie. The other was an older black Brummie. They smiled and chatted with her at the door. They were lovely. They understood what was happening. They also had to play by the rules.

They let me walk in past them, without saying anything, to go and find a friend of hers who looked eighteen and could make the fiction a bit more plausible. The place was rammed. Four hundred kids. Fifteen to eighteen. The Flag and Castle is a dump now. Vape shops on either side. Rubbish in the doorway. Was probably a nice pub once. From the outside there is no reason to walk past it, ever. From the outside there is no reason to think it exists.

Inside, the place was full of energy and hope and alive. Hidden from view.

The chaperone for my daughter that night was the boy who had stayed at our house two years ago.

I did not recognise him at first. He was about six feet tall. White rapper clobber, head to toe. Gold chains. Massive braces when he smiled. A white baseball hat. He could not be missed. He saw me first. He walked up and said hi Doug, lovely to see you.

I said something like wow, you look great. I asked him about school. He said he had gone back. We chatted about his mum for a minute. He said he was trying not to fight any more.

I walked away from that door thinking how far has he come.

Then I walked out of the pub, and I thought — I have no idea why I am telling you this. Then the second thought came. I am telling you because this is what I do not know how to say in any other way. This book is where I am trying to say it. The boy is a case of it. The fact that I did not plan to tell you about him is the whole of why I need to.

Two days ago my daughter rang me.

Dad, I need some more money.

Why. I just gave you some two hours ago.

I bought two homeless people food. They were nice.

She laughed at me on the phone. She asked for more money. Then she said — dad, you would have done that too.

I gave her the money. I laughed. I came off the phone and I thought, my job is done. How far she has come.

Then I rang a friend.

He is the friend I told you about in chapter two. He named the pattern, years ago, as LinkyBrains. I did not ring him to tell him what had happened. I rang him because something had landed and I wanted another carrier to receive it. That is what carriers do when something has landed. The only person who would know what the phone call from my daughter had just meant was somebody who carries the same disposition. I rang him. He picked up. I told him. He understood. We moved on.

This part of the scene will also not be in any dataset. Nobody writes down the phone call to the other carrier. Nobody posts I just rang him because my daughter just said something that I cannot quite absorb without saying it to another person who would understand. It does not appear in any inbox I have given to any machine. It will not appear in this paragraph in any form a machine could train on with meaning. I rang him. That is the whole of it. The ringing is part of the pattern too.

Yesterday afternoon I walked through Birmingham with my daughter. I was carrying an Indian takeaway.

We passed a homeless man. I walked past him.

Then, about a step later, I thought — why now.

I turned round. I went back. I gave him the takeaway.

I went back because my subconscious had been holding my daughter's words from the day before. Dad, you would have done that too. And she was walking beside me. She did not say anything. She did not need to.

I want to sit with that for a moment. It is the thing the chapter has to land.

I had taught her the disposition without meaning to. I had done it the way my grandmother had done it to me. I had done it the way my uncle had done it to me on the walks to the river. I had done it by being in the room while she was growing up, doing the thing, without announcing that the thing was being done. She had absorbed it. Then she had phoned me two days ago and, in the voice of a fifteen-year-old who thought she was asking for more money, had returned the disposition to me. Dad, you would have done that too. When I walked past the homeless man yesterday, her voice from the day before was the thing in my head that made me turn round.

The grandchild taught the grandparent. The fifteen-year-old taught the sixty-two-year-old. The disposition was carried back to me, silent, by the person I had unknowingly transmitted it to.

That is how this kind of knowledge has always moved. It moves in both directions. It is not a download from elder to child. It is a loop between the people in the room, running across a lifetime, with nobody quite knowing who taught whom.

None of it is in any dataset. Not one second of any of it.

I want now to tell you what I think a machine being trained on the public record would find, if I gave it the whole of the week I have just described and told it to look for the training data.

It would find a fifteen-year-old going into a pub in Moseley. The pub would appear, if it appeared at all, as a Google Maps pin with a two-star rating and a handful of old reviews. No article, no trade press, no mention of the four hundred kids inside it on a Wednesday night. The pub is not known to the internet as a place where anything is happening. It is known to the internet as a dump. That is the known known.

The machine might be able to infer some of what is adjacent. That fifteen-year-olds in Birmingham go to pubs at weekends. That they drink. That there is probably more happening in Moseley than the internet records. These are gaps you can see the edges of from inside the public record. You know what shape of data you are missing. A lab could, in principle, commission a study.

What the machine cannot see, at all, is the third layer. The category it has not been taught exists. The oh no I thought, two years ago, when my daughter told me the boy was coming. The one second in which I might have said no. The second in which I did not say no, and did not announce to anyone that I had not said no, and then forgot the second entirely because a disposition I did not have a name for did the work underneath the thought. My daughter's laugh on the phone at me for asking. He picked up. The pause on a Birmingham street yesterday. Why now.

These are not missing because they are paywalled. They are not missing because they are in the wrong language. They are not missing because a licensing deal has not yet been done. They are missing because, in the conversation about what data the machines should be trained on, nothing in the conversation has a shape that would receive them.

There is no team at any lab whose job it is to collect the oh no that happens in private the second before a disposition overrides it. There is no dataset called phone-calls-to-other-carriers-when-something-lands. There is no budget line for why-nows. The conversation does not have the category. Because the conversation does not have the category, no amount of working hard on data collection — in the sense in which data collection is currently understood — will produce the thing.

The pub as a Google pin is the public record. The rough statistical shape of what happens inside places like the pub is what a machine might infer from it. Every specific judgement by every specific person that makes each of those places what it is, for each of the other specific people in it, across each of the specific relationships they have, over each of the years they have had those relationships — that is the layer underneath. That is the layer. It is not missing because anyone decided to leave it out. It is missing because nobody has yet said, inside the rooms where these decisions get made — that layer exists, it is bigger than any of the layers we are currently working on, and we do not know how to collect it.

This book is trying to get that sentence into those rooms.

I want to say something about why the labs, specifically, have not seen this.

This is not an accusation. The people at the labs are some of the most thoughtful people working on anything right now. They are not missing this layer because they are careless or because they do not care. They are missing it because of who they are and where they have been.

The labs are full of people who went, in their teens, to selective schools. Who went, in their early twenties, to elite universities. Who went, in their mid-twenties, into PhD programmes or into the companies that became the labs. They are brilliant and serious. They are also, by the selection funnel they have been through, people for whom the public record has always been where the action was.

Imagine the funnel from inside it. A clever child, encouraged at school. Doing well. Sent on to better schools, where doing well was the whole of what was asked. Head down. University. A subject. Then a narrower part of that subject. Then a narrower part of that. The knowledge gets deeper. It also gets narrower.

They did not, at fifteen, go to underground clubs with black kids in the backstreets of Birmingham. They did not stand in a pub doorway in Moseley while a bouncer who knew the rules let them through anyway. They did not have a chaperone in a white baseball hat saying lovely to see you Doug. The funnel did not lead through those rooms. The funnel led past them.

They grew up inside the part of human knowledge that writes itself down. Their teachers wrote books. Their supervisors wrote papers. Their colleagues write posts. The work they did to get to where they are was, mostly, legible work — work that was read and evaluated and rewarded by other people writing things down.

The kitchen layer of human knowledge was not, for them, invisible. It was just not the layer they were selected against. The thing they were selected against was producing publishable output. They produced it. They were rewarded for it. They now work on a system that is being trained on what they and people like them produced.

The disposition I have been describing — the silence, the not-writing-about-it, the private exchange, the oh no overridden in the second before anyone knew it was being overridden — is not the disposition the labs selected for. The labs selected for the opposite. The people who carry the silent layer are, almost by definition, not the people the labs hired. The hiring pipeline filtered them out, without malice, without noticing, because the pipeline was looking for something else.

This is not a problem with the labs. It is a feature of the pipeline that produces the labs. Nobody designed it. It is emergent. It means the people who could most easily see the missing layer are the people least likely to be inside the buildings where the decisions about the training data are made.

The Flag and Castle on a Wednesday night is full of four hundred fifteen-to-eighteen-year-olds, in energy and hope, hidden from view. No machine trained on the public record knows the Flag and Castle is alive on a Wednesday night. A bouncer on the door does. A fifteen-year-old going in does. A boy who has been expelled from six schools, and has gone back, and is in a white baseball hat saying hi Doug, lovely to see you, does. The people in the rooms where the training decisions get made, mostly, do not.

That is the gap. That is what this book is about.

I said, at the start of the chapter, that the layer is missing for reasons that are not about malice or neglect. I want to be specific about what kind of problem this is. The diagnosis matters for the fix.

It is not a problem that will be solved by more data of the kind the machines already have. More scraping of more public forums, more translation of more non-English text, more digitisation of more archives, more deals with more publishers — these are all worth doing. They are all known-unknowns work. None of them touches the layer this chapter has been describing.

It is not a problem that will be solved by better algorithms. The algorithms are not the bottleneck. An algorithm that has not seen the layer cannot invent the layer. It can only pattern-match on the adjacent things it has seen, and produce a competent version of the wrong thing.

It is not a problem that will be solved by asking the models. You cannot ask a model what it does not know. You can only notice, by working alongside it, the places where it is missing. Most people working alongside these models are not carriers. Most of them will not notice. The model will feel to them like it is doing fine. The output will be competent. Nothing will seem wrong. The carriers, in their kitchens and on their Birmingham streets and at their pub doors in Moseley, will be producing their short silent exchanges that nobody is logging.

What I mean by a data problem is narrower and harder. The missing layer is not a mystery. It exists. It is carried inside living people. It could, in principle, be collected. It has never been collected at any meaningful scale. The instruments we have for collecting human text are not suited to collecting this kind of text. The people being asked to build those instruments have not yet recognised that a different kind of text exists.

The problem is not that the data is secret. The problem is that the data is shaped wrong for the pipelines we have. It is short. It is private. It is embedded in relationships. It is illegible without its context. It does not fit into any dataset in the form in which it is actually produced.

This is a solvable problem. It requires a different kind of instrument than the ones the labs currently build. An instrument that can receive short, private, embedded, illegible text in the form in which it is actually produced, and do something useful with it. That instrument does not yet exist. It could. Some of the people reading this book will be the ones to build it, if they decide the thing is worth building.

This book is an attempt to argue that the thing is worth building.

I want to close this chapter with what happened, rather than with an argument.

The boy said lovely to see you Doug in a doorway in Moseley last night.

My daughter said dad, you would have done that too on the phone two days ago.

I went back and gave a homeless man an Indian takeaway yesterday, because the fifteen-year-old walking next to me had said a thing the day before that my subconscious would not let go of.

I rang him to tell him. He picked up.

None of it is in any dataset.

All of it is the training data.

The next chapter is about why the carriers will not solve this for you by writing it down. The carriers, by disposition, do not write. That is the disposition that produced the pattern. It is also the disposition that is going to produce the silence about the pattern, unless someone intervenes.

I am intervening, for now, in this book. I am one person. I cannot carry the whole layer by myself. The book is a naming. The collection is someone else's work.

Pass it on