Stop Words

Is it a paradox that the most common words in the English language are considered irrelevant by many Natural Language Processing (NLP) approaches? This is the first sentence of the entry on stop words from Wikipedia:

Stop words are words that are filtered out before or after natural language processing because they are insignificant.

If you are trying to figure out whether a document is a fishing manual or a romance novel, sure, the word “the” is insignificant. But the language of the Wikipedia entry is so unequivocal – “because they are insignificant” – that the contrarian in me can’t help but see stop words as underdogs in some aspect of how we navigate the world in language. Humble, seemingly insignificant, and yet as ubiquitous as oxygen, stop words orient the concepts around them, connecting them to one another and the wider world.

Here’s another great quote from a tutorial:

In English vocabulary, there are many words like “I”, “the” and “you” that appear very frequently in the text but they do not add any valuable information for NLP operations and modeling.

Apparently “you” and “I” have no value when it comes to analyzing text with machine learning! Something feels off about this at the gut level – how could a computer possibly make sense of the language that people actually use if it removes all pronouns? The answer is that it depends on just what patterns the application is designed to detect.

NLP applications have incredible value for tasks such as sentiment analysis, text classification, named entity recognition, and many others. The question of what words are significant and what words are not depends on whether one is trying to classify a document, detect plagiarism or spam, check grammar, or any number of uses.

So what qualifies as a stop word to a search engine is probably not a stop word to a translator or grammar checker that needs to process every word. Use cases determine the significance of various categories of words.

And yet, there’s still something that bothers me about the necessary evil of stop words. Here is the list of stopwords used by spaCy, a popular NLP library for parsing text. And here is an alphabetical snippet of the list:


back be became because become becomes becoming been before beforehand behind
    being below beside besides between beyond both bottom but by

Try to imagine a machine extracting significance from a text after having removed any notion of ontology or causality, processing language in a way that removes `before` and `after`, removes `everything` and `nothing`, removes `is` and `was`. One is left with a weirdly existential poetry of extreme specificity – atomic units of language floating through a void without temporality or relationships.

Here’s an experiment to read some literary texts with the stop words removed:

Call me Ishmael.

It was the best of times, it was the worst of times.

It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife

These strange artifacts are, in fact, interesting ways of removing a certain kind of noise and leaving a signal that indicates thematic constellations of each text. But of course, we lose the poetry of each of the quotations – Melville’s narrator no longer directly addresses the reader and invites them to refer to the speaker in a kind of intimate gesture. Instead, a name sits alone. Significant? Depends on what you mean by that.

Stop words are strangely insignificant when looked at in isolation but they can be arranged in such a manner as to complicate, amplify, or negate the significance of non-stop words. A list of rules that specifies “No hitting, no biting, no scratching” liberated of its stop words becomes a list of activities that it attempts to prohibit.

Negation can be a matter of nuance even in everyday communication – often tone of voice or gesture can run against the grain of an utterance and create the implication of sarcasm or ambivalence. If the statement in question is “this book is interesting” – there are innumerable ways in which the statement can be subverted to mean its opposite from rolling one’s eyes to sarcastic tone of voice.

If we limit the scope of inquiry to language itself, negation can be communicated through words or context. Here are some ways I can negate the meaning of my claim about the book:

“I do not think the book is interesting.”
“I doubt that the book is interesting.”
“The book is interesting to a small number of people.”
“The book is not interesting.”
“The book is interesting. No, actually it is quite boring!”

In most cases, stripping typical stop words makes it more difficult to extract the true meaning of the statement. When we make meaning out of language, there is always a provisional dance of background noise and signal where words that formerly seemed insignificant may turn out to be extremely important. The performance of language always holds meaning in suspense, which is the source of so much of its strangeness, playfulness, and beauty.

Going back to spaCy’s list of stop words, I find it strangely beautiful. There’s no meaning in it – nothing trying to be conveyed. But I find the list evocative. `became because become becomes becoming been` – the stuttering accumulation of vague states of being without referent.

Think about the difference between a tree and the tree. When I talk about a tree, you are free to think about any tree you wish. But when I refer to the tree, I ask you to think about a specific tree – perhaps you should know about it because of something that happened underneath it, a few years back. There is a world of difference erased by removing stop words – so much nuance of how we use language is tethered to differences such as the distinction between “a” and “the.”

As soon as I heard of stop words, I became convinced that I wanted to write a poem composed entirely of them. But of course, this is not a small challenge. They are words that have been selected because they are devoid of specificity. And poetry thrives on specificity, experiential richness, particular moments.

So I wrote a strange thing; I’m not sure that I succeeded in proving anything. The constraint constitutes perhaps too high a bar to create significance. The poem reads as a series of abstract utterances that are too detached from any specific context to communicate much more than a feeling of weightless melancholy and sophomoric pontification. But in the narrative I tell myself about the poem, the speaker seems to be the stop words themselves, complaining about their insubstantiality and abstractness. Words that have been deemed insignificant complaining to words that have been deemed significant. Or perhaps it is something else entirely – I cede the significance of the poem to the poem itself.

The next stage for this poem was to see if I could use stable diffusion, a set of AI models for translating text into images, to turn the vague abstraction of stop words into images! There is something curious to me about using stop words, arranged in the form of a poem about stop words, to prompt an AI to manifest specific images.

I used each stanza as a prompt for the model to create an image. At first, stable diffusion gave me back garbled words as responses. I used the prompt: “Everything if becoming seems something other than itself.”