NLP (natural language processing) and SEO
The time has come when the advance of artificial intelligence in all areas of human life becomes palpable. Today, it would be difficult to list activities that have not yet undergone major changes brought about by AI. Quite obvious, with SEO it could not be otherwise. Rather. Given that the very existence of search engine optimization is validated by search engine algorithms, what SEO experiences is a real revolution. The fields and approaches are multiple. In this article we will touch one of the milestones: natural language processing or NLP in SEO.
As mentioned, the influences of artificial intelligence in SEO are numerous and approaching all of them in the same article would risk excessive superficiality. Therefore in this article we will deal with NLP having SEO as the big picture.
What is natural language processing (NLP)?
Thus in our examination between natural language processing and search engine optimization, the first task is to answer the question: what is NLP?
Short answer: by definition NLP is one of the subfields of linguistics (known as applied linguistics in this case), computer science or even artificial intelligence able to read human language and transform its unstructured data into structured data understandable to machines. Not only that, NLP also deals with the opposite: starting from structured data, elaborating that particular content in a fluid text capable of being understood by humans.
As a matter of fact, NLP deals with the processing of human natural language by software. In short, how machines are able to sustain fluid dialogues with us, humans, that naturally have this language.
Language is an innate skill of human beings, a device of nature that we have not yet been able to fully understand. According to many, human language is our representation for excellence. Not surprisingly, Ray Kurzweil, in his book "How to create a Mind", writes (p.56):
“Language is itself highly hierarchical and evolved to take advantage of the hierarchical nature of the neocortex, which in turn reflects the structure of the neocortex.”
Emulating the neocortex and its unique skills makes developing NLP a huge challenge. It is certainly not easy to do, but we are getting closer. Let’s see how.
Some practical examples of NLP
Moving to practical field, today we already see NLP allowing fluid interactions between people and objects possible. Just think of the ease with which we can communicate with our car, in the case of the Mercedes-Benz MBUX system:
Another example is the convenience and versatility of voice assistants like Alexa and Google Home. In the video, some examples of what Alexa is doing right now:
Certainly, the NLP embed in MBUX and Alexa still has large room for improvement. Keep in mind that in this field time is truly relative, given the exponential progression of technology. The leaps forward are remarkable. Who has never heard a comment the might sound something like "but where are we going to end with all of this?" Yeap ... and we are only at the beginning.
NLP applied to SEO: BERT
Getting closer to our topic, a concrete example of the presence and evolution of NLP within SEO industry is the adoption by Google of BERT, an acronym for Bidirectional Encoder Representations from Transformers. This update of Google's algorithms was one of the first that marked an era in SEO: there are no longer any errors "to be fixed" after an update. Now, Google just states that it is the overall user experience that needs to improve.
In this scenario, BERT was introduced aiming to increase Google algorithms’ understanding skills. Not accidentally, the acronym BERT contains "Transformers" in it, which is the data model used to understand text’s words. Therefore, BERT applies the Transformers model in both directions of the text starting from the analyzed word, not just forwards.
So, when we Google something like "html 301 redirect", the search engine might answer:
"A 301 redirect indicates the permanent moving of a web page from one location to another. The 301 part refers to the HTTP status code of the redirected page. In simple terms, a 301 redirect tells the browser: "This page has moved permanently."
To find this answer, algorithms performed several sub-tasks to identify the most suitable one, such as “name entity labeling” and “question type classification”. We do not want to delve too much into the technicalities of BERT in this article, here we just need to understand that the interaction of the results of these clusters will determine the answer with the highest score.
This same technology is also the basis of tools used not only by SEO, but practically by everyone, such as Gmail or Android: the ability to "guess" which word will be next, even before we type it. This is BERT applied.
The natural language generation (NLG)
So far we have seen how the NLP understands content already elaborated. Hence, the human equivalent of reading. But let's try to imagine what NLP would do to write? This is the field of another sub-category of NLP, the natural language generation or NLG.
The natural language generation assumes the existence of structured data, from which it will be able to build a text that encloses within it a coherent meaning. An example of structured data could be this table, published by Trading Economics about Italian GDP growth:
To represent these data, the webpage has published a paragraph with a conversational interpretation of the graph. A paragraph like this is the perfect representation of what the NLG is already able to do in a perfectly automatic way in 2020 (taken from the website itself):
"Italy's GDP shrank by 12.8 percent on quarter in the three months to June 2020, compared to a preliminary reading of a 12.4 percent plunge and following a revised 5.5 percent contraction in the previous period. That was the steepest pace of contraction since comparable series began in the 1960s as the country was one of the hardest hit by the coronavirus pandemic. The government was forced to introduce rigid restriction measures from March 9th, which were only gradually eased from May 4th."
Not surprisingly, among the fast-growing applications of NLG, one of the first niches where it has managed to penetrate is that of reporting.
Now, let’s a look at what natural language generation can do for search engine optimization.
Natural language generation applied to SEO
Previously we saw that NLG needs structured data to be able to elaborate a text understandable by man. Narrowing down to the field of SEO, which kind of structured data can be useful to NLG?
Before that we need to consider that structured data will arrive through data points, which can be infinite. Not only that, data points might vary according to the goals of the content we want to create. Indicatively but not exhaustively, some of the structured data useful for the SEO-NLG combination can be:
- keywords: not only the main keyword, but also the long-tail ones, questions, and the most frequent synonyms;
- product’s description and attributes like colors, size, collection, brand, price;
- competitors' content: existing pieces that compete for the same real estate in the SERPs;
- transcriptions of phone calls and voice assistants: a huge database possibility for companies dealing with customer support;
- already existing texts on client’s website, ebooks or blogs.
Here it is necessary to highlight how keywords are still an important element to address the content we are about to create, matching it with the search intent we want to satisfy. It is certainly less important than it has been in the recent past, but it is still a fundamental tool. With the evolution of data processing models, the function of keywords tends to disappear completely.
Indeed, the models process an exponentially growing number of variables: just think that the latest model, the GPT-3, processes 175 billion parameters, while its predecessor stopped at "only" 1.5 billion. By the way, mentioning NLG processing models, let's take a closer look at some of the most known.
NLG models: ELMo
An acronym for Embeddings from Language Models, the ELMo model was developed in 2018 by Allen NLP. It works as a representation of words in a deep context that works both the complexity of the use of words themselves and polysemy (linguistic context). It uses two layers of vectors, bidirectional (before and after the word placed under the lenses).
It can be combined with other models, significantly increasing the effectiveness of created content. It uses three representations: that of the characters (going beyond the data used in training), depth and the one that distinguishes it: the contextual representation.
Grover: an antidote to fake news?
Born as an evolution of the GPT-2 model (we'll talk about it below), Grover plays the role of better targeting created content, assigning more meaning to it.
Published by the University of Washington in mid-2019, Grover can predict the next word not only based on previous words, but also with other elements not present in GPT-2 such as the title and the author. Does it sound like business as usual?
Not at all! Meanwhile, curious experiments have been made using Grover creating fake news. Yes, you read correctly, the ubiquitous fake news. Starting from a dataset such as a newspaper website, Grover has proved itself deadly precise in the ability to create a huge and fairly consistent amount of similar but false news.
Specifically, by taking up every element of a news item, Grover can change all of them consistently, from the title, to the text, to the author, to the images. At the end of it, the new is similar to the original, but with a completely different orientation.
As always, technology allows us to do noble things and also the less noble ones. It happens the same way with Grover: since its model can properly create fake news it is equally able to recognize them with a high percentage of accuracy. In fact, it succeeded 92% of the time. Bravo, right? Eventually this could be a starting point to build fact-checking tools.
A team of researchers from Google Brain and Carnegie Mellon University worked on BERT's major flaws and leveraged the new Transformer-XL architecture to release XLNet, earning SOTA on 18 NLP assets.
XLNet improves Transformer architecture by being much faster (more than 1800 according to Google) and by adding segment-level recurrence and related positional encoding.
The XLNet model is capable of processing much larger sentences and better keeping long-term dependencies over the architecture of the previous Transformer model. As a matter of fact, Transformer-XL can improve performance between 80 and 450% compared to standard RNN and Transformer, respectively.
In this gif, you can get a more detailed idea of Transformer-XL's segment-level recurrence mechanism :
When we talk about natural language generation, a separate chapter must be dedicated to the American OpenAI. Founded in San Francisco in 2015 by some well-known personalities from the tech world such as Elon Musk and Reid Hoffman, it has grown exponentially in these almost 5 years.
It is no coincidence that OpenAI received US$ 1 billion in investments from Microsoft in 2019. Microsoft, together with the Indian Infosys are the companies that are part of OpenAI.
OpenAI importance for natural language generation lies essentially in its models, known as GPT.
The first Generative Pretrained Transformer (GPT)
Released back in 2018, the first GPT was able to learn and establish dependencies (the same as the Transformer model) independently, processing huge amounts of data.
The second version, dubbed as GPT-2, was released in February 2019. The company caused a stir by limiting the GPT-2's publicly available functions. That was justified with the possible inadequate use of the new technology, especially due to the proliferation of the now known fake news. And this is where the Grover model, described above, was born.
The GPT-2 generator model has been trained in 8 million documents, reaching 40gb of text, with over 1.5 billion parameters.
GPT-2 represented a huge evolution in the natural language generation because it was able to create truly realistic texts, like no one had ever done before. Additionally, many tests have been done and, in many cases, it has not been possible to recognize the texts created by humans and those from the GPT-2. Needless to say, it is no coincidence that Open AI has decided to limit the functions available to the public.
What happened next? Some researchers have managed to emulate the model used by GPT-2, however, this time making public all the results obtained. Of course, this has triggered a worldwide wave of new content creation, unfortunately, many times with less confessable purposes. Once again, Grover was the attempt to stem the tide of this phenomenon.
As a weakness, the GPT-2 presents the difficulty of staying consistent across long text extensions. Some tests have highlighted 400 words as a threshold.
The advent of GPT-3
May 2020 marks the arrival of the third generation of the GPT-n series. If the previous GPT-2 had already made a huge leap forward with its 1.5 billion parameters, what about the GPT-3 which comes with 175 billion parameters? That's 116 times more parameters than before!
The GPT-3 model was trained with the Tesla V100 graphics processing unit, by reading hundreds of billions of documents from sources ranging from Wikipedia to the Common Crawl, through books and other types of webtext2 documents.
The evolution of the segments used in the calculations is equally remarkable: the smallest model uses 12 segments with 12x64-sized heads, while the largest reaches 96 segments with 96x128-sized heads.
Consequently, since July 2020, when the GPT-3 beta version becomes available, content creators have welcomed it with enthusiasm. Specifically, for SEO, the model has proven itself capable of producing short texts of 200 words with high quality, without needing additional fine-tuning by humans.
Effectively using of NLG for SEO purposes
One of the first reported tests about the GPT-3 to be published is the one of Will Critchlow in this excellent Search Pilot article . His conclusions are clear, right now GPT-3 model can help SEO with:
- converting bulleted lists into captivating text;
- extensive iterating of some text parts such as headlines, a task that would not be feasible with regular copywriters;
- creating longer content such as ecommerce products description, based on structured data.
I asked Critchlow for his opinion on the next GPT-n developments for the next 2 years. In his view, "i contenuti che possono richiedere centinaia o migliaia di cose simili scritte meccanicamente, come ad esempio i titoli saranno solitamente scritti dall’intelligenza artificiale." He also thinks that "more creative content will still be quite experimental”
Therefore, there is still a long way to go, total autonomy is a long way off. I also asked him if he expects NLG in general will be able to create longer content with proper quality any soon. According to Critchlow, we can expect it "in terms of being able to produce coherent output" but he has some doubts if it will replace "the kinds of things that longer from content is for”
Then he elaborates a clever summary: "The direction of travel is that the technology is getting more and more coherent without necessarily being more information rich.”
We can conclude that even if it is already happening that copywriters are replaced by artificial intelligence, as MSN has recently done (aka Microsoft), the moment they will no longer be needed has not yet arrived.
Would you like to try out NLG?
Before wrapping-up this topic, a suggestion: if you are curious to see the GPT-3 in action, I suggest you visit the Philosopher AI, where you can ask questions in English. The answers will be elaborated on the fly by artificial intelligence.
Spoiler Warning: all politically incorrect questions are "politely" declined this way: "Philosopher AI thinks this is nonsense, and is refusing to answer your query. It appears you will have to try something else."
In the midst of this SEO “AI-ation”, how will Google behave?
First, let's be more specific about this question that sounds provocative. As we have seen so far, the models used by the natural language generation, especially GPT-3, have developed enormously. In fact, NLP and specifically NLG have reached such a point that, if they are not yet able to produce all the content themselves, they can already do it partially or even remarkably close to that.
Given that almost all relevant content is indexed by Google, which has become the omnipresent sextant of humanity, we need to consider what the search engine intends to do. In its guidelines, Google specifically says it can take action against “Text generated through automated processes, such as Markov chains”.”
A debate arises here, because it seems obvious that these guidelines were conceived at a time when automated text generation was actually equivalent to spamming. Now this has dramatically changed. We all know how much Google focuses on the quality and relevance of content in its algorithms. If AI is able to help SEO, creating quality pieces that truly add value to the user experience, why punishing such practice?
For now, that is not known. As Critchlow recalled in his article, for now Google recognizes the existence of the debate within the company. We will see it. It is difficult to imagine that the guidelines will not change sooner or later, recognizing the practice as potentially valuable.
The future: will copywriters disappear, and will we only use NLP?
This is a question that actually mirrors a broader question. Undeniably artificial intelligence is growing rapidly and conquers territories at a more than exponential speed. Every human activity is performed and sometimes even managed by AI. Who has never asked the question, what will humans do in this scenario?
Similarly, if natural language creation will be able to autonomously produce new coherent and meaningful texts for SEO, what will remain for the profession of copywriters, so important at present moment? In theory it would give up space completely. But will this happen? When could this happen? The question is justified by the extraordinary evolution of NLP, overcoming the evolution of computer chips themselves.
Right now we have no concrete evidence to be assertive about when, not even if that will ever really happen. What can be said, considering this moment, is that the path has already been drawn.