How to measure the proximity of meaning of two sentences without any words in common?
Introduction to semantics
The semantic similarity task is a subtask of the Sentence Pair task which consists in comparing a pair of texts and determining a relation between them.
In our case, this pair of texts is a pair of grammatical sentences, for which we try to determine the semantic relation. That is to say, are these two sentences synonyms? antonyms? or rather without any semantic relationship?
sense4data proposes the use of semantic similarity in its Job Seeker Matching algorithm (Sense4Matching) in order to compare the skills described by candidates in their resumes with those requested in the job offer. This makes it possible to calculate the compatibility rate between the offer and the resume.
Different approaches
A first approach consists in comparing the words contained in the two texts and calculating a similarity score in relation to the number of words in common.
The most common scoring method is the Jaccard index, of formula:
Jaccard (A, B)∈ [0, 1] = || A∩B |||| AUB ||
Let the following pair of sentences (A, B) be:
A: Construction monitoring
B: Monitoring of the construction site
The jaccard index (A, B) obtained is 0. This is justified by the fact that the two sentences are totally different from a formal point of view: they have no words in common. However, they are very similar sentences from a semantic point of view. It can be said, therefore, that calculating the similarity between two sentences from a morphological point of view is not sufficient. It is necessary to examine the sentences using their meaning.
One of the techniques to measure the proximity of the meaning of two sentences having no words in common and to answer the limits of the "keyword" approaches is the use of lexical embeddings. The latter, also called embeddings, are mathematical representations of words, in the form of vectors in a multidimensional space.
One of the first models designed for this purpose is Word2Vec [1] developed by Google under the guidance of T. Mikolov in 2013. This model, composed of two-layer neural networks, is trained to produce, for words sharing the same context, spatially close numerical vectors. The resulting vectors from the training are static vectors.
Since they are mathematical vectors, it is possible to perform algebraic operations (addition, subtraction, etc.) to deduce relationships.
Take the words "mom", "dad", "woman", "man", we can deduce the following relationship:
mom = dad - man + woman
The Word2Vec vectors have been computed in such a way that this equation can be true using the corresponding vectors.
The following figure represents a set of words encoded with the Word2Vec algorithm. A dimensionality reduction [2] was applied to the resulting vectors to represent them in a two-dimensional space.
We observe that most of the words referring to the same lexical field are in close proximity in space. This is the case of words like "mango", "banana", which refer to the same idea: food.
However, let's take the case of the ambiguous word "lawyer", which can be, depending on the context, a fruit or an individual exercising a legal function. We notice that it is included in the legal lexical field but at a great distance from the food lexical field.
In order to solve the above mentioned problem and to obtain a more efficient and meaningful representation of words, various models have emerged in recent years. These so-called contextual models make it possible to obtain, for a given word of a sentence, a vector representation computed according to the words surrounding it, instead of a fixed vector.
Among the models developed for this purpose, BERT [3] is the one that has revolutionized the world of automatic natural language processing, by breaking several records on language-based tasks.
What is the BERT model and how does it work?
BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google in 2018. It performs better than its predecessors in terms of results and learning speed. Once pre-trained in unsupervised, it can be fine-tuned to perform a more specific task with little data.
This model based on bidirectional Transformers can handle various tasks such as machine translation, PoS Tagging, text generation, semantic similarity and more.
A Transformer is a model that takes one sentence as input and produces another as output. It consists of an equal number of encoders and decoders.
All encoders have an identical architecture, illustrated in the figure below.
They are each composed of a self-attention layer and a feed-forward neural network [4].
The self-attention mechanism allows an encoder to communicate with other encoders and understand which words in the input sequence are most relevant to help generate the best vector representation of the current word. So which words in the input sequence are most contextually and semantically relevant to the word that is being encoded. Each encoder can therefore consult words from the past and the future.
Decoders also have these two layers, except that between the two there is another layer called encoder-decoder attention which allows communication between encoders and decoders.
This attention mechanism allows decoders to understand which words in the input sequence are most relevant to generating an output word. Each decoder only consults words at previous positions to generate the current word since the future is yet to be predicted.
The architecture of the decoders is shown in the figure below:
BERT uses only a part of the Transformers architecture. Indeed, as its name indicates, it is composed of a stack of bidirectional encoder blocks, without decoders. It takes as input a sequence of tokens (words, punctuation marks, etc.) that pass through these encoder layers and produces as output a sequence of contextual vectors.
BERT is available in two models:
basic model with 12 encoders
large model with 24 encoders.
For more information on Transformers and BERT, see the articles by J. Alammar [5] and P. Denoyes [6] on the subject.
As seen previously, the output of the BERT model is a list of vectors corresponding to the vector representations of the tokens of the input sentence. To obtain a unique embedding for the whole sentence, many methods can be applied such as the average of the word vectors, their addition...
The method we will prefer is the one using the Sentence-BERT (S-BERT) model developed by Reimers and Gurevych [6].
S-BERT is a modified version of the BERT model in which Siamese BERT networks have been introduced. This facilitates the representation of sentences, paragraphs and images in a vector form. The text is encoded in a vector space so that texts with the same meaning are spatially close and can be identified using the cosine distance.
The cosine distance or similarity is a similarity score between two n-dimensional vectors. It determines the value of the cosine of their angle.
Let two vectors A and B, we have as formula :
Cosine (A, B)∈ [-1, 1] = A.B|| A || || B ||
The Sense4Matching algorithm uses the one trained with multilingual Siamese BERT networks.
This allows capturing the semantic similarity of skills described in the same language, as well as those described in different languages. Therefore, it allows to process documents without words in common and not to discriminate between CVs with formulations of skills/experiences different from those in the offer. This is particularly the case for technical CVs and job offers, in which a large number of English terms are used.
Below are examples of results obtained by determining the semantic similarity between skills via S-BERT and cosine similarity:
A : Accompaniment of the construction works
B : Construction supervision
cosine (A, B) = 0.79
C : Generation of graphics
D: Drawing production
cosine (C, D) = 0.72
E : Defect correction
F : Anomaly correction
cosine (E, F) = 0.67
G : Natural language processing
H : Natural language processing
cosine (G, H) = 0.78
I : Project management
J : Projects management
cosine (I, J) = 0.97
There are many tasks for which semantic similarity can be used. It can be used to :
To learn more about the range of skills within Sense4data, go here.
References
Articles
[1] Tomas Mikolov (2013), Efficient Estimation of Word Representations in Vector Space [3] Zell, Andreas (1994), Simulation Neuronaler Netze [Simulation of Neural Networks] [6] Nils Reimers, Iryna Gurevych (2019), Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Links
[2] https://fr.wikipedia.org/wiki/Algorithme_t-SNE [4] https://jalammar.github.io/illustrated-transformer/ [5] https://lesdieuxducode.com/blog/2019/4/bert-le-transformer-model-qui-sentraine-et-qui-represente#:~:text=BERT%20%3A%20The%20%22Transform%20model%22,repr%C3%A9sente%20%7C%20The%20Gods%20of%20Code