sense4data-logosense4data-logosense4data-logosense4data-logo
  • AI Center Of Excellence
  • AI Software
  • AI Technology Partnership
  • Resources
    • AI Blog
    • AI Resources
    • About sense4data
  • Contact us
English
  • French
✕
blog ontology
Ontology based on a repository of professions and skills
20 May 2022
AI Fraud detection
Bank fraud detection
2 September 2022
1 July 2022
Categories
  • NLP
Tags
Blog NLP

Natural language processing

How to measure the proximity of meaning of two sentences without any words in common?
Introduction to semantics

The semantic similarity task is a subtask of the Sentence Pair task which consists in comparing a pair of texts and determining a relation between them.

In our case, this pair of texts is a pair of grammatical sentences, for which we try to determine the semantic relation. That is to say, are these two sentences synonyms? antonyms? or rather without any semantic relationship?

sense4data proposes the use of semantic similarity in its Job Seeker Matching algorithm (Sense4Matching) in order to compare the skills described by candidates in their resumes with those requested in the job offer. This makes it possible to calculate the compatibility rate between the offer and the resume.

Different approaches

Keyword approach

A first approach consists in comparing the words contained in the two texts and calculating a similarity score in relation to the number of words in common.

The most common scoring method is the Jaccard index, of formula:

Jaccard (A, B)∈ [0, 1] = || A∩B |||| AUB ||

Let the following pair of sentences (A, B) be:

A: Construction monitoring

B: Monitoring of the construction site

The jaccard index (A, B) obtained is 0. This is justified by the fact that the two sentences are totally different from a formal point of view: they have no words in common. However, they are very similar sentences from a semantic point of view. It can be said, therefore, that calculating the similarity between two sentences from a morphological point of view is not sufficient. It is necessary to examine the sentences using their meaning.

Lexical extensions

One of the techniques to measure the proximity of the meaning of two sentences having no words in common and to answer the limits of the "keyword" approaches is the use of lexical embeddings. The latter, also called embeddings, are mathematical representations of words, in the form of vectors in a multidimensional space.

One of the first models designed for this purpose is Word2Vec [1] developed by Google under the guidance of T. Mikolov in 2013. This model, composed of two-layer neural networks, is trained to produce, for words sharing the same context, spatially close numerical vectors. The resulting vectors from the training are static vectors.

Since they are mathematical vectors, it is possible to perform algebraic operations (addition, subtraction, etc.) to deduce relationships.

Take the words "mom", "dad", "woman", "man", we can deduce the following relationship:

mom = dad - man + woman

The Word2Vec vectors have been computed in such a way that this equation can be true using the corresponding vectors.

The following figure represents a set of words encoded with the Word2Vec algorithm. A dimensionality reduction [2] was applied to the resulting vectors to represent them in a two-dimensional space.

word_representation_in_space

We observe that most of the words referring to the same lexical field are in close proximity in space. This is the case of words like "mango", "banana", which refer to the same idea: food.

However, let's take the case of the ambiguous word "lawyer", which can be, depending on the context, a fruit or an individual exercising a legal function. We notice that it is included in the legal lexical field but at a great distance from the food lexical field.

Contextual models

In order to solve the above mentioned problem and to obtain a more efficient and meaningful representation of words, various models have emerged in recent years. These so-called contextual models make it possible to obtain, for a given word of a sentence, a vector representation computed according to the words surrounding it, instead of a fixed vector.

Among the models developed for this purpose, BERT [3] is the one that has revolutionized the world of automatic natural language processing, by breaking several records on language-based tasks.

What is the BERT model and how does it work?

BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google in 2018. It performs better than its predecessors in terms of results and learning speed. Once pre-trained in unsupervised, it can be fine-tuned to perform a more specific task with little data.

This model based on bidirectional Transformers can handle various tasks such as machine translation, PoS Tagging, text generation, semantic similarity and more.

A Transformer is a model that takes one sentence as input and produces another as output. It consists of an equal number of encoders and decoders.

Bert_model

All encoders have an identical architecture, illustrated in the figure below.

They are each composed of a self-attention layer and a feed-forward neural network [4].

encoder

The self-attention mechanism allows an encoder to communicate with other encoders and understand which words in the input sequence are most relevant to help generate the best vector representation of the current word. So which words in the input sequence are most contextually and semantically relevant to the word that is being encoded. Each encoder can therefore consult words from the past and the future.

Decoders also have these two layers, except that between the two there is another layer called encoder-decoder attention which allows communication between encoders and decoders.

This attention mechanism allows decoders to understand which words in the input sequence are most relevant to generating an output word. Each decoder only consults words at previous positions to generate the current word since the future is yet to be predicted.

The architecture of the decoders is shown in the figure below:

decoder

BERT uses only a part of the Transformers architecture. Indeed, as its name indicates, it is composed of a stack of bidirectional encoder blocks, without decoders. It takes as input a sequence of tokens (words, punctuation marks, etc.) that pass through these encoder layers and produces as output a sequence of contextual vectors.

BERT is available in two models:

basic model with 12 encoders
large model with 24 encoders.
For more information on Transformers and BERT, see the articles by J. Alammar [5] and P. Denoyes [6] on the subject.

Vector representation of the sentence

As seen previously, the output of the BERT model is a list of vectors corresponding to the vector representations of the tokens of the input sentence. To obtain a unique embedding for the whole sentence, many methods can be applied such as the average of the word vectors, their addition...

The method we will prefer is the one using the Sentence-BERT (S-BERT) model developed by Reimers and Gurevych [6].

S-BERT is a modified version of the BERT model in which Siamese BERT networks have been introduced. This facilitates the representation of sentences, paragraphs and images in a vector form. The text is encoded in a vector space so that texts with the same meaning are spatially close and can be identified using the cosine distance.

The cosine distance or similarity is a similarity score between two n-dimensional vectors. It determines the value of the cosine of their angle.

Let two vectors A and B, we have as formula :

Cosine (A, B)∈ [-1, 1] = A.B|| A || || B ||

  • The closer the cosine is to 1, the more synonymous the sentences are,
  • The closer it is to -1, the more the sentences are opposed,
  • The closer it is to 0, the less semantic relationship exists between the sentences.
    S-BERT is proposed with a large panel of pre-trained models.

The Sense4Matching algorithm uses the one trained with multilingual Siamese BERT networks.

This allows capturing the semantic similarity of skills described in the same language, as well as those described in different languages. Therefore, it allows to process documents without words in common and not to discriminate between CVs with formulations of skills/experiences different from those in the offer. This is particularly the case for technical CVs and job offers, in which a large number of English terms are used.

Below are examples of results obtained by determining the semantic similarity between skills via S-BERT and cosine similarity:

A : Accompaniment of the construction works

B : Construction supervision

cosine (A, B) = 0.79

C : Generation of graphics

D: Drawing production

cosine (C, D) = 0.72

E : Defect correction

F : Anomaly correction

cosine (E, F) = 0.67

G : Natural language processing

H : Natural language processing

cosine (G, H) = 0.78

I : Project management

J : Projects management

cosine (I, J) = 0.97

Conclusion

There are many tasks for which semantic similarity can be used. It can be used to :

  • Improve the results of a search engine,
  • Build a recommendation system for press articles,
  • Help customer support to find answers to questions more quickly, especially through chatbots,
  • Improve a website/product by grouping customer comments on the same theme,
    etc.
    Thanks to its skills in artificial intelligence, Sense4data is able to provide solutions for all these tasks, but also to answer problems of automatic natural language processing in general. Indeed, the company is able to provide teams with tools to automate tasks that would be repetitive, costly in terms of time and human resources on various NLP themes such as: information extraction, document classification, automatic text generation, named entity recognition, etc.

To learn more about the range of skills within Sense4data, go here.

References

Articles

[1] Tomas Mikolov (2013), Efficient Estimation of Word Representations in Vector Space [3] Zell, Andreas (1994), Simulation Neuronaler Netze [Simulation of Neural Networks] [6] Nils Reimers, Iryna Gurevych (2019), Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Links

[2] https://fr.wikipedia.org/wiki/Algorithme_t-SNE [4] https://jalammar.github.io/illustrated-transformer/ [5] https://lesdieuxducode.com/blog/2019/4/bert-le-transformer-model-qui-sentraine-et-qui-represente#:~:text=BERT%20%3A%20The%20%22Transform%20model%22,repr%C3%A9sente%20%7C%20The%20Gods%20of%20Code

Related posts

blog ontology
20 May 2022

Ontology based on a repository of professions and skills


Read more

Accelerate your digital transformation with AI powered decision making

Copyright sense4data 2023

 

Company

About us 

Resources

Data Privacy

Cooky Policy

 

Our Solutions

AI Center of Excellence

AI Softwares

AI Technology Partnership

AI Blog

English
  • English
  • French