Deciphering the Dimensions of Embeddings: A Journey into Semantic Spaces

Anand
4 min readOct 17, 2023

--

Audience: Throughout my career, I’ve consistently focused on improving the developer experience and providing valuable insights to technologists, whether they are developers or IT leaders. My mission has always been to make emerging technology sectors accessible and understandable to those eager to explore them. With this goal in mind, the articles you’ll discover here are not aimed at machine learning experts but rather at individuals who are enthusiastic about embarking on their journey with Large Language Models (LLMs) and Vector Databases. Join me as I delve into these topics together, making them approachable and comprehensible for all levels of expertise!

In the world of deep learning and natural language processing, word embeddings have become a fundamental tool for understanding and representing language. These embeddings, often generated by complex neural network models, reside in high-dimensional spaces that capture intricate relationships between words and concepts. However, the question arises: What do these dimensions in the embeddings represent?

Distributed Representations

To begin with, it’s essential to understand that embeddings are distributed representations. Rather than having a clear, individual meaning for each dimension, the information about a word or concept is spread across all embedding dimensions. Each dimension doesn’t stand alone but contributes collectively to the representation.

High-dimensional Spaces
Embeddings typically exist in high-dimensional spaces, often with hundreds of dimensions. In these spaces, the relative positions and angles between vectors are more crucial than the specific values of each dimension. The relationships between words or data points are encoded by their relative positions within this multi-dimensional landscape.

Emergent Properties

Vectors Representing Semantic and Syntactic Properties in 3D Space

While individual dimensions may not have direct, human-understandable meanings, certain directions in the embedding space can represent specific semantic or syntactic properties. For example, in a simplified 3D space, one dimension might represent “formality,” another “positivity,” and a third “activity level.”

Words like “Hello,” “Greetings,” “Smile,” “Run,” and “Walk” would have embeddings that differ along these dimensions. “Greetings” might be more formal than “Hello,” “Smile” is highly positive, and “Run” is more active than “Walk.”

Training Data and Task Influence
The nature of these dimensions is heavily influenced by the training data and the specific task for which the embedding model was trained. Embeddings trained on scientific texts may capture different nuances than those trained on general web text, leading to variations in the meaning of dimensions.

Non-Interpretability
It’s crucial to acknowledge that, in many cases, especially with deep neural networks, humans do not directly interpret the individual dimensions of embeddings. While these models can capture complex patterns and make accurate predictions, the internal representations they learn are often challenging for humans to decipher.

Dimensionality Reduction for Visualization
As described in detail in my post, to make embeddings more understandable, techniques like t-SNE or PCA are often employed to reduce their dimensionality to 2 or 3 dimensions for visualization purposes. However, this process sacrifices some of the original information in the high-dimensional space to provide a more intuitive view of relationships between words.

Visualizing Word Embeddings in 3D Space with a known set of dimensions

Now, let’s take a practical step and visualize these concepts using Python code to create a 3D scatter plot representing word embeddings in a 3D space.

We’ll use a simplified example with the dimensions “Formality,” “Positivity,” and “Activity Level.”

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Define the words and their embeddings
words = ["Hello", "Greetings", "Smile", "Run", "Walk"]
embeddings = {
"Hello": [0.5, 0.6, 0.3],
"Greetings": [0.9, 0.6, 0.2],
"Smile": [0.2, 1.0, 0.4],
"Run": [0.2, 0.5, 0.9],
"Walk": [0.3, 0.5, 0.7]
}
# Extract the values for each dimension
x = [embeddings[word][0] for word in words]
y = [embeddings[word][1] for word in words]
z = [embeddings[word][2] for word in words]
# Create a 3D scatter plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Plot each point
for i in range(len(words)):
ax.scatter(x[i], y[i], z[i], label=words[i])

# Label axes
ax.set_xlabel('Formality')
ax.set_ylabel('Positivity')
ax.set_zlabel('Activity Level')

# Add a legend
ax.legend()

# Show the plot
plt.show()

This code creates a 3D scatter plot with points representing the word embeddings of “Hello,” “Greetings,” “Smile,” “Run,” and “Walk” in a 3D space defined by the dimensions “Formality,” “Positivity,” and “Activity Level.”

As a Summary, we’ve uncovered the fascinating world of distributed representations, high-dimensional spaces, and emergent properties in this exploration of word embeddings. While individual dimensions within embeddings may not have direct, human-understandable meanings, their collective power lies in capturing rich relationships and patterns in data.

Beyond words, it’s important to note that similar principles also apply to sentence embeddings. Just as word embeddings capture semantic and syntactic nuances, sentence embeddings encapsulate the essence of entire sentences, paragraphs, or documents. The behavior of sentence embeddings aligns with the insights gained from word embeddings and becomes even more relevant in the context of Large Language Models (LLMs) and General AI (Gen AI).

In these advanced AI spaces, understanding and representing not only individual words but entire sentences is pivotal. Sentence embeddings play a crucial role in text summarization, sentiment analysis, and language generation tasks. They enable LLMs and Gen AI to comprehend context, context shifts, and intricate relationships within language, allowing for more sophisticated and context-aware responses.

As we continue our journey into the world of embeddings, it becomes evident that these representations are the backbone of AI’s ability to comprehend and generate human language, making them indispensable tools in the evolving landscape of artificial intelligence.

--

--

No responses yet