Audience: I’ve prioritized enhancing the developer experience at every turn throughout my career. My goal has always been to clearly convey what technologists, whether developers or IT leaders, need to grasp about emerging technology sectors. This understanding empowers them to dive deeper and fully harness these tools. With this in mind, the articles you’ll find here are tailored not for machine learning experts but for those eager to begin their journey with Large Language Models and Vector Databases. Dive in, and let’s explore together!
In today’s data-driven world, understanding and visualizing high-dimensional data is crucial. One area where this is particularly relevant is in the realm of vector embeddings. In this post, we’ll journey through the intricacies of vector embeddings, their significance in search systems, and the art of visualizing them.
Note: All the sample embedding values are hypothetical for illustration purposes only.
1. Vector Embeddings: A Primer
Vector embeddings are mathematical representations of objects in a high-dimensional space, such as words or sentences. These embeddings capture semantic relationships, meaning similar objects are closer in the vector space. For instance, in a word embedding space, the words “king” and “queen” might be closer to each other than “king” and “apple”.
Example: Imagine representing words in a 3D space with axes: Formality, Positivity, and Activity Level. The word “Hello” might have coordinates [0.5, 0.6, 0.3], while “Greetings” might be [0.9, 0.6, 0.2]. These coordinates, or embeddings, capture the essence of each word in relation to the defined axes.
2. Similarity and Semantic Search
Vector embeddings power similarity and semantic search. While traditional databases use exact matching, vector databases (vectorDBs) enable searching based on semantic similarity. For instance, querying “royal lady” might return documents containing “queen”.
Example: Consider the following product reviews and their hypothetical embeddings:
- “Hello, I loved the product!” -> [0.5, 0.8, 0.3]
- “Greetings. The product was excellent.” -> [0.9, 0.9, 0.2]
- “Made me smile. Great purchase!” -> [0.2, 1.0, 0.4]
A query “Hello, it’s a good product for jogging” with an embedding [0.5, 0.7, 0.8] would find the first review most similar based on the embeddings’ proximity.
Let's see how a plot for the above example looks :
3. Measuring Similarity: Cosine and Euclidean
Two popular measures to gauge similarity in vector spaces are cosine similarity and Euclidean distance. Cosine similarity measures the cosine of the angle between two vectors, focusing on their orientation. Euclidean distance, on the other hand, gauges the “straight line” distance between two points.
# Cosine Similarity Formula:
similarity = dot(A, B) / (norm(A) * norm(B))
# Euclidean Distance Formula:
distance = sqrt(sum((A_i - B_i)^2 for i in range(n)))
4. Visualizing Embeddings
While embeddings often reside in high-dimensional spaces, visualizing them in 2D or 3D can offer insights. Techniques like Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) help reduce dimensionality.
Example with the Iris Dataset:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Load and normalize data
iris = load_iris()
scaler = StandardScaler()
scaled_data = scaler.fit_transform(iris.data)
# Apply PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(scaled_data)
# Visualization
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=iris.target, cmap='viridis', edgecolor='k')
plt.title("PCA Visualization of Iris Dataset")
plt.show()
5. The Importance of Normalization
Before applying dimensionality reduction, it’s crucial to normalize the data. Especially when data dimensions have varying scales, normalization ensures each feature has equal weight, leading to more balanced and interpretable visualizations.
Conclusion
Vector embeddings are revolutionizing the way we handle and interpret data. Whether diving into natural language processing, recommendation systems, or just trying to visualize complex datasets, understanding the world of embeddings is invaluable.
Note: Link to Notebook for all the plots above.