January 9, 2026
Embedding Records with Heterogeneous Image Sets
Introduction
This project explores the application of multimodal embeddings for vehicle image classification and visualization. By leveraging Voyage AI's state-of-the-art embedding model, we transform groups of vehicle images into dense vector representations that capture semantic meaning. These high-dimensional embeddings are then projected into lower-dimensional spaces using t-SNE, enabling intuitive visual exploration of how different vehicle categories relate to one another in the learned embedding space.
The primary motivation behind this work is to understand how modern multimodal models perceive and cluster different types of vehicles. By visualizing the embedding space, we can observe whether semantically similar vehicles (such as buses and minibuses, or cars and taxis) naturally cluster together, and how distinct categories like helicopters or boats separate from ground-based transportation.
Dataset and Sampling Strategy
The second paragraph of an article is sometimes called the “nut graph,” which is short for “nutshell paragraph.” That’s because this is usually where the article gets to the heart of the matter—the main point. After the first section, the reader is ready to hear what’s truly at stake in this piece of writing. They’re invested. They’re paying attention. If your piece is long enough to have long, multi-paragraph sections, then you’ll want to use this strategy throughout to make sure you’re holding reader attention in a consistent way.
Embedding Generation
The second paragraph of an article is sometimes called the “nut graph,” which is short for “nutshell paragraph.” That’s because this is usually where the article gets to the heart of the matter—the main point. After the first section, the reader is ready to hear what’s truly at stake in this piece of writing. They’re invested. They’re paying attention. If your piece is long enough to have long, multi-paragraph sections, then you’ll want to use this strategy throughout to make sure you’re holding reader attention in a consistent way.
Dimensionality Reduction with 3D t-SNE
The second paragraph of an article is sometimes called the “nut graph,” which is short for “nutshell paragraph.” That’s because this is usually where the article gets to the heart of the matter—the main point. After the first section, the reader is ready to hear what’s truly at stake in this piece of writing. They’re invested. They’re paying attention. If your piece is long enough to have long, multi-paragraph sections, then you’ll want to use this strategy throughout to make sure you’re holding reader attention in a consistent way.
Other Technicalities ...
The visualization rendering employs a straightforward 3D projection approach. Points are first centered by subtracting the mean coordinates, then rotated around the X and Y axes based on user drag input. The rotation is implemented using standard rotation matrices applied sequentially. After rotation, points are projected onto the 2D canvas using orthographic projection, with the Z coordinate preserved for depth sorting.
The convex hull algorithm uses the Graham scan approach, which first finds the lowest point, sorts remaining points by polar angle relative to that anchor, then iteratively builds the hull by checking cross products to determine turn directions. The resulting hull points are then smoothed using Catmull-Rom spline interpolation, which passes through all control points while maintaining C1 continuity for visually pleasing curves.
The ellipse fitting alternative computes the 2D covariance matrix of projected points, extracts eigenvalues and eigenvectors to determine the ellipse orientation and axis lengths, then scales appropriately to encompass the cluster. This approach produces smoother, more regular boundaries that better represent the statistical spread of each cluster, though it may not tightly fit irregular cluster shapes.
Findings
The 3D visualization reveals clear clustering behavior in the embedding space. Vehicle categories with similar visual and functional characteristics tend to cluster near each other. Ground-based passenger vehicles like cars, taxis, and buses form a relatively cohesive region, while more distinctive categories like helicopters and boats occupy separate areas of the space.
The multi-image input approach appears to produce stable embeddings that still reflect category membership, suggesting the model successfully aggregates visual information across multiple images. Some inter-cluster proximity is observed between semantically related categories, such as buses and minibuses, or motorcycles and bicycles, indicating the embedding space captures meaningful semantic relationships beyond simple visual similarity.
Summary
This project demonstrates a complete pipeline for multimodal embedding visualization, from raw images through semantic embeddings to interactive 3D exploration. The combination of Voyage AI's powerful embedding model, 3D t-SNE dimensionality reduction, and a custom web-based visualization provides an effective framework for understanding how neural networks organize visual concepts in their learned representation spaces. The self-contained interactive visualization makes the abstract embedding space tangible and explorable, enabling intuitive discovery of clustering patterns and semantic relationships between vehicle categories.