January 9, 2026

Embedding Records with Heterogeneous Image Sets

Introduction

This project explores the application of multimodal embeddings for vehicle image classification and visualization. By leveraging Voyage AI's state-of-the-art embedding model, we transform groups of vehicle images into dense vector representations that capture semantic meaning. These high-dimensional embeddings are then projected into lower-dimensional spaces using t-SNE, enabling intuitive visual exploration of how different vehicle categories relate to one another in the learned embedding space.

The primary motivation behind this work is to understand how modern multimodal models perceive and cluster different types of vehicles. By visualizing the embedding space, we can observe whether semantically similar vehicles (such as buses and minibuses, or cars and taxis) naturally cluster together, and how distinct categories like helicopters or boats separate from ground-based transportation.

Dataset and Sampling Strategy

The second paragraph of an article is sometimes called the “nut graph,” which is short for “nutshell paragraph.” That’s because this is usually where the article gets to the heart of the matter—the main point. After the first section, the reader is ready to hear what’s truly at stake in this piece of writing. They’re invested. They’re paying attention. If your piece is long enough to have long, multi-paragraph sections, then you’ll want to use this strategy throughout to make sure you’re holding reader attention in a consistent way.

Embedding Generation

Dimensionality Reduction with 3D t-SNE

Other Technicalities ...

The visualization rendering employs a straightforward 3D projection approach. Points are first centered by subtracting the mean coordinates, then rotated around the X and Y axes based on user drag input. The rotation is implemented using standard rotation matrices applied sequentially. After rotation, points are projected onto the 2D canvas using orthographic projection, with the Z coordinate preserved for depth sorting.

The convex hull algorithm uses the Graham scan approach, which first finds the lowest point, sorts remaining points by polar angle relative to that anchor, then iteratively builds the hull by checking cross products to determine turn directions. The resulting hull points are then smoothed using Catmull-Rom spline interpolation, which passes through all control points while maintaining C1 continuity for visually pleasing curves.

The ellipse fitting alternative computes the 2D covariance matrix of projected points, extracts eigenvalues and eigenvectors to determine the ellipse orientation and axis lengths, then scales appropriately to encompass the cluster. This approach produces smoother, more regular boundaries that better represent the statistical spread of each cluster, though it may not tightly fit irregular cluster shapes.

Findings

The 3D visualization reveals clear clustering behavior in the embedding space. Vehicle categories with similar visual and functional characteristics tend to cluster near each other. Ground-based passenger vehicles like cars, taxis, and buses form a relatively cohesive region, while more distinctive categories like helicopters and boats occupy separate areas of the space.

The multi-image input approach appears to produce stable embeddings that still reflect category membership, suggesting the model successfully aggregates visual information across multiple images. Some inter-cluster proximity is observed between semantically related categories, such as buses and minibuses, or motorcycles and bicycles, indicating the embedding space captures meaningful semantic relationships beyond simple visual similarity.

Summary

This project demonstrates a complete pipeline for multimodal embedding visualization, from raw images through semantic embeddings to interactive 3D exploration. The combination of Voyage AI's powerful embedding model, 3D t-SNE dimensionality reduction, and a custom web-based visualization provides an effective framework for understanding how neural networks organize visual concepts in their learned representation spaces. The self-contained interactive visualization makes the abstract embedding space tangible and explorable, enabling intuitive discovery of clustering patterns and semantic relationships between vehicle categories.

January 9, 2026

Embedding Records with Heterogeneous Image Sets

Introduction

Dataset and Sampling Strategy

Embedding Generation

Dimensionality Reduction with 3D t-SNE

Other Technicalities ...

Findings

Summary

January 9, 2026

Embedding Records with Heterogeneous Image Sets

Introduction

Dataset and Sampling Strategy

The foundation of this project is the Vehicle-10 dataset, which contains approximately 35,860 images spanning ten distinct vehicle categories: bicycle, boat, bus, car, helicopter, minibus, motorcycle, taxi, train, and truck. This diverse collection provides a rich testbed for evaluating how well the embedding model captures visual and semantic differences between vehicle types.

Rather than embedding individual images, we adopt an interesting approach where multiple images are grouped together as a single input to the embedding model. For each of the ten categories, we create twenty sample groups, with each group containing between one and four randomly selected images from that category. This sampling strategy, seeded with a fixed random state of 42 for reproducibility, results in approximately 200 total embedding samples. The grouped input approach allows us to observe how the model handles multi-image inputs and whether it produces coherent embeddings that still reflect the underlying category.

Embedding Generation

The embedding generation phase utilizes Voyage AI's voyage-multimodal-3.5 model, a powerful multimodal embedding system capable of processing images and producing semantically meaningful vector representations. Each image group is loaded using PIL, and the collection of images is passed to the model as a single input batch.

The model outputs 1024-dimensional embedding vectors for each input group. These dense representations encode visual features, object characteristics, and semantic information learned during the model's training. The high dimensionality allows the model to capture nuanced differences between vehicle types, from structural features like wheels and wings to contextual elements like typical environments and use cases.

Dimensionality Reduction with 3D t-SNE

To make the 1024-dimensional embedding space interpretable, we employ t-distributed Stochastic Neighbor Embedding (t-SNE), a widely-used technique for visualizing high-dimensional data. t-SNE works by modeling pairwise similarities between data points in both the high-dimensional and low-dimensional spaces, then optimizing the low-dimensional representation to preserve local neighborhood structure.

We project the embeddings into three-dimensional space using a perplexity of 30, which balances attention between local and global structure, and 1000 optimization iterations to ensure convergence. The random state is fixed at 42 to guarantee reproducible results across runs. The 3D projection captures more of the embedding structure than a 2D projection would, enabling richer exploration of cluster relationships.

The projected coordinates, along with class labels and original image paths for each point, are embedded directly into a self-contained HTML visualization. This allows the interactive visualization to be easily shared and embedded in other pages without external dependencies.

Other Technicalities ...

Findings

Summary