Vector Embeddings for E-commerce Search: From Text to Semantic Understanding
Vector Embeddings for E-commerce Search: From Text to Semantic Understanding
Traditional e-commerce search has long relied on keyword matching and basic text indexing, often failing to capture the true intent behind user queries. Our project to enhance ikman.lk's search capabilities demonstrates how vector embeddings can transform search from simple text matching to true semantic understanding, dramatically improving the user experience.
The Limitations of Traditional Search
Before diving into vector embeddings, it's important to understand the limitations of traditional search approaches:
- Keyword Dependence: Exact keyword matching misses synonyms and related concepts
- Language Sensitivity: Poor handling of misspellings, plurals, and language variations
- Context Blindness: Inability to understand the semantic meaning behind queries
- Ranking Challenges: Difficulty in determining truly relevant results beyond keyword frequency
For a marketplace like ikman.lk with diverse listings across multiple categories and languages, these limitations significantly impact user experience.
Vector Embeddings: The Foundation of Semantic Search
Vector embeddings address these limitations by converting text into numerical representations that capture semantic meaning:
What Are Vector Embeddings?
Vector embeddings are high-dimensional numerical representations of text where:
- Each word or phrase is represented as a vector (typically 300-1000 dimensions)
- Semantically similar concepts have vectors that are close in the vector space
- The relationships between concepts are preserved in the vector space
- The "meaning" of text is encoded in a way machines can process efficiently
How Embeddings Enable Semantic Search
With vector embeddings, search becomes a matter of finding vectors that are "close" to the query vector:
- Query Understanding: Convert user query to a vector embedding
- Similarity Matching: Find product/listing vectors with high similarity (cosine similarity or Euclidean distance)
- Ranking: Order results by similarity score and other relevance factors
Implementing Vector Embeddings for ikman.lk
Our implementation for ikman.lk involved several key technical decisions:
1. Embedding Model Selection
After evaluating several options, we selected a model based on:
- Multilingual Support: Handling both English and Sinhala text
- Domain Relevance: Performance on e-commerce and classified listing content
- Dimensionality: Balance between semantic richness and computational efficiency
- Inference Speed: Fast enough for real-time embedding generation
2. Text Preparation Pipeline
Before generating embeddings, we implemented a text preparation pipeline:
def prepare_text_for_embedding(listing_data):
# Combine relevant fields with appropriate weighting
title = listing_data.get('title', '').strip()
description = listing_data.get('description', '').strip()
category = listing_data.get('category', '').strip()
location = listing_data.get('location', '').strip()
# Title gets higher weight by repetition
combined_text = f"{title} {title} {description} {category} {location}"
# Basic text cleaning
combined_text = combined_text.lower()
# Remove excessive whitespace
combined_text = ' '.join(combined_text.split())
# Remove special characters that don't add semantic value
combined_text = re.sub(r'[^ws]', ' ', combined_text)
# Normalize text (especially important for multilingual content)
combined_text = unicodedata.normalize('NFKD', combined_text)
return combined_text
3. Embedding Generation Service
We implemented a dedicated microservice for embedding generation:
- API-Based Approach: Separate service to avoid Lambda size/memory constraints
- Model Loading: Pre-loaded model for faster inference
- Batching: Support for processing multiple texts in a single request
- Caching: Temporary caching of results for identical texts
from sentence_transformers import SentenceTransformer
import torch
from flask import Flask, request, jsonify
app = Flask(__name__)
# Load model once at startup
model = SentenceTransformer('multilingual-e5-large')
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)
@app.route('/embed', methods=['POST'])
def generate_embedding():
data = request.json
if not data or 'text' not in data:
return jsonify({'error': 'No text provided'}), 400
text = data['text']
try:
# Generate embedding
with torch.no_grad():
embedding = model.encode(text, convert_to_tensor=True)
embedding = embedding.cpu().numpy().tolist()
return jsonify({'embedding': embedding})
except Exception as e:
return jsonify({'error': str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
4. Vector Storage in Pinecone
Pinecone was selected as our vector database for several reasons:
- Scalability: Handling millions of vectors with consistent performance
- Query Speed: Fast similarity search even with large vector collections
- Metadata Filtering: Combining vector similarity with metadata filters
- Managed Service: Reduced operational overhead
5. Hybrid Search Implementation
To maximize search quality, we implemented a hybrid approach:
- Vector Search: Finding semantically similar listings
- Keyword Boosting: Giving additional weight to exact keyword matches
- Category Filtering: Using metadata to narrow results to relevant categories
- Recency Boosting: Giving preference to newer listings
Technical Challenges in Vector Search Implementation
1. Handling Multilingual Content
Challenge: ikman.lk contains listings in English, Sinhala, and Tamil, often mixed within the same listing.
Solution: We used a multilingual embedding model specifically fine-tuned for multiple languages, and implemented language detection to apply language-specific preprocessing when needed.
2. Dimensionality and Performance
Challenge: Higher-dimensional embeddings capture more semantic information but increase storage and computation costs.
Solution: We evaluated different dimensionality reduction techniques and settled on 384-dimensional embeddings as the optimal balance between semantic richness and performance.