Vector Embeddings for E-commerce Search: From Text to Semantic Understanding

Traditional e-commerce search has long relied on keyword matching and basic text indexing, often failing to capture the true intent behind user queries. Our project to enhance ikman.lk's search capabilities demonstrates how vector embeddings can transform search from simple text matching to true semantic understanding, dramatically improving the user experience.

The Limitations of Traditional Search

Before diving into vector embeddings, it's important to understand the limitations of traditional search approaches:

Keyword Dependence: Exact keyword matching misses synonyms and related concepts
Language Sensitivity: Poor handling of misspellings, plurals, and language variations
Context Blindness: Inability to understand the semantic meaning behind queries
Ranking Challenges: Difficulty in determining truly relevant results beyond keyword frequency

For a marketplace like ikman.lk with diverse listings across multiple categories and languages, these limitations significantly impact user experience.

Vector Embeddings: The Foundation of Semantic Search

Vector embeddings address these limitations by converting text into numerical representations that capture semantic meaning:

What Are Vector Embeddings?

Vector embeddings are high-dimensional numerical representations of text where:

Each word or phrase is represented as a vector (typically 300-1000 dimensions)
Semantically similar concepts have vectors that are close in the vector space
The relationships between concepts are preserved in the vector space
The "meaning" of text is encoded in a way machines can process efficiently

How Embeddings Enable Semantic Search

With vector embeddings, search becomes a matter of finding vectors that are "close" to the query vector:

Query Understanding: Convert user query to a vector embedding
Similarity Matching: Find product/listing vectors with high similarity (cosine similarity or Euclidean distance)
Ranking: Order results by similarity score and other relevance factors

Implementing Vector Embeddings for ikman.lk

Our implementation for ikman.lk involved several key technical decisions:

1. Embedding Model Selection

After evaluating several options, we selected a model based on:

Multilingual Support: Handling both English and Sinhala text
Domain Relevance: Performance on e-commerce and classified listing content
Dimensionality: Balance between semantic richness and computational efficiency
Inference Speed: Fast enough for real-time embedding generation

2. Text Preparation Pipeline

Before generating embeddings, we implemented a text preparation pipeline:

def prepare_text_for_embedding(listing_data):
    # Combine relevant fields with appropriate weighting
    title = listing_data.get('title', '').strip()
    description = listing_data.get('description', '').strip()
    category = listing_data.get('category', '').strip()
    location = listing_data.get('location', '').strip()

    # Title gets higher weight by repetition
    combined_text = f"{title} {title} {description} {category} {location}"

    # Basic text cleaning
    combined_text = combined_text.lower()

    # Remove excessive whitespace
    combined_text = ' '.join(combined_text.split())

    # Remove special characters that don't add semantic value
    combined_text = re.sub(r'[^ws]', ' ', combined_text)

    # Normalize text (especially important for multilingual content)
    combined_text = unicodedata.normalize('NFKD', combined_text)

    return combined_text

3. Embedding Generation Service

We implemented a dedicated microservice for embedding generation:

API-Based Approach: Separate service to avoid Lambda size/memory constraints
Model Loading: Pre-loaded model for faster inference
Batching: Support for processing multiple texts in a single request
Caching: Temporary caching of results for identical texts

from sentence_transformers import SentenceTransformer
import torch
from flask import Flask, request, jsonify

app = Flask(__name__)

# Load model once at startup
model = SentenceTransformer('multilingual-e5-large')
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

@app.route('/embed', methods=['POST'])
def generate_embedding():
    data = request.json

    if not data or 'text' not in data:
        return jsonify({'error': 'No text provided'}), 400

    text = data['text']

    try:
        # Generate embedding
        with torch.no_grad():
            embedding = model.encode(text, convert_to_tensor=True)
            embedding = embedding.cpu().numpy().tolist()

        return jsonify({'embedding': embedding})
    except Exception as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

4. Vector Storage in Pinecone

Pinecone was selected as our vector database for several reasons:

Scalability: Handling millions of vectors with consistent performance
Query Speed: Fast similarity search even with large vector collections
Metadata Filtering: Combining vector similarity with metadata filters
Managed Service: Reduced operational overhead

5. Hybrid Search Implementation

To maximize search quality, we implemented a hybrid approach:

Vector Search: Finding semantically similar listings
Keyword Boosting: Giving additional weight to exact keyword matches
Category Filtering: Using metadata to narrow results to relevant categories
Recency Boosting: Giving preference to newer listings

Technical Challenges in Vector Search Implementation

1. Handling Multilingual Content

Challenge: ikman.lk contains listings in English, Sinhala, and Tamil, often mixed within the same listing.

Solution: We used a multilingual embedding model specifically fine-tuned for multiple languages, and implemented language detection to apply language-specific preprocessing when needed.

2. Dimensionality and Performance

Challenge: Higher-dimensional embeddings capture more semantic information but increase storage and computation costs.

Solution: We evaluated different dimensionality reduction techniques and settled on 384-dimensional embeddings as the optimal balance between semantic richness and performance.