Skip to main content

Storefront Search Methodology

Introduction

This document outlines the search methodology implemented within the Publica.la storefront. Our system utilizes a weighted metadata scoring approach to deliver relevant product results based on user queries. Understanding this process helps clarify how products are surfaced and ranked in search results on storefronts.

Weighted Scoring System

The core of our search functionality is a multi-level scoring system designed to prioritize the most relevant results. This system assigns different point values to various types of matches.

Prioritized Exact Matches

To ensure that precise queries yield the most direct results, the system assigns high scores to exact matches on key identifiers:

  • Exact External ID (ISBN): A perfect match on a product's External ID (commonly the ISBN) receives the highest score: 1000 points. This prioritizes specific product lookups.
  • Exact Name Match: An exact match of the user's query with a product's name receives a significant score: 500 points.

Relevance-Based Scoring

For less precise matches, the system uses relevance-based scoring:

  • Full-Text Name Search: When an exact name match isn't found, the system performs a full-text search within product names. The score assigned is based on the relevance determined by the search algorithm, rewarding partial matches or variations.

Searchable Fields

The search process queries several metadata fields to gather comprehensive results.

  • Product Name and External ID: These are the primary fields searched, with exact matches prioritized as described above.
  • Contributors: The system examines the names (first and last) of authors and other contributors associated with the products. Matches in contributor names add to the product's overall relevance score.
  • Terms/Taxonomies: If enabled for the specific storefront, the search also includes terms and taxonomies (like categories or tags) associated with products. Matches in these fields contribute to the score but carry a reduced weight (0.5x multiplier) compared to other fields, reflecting their broader nature.

Technical Implementation

To maintain security and stability, the user's search query undergoes a sanitization process. This step removes or escapes special characters that could potentially disrupt or break the search query execution, preventing errors and security vulnerabilities.

Search Process Flow

The search operation follows a structured sequence to combine results from different fields efficiently:

  1. Initial Search: The system first searches for matches based on Product Name and External ID, applying the high-priority scoring for exact matches.
  2. Contributor Integration: Results from searches within contributor names are then integrated, adding to the scores of relevant products.
  3. Taxonomy Search (Conditional): If the taxonomy search feature is active, the system performs searches across relevant terms and adds these weighted scores to the products.
  4. Score Aggregation: All scores from the different search stages (Name, ID, Contributors, Taxonomies) are combined to calculate a final relevance score for each product.
  5. Data Consolidation: Necessary product fields are selected, and results are grouped to prevent duplicate entries for the same product appearing multiple times.
  6. Filtering and Sorting: Products with a final score greater than zero are retained. The results are then sorted in descending order based on their total relevance score, presenting the most relevant products first.

Optimizations

Several optimizations are in place to ensure efficient performance:

  • Result Limitation: The search process is limited to returning a maximum of 500 records. This prevents performance degradation from overly broad queries and ensures a responsive user experience.
  • Data Aggregation: Functions like MAX and grouping techniques are used to consolidate information efficiently when a product matches across multiple search criteria (e.g., name and contributor).
  • Score Rounding: Calculated scores may be rounded to avoid overly precise or fractional point values, simplifying the ranking logic.

X

Graph View