Skip to main content
Learn how to use the TF-IDF variant for keyword-based search without external dependencies.
In this guide you’ll learn when to use TF-IDF instead of semantic search, how to configure and query it, and understand its limitations.

When to Use TF-IDF

ScenarioRecommendation
Small corpus (< 10K docs)TF-IDF works well
No network access for model downloadUse TF-IDF
Keyword matching is sufficientUse TF-IDF
Semantic understanding requiredUse VectoriaDB
Large corpus (> 10K docs)Use VectoriaDB + HNSW

Basic Usage

src/tfidf-basic.ts
import { TFIDFVectoria, DocumentMetadata } from 'vectoriadb';

interface ToolDocument extends DocumentMetadata {
  toolName: string;
  category: string;
}

const db = new TFIDFVectoria<ToolDocument>({
  defaultSimilarityThreshold: 0.0,
  defaultTopK: 10,
});

// Add documents
db.addDocument('tool1', 'User authentication tool', {
  id: 'tool1',
  toolName: 'auth',
  category: 'security',
});

db.addDocument('tool2', 'User profile retrieval', {
  id: 'tool2',
  toolName: 'profile',
  category: 'user',
});

// Reindex after adding documents (required for IDF update)
db.reindex();

// Search
const results = db.search('authentication', { topK: 5 });

Key Differences from VectoriaDB

FeatureTFIDFVectoriaVectoriaDB
DependenciesZerotransformers.js (~22MB model)
InitializationSynchronousAsync (model download)
Semantic understandingKeyword-basedFull semantic
Best forSmall corpora (under 10K docs)Any size
Reindex requiredYes, after changesNo

Important: Reindexing

TF-IDF requires reindexing after document changes to update IDF (Inverse Document Frequency) values:
src/tfidf-reindex.ts
// Add documents
db.addDocument('doc1', 'Text one', metadata1);
db.addDocument('doc2', 'Text two', metadata2);

// MUST reindex before searching
db.reindex();

// Now search works
const results = db.search('query');

// After adding more documents
db.addDocument('doc3', 'Text three', metadata3);
db.reindex(); // Reindex again
Forgetting to call reindex() after changes will result in incorrect search results.

Configuration Options

src/tfidf-config.ts
const db = new TFIDFVectoria<ToolDocument>({
  defaultSimilarityThreshold: 0.0,  // Minimum score (0-1)
  defaultTopK: 10,                  // Default results limit
});

Search Options

src/tfidf-search.ts
const results = db.search('query', {
  topK: 5,          // Maximum results
  threshold: 0.1,   // Minimum score
  filter: (metadata) => metadata.category === 'security',
});

TF-IDF Algorithm

TF-IDF (Term Frequency-Inverse Document Frequency) works by:
  1. Term Frequency (TF): How often a term appears in a document
  2. Inverse Document Frequency (IDF): How rare a term is across all documents
  3. TF-IDF Score: TF x IDF - terms that are frequent in a document but rare overall get high scores
This means:
  • Common words like “the”, “is”, “a” get low scores (low IDF)
  • Unique terms specific to a document get high scores
  • The query is matched against TF-IDF vectors using cosine similarity

Example: Tool Discovery

src/tfidf-tool-discovery.ts
import { TFIDFVectoria } from 'vectoriadb';

interface Tool {
  id: string;
  name: string;
  category: string;
}

const toolSearch = new TFIDFVectoria<Tool>();

// Index tools with descriptive text
toolSearch.addDocument(
  'user-create',
  'Create new user account registration signup',
  { id: 'user-create', name: 'createUser', category: 'users' }
);

toolSearch.addDocument(
  'user-delete',
  'Delete remove user account termination',
  { id: 'user-delete', name: 'deleteUser', category: 'users' }
);

toolSearch.addDocument(
  'payment-charge',
  'Charge payment credit card billing',
  { id: 'payment-charge', name: 'charge', category: 'billing' }
);

toolSearch.reindex();

// Search
const results = toolSearch.search('create account');
// Returns: user-create with high score (matches "create" and "account")

Limitations

  1. No semantic understanding - “car” won’t match “automobile”
  2. Reindex requirement - Must call reindex() after changes
  3. Limited to keywords - Misspellings and synonyms aren’t handled
  4. Memory for large vocabularies - IDF tables grow with vocabulary size

Hybrid Approach

For best of both worlds, you can use TF-IDF as a pre-filter before semantic search:
src/tfidf-hybrid.ts
// Fast TF-IDF pre-filter
const tfidfResults = tfidfIndex.search(query, { topK: 100, threshold: 0.1 });

// Semantic re-ranking on smaller set
const semanticResults = await vectoriaDB.searchByIds(
  tfidfResults.map(r => r.id),
  query
);

Welcome

Getting started

Search

Semantic search options

Storage

Storage adapters