TF-IDF Variant

Learn how to use the TF-IDF variant for keyword-based search without external dependencies.

In this guide you’ll learn when to use TF-IDF instead of semantic search, how to configure and query it, and understand its limitations.

When to Use TF-IDF

Scenario	Recommendation
Small corpus (< 10K docs)	TF-IDF works well
No network access for model download	Use TF-IDF
Keyword matching is sufficient	Use TF-IDF
Semantic understanding required	Use VectoriaDB
Large corpus (> 10K docs)	Use VectoriaDB + HNSW

Basic Usage

src/tfidf-basic.ts

import { TFIDFVectoria, DocumentMetadata } from 'vectoriadb';

interface ToolDocument extends DocumentMetadata {
  toolName: string;
  category: string;
}

const db = new TFIDFVectoria<ToolDocument>({
  defaultSimilarityThreshold: 0.0,
  defaultTopK: 10,
});

// Add documents
db.addDocument('tool1', 'User authentication tool', {
  id: 'tool1',
  toolName: 'auth',
  category: 'security',
});

db.addDocument('tool2', 'User profile retrieval', {
  id: 'tool2',
  toolName: 'profile',
  category: 'user',
});

// Reindex after adding documents (required for IDF update)
db.reindex();

// Search
const results = db.search('authentication', { topK: 5 });

Key Differences from VectoriaDB

Feature	TFIDFVectoria	VectoriaDB
Dependencies	Zero	transformers.js (~22MB model)
Initialization	Synchronous	Async (model download)
Semantic understanding	Keyword-based	Full semantic
Best for	Small corpora (under 10K docs)	Any size
Reindex required	Yes, after changes	No

Important: Reindexing

TF-IDF requires reindexing after document changes to update IDF (Inverse Document Frequency) values:

src/tfidf-reindex.ts

// Add documents
db.addDocument('doc1', 'Text one', metadata1);
db.addDocument('doc2', 'Text two', metadata2);

// MUST reindex before searching
db.reindex();

// Now search works
const results = db.search('query');

// After adding more documents
db.addDocument('doc3', 'Text three', metadata3);
db.reindex(); // Reindex again

Forgetting to call reindex() after changes will result in incorrect search results.

Configuration Options

src/tfidf-config.ts

const db = new TFIDFVectoria<ToolDocument>({
  defaultSimilarityThreshold: 0.0,  // Minimum score (0-1)
  defaultTopK: 10,                  // Default results limit
});

Search Options

src/tfidf-search.ts

const results = db.search('query', {
  topK: 5,          // Maximum results
  threshold: 0.1,   // Minimum score
  filter: (metadata) => metadata.category === 'security',
});

TF-IDF Algorithm

TF-IDF (Term Frequency-Inverse Document Frequency) works by:

Term Frequency (TF): How often a term appears in a document
Inverse Document Frequency (IDF): How rare a term is across all documents
TF-IDF Score: TF x IDF - terms that are frequent in a document but rare overall get high scores

This means:

Common words like “the”, “is”, “a” get low scores (low IDF)
Unique terms specific to a document get high scores
The query is matched against TF-IDF vectors using cosine similarity

Example: Tool Discovery

src/tfidf-tool-discovery.ts

import { TFIDFVectoria } from 'vectoriadb';

interface Tool {
  id: string;
  name: string;
  category: string;
}

const toolSearch = new TFIDFVectoria<Tool>();

// Index tools with descriptive text
toolSearch.addDocument(
  'user-create',
  'Create new user account registration signup',
  { id: 'user-create', name: 'createUser', category: 'users' }
);

toolSearch.addDocument(
  'user-delete',
  'Delete remove user account termination',
  { id: 'user-delete', name: 'deleteUser', category: 'users' }
);

toolSearch.addDocument(
  'payment-charge',
  'Charge payment credit card billing',
  { id: 'payment-charge', name: 'charge', category: 'billing' }
);

toolSearch.reindex();

// Search
const results = toolSearch.search('create account');
// Returns: user-create with high score (matches "create" and "account")

Limitations

No semantic understanding - “car” won’t match “automobile”
Reindex requirement - Must call reindex() after changes
Limited to keywords - Misspellings and synonyms aren’t handled
Memory for large vocabularies - IDF tables grow with vocabulary size

Hybrid Approach

For best of both worlds, you can use TF-IDF as a pre-filter before semantic search:

src/tfidf-hybrid.ts

// Fast TF-IDF pre-filter
const tfidfResults = tfidfIndex.search(query, { topK: 100, threshold: 0.1 });

// Semantic re-ranking on smaller set
const semanticResults = await vectoriaDB.searchByIds(
  tfidfResults.map(r => r.id),
  query
);

Welcome

Getting started

Search

Semantic search options

Storage

Storage adapters

Get Started

Core Guides

Alternatives

Use Cases

Deployment

Integrations

Troubleshooting

When to Use TF-IDF

Basic Usage

Key Differences from VectoriaDB

Important: Reindexing

Configuration Options

Search Options

TF-IDF Algorithm

Example: Tool Discovery

Limitations

Hybrid Approach

Welcome

Search

Storage

Get Started

Core Guides

Alternatives

Use Cases

Deployment

Integrations

Troubleshooting

​When to Use TF-IDF

​Basic Usage

​Key Differences from VectoriaDB

​Important: Reindexing

​Configuration Options

​Search Options

​TF-IDF Algorithm

​Example: Tool Discovery

​Limitations

​Hybrid Approach

​Related

Welcome

Search

Storage

When to Use TF-IDF

Basic Usage

Key Differences from VectoriaDB

Important: Reindexing

Configuration Options

Search Options

TF-IDF Algorithm

Example: Tool Discovery

Limitations

Hybrid Approach

Related