Learn how to use the TF-IDF variant for keyword-based search without external dependencies.
In this guide you’ll learn when to use TF-IDF instead of semantic search, how to configure and query it, and understand its limitations.
When to Use TF-IDF
Scenario Recommendation Small corpus (< 10K docs) TF-IDF works well No network access for model download Use TF-IDF Keyword matching is sufficient Use TF-IDF Semantic understanding required Use VectoriaDB Large corpus (> 10K docs) Use VectoriaDB + HNSW
Basic Usage
import { TFIDFVectoria , DocumentMetadata } from ' vectoriadb ' ;
interface ToolDocument extends DocumentMetadata {
toolName : string ;
category : string ;
}
const db = new TFIDFVectoria < ToolDocument >({
defaultSimilarityThreshold : 0.0 ,
defaultTopK : 10 ,
});
// Add documents
db . addDocument ( ' tool1 ' , ' User authentication tool ' , {
id : ' tool1 ' ,
toolName : ' auth ' ,
category : ' security ' ,
});
db . addDocument ( ' tool2 ' , ' User profile retrieval ' , {
id : ' tool2 ' ,
toolName : ' profile ' ,
category : ' user ' ,
});
// Reindex after adding documents (required for IDF update)
db . reindex ();
// Search
const results = db . search ( ' authentication ' , { topK : 5 });
Key Differences from VectoriaDB
Feature TFIDFVectoria VectoriaDB Dependencies Zero transformers.js (~22MB model) Initialization Synchronous Async (model download) Semantic understanding Keyword-based Full semantic Best for Small corpora (under 10K docs) Any size Reindex required Yes, after changes No
Important: Reindexing
TF-IDF requires reindexing after document changes to update IDF (Inverse Document Frequency) values:
// Add documents
db . addDocument ( ' doc1 ' , ' Text one ' , metadata1 );
db . addDocument ( ' doc2 ' , ' Text two ' , metadata2 );
// MUST reindex before searching
db . reindex ();
// Now search works
const results = db . search ( ' query ' );
// After adding more documents
db . addDocument ( ' doc3 ' , ' Text three ' , metadata3 );
db . reindex (); // Reindex again
Forgetting to call reindex() after changes will result in incorrect search results.
Configuration Options
const db = new TFIDFVectoria < ToolDocument >({
defaultSimilarityThreshold : 0.0 , // Minimum score (0-1)
defaultTopK : 10 , // Default results limit
});
Search Options
const results = db . search ( ' query ' , {
topK : 5 , // Maximum results
threshold : 0.1 , // Minimum score
filter : ( metadata ) => metadata . category === ' security ' ,
});
TF-IDF Algorithm
TF-IDF (Term Frequency-Inverse Document Frequency) works by:
Term Frequency (TF) : How often a term appears in a document
Inverse Document Frequency (IDF) : How rare a term is across all documents
TF-IDF Score : TF x IDF - terms that are frequent in a document but rare overall get high scores
This means:
Common words like “the”, “is”, “a” get low scores (low IDF)
Unique terms specific to a document get high scores
The query is matched against TF-IDF vectors using cosine similarity
src/tfidf-tool-discovery.ts
import { TFIDFVectoria } from ' vectoriadb ' ;
interface Tool {
id : string ;
name : string ;
category : string ;
}
const toolSearch = new TFIDFVectoria < Tool >();
// Index tools with descriptive text
toolSearch . addDocument (
' user-create ' ,
' Create new user account registration signup ' ,
{ id : ' user-create ' , name : ' createUser ' , category : ' users ' }
);
toolSearch . addDocument (
' user-delete ' ,
' Delete remove user account termination ' ,
{ id : ' user-delete ' , name : ' deleteUser ' , category : ' users ' }
);
toolSearch . addDocument (
' payment-charge ' ,
' Charge payment credit card billing ' ,
{ id : ' payment-charge ' , name : ' charge ' , category : ' billing ' }
);
toolSearch . reindex ();
// Search
const results = toolSearch . search ( ' create account ' );
// Returns: user-create with high score (matches "create" and "account")
Limitations
No semantic understanding - “car” won’t match “automobile”
Reindex requirement - Must call reindex() after changes
Limited to keywords - Misspellings and synonyms aren’t handled
Memory for large vocabularies - IDF tables grow with vocabulary size
Hybrid Approach
For best of both worlds, you can use TF-IDF as a pre-filter before semantic search:
// Fast TF-IDF pre-filter
const tfidfResults = tfidfIndex . search ( query , { topK : 100 , threshold : 0.1 });
// Semantic re-ranking on smaller set
const semanticResults = await vectoriaDB . searchByIds (
tfidfResults . map ( r => r . id ),
query
);
Search Semantic search options