Search Engine using TF-IDF & Cosine Similarity

• Developed a custom search engine to retrieve the most relevant U.S. presidential inaugural speech based on a user’s search query.

• Implemented a full-text retrieval pipeline using TF-IDF weighting and Cosine Similarity for accurate document ranking and scoring.

• Preprocessed over 50 historical speech documents, including tokenization, stop-word removal, stemming (using NLTK), and normalization.

• Constructed an inverted index (postings list) mapping terms to their weighted document occurrences, sorted by relevance scores.

• Built a query processor that dynamically computes similarity scores between a user query and all document vectors using normalized TF-IDF.

• Implemented a top-K retrieval mechanism, limiting results to the highest scoring documents and optimizing query response time.

• Added upper-bound scoring for early termination, allowing the system to skip deeper document scans when confidence in top results is high.

• Designed a scalable query loop, progressively expanding the document search scope when relevance is ambiguous—balancing recall and performance.