Search Engine using TF-IDF & Cosine Similarity
• Developed a custom search engine to retrieve the most relevant U.S. presidential inaugural speech based on a user’s search query.
• Implemented a full-text retrieval pipeline using TF-IDF weighting and Cosine Similarity for accurate document ranking and scoring.
• Preprocessed over 50 historical speech documents, including tokenization, stop-word removal, stemming (using NLTK), and normalization.
• Constructed an inverted index (postings list) mapping terms to their weighted document occurrences, sorted by relevance scores.
• Built a query processor that dynamically computes similarity scores between a user query and all document vectors using normalized TF-IDF.
• Implemented a top-K retrieval mechanism, limiting results to the highest scoring documents and optimizing query response time.
• Added upper-bound scoring for early termination, allowing the system to skip deeper document scans when confidence in top results is high.
• Designed a scalable query loop, progressively expanding the document search scope when relevance is ambiguous—balancing recall and performance.