Design choices that affects Query Answering capabilities of a QA system
- Lemmatizer / Stemmer : Which provides better result and why?
- Stopword Filter : As comprehensive and unobstructing as possible
- Spell errors and check
- distance metric
- TF-IDF design : Among the variants of TFIDF which is the most suitable for query answering?
- Vocabulary of corpus
- Unigram / Bigram : Does bigram vocabulary help?
- Handover to human correspondent
- Syntatic Parsing of sentences to uncover relations between words
- Named entity recognition and Noun Phrase extraction
Database -Quora question pair sets can be downloaded from :http://qim.ec.quoracdn.net/quora_duplicate_questions.tsv
Installation:
- Requires NLTK only