Anvay is a Flask-based Bengali text processing and topic modeling tool that uses Latent Dirichlet Allocation (LDA) to extract topics from uploaded text files. The application includes a custom rule-based Bengali stemmer, stopword removal, and support for n-grams and additional preprocessing features. It generates a visualization of topics and allows exporting the results in .txt and .csv formats.
- Accepts multiple .txt files for processing.
- Removes stopwords (Bengali stopwords from NLTK and custom stopwords).
- Removes punctuation (including Bengali-specific punctuation).
- Rule-based stemming for Bengali nouns and verbs.
- Removes most frequent words based on user-defined thresholds.
- Customizable LDA model parameters (number of topics, iterations, etc.).
- Support for unigrams, bigrams, and trigrams.
- Generates an interactive HTML visualization of topics using pyLDAvis.
- Topics can be downloaded as .txt or .csv files.
- Evaluates the coherence of the generated topics using the gensim library.
Anvay supports the following visualizations:
- PyLDAvis: Interactive visualization of topics and their relationships.
- Scatter Plot: Visual representation of topic distribution using Bokeh.
- Bar Chart: Displays the frequency of topics.
- Heatmap: Illustrates the relationships and strengths between topics.
- Chord Diagram: Visualizes connections between topics.
- Hierarchical Clustering: Groups topics based on similarity.
- Topic Evolution Chart: Shows how topics evolve over documents.
- Topic Distribution Per Document: Provides insight into topic coverage within individual documents.
- Coherence Bar Chart: Evaluates the coherence of topics for better understanding.
- Python 3.8+
- Flask
- Gensim
- NLTK
- PyLDAvis
- Bokeh 2.4.3 (Required)
- Matplotlib
- Other dependencies as specified in
requirements.txt
Note: The application requires Bokeh version 2.4.3. Using any other version may lead to compatibility issues.
The following Python libraries are required:
- Flask
- Gensim
- PyLDAvis
- NLTK
- re
- csv
- string
- os
- collections
Install the dependencies using:
pip install -r requirements.txt
Clone the repository and navigate to the project directory. Check that the necessary folders for file uploads, results, and stopwords are there:
git clone https://github.com/vinayakdasgupta/anvay/
The application automatically downloads required NLTK resources (e.g., stopwords and punkt). If needed, you can manually download them:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
Start the Flask application:
python anvay.py
The application will run at http://127.0.0.1:5000/
.
Navigate to the homepage and upload .txt files for topic modeling. Optionally, upload a custom stopwords file (a .txt file with one word per line).
Set the following parameters:
num_topics
: Number of topics to extract.iterations
: Number of iterations for model training.chunk_size
: Size of chunks to process at a time.alpha
: Dirichlet prior for topic distribution.
- Enable/disable stemming.
- Set the percentage of most frequent words to remove.
- Choose n-gram type: unigram, bigram, or trigram.
- Enable/disable stopword removal and multicore processing.
- View the interactive topic visualization.
- Download the topics as .txt or .csv files.
uploads/
: Stores uploaded text files.results/
: Stores the LDA visualization (lda_visualization.html
) and generated topic files (topics.txt
,topics.csv
).stopwords/
: Stores custom stopwords files.
Add your custom stopwords file in the stopwords/
folder or upload it through the application. The file should contain one word per line.
Edit the noun and verb root/suffix files in the stopwords/
folder:
noun_roots.txt
verb_roots.txt
noun_suffixes.txt
verb_suffixes.txt
- Upload Bengali text files (.txt).
- Configure preprocessing (e.g., stemming, stopword removal, etc.).
- Set LDA model parameters (e.g., number of topics, iterations).
- Click Process to generate the topics.
- View the interactive topic visualization or download the results.
- Integrate a more robust POS tagger for Bengali.
- Improve the stemming process with a hybrid approach (rule-based + statistical methods).
- Improve memory handling.
- Create interactive JS visualizations for static images.
- File Upload Limitation: The total file size and the number of files that can be uploaded depend on Flask’s internal limitations. The application has been tested with up to 980 files successfully.
- Threading Issue: Flask may sometimes encounter threading issues under heavy load.
This application is open-source and available under the MIT License. See LICENSE
for details.
- Gensim: Topic modeling and coherence evaluation.