Update README.md

AILAB-CEFET-RJ · Sep 10, 2024 · af876b0 · af876b0
1 parent 8ec3ddd
commit af876b0
Showing 1 changed file with 91 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -30,5 +30,94 @@ The project is composed of the following scripts:
 
 1. Clone the repository:
    ```bash
-   git clone https://github.com/yourusername/GLARE-Legal-Documents.git
-   cd GLARE-Legal-Documents
+   git clone ttps://github.com/AILAB-CEFET-RJ/r2t
+   cd src
+2. Create a virtual environment (optional but recommended):
+   ```bash
+   python -m venv venv
+   source venv/bin/activate    # On Linux/Mac
+   venv\Scripts\activate       # On Windows
+3. Install the required dependencies:
+   ```bash
+   pip install -r requirements.txt
+
+## Usage
+
+Step 1: Generating Embeddings
+Use createEmbedding.py to generate embeddings for both legal documents (Recursos Especiais) and themes.
+- For legal documents:
+  python createEmbedding.py REsp_completo.csv recurso recurso --clean --begin_point cabimento -v
+- For themes:
+  python createEmbedding.py temas_repetitivos.csv tema tema --clean -v
+
+Step 2: Summarizing Documents
+Once embeddings are generated, you can summarize the documents using createTopics.py with one of the summarization methods.
+python script.py <corpus_embedding> <size> <type> [--verbose] [--seed_list <seed_list>] [<model>]
+
+### Parameters 
+* corpus_embedding: Path to the corpus embeddings file (.pkl file).
+* size: Number of sentences or topics to summarize.
+* type: Type of topic generation:
+   * B: Bertopic
+   * G: Guided Bertopic
+   * L: Lexrank
+   * X: Guided Lexrank
+* --verbose: Increase the verbosity of the process.
+* --seed_list: Path to the seed list (required for type G or X).
+* <model>: Sentence-BERT model used to generate embeddings (optional, default: distiluse-base-multilingual-cased-v1)
+
+### Examples
+* Topic generation with BERTopic:
+  python script.py corpus.pkl 10 B
+
+* Topic generation with Guided BERTopic:
+  python script.py corpus.pkl 10 G --seed_list seeds.csv
+
+* Summary generation with LexRank:
+  python script.py corpus.pkl 5 L
+
+* Summary generation with Guided LexRank:
+  python script.py corpus.pkl 5 X --seed_list seeds.csv
+
+Step 3: Calculating Similarity
+After summarizing the documents, use calcSimilarity.py to compute the similarity between the document summaries and the themes.
+python calcSimilarity.py <corpus_file> <themes_file> <rank> <type>
+
+### Parameters
+
+* <corpus_file> is the path to the corpus file in pickle format.
+* <themes_file> is the path to the themes file in pickle format.
+* <rank> is the number of top results to retrieve.
+* <type> type of similarity
+   * B indicates that the BM25 method should be used for similarity calculation.
+   * C indicates that the Cosine Similarity method should be used for similarity calculation.
+
+### Usage
+For text-based similarity (using BM25):
+python calcSimilarity.py <corpus_file> <themes_file> <rank> B
+
+### Output
+The program will generate a CSV file with the similarity results. 
+The file will be named CLASSIFIED_<corpus_name>_<METHOD>.csv, where <METHOD> is BM25 or COSINE, depending on the similarity method used.
+Example Output
+* For BM25:
+  CLASSIFIED_TOPICS_L10CLEAN_BM25.csv
+* For Cosine Similarity:
+  CLASSIFIED_TOPICS_L10CLEAN_COSINE.csv
+
+### Notes
+* Ensure the input files are in pickle format and contain the expected structure.
+* The rank parameter determines the number of items similar to the top to be retrieved and included in the output.
+
+
+Step 4: Evaluating Performance
+Finally, use metrics.py to calculate metrics and evaluate the system’s performance.
+It computes metrics such as Recall, F1-Score, MAP (Mean Average Precision), NDCG (Normalized Discounted Cumulative Gain), and MRR (Mean Reciprocal Rank) based on the provided classified data.
+
+### Usage
+python metrics.py CLASSFIED_TOPICS_B10CLEAN_BM25.csv
+
+
+
+
+