Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
gregoriofsg authored Sep 10, 2024
1 parent 8ec3ddd commit af876b0
Showing 1 changed file with 91 additions and 2 deletions.
93 changes: 91 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,94 @@ The project is composed of the following scripts:

1. Clone the repository:
```bash
git clone https://github.com/yourusername/GLARE-Legal-Documents.git
cd GLARE-Legal-Documents
git clone ttps://github.com/AILAB-CEFET-RJ/r2t
cd src
2. Create a virtual environment (optional but recommended):
```bash
python -m venv venv
source venv/bin/activate # On Linux/Mac
venv\Scripts\activate # On Windows
3. Install the required dependencies:
```bash
pip install -r requirements.txt

## Usage

Step 1: Generating Embeddings
Use createEmbedding.py to generate embeddings for both legal documents (Recursos Especiais) and themes.
- For legal documents:
python createEmbedding.py REsp_completo.csv recurso recurso --clean --begin_point cabimento -v
- For themes:
python createEmbedding.py temas_repetitivos.csv tema tema --clean -v

Step 2: Summarizing Documents
Once embeddings are generated, you can summarize the documents using createTopics.py with one of the summarization methods.
python script.py <corpus_embedding> <size> <type> [--verbose] [--seed_list <seed_list>] [<model>]

### Parameters
* corpus_embedding: Path to the corpus embeddings file (.pkl file).
* size: Number of sentences or topics to summarize.
* type: Type of topic generation:
* B: Bertopic
* G: Guided Bertopic
* L: Lexrank
* X: Guided Lexrank
* --verbose: Increase the verbosity of the process.
* --seed_list: Path to the seed list (required for type G or X).
* <model>: Sentence-BERT model used to generate embeddings (optional, default: distiluse-base-multilingual-cased-v1)

### Examples
* Topic generation with BERTopic:
python script.py corpus.pkl 10 B

* Topic generation with Guided BERTopic:
python script.py corpus.pkl 10 G --seed_list seeds.csv

* Summary generation with LexRank:
python script.py corpus.pkl 5 L

* Summary generation with Guided LexRank:
python script.py corpus.pkl 5 X --seed_list seeds.csv

Step 3: Calculating Similarity
After summarizing the documents, use calcSimilarity.py to compute the similarity between the document summaries and the themes.
python calcSimilarity.py <corpus_file> <themes_file> <rank> <type>

### Parameters

* <corpus_file> is the path to the corpus file in pickle format.
* <themes_file> is the path to the themes file in pickle format.
* <rank> is the number of top results to retrieve.
* <type> type of similarity
* B indicates that the BM25 method should be used for similarity calculation.
* C indicates that the Cosine Similarity method should be used for similarity calculation.

### Usage
For text-based similarity (using BM25):
python calcSimilarity.py <corpus_file> <themes_file> <rank> B

### Output
The program will generate a CSV file with the similarity results.
The file will be named CLASSIFIED_<corpus_name>_<METHOD>.csv, where <METHOD> is BM25 or COSINE, depending on the similarity method used.
Example Output
* For BM25:
CLASSIFIED_TOPICS_L10CLEAN_BM25.csv
* For Cosine Similarity:
CLASSIFIED_TOPICS_L10CLEAN_COSINE.csv

### Notes
* Ensure the input files are in pickle format and contain the expected structure.
* The rank parameter determines the number of items similar to the top to be retrieved and included in the output.


Step 4: Evaluating Performance
Finally, use metrics.py to calculate metrics and evaluate the system’s performance.
It computes metrics such as Recall, F1-Score, MAP (Mean Average Precision), NDCG (Normalized Discounted Cumulative Gain), and MRR (Mean Reciprocal Rank) based on the provided classified data.

### Usage
python metrics.py CLASSFIED_TOPICS_B10CLEAN_BM25.csv





0 comments on commit af876b0

Please sign in to comment.