-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add graph strategy #2772
base: main
Are you sure you want to change the base?
Add graph strategy #2772
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has totally checked 1044 files.
Valid | Invalid | Ignored | Fixed |
---|---|---|---|
931 | 1 | 112 | 0 |
Click to see the invalid file list
- nucliadb/src/nucliadb/search/search/graph_strategy.py
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2772 +/- ##
========================================
Coverage 87.13% 87.14%
========================================
Files 386 387 +1
Lines 24269 24394 +125
========================================
+ Hits 21148 21257 +109
- Misses 3121 3137 +16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
if _type not in FIELD_TYPE_STR_TO_PB: | ||
# Try to parse the enum value | ||
# XXX: This is to support field types that are integer values of FieldType | ||
# Which is how legacy processor relations reported the paragraph_id | ||
try: | ||
type_pb = FieldType.ValueType(int(_type)) | ||
except ValueError: | ||
raise ValueError(f"Invalid FieldId: {_type}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch, really? So on old entities, the paragraph ids in the metadata are coming with the int value?
@classmethod | ||
def parse_field_type(cls, _type: str) -> str: | ||
if _type not in FIELD_TYPE_STR_TO_PB: | ||
# Try to parse the enum value | ||
# XXX: This is to support field types that are integer values of FieldType | ||
# Which is how legacy processor relations reported the paragraph_id | ||
try: | ||
type_pb = FieldType.ValueType(int(_type)) | ||
except ValueError: | ||
raise ValueError(f"Invalid FieldId: {_type}") | ||
if type_pb in FIELD_TYPE_PB_TO_STR: | ||
return FIELD_TYPE_PB_TO_STR[type_pb] | ||
else: | ||
raise ValueError(f"Invalid FieldId: {_type}") | ||
return _type | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some unit tests for this?
in this module for example: https://github.com/nuclia/nucliadb/blob/main/nucliadb/tests/nucliadb/unit/common/test_ids.py#L61
- hops=2 will explore the neighbors of the neighbors of the starting entities. | ||
And so on. | ||
Bigger values will discover more intricate relationships but will also take more time to compute.""", | ||
ge=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put a max hops here? like, 10 or 20?
Something that still takes a reasonable amount of time to compute on a big graph -- let's say a minute.
default=25, | ||
title="Top k", | ||
description="Number of relationships to keep after each hop after ranking them by relevance to the query. This number correlates to more paragraphs being sent as context.", | ||
ge=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I think we should put an upper bound for the top_k
@@ -1381,6 +1422,7 @@ class AskRequest(AuditMetadataBase): | |||
- `neighbouring_paragraphs` will add the sorrounding paragraphs to the context for each matching paragraph. | |||
- `metadata_extension` will add the metadata of the matching paragraphs or its resources to the context. | |||
- `prequeries` allows to run multiple retrieval queries before the main query and add the results to the context. The results of specific queries can be boosted by the specifying weights. | |||
- `graph` will retrieve context pieces by exploring the Knowledge Graph, starting from the entities present in the query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should mention clearly that, as it is now, graph rag strategy is not compatible at least with prequeries
strategy right?
parsed_paragraph_id = ParagraphId.from_string(paragraph_id) | ||
return TextBlockMatch( | ||
paragraph_id=parsed_paragraph_id, | ||
score=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the bm25 score from the graph index is not used? are we assuming we're always reranking with an llm with the graph rag strategy?
Description
Creates the graph strategy on its first version.
Tasks done:
How was this PR tested?
Added integration test and unit test