Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add graph strategy #2772

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Add graph strategy #2772

wants to merge 14 commits into from

Conversation

carlesonielfa
Copy link
Contributor

@carlesonielfa carlesonielfa commented Jan 10, 2025

Description

Creates the graph strategy on its first version.

  • More details on the Feature Request document
  • Additional questions on the relevant slack conversation

Tasks done:

  • Basic 1 hop implementation
  • Use fuzzy matching for first query entities
  • Implement multihop
  • Gracefully handle errors
  • Add metrics timing
  • Tests

How was this PR tested?

Added integration test and unit test

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has totally checked 1044 files.

Valid Invalid Ignored Fixed
931 1 112 0
Click to see the invalid file list
  • nucliadb/src/nucliadb/search/search/graph_strategy.py

Copy link

codecov bot commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 87.40741% with 17 lines in your changes missing coverage. Please review.

Project coverage is 87.14%. Comparing base (200c7bd) to head (332da3c).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...liadb/src/nucliadb/search/search/graph_strategy.py 85.26% 14 Missing ⚠️
nucliadb/src/nucliadb/common/ids.py 76.92% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #2772    +/-   ##
========================================
  Coverage   87.13%   87.14%            
========================================
  Files         386      387     +1     
  Lines       24269    24394   +125     
========================================
+ Hits        21148    21257   +109     
- Misses       3121     3137    +16     
Flag Coverage Δ
nucliadb 74.75% <55.03%> (-0.24%) ⬇️
nucliadb-ingest 40.71% <31.78%> (-0.07%) ⬇️
nucliadb-reader 42.90% <29.45%> (-0.09%) ⬇️
nucliadb-search 43.85% <69.76%> (+0.17%) ⬆️
nucliadb-standalone 47.42% <29.45%> (-0.12%) ⬇️
nucliadb-train 45.62% <29.45%> (-0.11%) ⬇️
nucliadb-writer 46.10% <31.78%> (-0.10%) ⬇️
nucliadb_dataset 55.45% <ø> (ø)
nucliadb_models 85.87% <100.00%> (+0.05%) ⬆️
nucliadb_sdk 80.11% <ø> (ø)
nucliadb_sidecar 89.03% <ø> (ø)
nucliadb_telemetry 86.56% <ø> (ø)
nucliadb_utils 83.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@carlesonielfa carlesonielfa changed the title WIP: Add graph strategy Add graph strategy Jan 14, 2025
@carlesonielfa carlesonielfa marked this pull request as ready for review January 14, 2025 10:15
Comment on lines +130 to +137
if _type not in FIELD_TYPE_STR_TO_PB:
# Try to parse the enum value
# XXX: This is to support field types that are integer values of FieldType
# Which is how legacy processor relations reported the paragraph_id
try:
type_pb = FieldType.ValueType(int(_type))
except ValueError:
raise ValueError(f"Invalid FieldId: {_type}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch, really? So on old entities, the paragraph ids in the metadata are coming with the int value?

Comment on lines +128 to +143
@classmethod
def parse_field_type(cls, _type: str) -> str:
if _type not in FIELD_TYPE_STR_TO_PB:
# Try to parse the enum value
# XXX: This is to support field types that are integer values of FieldType
# Which is how legacy processor relations reported the paragraph_id
try:
type_pb = FieldType.ValueType(int(_type))
except ValueError:
raise ValueError(f"Invalid FieldId: {_type}")
if type_pb in FIELD_TYPE_PB_TO_STR:
return FIELD_TYPE_PB_TO_STR[type_pb]
else:
raise ValueError(f"Invalid FieldId: {_type}")
return _type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some unit tests for this?

in this module for example: https://github.com/nuclia/nucliadb/blob/main/nucliadb/tests/nucliadb/unit/common/test_ids.py#L61

- hops=2 will explore the neighbors of the neighbors of the starting entities.
And so on.
Bigger values will discover more intricate relationships but will also take more time to compute.""",
ge=1,
Copy link
Contributor

@lferran lferran Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put a max hops here? like, 10 or 20?
Something that still takes a reasonable amount of time to compute on a big graph -- let's say a minute.

default=25,
title="Top k",
description="Number of relationships to keep after each hop after ranking them by relevance to the query. This number correlates to more paragraphs being sent as context.",
ge=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I think we should put an upper bound for the top_k

@@ -1381,6 +1422,7 @@ class AskRequest(AuditMetadataBase):
- `neighbouring_paragraphs` will add the sorrounding paragraphs to the context for each matching paragraph.
- `metadata_extension` will add the metadata of the matching paragraphs or its resources to the context.
- `prequeries` allows to run multiple retrieval queries before the main query and add the results to the context. The results of specific queries can be boosted by the specifying weights.
- `graph` will retrieve context pieces by exploring the Knowledge Graph, starting from the entities present in the query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should mention clearly that, as it is now, graph rag strategy is not compatible at least with prequeries strategy right?

parsed_paragraph_id = ParagraphId.from_string(paragraph_id)
return TextBlockMatch(
paragraph_id=parsed_paragraph_id,
score=0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the bm25 score from the graph index is not used? are we assuming we're always reranking with an llm with the graph rag strategy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants