Add graph strategy #2772

carlesonielfa · 2025-01-10T14:39:43Z

Description

Creates the graph strategy on its first version.

More details on the Feature Request document
Additional questions on the relevant slack conversation

Tasks done:

How was this PR tested?

Added integration test and unit test

github-actions

license-eye has totally checked 1044 files.

Valid	Invalid	Ignored	Fixed
931	1	112	0

Click to see the invalid file list

nucliadb/src/nucliadb/search/search/graph_strategy.py

nucliadb/src/nucliadb/search/search/graph_strategy.py

codecov · 2025-01-10T14:52:33Z

Codecov Report

Attention: Patch coverage is 87.40741% with 17 lines in your changes missing coverage. Please review.

Project coverage is 87.14%. Comparing base (200c7bd) to head (332da3c).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...liadb/src/nucliadb/search/search/graph_strategy.py	85.26%	14 Missing ⚠️
nucliadb/src/nucliadb/common/ids.py	76.92%	3 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             main    #2772    +/-   ##
========================================
  Coverage   87.13%   87.14%            
========================================
  Files         386      387     +1     
  Lines       24269    24394   +125     
========================================
+ Hits        21148    21257   +109     
- Misses       3121     3137    +16

Flag	Coverage Δ
nucliadb	`74.75% <55.03%> (-0.24%)`	⬇️
nucliadb-ingest	`40.71% <31.78%> (-0.07%)`	⬇️
nucliadb-reader	`42.90% <29.45%> (-0.09%)`	⬇️
nucliadb-search	`43.85% <69.76%> (+0.17%)`	⬆️
nucliadb-standalone	`47.42% <29.45%> (-0.12%)`	⬇️
nucliadb-train	`45.62% <29.45%> (-0.11%)`	⬇️
nucliadb-writer	`46.10% <31.78%> (-0.10%)`	⬇️
nucliadb_dataset	`55.45% <ø> (ø)`
nucliadb_models	`85.87% <100.00%> (+0.05%)`	⬆️
nucliadb_sdk	`80.11% <ø> (ø)`
nucliadb_sidecar	`89.03% <ø> (ø)`
nucliadb_telemetry	`86.56% <ø> (ø)`
nucliadb_utils	`83.77% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lferran · 2025-01-15T05:54:05Z

nucliadb/src/nucliadb/common/ids.py

+        if _type not in FIELD_TYPE_STR_TO_PB:
+            # Try to parse the enum value
+            # XXX: This is to support field types that are integer values of FieldType
+            # Which is how legacy processor relations reported the paragraph_id
+            try:
+                type_pb = FieldType.ValueType(int(_type))
+            except ValueError:
+                raise ValueError(f"Invalid FieldId: {_type}")


ouch, really? So on old entities, the paragraph ids in the metadata are coming with the int value?

lferran · 2025-01-15T05:54:09Z

nucliadb/src/nucliadb/common/ids.py

+    @classmethod
+    def parse_field_type(cls, _type: str) -> str:
+        if _type not in FIELD_TYPE_STR_TO_PB:
+            # Try to parse the enum value
+            # XXX: This is to support field types that are integer values of FieldType
+            # Which is how legacy processor relations reported the paragraph_id
+            try:
+                type_pb = FieldType.ValueType(int(_type))
+            except ValueError:
+                raise ValueError(f"Invalid FieldId: {_type}")
+            if type_pb in FIELD_TYPE_PB_TO_STR:
+                return FIELD_TYPE_PB_TO_STR[type_pb]
+            else:
+                raise ValueError(f"Invalid FieldId: {_type}")
+        return _type
+


Can you add some unit tests for this?

in this module for example: https://github.com/nuclia/nucliadb/blob/main/nucliadb/tests/nucliadb/unit/common/test_ids.py#L61

lferran · 2025-01-15T05:58:18Z

nucliadb_models/src/nucliadb_models/search.py

+- hops=2 will explore the neighbors of the neighbors of the starting entities.
+And so on.
+Bigger values will discover more intricate relationships but will also take more time to compute.""",
+        ge=1,


Should we put a max hops here? like, 10 or 20?
Something that still takes a reasonable amount of time to compute on a big graph -- let's say a minute.

lferran · 2025-01-15T06:00:15Z

nucliadb_models/src/nucliadb_models/search.py

+        default=25,
+        title="Top k",
+        description="Number of relationships to keep after each hop after ranking them by relevance to the query. This number correlates to more paragraphs being sent as context.",
+        ge=1,


Same here, I think we should put an upper bound for the top_k

lferran · 2025-01-15T06:01:40Z

nucliadb_models/src/nucliadb_models/search.py

@@ -1381,6 +1422,7 @@ class AskRequest(AuditMetadataBase):
 - `neighbouring_paragraphs` will add the sorrounding paragraphs to the context for each matching paragraph.
 - `metadata_extension` will add the metadata of the matching paragraphs or its resources to the context.
 - `prequeries` allows to run multiple retrieval queries before the main query and add the results to the context. The results of specific queries can be boosted by the specifying weights.
+- `graph` will retrieve context pieces by exploring the Knowledge Graph, starting from the entities present in the query.


You should mention clearly that, as it is now, graph rag strategy is not compatible at least with prequeries strategy right?

lferran · 2025-01-15T06:13:11Z

nucliadb/src/nucliadb/search/search/find_merge.py

+    parsed_paragraph_id = ParagraphId.from_string(paragraph_id)
+    return TextBlockMatch(
+        paragraph_id=parsed_paragraph_id,
+        score=0,


why is the bm25 score from the graph index is not used? are we assuming we're always reranking with an llm with the graph rag strategy?

carlesonielfa added 6 commits January 9, 2025 08:28

Fix NIDX relation metadata indexing

0a82876

Fix also old index

3f9e22c

Add unit tests

f08312e

Add unit tests to nidx

19faab6

Graph Strategy begins

63f00d5

Merge remote-tracking branch 'origin/main' into add-graph-strategy

be93a6c

github-actions bot reviewed Jan 10, 2025

View reviewed changes

nucliadb/src/nucliadb/search/search/graph_strategy.py Show resolved Hide resolved

carlesonielfa and others added 7 commits January 13, 2025 09:59

Use suggest for fuzzy matching entities instead of predict

17261aa

Multi hop

ddb0d65

Refactor + fix tests

65865dd

Refactor

098fdff

Merge branch 'main' into add-graph-strategy

85d079f

Better testing

7583e11

remove todos

667db65

carlesonielfa changed the title ~~WIP: Add graph strategy~~ Add graph strategy Jan 14, 2025

carlesonielfa marked this pull request as ready for review January 14, 2025 10:15

More coverage

332da3c

lferran reviewed Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add graph strategy #2772

Add graph strategy #2772

carlesonielfa commented Jan 10, 2025 •

edited

Loading

github-actions bot left a comment

codecov bot commented Jan 10, 2025 •

edited

Loading

lferran Jan 15, 2025

lferran Jan 15, 2025

lferran Jan 15, 2025 •

edited

Loading

lferran Jan 15, 2025

lferran Jan 15, 2025

lferran Jan 15, 2025

Add graph strategy #2772

Are you sure you want to change the base?

Add graph strategy #2772

Conversation

carlesonielfa commented Jan 10, 2025 • edited Loading

Description

How was this PR tested?

github-actions bot left a comment

Choose a reason for hiding this comment

codecov bot commented Jan 10, 2025 • edited Loading

Codecov Report

lferran Jan 15, 2025

Choose a reason for hiding this comment

lferran Jan 15, 2025

Choose a reason for hiding this comment

lferran Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

lferran Jan 15, 2025

Choose a reason for hiding this comment

lferran Jan 15, 2025

Choose a reason for hiding this comment

lferran Jan 15, 2025

Choose a reason for hiding this comment

carlesonielfa commented Jan 10, 2025 •

edited

Loading

codecov bot commented Jan 10, 2025 •

edited

Loading

lferran Jan 15, 2025 •

edited

Loading