Flow PHP - SEAL Adapter #2408
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## 1.x #2408 +/- ##
============================================
- Coverage 85.11% 85.11% -0.01%
- Complexity 0 21198 +21198
============================================
Files 1599 1597 -2
Lines 65334 65401 +67
============================================
+ Hits 55612 55668 +56
- Misses 9722 9733 +11 🚀 New features to boost your workflow:
|
|
@MrHDOLEK as we discussed offline, please find some feedback and guidance below. Extractor / LoaderIn general I like the idea of having one extractor that accepts Seal EngineInterface which is aligned with how Doctrine Adapter works right now. Same goes for the loaders. It would also work very well with frameworks as seal provides a bundle for symfony that literally configures engine so to use it with flow it would be a simple matter of dependency injection. What to testI had to think about it a bit more but I dont think we should retest every backend that seal is testing. Doctrine adapter is testing MySQL, PostgreSql nad Sqlite because flow-php/docrtine-dbal-bulk which extends doctrine default behaviors that needs to be tested. At this time seal provides following adapters:
I would chose one of them (that requires the least configuration and is the easiest to setup in docker) and cover it with proper integration tests that would confirm that moving data from flow to seal works as expected. How to testSo in general we want to cover in tests that seal adapter can handle all types of flow Entries. You can achieve that by using one of the two extractors: They both expose static method SchemaConverterSchema converters are recursive algorithms that can convert schema in both directiosn Flow to Seal and Seal to Flow. You can find some inspirations here: It might be the easiest to use LLM to help you create one based on those 3 examples for Seal (it's a recursive brain damaging exercise that might not be worth spending time on). Of course schema converters would need a DSL method. Search Engine in TestsSo in this PR you are using traits to configure backends in tests, which is fine but there is a different pattern which I found cleaner and easier to maintain. Contexts. Here is a good example of DatabaseContext that is used in DatabaseTableListCommandTest If there will be more than one integration tests you can extract an abstract In case of any questions, you know where to find me 😁 |
norberttech
left a comment
There was a problem hiding this comment.
Looks much better! I left some comments, nothing critical, but there is one gap related to Delete operation.
Please let me know if you have any questions about those comments!
0cd33ec to
7bcf17c
Compare
norberttech
left a comment
There was a problem hiding this comment.
Looks awesome @MrHDOLEK !
If you feel that work is completed, and this adapter is now usable I'm happy to merge it 🚀
|
@norberttech I found an issue while testing with the 20k dataset. SEAL cannot read data beyond 10K from elastic because it only supports offset-based pagination, and Elasticsearch blocks offsets outside the 10K result window. To work around this problem, I added cursor-based pagination. |
That's excellent finding! Does other engines also comes with such limitations? Is this keyset pagination going to work only with Elastic or all seal adapters? @alexander-schranz if by any chance you could take a look at this integration, your feedback would be greatly appreciated 🙏 |
| $search = $this->engine->createSearchBuilder($this->index); | ||
|
|
||
| if ($this->searchBuilder !== null) { | ||
| ($this->searchBuilder)($search); | ||
| } | ||
|
|
||
| $search->addSortBy($cursorField, $direction->value)->limit($this->pageSize)->offset(0); |
|
|
||
| foreach ($search->getResult() as $document) { | ||
| $documents[] = $document; | ||
| $cursor = $this->cursorValue($document, $cursorField); |
There was a problem hiding this comment.
Shouldn't this value be only check when count is equal page size? Cause otherwise it will not be used anyway yet it will check 999 times value with page limit of 1k and only 999 results, no?
Algolia — 1 000 (paginationLimitedTo, max 20k) |
In general pagination is a complex problem, especially over large datasets. Limit Offset Pagination - usually works up to 5-10k rows I guess the biggest question is if SEAL would let us now build something that would leverage KeySet pagination without any changes in SEAL or maybe some adjustments are first required there? |
|
@norberttech I can just give a quick response here without having a deeper look at the Pull Request yet. Limitations on Elasticsearch can be controlled by Elasticsearch settings, so if somebody want to fetch more they can increase it on Elasticsearch instance config. There are some discussions to allow configure more settings via DSN params, but that will take a longer time. While there is an issue around adding a
Are you in control of the Schema and the Data? If I understand the KeySet Pagination correctly you would require additional field or a value which increases. If document id is integer maybe sort by id, else a datetime field or something which can be sorted by and is possible unique or atleast not exist more as 250 (typesense default limit). @norberttech How are you handling KeySet Pagination in doctrine if the identifier are not auto incremented like uuids? |
Thank you!
That's unfortunately not acceptable solution in data processing world, it does not scale and it's also not very complex to hit that limit even when the scale is not massive + there is a limit of how much you can increase those limits up to. So
Yes, KeySet pagination requires something you can sort by, it can be either auto-incremented id, Uuid V7, additional timestamp column.
Of course, if you don't have a control over it, you can't build an reliable ETL pipeline. |
The Elasticsearch and Meilisearch adapters are replaced by a single SEAL adapter
(
flow-php/etl-adapter-seal) built on top of SEAL— a PHP Search Engine Abstraction Layer. One adapter now covers Elasticsearch, OpenSearch,
Meilisearch, Algolia, Solr, Typesense, RediSearch and Loupe: the user builds a
CmsIg\Seal\EngineInterfaceand passes it to the DSL, exactly like the PostgreSQL/Doctrineadapters take a client. Tests are backend-agnostic (Memory adapter) and prove every Flow
Entry type survives a round-trip through the adapter.
Resolves: #2168
Change Log
Added
flow-php/etl-adapter-seal) — a single, strongly typed integration for SEAL search engines (Elasticsearch, OpenSearch, Meilisearch, Algolia, Solr, Typesense, RediSearch, Loupe)from_seal()extractor andto_seal()loader, working with any SEALEngineInterfaceto_seal_schema()andseal_schema_to_flow()DSL — recursive, bi-directional Flow Schema ↔ SEAL Schema conversion (nested structures, lists and maps)seal_create_index(),seal_drop_index(),seal_create_schema()andseal_drop_schema()DSL index-lifecycle helpersFixed
Changed
Removed
flow-php/etl-adapter-elasticsearch) — superseded by the SEAL adapter; use SEAL with the Elasticsearch backend (cmsig/seal-elasticsearch-adapter)Deprecated
Security