Make search independent of word order and fuzzy #2335

jecorn · 2023-04-05T09:02:23Z

jecorn
Apr 5, 2023

Before submitting this feature request I have

Reviewed the documentation and verified that my feature request isn't already a feature.
Reviewed the existing feature requests and my feature has not been requested.
Checked the tasks taged issues and verified my feature is not covered

Please Describe The Problem To Be Solved
Recipe search is currently highly dependent on search phrasing and is non-fuzzy. For example, "pinto beans" and "beans pinto" are different searches, and "eggs over easy" is different than "egg over easy". See #2325

Suggest A Solution
The different function call names to Levenshtein distance and trigram search in sqlite vs Postgres make this a bit complicated via the sqlalchemy interface. @fleshgolem therefore suggested a two-stage fix:

Split all search terms on whitespace and combine those to a single AND query, just so that it does not depend on ordering
Implement actual fuzzy search to account for typos etc.

This is safe, sane and makes a lot of sense.

However, I also did some benchmarking with RapidFuzz (https://maxbachmann.github.io/RapidFuzz/) to test the crazy idea of just pulling relevant database fields from all recipes and then fuzzing against everything to figure out what matches. This would be database independent and could be easily pythonized inside the search function, but possibly slower. I was surprised to instead find out that RapidFuzz is blazingly fast and this is actually very fast with even a large dataset.

I started with a test set of 5 different recipes and made a "meta string" of the "name", "description", "tags" and "recipeIngredient:note" fields for each. Running 10'000 iterations of a four-word search against these meta strings (e.g. simulating 5 separate 10'000 recipe databases) using even the slowest RapidFuzz algorithms (partial_token_set_ratio and partial_token_sort_ratio) takes on average 0.1 seconds per 10'000 recipe search. These algorithms tokenise and variably sort or set-ize the strings to make them independent of word order and remove repeated words. The fuzzy matching was excellent (stop words should be removed). A one-word search is even faster, at average 0.05 seconds per 10'000 recipe search.

Breaking the search up into tiered search of various fields could be even faster, if real-life usage suggests that the meta-string approach is problematic.

Additional Information

If the feature is accepted I'm willing to submit a PR to provide this feature
Sort of. I have never developed inside docker and am only passingly familiar with the mealie codebase. But I could share a the inner code for a potential RapidFuzz-based change with someone who has more experience with the mealie code integration workflow (e.g. testing etc)

michael-genson · 2023-04-05T13:56:42Z

michael-genson
Apr 5, 2023
Maintainer

If you want to give it a shot, here's the dev getting started guide:
https://nightly.mealie.io/contributors/developers-guide/starting-dev-server/

And here's the core search logic: https://github.com/hay-kot/mealie/blob/de4debe74963ef48f057588c666ca2e214ab4cde/mealie/repos/repository_recipes.py#L153

0 replies

hay-kot · 2023-04-05T19:36:13Z

hay-kot
Apr 5, 2023
Maintainer

However, I also did some benchmarking with RapidFuzz (https://maxbachmann.github.io/RapidFuzz/) to test the crazy idea of just pulling relevant database fields from all recipes and then fuzzing against everything to figure out what matches. This would be database independent and could be easily pythonized inside the search function, but possibly slower. I was surprised to instead find out that RapidFuzz is blazingly fast and this is actually very fast with even a large dataset.

Probably want to include what hardware you're bench testing on. Lots of people run this of Pi's which I think will totally tank any feasibility here, but that's just a guess. Bet it runs great on my M1 though!

3 replies

jecorn Apr 6, 2023
Author

Definitely a potential issue. But RapidFuzz uses a C++ backend for Levenshtein, which is the same as Postgres levenshtein() and sqlite editdist3(). Levenshtein doesn't need an index, so there's no db benefit there. Trigram is fast with an index, but there's still the complication of having two entirely differently modelled and syntaxed solutions for Postgres vs sqlite. So I (somewhat naively) think that the main overheads for RapidFuzz beyond having the database perform the search is pulling all records and creating the meta-string.

I should find a way to simulate a Pi for testing...

fleshgolem Apr 6, 2023

You can reduce the amount of memory and cpu the app can use when you run it in a docker container. The last time I had an issue that seemed to be a problem on low spec hardware i set it to something like 512M RAM and half a core
https://docs.docker.com/config/containers/resource_constraints/

jecorn Apr 6, 2023
Author

And one can set this within the vscode docker container environment? Sorry for the naive question. This is my first time developing inside a container and I am a bit unclear on some of the stackexchange answers about passing docker arguments inside vscode.

michael-genson · 2024-08-21T19:04:16Z

michael-genson
Aug 21, 2024
Maintainer

This has since been implemented

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make search independent of word order and fuzzy #2335

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Make search independent of word order and fuzzy #2335

jecorn Apr 5, 2023

Replies: 3 comments · 3 replies

michael-genson Apr 5, 2023 Maintainer

hay-kot Apr 5, 2023 Maintainer

jecorn Apr 6, 2023 Author

fleshgolem Apr 6, 2023

jecorn Apr 6, 2023 Author

michael-genson Aug 21, 2024 Maintainer

jecorn
Apr 5, 2023

Replies: 3 comments 3 replies

michael-genson
Apr 5, 2023
Maintainer

hay-kot
Apr 5, 2023
Maintainer

jecorn Apr 6, 2023
Author

jecorn Apr 6, 2023
Author

michael-genson
Aug 21, 2024
Maintainer