Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch 7.* and 8.* integration. OpenSearch integration. #469

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

ivanmrsulja
Copy link
Member

@ivanmrsulja ivanmrsulja commented Jun 10, 2024

What does this pull request do?

Updates current ES 6.x integration to 8.x.

What's new?

Changes in ResponseParser and ES documentation on the first draft.

Example:

  • Changed src/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/ResponseParser.java to be in line with current ES API
  • Updated src/main/java/edu/cornell/mannlib/vitro/webapp/searchengine/elasticsearch/Elasticsearch_notes_on_the_first_draft.md with new mapping
  • Updated example.applicationSetup.n3 to show ES setup example

How should this be tested?

Initial setup

  • Install elasticsearch/opensearch somewhere.
  • Create a search index with the appropriate mapping (see below).
  • Check out VIVO and this branch of Vitro (see below), and do the usual installation procedure.
  • Modify {vitro_home}/config/applicationSetup.n3 to use this driver (see below).
  • Modify the vitro.local.searchengine.url configuration property to contain ES index base URL (due to backward compatibility, Solr can also be configured using vitro.local.solr.url. This will however result in a warning that is shown in logs, advising the client to switch to a new configuration parameter)
  • Modify the vitro.local.searchengine.username configuration property to contain ES/OS basic auth username
  • Modify the vitro.local.searchengine.password configuration property to contain to contain ES/OS basic auth password
  • Start elasticsearch/opensearch
  • Start VIVO

A mapping for the search index

curl -X PUT "localhost:9200/vivo?pretty" -H 'Content-Type: application/json' -d'
{
  "settings":{
    "index":{
      "analysis":{
        "tokenizer":{
          "keyword_tokenizer":{
            "type":"keyword"
          },
          "whitespace_tokenizer":{
            "type":"whitespace"
          }
        },
        "filter":{
          "lowercase_filter":{
            "type":"lowercase"
          },
          "edgengram_filter":{
            "type":"edge_ngram",
            "min_gram":2,
            "max_gram":25
          },
          "word_delimiter_filter":{
            "type":"word_delimiter",
            "generate_word_parts":true,
            "generate_number_parts":true,
            "catenate_words":false,
            "catenate_numbers":false,
            "catenate_all":false,
            "split_on_case_change":true
          },
          "porter_stem_filter":{
            "type":"snowball",
            "language":"English"
          }
        },
        "analyzer":{
          "default":{
            "type":"english"
          },
          "edgengram_untokenized":{
            "type":"custom",
            "tokenizer":"keyword_tokenizer",
            "filter":[
              "lowercase_filter",
              "edgengram_filter"
            ]
          },
          "edgengram_untokenized_query":{
            "type":"custom",
            "tokenizer":"keyword_tokenizer",
            "filter":[
              "lowercase_filter"
            ]
          },
          "edgengram_stemmed":{
            "type":"custom",
            "tokenizer":"whitespace_tokenizer",
            "filter":[
              "word_delimiter_filter",
              "lowercase_filter",
              "porter_stem_filter",
              "edgengram_filter"
            ]
          },
          "edgengram_stemmed_query":{
            "type":"custom",
            "tokenizer":"whitespace_tokenizer",
            "filter":[
              "word_delimiter_filter",
              "lowercase_filter",
              "porter_stem_filter"
            ]
          },
          "sort_field_analyzer":{
            "type":"custom",
            "tokenizer":"keyword",
            "filter":[
              "lowercase"
            ]
          }
        }
      }
    }
  },
  "mappings":{
    "dynamic_templates":[
      {
        "field_sort_template":{
          "match":"*_label_sort",
          "mapping":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword"
              }
            },
            "fielddata":true,
            "analyzer":"sort_field_analyzer"
          }
        }
      },
      {
        "field_ss_template":{
          "match":"*_ss",
          "mapping":{
            "type":"text",
            "fields":{
              "keyword":{
                "type":"keyword",
                "ignore_above":256
              }
            },
            "fielddata":true
          }
        }
      },
      {
        "date_range_template":{
          "match":"*_drsim",
          "mapping":{
            "type":"date_range",
            "format":"strict_date_optional_time||epoch_millis"
          }
        }
      }
    ],
    "properties":{
      "ALLTEXT":{
        "type":"text",
        "analyzer":"english",
        "fields":{
          "keyword":{
            "type":"keyword",
            "ignore_above":256
          }
        }
      },
      "ALLTEXTUNSTEMMED":{
        "type":"text",
        "analyzer":"standard"
      },
      "DocId":{
        "type":"keyword"
      },
      "classgroup":{
        "type":"keyword"
      },
      "type":{
        "type":"keyword"
      },
      "mostSpecificTypeURIs":{
        "type":"keyword"
      },
      "indexedTime":{
        "type":"long"
      },
      "nameRaw":{
        "type":"keyword"
      },
      "URI":{
        "type":"keyword"
      },
      "THUMBNAIL":{
        "type":"integer"
      },
      "THUMBNAIL_URL":{
        "type":"keyword"
      },
      "nameLowercaseSingleValued":{
        "type":"text",
        "analyzer":"standard",
        "fielddata":true
      },
      "BETA":{
        "type":"float"
      },
      "acNameUntokenized":{
        "type":"text",
        "analyzer":"edgengram_untokenized",
        "search_analyzer":"edgengram_untokenized_query"
      },
      "acNameStemmed":{
        "type":"text",
        "analyzer":"edgengram_stemmed",
        "search_analyzer":"edgengram_stemmed_query"
      }
    }
  }
}
'

Modify applicationSetup.n3

  • Change this (it is already changed in this PR):
# ----------------------------
#
# Search engine module: 
#    The Solr-based implementation is the only standard option, but it can be
#    wrapped in an "instrumented" wrapper, which provides additional logging 
#    and more rigorous life-cycle checking.
#

:instrumentedSearchEngineWrapper 
    a   <java:edu.cornell.mannlib.vitro.webapp.searchengine.InstrumentedSearchEngineWrapper> , 
        <java:edu.cornell.mannlib.vitro.webapp.modules.searchEngine.SearchEngine> ;
    :wraps :solrSearchEngine .

  • To this:
# ----------------------------
#
# Search engine module: 
#    The Solr-based implementation is the only standard option, but it can be
#    wrapped in an "instrumented" wrapper, which provides additional logging 
#    and more rigorous life-cycle checking.
#

:instrumentedSearchEngineWrapper 
    a   <java:edu.cornell.mannlib.vitro.webapp.searchengine.InstrumentedSearchEngineWrapper> , 
        <java:edu.cornell.mannlib.vitro.webapp.modules.searchEngine.SearchEngine> ;
    :wraps :elasticSearchEngine .

:elasticSearchEngine
    a   <java:edu.cornell.mannlib.vitro.webapp.searchengine.elasticsearch.ElasticSearchEngine> ,
        <java:edu.cornell.mannlib.vitro.webapp.modules.searchEngine.SearchEngine> .

Your setup should be completed now 😃 ! After this, you should perform common manual tests that are done for every new release.

Interested parties

@chenejac

Reviewers' expertise

Candidates for reviewing this PR should have some of the following expertises:

  1. Java
  2. Elasticsearch 7.* or 8.*

@ivanmrsulja ivanmrsulja marked this pull request as draft June 11, 2024 07:11
@chenejac chenejac marked this pull request as ready for review June 18, 2024 13:54
@ivanmrsulja ivanmrsulja changed the title Small mapping update and response parsing fix. Elasticsearch 7.* and 8.* integration. Jun 24, 2024
@ivanmrsulja ivanmrsulja changed the title Elasticsearch 7.* and 8.* integration. Elasticsearch 7.* and 8.* integration. OpenSearch integration. Jul 5, 2024
@chenejac
Copy link
Contributor

chenejac commented Oct 8, 2024

The following features should be tested:

  • search form - searching, filtering and sorting (changing localization)
  • list of research, persons, org units - alphabetical index (changing localization)
  • lookup/autocompletion at some forms - for instance adding author to a publication (changing localization)
  • visualizations - map of science, collaboration network

@chenejac chenejac linked an issue Oct 29, 2024 that may be closed by this pull request
@chenejac chenejac requested a review from litvinovg November 7, 2024 16:00
@chenejac
Copy link
Contributor

@ivanmrsulja please create a VIVO PR with updated example.runtime.properties. Also, please move JSON configuration into vivo-es project. Add in the vivo-es project a Docker file, and update README file to explain how ES should be run.

chenejac
chenejac previously approved these changes Dec 5, 2024
Copy link
Contributor

@chenejac chenejac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanmrsulja basic VIVO search functionalities works for me. I didn't review the code. Instructions from the PR description about setup of the elasticsearch index might be replaced with a pointer to the vivo-es Readme file.

@ivanmrsulja ivanmrsulja requested a review from chenejac December 9, 2024 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants