Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ollama #1326

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

Smpests
Copy link

@Smpests Smpests commented Oct 26, 2024

Description

Supports Ollama, which provides local free large LLM models,for those who need to use local APIs.

Related Issues

None

Proposed Changes

1.Add ollama package in graphrag.llm;
2.Add ollama package in graphrag.query.llm;
3.Some details to make the changes works:
graphrag.llm.openai.utils.py moved to graphrag.llm.utils.py;
add new types in graphrag.config.enum.LLMType;
...

Checklist

  • [ √] I have tested these changes locally.
  • [ √] I have reviewed the code changes.
  • [ un] I have updated the documentation (if necessary).
  • [ ×] I have added appropriate unit tests (if applicable).

Additional Notes

Follow https://microsoft.github.io/graphrag/get_started with 24720 chars book.txt,graphrag index and graphrag query were passed.

Part of settings.yaml
llm:
type: ollama_chat # or azure_openai_chat
model: llama3.1:8b
model_supports_json: false # recommended if this is available for your model.
max_tokens: 12800
api_base: http://localhost:11434
concurrent_requests: 2 # the number of parallel inflight requests that may be made
embeddings:
llm:
type: ollama_embedding # or azure_openai_embedding
model: nomic-embed-text:latest
api_base: http://localhost:11434
concurrent_requests: 2 # the number of parallel inflight requests that may be made

@Smpests Smpests requested review from a team as code owners October 26, 2024 13:30
@Smpests
Copy link
Author

Smpests commented Oct 26, 2024

@Smpests please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

@Smpests Smpests changed the title Feature/ollama support Support ollama Oct 26, 2024
@JoedNgangmeni
Copy link

I get the following error when running this fork:

'ollama_chat' is not a valid LLMType

@JoedNgangmeni
Copy link

JoedNgangmeni commented Oct 28, 2024

I've updated my yaml and llm type files but am now getting this error:
image

How do you make sure ollama models are actually being run? I think that is the main issue.

stats.json

#####################################################
YAML: 
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: ollama_chat # or azure_openai_chat
  model: llama3.2:latest
  model_supports_json: true # recommended if this is available for your model.
  # audience: "https://cognitiveservices.azure.com/.default"
  max_tokens: 12800
  # request_timeout: 180.0
  api_base: http://localhost:11434
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  concurrent_requests: 2 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  # target: required # or all
  # batch_size: 16 # the number of documents to send in a single request
  # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
  vector_store:
    type: lancedb
    db_uri: 'output/lancedb'
    collection_name: entity_description_embeddings
    overwrite: true
  # vector_store: # configuration for AI Search
    # type: azure_ai_search
    # url: <ai_search_endpoint>
    # api_key: <api_key> # if not set, will attempt to use managed identity. Expects the `Search Index Data Contributor` RBAC role in this case.
    # audience: <optional> # if using managed identity, the audience to use for the token
    # overwrite: true # or false. Only applicable at index creation time
    # collection_name: <collection_name> # the name of the collection to use
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: ollama_embedding # or azure_openai_embedding
    model: mxbai-embed-large:latest
    api_base: http://localhost:11434
    # api_version: 2024-02-15-preview
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    concurrent_requests: 2 # the number of parallel inflight requests that may be made
#####################################################
From Enums.py
class LLMType(str, Enum):
    """LLMType enum class definition."""

    # Embeddings
    OpenAIEmbedding = "openai_embedding"
    AzureOpenAIEmbedding = "azure_openai_embedding"
    OllamaEmbedding = "ollama_embedding"

    # Raw Completion
    OpenAI = "openai"
    AzureOpenAI = "azure_openai"

    # Chat Completion
    OpenAIChat = "openai_chat"
    AzureOpenAIChat = "azure_openai_chat"
    OllamaChat = "ollama_chat"


    # Debug
    StaticResponse = "static_response"

    def __repr__(self):
        """Get a string representation."""
        return f'"{self.value}"'

@Smpests
Copy link
Author

Smpests commented Oct 28, 2024

I've updated my yaml and llm type files but am now getting this error: image

How do you make sure ollama models are actually being run? I think that is the main issue.

stats.json

#####################################################
YAML: 
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: ollama_chat # or azure_openai_chat
  model: llama3.2:latest
  model_supports_json: true # recommended if this is available for your model.
  # audience: "https://cognitiveservices.azure.com/.default"
  max_tokens: 12800
  # request_timeout: 180.0
  api_base: http://localhost:11434
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  concurrent_requests: 2 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  # target: required # or all
  # batch_size: 16 # the number of documents to send in a single request
  # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
  vector_store:
    type: lancedb
    db_uri: 'output/lancedb'
    collection_name: entity_description_embeddings
    overwrite: true
  # vector_store: # configuration for AI Search
    # type: azure_ai_search
    # url: <ai_search_endpoint>
    # api_key: <api_key> # if not set, will attempt to use managed identity. Expects the `Search Index Data Contributor` RBAC role in this case.
    # audience: <optional> # if using managed identity, the audience to use for the token
    # overwrite: true # or false. Only applicable at index creation time
    # collection_name: <collection_name> # the name of the collection to use
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: ollama_embedding # or azure_openai_embedding
    model: mxbai-embed-large:latest
    api_base: http://localhost:11434
    # api_version: 2024-02-15-preview
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    concurrent_requests: 2 # the number of parallel inflight requests that may be made
#####################################################
From Enums.py
class LLMType(str, Enum):
    """LLMType enum class definition."""

    # Embeddings
    OpenAIEmbedding = "openai_embedding"
    AzureOpenAIEmbedding = "azure_openai_embedding"
    OllamaEmbedding = "ollama_embedding"

    # Raw Completion
    OpenAI = "openai"
    AzureOpenAI = "azure_openai"

    # Chat Completion
    OpenAIChat = "openai_chat"
    AzureOpenAIChat = "azure_openai_chat"
    OllamaChat = "ollama_chat"


    # Debug
    StaticResponse = "static_response"

    def __repr__(self):
        """Get a string representation."""
        return f'"{self.value}"'

I not only changed the two places you mentioned, you can check files changed in this PR, or just use this branch: https://github.com/Smpests/graphrag-ollama/tree/feature/ollama-support.
below is my result:
graphrag index --root ./ragtest
image
graphrag query --root ./ragtest --method local --query "Who is Scrooge and what are his main relationships?"
image

@JoedNgangmeni
Copy link

was the main.py file removed purposefully from graphrag.index? This results in --init and other args not working

@JoedNgangmeni
Copy link

Also, did you encounter any "Error Invoking LLM" errors? Not sure im invoking it right.

@Smpests
Copy link
Author

Smpests commented Oct 29, 2024

Also, did you encounter any "Error Invoking LLM" errors? Not sure im invoking it right.

I didn't.You can try change parallelization.pnum_threads with a lower value in setting.yaml according to your machine, it defaults to 50.

@Smpests
Copy link
Author

Smpests commented Oct 29, 2024

was the main.py file removed purposefully from graphrag.index? This results in --init and other args not working

Please checkt this issue: #1305
Now, graphrag.index = graphrag index

@JoedNgangmeni
Copy link

I'm unsure why it's bugging.

logs.json
indexing-engine.log

I think the error invoking LLM (even though I set the request timeout in yaml to 12800.0), leads to this models inability to create some reports and summaries.

The below is from running python -m graphrag query --query "who is scrooge?" --root ./ragtest --method global

image

Result from python -m graphrag --query "who is scrooge?" --root ./ragtest --method global :

image

Result from python -m graphrag query "who is scrooge?" --root ./ragtest --method global :

image

@Smpests
Copy link
Author

Smpests commented Nov 2, 2024

I'm unsure why it's bugging.

logs.json indexing-engine.log

I think the error invoking LLM (even though I set the request timeout in yaml to 12800.0), leads to this models inability to create some reports and summaries.

The below is from running python -m graphrag query --query "who is scrooge?" --root ./ragtest --method global

image

Result from python -m graphrag --query "who is scrooge?" --root ./ragtest --method global :

image

Result from python -m graphrag query "who is scrooge?" --root ./ragtest --method global :

image

Change your query command to python -m graphrag query --query "who is scrooge?" --root ./ragtest --method global

I checked your indexing-engine.log, which model did you use?(My case was LLM3.1:8b) Maybe your model response is not a valid json, you should debug step by step or log the model response and see.

I've seen similar error, the model response like: Below is my answer: {"title": "xx"...}, then I added "Answer only JSON, without any other text." to the prompt.

@drcrallen
Copy link

You can get ollama working with just settings.yaml with something like this:

llm:
  type: openai_chat
  model: qwen2.5:3b-16k
  batch_max_tokens: 8191
  max_tokens: 4000
  api_base: http://127.0.0.1:11434/v1
  max_retries: 3
  model_supports_json: false
  concurrent_requests: 1

Where qwen2.5:3b-16k is created like:

FROM qwen2.5:3b-instruct-q6_K
PARAMETER num_ctx 16384

For indexing, the problem I've encountered isn't that ollama doesn't have api hooks, it is that the models produce very different results compared to OpenAI's api, many of which are not compatible. I've only gotten qwen2.5 to produce anything reasonable in the models that fit in my (paltry) 8gb card. Looking at this thread it seems like llama3.2 has had luck for people? But are there any others that people have found successful without having to do prompt engineering (aka, take the stock prompts)?

@JoedNgangmeni
Copy link

JoedNgangmeni commented Nov 5, 2024

I'm unsure why it's bugging.
logs.json indexing-engine.log
I think the error invoking LLM (even though I set the request timeout in yaml to 12800.0), leads to this models inability to create some reports and summaries.
The below is from running python -m graphrag query --query "who is scrooge?" --root ./ragtest --method global
image
Result from python -m graphrag --query "who is scrooge?" --root ./ragtest --method global :
image
Result from python -m graphrag query "who is scrooge?" --root ./ragtest --method global :
image

Change your query command to python -m graphrag query --query "who is scrooge?" --root ./ragtest --method global

I checked your indexing-engine.log, which model did you use?(My case was LLM3.1:8b) Maybe your model response is not a valid json, you should debug step by step or log the model response and see.

I've seen similar error, the model response like: Below is my answer: {"title": "xx"...}, then I added "Answer only JSON, without any other text." to the prompt.

I was using ollama 3.2:latest. I will try running it with 3.1:8b.

Does your response mean you changed the prompt document from their repo? I'm new to their repo and this LLM space and didn't want to ruin anything. I ask because i think the model outputs a parquet file.

If your answer is yes, should I add --emit json to the prompt?

ALSO, if we want json outputs how come in your YAM file you set the model support for json to false?

@Smpests
Copy link
Author

Smpests commented Nov 5, 2024

I'm unsure why it's bugging.
logs.json indexing-engine.log
I think the error invoking LLM (even though I set the request timeout in yaml to 12800.0), leads to this models inability to create some reports and summaries.
The below is from running python -m graphrag query --query "who is scrooge?" --root ./ragtest --method global
image
Result from python -m graphrag --query "who is scrooge?" --root ./ragtest --method global :
image
Result from python -m graphrag query "who is scrooge?" --root ./ragtest --method global :
image

Change your query command to python -m graphrag query --query "who is scrooge?" --root ./ragtest --method global
I checked your indexing-engine.log, which model did you use?(My case was LLM3.1:8b) Maybe your model response is not a valid json, you should debug step by step or log the model response and see.
I've seen similar error, the model response like: Below is my answer: {"title": "xx"...}, then I added "Answer only JSON, without any other text." to the prompt.

I was using ollama 3.2:latest. I will try running it with 3.1:8b.

Does your response mean you changed the prompt document from their repo? I'm new to their repo and this LLM space and didn't want to ruin anything. I ask because i think the model outputs a parquet file.

If your answer is yes, should I add --emit json to the prompt?

ALSO, if we want json outputs how come in your YAM file you set the model support for json to false?

I tried with ollama 3.2:latest(on my branch feature/ollama-support), this is my result(with 16078chars in input/book.txt ):
indexing-engine.log
{682996E3-42FF-4C20-85A6-49E00048023E}

Yes, I modified their prompt document.(a simple guiding tip on llama3.1:8b, but i didn't change it on llama3.2:latest).
My setting.yaml set model_supports_json to false, because this parameter make no sense to ollama.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants