Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : return stopping_word in the partial response #10720

Closed

Conversation

z80maniac
Copy link
Contributor

This PR returns the stopping_word field in the partial API response.

The older versions of the server returned full info at the end of the partial response. The example for 6acce39 (Dec 1) that shows the last JSON line in streaming mode:

curl -Ss --data '{"prompt": "Alice: Ask me any question.\nBob: What color is the sky on", "n_predict": 8, "cache_prompt": true, "stop": ["?\n"], "seed": 42, "stream": true}' http://127.0.0.1:8080/completion | tail -2 | sed 's/^data: //' | jq
{
  "content": "",
  "id_slot": 0,
  "stop": true,
  "model": "/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf",
  "tokens_predicted": 3,
  "tokens_evaluated": 16,
  "generation_settings": {
    "n_ctx": 1024,
    "n_predict": -1,
    "model": "/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf",
    "seed": 42,
    "seed_cur": 42,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0.0,
    "dynatemp_exponent": 1.0,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "xtc_probability": 0.0,
    "xtc_threshold": 0.10000000149011612,
    "typical_p": 1.0,
    "repeat_last_n": 64,
    "repeat_penalty": 1.0,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0,
    "dry_multiplier": 0.0,
    "dry_base": 1.75,
    "dry_allowed_length": 2,
    "dry_penalty_last_n": -1,
    "dry_sequence_breakers": [
      "\n",
      ":",
      "\"",
      "*"
    ],
    "mirostat": 0,
    "mirostat_tau": 5.0,
    "mirostat_eta": 0.10000000149011612,
    "penalize_nl": false,
    "stop": [
      "?\n"
    ],
    "max_tokens": 8,
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": true,
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "dry",
      "top_k",
      "typ_p",
      "top_p",
      "min_p",
      "xtc",
      "temperature"
    ],
    "speculative": false,
    "speculative.n_max": 16,
    "speculative.n_min": 5,
    "speculative.p_min": 0.8999999761581421
  },
  "prompt": "<|begin_of_text|>Alice: Ask me any question.\nBob: What color is the sky on",
  "has_new_line": false,
  "truncated": false,
  "stopped_eos": false,
  "stopped_word": true,
  "stopped_limit": false,
  "stopping_word": "?\n",
  "tokens_cached": 18,
  "timings": {
    "prompt_n": 16,
    "prompt_ms": 20.212,
    "prompt_per_token_ms": 1.26325,
    "prompt_per_second": 791.6089451810806,
    "predicted_n": 3,
    "predicted_ms": 27.915,
    "predicted_per_token_ms": 9.305,
    "predicted_per_second": 107.46910263299303
  },
  "index": 0
}

But with the new changes (6c5bc06), the server returns very little info.

The example for ecc93d0 (Dec 8):

{
  "index": 0,
  "content": "",
  "stop_type": "word",
  "stop": true,
  "id_slot": -1,
  "tokens_predicted": 3,
  "tokens_evaluated": 16,
  "timings": {
    "prompt_n": 16,
    "prompt_ms": 20.497,
    "prompt_per_token_ms": 1.2810625,
    "prompt_per_second": 780.6020393228277,
    "predicted_n": 3,
    "predicted_ms": 28.193,
    "predicted_per_token_ms": 9.397666666666668,
    "predicted_per_second": 106.40939240236938
  },
  "truncated": false
}

IMHO this is a pretty drastic backwards-incompatible change. However, I'm not sure if that change is supposed to be this way or not, so this PR does not return the full info in the output, but only makes a minimal change to fix one small use-case: it returns the stopping_word field (the change in API broke my API client because it expects this field to be present).

The API result with this PR applied:

{
  "index": 0,
  "content": "",
  "stop_type": "word",
  "stopping_word": "?\n",
  "stop": true,
  "id_slot": -1,
  "tokens_predicted": 3,
  "tokens_evaluated": 16,
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 55.604,
    "prompt_per_token_ms": 55.604,
    "prompt_per_second": 17.984317674987413,
    "predicted_n": 3,
    "predicted_ms": 104.018,
    "predicted_per_token_ms": 34.672666666666665,
    "predicted_per_second": 28.841162106558482
  },
  "truncated": false
}

However, I think that in the long run it's better to return all the info, like it was done before.

@ngxson
Copy link
Collaborator

ngxson commented Dec 8, 2024

Hmm ok that is because I tried to use only server_task_result_cmpl_partial for stream responses for consistency. The old implementation uses both _final and _partial for stream response.

I'll fix it in another PR

@ngxson
Copy link
Collaborator

ngxson commented Dec 8, 2024

I'm closing this because it's fixed by #10722

curl -Ss --data '{"prompt": "Alice: Ask me any question.\nBob: What color is the sky on", "n_predict": 8, "cache_prompt": true, "stop": ["?\n"], "seed": 42, "stream": true}' http://127.0.0.1:8080/completion
data: {"index":0,"content":" the","stop":false,"id_slot":-1,"tokens_predicted":1,"tokens_evaluated":16}

data: {"index":0,"content":" planet","stop":false,"id_slot":-1,"tokens_predicted":2,"tokens_evaluated":16}

data: {"index":0,"content":" Mars","stop":false,"id_slot":-1,"tokens_predicted":3,"tokens_evaluated":16}

data: {"index":0,"content":"","stop":false,"id_slot":-1,"tokens_predicted":4,"tokens_evaluated":16,"timings":{"prompt_n":16,"prompt_ms":209.562,"prompt_per_token_ms":13.097625,"prompt_per_second":76.34971989196514,"predicted_n":4,"predicted_ms":62.882,"predicted_per_token_ms":15.7205,"predicted_per_second":63.611208294901566}}

data: {"index":0,"content":"","id_slot":0,"stop":true,"model":"gpt-3.5-turbo-0613","tokens_predicted":4,"tokens_evaluated":16,"generation_settings":{"n_predict":8,"seed":42,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":["?\n"],"max_tokens":8,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","samplers":["dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":5,"speculative.p_min":0.8999999761581421,"timings_per_token":false},"prompt":"<|begin_of_text|>Alice: Ask me any question.\nBob: What color is the sky on","has_new_line":0,"truncated":false,"stop_type":"word","stopping_word":"?\n","tokens_cached":19,"timings":{"prompt_n":16,"prompt_ms":209.562,"prompt_per_token_ms":13.097625,"prompt_per_second":76.34971989196514,"predicted_n":4,"predicted_ms":62.903,"predicted_per_token_ms":15.72575,"predicted_per_second":63.589971861437455}}

@ngxson ngxson closed this Dec 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants