server : return stopping_word in the partial response #10720

z80maniac · 2024-12-08T10:34:47Z

This PR returns the stopping_word field in the partial API response.

The older versions of the server returned full info at the end of the partial response. The example for 6acce39 (Dec 1) that shows the last JSON line in streaming mode:

curl -Ss --data '{"prompt": "Alice: Ask me any question.\nBob: What color is the sky on", "n_predict": 8, "cache_prompt": true, "stop": ["?\n"], "seed": 42, "stream": true}' http://127.0.0.1:8080/completion | tail -2 | sed 's/^data: //' | jq

{
  "content": "",
  "id_slot": 0,
  "stop": true,
  "model": "/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf",
  "tokens_predicted": 3,
  "tokens_evaluated": 16,
  "generation_settings": {
    "n_ctx": 1024,
    "n_predict": -1,
    "model": "/opt/models/text/Llama-3.2-3B-Instruct-Q8_0.gguf",
    "seed": 42,
    "seed_cur": 42,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0.0,
    "dynatemp_exponent": 1.0,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "xtc_probability": 0.0,
    "xtc_threshold": 0.10000000149011612,
    "typical_p": 1.0,
    "repeat_last_n": 64,
    "repeat_penalty": 1.0,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0,
    "dry_multiplier": 0.0,
    "dry_base": 1.75,
    "dry_allowed_length": 2,
    "dry_penalty_last_n": -1,
    "dry_sequence_breakers": [
      "\n",
      ":",
      "\"",
      "*"
    ],
    "mirostat": 0,
    "mirostat_tau": 5.0,
    "mirostat_eta": 0.10000000149011612,
    "penalize_nl": false,
    "stop": [
      "?\n"
    ],
    "max_tokens": 8,
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": true,
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "dry",
      "top_k",
      "typ_p",
      "top_p",
      "min_p",
      "xtc",
      "temperature"
    ],
    "speculative": false,
    "speculative.n_max": 16,
    "speculative.n_min": 5,
    "speculative.p_min": 0.8999999761581421
  },
  "prompt": "<|begin_of_text|>Alice: Ask me any question.\nBob: What color is the sky on",
  "has_new_line": false,
  "truncated": false,
  "stopped_eos": false,
  "stopped_word": true,
  "stopped_limit": false,
  "stopping_word": "?\n",
  "tokens_cached": 18,
  "timings": {
    "prompt_n": 16,
    "prompt_ms": 20.212,
    "prompt_per_token_ms": 1.26325,
    "prompt_per_second": 791.6089451810806,
    "predicted_n": 3,
    "predicted_ms": 27.915,
    "predicted_per_token_ms": 9.305,
    "predicted_per_second": 107.46910263299303
  },
  "index": 0
}

But with the new changes (6c5bc06), the server returns very little info.

The example for ecc93d0 (Dec 8):

{
  "index": 0,
  "content": "",
  "stop_type": "word",
  "stop": true,
  "id_slot": -1,
  "tokens_predicted": 3,
  "tokens_evaluated": 16,
  "timings": {
    "prompt_n": 16,
    "prompt_ms": 20.497,
    "prompt_per_token_ms": 1.2810625,
    "prompt_per_second": 780.6020393228277,
    "predicted_n": 3,
    "predicted_ms": 28.193,
    "predicted_per_token_ms": 9.397666666666668,
    "predicted_per_second": 106.40939240236938
  },
  "truncated": false
}

IMHO this is a pretty drastic backwards-incompatible change. However, I'm not sure if that change is supposed to be this way or not, so this PR does not return the full info in the output, but only makes a minimal change to fix one small use-case: it returns the stopping_word field (the change in API broke my API client because it expects this field to be present).

The API result with this PR applied:

{
  "index": 0,
  "content": "",
  "stop_type": "word",
  "stopping_word": "?\n",
  "stop": true,
  "id_slot": -1,
  "tokens_predicted": 3,
  "tokens_evaluated": 16,
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 55.604,
    "prompt_per_token_ms": 55.604,
    "prompt_per_second": 17.984317674987413,
    "predicted_n": 3,
    "predicted_ms": 104.018,
    "predicted_per_token_ms": 34.672666666666665,
    "predicted_per_second": 28.841162106558482
  },
  "truncated": false
}

However, I think that in the long run it's better to return all the info, like it was done before.

ngxson · 2024-12-08T16:16:31Z

Hmm ok that is because I tried to use only server_task_result_cmpl_partial for stream responses for consistency. The old implementation uses both _final and _partial for stream response.

I'll fix it in another PR

ngxson · 2024-12-08T19:40:26Z

I'm closing this because it's fixed by #10722

curl -Ss --data '{"prompt": "Alice: Ask me any question.\nBob: What color is the sky on", "n_predict": 8, "cache_prompt": true, "stop": ["?\n"], "seed": 42, "stream": true}' http://127.0.0.1:8080/completion
data: {"index":0,"content":" the","stop":false,"id_slot":-1,"tokens_predicted":1,"tokens_evaluated":16}

data: {"index":0,"content":" planet","stop":false,"id_slot":-1,"tokens_predicted":2,"tokens_evaluated":16}

data: {"index":0,"content":" Mars","stop":false,"id_slot":-1,"tokens_predicted":3,"tokens_evaluated":16}

data: {"index":0,"content":"","stop":false,"id_slot":-1,"tokens_predicted":4,"tokens_evaluated":16,"timings":{"prompt_n":16,"prompt_ms":209.562,"prompt_per_token_ms":13.097625,"prompt_per_second":76.34971989196514,"predicted_n":4,"predicted_ms":62.882,"predicted_per_token_ms":15.7205,"predicted_per_second":63.611208294901566}}

data: {"index":0,"content":"","id_slot":0,"stop":true,"model":"gpt-3.5-turbo-0613","tokens_predicted":4,"tokens_evaluated":16,"generation_settings":{"n_predict":8,"seed":42,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":["?\n"],"max_tokens":8,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","samplers":["dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":5,"speculative.p_min":0.8999999761581421,"timings_per_token":false},"prompt":"<|begin_of_text|>Alice: Ask me any question.\nBob: What color is the sky on","has_new_line":0,"truncated":false,"stop_type":"word","stopping_word":"?\n","tokens_cached":19,"timings":{"prompt_n":16,"prompt_ms":209.562,"prompt_per_token_ms":13.097625,"prompt_per_second":76.34971989196514,"predicted_n":4,"predicted_ms":62.903,"predicted_per_token_ms":15.72575,"predicted_per_second":63.589971861437455}}

server : return stopping_word in the partial response

c06405e

github-actions bot added examples server labels Dec 8, 2024

ngxson mentioned this pull request Dec 8, 2024

server : bring back info of final chunk in stream mode #10722

Merged

ngxson closed this Dec 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : return stopping_word in the partial response #10720

server : return stopping_word in the partial response #10720

z80maniac commented Dec 8, 2024

ngxson commented Dec 8, 2024

ngxson commented Dec 8, 2024

server : return stopping_word in the partial response #10720

server : return stopping_word in the partial response #10720

Conversation

z80maniac commented Dec 8, 2024

ngxson commented Dec 8, 2024

ngxson commented Dec 8, 2024