-
Notifications
You must be signed in to change notification settings - Fork 57
/
README.md.in
506 lines (364 loc) · 16 KB
/
README.md.in
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
# :speaking_head: aspeak
[![GitHub stars](https://img.shields.io/github/stars/kxxt/aspeak)](https://github.com/kxxt/aspeak/stargazers)
[![GitHub issues](https://img.shields.io/github/issues/kxxt/aspeak)](https://github.com/kxxt/aspeak/issues)
[![GitHub forks](https://img.shields.io/github/forks/kxxt/aspeak)](https://github.com/kxxt/aspeak/network)
[![GitHub license](https://img.shields.io/github/license/kxxt/aspeak)](https://github.com/kxxt/aspeak/blob/v6/LICENSE)
<a href="https://github.com/kxxt/aspeak/graphs/contributors" alt="Contributors">
<img src="https://img.shields.io/github/contributors/kxxt/aspeak" />
</a>
<a href="https://github.com/kxxt/aspeak/pulse" alt="Activity">
<img src="https://img.shields.io/github/commit-activity/m/kxxt/aspeak" />
</a>
A simple text-to-speech client for Azure TTS API. :laughing:
## Note
Starting from version 6.0.0, `aspeak` by default uses the RESTful API of Azure TTS. If you want to use the WebSocket API,
you can specify `--mode websocket` when invoking `aspeak` or set `mode = "websocket"` in the `auth` section of your profile.
Starting from version 4.0.0, `aspeak` is rewritten in rust. The old python version is available at the `python` branch.
You can sign up for an Azure account and then
[choose a payment plan as needed (or stick to free tier)](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services/).
The free tier includes a quota of 0.5 million characters per month, free of charge.
Please refer to the [Authentication section](#authentication) to learn how to set up authentication for aspeak.
## Installation
### Download from GitHub Releases (Recommended for most users)
Download the latest release from [here](https://github.com/kxxt/aspeak/releases/latest).
After downloading, extract the archive and you will get a binary executable file.
You can put it in a directory that is in your `PATH` environment variable so that you can run it from anywhere.
### Installl from AUR (Recommended for Arch Linux users)
From v4.1.0, You can install `aspeak-bin` from AUR.
### Install from PyPI
Installing from PyPI will also install the python binding of `aspeak` for you. Check [Library Usage#Python](#Python) for more information on using the python binding.
```bash
pip install -U aspeak==@@ASPEAK_VERSION@@
```
Now the prebuilt wheels are only available for x86_64 architecture.
Due to some technical issues, I haven't uploaded the source distribution to PyPI yet.
So to build wheel from source, you need to follow the instructions in [Install from Source](#Install-from-Source).
Because of manylinux compatibility issues, the wheels for linux are not available on PyPI. (But you can still build them from source.)
### Install from Source
#### CLI Only
The easiest way to install `aspeak` from source is to use cargo:
```bash
cargo install aspeak -F binary
```
Alternatively, you can also install `aspeak` from AUR.
#### Python Wheel
To build the python wheel, you need to install `maturin` first:
```bash
pip install maturin
```
After cloning the repository and `cd` into the directory
, you can build the wheel by running:
```bash
maturin build --release --strip -F python --bindings pyo3 --interpreter python --manifest-path Cargo.toml --out dist-pyo3
maturin build --release --strip --bindings bin -F binary --interpreter python --manifest-path Cargo.toml --out dist-bin
bash merge-wheel.bash
```
If everything goes well, you will get a wheel file in the `dist` directory.
## Usage
Run `aspeak help` to see the help message.
Run `aspeak help <subcommand>` to see the help message of a subcommand.
### Authentication
The authentication options should be placed before any subcommand.
For example, to utilize your subscription key and
an official endpoint designated by a region,
run the following command:
```sh
$ aspeak --region <YOUR_REGION> --key <YOUR_SUBSCRIPTION_KEY> text "Hello World"
```
If you are using a custom endpoint, you can use the `--endpoint` option instead of `--region`.
To avoid repetition, you can store your authentication details
in your aspeak profile.
Read the following section for more details.
From v5.2.0, you can also set the authentication secrets via the following environment variables:
- `ASPEAK_AUTH_KEY` for authentication using subscription key
- `ASPEAK_AUTH_TOKEN` for authentication using authorization token
From v4.3.0, you can let aspeak use a proxy server to connect to the endpoint.
For now, only http and socks5 proxies are supported (no https support yet). For example:
```sh
$ aspeak --proxy http://your_proxy_server:port text "Hello World"
$ aspeak --proxy socks5://your_proxy_server:port text "Hello World"
```
aspeak also respects the `HTTP_PROXY`(or `http_proxy`) environment variable.
### Configuration
aspeak v4 introduces the concept of profiles.
A profile is a configuration file where you can specify default values for the command line options.
Run the following command to create your default profile:
```sh
$ aspeak config init
```
To edit the profile, run:
```sh
$ aspeak config edit
```
If you have trouble running the above command, you can edit the profile manually:
Fist get the path of the profile by running:
```sh
$ aspeak config where
```
Then edit the file with your favorite text editor.
The profile is a TOML file. The default profile looks like this:
Check the comments in the config file for more information about available options.
```toml
@@PROFILE_TEMPLATE@@
```
If you want to use a profile other than your default profile, you can use the `--profile` argument:
```sh
aspeak --profile <PATH_TO_A_PROFILE> text "Hello"
```
If you want to temporarily disable the profile, you can use the `--no-profile` argument:
```sh
aspeak --no-profile --region eastus --key <YOUR_KEY> text "Hello"
```
### Pitch and Rate
- `rate`: The speaking rate of the voice.
- If you use a float value (say `0.5`), the value will be multiplied by 100% and become `50.00%`.
- You can use the following values as well: `x-slow`, `slow`, `medium`, `fast`, `x-fast`, `default`.
- You can also use percentage values directly: `+10%`.
- You can also use a relative float value (with `f` postfix), `1.2f`:
- According to the [Azure documentation](https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup?tabs=csharp#adjust-prosody),
- A relative value, expressed as a number that acts as a multiplier of the default.
- For example, a value of `1f` results in no change in the rate. A value of `0.5f` results in a halving of the rate. A value of `3f` results in a tripling of the rate.
- `pitch`: The pitch of the voice.
- If you use a float value (say `-0.5`), the value will be multiplied by 100% and become `-50.00%`.
- You can also use the following values as well: `x-low`, `low`, `medium`, `high`, `x-high`, `default`.
- You can also use percentage values directly: `+10%`.
- You can also use a relative value, (e.g. `-2st` or `+80Hz`):
- According to the [Azure documentation](https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup?tabs=csharp#adjust-prosody),
- A relative value, expressed as a number preceded by "+" or "-" and followed by "Hz" or "st" that specifies an amount to change the pitch.
- The "st" indicates the change unit is semitone, which is half of a tone (a half step) on the standard diatonic scale.
- You can also use an absolute value: e.g. `600Hz`
**Note**: Unreasonable high/low values will be clipped to reasonable values by Azure Cognitive Services.
### Examples
The following examples assume that you have already set up authentication in your profile.
#### Speak "Hello, world!" to default speaker.
```sh
$ aspeak text "Hello, world"
```
#### SSML to Speech
```sh
$ aspeak ssml << EOF
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'><voice name='en-US-JennyNeural'>Hello, world!</voice></speak>
EOF
```
#### List all available voices.
```sh
$ aspeak list-voices
```
#### List all available voices for Chinese.
```sh
$ aspeak list-voices -l zh-CN
```
#### Get information about a voice.
```sh
$ aspeak list-voices -v en-US-SaraNeural
```
<details>
<summary>
Output
</summary>
```
Microsoft Server Speech Text to Speech Voice (en-US, SaraNeural)
Display name: Sara
Local name: Sara @ en-US
Locale: English (United States)
Gender: Female
ID: en-US-SaraNeural
Voice type: Neural
Status: GA
Sample rate: 48000Hz
Words per minute: 157
Styles: ["angry", "cheerful", "excited", "friendly", "hopeful", "sad", "shouting", "terrified", "unfriendly", "whispering"]
```
</details>
#### Save synthesized speech to a file.
```sh
$ aspeak text "Hello, world" -o output.wav
```
If you prefer mp3/ogg/webm, you can use `-c mp3`/`-c ogg`/`-c webm` option.
```sh
$ aspeak text "Hello, world" -o output.mp3 -c mp3
$ aspeak text "Hello, world" -o output.ogg -c ogg
$ aspeak text "Hello, world" -o output.webm -c webm
```
#### List available quality levels
```sh
$ aspeak list-qualities
```
<details>
<summary>Output</summary>
```
Qualities for MP3:
3: audio-48khz-192kbitrate-mono-mp3
2: audio-48khz-96kbitrate-mono-mp3
-3: audio-16khz-64kbitrate-mono-mp3
1: audio-24khz-160kbitrate-mono-mp3
-2: audio-16khz-128kbitrate-mono-mp3
-4: audio-16khz-32kbitrate-mono-mp3
-1: audio-24khz-48kbitrate-mono-mp3
0: audio-24khz-96kbitrate-mono-mp3
Qualities for WAV:
-2: riff-8khz-16bit-mono-pcm
1: riff-24khz-16bit-mono-pcm
0: riff-24khz-16bit-mono-pcm
-1: riff-16khz-16bit-mono-pcm
Qualities for OGG:
0: ogg-24khz-16bit-mono-opus
-1: ogg-16khz-16bit-mono-opus
1: ogg-48khz-16bit-mono-opus
Qualities for WEBM:
0: webm-24khz-16bit-mono-opus
-1: webm-16khz-16bit-mono-opus
1: webm-24khz-16bit-24kbps-mono-opus
```
</details>
#### List available audio formats (For expert users)
```sh
$ aspeak list-formats
```
<details>
<summary>Output</summary>
```
amr-wb-16000hz
audio-16khz-128kbitrate-mono-mp3
audio-16khz-16bit-32kbps-mono-opus
audio-16khz-32kbitrate-mono-mp3
audio-16khz-64kbitrate-mono-mp3
audio-24khz-160kbitrate-mono-mp3
audio-24khz-16bit-24kbps-mono-opus
audio-24khz-16bit-48kbps-mono-opus
audio-24khz-48kbitrate-mono-mp3
audio-24khz-96kbitrate-mono-mp3
audio-48khz-192kbitrate-mono-mp3
audio-48khz-96kbitrate-mono-mp3
ogg-16khz-16bit-mono-opus
ogg-24khz-16bit-mono-opus
ogg-48khz-16bit-mono-opus
raw-16khz-16bit-mono-pcm
raw-16khz-16bit-mono-truesilk
raw-22050hz-16bit-mono-pcm
raw-24khz-16bit-mono-pcm
raw-24khz-16bit-mono-truesilk
raw-44100hz-16bit-mono-pcm
raw-48khz-16bit-mono-pcm
raw-8khz-16bit-mono-pcm
raw-8khz-8bit-mono-alaw
raw-8khz-8bit-mono-mulaw
riff-16khz-16bit-mono-pcm
riff-22050hz-16bit-mono-pcm
riff-24khz-16bit-mono-pcm
riff-44100hz-16bit-mono-pcm
riff-48khz-16bit-mono-pcm
riff-8khz-16bit-mono-pcm
riff-8khz-8bit-mono-alaw
riff-8khz-8bit-mono-mulaw
webm-16khz-16bit-mono-opus
webm-24khz-16bit-24kbps-mono-opus
webm-24khz-16bit-mono-opus
```
</details>
#### Increase/Decrease audio qualities
```sh
# Less than default quality.
$ aspeak text "Hello, world" -o output.mp3 -c mp3 -q=-1
# Best quality for mp3
$ aspeak text "Hello, world" -o output.mp3 -c mp3 -q=3
```
#### Read text from file and speak it.
```sh
$ cat input.txt | aspeak text
```
or
```sh
$ aspeak text -f input.txt
```
with custom encoding:
```sh
$ aspeak text -f input.txt -e gbk
```
#### Read from stdin and speak it.
```sh
$ aspeak text
```
maybe you prefer:
```sh
$ aspeak text -l zh-CN << EOF
我能吞下玻璃而不伤身体。
EOF
```
#### Speak Chinese.
```sh
$ aspeak text "你好,世界!" -l zh-CN
```
#### Use a custom voice.
```sh
$ aspeak text "你好,世界!" -v zh-CN-YunjianNeural
```
#### Custom pitch, rate and style
```sh
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p 1.5 -r 0.5 -S sad
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=-10% -r=+5% -S cheerful
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=+40Hz -r=1.2f -S fearful
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=high -r=x-slow -S calm
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=+1st -r=-7% -S lyrical
```
### Advanced Usage
#### Use a custom audio format for output
**Note**: Some audio formats are not supported when outputting to speaker.
```sh
$ aspeak text "Hello World" -F riff-48khz-16bit-mono-pcm -o high-quality.wav
```
## Library Usage
### Python
The new version of `aspeak` is written in Rust, and the Python binding is provided by PyO3.
Here is a simple example:
```python
from aspeak import SpeechService
service = SpeechService()
service.connect()
service.speak_text("Hello, world")
```
First you need to create a `SpeechService` instance.
When creating a `SpeechService` instance, you can specify the following parameters:
- `audio_format`: The audio format of the output audio. Default is `AudioFormat.Riff24KHz16BitMonoPcm`.
- You can get an audio format by providing a container format and a quality level: `AudioFormat("mp3", 2)`.
- `endpoint`: The endpoint of the speech service.
- `region`: Alternatively, you can specify the region of the speech service instead of typing the boring endpoint url.
- `subscription_key`: The subscription key of the speech service.
- `token`: The auth token for the speech service. If you provide a token, the subscription key will be ignored.
- `headers`: Additional HTTP headers for the speech service.
Then you need to call `connect()` to connect to the speech service.
After that, you can call `speak_text()` to speak the text or `speak_ssml()` to speak the SSML.
Or you can call `synthesize_text()` or `synthesize_ssml()` to get the audio data.
For `synthesize_text()` and `synthesize_ssml()`, if you provide an `output`, the audio data will be written to that file and the function will return `None`. Otherwise, the function will return the audio data.
Here are the common options for `speak_text()` and `synthesize_text()`:
- `locale`: The locale of the voice. Default is `en-US`.
- `voice`: The voice name. Default is `en-US-JennyNeural`.
- `rate`: The speaking rate of the voice. It must be a string that fits the requirements as documented in this section: [Pitch and Rate](#pitch-and-rate)
- `pitch`: The pitch of the voice. It must be a string that fits the requirements as documented in this section: [Pitch and Rate](#pitch-and-rate)
- `style`: The style of the voice.
- You can get a list of available styles for a specific voice by executing `aspeak -L -v <VOICE_ID>`
- The default value is `general`.
- `style_degree`: The degree of the style.
- According to the
[Azure documentation](https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup?tabs=csharp#adjust-speaking-styles)
, style degree specifies the intensity of the speaking style.
It is a floating point number between 0.01 and 2, inclusive.
- At the time of writing, style degree adjustments are supported for Chinese (Mandarin, Simplified) neural voices.
- `role`: The role of the voice.
- According to the
[Azure documentation](https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup?tabs=csharp#adjust-speaking-styles)
, `role` specifies the speaking role-play. The voice acts as a different age and gender, but the voice name isn't
changed.
- At the time of writing, role adjustments are supported for these Chinese (Mandarin, Simplified) neural voices:
`zh-CN-XiaomoNeural`, `zh-CN-XiaoxuanNeural`, `zh-CN-YunxiNeural`, and `zh-CN-YunyeNeural`.
### Rust
Add `aspeak` to your `Cargo.toml`:
```bash
$ cargo add aspeak
```
Then follow the [documentation](https://docs.rs/aspeak) of `aspeak` crate.
There are 4 examples for quick reference:
- [Simple usage of RestSynthesizer](https://github.com/kxxt/aspeak/blob/v6/examples/03-rest-synthesizer-simple.rs)
- [Simple usage of WebsocketSynthesizer](https://github.com/kxxt/aspeak/blob/v6/examples/04-websocket-synthesizer-simple.rs)
- [Synthesize all txt files in a given directory](https://github.com/kxxt/aspeak/blob/v6/examples/01-synthesize-txt-files.rs)
- [Read-Synthesize-Speak-Loop: Read text from stdin line by line and speak it](https://github.com/kxxt/aspeak/blob/v6/examples/02-rssl.rs)