-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running out of Markovs to chain #143
Comments
First, attempt, using the following sources:
Rules were strictly deleted after one usage. Ngrams were of length 2-5, alternating words and whitespace/punctuation. The output story only reached 3337 words, and they were too heavily slanted towards middle/early modern English, and fewer line breaks might be nice. Memory usage is almost a gigabyte. Sample from the start:
Sample from the end:
|
Second attempt, using the following sources:
Rules were deleted with probability 0.99 after each usage. Output was 7236 words, nearly 1.7 gigabytes memory usage. I think it would be a good idea to make the generator favour productions from long ngrams, so that when those are all inaccessible it can fall back to the shorter, more permissive ngrams. Sample from the start:
Sample from the end:
|
Unexpectedly short stories in buggy versions of the generator:
|
Third attempt, same sources. the program attempts to find a rule for the most recent 5-gram, progressing to 4-grams, 3-grams, and 2-grams if it fails, and stops if none are found. 4127 much more coherent words were generated. The text stops at Sample from the start:
Sample from the end:
|
Fourth attempt. I've altered capitalisation behaviour so that the first alphabetic character (including thorns and yoghs) after a full stop, question mark, or exclamation mark is capitalised, and the word The chapters get progressively shorter (but don't reach zero length), the process gets gradually slower, and it has real difficulty reaching 15,000 words. I think it needs a larger corpus. Sample from the start:
Sample from the last chapter before I interrupted the process:
|
Attempt 5. I've reduced the deletion probability to 0.9 and added the following source:
Output: https://raw.githubusercontent.com/serin-delaunay/NaNoGenMo2016/master/output/markov.txt 19589 words according to notepad++ - my code reported over 20,000, but whatever. Sample from the start:
Sample from the end:
|
Late output: https://raw.githubusercontent.com/serin-delaunay/NaNoGenMo2016/master/output/markov_v2.txt Strictly speaking NaNoGenMo is over, but I changed the method of choosing the starting token: Originally I chose a random ngram in the Markov model, and chose a random production from that ngram's rule. Later I made a list of all available productions in the whole model. That was really expensive! Now I make a set of every word encountered in the source text during parsing, and convert it to a list. To start a chapter I choose one at random, and delete it from the list. That makes the whole novel generation process muuuuuuuch faster. My code underestimates the number of words output, so I set its target word count to 55,000 and got 53,020 words in 762 chapters. The generation process took a matter of seconds. Parsing is still really slow, though.
Sample from the end:
|
IPython notebook: https://github.com/serin-delaunay/NaNoGenMo2016/blob/master/RunningOut.ipynb
Output (strict, 19589 words): https://raw.githubusercontent.com/serin-delaunay/NaNoGenMo2016/master/output/markov.txt
Output (late, 53,020 words): https://raw.githubusercontent.com/serin-delaunay/NaNoGenMo2016/master/output/markov_v2.txt
Going to try a very simple project for the last day: an ngram-based Markov chain trained on a variety of books from project Gutenberg, but with destructive output.The first time a rule is selected from the model, the rule is deleted and it can't be used again. The novel should sound normal (for a Markov story) at the start, but gradually (or quickly) turn into something really odd.
I'll probably have to find some workarounds to make it reach 50,000 words wothout stopping - such as:
The text was updated successfully, but these errors were encountered: