Text generator which reads in large .txt files, trains a LSTM model on it and outputs text in similar format.
The model uses the following parameter num_hidden=512, num_layers=3, drop_prob=0.5, use_gpu=True
I tested the model on different datasets. Some models converged quickly, some not at all.
Epochs to trained = 35 Batch_size = 128
The model was trained on the Names, Latitude and Longitude (all in one string) of German counties. Supprisingly, the model, which is build to predict character by character (not words), learns the structure of the text and spits out an arbitrary number of fake counties with GPS codes.
For this to work, the text had to be shuffled using a simple:
import random
random.shuffle(words)
We do this to avoid the model to only predict counties that start with the letter "A".
Training time: --- 543.0422582626343 seconds ---
Epochs to trained = Batch_size =
(I blanked my wife's first name and number for obvious reasons.)Daniel: Wie die schlechte auf
...a Müller : Da keine Somen die Schatz
...a Müller : Dann ist dir gestimmt den Schatz
...a Müller : Danke ich mir die sorlen den Meinen?
Daniel: Bis gut in Schatz
...a Müller : Heit
...a Müller : Heh ich mich dann suer mich auch auf dir aber
...a Müller : Dass den Schatze Schatz
...a Müller Dier schon die schlacht
...a Müller : Ich dir den Sonnen ich dass sorte mit gerauten sorer ist dann den Stard
...a Müller : Ich hieb dich auch
...a Müller : Hit ist der Stit mal dich so guten Schatz
...a Müller : Hi Schatz weißen dir an mal in schöchen?
Daniel: Hab schön schönen Stutz
Daniel: Hie ich dann schot schon auch sehr suesse
...a Müller : Dann ich mal dir gut aber die Bein die Schatz ist dass auch aber schlecht
...a Müller : Hab die so gerne auf die so sehr
...a Müller : Dann da kannst
...a Müller : Hab das dann somme die Schot
Daniel: Heute mit
Daniel: Bin so schon so sein
Daniel: Bin gut aus dir ganz dass du bist du dir gehalt
(I replaced my wife's first name with "...a". Everything else was created by the machine.)
The model was trained on my wife and mine WhatApp chat history of about 5 years. The model is able to produce the format you find in the WhatApp chat history and is able to output some German words. However is not able to generally produce meaningful sentences (few short excpetions).
Bonus points: The model also picked up the "author distribution" of the messages between us. I don't text that much, neither does the model pretend to (=
At the beginning, whatever I did, my model did not converge. The main reason was the amount of characters to be encoded. The model thought of each emoticon as of a characters. I ended up with 733 options for the algorithm to choose from, when picking the subsequent character.
Removing the emoticons reduced the space by approximately 85%, while keeping the text readable.TBD
Training time: --- seconds ---