This project attempts to combine markov chains with POS tagging and word2vec to generate grammatically correct text that stays on topic.
I use a custom tokenizer that keeps the punctuation and general text structure: line breaks, tabs... This allows the algorithm to learn the style of the text better than using a regular tokenizer (like provided in natural for example) which removes the punctuation and only keep the words. The text is converted to lowercase pre-training and capitalization is re-applied post-rendering using simple heuristics.
The data is stored in neo4j with weighted relationships.
The project is still under active development. Once completed, it will be converted to an NPM package.
- Node
- Neo4j
Install the NPM dependencies
npm install
Install and run Neo4j
I run my tests on Trump's speeches. He has a very recognizable style which makes it easy to see if the algorithm has learned it and is able to reproduce it. He also rambles a lot, which make it easier to generate spoof speeches that looks as realistic as the real thing.
var cmarkov = require('./cmarkov');
// See the options in the doc
var bot = new cmarkov({
name: 'trump',
depth: [1,5],
lowpri: 3,
weight: 1,
depthWeight: 1,
certainty: 0.1,
pos: true,
debug: false
});
// Learn the word structure
bot.read("training-data/trump.txt", function() {
// Learn the gramatical style using Part-Of-Speech tagging.
// You can learn the style from another dataset if you wish. You could train on the style from Obama's speeched to generate speeches with Trump's ideas with a more presidential style.
bot.readPOS("training-data/trump.txt", function() {
// Test the algorithm, generate a total of 200 words, starting with "I would like to talk today about".
// The algorithm will continue that sentence. You can start with a single word, or with an entire sentence.
bot.generate("I would like to talk today about", 200, function(str) {
console.log(str);
// I would like to talk today about how to develop a new vision for a new time. In fact, i get criticized all the time. And always so theatrical; [...]
});
});
});
bot.generate("Trump", 200, function(str) {
console.log(str);
// Trump. Ok. He’s not a protester. We’re going to win a lot. We’re going to win so much — we’re going to win so much — we’re going to win so much you’re all going to get sick.
});
The training doesn't use any machine learning, relying instead on statistic and graphs. And surprisingly, the text generated by this algorithm is much more realistic and trains much faster than when using a deep LSTM on Tensorflow. The training takes just a 3rd of the time taken by a 5 layer LSTM (128 neurons/layer) to train on similar datasets. It should be noted that my LSTM used char level ngrams while this algorithm uses word level ngrams.
- Convert the text to lowercase
- Tokenize the text
- Generate ngrams of various lengths (without padding). I found that using 1-grams (1 word) to 5-grams (sequences of 5 words) works best.
- Store the ngrams as nodes on Neo4j
In the code, the ngrams are represented as an array of strings. To make it easier to query, I stringify the ngrams by joining them with a pipe:
tokens.join('|')
- Map the relationships between ngrams of various lengths to the next word. The relationships can come from nodes representing ngrams of various lengths, but only point to ngrams of length 1, to only point to a single word. Each relationship has a "weight" property, that gets incremented everytime a relationship is found in the training dataset.
The text is generated word by word using a recursive algorithm. The algorithm changes slightly base on the settings. The following algorithm describes the best performing settings:
- Tokenize the input to convert the text into an array of token. We only work with arrays of strings, which is converted to a string right before being returned. I will refer to that array as the
chain
. - For each
depth
in the range 5...1: (or whatever range was used during training)- List the possible next words and their weights.
- Generate the last ngram from the current chain. if depth = 5, then this is an array of length 5 that contains the last 5 elements of the chain. If the chain's length is less than the depth, then the ngram is a copy of the chain. We simply use
chain.slice(-depth)
. - Lookup in the graph database (neo4j) to find the node corresponding to that ngram. If the node doesn't exist, skip to the next depth.
- List the relationships leaving that node, and associate in an object (or dictionary) the next word (node it's connecting to, which is always a ngram of length 1) and the relationship's weight. Because longer ngrams lead to a more realistic output, the higher the depth, the higher the final weight should be. For that reason, the weight is multiplied by Math.pow(depth, depthWeight). However, you can disable that in the settings by setting depthWeight to false.
- Generate the last ngram from the current chain. if depth = 5, then this is an array of length 5 that contains the last 5 elements of the chain. If the chain's length is less than the depth, then the ngram is a copy of the chain. We simply use
- List what the next type of word could be (POS tag), an the weight for each option.
- Convert the current chain (array of words) to an array of POS tag. I'll refer to that chain as the
POS chain
. - Generate the last ngram from the POS chain. if depth = 5, then this is an array of length 5 that contains the last 5 elements of the chain. If the chain's length is less than the depth, then the ngram is a copy of the chain. We simply use
posChain.slice(-depth)
. - Lookup in the graph database (neo4j) to find the node corresponding to that ngram. If the node doesn't exist, skip to the next depth.
- List the relationships leaving that node, and associate in an object (or dictionary) the next POS Tag (node it's connecting to, which is always a ngram of length 1) and the relationship's weight. Because longer ngrams lead to a more grammatically realistic output, the higher the depth, the higher the final weight should be. For that reason, the weight is multiplied by Math.pow(depth, depthWeight). However, you can disable that in the settings by setting depthWeight to false.
- Convert the current chain (array of words) to an array of POS tag. I'll refer to that chain as the
- List the possible next words and their weights.
- For each word option, assemble a temporary string (chain+word) and POS tag it to see what TAG the word option would be in the current context. If that tag is not in the list of possible tags (from the previous step), then remove that word from the options. Else, update the word option's weight to be an average between the tag's weight and its current weight.
- Convert each word option's weight to a probability (the sum of all weights should sum to 1), and remove low probability options so that the lowest probability is equal to 0.1. You can change that value
certainty
in the settings or set it to false to keep all the compatible options. Keeping only the most probable options ensure that the generated text is likely and realistic. Any value over 0.5 would keep a single value in the options, which generates very likely text, but often either get stuck in a loop, or remembers text it has learned word by word without any modification. - Select a random word, with a probability equal to their weight, and add it to the current chain.
- If the chain's length is less than the required length, loop back to the first step with the new chain as input.
I'm very happy with the outcome so far, and the realistic nature of the generated text. It's performing much better than I ever thought it would. The POS tagging really makes a huge difference, ensuring that the sentences generated are grammatically correct. However, there is no control over the topic of the sentence. All you can do is start the sentence and hope it'll stay on subject. So I'm working on having as input a word and then have the algorithm grow the sentence forward and backward around the word, so that the word would be int he middle of the sentence. I'd also like to work on having as input a few words, and generate text to connect those word by fining paths between nodes in the graph. That should in theory allow to input a few keywords and have an entire story built to connect them, like a story generation if you trained the algorithm on novels. Finally, I'd like to use a pre-trained word2vec to calculate the vector distance between the current chain and the topic to which I want the generation to go toward.
After training the algorithm on a South Park script, I generate a new script from scratch.
I trained the algorithm overnight on a text file containing many of Trump's speeches.
I got it started using simply "I" as input.
I started this journey and that’s what it is.
It’s labor, and it’s unions good and some bad and lots of people that aren’t in unions, and it’s ll over the place. I hire people, they do a web site. It costs me $3. $5 billion web site.
But you know — although i think last night i gave her a big headache.
We’re building a wall, we are going to replace it.
And we’re like a dumping ground for everybody else’s problems.
It’s very simple. So i announced — and i was talking about the borders, i was talking about the wall, i was talking about the 2nd amendment will be saved, by the way, i’m not even saying that in — that’s the kind of thinking you need for this country.
So they put together this. And before i say it, i have to say this. I made it the old-fashioned way. It’s real estate. You know, professionally written. So, he made a statement, and after that, h s spin people said, oh, man.
We’ve had a president who dislikes our friends and bows to our enemies.
We’re going to finally have a coherent foreign policy.
We got a military that needs equipment all over the place, thousands and thousands of wounded sol iers.
And she and my opponents — oh, isn’t it terrible that putin said — wouldn’t it be nice?
We’ve picked fights with our oldest friends, and now they’re starting to look elsewhere for help.
We’re building a wall, we are going to become strong again. And. ..
But he spent — think of it. They’re going to come in through a legal process. They’re not coming n the way they’re coming in now, just walking in like nothing.
I’m doing that to say that that’s the kind of thinking you need for this country.
We are getting out of the nation-building business, and instead focusing on creating stability in the world.
We have all the cards, but we don’t have to destroy our businesses, we don’t have to destroy our usinesses, we don’t have to destroy our businesses, we don’t have to destroy our —
We protect saudi arabia. They were, during the good oil days — now it’s probably half, which is fine. But during the good old — they were making a billion — listen to this — a billion dollars a day. Whenever they have problems, we send over the ships. We say “we’re gonna protect. ” what are w doing? They’ve got nothing but money.
And you know really don’t — i’ll tell you who does not like — yeah, we’ll get ivanka. Good. Let’s do ivanka.
But it’s sealed, it’s beautiful. I don’t think they’re going to be doing it.
First, we need a long-term plan to halt the spread and reach of radical islam.
And we must only be generous to those that prove they are our friends.
It’s war. And so, you don’t like china? ”
Let’s say the cost of the wall is $10 billion. That’s a good — i can do great job at $10 billion. I think we’ll have a lot of natural barriers, right, that are extremely tough to get across. We have 1, 000 miles, and we have 1, 000 miles. We have modern cranes. We have caterpillar tractors — i want to just shoot the television, you know? They really get it. They want to get rid of the fraud, but you deserve your social security; you've been paying your security. And like, i like congressman ryan, i like a lot of the pundits on television, “well, donald will never run, and one of the things i did, and i said it strongly, it's very hard for somebody that does tremendous numbers of deals to run for politics, run for political office, any office, let alone president. Because you've done so much; you've beaten so many people; you've created so many-- look, obama, what did he do? No deal. He never did a deal. He did one deal. A house. And if you did that house you'd be i jail right now, okay. He got away with murder. But i can tell you.
But you don’t hear that from anybody else. You don’t hear it from anybody else. You don’t hear it from anybody else. And i watch these teleprompters, and by the special interests, fully.
It’s war. And so, you don’t like china? ”
Hillary clinton also refuses to say the words “radical islam, ” even as she pushes for a massive ncrease in refugees.
And you know, it’s — i love getting up — and for the last couple of months. And then he appointed justice roberts. And justice roberts basically approved obamacare in a shocking decision that nob dy believes. So you can't have romney. He choked.
We’ve had a president who dislikes our friends and bows to our enemies.
We gave them, by decapitating iraq and then leaving, because frankly, once we did it we should have kept 20, 00 troops or something, we give iraq. So not only did they make a great deal in terms of the universe. Miss universe, by the way. I’ve been here many times. Great golfing area, right? e love it.
I’m number one by a lot and i spent no money and i’m number one. Others spent — they will have sp nt hundreds of millions of dollars.
— and then it comes out in big globs, right, and you — it’s stuck in your hair and you say oh my od, i’ve got to take a shower again. My hair’s all screwed up, right?
We’re building a wall, we are going to build the strongest, the best, the most powerful military ver. Ever.
— right? Right? But i hear where they don’t want me to use the pump because the other one which i really like better than going bing, bing, bing. No problems, get off stage, everybody falls asleep and that's the end of that. But we have to do it fast; we have to do something about these telep ompters.
So hillary clinton said — and here’s chuck todd — and he’s a nice guy. I’m not trying to knock him. Please don’t be insulted, chuck. But she says she doesn’t like him. But he doesn’t treat me right. So he goes the clinton campaign said they’d most like to run against donald trump. And yes — and they’re looking forward to it. Trust me, they don’t want to run against trump. The last thing she wants in her whole life is — this was only — i did this in, like, i didn’t finish the — so, the wall.
— you know, i really do live in a very nice apartment, right?
That’s right. A lot of people, which is too bad, actually. So maybe next time. I don’t know. I th nk this is the largest room. I know it’s a record for the room.
So it started off with me where i was going to say, but i ride down the highways and somebody makes those guard rails. You know the guard rails. The ones that sort of go like this [demonstrates with hand] that are always bent, rusted and horrible. Did you ever see what they do now? We have these little walls. They build ramps. They build a ramp. I say, “wouldn’t it be cheaper just to knoc the wall down and just. ..
We’ve had a president who dislikes our friends and bows to our enemies.
We’re going to finally have a coherent foreign policy based upon american interests, and the shar d interests of our allies.
It’s labor, and it’s unions good and some bad and lots of people that aren’t in unions, and it’s all over the place. And i used to say it, there are ships at sea, and this was during the worst crisis, that were loaded up with oil, and the cartel kept the price up, because, again, they were smarter than our leaders, and we can’t sustain ourself with that. There’s too much — it’s like — it’s like take the new england patriots and tom brady and have them play your high school football te m. That’s the difference between china’s leaders and our leaders.
We have a lot to be proud of. In the 1940s we saved the world again, this time from totalitarian ommunism. The cold war lasted for decades, but we won.
We got a military that needs equipment all over the place, thousands and thousands of sets all the time, for some reason it's south korea. You know whether it's china, south korea, but in this ca e televisions, south korea. Why aren't they doing something to justify what is going on?
First, we need a long-term plan to halt the spread and reach of radical islam. But this has to be a record because every inch of the room is taken with people, and they’re standing. They’re not sitting. So that’s great. I love — we love setting records. We want the country to set records, rig t?
We’ve got a lot of problems. We’ve got a real problem. You’ve got to report them. I mean, four years. I’ve been “get the oil, get the oil. ” because who’s going to get the oil? Iran is taking ove iraq. Think of it. I mean give me a break. So we have to change.
We are getting out of the nation-building business, and instead focusing on creating stability in the world. We are laughed at by the rest of the world.
— let me just tell you — let me just tell you — let me just tell you — let me just tell you — let me just tell you — let me just tell you a little secret. It’s the cheapest thing we can do. We’re never going to have to change our trade, immigration and economic policies to make our economy strong again – and to put americans first again. This will ensure that our own workers, right here in america, get the jobs and higher pay that will grow our tax revenue and increase our economic mi ht as a nation.
But we will look for savings and spend our money wisely. In this time of mounting debt, not one d llar can be wasted.
I’m number one by a lot and i spent no money and i’m number one. Others spent — they will have sp nt hundreds of millions of dollars.
… one of the early things i would do, probably before i even got in — and i wouldn’t even use — y u know, i’m just hoping they’re just stupid people, ok?
… one of the early things i would do, probably before i even got in — and i wouldn’t even use — y u know, it’s like, i’ve never did this before.
And it will cost you much less money, and it will be great. I mean, he’s never going to leave and you’ll make a deal. But when you announce you are leaving in 18 months or whatever the hell he sa d, they just pull back. And then as soon as we left, they come in ok?
That’s true. But actually i am. I think i am a nice person. ”
Hillary clinton also refuses to say the words “radical islam, ” even as she pushes for a massive ncrease in refugees.
And we’re in a war against radical islam, but president obama won’t even name the enemy!
I’ll say you don’t have to pay a tax, folks. And you know, you’re talking about 15, 16 people — 46 percent. Time magazine, 49 percent. Think of it. I mean the euro, china, mexico; everybody is devaluing. And when you think about it, lies, i mean are they prosecuted? Does anyone do anything? And what are the republican politicians doing about it? He lied about the doctor, he lied about every aspect. You can keep your plan. And you've all heard that hundreds of times. That's like the real estate location, location. I don't even say it anymore because everybody goes location, location. I don't even say it anymore because everybody goes location, location. But you have heard this bout obamacare.
First, we need a long-term plan to halt the spread and reach of radical islam. But this has to be a record because every inch of the room is taken with people, and they’re standing. They’re not sitting. So that’s great. I love — we love setting records. We want the country to set records, rig t?
We’re building a wall, we are going to make our country great again.
And i'm a conservative, actually very conservative, and i'm a republican. And i'm very disappointed by our republican politicians. Because they let the president get away with absolute murder. You see always, oh we're going to do this, we're going to--. Nothing ever happens; nothing ever happ ns.
That’s right. A lot of times, they will say, well mr. Trump’s people are blue-collar. I love blue collar. I’m honored by that.
But they are not blue-collar. We have blue-collar and we have executives and we have young people. They say the audience is old. It’s not older. The audience is young. The other night in iowa, i told that, and i said it very strongly, years ago, i said — and i love my father. I learned so much. He was a great negotiator. I learned so much. He was a great negotiator. I learned so much just sitting at his feet playing with blocks listening to him negotiate with subcontractors. But i lea ned a lot.
But you don’t hear that from anybody else. You don’t hear it from anybody else. And i watch these teleprompters, and by the special interests, fully.
But they all said, a lot of the people that are talking about you know cutting social security, a d by the way, 20 years ago, they built the great wall of china goes 18, 000 miles.
I’m not doing that to brag, because you know what? Ford will say “if i (inaudible) pay a tax, we’re going to charge you a 35-percent tax, and that tax is going to be paid simultaneously with the ransaction, and that’s it.
So it started off with me where i was going to say, but i ride down the highways and somebody makes those guard rails. You know the guard rails. The ones that sort of go like this [demonstrates with hand] that are always bent, rusted and horrible. Did you ever see more than like 20 feet which isn't corroded, or bent or the heat, if it gets too hot, it just crushes. Now they've been selling this thing for 25 years. Why doesn't someone stop them and get something that works. Because they don't know; they don't know what's happening. Somebody made a lot of money on that. They said he talks about illegal immigration, we’re not going to do anything. It’s wonderful. Oh, good, i had nice crowds. They’ll write about me every once in a while; every two years i’ll have a story, “trump ran this rather good campaign. ” if we don’t win, it’s a terrible waste of time, because you’ll have somebody else in there. They won’t be able to do what i do. They’re not going to be able to se it. I don’t want it. ”
I’ll say you don’t have to pay interest, because they took the oil that, when we left iraq, i sai we should’ve taken.
It’s like japan. If somebody attacks japan, we have to do something about it and we are going to eplace it.
… one of the early things i would do, probably before i even got in — and i wouldn’t even use — you know, it’s — i love getting up — and for the last — and leading a lot. I’m going to go over tha because i go over (ph) polls. I love polls.
We’ve had a president who dislikes our friends and bows to our enemies.
And she and my opponents — oh, isn’t it terrible that putin said — wouldn’t it be nice if they had a gun on their waist? Somebody. Him, him, so that when they — you know what? I don’t have to brag. I don’t have to brag. I don’t have to brag. I don’t have to brag. I don’t have to, believe it o not, in terms of trade, is mexico.
And i have liabilities of about $500 million (ph). That’s long-term debt, very low interest rates. We have a lot to be proud of. In the 1940s we saved the world again, this time from totalitarian communism. The cold war lasted for decades, but we won.
— right? Right? But i hear where they don’t want me to use the pump because the other one which i really like better than going bing, bing, bing, bing. No problems, get off stage, everybody falls asleep and that's the end of that. But he's totally in favor of common core. He's very, very weak on immigration. He’s in favor of common core. I don’t see how he can possibly get the nomination. He’s weak on immigration. Don't forget--remember his statement--they come for love. I say, “would ’t it be cheaper just to knock the wall down and just. ..
We’re building a wall, we are going to get rid of the fraud, but you deserve your social security; you've been paying your security. And like, i like congressman ryan, i like a lot of the pundits on television, “well, donald will never run, and one of the beautiful polls that came out, and other things — we have a website — as soon as the hats came out, i started — i did it, i was, oh, that first couple of weeks with illegal immigration and mexico and all of this stuff, right? And for the last — and leading a lot. I’m going to do to knock the hell out of them and make me number on next year.
We’re building a wall, we are going to become strong again. And. ..
We’re building a wall, we are going to straighten out common core. Common core should — it is a disaster. You people know. Obamacare is a total catastrophe. It’s going to be one of the largest in the world. Ford. Good company.
We’ve picked fights with our oldest friends, and now they’re starting to say the wall. We have to save our country.
But we have to do it;
— you know, i’m doing the old post office site, and i got it from obama. Do you believe that? And everybody wanted it. So, obamacare, dead. Common core, gone. We are going to become strong again. And. ..
We’ve had a president who dislikes our friends and bows to our enemies.
It’s labor, and it’s unions good and some bad and lots of people that aren’t in unions, and it’s all over the place, thousands and thousands of sets all the time, okay. I run a big business. You now i've always said it's very, very sad.
We’ve let our rivals and challengers think they can get away with anything.
— you know, because i see since then — i mean, i would have probably said we are going to help your country. So, here is the story. So, we are in far worse shape now in the middle east, iran and somebody else will get the oil, and it turned out that iran is now taking over iraq. Think of it. So china is spewing up all this stuff and we’re holding back. And with china — you know, that’s wh t you say. ” she hasn’t heard that from any other candidate.
We’ve let our rivals and challengers think they can get away with anything.
— right? Right? But i hear where they don’t want me to use hair spray. They say don’t use hair spray, it’s bad for the ozone. So i’m sitting in this concealed apartment, this concealed unit — you know, i’m just hoping they’re just stupid people, ok?
It’s labor, and it’s unions good and some bad and lots of people that aren’t in unions, and it’s all over the place, thousands and thousands of sets all the time, for some reason it's south korea. You know whether it's china, south korea, but in this case televisions, south korea. Why aren't hey doing something to justify what is going on?
We have made the middle east more unstable and chaotic than ever before.
— let me just tell you — let me just tell you — let me just tell you a little secret. It’s the ch apest thing we can do. We’re never going to have to use it. I don’t want it. ”
We’re going to win a lot.
We are getting out of the nation-building business, and instead focusing on creating stability in the world.
So, just to sum up, i would do various things very quickly. I would repeal and replace the big li , obamacare.
We’re building a wall, we are going to become strong again. And. ..
It’s labor, and it’s unions good and some bad and lots of people that aren’t in unions, and it’s ll over the place and building all over the world that they can do it.
And we should be very prepared now, because the world hates us. She has done a terrible job as se retary of state. Think of it. We gave iran iraq. Iran has the biggest — think of it.
And i go home and my wife said were there many people? I watched you on television tonight. I said, “you don’t understand. You don’t understand. If i don’t win, it’s a terrible waste of time, because you’ll have somebody else in there. They won’t be able to do what i do. They’re not going to be able to. And even if you get one of these republicans in, like, 15 minutes what happened to them, because the husband wants to come and she wants to accuse me of things and the husband’s one of the great hotels of the world on pennsylvania avenue, the old post office, we’re converting it into one of the world’s great hotels. It’s gonna be the best hotel in washington, d. C. We got it from the general services administration in washington. The obama administration. We got it from the general services administration in washington. The obama administration. We got it from the general services administration in washington. The obama administration. We got it. It was the most highly sought after project in the history of this country. ” newt gingrich was on a television show the other say, and said, “this is going to be strong again. America is going to be strong again. America is going to be an important moment for all of us. The one thing i have to say this. I made it the old-fashioned way. It’s real estate. You know, all of my life, i’ve heard that a truly successful person, a really, really successful person and even modestly successful cannot run for public office. Just can’t happen. And yet that’s the kind of mindset, that’s the kind of mindset that ou need to make this country great again.
We got a military that needs equipment all over the place, thousands and thousands of sets all the time, okay. I run a big business. You know i've always said it's very, very hard for a person who is very successful. I have done so many deals. Almost all of them have been tremendously successful.