Speech Recogition using Julius in Linux

Filed under: Projects, ROS — 19 Comments

January 11, 2012

Speech recognition in Linux is generally considered tough, but it can be done with very good results. Julian is a special version of Julius which performs grammar based speech recognition. The video shows the the speech recognition in action.

As we can see, there occurs no error in the recognition of sentences.

All Speech Recognition Engines (“SRE“s) are made up of the following components:

Language Model or Grammar – Language Models contain a very large list of words and their probability of occurrence in a given sequence. They are used in dictation applications. Grammars are a much smaller file containing sets of predefined combinations of words. Grammars are used in IVR or desktop Command and Control applications. Each word in a Language Model or Grammar has an associated list of phonemes (which correspond to the distinct sounds that make up a word).
Acoustic Model – Contains a statistical representation of the distinct sounds that make up each word in the Language Model or Grammar. Each distinct sound corresponds to a phoneme.
Decoder – Software program that takes the sounds spoken by a user and searches the Acoustic Model for the equivalent sounds. When a match is made, the Decoder determines the phoneme corresponding to the sound. It keeps track of the matching phonemes until it reaches a pause in the users speech. It then searches the Language Model or Grammar file for the equivalent series of phonemes. If a match is made it returns the text of the corresponding word or phrase to the calling program.

Although Julian (a special version of Julius which performs grammar-based speech recognition) uses Acoustic Models created with theHTK toolkit, Julian uses its own Grammar definition format.

Grammar

A recognition Grammar essentially defines constraints on what the SRE can expect as input. It is a list of words and/or phrases that the SRE listens for. When one of these predefined words or phrases is heard, the SRE returns the word or phrase to the calling program – usually a Dialog Manager (but could also be a script written in Perl, Python, etc.). The Dialog Manager then does some processing based on this word or phrase.

The example video shown above that of a voice-operated interface to for a robot control. If the SRE hears the sequence of words: ‘Chippu Move Forward’, it returns the textual representation of this phrase to the Dialog Manager, which then produces the control signals to turn the motors.

It is very important to understand that the words that you can use in your Grammar are limited to the words that you have ‘trained’ in your Acoustic Model. The two are tied very closely together.

Acoustic Model

An Acoutic Model is a file that contains a statistical representation of each distinct sound that makes up a spoken word. It must contain the sounds for each word used in your grammar. The words in your grammar give the SRE the sequence of sounds it must listen for. The SRE then listens for the sequence of sounds that make up a particular word, and when it finds a particular sequence, returns the textual representation of the word to the calling program (usually a Dialog Manager). Thus, when an SRE is listening for words, it is actually listening for the sequence of sounds that make up one of the words you defined in your Grammar. The Grammar and the Acoustic Model work together.

Therefore, when you train your Acoustic Model to recognize the phrase ‘CHIPPU MOVE FORWARD’, the SRE is actually listening for the phoneme sequence “ch” “iy” “p” “ax” “m” “uw” “v” “f” “ao” “r” “w” “er” and “d”. If you say each of these phonemes aloud in sequence, it will give you an idea of what the SRE is looking for.

Commercial SREs use large databases of speech audio to create their Acoustic Models. Because of this, most common words that might be used in a Grammar are already included in their Acoustic Model.

When creating your own Acoustic Models and Grammars, you need to make sure that all the phonemes that make up the words in your Grammar are included in your Acoustic Model.

Background – Julian Grammars

In Julian, a recognition grammar is separated into two files:

the “.grammar” file which defines a set of rules governing the words the SRE is expected to recognize; rather than listing out each word in the .grammar file, a Julian grammar file uses “Word Categories” – which is the name for a list of words to be recognized (which are defined in a separate “.voca” file);
the “.voca” file which defines the actual “Word Candidates” in each Word Category and their pronunciation information (Note: the phonemes that make up this pronunciation information must be the same as will be used to train your Acoustic Model).

.grammar file

The rules governing the allowed words are defined in the .grammar file using a modified BNF format. A .grammar specification in Julian uses a set of derivation rules, written as:

Symbol: [expression with Symbols]

where:

Symbol is a nonterminal; and
[expression with Symbols] is an expression which consists of sequences of Symbols, which can be terminals and/ornonterminals.

A terminal is BNF jargon for a symbol that represents a constant value. It never appears to the left of the colon. In Julian terminals represent Word Categories – lists of words that are further defined in a separate “.voca” file.

A nonterminal is BNF jargon for a symbol that can be expressed in terms of other symbols. It can be replaced as a result of substitution rules.

For example, look at the the following derivation rules:

S : NS_B MOVE NS_E
MOVE: NAME ACTION DIRECTION

In this example, “S” is the initial sentence symbol. NS_B and NS_E correspond to the silence that occurs just before the utterance you want to recognize and after. “S”, “NS_B” and “NS_E” are required in all Julian grammars.

“NS_B”, “NS_E”, “NAME”,”ACTION” and “DIRECTION” are terminals, and represent Word Categories that must be defined in the “.voca” file. In the “.voca” file,”ACTION” corresponds to two words: “MOVE” and “LOOK” and their pronunciations. “NAME” corresponds to the word “CHIPPU” . “DIRECTION” corresponds to four words: “LEFT”, “RIGHT”, “FORWARD”, “BACKWARDS” and their pronunciations.

“MOVE” is a nonterminal, and does not have any definition in the .voca file. It does have a further definition in the .grammar file, where it is replaced by the expression “NAME ACTION DIRECTION”. All nonterminals must be further defined in the .grammar file until they are finally represented by terminals (which are then defined in the .voca file as Word Categories).

With Julian, only one Substitution Rule per line is permitted, with the colon “:” as the separator. Alphanumeric ASCII characters and the underscore are permitted for Symbol names, and these are case sensitive

.voca file

The “.voca” file contains Word Definitions for each Word Category defined in the .grammar file.

Each Word Category must be defined with “%” preceding it. Word Definitions in each Word Category are then defined one per line. The first column is the string which will be output when recognized, and the rest is the pronunciation. Spaces and/or tabs can act field separators.

Format:

%[Word Category]
[Word Definition]   [pronunciation ...]
...

For example the Word Categories “NS_B”, “NS_E”, “NAME” ,”ACTION” and “DIRECTION” were referenced in the “.grammar” file above and are defined in a “.voca” as follows:

% NS_B
<s>                           sil

% NS_E
</s>                         sil

% NAME

CHIPPU                    ch iy p ax

% ACTION

MOVE                          m uv w

LOOK                           l uw k

% DIRECTION

FORWARD                 f ao r w er d

BACKWARDS            b ae k w er d z

LEFT                             l eh f t

RIGHT                          r ay t

In the above example, the NS_B and NS_E Word Categories each have one Word Definition with a silence model (‘sil’ is a special silence model defined in your Acoustic Model). These correspond to the head and tail silence in speech input.

“ACTION” is broken out into two words “MOVE” and “LOOK” with pronunciation information, which are the phonemes that make up the words to be recognized (and which correspond to phonemes that will be included in your Acoustic Model). “DIRECTION” is broken out into four words: “FORWARD” and “BACKWARDS”, “LEFT”, “RIGHT” and their phonemes

If you have words with different pronunciations, simply create the additional entries on separate lines for the same word but with the different pronunciation.

The .grammar and .voca files working together

Julian needs a predefined word lattice file where each word and each word-to-word transition is listed explicitly. We get this by compiling the “.grammar” and “.voca” files together to generate the word lattice file (actually it is two files, but more on that later) with a script. The mkdfa.pl script does this by looking for the Initial Sentence Symbol “S” in the .grammar file and replacing the Word Categories with all the possible Word Candidates from the .voca file, and making a predefined list of all the possible combinations of words and phrases Julian must recognize. In this case, the list of all possible sentences would be:

<s> CHIPPU MOVE FORWARD </s>

<s> CHIPPU MOVE BACKWARDS</s>

<s> CHIPPU MOVE RIGHT</s>

<s> CHIPPU MOVE LEFT</s>

<s> CHIPPU LOOK FORWARD</s>

<s> CHIPPU LOOK BACKWARDS</s>

<s>CHIPPU LOOK LEFT</s>

<s>CHIPPU LOOK RIGHT</s>

Compiling your Grammar

The .grammar and .voca files now need to be compiled into “.dfa” and “.dict” files so that Julian can use them. This is done using the Julian “mkdfa.pl” grammar compiler. The .grammar and .voca files need to have the same file prefix, and this prefix is then specified to the mkdfa.pl script. Compile your files (sample.grammar and sample.voca) as follows:

mkdfa.pl sample

The generated sample.dfa and sample.term files contain finite automaton information, and the sample.dict file contains word dictionary information. All are in Julian format.

Now run the Julius to see the speech recognition in action

julius -quiet -input mic -C julian.jconf 2>/dev/null

Tags: chippu, Julius, ROS, speech recognition

Comments RSS feed

15 Comments:

Ben

January 16, 2015 at 11:36 am

Hi Achu,
I’m approaching Julius like you because I’d like to do something like yours with a limited set of commands, I have seen your code on code.google.com and after reading voxforge docs without success I’d like to ask you :
– what are the differences between julian and julius ?
– where did you take julian .voca, .term, .grammar, .dict, .dfa ?
I know the RTFM rule but after downloading and installing everything I cannot understand how to generate these files with HTK
Thanks

Ben

Reply
Electronics Tutorials

September 20, 2014 at 5:15 am

nice one! I never thought it was that easy and precise with Linux!

Reply
Cassio

May 7, 2014 at 10:32 am

Hello.

I have a question about the Acoustic Model: Did you train it with HTK or did you download it (from VoxForge for example)?

In case of acoustic model training, which audio databased you have used?

PS.: Nice Project! Congratulations (:

Regards

Reply
- achuwilson
  
  May 14, 2014 at 6:08 am
  
  I didn’t train the acoustic model, even though I started works on it. The default model which came from VoxForge was pretty impressive & gave nice results, that I didn’t bother about training a new one
  
  Reply
pfl

February 21, 2014 at 8:44 pm

but how do you install/configure all these ?

Reply
Chaitanya Waichal (@chaitanyawaicha)

September 16, 2013 at 6:03 am

Hi…How and which accoustic model have you used here?

Reply
chaitanyaw

September 15, 2013 at 4:56 pm

Hi! This demo looks so good! I ve a doubt though, how, which and what about the acoustic model ? how and where have you used the acoustic model? Am I missing something?

Reply
srivani

June 10, 2013 at 11:17 am

hai sir..
how to install julius software in ubuntu…while iam installing i got some errors like segmentation fault and Sample.jconf is not included…pls any one give the solution for this problem also steps to install the julius software..
thankyou..

Reply
ameblo

June 7, 2013 at 5:29 am

I was very pleased to search out this net-site.I wished to thanks on your time for this excellent learn!! I positively enjoying every little little bit of it and I have you bookmarked to check out new stuff you blog post.

Reply
ameblo

June 5, 2013 at 4:27 am

Just will statement in couple of general items, Your website type is good, the niche issue will be r

Reply
the diet solution

April 10, 2013 at 4:01 am

Thanks for sharing your thoughts on tongue-in-cheek.
Regards

Reply
orious

December 6, 2012 at 12:20 pm

guys!! mkdfa pl sample too works!!

Reply
linux for security

November 3, 2012 at 2:59 pm

When I initially left a comment I appear to have clicked on the -Notify me when new comments are added- checkbox and from now on every time a comment is added I recieve four emails with the same comment. Perhaps there is a way you can remove me from that service? Kudos!

Reply
thanhlt90

July 15, 2012 at 3:56 pm

Nice post!
My english is not good. At #Compiling your Grammar, if my OS is Windows so How i can do that ? Thanks you so much!

Wait your reply.

Reply
shashank

June 16, 2012 at 5:19 pm

hi …. how to install julian in ubuntu??

Reply

4 Trackbacks / Pingbacks for this entry:

Robot takes voice commands via open source CSR « Vijai's Blog

[…] which is built from those sounds. [Achu] published another post which goes into detail about using Julius on a Linux box. It seems like this is possible with less robust hardware (ie: on an embedded system) if you narrow […]
Robot takes voice commands via open source CSR | ro-Stire

[…] which is built from those sounds. [Achu] published another post which goes into detail about using Julius on a Linux box. It seems like this is possible with less robust hardware (ie: on an embedded system) if you narrow […]
Robot takes voice commands via open source CSR - Hack a Day

[…] which is built from those sounds. [Achu] published another post which goes into detail about using Julius on a Linux box. It seems like this is possible with less robust hardware (ie: on an embedded system) if you narrow […]
Chippu Speech Recognition « Achu's TechBlog

[…] easily learnt to recognize human speech. It uses Julius for speech recognition as explained in the previous post. The following video shows the speech recognition abilities of chippu. Like this:LikeBe the […]

Achu's TechBlog