Speech Research >> SphinxTrain: beginner's questions.

by Jack » Fri, 05 Nov 2004 22:48:55 GMT

I have a couple of questions using SphinxTrain.

I use the "raw" option of wave2feat to extract feature files from wav
files. Is this correct?

agg_seg writes 133087 frames.

I use -stride 1 on both agg_seg and kmeans_init but kmeans_init reports 8
empty clusters. I don't have a good quality microphone.

Running norm, after the first bw, I get messages "mgau= 0, feat= 2,
density= xxx) never observed for 45 densities and also (mgau= 0, feat= 2,
density=39, component=0) < 0 for 2 components.

I was also expecting norm to print out the overall likelihood per frame
and convergence ratio (that's what it shows on the troubleshooting web
page fr6.html). However, when I run norm I don't see that information.

How can I calculate the convergence ratio myself?

Jack.

Speech Research >> SphinxTrain: beginner's questions.

by James Salsman » Sat, 06 Nov 2004 05:49:30 GMT



Not sure; is there a "cepstral" or "mfcc" option?


Given how much input?


Well, get a better mic. Do you want to fill those of us who haven't
used SphinxTrain in a long time what you think you are doing there,
what the manual says you are doing, and whether you think those are
the same thing? Please also post the URL to the manual for the version
you are using.


Full URL please.


My first impulse is to suggest using AWK on the output, but who uses
AWK these days? Use Perl or Tcl.

Sincerely,
James
--
www.readsay.com - maker of the ReadSay PROnounce English literacy system
400 MHz PDA included: $499 -- http://www.readsay.com/PROnounce.html

Speech Research >> SphinxTrain: beginner's questions.

by Jack » Sat, 06 Nov 2004 21:17:03 GMT


Thanks James for your reply. I'm attempting to train a semi-continuous
small vocabulary model using SphinxTrain-0.9.1-beta on x86. I am following
the document at http://www-2.cs.cmu.edu/ ~rsingh/sphinxman/s3manual.html
which has links to "Instruction set for training" and "Troubleshooting:
tools and logfiles".

I use /usr/bin/record to record wav file at 16000Hz and use wave2feat to
extract MFCCs. I think the feature files are correct because cepview shows
the vectors having values in acceptable ranges according to the
trobleshooting doc. ie. the first component typically starts off about 12 and then
continues int the range 5 - 10. The other components are in the range -1
to +1. However I was just wanting to double check. For wave2feat you have
to specify the audio file as either -nist or -raw. I'm saying -raw.

Training data comprises 189 utterances or 43M of wav files resulting in
133087 frames.

I have previously gone through the entire training sequence once (CI,
CD-untied, CD-tied training and conversion to Sphinx2 format). But I don't
get any recognition when using sphinx2-continuous, so I'm trying again.

One problem was that I wasn't getting any indication of convergence so I
ran the bw/norm step 7 times for each section of training.

My main question is regarding the 8 "empty cluster" messages at
kmeans_init. The troubleshooting log suggests that this can happen if the
feature files are very small, byte swapped or contain garbage. This is why
I'm asking if I'm doing wave2feat extraction correctly. And are the "never
observed" messages at normalization a consequence of the "empty clusters".

As far as calculating convergence ratios, I was thinking of writing my own
C program to do that from subsequent iterations of means/vars/mixw/tmat
files. What exactly do I have to calculate?
But this must have been done already.

The specs for my microphone are: Sensitivity -54 +/-3dB; Output impedance
2.2K ohms; Frequency response 20Hz ~ 20KHz; Operating voltage 1.5V ~ 10V;
Sensitivity reduction within -3dB at 1V.

Thanks, Jack.

Speech Research >> SphinxTrain: beginner's questions.

by sl_jerry1 » Sun, 07 Nov 2004 06:13:43 GMT

Suggestion: post your questions to the Help Forum for the Sphinx
project at SourceForge: http://sourceforge.net/forum/?group_id=1904 .
You'll reach a much more qualified audience.

cheers,
jerry wolf

Speech Research >> SphinxTrain: beginner's questions.

by James Salsman » Thu, 11 Nov 2004 07:24:59 GMT


Thanks for the URL! Those docs are much improved from what I remember.
Someone at CMU must be reading this newsgroup.


If your files are .wav files, they might have a header, and the length
of that header might be shorter than the -nist header. What I'm getting
at, is make sure you aren't coding the header. The .wav file header is
about 44 bytes (sometimes 56 under a rare condition, IIRC. Anyway, make
sure you aren't chopping of any more than just the header, either.)


About how many phones per utterance, roughly? And how many words?


Did you try this? http://www-2.cs.cmu.edu/ ~rsingh/sphinxman/FAQ.html#22


That could indicate a data format error, but it could be something else.


There seems to be a -mach_endian switch for the decoder, but not kmeans?
Are you using the same system for each step?


Parse out the variances with AWK or something and make sure they
decrease.


-54 dB is fine in a quiet room.

Jeff's suggestion to try the Sphinx forums on Sourceforge was a good
one.

Sincerely,
James

--
www.readsay.com - maker of the ReadSay PROnounce English literacy system
400 MHz PDA included: $499 -- http://www.readsay.com/PROnounce.html