wu :: forums (http://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi)
riddles >> hard >> Discovering words in a random sentence
(Message started by: Earendil on Jun 15th, 2005, 6:48am)

Title: Discovering words in a random sentence
Post by Earendil on Jun 15th, 2005, 6:48am
Some finite words (constructed on a finite alphabet) are selected without you knowing them. After that, a sentence is constructed selecting every time a word, independently of the others already chosen, with some unknown probability function.

Given an infinite sized sentence, how do you discover which are the words and what is the probability of chosing each one of them?

Example:
words can be:

(2,0,1)
(2,1)
(3,0,0)

A sentence could be:

30030030020121201212120130021201300...

Title: Re: Discovering words in a random sentence
Post by Barukh on Jun 15th, 2005, 7:13am
Some clarifications are required... I have one question:

May a word be a prefix of another word?

Title: Re: Discovering words in a random sentence
Post by towr on Jun 15th, 2005, 8:43am
It's easy enough to find pieces of words that are repeated. But there is no way to tell where wordboundaries are, unless there are extra constraints (like f.i. no words is prefix to another)

There is no reason not to just say that each word consist of exactly one letter. Or just take the lexicon to be every combination of two letters, or three, etc.

You get different probability function for each assumption, but you always get one. And since the right one is unknown, it can't help you distinguish what is the right lexicon anyway.

Some context, meaning, to the sentence might be nice.

Title: Re: Discovering words in a random sentence
Post by Earendil on Jun 15th, 2005, 9:22am

on 06/15/05 at 07:13:13, Barukh wrote:
Some clarifications are required... I have one question:

May a word be a prefix of another word?


A word may not be the suffix of another word


on 06/15/05 at 08:43:13, towr wrote:
There is no reason not to just say that each word consist of exactly one letter. Or just take the lexicon to be every combination of two letters, or three, etc.


You can guide yourself on the Principle of Maximum Likelihood...

(The parameters which maximize the probability of obtaining the sample are the best ones)

Title: Re: Discovering words in a random sentence
Post by towr on Jun 15th, 2005, 10:12am
You're suggesting a hidden markov model?

Title: Re: Discovering words in a random sentence
Post by Earendil on Jun 15th, 2005, 12:14pm

on 06/15/05 at 10:12:44, towr wrote:
You're suggesting a hidden markov model?


Words haven't got necessarily the same length...

Title: Re: Discovering words in a random sentence
Post by JocK on Jun 15th, 2005, 1:24pm
The remark that subsequent words are chosen independently of previous words, is crucial here. It means that the whole sequence should fit a Markov chain model with transitional probabilities for subsequent words that satisfy for each word (y) and all preceding words (x): P(x)(y) = P(y)

A simple example:

Suppose I would construct an infinite sequence based on the random selection of the words ''ab'' and ''ba'' with equal probabilities.

The assumption that the words would be ''a'' and ''b'' would lead to a markov chain with transitional probabilities P(a)(a) = 1/4 =/= 3/4 = P(b)(a), and P(b)(b) = 1/4 =/= = 3/4 = P(a)(b). This result is clearly in conflict with the given statistical independence of subsequent words.

However, starting from the assumption that the words are ''ab'' and ''ba'' would lead to the Markov chain actually used in the construction of the data. I.e. the chain with P(ab)(ab) = P(ba)(ab) = P(ab) = 1/2,  and P(ba)(ba) = P(ab)(ba) = P(ba) = 1/2, satisfying the requirement of statistical independence of subsequent words.

Of course, in this example, one could also construct a valid Markov model based on the words ''abab'', ''abba'', ''baab'' and ''baba''.

The choice on what is the most likely Markov model explaining the data, could be based on the entropy of the Markov model:

H{x} = Sum x  -P(x) ln(P(x))

For the 4-word model this leads to an entropy of ln(4), whilst for the 2-word model the entropy would be ln(2).

I assume Earendil wants us to pick the Markov chain that satisfies all requirements and has the smallest entropy.

How to translate this into a constructive algorithm for deriving the words is not clear to me yet.





Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board