|
||||
Title: Discovering words in a random sentence Post by Earendil on Jun 15th, 2005, 6:48am Some finite words (constructed on a finite alphabet) are selected without you knowing them. After that, a sentence is constructed selecting every time a word, independently of the others already chosen, with some unknown probability function. Given an infinite sized sentence, how do you discover which are the words and what is the probability of chosing each one of them? Example: words can be: (2,0,1) (2,1) (3,0,0) A sentence could be: 30030030020121201212120130021201300... |
||||
Title: Re: Discovering words in a random sentence Post by Barukh on Jun 15th, 2005, 7:13am Some clarifications are required... I have one question: May a word be a prefix of another word? |
||||
Title: Re: Discovering words in a random sentence Post by towr on Jun 15th, 2005, 8:43am It's easy enough to find pieces of words that are repeated. But there is no way to tell where wordboundaries are, unless there are extra constraints (like f.i. no words is prefix to another) There is no reason not to just say that each word consist of exactly one letter. Or just take the lexicon to be every combination of two letters, or three, etc. You get different probability function for each assumption, but you always get one. And since the right one is unknown, it can't help you distinguish what is the right lexicon anyway. Some context, meaning, to the sentence might be nice. |
||||
Title: Re: Discovering words in a random sentence Post by Earendil on Jun 15th, 2005, 9:22am on 06/15/05 at 07:13:13, Barukh wrote:
A word may not be the suffix of another word on 06/15/05 at 08:43:13, towr wrote:
You can guide yourself on the Principle of Maximum Likelihood... (The parameters which maximize the probability of obtaining the sample are the best ones) |
||||
Title: Re: Discovering words in a random sentence Post by towr on Jun 15th, 2005, 10:12am You're suggesting a hidden markov model? |
||||
Title: Re: Discovering words in a random sentence Post by Earendil on Jun 15th, 2005, 12:14pm on 06/15/05 at 10:12:44, towr wrote:
Words haven't got necessarily the same length... |
||||
Title: Re: Discovering words in a random sentence Post by JocK on Jun 15th, 2005, 1:24pm The remark that subsequent words are chosen independently of previous words, is crucial here. It means that the whole sequence should fit a Markov chain model with transitional probabilities for subsequent words that satisfy for each word (y) and all preceding words (x): P(x)(y) = P(y) A simple example: Suppose I would construct an infinite sequence based on the random selection of the words ''ab'' and ''ba'' with equal probabilities. The assumption that the words would be ''a'' and ''b'' would lead to a markov chain with transitional probabilities P(a)(a) = 1/4 =/= 3/4 = P(b)(a), and P(b)(b) = 1/4 =/= = 3/4 = P(a)(b). This result is clearly in conflict with the given statistical independence of subsequent words. However, starting from the assumption that the words are ''ab'' and ''ba'' would lead to the Markov chain actually used in the construction of the data. I.e. the chain with P(ab)(ab) = P(ba)(ab) = P(ab) = 1/2, and P(ba)(ba) = P(ab)(ba) = P(ba) = 1/2, satisfying the requirement of statistical independence of subsequent words. Of course, in this example, one could also construct a valid Markov model based on the words ''abab'', ''abba'', ''baab'' and ''baba''. The choice on what is the most likely Markov model explaining the data, could be based on the entropy of the Markov model: H{x} = Sum x -P(x) ln(P(x)) For the 4-word model this leads to an entropy of ln(4), whilst for the 2-word model the entropy would be ln(2). I assume Earendil wants us to pick the Markov chain that satisfies all requirements and has the smallest entropy. How to translate this into a constructive algorithm for deriving the words is not clear to me yet. |
||||
Powered by YaBB 1 Gold - SP 1.4! Forum software copyright © 2000-2004 Yet another Bulletin Board |