Author |
Topic: Discovering words in a random sentence (Read 676 times) |
|
Earendil
Newbie
Gender:
Posts: 46
|
|
Discovering words in a random sentence
« on: Jun 15th, 2005, 6:48am » |
Quote Modify
|
Some finite words (constructed on a finite alphabet) are selected without you knowing them. After that, a sentence is constructed selecting every time a word, independently of the others already chosen, with some unknown probability function. Given an infinite sized sentence, how do you discover which are the words and what is the probability of chosing each one of them? Example: words can be: (2,0,1) (2,1) (3,0,0) A sentence could be: 30030030020121201212120130021201300...
|
|
IP Logged |
|
|
|
Barukh
Uberpuzzler
Gender:
Posts: 2276
|
|
Re: Discovering words in a random sentence
« Reply #1 on: Jun 15th, 2005, 7:13am » |
Quote Modify
|
Some clarifications are required... I have one question: May a word be a prefix of another word?
|
|
IP Logged |
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Discovering words in a random sentence
« Reply #2 on: Jun 15th, 2005, 8:43am » |
Quote Modify
|
It's easy enough to find pieces of words that are repeated. But there is no way to tell where wordboundaries are, unless there are extra constraints (like f.i. no words is prefix to another) There is no reason not to just say that each word consist of exactly one letter. Or just take the lexicon to be every combination of two letters, or three, etc. You get different probability function for each assumption, but you always get one. And since the right one is unknown, it can't help you distinguish what is the right lexicon anyway. Some context, meaning, to the sentence might be nice.
|
|
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
Earendil
Newbie
Gender:
Posts: 46
|
|
Re: Discovering words in a random sentence
« Reply #3 on: Jun 15th, 2005, 9:22am » |
Quote Modify
|
on Jun 15th, 2005, 7:13am, Barukh wrote:Some clarifications are required... I have one question: May a word be a prefix of another word? |
| A word may not be the suffix of another word on Jun 15th, 2005, 8:43am, towr wrote: There is no reason not to just say that each word consist of exactly one letter. Or just take the lexicon to be every combination of two letters, or three, etc. |
| You can guide yourself on the Principle of Maximum Likelihood... (The parameters which maximize the probability of obtaining the sample are the best ones)
|
« Last Edit: Jun 15th, 2005, 9:24am by Earendil » |
IP Logged |
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Discovering words in a random sentence
« Reply #4 on: Jun 15th, 2005, 10:12am » |
Quote Modify
|
You're suggesting a hidden markov model?
|
|
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
Earendil
Newbie
Gender:
Posts: 46
|
|
Re: Discovering words in a random sentence
« Reply #5 on: Jun 15th, 2005, 12:14pm » |
Quote Modify
|
on Jun 15th, 2005, 10:12am, towr wrote:You're suggesting a hidden markov model? |
| Words haven't got necessarily the same length...
|
|
IP Logged |
|
|
|
JocK
Uberpuzzler
Gender:
Posts: 877
|
|
Re: Discovering words in a random sentence
« Reply #6 on: Jun 15th, 2005, 1:24pm » |
Quote Modify
|
The remark that subsequent words are chosen independently of previous words, is crucial here. It means that the whole sequence should fit a Markov chain model with transitional probabilities for subsequent words that satisfy for each word (y) and all preceding words (x): P(x)(y) = P(y) A simple example: Suppose I would construct an infinite sequence based on the random selection of the words ''ab'' and ''ba'' with equal probabilities. The assumption that the words would be ''a'' and ''b'' would lead to a markov chain with transitional probabilities P(a)(a) = 1/4 =/= 3/4 = P(b)(a), and P(b)(b) = 1/4 =/= = 3/4 = P(a)(b). This result is clearly in conflict with the given statistical independence of subsequent words. However, starting from the assumption that the words are ''ab'' and ''ba'' would lead to the Markov chain actually used in the construction of the data. I.e. the chain with P(ab)(ab) = P(ba)(ab) = P(ab) = 1/2, and P(ba)(ba) = P(ab)(ba) = P(ba) = 1/2, satisfying the requirement of statistical independence of subsequent words. Of course, in this example, one could also construct a valid Markov model based on the words ''abab'', ''abba'', ''baab'' and ''baba''. The choice on what is the most likely Markov model explaining the data, could be based on the entropy of the Markov model: H{x} = Sum x -P(x) ln(P(x)) For the 4-word model this leads to an entropy of ln(4), whilst for the 2-word model the entropy would be ln(2). I assume Earendil wants us to pick the Markov chain that satisfies all requirements and has the smallest entropy. How to translate this into a constructive algorithm for deriving the words is not clear to me yet.
|
« Last Edit: Jun 15th, 2005, 1:26pm by JocK » |
IP Logged |
solving abstract problems is like sex: it may occasionally have some practical use, but that is not why we do it.
xy - y = x5 - y4 - y3 = 20; x>0, y>0.
|
|
|
|