Ramblings on linguistics: A Maximum Entropy Model of Phonotactics and Phonotactic Learning

Today's paper is A Maximum Entropy Model of Phonotactics and Phonotactic Learning by Bruce Hayes and Colin Wilson, published in Linguistic Enquiry in 2008.

Phonotactics is the grammar of phonemes: which sounds can combine together, and in which environments. There is a famous example (from Chomsky and Halle 1965) that in English, brick is an existing word, blick doesn't exist but is "well-formed" i.e. a perfectly possible new word, and bnick is "ill-formed" - it could never be an English word.

Hayes and Wilson's paper is about their phonotactic learner, which attempts to answer the question "What is necessary for humans to learn phonotactic rules?" by modelling input, initial knowledge and learnt components.

Input

The input for this phonotactic learner is positive data only. Positive data gives information about what is valid, saying nothing about what is not valid. This corresponds fairly well to the situation of human learners: no-one ever sits a child down and says "And you can't start words with **bn,* or *lk, or...". Instead, we hear lots and lots of valid words, and have to figure out the constraints that prevent the invalid ones just from those.

(A constraint might be, for example, that English words cannot start *bn. The star means it is ungrammatical.)

Maximum Entropy model

"A maximum entropy grammar uses weighted constraints to assign probabilities to outputs."

I shall have to do further reading for the technicalities, but what it boils down to is that there isn't a yes-no answer to the question "Is this a valid word?", but instead there is a gradient between the two. This method allows them to model that gradience.

To start with, their model assumes all constraints are equally important; it then changes the weight of them until it finds a model which most closely matches the observed words.

Choosing the initial constraints

The problem with the closest match for the observed words is of course that it doesn't contain any possible words that weren't in the data set - such as blick, or indeed phonotactics, unless your word list is extremely long.

So instead, we need to find the set of grammars that give us all possible words.

Optimality Theory is a theory of linguistics that assumes all the constraints are provided by the Universal Grammar (UG), and learning a language consists of changing their weightings until they match the data. "Universal Grammar" is the inbuilt linguistics knowledge that humans have. A large part of modern linguistics is figuring out what, if anything, is part of UG, and what is learnable given the input and other human abilities (e.g. rhythm, auditory and visual processing and the like).

Since one of the goals of this model is to see how much is learnable (i.e. is not part of UG), the authors did not specify a list of constraints initially. Nor did they simply give it every possible constraint (there are too many unless you are deliberately looking at a restricted problem).

Instead, they gave it a list of features and formats that constraints could be in (thus assuming that those are part of UG).
They do limit the number of features, because the computation is not feasible past a certain point - which should also be a factor for human learners, though we don't yet know at which point that is.

Finding the best fit

The search space has been shown to be totally convex, so they can find the maximum point by following increasing gradient at every point; there are no local maxima. The gradient is a vector of partial derivatives, each corresponding to the difference between the expected and observed value of a given constraint.

Since the system cannot, as mentioned, use all possible constraints, it is given search heuristics.
It primarily searches for accurate constraints, within which it prefers shorter constraints (referring to fewer features), within which more general natural classes (more phonemes referred to).

First test: English onsets

They tested this baseline version against English onsets (the first few consonants of a word). It works very well.

Second test: Shona vowels (Nonlocal phonotactics)

Shona is a Bantu language of Zimbabwe. It exhibits vowel harmony, which is to say that unlike in English, certain vowels can only occur in the same word as certain other vowels.

Running the inductive baseline learner didn't really work. This is because they had to search longer strings (because of the consonants in between the vowels).

The learner is modified to "create 'projections' and scan them for phonotactic generalizations". This means that, in this instance, vowels can be 'projected', and generalisations can be learnt that rely only on vowels. And now it works! The authors claim therefore that "the concept of a vowel tier can be defended on learnability grounds".

Third test: Stress (Nonlocal phonotactics)

Stress (corresponding to a louder, higher, longer, more articulated syllable - the exact properties vary between languages) - is not a binary phenomenon. There one main stress in a word, but there can be other stressed syllables.

Stress is frequently represented using a grid: all syllables on one row, stressed syllables marked on the row above, main stress marked on the row above. The syllable bearing the main stress is therefore marked on all 3 rows.

As for vowels, each type of stress is projected and scanned for generalisations. This worked well.

Fourth test: Wargamay (Whole language)

Wargamay is an Australian aboriginal language, whose phonotactics are well-described.

Using the techniques discussed above, the phonotactic learner succeeded in penalizing forms described as illegal, but is perhaps too severe (banning forms that are only 'accidentally' missing from the data, rather than because of any rule).

They suggest that either these rules may be true (more experimentation may confirm them); or that there is some kind of limit to the number of constraints, perhaps based on how much impact extra constraints have; or that there are more phonological principles that need to be built in (like the projection system).

References

Hayes and Wilson (2008) A Maximum Entropy Model of Phonotactics and Phonotactic Learning. Linguistic Enquiry, Volume 39, Number 3, Summer 2008, 379 - 440.
http://www.linguistics.ucla.edu/people/hayes/Papers/HayesAndWilsonPhonotactics2008.pdf

Chomsky and Halle (1965) Some controversial questions in phonological theory. Journal of Linguistics 1:97 - 138

http://www.jstor.org/stable/4174898