Based on magnetoencephalographic measurements, this contribution delineates a sequence of processing stages engaged in audiovisual speech perception, giving rise, finally, to the fusion of phonological features derived from auditory and visual input. Although the two channels interact even within early time windows, the definite percept appears to emerge relatively late (> 250 ms after speech onset). Most noteworthy, our data indicate visual motion to be encoded as categorical information even prior to audiovisual fusion, as demonstrated by a non-linear visual /ta/ - /pa/ effect. Our findings indicate, first, modality-specific sensory input to be transformed into phonetic features prior to the generation of a definite phonological percept and, second, cross-modal interactions to extend across a relatively large time window. Conceivably, these integration processes during speech perception are not only susceptible to visual input, but also to other supramodal influences such as top-down expectations and interactions with lexical data structures.