Crowding & Masking

Vision starts at the retina and continues in more than 40 visual areas. Object recognition is usually assumed to proceed from the analysis of simple features, such as edges and lines in the early visual areas, to more and more complex features in higher visual areas, such as IT.

There are three important characteristics of hierarchical, feedforward processing. First, processing proceeds from basic (lines, edges) to complex (objects, faces) features. Second, processing at each level is fully determined by processing at the previous level. Third, receptive fields increase along the visual hierarchy because objects are more extended than their constituting elements.

There are four crucial implications of these characteristics. First, object recognition always becomes difficult when objects are embedded in clutter because object-irrelevant elements mingle with object-relevant elements. Second, features “lost” at the early stages of processing are irretrievably lost. Third, only nearby elements interfere with each other. Fourth, only low-level features “interact” with each other, such as interactions between lines of the same color and orientation.

In a variety of classic visual paradigms, we found that none of these predictions holds true.

The example of crowding. We presented a vernier stimulus, which comprises two vertical lines offset in the horizontal direction. Human observers had to indicate the offset direction. Performance strongly deteriorated when the vernier is surrounded by the outline of a square. This is a classic crowding effect and well in line with predictions of most models of crowding and object recognition in general because irrelevant information interferes with target processing. Next, we presented further squares on both sides of the square containing the vernier (Figure 1). Classic models propose that performance should deteriorate even further since more distracting information is added. However, performance was almost as good as when the vernier was presented alone. When we turned the flanking squares by 90 degrees creating diamonds, performance was at its worst. We proposed that the repetitive, good Gestalt of the square array creates a high-level structure, which is treated differently from the vernier, which for this reason is clearly perceived and, hence, performance is good. Computationally, the squares must be computed from their constituting lines, grouped as one array, and only then they can interact with the low-level vernier. We showed similar crowding results with other types of stimuli (Key Publications: Malania et al., 2007; Manassi et al., 2012, 2013; Herzog et al., 2015a,b; Manassi et al., 2016; Pachai et al., 2016).

Figure 1

Figure 1: Observers were asked to discriminate whether the vernier was offset to the left or right. We determined the offset size for which 75% correct responses occurred (left bar and dashed line). When the vernier was embedded in the outline of a square, thresholds increased (a-b). When adding three flanking squares on each side, thresholds strongly decreased and crowding disappeared (b-c). When rotating the flanking squares by 90° (d), thresholds increased. Modified from Manassi et al. (2013).

Thus, using crowding as an example, we have shown clear counterevidence to predictions of the classic models of object recognition. A) Adding uninformative elements to a target does not necessarily, as proposed by classic models, deteriorate performance. Bigger can be better. B) Low-level information is not irretrievably “lost”: neither on a low-level nor on any other stage. C) Performance of a target depends on all elements in the visual field and is not restricted to nearby elements. D) Performance depends on the overall configuration of the stimulus, i.e., high-level processing determines low-level processing. For the same reasons, processing cannot be feedforward (Herzog et al., 2016; Jaekel et al., 2016).

Neurophysiology. Where in the human brain does crowding occur? Using high density 192 channel EEG, we could show that high-level areas rather than low-level areas were involved, which argues again against interferences in low-level visual areas (Chicherov et al., 2014).

The examples of visual masking and surround suppression. We obtained very similar results in visual masking and surround suppression and found, as with crowding, only little evidence for the classic vision framework. To the contrary, also in these paradigms, processing is determined by global, Gestalt processing (Key Publications: Herzog & Fahle, 2002; Saarela & Herzog, 2009).

Modeling. We showed that classic feedforward architectures, including CNNs, cannot explain crowding and masking. Models including a recurrent grouping stage however did a good job (Key Publications, Crowding: Francis et al., 2017; Doerig et al., 2019, 2020; Bornet et al., 2019; Masking: Hermens et al., 2008; see also Ghose et al., 2012).

Publications – Crowding

Peripheral vision
Foveal vision
Convolutional Neural Networks (CNNs)

Publications – Overlay masking / Surround suppression

Publications – Masking

Meta-contrast masking
Masking in general
Pattern masking & Shine-through effect
Modeling of masking
Dyslexia, Development & Visual Backward Masking