Below is a indicative illustration of zones compared to the reasonably linear response of a silicon sensor with a 12bit AD converted output. All the zone charts I have seen show equal length steps with the tone values doubling with each step. The illustration varies the length to reflect the space tone values may occupy related to the possible data conversion.
From a numeric point of view there seems to be tremendous room for detail to be held in zones 9 and 10 so it appears that ETTR is a valid approach. Theoretically zones 9 and 10 occupy even more room than is shown but it conveys the concept of the highlight headroom providing clipping does not start to happen.
Warning – the above assumes the AD conversion is linear and stored as RAW values but to speed up the conversion manufactures may not bother to completely resolve the higher values or may use a reference curve during the AD conversion. Why use so much time and storage for highlights as it is basically redundant precision? Examining the RAW file may give a clue.
I am curious as to how the designers approach it, so if any of you can provide factual insights I would be pleased to know.