When you think about it, the light "gathering" characteristics of the camera's sensor are set in silicon - so all of the magic of high ISO can only happen electronically. So if you're shooting at - say - ISO 800, all that's really happening is you're (as far as the sensor is concerned) (need to word this carefully!) only feeding it 1/8th the amount of light that it would have got at ISO 100 if the light was 8 times stronger. So to put that another way, if you have a 12 foot well with fresh water at the top - mud at the bottom - shooting at ISO 800 is like taking water 3 feet down from the top - which is also 3 feet closer to the mud. But if that pipe was ALREADY 2 feet in the water (meaning a 2 stop under-exposure) and you lowered it an additional 3 feet then the water isn't as clean and as clear as it would have been 2 feet further up. In a digital context this means you're getting closer and closer to the noise floor.
In a camera sense, it's not the under-exposure that causes the noise per se (if you want to test it, set your camera to it;s highest ISO setting - under-expose a shot by 10 stops - and you still won't see any noise - BUT - it's when you try to adjust that under-exposure out in post processing you find that the signal & the noise floor are basically one and the same, and you can't display one without revealing the other ... so the "trick" is to stay as far away from the noise floor as possible - and that means not under-exposing. Even over-exposed looking on the review screen in fine so long as important areas aren't blown. It's one of the few occasions where ETTR (Expose to the Right) is actually beneficial.