I think there's more than a few enthusiasts who would be very interesting in buying 1 or more of these cards (if they had 32+ GB of memory), but I don't have any data to back that opinion up. It is not only those who can't afford a 4090 though.
While the 4090 can run models that use less than 24GB of memory at blistering speeds, models are going to continue to scale up and 24GB is fairly limiting. Because LLM inference can take advantage of splitting the layers among multiple GPUs, high memory GPUs that aren't super expensive are desirable.
To share a personal perspective, I have a desktop with a 3090 and an M1 Max Studio with 64GB of memory. I use the M1 for local LLMs because I can use up to 57~GB of memory, even though the output (in terms of tok/s) is much slower than ones I can fit on a 3090.
Right now I have a 3090TI so it's not worth it for me to upgrade to a 4090, but I do run into Vram constraints a lot, especially with merging stable diffusion models, especially as the models get larger (XL-Cascade-etc). As I move toward running multiple LLMs at a time I run into similar problems.
I would gladly buy a card that ran a touch slower but had massive Vram, especially if it was affordable, but I guess that puts me into that camp of enthusiasts you mentioned.
It doesn't matter whether anyone is "spoiled" or not.
The fact is large language models require a lot of VRAM, and the more interesting ones need more than 24GB to run.
The people who are able to afford systems with more than 24GB VRAM will go buy hardware that gives them that, and when GPU vendors release products with insufficient VRAM they limit their market.
I mean inequality is definitely increasing at a worrying rate these days, but let's keep the discussion on topic...
I'm just fascinated that the response/demand to running out of RAM is "Just sell us more RAM, god damn!" instead of engineering a solution to make due with what is practically (and realistically) available.
I would say that increasing RAM to avoid engineering a solution has long been a successful strategy.
i learned my RAM lesson when I bought my first real linux PC. it had 4MB of RAM, which was enough to run X, bash, xterm, and emacs. But once I ran all that and also wanted to compile with g++, it would start swapping, which in the days of slow hard drives, was death to productivity.
I spent $200 to double to 8MB, and then another $200 to double to 16MB, and then finally, $200 to max out the RAM on my machine-- 32MB! And once I did that everything flew.
Rather than attempting to solve the problem by making emacs (eight megs and constantly swapping) use less RAM, or find a way to hack without X, I deployed money to max out my machine (which was practical, but not realistically available to me unless I gave up other things in life for the short term). Not only was I more productive, I used that time to work on other engineering problems which helped build my career, while also learning an important lesson about swapping/paging.
People demand RAM and what was not practically available is often available 2 years later as standard. Seems like a great approach to me, especially if you don't have enough smart engineers to work around problems like that (see "How would you sort 4M integers in 2M of RAM?")
While saying "we want more efficiency" is great there is a trade off between size and accuracy here.
It is possible that compressing and using all of human knowledge takes a lot of memory and in some cases the accuracy is more important than reducing memory usage.
For example [1] shows how Gemma 2B using AVX512 instructions could solve problems it couldn't solve using AVX2 because of rounding issues with the lower-memory instructions. It's likely that most quantization (and other memory reduction schemes) have similar problems.
As we develop more multi-modal models that can do things like understand 3D video in better than real time it's likely memory requirements will increase, not decrease.
Quantization and CPU mode and hybrid mode where the model is split between CPU and GPU exist and work well for LLMs, but in the end more VRAM is a massive quality of life improvement for running (and probably more for training, which has higher RAM needs and forbwhich quantization isn't useful, AFAIK) them, even ifbyou technically can do them on CPU alone or hybrid with no/lower VRAM requirements.
What makes you think people aren't trying to engineer a solution that uses less RAM?
There are millions (billions?) of dollars at stake here, and obviously the best minds are already tackling the problem. Only plebs like us who don't have the skills to do so bicker on an internet forum... It's not like we could realistically spend the time inventing ways to run inference with fewer resources and make significant headway.
I tend to agree that it would be niche. The machine learning enthusiast market is far smaller than the gamer market.
But selling to machine learning enthusiasts is not a bad place to be. A lot of these enthusiasts are going to go on to work at places that are deploying enterprise AI at scale. Right now, almost all of their experience is CUDA and they're likely to recommend hardware they're familiar with. By making consumer Intel GPUs attractive to ML enthusiasts, Intel would make their enterprise GPUs much more interesting for enterprise.
The problem is that this now becomes a long term investment, which doesn't work out when we have CEOs chasing quarterly profits and all that. Meanwhile Nvidia stuck with CUDA all those years back (while ensuring that it worked well on both the consumer and enterprise line) and now they reap the rewards.
It's about mindshare. Random people using your product to do AI means that the tooling is going to improve because people will try to use them. But as it stands right now if you think there's any chance you want to use AI in the next 5 years, then why would you buy anything other than Nvidia?
It doesn't even matter if that's your primary goal or not.
Microsoft got where they are because the developed tools that everyone used. The got the developers and the consumers followed. Intel (or AMD) could do the same thing. Get a big card with lost of ram so that the developers get used to your ecosystem and then sell the enterprise GPUs to make the $$$. It is a clear path with a lot of history and it blows my mind Intel and AMD aren't doing it.
AFAIK, unless you are a huge American corp with orders above $100m Nvidia will only sell you old and expensive server cards like the crappy A40 PCIe 4.0 48GB GDDR6 at $5,000. Good luck getting SXM H100s or GH200.
If Intel sells a stackable kit with a lot of RAM and a reasonable interconnect a lot of corporate customers will buy. It doesn't even have to be that good, just half way between PCIe 5.0 and NVLink.
But it seems they are still too stuck in their old ways. I wouldn't count on them waking up. Nor AMD. It's sad.
It will be a niche product with poor sales.