PCIe switches just shouldn't be so dammed expensive. A decade ago there was a lot more market competition but now there are, what, two companies with chips?
It has gotten a good bit harder to build, especially with so many of the tricks & tight timings in PCIe 5 and 6, but the lack of market competition has made getting any kind of parts at all much much more expensive.
With the cost of PLX PCIE Switches (allows you to e.g. PCIE 4.0 x16 -> PCIE 3.0 x32) it is actually worth considering just buying a second desktop and throwing in some high-bandwidth NIC and forming your own HPC. Or instead of using desktop parts just going EPYC/Xeon/Threadripper.
Of course it all comes down to the fact that if you need those PCIE lanes, there's a very good chance that it's for your job, meaning that businesses are the target market, not the enthusiast building a homelab for tinkering with LLM off the clock.
Coincidentally, the PCIe 5.0 variant is in "Samples available", i.e. not full production yet, which is very likely the reason for this card only being PCIe 4.0.
That's the price of the development/evaluation kit. Those are produced in small numbers, have provisions for everything and debugging the kitchen sink, and thus always this expensive.
With the PM40100 being $800 (single unit, no bulk pricing), the PM41100 / PM42100 are probably < $1500. (They do seem to have more features, not quite clear without proper datasheet sadly.)
This actually sounds like it could be a nice mid-range product. For lots of money, you can get a fancy motherboard and enclosure that routes a ton of CPU PCIe lanes direct to the NVMe drives. This ends up with a lot of performance per unit storage, which one might not want.
With a card like this, one can get a ton of high-speed (much better than SATA but not as fast as direct NVMe) storage in a regular machine.
People like to throw the innuendo around but the NSA's pathetic little datacenter is something that you would lose in a corner of a real datacenter operated by a real hyperscale system like Amazon or Google.
Just like a cluster of Bitcoin miners will run absolute circles around a similarly sized corner of an AWS data center, and the supercomputer at Oak Ridge will run circles around a similar sized corner of AWS of GPU EC2 instances connected via gigabit Ethernet, the NSA's cluster's got a different use case than running web services for every SaaS company that wants to run in AWS. I imagine it's aimed at saving and analyzing/decrypting large amounts of data, and thus is architected and tuned towards that purpose, and thus runs circles around a similarly sized corner of AWS for that particular task.
Unless you have experience with the NSA's cluster that you'd like to share with the rest of the class, that is.
Needs citation. I think the idea that the NSA has stronger data storage and analysis infrastructure than commercial operators is not even conjecture, it's something weaker, a fantasy. Commercial hyperscale operators claimed the ability to sort 50PB datasets at 600GB/s, eleven years ago. Storing and analyzing bulk data is the #1 thing these guys are good at.
Does anyone else marvel at data throughputs nowadays? People talk about 5GB/s NVME cards as being "slow." Same with Internet speeds. It's unreal the progress that we've made (and continue to make.)
I marvel at 5GB/s mass-storage, but then I groan at the knowledge it will be used to run a denornalized Postgres database table containing JSON blobs for queries that will do full tablescans because apparently dropping $Lots on a high-end storage system is preferable to learning how to do things properly.
I'm in the automation/manufacturing industry, and while I know there are big servers running ERPs and SCMs and MES and other TLAs with poorly-optimized giant databases I don't personally touch those. But still I suspect the same truths apply whether you're talking about "big iron" computers or literal metals: when evaluating the strength of an average weldment on an ordinary conveyor or machine, we like to say "Steel is cheap, engineering is expensive."
Obviously there are situations when you have to employ more rigor and do the FEA, but typically, when choosing between a just-right solution and one that's obviously strong enough, just overbuilding it is a lot more efficient in terms of value.
With 21 of the pictured $150 Samsung 1TB 990 Pro SSDs and, hypothetically, $1000 for this card, you're looking at $4,150 for this storage solution. If that solves your problem and lets you apply off-the-shelf Postgres and JSON and unoptimized queries, do it! That money only buys a handful of site visits and maybe a week of engineering hours to change a system that may involve dozens of users, tens or hundreds of thousands of lines of code, and rigid requirements from upstream and downstream...maybe you can change those eventually, and it would definitely have been cheaper if all the stakeholders had a fundamental understanding of the compute requirements of full table scans and non-native blobs and designed their business around those mathematics, but that doesn't sound likely.
The end result of this is something like Microsoft Teams where they are in the process of swapping out the entire underlying runtime because the team that wrote the application itself was so lazy (read: incompetent) that it is slow on literally every machine it runs on and this is apparently the only sane way to fix the whole mess - bar a whole rewrite of the entire application.
According to this 2021 blog article[1], MS Teams is undergoing two simultaneous overhauls:
1. Teams will use a shared, Windows systemwide Blink-based WebView2-based host instead of using its own private Electron environment.
2. The Teams' UI is changing from Angular to ReactJS.
So Teams will remain a modern-day HTA, for better or for worse, but sourcing from my own experiences working with Angular, ReactJS, and MS's WebView2 vs. Electron, I'm not convinced any of these changes will substantially benefit the end-user experience except perhaps a modest reduction in memory-usage attributed to using WebView2 instead of Electron.
Microsoft doesn't hire FTE SEs on the basis of their knowledge of a single platform or library - anyone who is good-enough overall will be able to familiarize themselves with Angular - or React - or any other framework, platform, or entire paradigm - that's how the industry works.
Employing people for knowledge with a specific library or platform can, and does, make sense, but only in a situation where a company needs a consultant or contractor(s) to make changes to an existing product for a short contract and then, poof, they no-longer work at the company.
While Microsoft does hire plenty of contractor staff (orange-badges, "v-dash trash", etc), only a minority of them are involved in product development, and an even tinier number of those are employed in any kind of consultancy role (which makes sense, considering that Microsoft almost entirely uses only its own platforms, frameworks and libraries for its consumer-facing products) - so the fact that Microsoft swallowed its pride and adopted Electron, Angular, React, Blink/Chromium in recent years marks a significant shift in the company's ideology (for want of a better word). No-one would have predicted this even as late as 2015.
Moreso, because this solution costs $4,000 so it only needs to save one week of one persons engineering time to be a more cost effective solution than “better software engineering”.
That kind of reads like a sarcastic post. The us has so much terrible internet. I'm about to try trip channel bonding starlink, dsl, and an lte internet connection. Starlink and dsl drop packets a lot.
I'm in the us, pay $100 a month to get 5 mbit dsl. This is not in a big city though, 20 miles outside of a city, next to a highway. There's fiber that runs by the street at the small subdivision this is in, kind of in the woods. The company that owns it refuses to connect us to fiber, instead preferring to put 100 homes on 5mbit dsl at far more profit. There's one big commerical user that paid for the fiber. This is the story of the us of course, I'm not unique. A family member lives on the other side of the country, he's closer to a big city but has the same problem.
Lets all keep in mind that a very small portion of the global population has access to this. It is our responsibility to bring all of humanity forward with us as we write software.
While you're not wrong about the portions who can access it, I wonder at how many of us doing work that brings humanity forward at all. Sure, let's not use fast system as an excuse for bad code, but I'm not going to pretend my CRUD app is elevating humanity. Even the stuff that claims to feels like at best a lateral move a lot of the time.
Got a VIC-20 for my 13th birthday in 1983. The local TV shop sold Commodore hardware and hosted a BBS that my little nerd cronies and I connected to at 300 bps. All of us knew the owner who ran the store and mentored us and hired some of us when we got old enough. On one visit he took me back, behind an actual curtain, and showed me the BBS machine. If you know the 1541 Disk Drive, you know. Well this beast had a Commodore branded 5 MB hard disk connected to a C-64. In my memory it was 6” by 8” by 24” long, with an enormous power supply, and must have cost thousands of Reagan-first-term US dollars.
Two or three years later another mentor hired me to put a 33 MB hard disk into his IBM PC. Not a clone. My memory tells me it was a DOS imposed limit, those 33 MB: the biggest drive available. I managed to plug the connector in upside down and released the magic smoke. That was a many-hundreds-of-dollars mistake. (And a good lesson in patient mentoring.)
In 1991 I obtained a used 80 MB drive (half height!) to put into my own PC XT clone, via a local Usenet group. I set the volume name to $1_PER_MB because going under that threshold was so impressive.
300 baud was comfortable reading speed, at least for me as a child. It was also fun to pick up the phone and be able to distinguish individual bytes of data (though not the actual content of the byte). Once we got to 1200 baud, it just became a stream of warbling.
> I set the volume name to $1_PER_MB because going under that threshold was so impressive.
Hah! Too funny, my own personal memory for 'cheap' storage was keeping an eye on the Fry's print ads in the Sunday newspaper while saving all my allowance and summer job money, finally buying the outrageously large 200GB HDD for a mere $1 a gig!
Sometimes you have to move 15TB about and then nothing is fast enough.
At 5GB/s, that would take nearly an hour and 5GB/s would be the fastest part of the trip. If it has to land on tape or travel through the net, it's going to take days.
"Never underestimate the bandwidth of a station wagon full of magnetic tapes hurtling down the highway" - Andrew S. Tannenbaum
But you are right, writing to and reading from tape will take a long time. Modern tape drives can do ~500MB/s, so 15TB will still take ~9 hours. Though that may still be faster than a 1 gbit internet connection (depending on how far you must drive).
Yes and no. Throughputs are crazy high these days but the rate at which they increase is slower than the rate at which compute increases, by a lot. So if we're discussing in a relative sense, then there's a growing divergence and thus one can argue that I/O is getting "slower." This is actually one of the major topics in HPC discussion and why there have been so many crazy hacks. Things like flashbuffers are pretty much essential these days. Even if you're doing multi-node ML training you see pretty big differences using infiniband due to the frequency in which nodes need to communicate (there are regularization interval tricks too). In scientific computing this is a big limitation to our ability to visualize at high resolutions and is why in situ visualization is growing popular.
As far as consumer hardware and consumer usages, yeah, everything just feels fast though.
It's starting to become difficult to comprehend much of it now. My ISP is pushing me to replace my 300 Mbps service with 2Gbps, but I never even saturate what I have. Maybe if I were a gamer with huge downloads it would make sense to upgrade.
I'm a gamer on 300 Mbps. Even 100 GB huge games aren't much more than a half hour away on steam. I don't really see the value in being able to download that much in 5 minutes.
I do not know if it is still true, but originally the PlayStation did not perform patch diffs, and any kind of update could be 10s of gigabytes. If I were routinely having to wait to start a frequently patched game, that would get old pretty quickly.
Outside of that, yeah, I am not sure what use greater than gigabyte would be for 95% of the population.
The protocol implementation doesn't matter if everything is serialized into ASCII. At some point there's going to be a web 4.0 where people figure out the performance advantages of binary data.
u.2 NVMe uses 2.5" bays and SATA Express connectors to offer up to 4 lanes of PCI-e or two lanes of SATA. That's probably where most of the enterprisy NAS is going.
In my humble opinion some home NAS use cases are beneficially converted to Thunderbolt. It's a lot more practical than the faster varieties of ethernet. If you have two hosts that need fast access to the NAS you can do it with TB4, and one of those hosts can re-export the stored resources for applications with lesser performance requirements, over SMB or iSCSI or whatever.
Yeah, but then the primary host inherits the uptime requirements of a NAS and this gets really awkward when you have workflows in different operating systems that both want to use the fast storage. Both OSes can access thunderbolt, of course, but only one can be a good server. Now that I've experienced the separation of concerns that comes from an independent NAS I do not want to go back.
Which host have you nominated as "primary" in this scenario? I'm only thinking of the TB4 link as a relatively cheap and simple point-to-point networking link. I think it's a lot more practical, and cheaper, that 25gbps ethernet. A dual-port 25gb ethernet NIC costs hundreds of dollars, while you can find all kinds of cheap computers with 2 TB4 ports.
I really need to finish my YouTube video on this...
-- -----
So, the answer is you don't need a lot to get big impact.
Let's take some NICs like the Intel XL710-QDA1. These are about $250 used. This is a 40Gbps NIC using just PCIe 3.0 x8. I used DACs to drop power consumption through the floor, and to reduce latency. All of this is presented via ISCSI, and with jumbo frames to eke out a few extra percent throughput. If you're using passive DACs, figure 4W of power per port for the NIC, plus another 1.2W at the switch (assuming four SFP+ fan-out), for one end of that connection. You could also just direct connect between the server and the client.
At this point, you can basically shove older prosumer (say 980 Pro) PCIe 4.0 x4 NVMe sustained transfer over the network. Granted, if you're outside of that STR use case, you'll fundamentally be limited by IOPS. Figure for every 40Gbps, you can throw up to 1,200,000 4K IOPS worth of data over the wire.
If you hit the IOPS cap, increase your link speed. A PCIe 4.0 x16 card can handle two 100GbE ports just fine. Note that as you increase IOPS, you'll eventually hit your operating system's IO scheduler limits somewhere in the 10M-15M IOPS range.
-- -----
The question then is actually keeping the NIC fed if you're going over the network. If you're local, it's at least much easier.
First you likely have data buffered in memory for reads. So RAM will cover you there. For writes, you probably want a FUSE pass-through filesystem on NVMe in front of your real backing store (if it's disk, and you're not pure NVMe), or alternatively a writeback cache. The idea here is that pass-through filesystem is basically a storage tier that sits in front of another volume (or volumes) in an entirely transparent manner, so it still appears as if you're writing to a volume that might just be a RAID with a large amount of disks, but instead it's being written to the (presumably mirrored or striped+mirrored) NVMe first to move it to the final destination later. Alternatively, you could also double it as additional read cache, if you don't need to use is all as a write cache too.
-- -----
That's basically what I've done. I have QSFP+ NICs in the storage server and my HEDT (10GbE, 2.5GbE and 1GbE in the rest of the lab hosts), a mirrored NVMe cache/ingest tier, and a boatload of 16TB HDDs in ZFS across two ZFS volumes behind it. This is further backed by 128GB of ECC memory for ZFS ARC, and all powered by an AMD Ryzen 7 PRO 5750GE that maxes out at just under 39W. The whole system, with the NIC, HBA, two 2TB NVMe drives, eight 16TB HDDs, SATA DOM, 128GB ECC, 8-core 35W TDP CPU, and onboard BMC idles at about 43W, and loads that can primarily hit the cache, sits in the 60-65W range, with an absolute system peak under 140W. These figures can be confirmed by a metered PDU.
This gives me ~80TiB of usable redundant storage, with native prosumer PCIe 4.0 x4 NVMe performance, over the network... with an idle of 42-43W that bumps up to 60-65W for most work.
You can do a passive x16 -> 4 x4, iff your board supports pci-e bifurcation. Theoretically, you could do x16 -> 8 x2 also passively, but I haven't seen bifurcation go down to x2. PCI-e switches are probably too expensive for anything active though.
I'm gonna claim that this card isn't aimed at enterprise use-cases that need this kind of service. I'd put it along the lines of "nearline SAS" HDDs, aimed at non-critical applications where you care more about bulk capacity than reliability.
Relatedly, M.2 SSDs are inherently slower than the same pile of silicon in an U.2/2.5" form factor — the power/heat budget is noticeably lower.
Enterprise SSDs almost always include power loss protection capacitors on the drive itself, so the drive can either directly advertise that it's write caches are non-volatile or simply ignore cache flush requests from the host since data in the cache can already be considered durable from the host's perspective.
Unfortunately for this product, enterprise M.2 SSDs are almost always 110mm long rather than 80mm long, precisely because of the space taken up by those capacitors.
>Apex Storage doesn't reveal the inner details of the X21
>In a single-card configuration, the X21 delivers sequential read and write speeds up to 30.5 GBps and 28.5 GBps, respectively.
did they test it or reprint press release?
>According to Apex Storage
ah
>The AIC has an average read and write access latency of 79us and 52us
that doesnt make sense unless its additional latency of controller or they ship it populated with drieves.
>However, Apex Storage didn't expose the type of RAID arrays. The X21 also flaunts "enterprise-grade reliability," NVMe 2.0 support, advanced EEC, data protection, and error recovery. Apex Storage didn't reveal the pricing or availability for the X21.
>A cross platform drop in card that can massively increase the amount of storage for your computer using cost effective m.2 SSDs.
You had to read fine print to realized its just a m.2 stand requiring proper 16 port SATA controller to function. It still hasnt shipped to this day. Im mildly optimistic.
Not yet Orico seems pretty disappointing 600MB/s for their 20Gbps enclosures. I would thought you should already get that with their 10Gbps ones. So a bit wary to try out their 40Gbps enclosures.
While dealing with the Samsung Pro Firmware issue, I read that SSDs mounted on a hardware RAID controller need to be removed from the RAID in order to have their Firmware update applied, since Samsung's tool won't see the SSDs if they are placed on the controller.
The chip this uses is likely a PM4x100 (x ∈ {0, 1, 2}) from Microchip (formerly Microsemi (formerly PMC-Sierra)):
https://www.microchip.com/en-us/product/PM40100
^ runs you $800 without bulk discounts [https://www2.mouser.com/ProductDetail/Microchip-Technology-A...] — if you can get them, that is.
https://www.microchip.com/en-us/product/PM41100
https://www.microchip.com/en-us/product/PM42100
^ these latter two I don't see publicly listed prices for anywhere.
The PCIe 5.0 equivalent is in "Samples available", i.e. not full production yet, which is likely why the card only does PCIe 4.0:
https://www.microchip.com/en-us/product/PM50100