How does read can exploit inter-rack bandwidth in Google File System?

it says that(my interpretation after reading research paper and its reviews) "inter rack bandwidth is lower than aggregated intra rack bandwidth(not sure what it means by aggregated, it doesn’t make much sense of kind of comparison). thus by placing data among various racks, clients can exploit the aggregate bandwidth from reads from various tracks.

in case of mutations where client has to send data, multiple racks are disadvantageous as data has to travel longer distances."…

I don’t get the point it’s trying to make about bandwidth. can anyone explain? Why would it be different for reads and writes? I understand write. As you write at distance=0, then if you’ve to write at distance=1000, then your data needs to travel longer distance. But why is it beneficial for read?

Some background information-:

Rack means collections of chunkservers(30-40).

Chunkservers are collection of 64MB chunks.

Chunks are collection of 64KB blocks.

Here’s a GFS architecture-:

(Google it in GFS paper, I can’t share as I’m new user).

GFS research paper

Other sources-:

What’s written in some solution manuals I saw online-:

To put it as simply as possible, you have multiple copies of each chunk so you can read one of them from anywhere, but need to write to all of them everywhere.

But there can be scenarios where you need to travel and spend a lot of bandwidth to read as well as it might not be in place. Plus there’s some tunable consistency in these systems. You can’t just read from 1 place and send results to client. You need to read from multiple places.

Another blog has given this example but I wasn’t absolutely clear about it although I’m well versed with undergraduate networking courses-:

Let’s say you have 10 chunkservers in a rack, all with NVMe drives delivering up to 3,200MB/s. The aggregate (reading from all chunkservers in a rack at the same time) would be 32,000MB/s. Now if the inter rack network is SFP+ then that can only deliver 10Gbps, which is less than the aggregate bandwith.

That’s for ideal conditions on a single rack. Let’s say the cluster has 10 racks, and the entire network is SFP+. Then the client can still only consume at 10Gbps, but by distributing the reads among all racks it becomes an average of 1Gbps per rack. Furthermore given that the topology may be uneven and some racks may have more latency than others for this client the client can choose the lowest latency (“nearest” in the paper) rack to do most of the reading from.

Another blog writes this-:

More copies of data increase the maximum possible read bandwidth. But more copies of data don’t increase write bandwidth.

What’s bandwidth here in GFS? How is it defined? I am thinking bandwidth is amount of data that can be transferred from a networking equipment at a time. It looks like the blog is trying to say the same thing “read from anywhere, write everywhere” but the bandwidth term used is confusing to me.

Another blog post writes this-:

Typically, the servers in a single rack will be connected by a top of rack switch which connects to every server in that rack. The servers in the rack will be able to communicate with one another at the link speed of their interface, and all of them can do this at the same time. The top of rack switch will connect further to a core switch, using high-bandwidth connections. The core switch is connected to every other top of rack switch. But usually, the link speed of the connection to the core switch will be smaller than the sum of the link speeds of the connections to every server in the rack.

The result of this is that the bandwidth available to servers within the same rack is higher than the bandwidth to communicate to servers outside that rack. (This isn’t always true. Facebook builds networking so that the inter-rack bandwidth is the same as the intra-rack bandwidth. That gives flexibility at the cost of power efficiency.)

It does bring the 3 tier design principles concept of core, acces and distribution layer. Where, core switch has the best possible speed. But the aggregated distribution/access switches could also have the more speed than the core switch speed. So what, I don’t get it.

How does read exploits aggregate bandwidth of multiple reads(according to the research paper) when we’ve placed data in multiple chunks? It doesn’t make much sense to me and is confusing.