In-Network Coherence Filtering: Snoopy Coherence without Broadcasts
Abstract
With transistor miniaturization leading to an abundance of
on-chip resources and uniprocessor designs providing diminishing returns, the industry has moved beyond single-core microprocessors and embraced the many-core wave. Scalable
cache coherence protocol implementations are necessary to al
low fast sharing of data among various cores and drive the
many-core revolution forward. Snoopy coherence protocols, if
realizable, have the desirable property of having low storage
overhead and not adding indirection delay to cache-to-cache
accesses. There are various proposals, like Token Coherence
(TokenB), Uncorq, Intel QPI, INSO and Timestamp Snooping, that tackle the ordering of requests in snoopy protocols
and make them realizable on unordered networks. However,
snoopy protocols still have the broadcast overhead because
each coherence request goes to all cores in the system. This
has substantial network bandwidth and power implications.
In this work, we propose embedding small in-network coherence filters inside on-chip routers that dynamically track sharing patterns among various cores. This sharing information is
used to filter away redundant snoop requests that are traveling
towards unshared cores. Filtering these useless messages saves
network bandwidth and power and makes snoopy protocols on
many-core systems truly scalable. Our in-network coherence
filters are able to reduce the total number of snoops in the system on an average by 41.9%, thereby reducing total network
traffic by 25.4% on 16-processor chip multiprocessor (CMP)
systems running parallel applications. For 64-processor CMP
systems, our filtering technique on an average achieves 46.5%
reduction in total number of snoops that ends up reducing the
total network traffic by 27.3%, on an average.