RDMA over Converged Ethernet: Difference between revisions
Laudabilis (talk | contribs) add archive link for dead pre-Mellanox-acquisition webpage |
|||
(103 intermediate revisions by 65 users not shown) | |||
Line 1: | Line 1: | ||
{{short description|Network protocol}} |
|||
'''RDMA over Converged Ethernet''' ('''RoCE''') is a network protocol |
'''RDMA over Converged Ethernet''' ('''RoCE''')<ref>{{Cite web|url=https://digitalvampire.org/blog/index.php/2010/12/06/two-notes-on-iboe/|title = Roland's Blog » Blog Archive » Two notes on IBoE}}</ref> is a network protocol which allows [[remote direct memory access]] (RDMA) over an [[Ethernet]] network. There are multiple RoCE versions. RoCE v1 is an Ethernet [[link layer]] protocol and hence allows communication between any two hosts in the same Ethernet [[broadcast domain]]. RoCE v2 is an [[internet layer]] protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a [[Data center bridging|converged Ethernet network]], the protocol can also be used on a traditional or non-converged Ethernet network.<ref name="RoCEv1">{{cite web |website=InfiniBand Trade Association |url=https://cw.infinibandta.org/document/dl/7148 |title=InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE |date=13 April 2010 |access-date=29 April 2015 |archive-date=9 March 2016 |archive-url=https://web.archive.org/web/20160309123709/https://cw.infinibandta.org/document/dl/7148 |url-status=dead }}</ref><ref name="RoCEv2">{{cite web |website=InfiniBand Trade Association |url=https://cw.infinibandta.org/document/dl/7781 |title=InfiniBand™ Architecture Specification Release 1.2.1 Annex A17: RoCEv2 |date=2 September 2014 |access-date=19 October 2014 |archive-date=17 September 2020 |archive-url=https://web.archive.org/web/20200917012109/https://cw.infinibandta.org/document/dl/7781 |url-status=dead }}</ref><ref name=":0">{{cite web |author=Ophir Maor |title=RoCEv2 Considerations |url=https://community.mellanox.com/docs/DOC-1451 |website=Mellanox |date=December 2015}}</ref><ref>{{cite web |author=Ophir Maor |title=RoCE and Storage Solutions |url=https://community.mellanox.com/docs/DOC-2283 |website=Mellanox |date=December 2015}}</ref> |
||
<ref name="RoCEv2">{{cite web |website=InfiniBand Trade Association |url=https://cw.infinibandta.org/document/dl/7781 |title=InfiniBand™ Architecture Specification Release 1.2.1 Annex A17: RoCEv2 |date=2 September 2014}}</ref> |
|||
==Background== |
==Background== |
||
Network-intensive applications like networked storage or cluster computing need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network [[application programming interfaces]] such as [[Berkeley sockets]] are lower latency, lower CPU load and higher bandwidth.<ref>{{cite book |last1=Cameron |first1=Don |last2=Regnier |first2=Greg |title=Virtual Interface Architecture |isbn=978-0-9712887-0-6 |publisher=Intel Press |year=2002}}</ref> The RoCE protocol allows lower latencies than its predecessor, the [[iWARP]] protocol.<ref>{{cite web |last=Feldman |first=Michael |url=http://archive.hpcwire.com/hpcwire/2010-04-22/roce_an_ethernet-infiniband_love_story.html |title=RoCE: An Ethernet-InfiniBand Love Story |website=HPC wire |date=22 April 2010}}</ref> There |
Network-intensive applications like networked storage or cluster computing need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network [[application programming interfaces]] such as [[Berkeley sockets]] are lower latency, lower CPU load and higher bandwidth.<ref>{{cite book |last1=Cameron |first1=Don |last2=Regnier |first2=Greg |title=Virtual Interface Architecture |isbn=978-0-9712887-0-6 |publisher=Intel Press |year=2002}}</ref> The RoCE protocol allows lower latencies than its predecessor, the [[iWARP]] protocol.<ref>{{cite web |last=Feldman |first=Michael |url=http://archive.hpcwire.com/hpcwire/2010-04-22/roce_an_ethernet-infiniband_love_story.html |title=RoCE: An Ethernet-InfiniBand Love Story |website=HPC wire |date=22 April 2010}}</ref> There are RoCE HCAs (Host Channel Adapter) with a latency as low as 1.3 microseconds<ref>{{cite web |url=http://www.mellanox.com/pdf/applications/SB_EthSolutions_financial.pdf |title=End-to-End Lowest Latency Ethernet Solution for Financial Services |website=Mellanox |date=March 2011}}</ref><ref>{{cite web |url=http://www.mellanox.com/pdf/whitepapers/WP_RoCE_vs_iWARP.pdf |title=RoCE vs. iWARP Competitive Analysis Brief |website=Mellanox |date=9 November 2010}}</ref> while the lowest known iWARP HCA latency in 2011 was 3 microseconds.<ref>{{cite web |website=Chelsio |url=http://www.chelsio.com/press-room/2164/ |title=Low Latency Server Connectivity With New Terminator 4 (T4) Adapter |date=25 May 2011}}</ref> |
||
[[File:RoCE Header format.png|thumb|RoCE Header format]] |
|||
==RoCE v1== |
==RoCE v1== |
||
The RoCE v1 protocol |
The RoCE v1 protocol is an Ethernet link layer protocol with Ethertype 0x8915.<ref name="RoCEv1"/> This means that the frame length limits of the Ethernet protocol apply: 1500 bytes for a regular [[Ethernet frame]] and 9000 bytes for a [[jumbo frame]]. |
||
==RoCE v1.5== |
|||
The RoCE v1.5 is an uncommon, experimental, non-standardized protocol that is based on the IP protocol. RoCE v1.5 uses the IP protocol field to differentiate its traffic from other IP protocols such as [[Transmission Control Protocol|TCP]] and [[User Datagram Protocol|UDP]]. The value used for the protocol number is unspecified and is left to the deployment to select. |
|||
==RoCE v2== |
==RoCE v2== |
||
The RoCE v2 protocol |
The RoCE v2 protocol exists on top of either the UDP/IPv4 or the UDP/IPv6 protocol.<ref name="RoCEv2"/> The UDP destination port number 4791 has been reserved for RoCE v2.<ref>{{cite web |author=Diego Crupnicoff |title=Service Name and Transport Protocol Port Number Registry |url=https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml?search=IP+Routable+RocE |website=IANA |date=17 October 2014 |access-date=14 October 2018}}</ref> Since RoCEv2 packets are routable the RoCE v2 protocol is sometimes called Routable RoCE<ref>{{cite web |author=InfiniBand Trade Association |title=RoCE Status and Plans |url=http://www.ietf.org/proceedings/88/slides/slides-88-storm-3.pdf |website=IETF |date=November 2013}}</ref> or RRoCE.<ref name=":0"/> Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered.<ref name=":0"/> In addition, RoCEv2 defines a congestion control mechanism that uses the IP ECN bits for marking and CNP<ref>{{cite web |author=Ophir Maor |title=RoCEv2 CNP Packet Format |url=https://community.mellanox.com/docs/DOC-2351 |website=Mellanox |date=December 2015}}</ref> frames for the acknowledgment notification.<ref>{{cite web |author=Ophir Maor |title=RoCEv2 Congestion Management |url=https://community.mellanox.com/docs/DOC-2321 |website=Mellanox |date=December 2015}}</ref> Software support for RoCE v2 is still emerging{{when|reason=software support quickly evolves|date=January 2023}}. Mellanox OFED 2.3 or later has RoCE v2 support and also Linux Kernel v4.5.<ref>{{cite web |title=Kernel GIT |url=https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=048ccca8c1c8f583deec3367d7df521bb1f542ae |date=January 2016}}</ref> |
||
==RoCE versus InfiniBand== |
==RoCE versus InfiniBand== |
||
RoCE defines how to perform RDMA over [[Ethernet]] while the [[InfiniBand]] architecture specification defines how to perform RDMA over an InfiniBand network. RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric.<ref>{{cite web |first=Rick |last=Merritt |url=http://www.eetimes.com/electronics-news/4088625/New-converged-network-blends-Ethernet-Infiniband |title=New converged network blends Ethernet, InfiniBand |website=EE Times |date=19 April 2010}}</ref> Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet.<ref>{{cite web |first=Sean Michael |last=Kerner |url=http://www.enterprisenetworkingplanet.com/nethub/article.php/3879506/InfiniBand-Moving-to-Ethernet.htm |title=InfiniBand Moving to Ethernet ? |website=Enterprise Networking Planet |date=2 April 2010 |
RoCE defines how to perform RDMA over [[Ethernet]] while the [[InfiniBand]] architecture specification defines how to perform RDMA over an InfiniBand network. RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric.<ref>{{cite web |first=Rick |last=Merritt |url=http://www.eetimes.com/electronics-news/4088625/New-converged-network-blends-Ethernet-Infiniband |title=New converged network blends Ethernet, InfiniBand |website=EE Times |date=19 April 2010}}</ref> Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet.<ref>{{cite web |first=Sean Michael |last=Kerner |url=http://www.enterprisenetworkingplanet.com/nethub/article.php/3879506/InfiniBand-Moving-to-Ethernet.htm |title=InfiniBand Moving to Ethernet ? |website=Enterprise Networking Planet |date=2 April 2010}}</ref> |
||
The technical differences between the RoCE and InfiniBand protocols are |
The technical differences between the RoCE and InfiniBand protocols are: |
||
* Link Level Flow Control: InfiniBand uses a credit-based algorithm to guarantee lossless HCA-to-HCA communication. RoCE runs on top of Ethernet. Implementations may require lossless Ethernet network for reaching to performance characteristics similar to InfiniBand. Lossless Ethernet is typically configured via [[Ethernet flow control]] or priority flow control (PFC). Configuring a [[Data center bridging]] (DCB) Ethernet network can be more complex than configuring an InfiniBand network.<ref>{{cite web |author=Mellanox |url=http://ir.mellanox.com/releasedetail.cfm?ReleaseID=851785 |title=Mellanox Releases New Automation Software to Reduce Ethernet Fabric Installation Time from Hours to Minutes |website=Mellanox |archive-url=http://web.archive.org/web/20160303215911/http://ir.mellanox.com/releasedetail.cfm?ReleaseID=851785 |archive-date=3 March 2016 |date=2 June 2014}}</ref> |
|||
* RoCE v1 is a link layer protocol and hence not routable. RoCE v2 and InfiniBand are routable. |
|||
* Congestion Control: Infiniband defines congestion control based on FECN/BECN marking, RoCEv2 defines a congestion control protocol that uses ECN for marking as implemented in standard switches and CNP frames for acknowledgments. |
|||
* RoCE uses priority-based flow control while InfiniBand uses a credit-based algorithm to guarantee lossless HCA-to-HCA communication. The priority-based flow control (PFC) algorithm limits cable length and increases switch cost.<ref>{{cite web |url=http://www.chelsio.com/wp-content/uploads/2011/05/A-Rocky-Road-for-Roce-White-Paper-0112.pdf |title=A Rocky Road for ROCE |website=Chelsio |date=1 May 2011}}</ref><ref>{{cite web |first=Keshav |last=Kamble |url=http://www.ieee802.org/1/files/public/docs2014/new-DCB-kamble-FlowControl-0318-v07.pdf |title=Credit based Link Level Flow Control and Capability Exchange Using DCBX for CEE ports |website=IEEE |date=17 March 2014}}</ref> PFC is good for a small number of hops (1-2), but actual congestion control is likely to be needed at larger scale, as PFC will have issues at larger number of hops.<ref>{{cite web |title=IETF 88 Proceedings - RDMA/IP Mini-BOF - minutes |url=http://www.ietf.org/proceedings/88/minutes/minutes-88-storm |website=IETF |date=7 November 2013}}</ref> |
|||
* |
* InfiniBand switches typically have lower latency than Ethernet switches. Port-to-port latency for one particular type of Ethernet switch is 230 ns<ref>{{cite web |url=http://www.mellanox.com/page/products_dyn?product_family=115&mtag=sx1036 |title=SX1036 - 36-Port 40/56GbE Switch System |website=Mellanox |access-date=April 21, 2014}}</ref> versus 100 ns<ref>{{cite web |url=http://www.mellanox.com/page/products_dyn?product_family=91&mtag=is5024 |title=IS5024 - 36-Port Non-blocking Unmanaged 40Gb/s InfiniBand Switch System |website=Mellanox |access-date=April 21, 2014}}</ref> for an InfiniBand switch with the same number of ports. |
||
* InfiniBand bandwidths are higher towards clients. The current standard setups are based on 40- or 56-gigabit host adapters, which in Ethernet environments are normally only used in the backbone. Though, some newer host adapters are able to run either in 56 gigabit IB or in 56 gigabit Ethernet mode.<ref>{{cite web |author=Mellanox |url=http://ir.mellanox.com/releasedetail.cfm?ReleaseID=762540 |title=Mellanox Announces 56 Gigabit Ethernet Interconnect Solution Family for Data Center Compute and Storage |website=Mellanox |date=7 May 2013}}</ref> |
|||
* Configuring a DCB Ethernet network is significantly more complex than configuring an InfiniBand network.<ref>{{cite web |author=Mellanox |url=http://ir.mellanox.com/releasedetail.cfm?ReleaseID=851785 |title=Mellanox Releases New Automation Software to Reduce Ethernet Fabric Installation Time from Hours to Minutes |website=Mellanox |date=2 June 2014}}</ref> |
|||
==RoCE versus iWARP== |
==RoCE versus iWARP== |
||
While the RoCE protocols define how to perform RDMA using Ethernet frames, the [[iWARP]] protocol defines how to perform RDMA over a connection-oriented transport like the [[Transmission Control Protocol]] (TCP). RoCE v1 is limited to a single Ethernet [[broadcast domain]]. RoCE v2 and iWARP packets are routable |
While the RoCE protocols define how to perform RDMA using Ethernet and UDP/IP frames, the [[iWARP]] protocol defines how to perform RDMA over a connection-oriented transport like the [[Transmission Control Protocol]] (TCP). RoCE v1 is limited to a single Ethernet [[broadcast domain]]. RoCE v2 and iWARP packets are routable. The memory requirements of a large number of connections along with TCP's flow and reliability controls lead to scalability and performance issues when using iWARP in large-scale datacenters and for large-scale applications (i.e., large-scale enterprises, cloud computing, web 2.0 applications etc.<ref>{{cite conference |last=Rashti |first=Mohammad |url=http://post.queensu.ca/~pprl/papers/HiPC-2010.pdf |title=iWARP Redefined: Scalable Connectionless Communication over High-Speed Ethernet |book-title=International Conference on High Performance Computing (HiPC) |year=2010}}</ref>). Also, multicast is defined in the RoCE specification while the current iWARP specification does not define how to perform multicast RDMA.<ref>{{cite journal |title= Direct Data Placement over Reliable Transports|journal= RFC 5041 |author= H. Shah|date= October 2007 |doi= 10.17487/RFC5041 |access-date= May 4, 2011 |url=http://tools.ietf.org/html/rfc5041|display-authors=etal|doi-access= free}}</ref><ref>{{cite journal |title=Stream Control Transmission Protocol (SCTP) Direct Data Placement (DDP) Adaptation |journal= RFC 5043 |author= C. Bestler|editor-first1= C. |editor-first2= R. |editor-last1= Bestler |editor-last2= Stewart |date= October 2007 |doi= 10.17487/RFC5043 |access-date= May 4, 2011 |url=http://tools.ietf.org/html/rfc5043|display-authors=etal|doi-access= free }}</ref><ref>{{cite journal |title= Marker PDU Aligned Framing for TCP Specification |journal= RFC 5044 |author= P. Culley|date= October 2007 |doi= 10.17487/RFC5044 |access-date= May 4, 2011 |url=http://tools.ietf.org/html/rfc5044|display-authors=etal|doi-access= free }}</ref> |
||
Reliability in [[iWARP]] is given by the protocol itself, as [[Transmission Control Protocol|TCP]] is reliable. RoCEv2 on the other hand utilizes [[User Datagram Protocol|UDP]] which has a far smaller overhead and better performance but does not provide inherent reliability, and therefore reliability must be implemented alongside RoCEv2. One solution is to use converged Ethernet switches to make the local area network reliable. This requires converged Ethernet support on all the switches in the local area network and prevents RoCEv2 packets from traveling through a wide area network such as the internet which is not reliable. Another solution is to add reliability to the RoCE protocol (i.e., reliable RoCE) which adds handshaking to RoCE to provide reliability at the cost of performance. |
|||
The question of which protocol is better depends on the vendor. Chelsio recommends and exclusively support iWARP. Mellanox, Xilinx, and Broadcom recommend and exclusively support RoCE/RoCEv2. Intel initially supported iWARP but now supports both iWARP and RoCEv2.<ref>{{cite web |title= Intel® Ethernet 800 Series |publisher=Intel |date=May 2021 |url=https://www.intel.com/content/www/us/en/architecture-and-technology/ethernet.html}}</ref> Other vendors involved in the network industry provide support for both protocols such as Marvell, Microsoft, Linux and Kazan.<ref name="sniaesfblog.org">{{cite web |title= RoCE vs. iWARP – The Next "Great Storage Debate" |author1= T Lustig |author2= F Zhang |author3= J Ko |date= October 2007 |access-date= August 22, 2018 |url= http://sniaesfblog.org/roce-vs-iwarp-the-next-great-storage-debate/ |archive-date= May 20, 2019 |archive-url= https://web.archive.org/web/20190520151033/http://sniaesfblog.org/roce-vs-iwarp-the-next-great-storage-debate/ |url-status= dead }}</ref> Cisco supports both RoCE<ref>{{cite web |title=Benefits of Remote Direct Memory Access Over Routed Fabrics |publisher=Cisco |date=October 2018 |url=https://www.cisco.com/c/dam/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-741091.pdf}}</ref> and their own VIC RDMA protocol. |
|||
Both Protocols are standardized with iWARP being the standard for RDMA over TCP defined by the [[IETF]] and RoCE being the standard for RDMA over Ethernet defined by the [[IBTA]].<ref name="sniaesfblog.org"/> |
|||
==Criticism== |
==Criticism== |
||
Line 31: | Line 39: | ||
* How to translate between secondary RoCE v1 GIDs and Ethernet MAC addresses. It is not clear whether it is possible to implement secondary GIDs in the RoCE v1 protocol without adding a RoCE-specific address resolution protocol. |
* How to translate between secondary RoCE v1 GIDs and Ethernet MAC addresses. It is not clear whether it is possible to implement secondary GIDs in the RoCE v1 protocol without adding a RoCE-specific address resolution protocol. |
||
* How to implement VLANs for the RoCE v1 protocol. Current RoCE v1 implementations store the VLAN ID in the twelfth and thirteenth byte of the sixteen-byte GID, although the RoCE v1 specification does not mention VLANs at all.<ref>{{cite web |first=Eli |last=Cohen |url=https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=af7bd463761c6abd8ca8d831f9cc0ac19f3b7d4b |title=IB/core: Add VLAN support for IBoE |website=kernel.org |date=26 August 2010}}</ref> |
* How to implement VLANs for the RoCE v1 protocol. Current RoCE v1 implementations store the VLAN ID in the twelfth and thirteenth byte of the sixteen-byte GID, although the RoCE v1 specification does not mention VLANs at all.<ref>{{cite web |first=Eli |last=Cohen |url=https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=af7bd463761c6abd8ca8d831f9cc0ac19f3b7d4b |title=IB/core: Add VLAN support for IBoE |website=kernel.org |date=26 August 2010}}</ref> |
||
* How to translate between RoCE v1 multicast GIDs and Ethernet MAC addresses. Implementations in 2010 used the same address mapping that has been specified for mapping IPv6 multicast addresses to Ethernet MAC addresses.<ref>{{cite web |first=Eli |last=Cohen |url=https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3c86aa70bf677a31b71c8292e349242e26cbc743 |title=RDMA/cm: Add RDMA CM support for IBoE devices |website=kernel.org |date=13 October 2010}}</ref><ref>{{cite |
* How to translate between RoCE v1 multicast GIDs and Ethernet MAC addresses. Implementations in 2010 used the same address mapping that has been specified for mapping IPv6 multicast addresses to Ethernet MAC addresses.<ref>{{cite web |first=Eli |last=Cohen |url=https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3c86aa70bf677a31b71c8292e349242e26cbc743 |title=RDMA/cm: Add RDMA CM support for IBoE devices |website=kernel.org |date=13 October 2010}}</ref><ref>{{cite journal |first=M. |last=Crawford |url=http://tools.ietf.org/html/rfc2464#section-7 |title=RFC 2464 - Transmission of IPv6 Packets over Ethernet Networks |website=IETF |year=1998|doi=10.17487/RFC2464 |doi-access=free }}</ref> |
||
* How to restrict RoCE v1 multicast traffic to a subset of the ports of an Ethernet switch. As of September 2013, an equivalent of the [[Multicast Listener Discovery]] protocol has not yet been defined for RoCE v1. |
* How to restrict RoCE v1 multicast traffic to a subset of the ports of an Ethernet switch. As of September 2013, an equivalent of the [[Multicast Listener Discovery]] protocol has not yet been defined for RoCE v1. |
||
* Software support for RoCE v2 is still emerging. Mellanox OFED 2.3 has RoCE v2 support but neither OpenFabrics OFED 3.12 nor Linux kernel 3.17 supports RoCE v2. |
|||
In addition, any protocol running over IP cannot assume the underlying network has guaranteed ordering, any more than it can assume congestion cannot occur. |
|||
* At least one vendor that offers an RDMA over Ethernet solution has chosen another wire protocol than RoCE.<ref>{{cite web |last=Malhi |first=Upinder |url=http://thread.gmane.org/gmane.linux.drivers.rdma/17221 |title=PATCH Cisco VIC RDMA Node and Transport |website=linux-rdma mailing list |date=4 September 2013}}</ref> |
|||
It is known that the use of PFC can lead to a network-wide deadlock.<ref>{{cite conference |
|||
|first1=Shuihai |last1=Hu |first2=Yibo |last2=Zhu |first3=Peng |last3=Cheng |first4=Chuanxiong |
|||
|last4=Guo |first5=Kun |last5=Tan |first6=Jitendra |last6=Padhye1 |first7=Kai |last7=Chen |
|||
|title=Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them |
|||
|url=https://www.microsoft.com/en-us/research/wp-content/uploads/2016/10/hotnets16-final67.pdf |
|||
|conference=15th ACM Workshop on Hot Topics in Networks |pages=92–98 |year=2016}}</ref> |
|||
<ref>{{cite conference |first1=Alex |last1=Shpiner |first2=Eitan |last2=Zahavi |first3=Vladimir |
|||
|last3=Zdornov |first4=Tal |last4=Anker |first5=Matty |last5=Kadosh |
|||
|title=Unlocking credit loop deadlocks |
|||
|url=https://www.researchgate.net/publication/309638534 |
|||
|conference=15th ACM Workshop on Hot Topics in Networks |pages=85–91 |year=2016}}</ref> |
|||
<ref>{{cite arXiv |first1=Radhika |last1=Mittal |first2=Alexander |last2=Shpiner |
|||
|first3=Aurojit |last3=Panda |first4=Eitan |last4=Zahavi |first5=Arvind |last5=Krishnamurthy |
|||
|first6=Sylvia |last6=Ratnasamy |first7=Scott |last7=Shenker |
|||
|title=Revisiting Network Support for RDMA |date=21 June 2018 |eprint=1806.08159 |class=cs.NI }}</ref> |
|||
==Vendors== |
|||
Some vendors of RoCE enabled equipment include: |
|||
* [[Mellanox]] (acquired by [[Nvidia]] in 2020,<ref>{{Cite web|url=https://www.crn.com/news/components-peripherals/nvidia-mellanox-deal-may-not-close-until-early-2020|title=Nvidia: Mellanox Deal May Not Close Until Early 2020|date=14 November 2019}}</ref> brand retained<ref>{{Cite web|url=https://blogs.nvidia.com/blog/2019/03/27/israel-mellanox-nvidia/|title = Israel's AI Ecosystem Toasts NVIDIA's Proposed Mellanox Acquisition | NVIDIA Blog|date = 27 March 2019}}</ref>) |
|||
* [[Emulex]] (acquired by [[Broadcom]]) |
|||
* [[Broadcom]] |
|||
* [[QLogic]] (acquired by [[Cavium]], rebranded) |
|||
* [[Cavium]] (acquired by [[Marvell Technology Group]], rebranded) |
|||
* [[Huawei]] |
|||
* [[ATTO Technology]] |
|||
* [[Dell Technologies]] |
|||
* [[Intel]] |
|||
* [[Bloombase]] |
|||
* [[Xilinx]] (via [[FPGA]] soft IP core) |
|||
* [https://grovf.com/products/grovf-rdma Grovf]<ref>{{Cite news|title=Grovf Inc. Releases Low Latency RDMA RoCE V2 FPGA IP Core for Smart NICs|work=Yahoo News|url=https://www.yahoo.com/now/grovf-inc-releases-low-latency-000000059.html}}</ref> |
|||
* [https://www.cerio.io Cerio]<ref>{{Cite news|title=ACCELERATED COMPUTING PLATFORM|url=https://www.cerio.io}}</ref> |
|||
==References== |
==References== |
Latest revision as of 21:46, 27 September 2024
RDMA over Converged Ethernet (RoCE)[1] is a network protocol which allows remote direct memory access (RDMA) over an Ethernet network. There are multiple RoCE versions. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network.[2][3][4][5]
Background
[edit]Network-intensive applications like networked storage or cluster computing need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network application programming interfaces such as Berkeley sockets are lower latency, lower CPU load and higher bandwidth.[6] The RoCE protocol allows lower latencies than its predecessor, the iWARP protocol.[7] There are RoCE HCAs (Host Channel Adapter) with a latency as low as 1.3 microseconds[8][9] while the lowest known iWARP HCA latency in 2011 was 3 microseconds.[10]
RoCE v1
[edit]The RoCE v1 protocol is an Ethernet link layer protocol with Ethertype 0x8915.[2] This means that the frame length limits of the Ethernet protocol apply: 1500 bytes for a regular Ethernet frame and 9000 bytes for a jumbo frame.
RoCE v1.5
[edit]The RoCE v1.5 is an uncommon, experimental, non-standardized protocol that is based on the IP protocol. RoCE v1.5 uses the IP protocol field to differentiate its traffic from other IP protocols such as TCP and UDP. The value used for the protocol number is unspecified and is left to the deployment to select.
RoCE v2
[edit]The RoCE v2 protocol exists on top of either the UDP/IPv4 or the UDP/IPv6 protocol.[3] The UDP destination port number 4791 has been reserved for RoCE v2.[11] Since RoCEv2 packets are routable the RoCE v2 protocol is sometimes called Routable RoCE[12] or RRoCE.[4] Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered.[4] In addition, RoCEv2 defines a congestion control mechanism that uses the IP ECN bits for marking and CNP[13] frames for the acknowledgment notification.[14] Software support for RoCE v2 is still emerging[when?]. Mellanox OFED 2.3 or later has RoCE v2 support and also Linux Kernel v4.5.[15]
RoCE versus InfiniBand
[edit]RoCE defines how to perform RDMA over Ethernet while the InfiniBand architecture specification defines how to perform RDMA over an InfiniBand network. RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric.[16] Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet.[17]
The technical differences between the RoCE and InfiniBand protocols are:
- Link Level Flow Control: InfiniBand uses a credit-based algorithm to guarantee lossless HCA-to-HCA communication. RoCE runs on top of Ethernet. Implementations may require lossless Ethernet network for reaching to performance characteristics similar to InfiniBand. Lossless Ethernet is typically configured via Ethernet flow control or priority flow control (PFC). Configuring a Data center bridging (DCB) Ethernet network can be more complex than configuring an InfiniBand network.[18]
- Congestion Control: Infiniband defines congestion control based on FECN/BECN marking, RoCEv2 defines a congestion control protocol that uses ECN for marking as implemented in standard switches and CNP frames for acknowledgments.
- InfiniBand switches typically have lower latency than Ethernet switches. Port-to-port latency for one particular type of Ethernet switch is 230 ns[19] versus 100 ns[20] for an InfiniBand switch with the same number of ports.
RoCE versus iWARP
[edit]While the RoCE protocols define how to perform RDMA using Ethernet and UDP/IP frames, the iWARP protocol defines how to perform RDMA over a connection-oriented transport like the Transmission Control Protocol (TCP). RoCE v1 is limited to a single Ethernet broadcast domain. RoCE v2 and iWARP packets are routable. The memory requirements of a large number of connections along with TCP's flow and reliability controls lead to scalability and performance issues when using iWARP in large-scale datacenters and for large-scale applications (i.e., large-scale enterprises, cloud computing, web 2.0 applications etc.[21]). Also, multicast is defined in the RoCE specification while the current iWARP specification does not define how to perform multicast RDMA.[22][23][24]
Reliability in iWARP is given by the protocol itself, as TCP is reliable. RoCEv2 on the other hand utilizes UDP which has a far smaller overhead and better performance but does not provide inherent reliability, and therefore reliability must be implemented alongside RoCEv2. One solution is to use converged Ethernet switches to make the local area network reliable. This requires converged Ethernet support on all the switches in the local area network and prevents RoCEv2 packets from traveling through a wide area network such as the internet which is not reliable. Another solution is to add reliability to the RoCE protocol (i.e., reliable RoCE) which adds handshaking to RoCE to provide reliability at the cost of performance.
The question of which protocol is better depends on the vendor. Chelsio recommends and exclusively support iWARP. Mellanox, Xilinx, and Broadcom recommend and exclusively support RoCE/RoCEv2. Intel initially supported iWARP but now supports both iWARP and RoCEv2.[25] Other vendors involved in the network industry provide support for both protocols such as Marvell, Microsoft, Linux and Kazan.[26] Cisco supports both RoCE[27] and their own VIC RDMA protocol.
Both Protocols are standardized with iWARP being the standard for RDMA over TCP defined by the IETF and RoCE being the standard for RDMA over Ethernet defined by the IBTA.[26]
Criticism
[edit]Some aspects that could have been defined in the RoCE specification have been left out. These are:
- How to translate between primary RoCE v1 GIDs and Ethernet MAC addresses.[28]
- How to translate between secondary RoCE v1 GIDs and Ethernet MAC addresses. It is not clear whether it is possible to implement secondary GIDs in the RoCE v1 protocol without adding a RoCE-specific address resolution protocol.
- How to implement VLANs for the RoCE v1 protocol. Current RoCE v1 implementations store the VLAN ID in the twelfth and thirteenth byte of the sixteen-byte GID, although the RoCE v1 specification does not mention VLANs at all.[29]
- How to translate between RoCE v1 multicast GIDs and Ethernet MAC addresses. Implementations in 2010 used the same address mapping that has been specified for mapping IPv6 multicast addresses to Ethernet MAC addresses.[30][31]
- How to restrict RoCE v1 multicast traffic to a subset of the ports of an Ethernet switch. As of September 2013, an equivalent of the Multicast Listener Discovery protocol has not yet been defined for RoCE v1.
In addition, any protocol running over IP cannot assume the underlying network has guaranteed ordering, any more than it can assume congestion cannot occur.
It is known that the use of PFC can lead to a network-wide deadlock.[32] [33] [34]
Vendors
[edit]Some vendors of RoCE enabled equipment include:
- Mellanox (acquired by Nvidia in 2020,[35] brand retained[36])
- Emulex (acquired by Broadcom)
- Broadcom
- QLogic (acquired by Cavium, rebranded)
- Cavium (acquired by Marvell Technology Group, rebranded)
- Huawei
- ATTO Technology
- Dell Technologies
- Intel
- Bloombase
- Xilinx (via FPGA soft IP core)
- Grovf[37]
- Cerio[38]
References
[edit]- ^ "Roland's Blog » Blog Archive » Two notes on IBoE".
- ^ a b "InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE". InfiniBand Trade Association. 13 April 2010. Archived from the original on 9 March 2016. Retrieved 29 April 2015.
- ^ a b "InfiniBand™ Architecture Specification Release 1.2.1 Annex A17: RoCEv2". InfiniBand Trade Association. 2 September 2014. Archived from the original on 17 September 2020. Retrieved 19 October 2014.
- ^ a b c Ophir Maor (December 2015). "RoCEv2 Considerations". Mellanox.
- ^ Ophir Maor (December 2015). "RoCE and Storage Solutions". Mellanox.
- ^ Cameron, Don; Regnier, Greg (2002). Virtual Interface Architecture. Intel Press. ISBN 978-0-9712887-0-6.
- ^ Feldman, Michael (22 April 2010). "RoCE: An Ethernet-InfiniBand Love Story". HPC wire.
- ^ "End-to-End Lowest Latency Ethernet Solution for Financial Services" (PDF). Mellanox. March 2011.
- ^ "RoCE vs. iWARP Competitive Analysis Brief" (PDF). Mellanox. 9 November 2010.
- ^ "Low Latency Server Connectivity With New Terminator 4 (T4) Adapter". Chelsio. 25 May 2011.
- ^ Diego Crupnicoff (17 October 2014). "Service Name and Transport Protocol Port Number Registry". IANA. Retrieved 14 October 2018.
- ^ InfiniBand Trade Association (November 2013). "RoCE Status and Plans" (PDF). IETF.
- ^ Ophir Maor (December 2015). "RoCEv2 CNP Packet Format". Mellanox.
- ^ Ophir Maor (December 2015). "RoCEv2 Congestion Management". Mellanox.
- ^ "Kernel GIT". January 2016.
- ^ Merritt, Rick (19 April 2010). "New converged network blends Ethernet, InfiniBand". EE Times.
- ^ Kerner, Sean Michael (2 April 2010). "InfiniBand Moving to Ethernet ?". Enterprise Networking Planet.
- ^ Mellanox (2 June 2014). "Mellanox Releases New Automation Software to Reduce Ethernet Fabric Installation Time from Hours to Minutes". Mellanox. Archived from the original on 3 March 2016.
- ^ "SX1036 - 36-Port 40/56GbE Switch System". Mellanox. Retrieved April 21, 2014.
- ^ "IS5024 - 36-Port Non-blocking Unmanaged 40Gb/s InfiniBand Switch System". Mellanox. Retrieved April 21, 2014.
- ^ Rashti, Mohammad (2010). "iWARP Redefined: Scalable Connectionless Communication over High-Speed Ethernet" (PDF). International Conference on High Performance Computing (HiPC).
- ^ H. Shah; et al. (October 2007). "Direct Data Placement over Reliable Transports". RFC 5041. doi:10.17487/RFC5041. Retrieved May 4, 2011.
- ^ C. Bestler; et al. (October 2007). Bestler, C.; Stewart, R. (eds.). "Stream Control Transmission Protocol (SCTP) Direct Data Placement (DDP) Adaptation". RFC 5043. doi:10.17487/RFC5043. Retrieved May 4, 2011.
- ^ P. Culley; et al. (October 2007). "Marker PDU Aligned Framing for TCP Specification". RFC 5044. doi:10.17487/RFC5044. Retrieved May 4, 2011.
- ^ "Intel® Ethernet 800 Series". Intel. May 2021.
- ^ a b T Lustig; F Zhang; J Ko (October 2007). "RoCE vs. iWARP – The Next "Great Storage Debate"". Archived from the original on May 20, 2019. Retrieved August 22, 2018.
- ^ "Benefits of Remote Direct Memory Access Over Routed Fabrics" (PDF). Cisco. October 2018.
- ^ Dreier, Roland (6 December 2010). "Two notes on IBoE". Roland Dreier's blog.
- ^ Cohen, Eli (26 August 2010). "IB/core: Add VLAN support for IBoE". kernel.org.
- ^ Cohen, Eli (13 October 2010). "RDMA/cm: Add RDMA CM support for IBoE devices". kernel.org.
- ^ Crawford, M. (1998). "RFC 2464 - Transmission of IPv6 Packets over Ethernet Networks". IETF. doi:10.17487/RFC2464.
- ^ Hu, Shuihai; Zhu, Yibo; Cheng, Peng; Guo, Chuanxiong; Tan, Kun; Padhye1, Jitendra; Chen, Kai (2016). Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them (PDF). 15th ACM Workshop on Hot Topics in Networks. pp. 92–98.
{{cite conference}}
: CS1 maint: numeric names: authors list (link) - ^ Shpiner, Alex; Zahavi, Eitan; Zdornov, Vladimir; Anker, Tal; Kadosh, Matty (2016). Unlocking credit loop deadlocks. 15th ACM Workshop on Hot Topics in Networks. pp. 85–91.
- ^ Mittal, Radhika; Shpiner, Alexander; Panda, Aurojit; Zahavi, Eitan; Krishnamurthy, Arvind; Ratnasamy, Sylvia; Shenker, Scott (21 June 2018). "Revisiting Network Support for RDMA". arXiv:1806.08159 [cs.NI].
- ^ "Nvidia: Mellanox Deal May Not Close Until Early 2020". 14 November 2019.
- ^ "Israel's AI Ecosystem Toasts NVIDIA's Proposed Mellanox Acquisition | NVIDIA Blog". 27 March 2019.
- ^ "Grovf Inc. Releases Low Latency RDMA RoCE V2 FPGA IP Core for Smart NICs". Yahoo News.
- ^ "ACCELERATED COMPUTING PLATFORM".