Jump to content

RDMA over Converged Ethernet: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Ophirmaor (talk | contribs)
Ophirmaor (talk | contribs)
m RoCE v2: minor edits
Line 1: Line 1:
'''RDMA over Converged Ethernet''' ('''RoCE''') is a network protocol that allows [[remote direct memory access]] ([[Remote direct memory access|RDMA]]) over an [[Ethernet]] network. There exist two RoCE versions, namely RoCE v1 and RoCE v2. RoCE v1 uses the Ethernet protocol as a [[link layer]] protocol and hence allows communication between any two hosts in the same Ethernet [[broadcast domain]]. While RoCE v2 is a RDMA running on top of UDP/IP and can be routed.<ref name="RoCEv1">{{cite web |website=InfiniBand Trade Association |url=https://cw.infinibandta.org/document/dl/7148 |title=InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE |date=13 April 2010}}</ref>
'''RDMA over Converged Ethernet''' ('''RoCE''') is a network protocol that allows [[remote direct memory access]] ([[Remote direct memory access|RDMA]]) over an [[Ethernet]] network. There exist two RoCE versions, namely RoCE v1 and RoCE v2. RoCE v1 uses the Ethernet protocol as a [[link layer]] protocol and hence allows communication between any two hosts in the same Ethernet [[broadcast domain]]. While RoCE v2 is a RDMA running on top of UDP/IP and can be routed.<ref name="RoCEv1">{{cite web |website=InfiniBand Trade Association |url=https://cw.infinibandta.org/document/dl/7148 |title=InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE |date=13 April 2010}}</ref>
<ref name="RoCEv2">{{cite web |website=InfiniBand Trade Association |url=https://cw.infinibandta.org/document/dl/7781 |title=InfiniBand™ Architecture Specification Release 1.2.1 Annex A17: RoCEv2 |date=2 September 2014}}</ref>
<ref name="RoCEv2">{{cite web |website=InfiniBand Trade Association |url=https://cw.infinibandta.org/document/dl/7781 |title=InfiniBand™ Architecture Specification Release 1.2.1 Annex A17: RoCEv2 |date=2 September 2014}}</ref> <ref name=":0">{{cite web |author=Ophir Maor |title=RoCEv2 Considerations |url=https://community.mellanox.com/docs/DOC-1451 |website=Mellanox |date=December 2015}}</ref>


==Background==
==Background==
Line 10: Line 10:


==RoCE v2==
==RoCE v2==
The RoCE v2 protocol, sometimes called Routable RoCE<ref>{{cite web |author=InfiniBand Trade Association |title=RoCE Status and Plans |url=http://www.ietf.org/proceedings/88/slides/slides-88-storm-3.pdf |website=IETF |date=November 2013}}</ref> or RRoCE. The InfinBand RDMA layer runs on top of UDP/IP and supports both IPv4 and IPv6.<ref name="RoCEv2"/> The destination UDP port number 4791 has been reserved for RoCE v2.<ref>{{cite web |author=Diego Crupnicoff |title=Service Name and Transport Protocol Port Number Registry |website=IANA |date=17 October 2014}}</ref> Packets with the same UDP source port and the same destination address must not be reordered. Packets with different UDP source port numbers and the same destination address may be sent over different links to that destination address. <ref>{{cite web |author=Ophir Maor |title=RoCEv2 Considerations |url=https://community.mellanox.com/docs/DOC-1451 |website=Mellanox |date=December 2015}}</ref>
The RoCE v2 protocol, sometimes called Routable RoCE<ref>{{cite web |author=InfiniBand Trade Association |title=RoCE Status and Plans |url=http://www.ietf.org/proceedings/88/slides/slides-88-storm-3.pdf |website=IETF |date=November 2013}}</ref> or RRoCE <ref name=":0">{{cite web |author=Ophir Maor |title=RoCEv2 Considerations |url=https://community.mellanox.com/docs/DOC-1451 |website=Mellanox |date=December 2015}}</ref>. The InfinBand RDMA layer runs on top of UDP/IP and supports both IPv4 and IPv6.<ref name="RoCEv2"/> The destination UDP port number 4791 has been reserved for RoCE v2.<ref>{{cite web |author=Diego Crupnicoff |title=Service Name and Transport Protocol Port Number Registry |website=IANA |date=17 October 2014}}</ref> Packets with the same UDP source port and the same destination address must not be reordered. Packets with different UDP source port numbers and the same destination address may be sent over different links to that destination address. <ref name=":0">{{cite web |author=Ophir Maor |title=RoCEv2 Considerations |url=https://community.mellanox.com/docs/DOC-1451 |website=Mellanox |date=December 2015}}</ref>


==RoCE versus InfiniBand==
==RoCE versus InfiniBand==
RoCE defines how to perform RDMA over [[Ethernet]] while the [[InfiniBand]] architecture specification defines how to perform RDMA over an InfiniBand network. RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric.<ref>{{cite web |first=Rick |last=Merritt |url=http://www.eetimes.com/electronics-news/4088625/New-converged-network-blends-Ethernet-Infiniband |title=New converged network blends Ethernet, InfiniBand |website=EE Times |date=19 April 2010}}</ref> Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet.<ref>{{cite web |first=Sean Michael |last=Kerner |url=http://www.enterprisenetworkingplanet.com/nethub/article.php/3879506/InfiniBand-Moving-to-Ethernet.htm |title=InfiniBand Moving to Ethernet ? |website=Enterprise Networking Planet |date=2 April 2010}}</ref> While Ethernet is a more familiar technology to most than InfiniBand, the cost of InfiniBand equipment, especially switches, was predicted in 2009 to be lower than that of [[40 Gigabit Ethernet]].<ref>{{cite web |first=David |last=Gross |url=http://seekingalpha.com/article/115050-will-new-qdr-infiniband-leap-ahead-of-40-gigabit-ethernet |title=Will New QDR InfiniBand Leap Ahead of 40 Gigabit Ethernet? |website=Seeking Alpha |date=16 January 2009}}{{Tertiary}}</ref>
RoCE defines how to perform RDMA over [[Ethernet]] while the [[InfiniBand]] architecture specification defines how to perform RDMA over an InfiniBand network. RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric.<ref>{{cite web |first=Rick |last=Merritt |url=http://www.eetimes.com/electronics-news/4088625/New-converged-network-blends-Ethernet-Infiniband |title=New converged network blends Ethernet, InfiniBand |website=EE Times |date=19 April 2010}}</ref> Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet<ref>{{cite web |first=Sean Michael |last=Kerner |url=http://www.enterprisenetworkingplanet.com/nethub/article.php/3879506/InfiniBand-Moving-to-Ethernet.htm |title=InfiniBand Moving to Ethernet ? |website=Enterprise Networking Planet |date=2 April 2010}}</ref> .


The technical differences between the RoCE and InfiniBand protocols are as follows:
The technical differences between the RoCE and InfiniBand protocols are as follows:


* Congestion Control: RoCE lean on the fact the the network should be lossless via [[Ethernet flow control]] or priority flow control (PFC) mechanisms, in addition, RoCEv2 defines Congestion control protocol that uses ECN for marking and CNP frames for acknowledgments. InfiniBand uses a credit-based algorithm to guarantee lossless HCA-to-HCA communication.
* RoCE v1 is a link layer protocol and hence not routable. RoCE v2 and InfiniBand are routable.
* RoCE uses priority-based flow control while InfiniBand uses a credit-based algorithm to guarantee lossless HCA-to-HCA communication. The priority-based flow control (PFC) algorithm limits cable length and increases switch cost.<ref>{{cite web |url=http://www.chelsio.com/wp-content/uploads/2011/05/A-Rocky-Road-for-Roce-White-Paper-0112.pdf |title=A Rocky Road for ROCE |website=Chelsio |date=1 May 2011}}</ref><ref>{{cite web |first=Keshav |last=Kamble |url=http://www.ieee802.org/1/files/public/docs2014/new-DCB-kamble-FlowControl-0318-v07.pdf |title=Credit based Link Level Flow Control and Capability Exchange Using DCBX for CEE ports |website=IEEE |date=17 March 2014}}</ref> PFC is good for a small number of hops (1-2), but actual congestion control is likely to be needed at larger scale, as PFC will have issues at larger number of hops.<ref>{{cite web |title=IETF 88 Proceedings - RDMA/IP Mini-BOF - minutes |url=http://www.ietf.org/proceedings/88/minutes/minutes-88-storm |website=IETF |date=7 November 2013}}</ref>
* The InfiniBand switches available today have (always had) a lower latency than Ethernet switches. Port-to-port latency for one particular type of Ethernet switch is 230&nbsp;ns<ref>{{cite web |url=http://www.mellanox.com/page/products_dyn?product_family=115&mtag=sx1036 |title=SX1036 - 36-Port 40/56GbE Switch System |website=Mellanox |accessdate=April 21, 2014}}</ref> versus 100&nbsp;ns<ref>{{cite web |url=http://www.mellanox.com/page/products_dyn?product_family=91&mtag=is5024 |title=IS5024 - 36-Port Non-blocking Unmanaged 40Gb/s InfiniBand Switch System |website=Mellanox |accessdate=April 21, 2014}}</ref> for an InfiniBand switch with the same number of ports.
* The InfiniBand switches available today have (always had) a lower latency than Ethernet switches. Port-to-port latency for one particular type of Ethernet switch is 230&nbsp;ns<ref>{{cite web |url=http://www.mellanox.com/page/products_dyn?product_family=115&mtag=sx1036 |title=SX1036 - 36-Port 40/56GbE Switch System |website=Mellanox |accessdate=April 21, 2014}}</ref> versus 100&nbsp;ns<ref>{{cite web |url=http://www.mellanox.com/page/products_dyn?product_family=91&mtag=is5024 |title=IS5024 - 36-Port Non-blocking Unmanaged 40Gb/s InfiniBand Switch System |website=Mellanox |accessdate=April 21, 2014}}</ref> for an InfiniBand switch with the same number of ports.
* InfiniBand bandwidths are higher towards clients. The current standard setups are based on 40- or 56-gigabit host adapters, which in Ethernet environments are normally only used in the backbone. Though, some newer host adapters are able to run either in 56 gigabit IB or in 56 gigabit Ethernet mode.<ref>{{cite web |author=Mellanox |url=http://ir.mellanox.com/releasedetail.cfm?ReleaseID=762540 |title=Mellanox Announces 56 Gigabit Ethernet Interconnect Solution Family for Data Center Compute and Storage |website=Mellanox |date=7 May 2013}}</ref>
* Configuring a DCB Ethernet network can be more complex than configuring an InfiniBand network.<ref>{{cite web |author=Mellanox |url=http://ir.mellanox.com/releasedetail.cfm?ReleaseID=851785 |title=Mellanox Releases New Automation Software to Reduce Ethernet Fabric Installation Time from Hours to Minutes |website=Mellanox |date=2 June 2014}}</ref>
* Configuring a DCB Ethernet network is significantly more complex than configuring an InfiniBand network.<ref>{{cite web |author=Mellanox |url=http://ir.mellanox.com/releasedetail.cfm?ReleaseID=851785 |title=Mellanox Releases New Automation Software to Reduce Ethernet Fabric Installation Time from Hours to Minutes |website=Mellanox |date=2 June 2014}}</ref>


==RoCE versus iWARP==
==RoCE versus iWARP==

Revision as of 22:04, 29 January 2016

RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. There exist two RoCE versions, namely RoCE v1 and RoCE v2. RoCE v1 uses the Ethernet protocol as a link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. While RoCE v2 is a RDMA running on top of UDP/IP and can be routed.[1] [2] [3]

Background

Network-intensive applications like networked storage or cluster computing need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network application programming interfaces such as Berkeley sockets are lower latency, lower CPU load and higher bandwidth.[4] The RoCE protocol allows lower latencies than its predecessor, the iWARP protocol.[5] There exist RoCE HCAs (Host Channel Adapter) with a latency as low as 1.3 microseconds[6][7] while the lowest known iWARP HCA latency in 2011 was 3 microseconds.[8]

RoCE Header format

RoCE v1

The RoCE v1 protocol uses the Ethernet protocol as a link layer with ethertype 0x8915.[1] This means that the frame length limits of the Ethernet protocol apply - 1500 bytes for a regular Ethernet frame and 9000 bytes for a jumbo frame.

RoCE v2

The RoCE v2 protocol, sometimes called Routable RoCE[9] or RRoCE [3]. The InfinBand RDMA layer runs on top of UDP/IP and supports both IPv4 and IPv6.[2] The destination UDP port number 4791 has been reserved for RoCE v2.[10] Packets with the same UDP source port and the same destination address must not be reordered. Packets with different UDP source port numbers and the same destination address may be sent over different links to that destination address. [3]

RoCE versus InfiniBand

RoCE defines how to perform RDMA over Ethernet while the InfiniBand architecture specification defines how to perform RDMA over an InfiniBand network. RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric.[11] Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet[12] .

The technical differences between the RoCE and InfiniBand protocols are as follows:

  • Congestion Control: RoCE lean on the fact the the network should be lossless via Ethernet flow control or priority flow control (PFC) mechanisms, in addition, RoCEv2 defines Congestion control protocol that uses ECN for marking and CNP frames for acknowledgments. InfiniBand uses a credit-based algorithm to guarantee lossless HCA-to-HCA communication.
  • The InfiniBand switches available today have (always had) a lower latency than Ethernet switches. Port-to-port latency for one particular type of Ethernet switch is 230 ns[13] versus 100 ns[14] for an InfiniBand switch with the same number of ports.
  • Configuring a DCB Ethernet network can be more complex than configuring an InfiniBand network.[15]

RoCE versus iWARP

While the RoCE protocols define how to perform RDMA using Ethernet frames, the iWARP protocol defines how to perform RDMA over a connection-oriented transport like the Transmission Control Protocol (TCP). RoCE v1 is limited to a single Ethernet broadcast domain. RoCE v2 and iWARP packets are routable.[16] RoCE is bound to Ethernet but iWARP is not. The memory requirements of a large number of connections along with TCP's flow and reliability controls lead to scalability and performance issues when using iWARP in large-scale datacenters and for large-scale applications (i.e. large-scale enterprises, cloud computing, web 2.0 applications etc.)[17] Also, multicast is defined in the RoCE specification while the current iWARP specification does not define how to perform multicast RDMA.[18][19][20]

Criticism

Some aspects that could have been defined in the RoCE specification have been left out. These are:

  • How to translate between primary RoCE v1 GIDs and Ethernet MAC addresses.[21]
  • How to translate between secondary RoCE v1 GIDs and Ethernet MAC addresses. It is not clear whether it is possible to implement secondary GIDs in the RoCE v1 protocol without adding a RoCE-specific address resolution protocol.
  • How to implement VLANs for the RoCE v1 protocol. Current RoCE v1 implementations store the VLAN ID in the twelfth and thirteenth byte of the sixteen-byte GID, although the RoCE v1 specification does not mention VLANs at all.[22]
  • How to translate between RoCE v1 multicast GIDs and Ethernet MAC addresses. Implementations in 2010 used the same address mapping that has been specified for mapping IPv6 multicast addresses to Ethernet MAC addresses.[23][24]
  • How to restrict RoCE v1 multicast traffic to a subset of the ports of an Ethernet switch. As of September 2013, an equivalent of the Multicast Listener Discovery protocol has not yet been defined for RoCE v1.
  • Software support for RoCE v2 is still emerging. Mellanox OFED 2.3 has RoCE v2 support but neither OpenFabrics OFED 3.12 nor Linux kernel 3.17 supports RoCE v2.
  • At least one vendor that offers an RDMA over Ethernet solution has chosen another wire protocol than RoCE.[25]

References

  1. ^ a b "InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE". InfiniBand Trade Association. 13 April 2010.
  2. ^ a b "InfiniBand™ Architecture Specification Release 1.2.1 Annex A17: RoCEv2". InfiniBand Trade Association. 2 September 2014.
  3. ^ a b c Ophir Maor (December 2015). "RoCEv2 Considerations". Mellanox.
  4. ^ Cameron, Don; Regnier, Greg (2002). Virtual Interface Architecture. Intel Press. ISBN 978-0-9712887-0-6.
  5. ^ Feldman, Michael (22 April 2010). "RoCE: An Ethernet-InfiniBand Love Story". HPC wire.
  6. ^ "End-to-End Lowest Latency Ethernet Solution for Financial Services" (PDF). Mellanox. March 2011.
  7. ^ "RoCE vs. iWARP Competitive Analysis Brief" (PDF). Mellanox. 9 November 2010.
  8. ^ "Low Latency Server Connectivity With New Terminator 4 (T4) Adapter". Chelsio. 25 May 2011.
  9. ^ InfiniBand Trade Association (November 2013). "RoCE Status and Plans" (PDF). IETF.
  10. ^ Diego Crupnicoff (17 October 2014). "Service Name and Transport Protocol Port Number Registry". IANA. {{cite web}}: Missing or empty |url= (help)
  11. ^ Merritt, Rick (19 April 2010). "New converged network blends Ethernet, InfiniBand". EE Times.
  12. ^ Kerner, Sean Michael (2 April 2010). "InfiniBand Moving to Ethernet ?". Enterprise Networking Planet.
  13. ^ "SX1036 - 36-Port 40/56GbE Switch System". Mellanox. Retrieved April 21, 2014.
  14. ^ "IS5024 - 36-Port Non-blocking Unmanaged 40Gb/s InfiniBand Switch System". Mellanox. Retrieved April 21, 2014.
  15. ^ Mellanox (2 June 2014). "Mellanox Releases New Automation Software to Reduce Ethernet Fabric Installation Time from Hours to Minutes". Mellanox.
  16. ^ "RoCE: Frequently Asked Questions" (PDF). Chelsio. 1 May 2011.
  17. ^ Rashti, Mohammad (2010). "iWARP Redefined: Scalable Connectionless Communication over High-Speed Ethernet" (PDF). International Conference on High Performance Computing (HiPC). {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
  18. ^ H. Shah; et al. (October 2007). "Direct Data Placement over Reliable Transports". RFC 5041. Retrieved May 4, 2011.
  19. ^ C. Bestler; et al. (October 2007). "Stream Control Transmission Protocol (SCTP) Direct Data Placement (DDP) Adaptation". RFC 5043. Retrieved May 4, 2011.
  20. ^ P. Culley; et al. (October 2007). "Marker PDU Aligned Framing for TCP Specification". RFC 5044. Retrieved May 4, 2011.
  21. ^ Dreier, Roland (6 December 2010). "Two notes on IBoE". Roland Dreier's blog.
  22. ^ Cohen, Eli (26 August 2010). "IB/core: Add VLAN support for IBoE". kernel.org.
  23. ^ Cohen, Eli (13 October 2010). "RDMA/cm: Add RDMA CM support for IBoE devices". kernel.org.
  24. ^ Crawford, M. (1998). "RFC 2464 - Transmission of IPv6 Packets over Ethernet Networks". IETF.
  25. ^ Malhi, Upinder (4 September 2013). "PATCH Cisco VIC RDMA Node and Transport". linux-rdma mailing list.