Jump to content

GPFS: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Citation bot (talk | contribs)
Alter: url. URLs might have been internationalized/anonymized. Add: s2cid. | You can use this bot yourself. Report bugs here. | Suggested by AManWithNoPlan | All pages linked from cached copy of User:AManWithNoPlan/sandbox2 | via #UCB_webform_linked 3104/3786
Citation bot (talk | contribs)
Altered journal. | Use this bot. Report bugs. | Suggested by Headbomb | #UCB_toolbar
 
(33 intermediate revisions by 25 users not shown)
Line 1: Line 1:
{{short description|High-performance clustered file system}}
{{short description|High-performance clustered file system}}
{{Multiple issues|
{{Advert|date=May 2024}}
{{More citations needed|date=May 2024}}
}}
{{Infobox filesystem
{{Infobox filesystem
| name = IBM Spectrum Scale
| name = GPFS
| developer = [[IBM]]
| developer = [[IBM]]
| full_name = IBM Spectrum Scale
| full_name = IBM Spectrum Scale
Line 23: Line 27:
| OS = [[IBM AIX|AIX]], [[Linux]], [[Windows Server]]
| OS = [[IBM AIX|AIX]], [[Linux]], [[Windows Server]]
}}
}}
'''IBM Spectrum Scale''', formerly the '''General Parallel File System''' ('''GPFS''')<ref name="IBM_research_gpfs_page">{{cite web
'''GPFS''' ('''General Parallel File System''', brand name '''IBM Storage Scale''' and previously '''IBM Spectrum Scale''')<ref name="IBM_research_gpfs_page">{{cite web
| title = GPFS (General Parallel File System)
| title = GPFS (General Parallel File System)
| publisher = IBM
| publisher = IBM
| url = https://researcher.watson.ibm.com/researcher/view_group.php?id=4840
| url = https://researcher.draco.res.ibm.com/researcher/view_group.php?id=4840
| accessdate = 2020-04-07
| access-date = 2020-04-07
}}</ref> is high-performance [[clustered file system]] software developed by [[IBM]]. It can be deployed in [[shared-disk]] or [[shared-nothing]] distributed parallel modes, or a combination of these. It is used by many of the world's largest commercial companies, as well as some of the [[supercomputer]]s on the [[TOP500|Top 500 List]].<ref name="schmuck02">{{cite conference
}}</ref>
is high-performance [[clustered file system]] software developed by [[IBM]]. It can be deployed in [[shared-disk]] or [[shared-nothing]] distributed parallel modes, or a combination of these. It is used by many of the world's largest commercial companies, as well as some of the [[supercomputer]]s on the [[TOP500|Top 500 List]].<ref name="schmuck02">{{cite conference
| first = Frank
| first = Frank
| last = Schmuck
| last = Schmuck
|author2=Roger Haskin
|author2=Roger Haskin
| title = GPFS: A Shared-Disk File System for Large Computing Clusters
| title = GPFS: A Shared-Disk File System for Large Computing Clusters
| booktitle = Proceedings of the FAST'02 Conference on File and Storage Technologies
| book-title = Proceedings of the FAST'02 Conference on File and Storage Technologies
| pages = 231–244
| pages = 231–244
| publisher = USENIX
| publisher = USENIX
Line 41: Line 44:
| url = http://www.usenix.org/events/fast02/full_papers/schmuck/schmuck.pdf
| url = http://www.usenix.org/events/fast02/full_papers/schmuck/schmuck.pdf
| isbn = 1-880446-03-0
| isbn = 1-880446-03-0
| accessdate = 2008-01-18}}
| access-date = 2008-01-18}}
</ref>
</ref>
For example, it is the filesystem of the [[Summit (supercomputer)|Summit]]
For example, it is the filesystem of the [[Summit (supercomputer)|Summit]]
Line 48: Line 51:
| publisher = Oak Ridge National Laboratory
| publisher = Oak Ridge National Laboratory
| url = https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/
| url = https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/
| accessdate = 2020-04-07
| access-date = 2020-04-07
}}</ref>
}}</ref>
at [[Oak Ridge National Laboratory]] which was the #1 fastest supercomputer in the world in the November 2019 top500 list of supercomputers
at [[Oak Ridge National Laboratory]] which was the #1 fastest supercomputer in the world in the November 2019 Top 500 List.<ref name="Nov 2019 top500 supercomputer list">{{cite web
<ref name="Nov 2019 top500 supercomputer list">{{cite web
| title = November 2019 top500 list
| title = November 2019 top500 list
| publisher = top500.org
| publisher = top500.org
| url = https://www.top500.org/list/2019/11/
| url = https://www.top500.org/list/2019/11/
| accessdate = 2020-04-07
| access-date = 2020-04-07
| archive-date = 2020-01-02
}}</ref>
| archive-url = https://web.archive.org/web/20200102235204/https://www.top500.org/list/2019/11/
. Summit is a 200 [[Petaflops]] system composed of more than 9,000 [[IBM POWER microprocessors]] and 27,000 NVIDIA [[Volta (microarchitecture)|Volta]] [[GPU]]s. The storage filesystem called Alpine<ref name="summit_faq_page">{{cite web
| url-status = dead
}}</ref> Summit is a 200 [[Petaflops]] system composed of more than 9,000 [[POWER9]] processors and 27,000 NVIDIA [[Volta (microarchitecture)|Volta]] [[GPU]]s. The storage filesystem is called Alpine.<ref name="summit_faq_page">{{cite web
| title = Summit FAQ
| title = Summit FAQ
| publisher = Oak Ridge National Laboratory
| publisher = Oak Ridge National Laboratory
| url = https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/summit-faqs/
| url = https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/summit-faqs/
| accessdate = 2020-04-07
| access-date = 2020-04-07
}}</ref>
}}</ref> has 250 PB of storage using Spectrum Scale on IBM ESS storage hardware, capable of approximately 2.5TB/s of sequential I/O and 2.2TB/s of random I/O.


Like typical cluster filesystems, IBM Spectrum Scale provides concurrent high-speed file access to applications executing on multiple nodes of clusters. It can be used with [[IBM AIX|AIX]] clusters, [[Linux]] clusters,<ref>{{cite book|chapter=BPAR: A Bundle-Based Parallel Aggregation Framework for Decoupled I/O Execution|publisher=IEEE|date=Nov 2014|doi=10.1109/DISCS.2014.6|title=2014 International Workshop on Data Intensive Scalable Computing Systems|pages=25–32|last1=Wang|first1=Teng|last2=Vasko|first2=Kevin|last3=Liu|first3=Zhuo|last4=Chen|first4=Hui|last5=Yu|first5=Weikuan|isbn=978-1-4673-6750-9|s2cid=2402391}}</ref> on Microsoft [[Windows Server]], or a heterogeneous cluster of AIX, Linux and Windows nodes running on [[x86]], [[Power ISA|POWER]] or [[IBM Z]] processor architectures. In addition to providing filesystem storage capabilities, it provides tools for management and administration of the IBM Spectrum Scale cluster and allows for shared access to file systems from remote clusters.
Like typical cluster filesystems, GPFS provides concurrent high-speed file access to applications executing on multiple nodes of clusters. It can be used with [[IBM AIX|AIX]] clusters, [[Linux]] clusters,<ref>{{cite book|chapter=BPAR: A Bundle-Based Parallel Aggregation Framework for Decoupled I/O Execution|publisher=IEEE|date=Nov 2014|doi=10.1109/DISCS.2014.6|title=2014 International Workshop on Data Intensive Scalable Computing Systems|pages=25–32|last1=Wang|first1=Teng|last2=Vasko|first2=Kevin|last3=Liu|first3=Zhuo|last4=Chen|first4=Hui|last5=Yu|first5=Weikuan|isbn=978-1-4673-6750-9|s2cid=2402391}}</ref> on Microsoft [[Windows Server]], or a heterogeneous cluster of AIX, Linux and Windows nodes running on [[x86]], [[Power ISA|Power]] or [[IBM Z]] processor architectures.


==History==
==History==
IBM Spectrum Scale began as the ''Tiger Shark'' file system, a research project at IBM's [[IBM Research - Almaden|Almaden Research Center]] as early as 1993. Tiger Shark was initially designed to support high throughput multimedia applications. This design turned out to be well suited to scientific computing.<ref name="may00">{{cite book
GPFS began as the ''Tiger Shark'' file system, a research project at IBM's [[IBM Research - Almaden|Almaden Research Center]] as early as 1993. Tiger Shark was initially designed to support high throughput multimedia applications. This design turned out to be well suited to scientific computing.<ref name="may00">{{cite book
| last = May
| last = May
| first = John M.
| first = John M.
Line 75: Line 79:
| url = https://books.google.com/books?id=iLj516DOIKkC&q=shark+vesta+gpfs&pg=PA92
| url = https://books.google.com/books?id=iLj516DOIKkC&q=shark+vesta+gpfs&pg=PA92
| isbn = 978-1-55860-664-7
| isbn = 978-1-55860-664-7
| accessdate = 2008-06-18
| access-date = 2008-06-18
| page = 92}}</ref>
| page = 92}}</ref>


Another ancestor is IBM's ''Vesta'' filesystem, developed as a research project at IBM's [[Thomas J. Watson Research Center]] between 1992 and 1995.<ref name="corbett93">{{Cite book|title=Supercomputing|last1=Corbett|first1=Peter F.|last2=Feitelson|first2=Dror G.|last3=Prost|first3=J.-P.|last4=Baylor|first4=S. J.|publisher=ACM/IEEE|year=1993|location=Portland, Oregon, United States|pages=472–481|contribution=Parallel access to files in the Vesta file system|doi=10.1145/169627.169786|isbn=978-0818643408|s2cid=46409100}}<!--| accessdate = 2008-06-18--></ref> Vesta introduced the concept of file partitioning to accommodate the needs of parallel applications that run on high-performance [[Parallel computing|multicomputers]] with [[parallel I/O]] subsystems. With partitioning, a file is not a sequence of bytes, but rather multiple disjoint sequences that may be accessed in parallel. The partitioning is such that it abstracts away the number and type of I/O nodes hosting the filesystem, and it allows a variety of logically partitioned views of files, regardless of the physical distribution of data within the I/O nodes. The disjoint sequences are arranged to correspond to individual processes of a parallel application, allowing for improved scalability.<ref name="corbett96">{{Cite journal
Another ancestor is IBM's ''Vesta'' filesystem, developed as a research project at IBM's [[Thomas J. Watson Research Center]] between 1992 and 1995.<ref name="corbett93">{{Cite book|last1=Corbett|first1=Peter F.|last2=Feitelson|first2=Dror G.|last3=Prost|first3=J.-P.|last4=Baylor|first4=S. J.|title=Proceedings of the 1993 ACM/IEEE conference on Supercomputing - Supercomputing '93 |publisher=ACM/IEEE|year=1993|location=Portland, Oregon, United States|pages=472–481|contribution=Parallel access to files in the Vesta file system|doi=10.1145/169627.169786|isbn=978-0818643408|s2cid=46409100}}<!--| access-date = 2008-06-18--></ref> Vesta introduced the concept of file partitioning to accommodate the needs of parallel applications that run on high-performance [[Parallel computing|multicomputers]] with [[parallel I/O]] subsystems. With partitioning, a file is not a sequence of bytes, but rather multiple disjoint sequences that may be accessed in parallel. The partitioning is such that it abstracts away the number and type of I/O nodes hosting the filesystem, and it allows a variety of logically partitioned views of files, regardless of the physical distribution of data within the I/O nodes. The disjoint sequences are arranged to correspond to individual processes of a parallel application, allowing for improved scalability.<ref name="corbett96">{{Cite journal
| first1 = Peter F.
| first1 = Peter F.
| last1 = Corbett
| last1 = Corbett
Line 84: Line 88:
| last2 = Feitelson
| last2 = Feitelson
| title = The Vesta parallel file system
| title = The Vesta parallel file system
| journal = Transactions on Computer Systems
| journal = ACM Transactions on Computer Systems
| volume = 14
| volume = 14
| issue = 3
| issue = 3
Line 92: Line 96:
| doi = 10.1145/233557.233558
| doi = 10.1145/233557.233558
| s2cid = 11975458
| s2cid = 11975458
| access-date = 2008-06-18
| accessdate = 2008-06-18}}</ref><ref>{{cite journal|author1=Teng Wang|author2=Kevin Vasko|author3=Zhuo Liu|author4=Hui Chen|author5=Weikuan Yu|title=Enhance parallel input/output with cross-bundle aggregation|journal=The International Journal of High Performance Computing Applications|volume=30|issue=2|pages=241–256|date=2016|doi=10.1177/1094342015618017|s2cid=12067366}}</ref>
| archive-date = 2012-02-12
| archive-url = https://web.archive.org/web/20120212075707/http://www.cs.umd.edu/class/fall2002/cmsc818s/Readings/vesta-tocs96.pdf
| url-status = bot: unknown
}}</ref><ref>{{cite journal|author1=Teng Wang|author2=Kevin Vasko|author3=Zhuo Liu|author4=Hui Chen|author5=Weikuan Yu|title=Enhance parallel input/output with cross-bundle aggregation|journal=The International Journal of High Performance Computing Applications|volume=30|issue=2|pages=241–256|date=2016|doi=10.1177/1094342015618017|s2cid=12067366}}</ref>


Vesta was commercialized as the PIOFS filesystem around 1994,<ref name="corbett95">{{cite journal |last=Corbett |first=P. F. |author2=D. G. Feitelson |author3=J.-P. Prost |author4=G. S. Almasi |author5=S. J. Baylor |author6=A. S. Bolmarcich |author7=Y. Hsu |author8=J. Satran |author9=M. Snir |author10=R. Colao |author11=B. D. Herr |author12=J. Kavaky |author13=T. R. Morgan |author14=A. Zlotek |title=Parallel file systems for the IBM SP computers |journal=IBM Systems Journal |volume=34 |issue=2 |pages=222–248 |year=1995 |url=http://www.research.ibm.com/journal/sj/342/corbett.pdf |access-date=2008-06-18 |doi=10.1147/sj.342.0222 |citeseerx=10.1.1.381.2988 |archive-date=2004-04-19 |archive-url=https://web.archive.org/web/20040419115328/http://www.research.ibm.com/journal/sj/342/corbett.pdf |url-status=bot: unknown }}</ref> and was succeeded by GPFS around 1998.<ref name="barrios99">{{cite book
Vesta was commercialized as the PIOFS filesystem around 1994,<ref name="corbett95">{{cite journal
| last = Corbett
|first = Marcelo
|last = Barris
| first = P. F.
|author2 = Terry Jones
|author2=D. G. Feitelson |author3=J.-P. Prost |author4=G. S. Almasi |author5=S. J. Baylor |author6=A. S. Bolmarcich |author7=Y. Hsu |author8=J. Satran |author9=M. Snir |author10=R. Colao |author11=B. D. Herr |author12=J. Kavaky |author13=T. R. Morgan |author14=A. Zlotek
|author3 = Scott Kinnane
| title = Parallel file systems for the IBM SP computers
|author4 = Mathis Landzettel Safran Al-Safran
| journal = IBM Systems Journal
|author5 = Jerry Stevens
| volume = 34
|author6 = Christopher Stone
| issue = 2
|author7 = Chris Thomas
| pages = 222–248
|author8 = Ulf Troppens
| year = 1995
|title = Sizing and Tuning GPFS
| url = http://www.research.ibm.com/journal/sj/342/corbett.pdf
|publisher = IBM Redbooks, International Technical Support Organization
| accessdate = 2008-06-18
|date = September 1999
| doi = 10.1147/sj.342.0222| citeseerx = 10.1.1.381.2988
|url = https://www.redbooks.ibm.com/redbooks/pdfs/sg245610.pdf
}}
|no-pp = true
</ref> and was succeeded by GPFS around 1998.<ref name="barrios99">
|access-date = 2022-12-06
{{cite book
|page = see page 1 (''"GPFS is the successor to the PIOFS file system"'')
| first = Marcelo
|archive-date = 2010-12-14
| last = Barris
|archive-url = https://web.archive.org/web/20101214215324/https://www.redbooks.ibm.com/redbooks/pdfs/sg245610.pdf
|author2=Terry Jones |author3=Scott Kinnane |author4=Mathis Landzettel Safran Al-Safran |author5=Jerry Stevens |author6=Christopher Stone |author7=Chris Thomas |author8=Ulf Troppens
|url-status = bot: unknown
| title = Sizing and Tuning GPFS
}}</ref><ref name="snir01">{{cite web
| publisher = IBM Redbooks, International Technical Support Organization
|date=September 1999
| location =
| url = http://www.redbooks.ibm.com/redbooks/pdfs/sg245610.pdf
| nopp = true
| page = see page 1 (''"GPFS is the successor to the PIOFS file system"'') }}
</ref><ref name="snir01">{{cite web
| last = Snir
| last = Snir
| first = Marc
| first = Marc
Line 127: Line 129:
|date=June 2001
|date=June 2001
| url = http://research.ac.upc.edu/HPCseminar/SEM0001/snir.pdf
| url = http://research.ac.upc.edu/HPCseminar/SEM0001/snir.pdf
| accessdate = 2008-06-18}}
| access-date = 2008-06-18}}
</ref> The main difference between the older and newer filesystems was that GPFS replaced the specialized interface offered by Vesta/PIOFS with the standard [[Unix]] [[API]]: all the features to support high performance parallel I/O were hidden from users and implemented under the hood.<ref name="may00"/><ref name="snir01"/>
</ref> The main difference between the older and newer filesystems was that GPFS replaced the specialized interface offered by Vesta/PIOFS with the standard [[Unix]] [[API]]: all the features to support high performance parallel I/O were hidden from users and implemented under the hood.<ref name="may00"/><ref name="snir01"/> GPFS also shared many components with the related products IBM Multi-Media Server and IBM Video Charger, which is why many GPFS utilities start with the prefix ''mm''—multi-media.<ref>{{cite book | title = General Parallel File System Administration and Programming Reference Version 3.1 | publisher = IBM | date = April 2006 | url = https://www.ira.inaf.it/Computing/tecnica/GPFS/GPFS_31_Admin_Guide.pdf }}</ref>{{Rp|xi}}

Spectrum Scale has been available on IBM's [[AIX]] since 1998, on Linux since 2001, and on Windows Server since 2008.

Today it is used by many of the top 500 supercomputers listed on the Top 500 Supercomputing List. Since inception, it has been successfully deployed for many commercial applications including digital media, grid analytics, and scalable file services.


In 2010, IBM previewed a version of GPFS that included a capability known as GPFS-SNC, where SNC stands for Shared Nothing Cluster. This was officially released with GPFS 3.5 in December 2012, and is now known as FPO
In 2010, IBM previewed a version of GPFS that included a capability known as GPFS-SNC, where SNC stands for Shared Nothing Cluster. This was officially released with GPFS 3.5 in December 2012, and is now known as FPO
Line 140: Line 138:
| year = 2013
| year = 2013
| url = http://public.dhe.ibm.com/common/ssi/ecm/en/dcs03038usen/DCS03038USEN.PDF
| url = http://public.dhe.ibm.com/common/ssi/ecm/en/dcs03038usen/DCS03038USEN.PDF
| accessdate = 2012-08-12
| access-date = 2012-08-12
}}{{Dead link|date=January 2020 |bot=InternetArchiveBot |fix-attempted=yes }}</ref> (File Placement Optimizer). This allows it to use locally attached disks on a cluster of network connected servers rather than requiring dedicated servers with shared disks (e.g. using a SAN). FPO is suitable for workloads with high data locality such as shared nothing database clusters such as SAP HANA and DB2 DPF, and can be used as a [[HDFS]]-compatible filesystem.
}}{{Dead link|date=January 2020 |bot=InternetArchiveBot |fix-attempted=yes }}</ref> (File Placement Optimizer).


==Architecture==
==Architecture==
{{refimprove|section|date=January 2020}}
{{refimprove|section|date=January 2020}}
It is a [[clustered file system]]. It breaks a file into blocks of a configured size, less than 1 megabyte each, which are distributed across multiple cluster nodes.
It is a [[clustered file system]]. It breaks a file into blocks of a configured size, less than 1 megabyte each, which are distributed across multiple cluster nodes.


The system stores data on standard block storage volumes, but includes an internal RAID layer that can virtualize those volumes for redundancy and parallel access much like a RAID block storage system. It also has the ability to replicate across volumes at the higher file level.
The system stores data on standard block storage volumes, but includes an internal RAID layer that can virtualize those volumes for redundancy and parallel access much like a RAID block storage system. It also has the ability to replicate across volumes at the higher file level.


Features of the architecture include
Features of the architecture include
Line 154: Line 152:
* Distributed locking. This allows for full [[POSIX]] filesystem semantics, including locking for exclusive file access.
* Distributed locking. This allows for full [[POSIX]] filesystem semantics, including locking for exclusive file access.
* Partition Aware. A failure of the network may partition the filesystem into two or more groups of nodes that can only see the nodes in their group. This can be detected through a heartbeat protocol, and when a partition occurs, the filesystem remains live for the largest partition formed. This offers a graceful degradation of the filesystem — some machines will remain working.
* Partition Aware. A failure of the network may partition the filesystem into two or more groups of nodes that can only see the nodes in their group. This can be detected through a heartbeat protocol, and when a partition occurs, the filesystem remains live for the largest partition formed. This offers a graceful degradation of the filesystem — some machines will remain working.
* Filesystem maintenance can be performed online. Most of the filesystem maintenance chores (adding new disks, rebalancing data across disks) can be performed while the filesystem is live. This ensures the filesystem is available more often, so keeps the supercomputer cluster itself available for longer.
* Filesystem maintenance can be performed online. Most of the filesystem maintenance chores (adding new disks, rebalancing data across disks) can be performed while the filesystem is live. This maximizes the filesystem availability, and thus the availability of the supercomputer cluster itself.


Other features include high availability, ability to be used in a heterogeneous cluster, disaster recovery, security, [[DMAPI]], [[Hierarchical Storage Management|HSM]] and [[Information Lifecycle Management|ILM]].
Other features include high availability, ability to be used in a heterogeneous cluster, disaster recovery, security, [[DMAPI]], [[Hierarchical Storage Management|HSM]] and [[Information Lifecycle Management|ILM]].
Line 162: Line 160:


* HDFS also breaks files up into blocks, and stores them on different filesystem nodes.
* HDFS also breaks files up into blocks, and stores them on different filesystem nodes.
* IBM Spectrum Scale has full Posix filesystem semantics.
* GPFS has full Posix filesystem semantics.
* IBM Spectrum Scale distributes its directory indices and other metadata across the filesystem. Hadoop, in contrast, keeps this on the Primary and Secondary Namenodes, large servers which must store all index information in-RAM.
* GPFS distributes its directory indices and other metadata across the filesystem. Hadoop, in contrast, keeps this on the Primary and Secondary Namenodes, large servers which must store all index information in-RAM.
* IBM Spectrum Scale breaks files up into small blocks. Hadoop HDFS likes blocks of {{nowrap|64 MB}} or more, as this reduces the storage requirements of the Namenode. Small blocks or many small files fill up a filesystem's indices fast, so limit the filesystem's size.
* GPFS breaks files up into small blocks. Hadoop HDFS likes blocks of {{nowrap|64 MB}} or more, as this reduces the storage requirements of the Namenode. Small blocks or many small files fill up a filesystem's indices fast, so limit the filesystem's size.


==Information lifecycle management==
==Information lifecycle management==
Storage pools allow for the grouping of disks within a file system. An administrator can create tiers of storage by grouping disks based on performance, locality or reliability characteristics. For example, one pool could be high-performance [[Fibre Channel]] disks and another more economical SATA storage.
Storage pools allow for the grouping of disks within a file system. An administrator can create tiers of storage by grouping disks based on performance, locality or reliability characteristics. For example, one pool could be high-performance [[Fibre Channel]] disks and another more economical SATA storage.


A fileset is a sub-tree of the file system namespace and provides a way to partition the namespace into smaller, more manageable units. Filesets provide an administrative boundary that can be used to set quotas and be specified in a policy to control initial data placement or data migration. Data in a single fileset can reside in one or more storage pools. Where the file data resides and how it is migrated is based on a set of rules in a user defined policy.
A fileset is a sub-tree of the file system namespace and provides a way to partition the namespace into smaller, more manageable units. Filesets provide an administrative boundary that can be used to set quotas and be specified in a policy to control initial data placement or data migration. Data in a single fileset can reside in one or more storage pools. Where the file data resides and how it is migrated is based on a set of rules in a user defined policy.


There are two types of user defined policies: file placement and file management. File placement policies direct file data as files are created to the appropriate storage pool. File placement rules are selected by attributes such as file name, the user name or the fileset. File management policies allow the file's data to be moved or replicated or files to be deleted. File management policies can be used to move data from one pool to another without changing the file's location in the directory structure. File management policies are determined by file attributes such as last access time, path name or size of the file.
There are two types of user defined policies: file placement and file management. File placement policies direct file data as files are created to the appropriate storage pool. File placement rules are selected by attributes such as file name, the user name or the fileset. File management policies allow the file's data to be moved or replicated or files to be deleted. File management policies can be used to move data from one pool to another without changing the file's location in the directory structure. File management policies are determined by file attributes such as last access time, path name or size of the file.


The policy processing engine is scalable and can be run on many nodes at once. This allows management policies to be applied to a single file system with billions of files and complete in a few hours.{{citation needed|date=January 2014}}
The policy processing engine is scalable and can be run on many nodes at once. This allows management policies to be applied to a single file system with billions of files and complete in a few hours.{{citation needed|date=January 2014}}
Line 178: Line 176:
{{Columns-list|colwidth=20em|
{{Columns-list|colwidth=20em|
* [[Alluxio]]
* [[Alluxio]]
* [[ASM Cluster File System]] (ACFS)
* [[Oracle Cloud File System|ASM Cluster File System]] (ACFS)
* [[BeeGFS]]
* [[BeeGFS]]
* [[GFS2]]
* [[GFS2]]
Line 188: Line 186:
* [[Moose File System|MooseFS]]
* [[Moose File System|MooseFS]]
* [[OCFS2]]
* [[OCFS2]]
* [[Panasas]] PanFS
* [[QFS]]
* [[QFS]]
* [[IBM Scale-out File Services]] – NAS-grid
* [[IBM Scale-out File Services]] – NAS-grid
Line 198: Line 197:
{{Reflist}}
{{Reflist}}



==External links==
* [https://web.archive.org/web/20060421011157/http://www.almaden.ibm.com/StorageSystems/file_systems/GPFS/ IBM Spectrum Scale at Almaden]
* [http://www.gpfsug.org Spectrum Scale User Group]
* [http://www-01.ibm.com/support/knowledgecenter/SSFKCN/gpfs_welcome.html?lang=en IBM Spectrum Scale Product Documentation]


{{File systems}}
{{File systems}}
Line 209: Line 205:
[[Category:Distributed file systems supported by the Linux kernel]]
[[Category:Distributed file systems supported by the Linux kernel]]
[[Category:IBM file systems]]
[[Category:IBM file systems]]
[[Category:IBM storage devices|Spectrum Scale]]
[[Category:IBM storage software|Spectrum Scale]]
[[Category:Distributed file systems]]
[[Category:Distributed file systems]]

Latest revision as of 19:38, 20 August 2024

GPFS
Developer(s)IBM
Full nameIBM Spectrum Scale
Introduced1998; 26 years ago (1998) with AIX
Limits
Max volume size8 YB
Max file size8 EB
Max no. of files264 per file system
Features
File system
permissions
POSIX
Transparent
encryption
yes
Other
Supported
operating systems
AIX, Linux, Windows Server

GPFS (General Parallel File System, brand name IBM Storage Scale and previously IBM Spectrum Scale)[1] is high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel modes, or a combination of these. It is used by many of the world's largest commercial companies, as well as some of the supercomputers on the Top 500 List.[2] For example, it is the filesystem of the Summit [3] at Oak Ridge National Laboratory which was the #1 fastest supercomputer in the world in the November 2019 Top 500 List.[4] Summit is a 200 Petaflops system composed of more than 9,000 POWER9 processors and 27,000 NVIDIA Volta GPUs. The storage filesystem is called Alpine.[5]

Like typical cluster filesystems, GPFS provides concurrent high-speed file access to applications executing on multiple nodes of clusters. It can be used with AIX clusters, Linux clusters,[6] on Microsoft Windows Server, or a heterogeneous cluster of AIX, Linux and Windows nodes running on x86, Power or IBM Z processor architectures.

History

[edit]

GPFS began as the Tiger Shark file system, a research project at IBM's Almaden Research Center as early as 1993. Tiger Shark was initially designed to support high throughput multimedia applications. This design turned out to be well suited to scientific computing.[7]

Another ancestor is IBM's Vesta filesystem, developed as a research project at IBM's Thomas J. Watson Research Center between 1992 and 1995.[8] Vesta introduced the concept of file partitioning to accommodate the needs of parallel applications that run on high-performance multicomputers with parallel I/O subsystems. With partitioning, a file is not a sequence of bytes, but rather multiple disjoint sequences that may be accessed in parallel. The partitioning is such that it abstracts away the number and type of I/O nodes hosting the filesystem, and it allows a variety of logically partitioned views of files, regardless of the physical distribution of data within the I/O nodes. The disjoint sequences are arranged to correspond to individual processes of a parallel application, allowing for improved scalability.[9][10]

Vesta was commercialized as the PIOFS filesystem around 1994,[11] and was succeeded by GPFS around 1998.[12][13] The main difference between the older and newer filesystems was that GPFS replaced the specialized interface offered by Vesta/PIOFS with the standard Unix API: all the features to support high performance parallel I/O were hidden from users and implemented under the hood.[7][13] GPFS also shared many components with the related products IBM Multi-Media Server and IBM Video Charger, which is why many GPFS utilities start with the prefix mm—multi-media.[14]: xi 

In 2010, IBM previewed a version of GPFS that included a capability known as GPFS-SNC, where SNC stands for Shared Nothing Cluster. This was officially released with GPFS 3.5 in December 2012, and is now known as FPO [15] (File Placement Optimizer).

Architecture

[edit]

It is a clustered file system. It breaks a file into blocks of a configured size, less than 1 megabyte each, which are distributed across multiple cluster nodes.

The system stores data on standard block storage volumes, but includes an internal RAID layer that can virtualize those volumes for redundancy and parallel access much like a RAID block storage system. It also has the ability to replicate across volumes at the higher file level.

Features of the architecture include

  • Distributed metadata, including the directory tree. There is no single "directory controller" or "index server" in charge of the filesystem.
  • Efficient indexing of directory entries for very large directories.
  • Distributed locking. This allows for full POSIX filesystem semantics, including locking for exclusive file access.
  • Partition Aware. A failure of the network may partition the filesystem into two or more groups of nodes that can only see the nodes in their group. This can be detected through a heartbeat protocol, and when a partition occurs, the filesystem remains live for the largest partition formed. This offers a graceful degradation of the filesystem — some machines will remain working.
  • Filesystem maintenance can be performed online. Most of the filesystem maintenance chores (adding new disks, rebalancing data across disks) can be performed while the filesystem is live. This maximizes the filesystem availability, and thus the availability of the supercomputer cluster itself.

Other features include high availability, ability to be used in a heterogeneous cluster, disaster recovery, security, DMAPI, HSM and ILM.

Compared to Hadoop Distributed File System (HDFS)

[edit]

Hadoop's HDFS filesystem, is designed to store similar or greater quantities of data on commodity hardware — that is, datacenters without RAID disks and a storage area network (SAN).

  • HDFS also breaks files up into blocks, and stores them on different filesystem nodes.
  • GPFS has full Posix filesystem semantics.
  • GPFS distributes its directory indices and other metadata across the filesystem. Hadoop, in contrast, keeps this on the Primary and Secondary Namenodes, large servers which must store all index information in-RAM.
  • GPFS breaks files up into small blocks. Hadoop HDFS likes blocks of 64 MB or more, as this reduces the storage requirements of the Namenode. Small blocks or many small files fill up a filesystem's indices fast, so limit the filesystem's size.

Information lifecycle management

[edit]

Storage pools allow for the grouping of disks within a file system. An administrator can create tiers of storage by grouping disks based on performance, locality or reliability characteristics. For example, one pool could be high-performance Fibre Channel disks and another more economical SATA storage.

A fileset is a sub-tree of the file system namespace and provides a way to partition the namespace into smaller, more manageable units. Filesets provide an administrative boundary that can be used to set quotas and be specified in a policy to control initial data placement or data migration. Data in a single fileset can reside in one or more storage pools. Where the file data resides and how it is migrated is based on a set of rules in a user defined policy.

There are two types of user defined policies: file placement and file management. File placement policies direct file data as files are created to the appropriate storage pool. File placement rules are selected by attributes such as file name, the user name or the fileset. File management policies allow the file's data to be moved or replicated or files to be deleted. File management policies can be used to move data from one pool to another without changing the file's location in the directory structure. File management policies are determined by file attributes such as last access time, path name or size of the file.

The policy processing engine is scalable and can be run on many nodes at once. This allows management policies to be applied to a single file system with billions of files and complete in a few hours.[citation needed]

See also

[edit]

References

[edit]
  1. ^ "GPFS (General Parallel File System)". IBM. Retrieved 2020-04-07.
  2. ^ Schmuck, Frank; Roger Haskin (January 2002). "GPFS: A Shared-Disk File System for Large Computing Clusters" (PDF). Proceedings of the FAST'02 Conference on File and Storage Technologies. Monterey, California, US: USENIX. pp. 231–244. ISBN 1-880446-03-0. Retrieved 2008-01-18.
  3. ^ "Summit compute systems". Oak Ridge National Laboratory. Retrieved 2020-04-07.
  4. ^ "November 2019 top500 list". top500.org. Archived from the original on 2020-01-02. Retrieved 2020-04-07.
  5. ^ "Summit FAQ". Oak Ridge National Laboratory. Retrieved 2020-04-07.
  6. ^ Wang, Teng; Vasko, Kevin; Liu, Zhuo; Chen, Hui; Yu, Weikuan (Nov 2014). "BPAR: A Bundle-Based Parallel Aggregation Framework for Decoupled I/O Execution". 2014 International Workshop on Data Intensive Scalable Computing Systems. IEEE. pp. 25–32. doi:10.1109/DISCS.2014.6. ISBN 978-1-4673-6750-9. S2CID 2402391.
  7. ^ a b May, John M. (2000). Parallel I/O for High Performance Computing. Morgan Kaufmann. p. 92. ISBN 978-1-55860-664-7. Retrieved 2008-06-18.
  8. ^ Corbett, Peter F.; Feitelson, Dror G.; Prost, J.-P.; Baylor, S. J. (1993). "Parallel access to files in the Vesta file system". Proceedings of the 1993 ACM/IEEE conference on Supercomputing - Supercomputing '93. Portland, Oregon, United States: ACM/IEEE. pp. 472–481. doi:10.1145/169627.169786. ISBN 978-0818643408. S2CID 46409100.
  9. ^ Corbett, Peter F.; Feitelson, Dror G. (August 1996). "The Vesta parallel file system" (PDF). ACM Transactions on Computer Systems. 14 (3): 225–264. doi:10.1145/233557.233558. S2CID 11975458. Archived from the original on 2012-02-12. Retrieved 2008-06-18.{{cite journal}}: CS1 maint: bot: original URL status unknown (link)
  10. ^ Teng Wang; Kevin Vasko; Zhuo Liu; Hui Chen; Weikuan Yu (2016). "Enhance parallel input/output with cross-bundle aggregation". The International Journal of High Performance Computing Applications. 30 (2): 241–256. doi:10.1177/1094342015618017. S2CID 12067366.
  11. ^ Corbett, P. F.; D. G. Feitelson; J.-P. Prost; G. S. Almasi; S. J. Baylor; A. S. Bolmarcich; Y. Hsu; J. Satran; M. Snir; R. Colao; B. D. Herr; J. Kavaky; T. R. Morgan; A. Zlotek (1995). "Parallel file systems for the IBM SP computers" (PDF). IBM Systems Journal. 34 (2): 222–248. CiteSeerX 10.1.1.381.2988. doi:10.1147/sj.342.0222. Archived from the original on 2004-04-19. Retrieved 2008-06-18.{{cite journal}}: CS1 maint: bot: original URL status unknown (link)
  12. ^ Barris, Marcelo; Terry Jones; Scott Kinnane; Mathis Landzettel Safran Al-Safran; Jerry Stevens; Christopher Stone; Chris Thomas; Ulf Troppens (September 1999). Sizing and Tuning GPFS (PDF). IBM Redbooks, International Technical Support Organization. see page 1 ("GPFS is the successor to the PIOFS file system"). Archived from the original on 2010-12-14. Retrieved 2022-12-06.{{cite book}}: CS1 maint: bot: original URL status unknown (link)
  13. ^ a b Snir, Marc (June 2001). "Scalable parallel systems: Contributions 1990-2000" (PDF). HPC seminar, Computer Architecture Department, Universitat Politècnica de Catalunya. Retrieved 2008-06-18.
  14. ^ General Parallel File System Administration and Programming Reference Version 3.1 (PDF). IBM. April 2006.
  15. ^ "IBM GPFS FPO (DCS03038-USEN-00)" (PDF). IBM Corporation. 2013. Retrieved 2012-08-12.[permanent dead link]