Apache Flink: Difference between revisions

Apache Flink
Developer(s)	Apache Software Foundation
Initial release	May 2011; 13 years ago
Stable release	1.20.0 / 1 August 2024; 4 months ago
Repository	github.com/apache/flink ;
Written in	Java and Scala
Operating system	Cross-platform
Type	Data analytics; machine learning algorithms;
License	Apache License 2.0
Website	flink.apache.org

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 16:13, 5 November 2021

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala.^[2]^[3] Flink executes arbitrary dataflow programs in a data-parallel and pipelined (hence task parallel) manner.^[4] Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs.^[5]^[6] Furthermore, Flink's runtime supports the execution of iterative algorithms natively.^[7]

Flink provides a high-throughput, low-latency streaming engine^[8] as well as support for event-time processing and state management. Flink applications are fault-tolerant in the event of machine failure and support exactly-once semantics.^[9] Programs can be written in Java, Scala,^[10] Python,^[11] and SQL^[12] and are automatically compiled and optimized^[13] into dataflow programs that are executed in a cluster or cloud environment.^[14]

Flink does not provide its own data-storage system, but provides data-source and sink connectors to systems such as Amazon Kinesis, Apache Kafka, HDFS, Apache Cassandra, and ElasticSearch.^[15]

Development

Apache Flink is developed under the Apache License 2.0^[16] by the Apache Flink Community within the Apache Software Foundation. The project is driven by over 25 committers and over 340 contributors.

Overview

Apache Flink's dataflow programming model provides event-at-a-time processing on both finite and infinite datasets. At a basic level, Flink programs consist of streams and transformations. “Conceptually, a stream is a (potentially never-ending) flow of data records, and a transformation is an operation that takes one or more streams as input, and produces one or more output streams as a result.”^[17]

Apache Flink includes two core APIs: a DataStream API for bounded or unbounded streams of data and a DataSet API for bounded data sets. Flink also offers a Table API, which is a SQL-like expression language for relational stream and batch processing that can be easily embedded in Flink's DataStream and DataSet APIs. The highest-level language supported by Flink is SQL, which is semantically similar to the Table API and represents programs as SQL query expressions.

Programming Model and Distributed Runtime

Upon execution, Flink programs are mapped to streaming dataflows.^[17] Every Flink dataflow starts with one or more sources (a data input, e.g., a message queue or a file system) and ends with one or more sinks (a data output, e.g., a message queue, file system, or database). An arbitrary number of transformations can be performed on the stream. These streams can be arranged as a directed, acyclic dataflow graph, allowing an application to branch and merge dataflows.

Flink offers ready-built source and sink connectors with Apache Kafka, Amazon Kinesis, HDFS, Apache Cassandra, and more.^[15]

Flink programs run as a distributed system within a cluster and can be deployed in a standalone mode as well as on YARN, Mesos, Docker-based setups along with other resource management frameworks.^[18]

State: Checkpoints, Savepoints, and Fault-tolerance

Apache Flink includes a lightweight fault tolerance mechanism based on distributed checkpoints.^[9] A checkpoint is an automatic, asynchronous snapshot of the state of an application and the position in a source stream. In the case of a failure, a Flink program with checkpointing enabled will, upon recovery, resume processing from the last completed checkpoint, ensuring that Flink maintains exactly-once state semantics within an application. The checkpointing mechanism exposes hooks for application code to include external systems into the checkpointing mechanism as well (like opening and committing transactions with a database system).

Flink also includes a mechanism called savepoints, which are manually-triggered checkpoints.^[19] A user can generate a savepoint, stop a running Flink program, then resume the program from the same application state and position in the stream. Savepoints enable updates to a Flink program or a Flink cluster without losing the application's state . As of Flink 1.2, savepoints also allow to restart an application with a different parallelism—allowing users to adapt to changing workloads.

DataStream API

Flink's DataStream API enables transformations (e.g. filters, aggregations, window functions) on bounded or unbounded streams of data. The DataStream API includes more than 20 different types of transformations and is available in Java and Scala.^[20]

A simple example of a stateful stream processing program is an application that emits a word count from a continuous input stream and groups the data in 5-second windows:

import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time

case class WordCount(word: String, count: Int)

object WindowWordCount {
  def main(args: Array[String]) {

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("localhost", 9999)

    val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
      .map { WordCount(_, 1) }
      .keyBy("word")
      .timeWindow(Time.seconds(5))
      .sum("count")

    counts.print

    env.execute("Window Stream WordCount")
  }
}

Apache Beam - Flink Runner

Apache Beam “provides an advanced unified programming model, allowing (a developer) to implement batch and streaming data processing jobs that can run on any execution engine.”^[21] The Apache Flink-on-Beam runner is the most feature-rich according to a capability matrix maintained by the Beam community.^[22]

data Artisans, in conjunction with the Apache Flink community, worked closely with the Beam community to develop a Flink runner.^[23]

DataSet API

Flink's DataSet API enables transformations (e.g., filters, mapping, joining, grouping) on bounded datasets. The DataSet API includes more than 20 different types of transformations.^[24] The API is available in Java, Scala and an experimental Python API. Flink's DataSet API is conceptually similar to the DataStream API.

Table API and SQL

Flink's Table API is a SQL-like expression language for relational stream and batch processing that can be embedded in Flink's Java and Scala DataSet and DataStream APIs. The Table API and SQL interface operate on a relational Table abstraction. Tables can be created from external data sources or from existing DataStreams and DataSets. The Table API supports relational operators such as selection, aggregation, and joins on Tables.

Tables can also be queried with regular SQL. The Table API and SQL offer equivalent functionality and can be mixed in the same program. When a Table is converted back into a DataSet or DataStream, the logical plan, which was defined by relational operators and SQL queries, is optimized using Apache Calcite and is transformed into a DataSet or DataStream program.^[25]

Flink Forward

Flink Forward is an annual conference about Apache Flink. The first edition of Flink Forward took place in 2015 in Berlin. The two-day conference had over 250 attendees from 16 countries. Sessions were organized in two tracks with over 30 technical presentations from Flink developers and one additional track with hands-on Flink training.

In 2016, 350 participants joined the conference and over 40 speakers presented technical talks in 3 parallel tracks. On the third day, attendees were invited to participate in hands-on training sessions.

In 2017, the event expands to San Francisco, as well. The conference day is dedicated to technical talks on how Flink is used in the enterprise, Flink system internals, ecosystem integrations with Flink, and the future of the platform. It features keynotes, talks from Flink users in industry and academia, and hands-on training sessions on Apache Flink.

In 2020, following the COVID-19 pandemic, Flink Forward's spring edition which was supposed to be hosted in San Francisco was canceled. Instead, the conference was hosted virtually, starting on April 22nd and concluding on April 24th, featuring live keynotes, Flink use cases, Apache Flink internals, and other topics on stream processing and real-time analytics.^[26]

History

In 2010, the research project "Stratosphere: Information Management on the Cloud"^[27] (funded by the German Research Foundation (DFG)^[28]) was started as a collaboration of Technical University Berlin, Humboldt-Universität zu Berlin, and Hasso-Plattner-Institut Potsdam. Flink started from a fork of Stratosphere's distributed execution engine and it became an Apache Incubator project in March 2014.^[29] In December 2014, Flink was accepted as an Apache top-level project.^[30]^[31]^[32]^[33]

Version	Original release date	Latest version	Release date
Old version, no longer maintained: 0.9	2015-06-24	0.9.1	2015-09-01
Old version, no longer maintained: 0.10	2015-11-16	0.10.2	2016-02-11
Old version, no longer maintained: 1.0	2016-03-08	1.0.3	2016-05-11
Old version, no longer maintained: 1.1	2016-08-08	1.1.5	2017-03-22
Old version, no longer maintained: 1.2	2017-02-06	1.2.1	2017-04-26
Old version, no longer maintained: 1.3	2017-06-01	1.3.3	2018-03-15
Old version, no longer maintained: 1.4	2017-12-12	1.4.2	2018-03-08
Old version, no longer maintained: 1.5	2018-05-25	1.5.6	2018-12-26
Old version, no longer maintained: 1.6	2018-08-08	1.6.3	2018-12-22
Old version, no longer maintained: 1.7	2018-11-30	1.7.2	2019-02-15
Old version, no longer maintained: 1.8	2019-04-09	1.8.3	2019-12-11
Old version, no longer maintained: 1.9	2019-08-22	1.9.2	2020-01-30
Old version, no longer maintained: 1.10	2020-02-11	1.10.3	2021-01-29
Old version, no longer maintained: 1.11	2020-07-06	1.11.4	2021-08-09
Old version, no longer maintained: 1.12	2020-12-10	1.12.5	2020-08-06
Old version, yet still maintained: 1.13	2021-05-03	1.13.2	2021-08-02
Current stable version: 1.14	2021-09-29	1.14.0	2021-09-29

Release Dates

09/2021: Apache Flink 1.14 (09/2021: v1.14.0)
05/2021: Apache Flink 1.13 (05/2021: v1.13.1; 08/2021: v1.13.2)
12/2020: Apache Flink 1.12 (01/2021: v1.12.1; 03/2021: v1.12.2; 04/2021: v1.12.3; 05/2021: v1.12.4; 08/2021: v1.12.6)
07/2020: Apache Flink 1.11 (07/2020: v1.11.1; 09/2020: v1.11.2; 12/2020: v1.11.3; 08/2021: v1.11.4)
02/2020: Apache Flink 1.10 (05/2020: v1.10.1; 08/2020: v1.10.2; 01/2021: v1.10.3)
08/2019: Apache Flink 1.9 (10/2019: v1.9.1; 01/2020: v1.9.2)
04/2019: Apache Flink 1.8 (07/2019: v1.8.1; 09/2019: v1.8.2; 12/2019: v1.8.3)
11/2018: Apache Flink 1.7 (12/2018: v1.7.1; 02/2019: v1.7.2)
08/2018: Apache Flink 1.6 (09/2018: v1.6.1; 10/2018: v1.6.2; 12/2018: v1.6.3)
05/2018: Apache Flink 1.5 (07/2018: v1.5.1; 07/2018: v1.5.2; 08/2018: v1.5.3; 09/2018: v1.5.4; 10/2018: v1.5.5; 12/2018: v1.5.6)
12/2017: Apache Flink 1.4 (02/2018: v1.4.1; 03/2018: v1.4.2)
06/2017: Apache Flink 1.3 (06/2017: v1.3.1; 08/2017: v1.3.2; 03/2018: v1.3.3)
02/2017: Apache Flink 1.2 (04/2017: v1.2.1)
08/2016: Apache Flink 1.1 (08/2016: v1.1.1; 09/2016: v1.1.2; 10/2016: v1.1.3; 12/2016: v1.1.4; 03/2017: v1.1.5)
03/2016: Apache Flink 1.0 (04/2016: v1.0.1; 04/2016: v1.0.2; 05/2016: v1.0.3)
11/2015: Apache Flink 0.10 (11/2015: v0.10.1; 02/2016: v0.10.2)
06/2015: Apache Flink 0.9 (09/2015: v0.9.1)
- 04/2015: Apache Flink 0.9-milestone-1

Apache Incubator Release Dates

01/2015: Apache Flink 0.8-incubating
11/2014: Apache Flink 0.7-incubating
08/2014: Apache Flink 0.6-incubating (09/2014: v0.6.1-incubating)
05/2014: Stratosphere 0.5 (06/2014: v0.5.1; 07/2014: v0.5.2)

Pre-Apache Stratosphere Release Dates

01/2014: Stratosphere 0.4 (version 0.3 was skipped)
08/2012: Stratosphere 0.2
05/2011: Stratosphere 0.1 (08/2011: v0.1.1)

References

^ "Release 1.20.0". 1 August 2024. Retrieved 20 August 2024.
^ "Apache Flink: Scalable Batch and Stream Data Processing". apache.org.
^ "apache/flink". GitHub.
^ Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The Stratosphere platform for big data analytics. The VLDB Journal 23, 6 (December 2014), 939-964. DOI
^ Ian Pointer (7 May 2015). "Apache Flink: New Hadoop contender squares off against Spark". InfoWorld.
^ "On Apache Flink. Interview with Volker Markl". odbms.org.
^ Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. 2012. Spinning fast iterative data flows. Proc. VLDB Endow. 5, 11 (July 2012), 1268-1279. DOI
^ "Benchmarking Streaming Computation Engines at Yahoo!". Yahoo Engineering. Retrieved 2017-02-23.
^ ^a ^b Carbone, Paris; Fóra, Gyula; Ewen, Stephan; Haridi, Seif; Tzoumas, Kostas (2015-06-29). "Lightweight Asynchronous Snapshots for Distributed Dataflows". arXiv:1506.08603 [cs.DC].
^ "Apache Flink 1.2.0 Documentation: Flink DataStream API Programming Guide". ci.apache.org. Retrieved 2017-02-23.
^ "Apache Flink 1.2.0 Documentation: Python Programming Guide". ci.apache.org. Retrieved 2017-02-23.
^ "Apache Flink 1.2.0 Documentation: Table and SQL". ci.apache.org. Retrieved 2017-02-23.
^ Fabian Hueske, Mathias Peters, Matthias J. Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, and Kostas Tzoumas. 2012. Opening the black boxes in data flow optimization. Proc. VLDB Endow. 5, 11 (July 2012), 1256-1267. DOI
^ Daniel Warneke and Odej Kao. 2009. Nephele: efficient parallel data processing in the cloud. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS '09). ACM, New York, NY, USA, Article 8, 10 pages. DOI
^ ^a ^b "Apache Flink 1.2.0 Documentation: Streaming Connectors". ci.apache.org. Retrieved 2017-02-23.
^ "ASF Git Repos - flink.git/blob - LICENSE". apache.org. Archived from the original on 2017-10-23. Retrieved 2015-04-12.
^ ^a ^b "Apache Flink 1.2.0 Documentation: Dataflow Programming Model". ci.apache.org. Retrieved 2017-02-23.
^ "Apache Flink 1.2.0 Documentation: Distributed Runtime Environment". ci.apache.org. Retrieved 2017-02-24.
^ "Apache Flink 1.2.0 Documentation: Distributed Runtime Environment - Savepoints". ci.apache.org. Retrieved 2017-02-24.
^ "Apache Flink 1.2.0 Documentation: Flink DataStream API Programming Guide". ci.apache.org. Retrieved 2017-02-24.
^ "Apache Beam". beam.apache.org. Retrieved 2017-02-24.
^ "Apache Beam Capability Matrix". beam.apache.org. Retrieved 2017-02-24.
^ "Why Apache Beam? A Google Perspective | Google Cloud Big Data and Machine Learning Blog | Google Cloud Platform". Google Cloud Platform. Retrieved 2017-02-24.
^ "Apache Flink 1.2.0 Documentation: Flink DataSet API Programming Guide". ci.apache.org. Retrieved 2017-02-24.
^ "Stream Processing for Everyone with SQL and Apache Flink". flink.apache.org. Retrieved 2020-01-08.
^ "Flink Forward Virtual Conference 2020".
^ "Stratosphere". stratosphere.eu.
^ "DFG - Deutsche Forschungsgemeinschaft -". dfg.de.
^ "Stratosphere". apache.org.
^ "Project Details for Apache Flink". apache.org.
^ "The Apache Software Foundation Announces Apache™ Flink™ as a Top-Level Project : The Apache Software Foundation Blog". apache.org.
^ "Will the mysterious Apache Flink find a sweet spot in the enterprise?". siliconangle.com.
^ (in German)

External links

Official website

[wikidata-b699a657a2100c58420b67d80b58e7b93ad097b6-v18-1] "Release 1.20.0". 1 August 2024. Retrieved 20 August 2024.

[2] "Apache Flink: Scalable Batch and Stream Data Processing". apache.org.

[3] "apache/flink". GitHub.

[4] Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The Stratosphere platform for big data analytics. The VLDB Journal 23, 6 (December 2014), 939-964. DOI

[5] Ian Pointer (7 May 2015). "Apache Flink: New Hadoop contender squares off against Spark". InfoWorld.

[6] "On Apache Flink. Interview with Volker Markl". odbms.org.

[7] Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. 2012. Spinning fast iterative data flows. Proc. VLDB Endow. 5, 11 (July 2012), 1268-1279. DOI

[8] "Benchmarking Streaming Computation Engines at Yahoo!". Yahoo Engineering. Retrieved 2017-02-23.

[:2-9] Carbone, Paris; Fóra, Gyula; Ewen, Stephan; Haridi, Seif; Tzoumas, Kostas (2015-06-29). "Lightweight Asynchronous Snapshots for Distributed Dataflows". arXiv:1506.08603 [cs.DC].

[10] "Apache Flink 1.2.0 Documentation: Flink DataStream API Programming Guide". ci.apache.org. Retrieved 2017-02-23.

[11] "Apache Flink 1.2.0 Documentation: Python Programming Guide". ci.apache.org. Retrieved 2017-02-23.

[12] "Apache Flink 1.2.0 Documentation: Table and SQL". ci.apache.org. Retrieved 2017-02-23.

[13] Fabian Hueske, Mathias Peters, Matthias J. Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, and Kostas Tzoumas. 2012. Opening the black boxes in data flow optimization. Proc. VLDB Endow. 5, 11 (July 2012), 1256-1267. DOI

[14] Daniel Warneke and Odej Kao. 2009. Nephele: efficient parallel data processing in the cloud. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS '09). ACM, New York, NY, USA, Article 8, 10 pages. DOI

[:0-15] "Apache Flink 1.2.0 Documentation: Streaming Connectors". ci.apache.org. Retrieved 2017-02-23.

[16] "ASF Git Repos - flink.git/blob - LICENSE". apache.org. Archived from the original on 2017-10-23. Retrieved 2015-04-12.

[:1-17] "Apache Flink 1.2.0 Documentation: Dataflow Programming Model". ci.apache.org. Retrieved 2017-02-23.

[18] "Apache Flink 1.2.0 Documentation: Distributed Runtime Environment". ci.apache.org. Retrieved 2017-02-24.

[19] "Apache Flink 1.2.0 Documentation: Distributed Runtime Environment - Savepoints". ci.apache.org. Retrieved 2017-02-24.

[20] "Apache Flink 1.2.0 Documentation: Flink DataStream API Programming Guide". ci.apache.org. Retrieved 2017-02-24.

[21] "Apache Beam". beam.apache.org. Retrieved 2017-02-24.

[22] "Apache Beam Capability Matrix". beam.apache.org. Retrieved 2017-02-24.

[23] "Why Apache Beam? A Google Perspective | Google Cloud Big Data and Machine Learning Blog | Google Cloud Platform". Google Cloud Platform. Retrieved 2017-02-24.

[24] "Apache Flink 1.2.0 Documentation: Flink DataSet API Programming Guide". ci.apache.org. Retrieved 2017-02-24.

[25] "Stream Processing for Everyone with SQL and Apache Flink". flink.apache.org. Retrieved 2020-01-08.

[26] "Flink Forward Virtual Conference 2020".

[stratosphere-27] "Stratosphere". stratosphere.eu.

[28] "DFG - Deutsche Forschungsgemeinschaft -". dfg.de.

[29] "Stratosphere". apache.org.

[30] "Project Details for Apache Flink". apache.org.

[31] "The Apache Software Foundation Announces Apache™ Flink™ as a Top-Level Project : The Apache Software Foundation Blog". apache.org.

[32] "Will the mysterious Apache Flink find a sweet spot in the enterprise?". siliconangle.com.

[33] (in German)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

@@ Line 23: / Line 23: @@
 ==Development==
 Apache Flink is developed under the [[Apache License]] 2.0<ref>{{cite web|url=https://git1-us-west.apache.org/repos/asf?p=flink.git;a=blob;f=LICENSE;hb=HEAD|title=ASF Git Repos - flink.git/blob - LICENSE|work=apache.org|access-date=2015-04-12|archive-url=https://web.archive.org/web/20171023175448/https://git1-us-west.apache.org/repos/asf?p=flink.git;a=blob;f=LICENSE;hb=HEAD|archive-date=2017-10-23|url-status=dead}}</ref> by the Apache Flink Community within the [[Apache Software Foundation]]. The project is driven by over 25 committers and over 340 contributors.
-''Ververica'' (formerly Data Artisans), a company that was founded by the original creators of Apache Flink,<ref>{{cite web|url=https://www.ververica.com/about|title=About - Ververica|website=ververica.com|language=en-US|access-date=2020-03-18}}</ref> employs many of the current Apache Flink committers.<ref>{{cite web|url=http://flink.apache.org/community.html#people|title=Apache Flink: Community & Project Info|website=flink.apache.org|language=en|access-date=2017-02-23}}</ref>
 == Overview ==

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category