Apache Tika: Difference between revisions

Tika
Developer(s)	Apache Software Foundation
Stable release	2.9.1 / 20 October 2023; 13 months ago
Repository	Tika Repository
Written in	Java
Operating system	Cross-platform
Type	Search and index API
License	Apache License 2.0
Website	tika.apache.org

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Inline

Latest revision as of 09:30, 1 August 2024

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation.^[1] It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages.

History

The project originated as part of the Apache Nutch codebase, to provide content identification and extraction when crawling. In 2007, it was separated out, to make it more extensible and usable by content management systems, other Web crawlers, and information retrieval systems. The standalone Tika was founded by Jérôme Charron, Chris Mattmann and Jukka Zitting.^[2] In 2011 Chris Mattmann and Jukka Zitting released the Manning book "Tika in Action", and the project released version 1.0.

Features

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types. For most of the more common and popular formats,^[3] Tika then provides content extraction, metadata extraction and language identification capabilities.

It can also get text from images by using the OCR software Tesseract.^[4]

While Tika is written in Java, it is widely used from other languages.^[5] The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

Notable uses

Tika is used by financial institutions including the Fair Isaac Corporation (FICO),^[6] Goldman Sachs,^[7] NASA and academic researchers^[8] and by major content management systems including Drupal,^[9] and Alfresco (software)^[10] to analyze large amounts of content, and to make it available in common formats using information retrieval techniques.

On April 4, 2016^[11] Forbes published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshore shell corporations. The leaked documents and the project to analyze them is referred to as the Panama Papers.

References

^ "Apache Tika". Retrieved 2016-04-15.
^ "Tika Proposal". Retrieved 2016-04-15.
^ "The Apache Software Foundation". Apache Tika formats page. Retrieved 16 April 2016.
^ "TikaOCR". Apache Tika. 2019-03-26. Retrieved 2019-12-02.
^ "API Bindings for Tika". Apache Tika. Retrieved 2016-04-17.
^ "FICO to Engage Kaggle's Community of 180,000 Data Scientists to Drive Innovation in the FICO Analytic Cloud | FICO". FICO | Decisions. Archived from the original on 2016-06-03. Retrieved 2016-04-15.
^ "Goldman Sachs Puts Elasticsearch To Work - InformationWeek". InformationWeek. Retrieved 2017-06-21.
^ "Studying polar data with the help of Apache Tika". Opensource.com. Retrieved 2016-04-15.
^ "Text Extract for Drupal using Tika | Drupal.org". www.drupal.org. 30 July 2012. Retrieved 2016-04-15.
^ "Content Transformation and Metadata Extraction with Apache Tika - alfrescowiki". wiki.alfresco.com. 5 June 2015. Retrieved 2016-04-15.
^ Fox-Brewster, Thomas. "From Encrypted Drives To Amazon's Cloud -- The Amazing Flight Of The Panama Papers". Forbes. Retrieved 2016-04-15.

[1] "Apache Tika". Retrieved 2016-04-15.

[2] "Tika Proposal". Retrieved 2016-04-15.

[3] "The Apache Software Foundation". Apache Tika formats page. Retrieved 16 April 2016.

[4] "TikaOCR". Apache Tika. 2019-03-26. Retrieved 2019-12-02.

[5] "API Bindings for Tika". Apache Tika. Retrieved 2016-04-17.

[6] "FICO to Engage Kaggle's Community of 180,000 Data Scientists to Drive Innovation in the FICO Analytic Cloud | FICO". FICO | Decisions. Archived from the original on 2016-06-03. Retrieved 2016-04-15.

[7] "Goldman Sachs Puts Elasticsearch To Work - InformationWeek". InformationWeek. Retrieved 2017-06-21.

[8] "Studying polar data with the help of Apache Tika". Opensource.com. Retrieved 2016-04-15.

[9] "Text Extract for Drupal using Tika | Drupal.org". www.drupal.org. 30 July 2012. Retrieved 2016-04-15.

[10] "Content Transformation and Metadata Extraction with Apache Tika - alfrescowiki". wiki.alfresco.com. 5 June 2015. Retrieved 2016-04-15.

[11] Fox-Brewster, Thomas. "From Encrypted Drives To Amazon's Cloud -- The Amazing Flight Of The Panama Papers". Forbes. Retrieved 2016-04-15.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

@@ Line 1: / Line 1: @@
+{{Short description|Open-source content analysis framework}}
 {{Infobox software
 | name = Tika
-| logo = [[File:Apache-Tika.png|130px|Tika logo]]
+| logo = Apache Tika Logo.svg
 | screenshot =
 | caption =
 | developer = [[Apache Software Foundation]]
-| latest release version = 1.14
+| latest release version = {{wikidata|property|edit|reference|P548=Q2804309|P348}}
-| latest release date = {{release date and age|2016|10|18}}
+| latest release date = {{start date and age|{{wikidata|qualifier|P548=Q2804309|P348|P577}}}}
 | latest preview version =
 | latest preview date =
+| repo = {{URL|https://gitbox.apache.org/repos/asf?p{{=}}tika.git|Tika Repository}}
-| status = Active
 | programming language = [[Java (programming language)|Java]]
 | operating system = [[Cross-platform]]
 | platform =
 | genre = [[Search algorithm|Search]] and [[index (search engine)|index]] [[Application programming interface|API]]
-| license = [[Apache License]] 2.0
+| license = [[Apache License 2.0]]
 | website = {{URL|http://tika.apache.org/}}
 }}
 '''Apache Tika''' is a content detection and [[content analysis|analysis]] framework, written in [[Java (programming language)|Java]], stewarded at the [[Apache Software Foundation]].<ref>{{Cite web|url=http://tika.apache.org/|title=Apache Tika|access-date=2016-04-15}}</ref> It detects and extracts metadata and text from over a thousand different [[file type]]s, and as well as providing a
 [[Java (programming language)|Java]] library, has server and command-line editions suitable for use from other programming languages.
@@ Line 25: / Line 25: @@
 == Features ==
-Tika provides capabilities for identification of more than 1400 file types from the [[Internet Assigned Numbers Authority]] taxonomy of [[MIME]] types. For most of the more common and popular formats,<ref>{{cite web|url=http://tika.apache.org/1.12/formats.html| title= The Apache Software Foundation| website=Apache Tika formats page|accessdate=16 April 2016}}</ref> Tika then provides content extraction, metadata extraction and language identification capabilities.
+Tika provides capabilities for identification of more than 1400 file types from the [[Internet Assigned Numbers Authority]] taxonomy of [[Media type|MIME types]]. For most of the more common and popular formats,<ref>{{cite web|url=http://tika.apache.org/1.12/formats.html| title= The Apache Software Foundation| website=Apache Tika formats page|access-date=16 April 2016}}</ref> Tika then provides content extraction, metadata extraction and language identification capabilities.
+It can also get text from images by using the [[Optical character recognition|OCR]] software [[Tesseract (software)|Tesseract]].<ref>{{Cite web|url=https://cwiki.apache.org/confluence/display/tika/TikaOCR|date=2019-03-26|publisher=Apache Tika|title=TikaOCR|access-date=2019-12-02}}</ref>
-While Tika is written in [[Java (programming language)|Java]], it is widely used from other languages.<ref>{{Cite web|url=https://wiki.apache.org/tika/API%20Bindings%20for%20Tika|title=API Bindings for Tika|last=|first=|date=|website=|publisher=Apache Tika|access-date=2016-04-17}}</ref> The [[Representational state transfer|RESTful]] server and [[Command-line interface|CLI Tool]] permit non-Java programs to access the Tika functionality.
+While Tika is written in [[Java (programming language)|Java]], it is widely used from other languages.<ref>{{Cite web|url=https://wiki.apache.org/tika/API%20Bindings%20for%20Tika|title=API Bindings for Tika|publisher=Apache Tika|access-date=2016-04-17}}</ref> The [[Representational state transfer|RESTful]] server and [[Command-line interface|CLI Tool]] permit non-Java programs to access the Tika functionality.
 == Notable uses ==
-Tika is used by financial institutions including the [[Fair Isaac Corporation]] (FICO),<ref>{{Cite web|url=http://www.fico.com/en/newsroom/fico-to-engage-kaggles-community-of-180000-data-scientists-to-drive-innovation-in-the-fico-analytic-cloud|title=FICO to Engage Kaggle's Community of 180,000 Data Scientists to Drive Innovation in the FICO Analytic Cloud {{!}} FICO®|website=FICO® {{!}} Decisions|access-date=2016-04-15}}</ref> Goldman Sachs<ref>{{Cite news|url=http://www.informationweek.com/software/enterprise-applications/goldman-sachs-puts-elasticsearch-to-work/d/d-id/1321778|title=Goldman Sachs Puts Elasticsearch To Work - InformationWeek|work=InformationWeek|access-date=2017-06-21|language=en}}</ref>, [[NASA]] and academic researchers<ref>{{Cite web|url=https://opensource.com/life/15/4/interview-annie-burgess-USC-JPL|title=Studying polar data with the help of Apache Tika|website=Opensource.com|access-date=2016-04-15}}</ref> and by major content management systems including [[Drupal]],<ref>{{Cite web|url=https://www.drupal.org/project/text_extract|title=Text Extract for Drupal using Tika {{!}} Drupal.org|website=www.drupal.org|access-date=2016-04-15}}</ref> and [[Alfresco (software)]]<ref>{{Cite web|url=https://wiki.alfresco.com/wiki/Content_Transformation_and_Metadata_Extraction_with_Apache_Tika|title=Content Transformation and Metadata Extraction with Apache Tika - alfrescowiki|website=wiki.alfresco.com|access-date=2016-04-15}}</ref> to analyze large amounts of content, and to make it available in common formats using information retrieval techniques.
+Tika is used by financial institutions including the [[Fair Isaac Corporation]] (FICO),<ref>{{Cite web|url=http://www.fico.com/en/newsroom/fico-to-engage-kaggles-community-of-180000-data-scientists-to-drive-innovation-in-the-fico-analytic-cloud|title=FICO to Engage Kaggle's Community of 180,000 Data Scientists to Drive Innovation in the FICO Analytic Cloud {{!}} FICO|website=FICO {{!}} Decisions|access-date=2016-04-15|archive-url=https://web.archive.org/web/20160603111240/http://www.fico.com/en/newsroom/fico-to-engage-kaggles-community-of-180000-data-scientists-to-drive-innovation-in-the-fico-analytic-cloud|archive-date=2016-06-03|url-status=dead}}</ref> Goldman Sachs,<ref>{{Cite news|url=http://www.informationweek.com/software/enterprise-applications/goldman-sachs-puts-elasticsearch-to-work/d/d-id/1321778|title=Goldman Sachs Puts Elasticsearch To Work - InformationWeek|work=InformationWeek|access-date=2017-06-21|language=en}}</ref> [[NASA]] and academic researchers<ref>{{Cite web|url=https://opensource.com/life/15/4/interview-annie-burgess-USC-JPL|title=Studying polar data with the help of Apache Tika|website=Opensource.com|access-date=2016-04-15}}</ref> and by major content management systems including [[Drupal]],<ref>{{Cite web|url=https://www.drupal.org/project/text_extract|title=Text Extract for Drupal using Tika {{!}} Drupal.org|website=www.drupal.org|date=30 July 2012 |access-date=2016-04-15}}</ref> and [[Alfresco (software)]]<ref>{{Cite web|url=https://wiki.alfresco.com/wiki/Content_Transformation_and_Metadata_Extraction_with_Apache_Tika|title=Content Transformation and Metadata Extraction with Apache Tika - alfrescowiki|website=wiki.alfresco.com|date=5 June 2015 |access-date=2016-04-15}}</ref> to analyze large amounts of content, and to make it available in common formats using information retrieval techniques.
-On April 4, 2016<ref>{{Cite web|url=http://www.forbes.com/sites/thomasbrewster/2016/04/05/panama-papers-amazon-encryption-epic-leak|title=From Encrypted Drives To Amazon's Cloud -- The Amazing Flight Of The Panama Papers|last=Fox-Brewster|first=Thomas|website=Forbes|access-date=2016-04-15}}</ref> [[Forbes]] published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshore [[shell corporation]]s. The leaked documents and the project to analyze them is referred to as the [[Panama Papers]].
+On April 4, 2016<ref>{{Cite web|url=https://www.forbes.com/sites/thomasbrewster/2016/04/05/panama-papers-amazon-encryption-epic-leak|title=From Encrypted Drives To Amazon's Cloud -- The Amazing Flight Of The Panama Papers|last=Fox-Brewster|first=Thomas|website=Forbes|access-date=2016-04-15}}</ref> [[Forbes]] published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshore [[shell corporation]]s. The leaked documents and the project to analyze them is referred to as the [[Panama Papers]].
 ==See also==
-*[[Magic number (programming)|Magic number]]
+*[[Magic number (programming)#Magic numbers in files|Magic number]]
 ==References==
 {{Reflist}}
-{{Apache}}
+{{Apache Software Foundation}}
-[[Category:Apache Software Foundation|Tika]]
+[[Category:Apache Software Foundation projects|Tika]]
 [[Category:Java platform]]
 [[Category:Free software programmed in Java (programming language)]]

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category