Jump to content

Caltech 101: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
No edit summary
 
(106 intermediate revisions by 69 users not shown)
Line 1: Line 1:
{{Short description|Dataset of images}}
'''Caltech 101''' is a [[dataset]] of [[digital images]] created in [[September]], [[2003]], compiled by [[Fei-Fei Li]], [[Marco Andreetto]], and [[Marc 'Aurelio Ranzato]] at the [[California Institute of Technology]]. It is intended to facilitate [[Computer Vision]] [[research]] and techniques. It is most applicable to techniques interested in [[recognition]], [[classification]], and [[categorization]]. Caltech 101 contains a total of 9146 images, split between 101 distinct object (including [[face]]s, [[watches]], [[ants]], [[pianos]], etc.) and a background category (for a total of 102 [[categories]]). Provided with the images are a set of [[annotations]] describing the outlines of each image, along with a [[Matlab]] [[Scripting_language |script]] for viewing.
'''Caltech 101''' is a [[data set]] of [[digital images]] created in September 2003 and compiled by [[Fei-Fei Li]], Marco Andreetto, Marc 'Aurelio Ranzato and [[Pietro Perona]] at the [[California Institute of Technology]]. It is intended to facilitate [[computer vision]] research and techniques and is most applicable to techniques involving [[image recognition]] classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories ([[face]]s, [[watches]], [[ants]], [[pianos]], etc.) and a background category. Provided with the images are a set of [[annotations]] describing the outlines of each image, along with a [[Matlab]] [[Scripting language|script]] for viewing.


==Purpose==
==Purpose==
Most computer vision and [[machine learning]] algorithms function by training on example inputs. They require a large and varied set of training data to work effectively. For example, the real-time [[face detection]] method used by Paul Viola and Michael J. Jones was trained on 4,916 hand-labeled faces.<ref name="Viola Jones">{{Cite journal|doi=10.1023/B:VISI.0000013087.49260.fb|title=Robust Real-Time Face Detection|year=2004|last1=Viola|first1=Paul|last2=Jones|first2=Michael J.|journal=International Journal of Computer Vision|volume=57|issue=2|pages=137–154|s2cid=2796017}}</ref>
Most Computer Vision and [[Machine Learning]] algorithms function by training on a large set of example inputs.
To work effectively, most of these techniques require a large and varied set of training data. For example, the relatively well known real time face detection method used by [[Paul Viola]] and [[Micheal J. Jones]] was trained on 4916 hand labeled faces <ref name="violajones"> P. Viola and M. J. Jones, Robust Real-Time Object Detection, , IJCV 2004</ref>.
However, acquiring a large volume of appropriate and usable images is often difficult. Furthermore, cropping and resizing many images, as well as marking point of interest by hand, is a tedious and time intensive task.


Cropping, re-sizing and hand-marking points of interest is tedious and time-consuming.
Historically, most datasets used in computer vision research have been tailored to the specific needs of the project being worked on.
[[Image:Caltech101vs256.gif|thumb | Caltech 101 vs Caltech 256 on same algorithms]]
A large problem in comparing different computer vision techniques is the fact that most groups are using their own datasets. Each of these datasets may have different properties that make reported results from different methods harder to compare directly. For example, differences in image size, image quality, relative location of objects within the images, and level of occlusion and clutter present can lead to varying results.


Historically, most data sets used in computer vision research have been tailored to the specific needs of the project being worked on.<!-- Missing image removed: [[Image:Caltech101vs256.gif|thumb | Caltech 101 vs Caltech 256 on same algorithms]] --> A large problem in comparing [[computer vision]] techniques is the fact that most groups use their own data sets. Each set may have different properties that make reported results from different methods harder to compare directly. For example, differences in image size, image quality, relative location of objects within the images and level of occlusion and clutter present can lead to varying results.<ref name="oertel">{{Cite book|doi=10.1109/AIPR.2008.4906457|chapter=Current challenges in automating visual perception|title=2008 37th IEEE Applied Imagery Pattern Recognition Workshop|year=2008|last1=Oertel|first1=Carsten|last2=Colder|first2=Brian|last3=Colombe|first3=Jeffrey|last4=High|first4=Julia|last5=Ingram|first5=Michael|last6=Sallee|first6=Phil|pages=1–8|isbn=978-1-4244-3125-0|s2cid=36669995}}</ref>


The Caltech 101 dataset aims to alleviate many of these common problems.
The Caltech 101 data set aims at alleviating many of these common problems.
*The images are cropped and re-sized.
*The work of collecting a large set of images, and cropping and resizing them appropriately has been taken care of.
*A large number of different categories are represented, which benefits both single, and multi class recognition algorithms.
*Many categories are represented, which suits both single and multiple class recognition algorithms.
*Detailed object outlines have been marked for each image.
*Detailed object outlines are marked.
*By being released for general use, the Caltech 101 acts as a common standard by which to compare different algorithms without bias due to different datasets.
*Available for general use, Caltech 101 acts as a common standard by which to compare different algorithms without bias due to different data sets.


However, a follow-up study demonstrated that tests based on uncontrolled natural images (like the Caltech 101 data set) can be seriously misleading, potentially guiding progress in the wrong direction.<ref name="pinto_et_al_2008">{{Cite journal|doi=10.1371/journal.pcbi.0040027|title=Why is Real-World Visual Object Recognition Hard?|year=2008|last1=Pinto|first1=Nicolas|last2=Cox|first2=David D.|last3=Dicarlo|first3=James J.|journal=PLOS Computational Biology|volume=4|issue=1|pages=e27|pmid=18225950|pmc=2211529|bibcode=2008PLSCB...4...27P |doi-access=free }}</ref>


==Data set==
However, a recent study <ref name="pinto_et_al_2008">[http://compbiol.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pcbi.0040027 | Why is Real-World Visual Object Recognition Hard? Pinto N, Cox DD, DiCarlo JJ PLoS Computational Biology Vol. 4, No. 1, e27 doi:10.1371/journal.pcbi.0040027]</ref> demonstrates that tests based on uncontrolled natural images like the Caltech 101 dataset can be seriously misleading, potentially guiding progress in the wrong direction.


==The Dataset==
===Images===
===Images===
The Caltech 101 data set consists of a total of 9,146 images, split between 101 different object categories, as well as an additional background/clutter category.
[[Image:Caltech101.gif | thumb| right | Caltech 101 images]]
The Caltech 101 dataset consists of a total of 9146 images, split between 101 different object categories, as well as an additional background/clutter category.

Each object category contains between 40 and 800 images on average. Common and popular categories such as faces tend to have a larger number of images than less used categories.
Each image is about 300x200 pixels in dimension.
Images of oriented objects such as [[airplanes]] and [[motorcycles]] were mirrored to be left-right aligned, and vertically oriented structures such as buildings were rotated to be off axis.









Each object category contains between 40 and 800 images. Common and popular categories such as faces tend to have a larger number of images than others.


Each image is about 300x200 pixels. Images of oriented objects such as [[airplanes]] and [[motorcycles]] were mirrored to be left to right aligned and vertically oriented structures such as buildings were rotated to be off axis.


===Annotations===
===Annotations===
A set of annotations is provided for each image. Each set of annotations contains two pieces of information: the general bounding box in which the object is located and a detailed human-specified outline enclosing the object.


A Matlab script is provided with the annotations. It loads an image and its corresponding annotation file and displays them as a Matlab figure.
As a supplement to the images, a set of annotations are provided for each image. Each set of annotations contains two pieces of information.

The general bounding box in which the object is located, and a detailed human specified outline enclosing the object.
A Matlab script is provided along with the annotations that will load an image and its corresponding annotation file and display them as a Matlab figure.

[[Image:Caltech101_croc_annotated.jpg | Crocodile image with annotations.]]

The bounding box is yellow and the outline is red.


==Uses==
==Uses==
The Caltech 101 data set was used to train and test several computer vision recognition and classification algorithms. The first paper to use Caltech 101 was an incremental [[Bayesian inference|Bayesian]] approach to one-shot learning,<ref name="OneShot">[http://www.vision.caltech.edu/feifeili/Fei-Fei_GMBV04.pdf L. Fei-Fei, R. Fergus and P. Perona. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. IEEE. CVPR 2004, Workshop on Generative-Model Based Vision. 2004]</ref> an attempt to classify an object using only a few examples, by building on prior knowledge of other classes.


The Caltech 101 images, along with the annotations, were used for another one-shot learning paper at Caltech.<ref name="OneShot2">{{Cite journal |url=http://vision.cs.princeton.edu/documents/Fei-FeiFergusPerona2006.pdf |author=L. Fei-Fei |author2=R. Fergus |author3=P. Perona |title=One-Shot learning of object categories |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=28 |issue=4 |pages=594–611 |date=April 2006 |doi=10.1109/TPAMI.2006.79 |pmid=16566508 |s2cid=6953475 |access-date=2008-01-16 |archive-url=https://web.archive.org/web/20070609194212/http://vision.cs.princeton.edu/documents/Fei-FeiFergusPerona2006.pdf |archive-date=2007-06-09}}</ref>
The Caltech 101 dataset has been used to train and test several Computer Vision recognition and classification algorithms.
The first paper to make use of Caltech 101 was an incremental Bayesian approach to [[one shot learning]] <ref name="OneShot">[http://www.vision.caltech.edu/feifeili/Fei-Fei_GMBV04.pdf |L. Fei-Fei, R. Fergus and P. Perona. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. IEEE. CVPR 2004, Workshop on Generative-Model Based Vision. 2004]</ref>. One shot learning is an attempt to learn a class of object using only a few examples, by building off of prior knowledge of many other classes.


Other Computer Vision papers that report using the Caltech 101 data set include:
The Caltech 101 images, along with the annotations, were used for another one shot learning paper at Caltech.
*Shape Matching and Object Recognition using Low Distortion Correspondence. Alexander C. Berg, Tamara L. Berg, [[Jitendra Malik]]. [[CVPR]] 2005
*The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. K. Grauman and T. Darrell. International Conference on Computer Vision (ICCV), 2005<ref>[http://www.vision.caltech.edu/Image_Datasets/Caltech101/grauman_darrell_iccv05.pdf The Pyramid Match Kernel:Discriminative Classification with Sets of Image Features. K. Grauman and T. Darrell. International Conference on Computer Vision (ICCV), 2005]</ref>
*Combining Generative Models and Fisher Kernels for Object Class Recognition. Holub, AD. Welling, M. Perona, P. International Conference on Computer Vision (ICCV), 2005<ref>{{Cite conference |url=http://www.its.caltech.edu/%7Eholub/publications.htm |title=Combining Generative Models and Fisher Kernels for Object Class Recognition |author=Holub, AD |author2=Welling, M |author3=Perona, P. |conference=International Conference on Computer Vision (ICCV), 2005 |access-date=2008-01-16 |archive-url=https://web.archive.org/web/20070814004226/http://www.its.caltech.edu/%7Eholub/publications.htm |archive-date=2007-08-14}}</ref>
*Object Recognition with Features Inspired by Visual Cortex. T. Serre, L. Wolf and T. Poggio. Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), IEEE Computer Society Press, San Diego, June 2005<ref>[http://web.mit.edu/serre/www/publications/serre_etal-CVPR05.pdf Object Recognition with Features Inspired by Visual Cortex. T. Serre, L. Wolf and T. Poggio. Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), IEEE Computer Society Press, San Diego, June 2005]</ref>
*SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. Hao Zhang, Alex Berg, Michael Maire, [[Jitendra Malik]]. CVPR, 2006<ref>[http://www.vision.caltech.edu/Image_Datasets/Caltech101/nhz_cvpr06.pdf SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. Hao Zhang, Alex Berg, Michael Maire, Jitendra Malik. CVPR, 2006]</ref>
*Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. [[Svetlana Lazebnik]], [[Cordelia Schmid]], and Jean Ponce. CVPR, 2006<ref>[http://www.vision.caltech.edu/Image_Datasets/Caltech101/cvpr06b_lana.pdf Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories]. [[Svetlana Lazebnik]], [[Cordelia Schmid]], and Jean Ponce. CVPR, 2006</ref>
*Empirical Study of Multi-Scale Filter Banks for Object Categorization. M.J. Mar韓-Jim閚ez, and N. P閞ez de la Blanca. December 2005<ref>[http://www.vision.caltech.edu/Image_Datasets/Caltech101/mjmarinVIP121505.pdf Empirical study of multi-scale filter banks for object categorization, M.J. Mar韓-Jim閚ez, and N. P閞ez de la Blanca. December 2005]</ref>
*Multiclass Object Recognition with Sparse, Localized Features. Jim Mutch and David G. Lowe., pp. 11–18, CVPR 2006, IEEE Computer Society Press, New York, June 2006<ref>[https://www.mit.edu/~jmutch/papers/cvpr2006_mutch_lowe.pdf Multiclass Object Recognition with Sparse, Localized Features, Jim Mutch and David G. Lowe. , pp. 11–18, CVPR 2006, IEEE Computer Society Press, New York, June 2006]</ref>
*Using Dependent Regions or Object Categorization in a Generative Framework. G. Wang, Y. Zhang, and L. Fei-Fei. IEEE Comp. Vis. Patt. Recog. 2006<ref>{{Cite journal |url=http://vision.cs.princeton.edu/documents/WangZhangFei-Fei_CVPR2006.pdf |title=Using Dependent Regions or Object Categorization in a Generative Framework |author=G. Wang |author2=Y. Zhang |author3=L. Fei-Fei |journal=IEEE Comp. Vis. Patt. Recog. |date=2006 |access-date=2008-01-16 |archive-url=https://web.archive.org/web/20070609194157/https://vision.cs.princeton.edu/documents/WangZhangFei-Fei_CVPR2006.pdf |archive-date=2007-06-09}}</ref>


==Analysis and comparison==
L. Fei-Fei, R. Fergus and P. Perona. One-Shot learning of object categories <ref name="OneShot2"> [http://vision.cs.princeton.edu/documents/Fei-FeiFergusPerona2006.pdf | L. Fei-Fei, R. Fergus and P. Perona. One-Shot learning of object categories. IEEE Trans. Pattern Analysis and Machine Intelligence, Vol28(4), 594 - 611, 2006.]</ref>


Other Computer Vision papers that report using the Caltech 101 dataset:
*Shape Matching and Object Recognition using Low Distortion Correspondence. Alexander C. Berg, Tamara L. Berg, Jitendra Malik. CVPR 2005
*The Pyramid Match Kernel:Discriminative Classification with Sets of Image Features. K. Grauman and T. Darrell. International Conference on Computer Vision (ICCV), 2005 <ref>[http://www.vision.caltech.edu/Image_Datasets/Caltech101/grauman_darrell_iccv05.pdf | The Pyramid Match Kernel:Discriminative Classification with Sets of Image Features. K. Grauman and T. Darrell. International Conference on Computer Vision (ICCV), 2005]</ref>
*Combining Generative Models and Fisher Kernels for Object Class Recognition Holub, AD. Welling, M. Perona, P. International Conference on Computer Vision (ICCV), 2005 <ref>[http://www.its.caltech.edu/%7Eholub/publications.htm | Combining Generative Models and Fisher Kernels for Object Class Recognition Holub, AD. Welling, M. Perona, P. International Conference on Computer Vision (ICCV), 2005]</ref>
*Object Recognition with Features Inspired by Visual Cortex. T. Serre, L. Wolf and T. Poggio. Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), IEEE Computer Society Press, San Diego, June 2005.<ref>[http://web.mit.edu/serre/www/publications/serre_etal-CVPR05.pdf | Object Recognition with Features Inspired by Visual Cortex. T. Serre, L. Wolf and T. Poggio. Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), IEEE Computer Society Press, San Diego, June 2005</ref>
*SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. Hao Zhang, Alex Berg, Michael Maire, Jitendra Malik. CVPR, 2006<ref>[http://www.vision.caltech.edu/Image_Datasets/Caltech101/nhz_cvpr06.pdf | SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. Hao Zhang, Alex Berg, Michael Maire, Jitendra Malik. CVPR, 2006]</ref>
*Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. CVPR, 2006<ref>[http://www.vision.caltech.edu/Image_Datasets/Caltech101/cvpr06b_lana.pdf | Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. CVPR, 2006]</ref>
* Empirical study of multi-scale filter banks for object categorization, M.J. Mar韓-Jim閚ez, and N. P閞ez de la Blanca. December 2005<ref>[http://www.vision.caltech.edu/Image_Datasets/Caltech101/mjmarinVIP121505.pdf | Empirical study of multi-scale filter banks for object categorization, M.J. Mar韓-Jim閚ez, and N. P閞ez de la Blanca. December 2005]</ref>
*Multiclass Object Recognition with Sparse, Localized Features, Jim Mutch and David G. Lowe. , pg. 11-18, CVPR 2006, IEEE Computer Society Press, New York, June 2006<ref>[http://www.mit.edu/~jmutch/papers/cvpr2006_mutch_lowe.pdf | Multiclass Object Recognition with Sparse, Localized Features, Jim Mutch and David G. Lowe. , pg. 11-18, CVPR 2006, IEEE Computer Society Press, New York, June 2006]</ref>
*Using Dependant Regions or Object Categorization in a Generative Framework, G. Wang, Y. Zhang, and L. Fei-Fei. IEEE Comp. Vis. Patt. Recog. 2006<ref>[http://vision.cs.princeton.edu/documents/WangZhangFei-Fei_CVPR2006.pdf | Using Dependant Regions or Object Categorization in a Generative Framework, G. Wang, Y. Zhang, and L. Fei-Fei. IEEE Comp. Vis. Patt. Recog. 2006]</ref>

==Analysis and Comparison==
===Advantages===
===Advantages===
Caltech 101 has several advantages over other similar datasets:
Caltech 101 has several advantages over other similar data sets:
*Uniform size and presentation.
*Uniform size and presentation:
Almost all the images within each category are uniform in image size and in the relative position of interest objects. This means that, in general, users who wish to use the Caltech 101 dataset do not need to spend and extra time cropping and scaling the images before they can be used.
**Almost all the images within each category are uniform in image size and in the relative position of interest objects. Caltech 101 users generally do not need to crop or scale images before they can be used.
*Low level of clutter/occlusion:
*Low level of clutter/occlusion:
Algorithms concerned with recognition usually function by storing features unique to the object that is to be recognized. However, the majority of images taken have varying degrees of background clutter. Algorithms trained on cluttered images can potentially build incorrect
**Algorithms concerned with recognition usually function by storing features unique to the object. However, most images taken have varying degrees of background clutter, which means algorithms may build incorrectly.
*Detailed Annotations:
*Detailed annotations
The detailed annotations of object outlines is another advantage to using the dataset.


===Weaknesses===
===Weaknesses===
There are several weaknesses to the Caltech 101 dataset <ref name="pinto_et_al_2008"/> <ref>[http://www-cvr.ai.uiuc.edu/ponce_grp/publication/paper/sicily06c.pdf | Dataset Issues in Object Recognition. J. Ponce, T. L. Berg, M. Everingham, D. A. Forsyth, M. Hebert, S. Lazebnik, M. Marszalek, C. Schmid, B. C. Russell, A. Torralba, C. K. I. Williams, J. Zhang, and A. Zisserman. Toward Category-Level Object Recognition, Springer-Verlag Lecture Notes in Computer Science. J. Ponce, M. Hebert, C. Schmid, and A. Zisserman (eds.), 2006]</ref>. Some of them are conscious trade-offs for the advantages it provides, and some are simply limitations of the dataset itself.
Weaknesses to the Caltech 101 data set<ref name="pinto_et_al_2008"/><ref>{{Cite web |url=http://www-cvr.ai.uiuc.edu/ponce_grp/publication/paper/sicily06c.pdf |title=Dataset Issues in Object Recognition |author=J. Ponce |author2=T. L. Berg |author3=M. Everingham |author4=D. A. Forsyth |author5=M. Hebert |author6=[[Svetlana Lazebnik|S. Lazebnik]] |author7=M. Marszalek |author8=C. Schmid |author9=B. C. Russell |author10=A. Torralba |author11=C. K. I. Williams |author12=J. Zhang |author13=A. Zisserman |series=Toward Category-Level Object Recognition, Springer-Verlag Lecture Notes in Computer Science |editor=J. Ponce |editor2=M. Hebert |editor3=C. Schmid |editor4=A. Zisserman |date=2006 |access-date=2008-02-08 |archive-url=https://web.archive.org/web/20161224094302/http://www-cvr.ai.uiuc.edu/ponce_grp/publication/paper/sicily06c.pdf |archive-date=2016-12-24}}</ref> may be conscious trade-offs, but others are limitations of the data set. Papers that rely solely on Caltech 101 are frequently rejected.

Weaknesses include:
*The data set is too clean:
**Images are very uniform in presentation, aligned from left to right, and usually not occluded. As a result, the images are not always representative of practical inputs that the algorithm might later expect to see. Under practical conditions, images are more cluttered, occluded and display greater variance in relative position and orientation of interest objects. The uniformity allows concepts to be derived using the average of a category, which is unrealistic.
*Limited number of categories:
*Limited number of categories:
There are approximately 10,000 different categories of objects. The Caltech 101 dataset represents only a small fraction of these.
**The Caltech 101 data set represents only a small fraction of possible object categories.
*Some categories contain few images:
*Some categories contain few images:
Certain categories are not represented as well as others, containing as few as 31 images.
**Certain categories are not represented as well as others, containing as few as 31 images.
This means that <math>\mathrm{N}_{\mathrm{train}} \le 30</math>. The number of images used for training must be less than or equal to 30, which is not sufficient for all purposes.
**This means that <math>\mathrm{N}_{\mathrm{train}} \le 30</math>. The number of images used for training must be less than or equal to 30, which is not sufficient for all purposes.
*Can be too easy:
Images are very uniform in presentation, left right aligned, and usually not occluded. As a result, the images are not always representative of practical inputs that the algorithm being trained might be expected to see. Under practical conditions, there is usually more clutter, occlusion, and variance in relative position and orientation of interest objects.
*Aliasing and artifacts due to manipulation:
*Aliasing and artifacts due to manipulation:
Some images have been rotated and scaled from their original orientation, and suffer from some amount of [[artifacts]] or [[aliasing]].
**Some images have been rotated and scaled from their original orientation, and suffer from some amount of [[Compression artifact|artifacts]] or [[aliasing]].


===Other Datasets===
===Other data sets===
*[[Caltech 256]] is another image dataset created at the California Institute of technology in [[2007]], a successor to Caltech 101. It is intended to address some of the weaknesses inherent to Caltech 101. Overall, it is a more difficult dataset than Caltech 101 (but it suffers from the same problems <ref name="pinto_et_al_2008"/>)
*[[Caltech 256]] is another image data set, created in 2007. It is a successor to Caltech 101. It is intended to address some of the weaknesses of Caltech 101. Overall, it is a more difficult data set than Caltech 101, but it suffers from comparable problems. It includes<ref name="pinto_et_al_2008"/>
**30,607 images, covering a larger number of categories.
**30,607 images, covering a larger number of categories
**Minimum number of image per category raised to 80.
**Minimum number of images per category raised to 80
**Images not left-right aligned.
**Images are not left-right aligned
**More variation in image presentation.
**More variation in image presentation
*[[LabelMe]] is an open, dynamic data set created at [[MIT Computer Science and Artificial Intelligence Laboratory]] (CSAIL). LabelMe takes a different approach to the problem of creating a large image data set, with different trade-offs.

*[[LabelMe]] is an open, dynamic dataset created at [[MIT Computer Science and Artificial Intelligence Laboratory]] (CSAIL). LabelMe takes a different approach to the problem of creating a large image dataset, with different trade-offs.
**106,739 images, 41,724 annotated images, and 203,363 labeled objects.
**106,739 images, 41,724 annotated images, and 203,363 labeled objects.
**Users may add images to the dataset by upload, and add labels or annotations to existing images.
**Users may add images to the data set by upload, and add labels or annotations to existing images.
**Due to its open nature, LabelMe has many more images covering a much wider scope than Caltech 101. However, since each person decides what images to upload, and how to label and annotate each image, there can be a lack of consistency between images.
**Due to its open nature, LabelMe has many more images covering a much wider scope than Caltech 101. However, since each person decides what images to upload, and how to label and annotate each image, the images are less consistent.
*VOC 2008 is a European effort to collect images for benchmarking visual categorization methods. Compared to Caltech 101/256, a smaller number of categories (about 20) are collected. The number of images in each category, however, is larger.
*[[Overhead Imagery Research Data Set]] (OIRDS) is an annotated library of imagery and tools.<ref name="OIRDSVehicles">F. Tanner, B. Colder, C. Pullen, D. Heagy, C. Oertel, & P. Sallee, ''Overhead Imagery Research Data Set (OIRDS) – an annotated data library and tools to aid in the development of computer vision algorithms'', June 2009, <http://sourceforge.net/apps/mediawiki/oirds/index.php?title=Documentation {{Webarchive|url=https://web.archive.org/web/20121109142328/http://sourceforge.net/apps/mediawiki/oirds/index.php?title=Documentation# |date=2012-11-09 }}> (28 December 2009)</ref> OIRDS v1.0 is composed of passenger vehicle objects annotated in overhead imagery. Passenger vehicles in the OIRDS include cars, trucks, vans, etc. In addition to the object outlines, the OIRDS includes subjective and objective statistics that quantify the vehicle within the image's context. For example, subjective measures of image clutter, clarity, noise, and vehicle color are included along with more objective statistics such as [[ground sample distance]] (GSD), time of day, and day of year.
** ~900 images, containing ~1800 annotated images
** ~30 annotations per object
** ~60 statistical measures per object
** Wide variation in object context
** Limited to passenger vehicles in overhead imagery
*MICC-Flickr 101 is an image data set created at the Media Integration and Communication Center (MICC), [[University of Florence]], in 2012. It is based on Caltech 101 and is collected from [[Flickr]]. MICC-Flickr 101<ref name="ballan_et_al_2012">{{Cite web |url=http://www.micc.unifi.it/publications/2012/BBDSSZ12/miccflickr101.pdf |title=L. Ballan, M. Bertini, A. Del Bimbo, A.M. Serain, G. Serra, B.F. Zaccone. Combining Generative and Discriminative Models for Classifying Social Images from 101 Object Categories. Int. Conference on Pattern Recognition (ICPR), 2012. |access-date=2012-07-11 |archive-url=https://web.archive.org/web/20140826113958/http://www.micc.unifi.it/publications/2012/BBDSSZ12/miccflickr101.pdf |archive-date=2014-08-26 |url-status=dead }}</ref> corrects the main drawback of Caltech 101, i.e. its low inter-class variability and provides social annotations through user tags. It builds on a standard and widely used data set composed of a manageable number of categories (101) and therefore can be used to compare object categorization performance in a constrained scenario (Caltech 101) and object categorization "in the wild" (MICC-Flickr 101) on the same 101 categories.

==See also==
* [[List of datasets for machine learning research]]
* [[MNIST database]]
* [[LabelMe]]


==References==
==References==
{{reflist}}
{{reflist}}


==External Links==
==External links==
* http://www.vision.caltech.edu/Image_Datasets/Caltech101/ -Caltech 101 Homepage (Includes download)
* http://www.vision.caltech.edu/Image_Datasets/Caltech101/ {{Webarchive|url=https://web.archive.org/web/20131206164923/http://www.vision.caltech.edu/Image_Datasets/Caltech101/ |date=2013-12-06 }} – Caltech 101 Homepage (Includes download)
* http://www.vision.caltech.edu/Image_Datasets/Caltech256/ -Caltech 256 Homepage (Includes download)
* http://www.vision.caltech.edu/Image_Datasets/Caltech256/ Caltech 256 Homepage (Includes download)
* http://labelme.csail.mit.edu/ -LabelMe Homepage
* http://labelme.csail.mit.edu/ LabelMe Homepage
* http://www2.it.lut.fi/project/visiq/ – Randomized Caltech 101 download page (Includes download)
==See Also==
* http://www.micc.unifi.it/vim/datasets/micc-flickr-101/ – MICC-Flickr101 Homepage (Includes download)
*[[LabelMe]]

*[[Caltech 256]]
[[Category:California Institute of Technology]]
*[[Computer Vision]]
[[Category:Datasets in computer vision]]
*[[Machine Learning]]

Latest revision as of 09:58, 14 April 2024

Caltech 101 is a data set of digital images created in September 2003 and compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology. It is intended to facilitate computer vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories (faces, watches, ants, pianos, etc.) and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.

Purpose

[edit]

Most computer vision and machine learning algorithms function by training on example inputs. They require a large and varied set of training data to work effectively. For example, the real-time face detection method used by Paul Viola and Michael J. Jones was trained on 4,916 hand-labeled faces.[1]

Cropping, re-sizing and hand-marking points of interest is tedious and time-consuming.

Historically, most data sets used in computer vision research have been tailored to the specific needs of the project being worked on. A large problem in comparing computer vision techniques is the fact that most groups use their own data sets. Each set may have different properties that make reported results from different methods harder to compare directly. For example, differences in image size, image quality, relative location of objects within the images and level of occlusion and clutter present can lead to varying results.[2]

The Caltech 101 data set aims at alleviating many of these common problems.

  • The images are cropped and re-sized.
  • Many categories are represented, which suits both single and multiple class recognition algorithms.
  • Detailed object outlines are marked.
  • Available for general use, Caltech 101 acts as a common standard by which to compare different algorithms without bias due to different data sets.

However, a follow-up study demonstrated that tests based on uncontrolled natural images (like the Caltech 101 data set) can be seriously misleading, potentially guiding progress in the wrong direction.[3]

Data set

[edit]

Images

[edit]

The Caltech 101 data set consists of a total of 9,146 images, split between 101 different object categories, as well as an additional background/clutter category.

Each object category contains between 40 and 800 images. Common and popular categories such as faces tend to have a larger number of images than others.

Each image is about 300x200 pixels. Images of oriented objects such as airplanes and motorcycles were mirrored to be left to right aligned and vertically oriented structures such as buildings were rotated to be off axis.

Annotations

[edit]

A set of annotations is provided for each image. Each set of annotations contains two pieces of information: the general bounding box in which the object is located and a detailed human-specified outline enclosing the object.

A Matlab script is provided with the annotations. It loads an image and its corresponding annotation file and displays them as a Matlab figure.

Uses

[edit]

The Caltech 101 data set was used to train and test several computer vision recognition and classification algorithms. The first paper to use Caltech 101 was an incremental Bayesian approach to one-shot learning,[4] an attempt to classify an object using only a few examples, by building on prior knowledge of other classes.

The Caltech 101 images, along with the annotations, were used for another one-shot learning paper at Caltech.[5]

Other Computer Vision papers that report using the Caltech 101 data set include:

  • Shape Matching and Object Recognition using Low Distortion Correspondence. Alexander C. Berg, Tamara L. Berg, Jitendra Malik. CVPR 2005
  • The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. K. Grauman and T. Darrell. International Conference on Computer Vision (ICCV), 2005[6]
  • Combining Generative Models and Fisher Kernels for Object Class Recognition. Holub, AD. Welling, M. Perona, P. International Conference on Computer Vision (ICCV), 2005[7]
  • Object Recognition with Features Inspired by Visual Cortex. T. Serre, L. Wolf and T. Poggio. Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), IEEE Computer Society Press, San Diego, June 2005[8]
  • SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. Hao Zhang, Alex Berg, Michael Maire, Jitendra Malik. CVPR, 2006[9]
  • Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. CVPR, 2006[10]
  • Empirical Study of Multi-Scale Filter Banks for Object Categorization. M.J. Mar韓-Jim閚ez, and N. P閞ez de la Blanca. December 2005[11]
  • Multiclass Object Recognition with Sparse, Localized Features. Jim Mutch and David G. Lowe., pp. 11–18, CVPR 2006, IEEE Computer Society Press, New York, June 2006[12]
  • Using Dependent Regions or Object Categorization in a Generative Framework. G. Wang, Y. Zhang, and L. Fei-Fei. IEEE Comp. Vis. Patt. Recog. 2006[13]

Analysis and comparison

[edit]

Advantages

[edit]

Caltech 101 has several advantages over other similar data sets:

  • Uniform size and presentation:
    • Almost all the images within each category are uniform in image size and in the relative position of interest objects. Caltech 101 users generally do not need to crop or scale images before they can be used.
  • Low level of clutter/occlusion:
    • Algorithms concerned with recognition usually function by storing features unique to the object. However, most images taken have varying degrees of background clutter, which means algorithms may build incorrectly.
  • Detailed annotations

Weaknesses

[edit]

Weaknesses to the Caltech 101 data set[3][14] may be conscious trade-offs, but others are limitations of the data set. Papers that rely solely on Caltech 101 are frequently rejected.

Weaknesses include:

  • The data set is too clean:
    • Images are very uniform in presentation, aligned from left to right, and usually not occluded. As a result, the images are not always representative of practical inputs that the algorithm might later expect to see. Under practical conditions, images are more cluttered, occluded and display greater variance in relative position and orientation of interest objects. The uniformity allows concepts to be derived using the average of a category, which is unrealistic.
  • Limited number of categories:
    • The Caltech 101 data set represents only a small fraction of possible object categories.
  • Some categories contain few images:
    • Certain categories are not represented as well as others, containing as few as 31 images.
    • This means that . The number of images used for training must be less than or equal to 30, which is not sufficient for all purposes.
  • Aliasing and artifacts due to manipulation:
    • Some images have been rotated and scaled from their original orientation, and suffer from some amount of artifacts or aliasing.

Other data sets

[edit]
  • Caltech 256 is another image data set, created in 2007. It is a successor to Caltech 101. It is intended to address some of the weaknesses of Caltech 101. Overall, it is a more difficult data set than Caltech 101, but it suffers from comparable problems. It includes[3]
    • 30,607 images, covering a larger number of categories
    • Minimum number of images per category raised to 80
    • Images are not left-right aligned
    • More variation in image presentation
  • LabelMe is an open, dynamic data set created at MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). LabelMe takes a different approach to the problem of creating a large image data set, with different trade-offs.
    • 106,739 images, 41,724 annotated images, and 203,363 labeled objects.
    • Users may add images to the data set by upload, and add labels or annotations to existing images.
    • Due to its open nature, LabelMe has many more images covering a much wider scope than Caltech 101. However, since each person decides what images to upload, and how to label and annotate each image, the images are less consistent.
  • VOC 2008 is a European effort to collect images for benchmarking visual categorization methods. Compared to Caltech 101/256, a smaller number of categories (about 20) are collected. The number of images in each category, however, is larger.
  • Overhead Imagery Research Data Set (OIRDS) is an annotated library of imagery and tools.[15] OIRDS v1.0 is composed of passenger vehicle objects annotated in overhead imagery. Passenger vehicles in the OIRDS include cars, trucks, vans, etc. In addition to the object outlines, the OIRDS includes subjective and objective statistics that quantify the vehicle within the image's context. For example, subjective measures of image clutter, clarity, noise, and vehicle color are included along with more objective statistics such as ground sample distance (GSD), time of day, and day of year.
    • ~900 images, containing ~1800 annotated images
    • ~30 annotations per object
    • ~60 statistical measures per object
    • Wide variation in object context
    • Limited to passenger vehicles in overhead imagery
  • MICC-Flickr 101 is an image data set created at the Media Integration and Communication Center (MICC), University of Florence, in 2012. It is based on Caltech 101 and is collected from Flickr. MICC-Flickr 101[16] corrects the main drawback of Caltech 101, i.e. its low inter-class variability and provides social annotations through user tags. It builds on a standard and widely used data set composed of a manageable number of categories (101) and therefore can be used to compare object categorization performance in a constrained scenario (Caltech 101) and object categorization "in the wild" (MICC-Flickr 101) on the same 101 categories.

See also

[edit]

References

[edit]
  1. ^ Viola, Paul; Jones, Michael J. (2004). "Robust Real-Time Face Detection". International Journal of Computer Vision. 57 (2): 137–154. doi:10.1023/B:VISI.0000013087.49260.fb. S2CID 2796017.
  2. ^ Oertel, Carsten; Colder, Brian; Colombe, Jeffrey; High, Julia; Ingram, Michael; Sallee, Phil (2008). "Current challenges in automating visual perception". 2008 37th IEEE Applied Imagery Pattern Recognition Workshop. pp. 1–8. doi:10.1109/AIPR.2008.4906457. ISBN 978-1-4244-3125-0. S2CID 36669995.
  3. ^ a b c Pinto, Nicolas; Cox, David D.; Dicarlo, James J. (2008). "Why is Real-World Visual Object Recognition Hard?". PLOS Computational Biology. 4 (1): e27. Bibcode:2008PLSCB...4...27P. doi:10.1371/journal.pcbi.0040027. PMC 2211529. PMID 18225950.
  4. ^ L. Fei-Fei, R. Fergus and P. Perona. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. IEEE. CVPR 2004, Workshop on Generative-Model Based Vision. 2004
  5. ^ L. Fei-Fei; R. Fergus; P. Perona (April 2006). "One-Shot learning of object categories" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 28 (4): 594–611. doi:10.1109/TPAMI.2006.79. PMID 16566508. S2CID 6953475. Archived from the original (PDF) on 2007-06-09. Retrieved 2008-01-16.
  6. ^ The Pyramid Match Kernel:Discriminative Classification with Sets of Image Features. K. Grauman and T. Darrell. International Conference on Computer Vision (ICCV), 2005
  7. ^ Holub, AD; Welling, M; Perona, P. Combining Generative Models and Fisher Kernels for Object Class Recognition. International Conference on Computer Vision (ICCV), 2005. Archived from the original on 2007-08-14. Retrieved 2008-01-16.
  8. ^ Object Recognition with Features Inspired by Visual Cortex. T. Serre, L. Wolf and T. Poggio. Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), IEEE Computer Society Press, San Diego, June 2005
  9. ^ SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. Hao Zhang, Alex Berg, Michael Maire, Jitendra Malik. CVPR, 2006
  10. ^ Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. CVPR, 2006
  11. ^ Empirical study of multi-scale filter banks for object categorization, M.J. Mar韓-Jim閚ez, and N. P閞ez de la Blanca. December 2005
  12. ^ Multiclass Object Recognition with Sparse, Localized Features, Jim Mutch and David G. Lowe. , pp. 11–18, CVPR 2006, IEEE Computer Society Press, New York, June 2006
  13. ^ G. Wang; Y. Zhang; L. Fei-Fei (2006). "Using Dependent Regions or Object Categorization in a Generative Framework" (PDF). IEEE Comp. Vis. Patt. Recog. Archived from the original (PDF) on 2007-06-09. Retrieved 2008-01-16.
  14. ^ J. Ponce; T. L. Berg; M. Everingham; D. A. Forsyth; M. Hebert; S. Lazebnik; M. Marszalek; C. Schmid; B. C. Russell; A. Torralba; C. K. I. Williams; J. Zhang; A. Zisserman (2006). J. Ponce; M. Hebert; C. Schmid; A. Zisserman (eds.). "Dataset Issues in Object Recognition" (PDF). Toward Category-Level Object Recognition, Springer-Verlag Lecture Notes in Computer Science. Archived from the original (PDF) on 2016-12-24. Retrieved 2008-02-08.
  15. ^ F. Tanner, B. Colder, C. Pullen, D. Heagy, C. Oertel, & P. Sallee, Overhead Imagery Research Data Set (OIRDS) – an annotated data library and tools to aid in the development of computer vision algorithms, June 2009, <http://sourceforge.net/apps/mediawiki/oirds/index.php?title=Documentation Archived 2012-11-09 at the Wayback Machine> (28 December 2009)
  16. ^ "L. Ballan, M. Bertini, A. Del Bimbo, A.M. Serain, G. Serra, B.F. Zaccone. Combining Generative and Discriminative Models for Classifying Social Images from 101 Object Categories. Int. Conference on Pattern Recognition (ICPR), 2012" (PDF). Archived from the original (PDF) on 2014-08-26. Retrieved 2012-07-11.
[edit]