EXCLAIM: Difference between revisions
Importing Wikidata short description: "Integrated tool for cross-language information retrieval" |
|||
(29 intermediate revisions by 24 users not shown) | |||
Line 1: | Line 1: | ||
{{Short description|Integrated tool for cross-language information retrieval}} |
|||
{{For|the Canadian magazine|Exclaim!}} |
{{For|the Canadian magazine|Exclaim!}} |
||
The '''EXtensible Cross-Linguistic Automatic Information Machine (EXCLAIM)''' |
The '''EXtensible Cross-Linguistic Automatic Information Machine (EXCLAIM)''' was an integrated tool for [[cross-language information retrieval]] (CLIR), created at the [[University of California, Santa Cruz]] in early 2006, with some support for more than a dozen languages. The lead developers were Justin Nuger and Jesse Saba Kirchner. |
||
Early work on CLIR depended on manually constructed parallel corpora for each pair of languages. This method is labor-intensive compared to parallel corpora created automatically. A more efficient way of finding data to train a CLIR system is to use matching pages on the [[World Wide Web|web]] which are written in different languages<ref> |
Early work on CLIR depended on manually constructed parallel corpora for each pair of languages. This method is labor-intensive compared to parallel corpora created automatically. A more efficient way of finding data to train a CLIR system is to use matching pages on the [[World Wide Web|web]] which are written in different languages.<ref> |
||
{{cite web |
{{cite web |
||
|title=Cross-Language Information Retrieval based on Parallel Texts and Automatic Mining of Parallel Texts in the Web |
|title=Cross-Language Information Retrieval based on Parallel Texts and Automatic Mining of Parallel Texts in the Web |
||
Line 9: | Line 10: | ||
|accessdate=2006-12-02 |
|accessdate=2006-12-02 |
||
}} |
}} |
||
</ref> |
</ref> |
||
EXCLAIM capitalizes on the idea of latent parallel corpora on the [[World Wide Web|web]] by automating the alignment of such corpora in various domains. The most significant of these is [[Wikipedia]] itself, which includes articles in [http://meta.wikimedia.org/wiki/Complete_list_of_language_Wikipedias_available 250 languages]. The role of EXCLAIM is to use [[semantics]] and [[linguistics|linguistic]] analytic tools to align the information in these Wikipedias so that they can be treated as parallel corpora. EXCLAIM is also extensible to incorporate information from many other sources, such as the [[Chinese Community Health Resource Center]] (CCHRC). |
EXCLAIM capitalizes on the idea of latent parallel corpora on the [[World Wide Web|web]] by automating the alignment of such corpora in various domains. The most significant of these is [[Wikipedia]] itself, which includes articles in [http://meta.wikimedia.org/wiki/Complete_list_of_language_Wikipedias_available 250 languages]. The role of EXCLAIM is to use [[semantics]] and [[linguistics|linguistic]] analytic tools to align the information in these Wikipedias so that they can be treated as parallel corpora. EXCLAIM is also extensible to incorporate information from many other sources, such as the [[Chinese Community Health Resource Center]] (CCHRC). |
||
Line 15: | Line 16: | ||
One of the main goals of the EXCLAIM project is to provide the kind of computational tools and CLIR tools for [[minority languages]] and [[endangered languages]] which are often available only for powerful or prosperous majority languages. |
One of the main goals of the EXCLAIM project is to provide the kind of computational tools and CLIR tools for [[minority languages]] and [[endangered languages]] which are often available only for powerful or prosperous majority languages. |
||
==Current |
==Current status== |
||
⚫ | In 2009, EXCLAIM was in a beta state, with varying degrees of functionality for different languages. Support for CLIR using the Wikipedia dataset and the most current version of EXCLAIM (v.0.5), including full UTF-8 support and Porter stemming for the English component, was available for the following twenty-three languages: |
||
⚫ | EXCLAIM |
||
{| class="wikitable" |
{| class="wikitable" |
||
| [[Albanian language|Albanian]] |
|||
|- |
|||
| [[Amharic]] |
| [[Amharic]] |
||
|- |
|- |
||
Line 30: | Line 34: | ||
|- |
|- |
||
| [[Indonesian language|Indonesian]] |
| [[Indonesian language|Indonesian]] |
||
|- |
|||
| [[Irish language|Irish]] |
|||
|- |
|||
| [[Javanese language|Javanese]] |
|||
|- |
|- |
||
| [[Latvian language|Latvian]] |
| [[Latvian language|Latvian]] |
||
|- |
|- |
||
| [[Malagasy]] |
| [[Malagasy language|Malagasy]] |
||
|- |
|||
| [[Mandarin Chinese]] |
|||
|- |
|- |
||
| [[Nahuatl]] |
| [[Nahuatl]] |
||
Line 39: | Line 49: | ||
| [[Navajo language|Navajo]] |
| [[Navajo language|Navajo]] |
||
|- |
|- |
||
| [[Quechua]] |
| [[Quechua languages|Quechua]] |
||
|- |
|- |
||
| [[Sardinian language|Sardinian]] |
| [[Sardinian language|Sardinian]] |
||
|- |
|- |
||
| [[Swahili]] |
| [[Swahili language|Swahili]] |
||
|- |
|- |
||
| [[Tagalog language|Tagalog]] |
| [[Tagalog language|Tagalog]] |
||
|- |
|- |
||
| [[ |
| [[Standard Tibetan|Tibetan]] |
||
|- |
|- |
||
| [[Turkish language|Turkish]] |
| [[Turkish language|Turkish]] |
||
Line 67: | Line 77: | ||
|} |
|} |
||
Significant developments in the most recent version of EXCLAIM include support for Mandarin Chinese. By developing support for this language, EXCLAIM has added solutions to [[text segmentation|segmentation]] and [[character encoding|encoding]] problems which will allow the system to be extended to many other languages written with non-European orthographic conventions. This support is supplied through the Trimming And Reformatting Modular System ([[TARMS]]) toolkit. |
|||
Future versions of EXCLAIM will extend the system to additional languages. Other goals include incorporation of available latent datasets in addition to the Wikipedia dataset. |
|||
The EXCLAIM development plan calls for an integrated CLIR instrument usable searching from English for information in any of the supported languages, or searching from any of the supported languages for information in English when EXCLAIM 1.0 is released. Future versions will allow searching from any supported language into any other, and searching from and into multiple languages. |
The EXCLAIM development plan calls for an integrated CLIR instrument usable searching from English for information in any of the supported languages, or searching from any of the supported languages for information in English when EXCLAIM 1.0 is released. Future versions will allow searching from any supported language into any other, and searching from and into multiple languages. |
||
==Further applications== |
|||
EXCLAIM has been incorporated into several projects which rely on cross-language [[query expansion]] as part of their [[Front and back ends|backend]]s. One such project is a cross-linguistic [[readability]] software generation framework, detailed in work presented at [[Association for Computational Linguistics|ACL 2009]].<ref>{{cite web |
|||
|title=A crosslinguistic readability framework |
|||
|url=http://www.aclweb.org/anthology/enwiki/w/W09/W09-3103.pdf |
|||
|format=PDF|publisher=ACL-IJNLP 2009 |
|||
|accessdate=2009-09-04 |
|||
}} |
|||
</ref> |
|||
==Notes and references== |
==Notes and references== |
||
{{reflist}} |
{{reflist}} |
||
==External links== |
==External links== |
||
*[http://www.soe.ucsc.edu/~jnuger/cgi-bin/exclaim.cgi EXCLAIM Website] |
*[http://www.soe.ucsc.edu/~jnuger/cgi-bin/exclaim.cgi EXCLAIM Website] {{Webarchive|url=https://web.archive.org/web/20070330033504/http://www.soe.ucsc.edu/%7Ejnuger/cgi-bin/exclaim.cgi |date=2007-03-30 }} |
||
*[http://www.w3.org/DesignIssues/Semantic.html Semantic Web Roadmap] |
*[http://www.w3.org/DesignIssues/Semantic.html Semantic Web Roadmap] |
||
*[http://www.cchphmo.com/cchrchealth/index_E.html Chinese Cultural Health Resource Center] |
*[https://web.archive.org/web/20061206133107/http://www.cchphmo.com/cchrchealth/index_E.html Chinese Cultural Health Resource Center] |
||
*[http:// |
*[http://ju-st.in/ Justin Nuger's professional webpage] |
||
*[http://people.ucsc.edu/~kirchner/ Jesse Saba Kirchner's professional webpage] |
|||
{{DEFAULTSORT:Exclaim}} |
|||
[[Category:Information retrieval]] |
[[Category:Information retrieval systems]] |
||
[[Category:Online databases]] |
Latest revision as of 21:40, 2 July 2023
The EXtensible Cross-Linguistic Automatic Information Machine (EXCLAIM) was an integrated tool for cross-language information retrieval (CLIR), created at the University of California, Santa Cruz in early 2006, with some support for more than a dozen languages. The lead developers were Justin Nuger and Jesse Saba Kirchner.
Early work on CLIR depended on manually constructed parallel corpora for each pair of languages. This method is labor-intensive compared to parallel corpora created automatically. A more efficient way of finding data to train a CLIR system is to use matching pages on the web which are written in different languages.[1]
EXCLAIM capitalizes on the idea of latent parallel corpora on the web by automating the alignment of such corpora in various domains. The most significant of these is Wikipedia itself, which includes articles in 250 languages. The role of EXCLAIM is to use semantics and linguistic analytic tools to align the information in these Wikipedias so that they can be treated as parallel corpora. EXCLAIM is also extensible to incorporate information from many other sources, such as the Chinese Community Health Resource Center (CCHRC).
One of the main goals of the EXCLAIM project is to provide the kind of computational tools and CLIR tools for minority languages and endangered languages which are often available only for powerful or prosperous majority languages.
Current status
[edit]In 2009, EXCLAIM was in a beta state, with varying degrees of functionality for different languages. Support for CLIR using the Wikipedia dataset and the most current version of EXCLAIM (v.0.5), including full UTF-8 support and Porter stemming for the English component, was available for the following twenty-three languages:
Albanian |
Amharic |
Bengali |
Gothic |
Greek |
Icelandic |
Indonesian |
Irish |
Javanese |
Latvian |
Malagasy |
Mandarin Chinese |
Nahuatl |
Navajo |
Quechua |
Sardinian |
Swahili |
Tagalog |
Tibetan |
Turkish |
Welsh |
Wolof |
Yiddish |
Support using the Wikipedia dataset and an earlier version of EXCLAIM (v.0.3) is available for the following languages:
Dutch |
Spanish |
Significant developments in the most recent version of EXCLAIM include support for Mandarin Chinese. By developing support for this language, EXCLAIM has added solutions to segmentation and encoding problems which will allow the system to be extended to many other languages written with non-European orthographic conventions. This support is supplied through the Trimming And Reformatting Modular System (TARMS) toolkit.
Future versions of EXCLAIM will extend the system to additional languages. Other goals include incorporation of available latent datasets in addition to the Wikipedia dataset.
The EXCLAIM development plan calls for an integrated CLIR instrument usable searching from English for information in any of the supported languages, or searching from any of the supported languages for information in English when EXCLAIM 1.0 is released. Future versions will allow searching from any supported language into any other, and searching from and into multiple languages.
Further applications
[edit]EXCLAIM has been incorporated into several projects which rely on cross-language query expansion as part of their backends. One such project is a cross-linguistic readability software generation framework, detailed in work presented at ACL 2009.[2]
Notes and references
[edit]- ^ "Cross-Language Information Retrieval based on Parallel Texts and Automatic Mining of Parallel Texts in the Web" (PDF). ACM-SIGIR 1999. Retrieved 2006-12-02.
- ^ "A crosslinguistic readability framework" (PDF). ACL-IJNLP 2009. Retrieved 2009-09-04.