Jump to content

Deduping: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Calzakk (talk | contribs)
Line 8: Line 8:


* Dedupe Software http://www.winpure.com/
* Dedupe Software http://www.winpure.com/
* Deduplication Software http://www.helpit.com
* Project Dedupe http://dedupe.sourceforge.net
* Project Dedupe http://dedupe.sourceforge.net



Revision as of 10:24, 1 October 2006

Deduping means removing duplicate entries in a set. For example, this is a common task when integrating multiple databases or merging datasets. In the case of merging bibliographic data, you would have to compare multiple values that belong to each entry or record to determine if you have duplicates and/or how many duplicates you may have. Some of these values include ISSN, ISBN, Titles, Contributors (authors, editors, publishers), Place of publication, Frequency, Page count, Publication Date(s), etc. This task could be easier depending on the quality of your data. e.g. You may have some records without standard numbers (ISSN, ISBN, etc.) that are duplicates to rows/records of works that do have standard numbers if your practice is not consistent. One way to consider deduping, if your data quality is an issue, is to think of records whose metadata is not necessarily exactly the same, but were intended to be the same (and would be the same if you had data quality standards).

Example: "Mark, do you typically dedupe these lists I send you?"

Dedupe requires and 'e' at the end. Dedup is incorrect.