We’re in an era of data intensive computing due to a tremendous surge in applications that require processing massive amounts of data from diverse sources including webpages, stock markets, medical records, climate warning systems, telecom call data records, telescope imagery, and online transactions. In many of these applications, data is generated at an extremely fast rate, so there is a strong need for real-time redundancy removal in order to improve the resource and compute efficiency for downstream processing. For example, each call data record (CDR) contains details about a particular call such as the calling number, the called number, length of the call, start time of the call, etc. Multiple copies of a CDR may be entered in the database due to errors in CDR generation. Therefore, before these CDRs can be stored in a central data center, the duplicates must be removed at a high throughput. Performing de-duplication using database accesses is slow and inefficient.
To address this challenging problem, we design an efficient parallel redundancy removal algorithm, based on Bloom filters, for both in-memory and disk-based execution. Our parallel algorithm performs complete de-duplication of 500 million records in 255s (throughput of 2 million records per second) on 16-core Intel Xeon 5570, with the in-memory execution. For larger data (~ 6 billion records), the algorithm takes less than 4.5 hours using only 6 cores of Intel Xeon 5570, with disk-based execution.
More info!