De-Duplication & Data Integrity

About de-duplication and data integrity

Since the crawlers are distributed and may not be in sync with each other, the
final duplication test is done by the firehose using the topic URL + number of
posts. If a post is added to a thread, all the posts of that thread will be
re-introduced to the firehose.
This can indeed cause an increased content duplication in the firehose, but in no way will it compromise the data integrity.

To ensure this, we have added the <seen_before>(True/False)</seen_before> field
to the Firehose output, which indicates whether a specific post was
previously sent out via the Firehose or not.
Sorting according to this field should ensure that you do not encounter duplicates.