Data Retrieval

Here's suggested pseudo-code for downloading Firehose Data

Set current_epoch_timestamp = get_currect_epoch_timestamp()
Read list of zip files from the firehose link
Download zip file and extract
Every 1 minute:
Read Firehose Link
For each zip file name(*):
    If Number(zip file name) > current_epoch_timestamp
        Download zip file

📘

Note

Every Zip file name matches the file's creation timestamp

About de-duplication and data integrity

Since the crawlers are distributed and may not be in sync with each other, the
final duplication test is done by the firehose using the topic URL + number of posts. If a post is added to a thread, all the posts of that thread will be re-introduced to the firehose. This can indeed cause an increased content duplication in the firehose, but in no way will it compromise the data integrity.

To ensure this, we have added the <seen_before>(True/False)</seen_before> field
to the Firehose output, which indicates whether a specific post was previously sent out via the Firehose or not. Sorting according to this field should ensure that you do not encounter duplicates.