Here's suggested pseudo-code for downloading Firehose Data
Set current_epoch_timestamp = get_currect_epoch_timestamp()
Read list of zip files from the firehose link
Download zip file and extract
Every 1 minute:
Read Firehose Link
For each zip file name(*):
If Number(zip file name) > current_epoch_timestamp
Download zip file
NoteEvery Zip file name matches the file's creation timestamp

About de-duplication and data integrity
Since the crawlers are distributed and may not be in sync with each other, the
final duplication test is done by the firehose using the topic URL + number of
posts. If a post is added to a thread, all the posts of that thread will be
re-introduced to the firehose.
This can indeed cause an increased content duplication in the firehose, but in no way will it compromise the data integrity.
To ensure this, we have added the <seen_before>(True/False)</seen_before> field
to the Firehose output, which indicates whether a specific post was
previously sent out via the Firehose or not.
Sorting according to this field should ensure that you do not encounter duplicates.