Quickstart guide - gdeltnews

This package helps you:

  • download GDELT Web NGrams files for a time range,
  • reconstruct article text from overlapping n-gram fragments,
  • filter and merge reconstructed CSVs using Boolean queries.

To learn more about the dataset, please visit the official announcement: https://blog.gdeltproject.org/announcing-the-new-web-news-ngrams-3-0-dataset/

Input files look like: http://data.gdeltproject.org/gdeltv3/webngrams/20250316000100.webngrams.json.gz

Reconstruction quality depends on the n-gram fragments available in the dataset.

Step 1: Download Web NGrams files

from gdeltnews.download import download

download(
    "2025-11-25T10:00:00",
    "2025-11-25T13:59:00",
    outdir="gdeltdata",
    decompress=False,
)

Step 2: Reconstruct articles (run as a script, not in Jupyter)

Multiprocessing can be problematic inside notebooks. Run this from a .py script.

from multiprocessing import freeze_support
from gdeltnews.reconstruct import reconstruct

def main():
    reconstruct(
        input_dir="gdeltdata",
        output_dir="gdeltpreprocessed",
        language="it",
        url_filters=["repubblica.it", "corriere.it"],
        processes=10,  # use None for all available cores
    )

if __name__ == "__main__":
    freeze_support()  # important on Windows
    main()

Step 3: Filter, deduplicate, and merge CSVs

from gdeltnews.filtermerge import filtermerge

filtermerge(
    input_dir="gdeltpreprocessed",
    output_file="final_filtered_dedup.csv",
    query='((elezioni OR voto) AND (regionali OR campania)) OR ((fico OR cirielli) AND NOT veneto)'
)

Advanced users can pre-filter and download GDELT data via Google BigQuery, then process it directly with wordmatch.py.