Function reference
This page documents the functions shipped in src/gdeltnews/ (and the most important helpers they rely on). It focuses on behavior, inputs, outputs, and notable edge cases.
Top-level package exports (gdeltnews)
download(start, end, *, outdir="gdeltdata", overwrite=False, decompress=True, timeout=30, show_progress=True) -> DownloadStats
Download GDELT Web NGrams minute files for an inclusive time range.
- Inputs
start,end: eitherdatetimeor a timestamp string. Supported string formats include:YYYY-MM-DDTHH:MM:SS(optionally with trailingZ)YYYY-MM-DD HH:MM:SSYYYYMMDDHHMMSS
outdir: destination directory for downloaded filesoverwrite: redownload even if the.gzalready exists locallydecompress: ifTrue, also write the decompressed.jsonfilestimeout: HTTP timeout in secondsshow_progress: show a progress bar across minute slots
- Behavior
- Iterates minute-by-minute from
starttoend(inclusive), attempting to fetch<YYYYMMDDHHMMSS>.webngrams.json.gzfrom the GDELT Web NGrams base URL. - If a minute file does not exist (non-200 response), it is skipped.
- Iterates minute-by-minute from
reconstruct(input_dir="gdeltdata", output_dir="gdeltpreprocessed", *, language=None, url_filters=None, processes=None, delete_gz=False, delete_json=True, delete_empty_csv=True, show_progress=True) -> None
Bulk reconstruction runner for a folder of GDELT *.webngrams.json.gz files.
- Inputs
input_dir: directory containing the.gzfilesoutput_dir: directory where per-input CSVs are writtenlanguage: optional language code filter (e.g."it").Nonekeeps all languagesurl_filters: optional iterable of URL substrings to keep (any match keeps a URL)processes: number of worker processes used per input file (Noneuses all cores)delete_gz: delete original.gzafter processingdelete_json: delete the temporary decompressed.jsonafter processingdelete_empty_csv: delete CSVs that contain only the header rowshow_progress: show a progress bar across files
- Outputs
- Writes one CSV per input file to
output_dirwith delimiter|and header:Text|Date|URL|Source
- Writes one CSV per input file to
filtermerge(input_dir, output_file, *, query=None, keep_temp=False, verbose=True) -> None
Filter, merge, and deduplicate reconstructed CSVs.
- Inputs
input_dir: directory of CSV files (as produced byreconstruct)output_file: destination CSV pathquery: Boolean query string usingAND,OR,NOTplus parentheses. Use double quotes for phrases, e.g."julia roberts".keep_temp: keep the intermediateoutput_file + ".tmp"fileverbose: print progress messages
- Behavior
- Performs case-insensitive substring matching over the
Textcolumn. - Deduplicates by
URL, keeping the row with the longestText.
- Performs case-insensitive substring matching over the
Module: gdeltnews.wordmatch
transform_dict(original_dict: dict[str, list[dict]]) -> dict[str, list[Entry]]
Convert raw per-URL entry dictionaries into simplified “sentence fragment” entries.
- Builds
sentencefrompre,ngram,post. - Normalizes some early-position artifacts (e.g. keeps substring after
" / "whenpos < 20).
reconstruct_sentence(fragments: list[str], positions: list[int] | None = None) -> str
Reconstruct a longer text by merging overlapping fragments (word overlap).
- Greedy overlap merge (max overlap first).
- If
positionsare given, enforces a constraint that prevents obviously wrong reorderings:- only appends fragments whose position is not earlier than the current max position
- only prepends fragments whose position is not later than the current min position
remove_overlap(text: str) -> str
Conservative cleanup that removes simple duplicated prefix/suffix overlaps in the final reconstructed text.
load_and_filter_data(input_file: str, language_filter="en"|None, url_filter=None) -> (articles: dict, url_order: list[str])
Read a decompressed *.webngrams.json file line-by-line and:
- optionally filter by language (
language_filter=Nonekeeps all) - optionally filter by URL substring(s)
- group entries by URL
- return:
- transformed entries (via
transform_dict) - the URL order as first encountered in the file (used to preserve ordering later)
- transformed entries (via
determine_source_label(url: str, url_filters: list[str] | None = None) -> str
Derive a Source label from URL filters:
- exactly one filter matches: returns that filter string
- multiple filters match: returns
"Multiple URL matched" - no filters or no matches: returns
""
process_article((url, entries), url_filters=None) -> dict[str, str]
Reconstruct one article (intended for multiprocessing).
- sorts entries by
pos - merges fragments with
reconstruct_sentence - runs
remove_overlapand basic output cleanup - returns a dict with
url,text,date,source
process_file_multiprocessing(input_file, output_file, language_filter="en"|None, url_filter=None, num_processes=None) -> None
Core driver for reconstructing one decompressed JSON file into a CSV.
- Uses a multiprocessing pool with
imap_unordered. - Preserves original URL order using the
url_orderlist fromload_and_filter_data. - Always writes a CSV header even when no articles are found.