Function reference

This page documents the functions shipped in src/gdeltnews/ (and the most important helpers they rely on). It focuses on behavior, inputs, outputs, and notable edge cases.

Top-level package exports (`gdeltnews`)

`download(start, end, *, outdir="gdeltdata", overwrite=False, decompress=True, timeout=30, show_progress=True) -> DownloadStats`

Download GDELT Web NGrams minute files for an inclusive time range.

Inputs
- start, end: either datetime or a timestamp string. Supported string formats include:
  - YYYY-MM-DDTHH:MM:SS (optionally with trailing Z)
  - YYYY-MM-DD HH:MM:SS
  - YYYYMMDDHHMMSS
- outdir: destination directory for downloaded files
- overwrite: redownload even if the .gz already exists locally
- decompress: if True, also write the decompressed .json files
- timeout: HTTP timeout in seconds
- show_progress: show a progress bar across minute slots
Behavior
- Iterates minute-by-minute from start to end (inclusive), attempting to fetch <YYYYMMDDHHMMSS>.webngrams.json.gz from the GDELT Web NGrams base URL.
- If a minute file does not exist (non-200 response), it is skipped.

`reconstruct(input_dir="gdeltdata", output_dir="gdeltpreprocessed", *, language=None, url_filters=None, processes=None, delete_gz=False, delete_json=True, delete_empty_csv=True, show_progress=True) -> None`

Bulk reconstruction runner for a folder of GDELT *.webngrams.json.gz files.

Inputs
- input_dir: directory containing the .gz files
- output_dir: directory where per-input CSVs are written
- language: optional language code filter (e.g. "it"). None keeps all languages
- url_filters: optional iterable of URL substrings to keep (any match keeps a URL)
- processes: number of worker processes used per input file (None uses all cores)
- delete_gz: delete original .gz after processing
- delete_json: delete the temporary decompressed .json after processing
- delete_empty_csv: delete CSVs that contain only the header row
- show_progress: show a progress bar across files
Outputs
- Writes one CSV per input file to output_dir with delimiter | and header:
  - Text|Date|URL|Source

`filtermerge(input_dir, output_file, *, query=None, keep_temp=False, verbose=True) -> None`

Filter, merge, and deduplicate reconstructed CSVs.

Inputs
- input_dir: directory of CSV files (as produced by reconstruct)
- output_file: destination CSV path
- query: Boolean query string using AND, OR, NOT plus parentheses. Use double quotes for phrases, e.g. "julia roberts".
- keep_temp: keep the intermediate output_file + ".tmp" file
- verbose: print progress messages
Behavior
- Performs case-insensitive substring matching over the Text column.
- Deduplicates by URL, keeping the row with the longest Text.

Module: `gdeltnews.wordmatch`

`transform_dict(original_dict: dict[str, list[dict]]) -> dict[str, list[Entry]]`

Convert raw per-URL entry dictionaries into simplified “sentence fragment” entries.

Builds sentence from pre, ngram, post.
Normalizes some early-position artifacts (e.g. keeps substring after " / " when pos < 20).

`reconstruct_sentence(fragments: list[str], positions: list[int] | None = None) -> str`

Reconstruct a longer text by merging overlapping fragments (word overlap).

Greedy overlap merge (max overlap first).
If positions are given, enforces a constraint that prevents obviously wrong reorderings:
- only appends fragments whose position is not earlier than the current max position
- only prepends fragments whose position is not later than the current min position

`remove_overlap(text: str) -> str`

Conservative cleanup that removes simple duplicated prefix/suffix overlaps in the final reconstructed text.

`load_and_filter_data(input_file: str, language_filter="en"|None, url_filter=None) -> (articles: dict, url_order: list[str])`

Read a decompressed *.webngrams.json file line-by-line and:

optionally filter by language (language_filter=None keeps all)
optionally filter by URL substring(s)
group entries by URL
return:
- transformed entries (via transform_dict)
- the URL order as first encountered in the file (used to preserve ordering later)

`determine_source_label(url: str, url_filters: list[str] | None = None) -> str`

Derive a Source label from URL filters:

exactly one filter matches: returns that filter string
multiple filters match: returns "Multiple URL matched"
no filters or no matches: returns ""

`process_article((url, entries), url_filters=None) -> dict[str, str]`

Reconstruct one article (intended for multiprocessing).

sorts entries by pos
merges fragments with reconstruct_sentence
runs remove_overlap and basic output cleanup
returns a dict with url, text, date, source

`process_file_multiprocessing(input_file, output_file, language_filter="en"|None, url_filter=None, num_processes=None) -> None`

Core driver for reconstructing one decompressed JSON file into a CSV.

Uses a multiprocessing pool with imap_unordered.
Preserves original URL order using the url_order list from load_and_filter_data.
Always writes a CSV header even when no articles are found.

Function reference

Top-level package exports (gdeltnews)

download(start, end, *, outdir="gdeltdata", overwrite=False, decompress=True, timeout=30, show_progress=True) -> DownloadStats

reconstruct(input_dir="gdeltdata", output_dir="gdeltpreprocessed", *, language=None, url_filters=None, processes=None, delete_gz=False, delete_json=True, delete_empty_csv=True, show_progress=True) -> None

filtermerge(input_dir, output_file, *, query=None, keep_temp=False, verbose=True) -> None

Module: gdeltnews.wordmatch

transform_dict(original_dict: dict[str, list[dict]]) -> dict[str, list[Entry]]

reconstruct_sentence(fragments: list[str], positions: list[int] | None = None) -> str

remove_overlap(text: str) -> str

load_and_filter_data(input_file: str, language_filter="en"|None, url_filter=None) -> (articles: dict, url_order: list[str])

determine_source_label(url: str, url_filters: list[str] | None = None) -> str

process_article((url, entries), url_filters=None) -> dict[str, str]

process_file_multiprocessing(input_file, output_file, language_filter="en"|None, url_filter=None, num_processes=None) -> None