Data Collection

Gather data from external sources such as websites, PDFs, ZIP archives, and APIs.

Purpose

Use this module to retrieve raw data and documents so they can be processed and analyzed downstream.

Key functions

  • ExtractTextFromPDF — Extract and clean text from PDF documents

  • FetchPDFFromURL — Download PDF files from URLs

  • FetchUSShapefile — Retrieve geographical shapefiles from the U.S. Census Bureau TIGER database

  • FetchWebsiteText — Scrape text content from websites

  • GetCompanyFilings — Access SEC EDGAR company filings

  • GetGoogleSearchResults — Fetch Google search results via Serper API

  • GetZipFile — Download and extract ZIP files from URLs

Common use cases

  • Web scraping

  • Document processing

  • Geospatial analysis

  • Market research and competitive intelligence

Last updated