Data From Docs aims to create a roadmap and suite of tools to help journalists discover stories in large, unstructured sets of documents by testing and building upon existing BigLocal tools. The project will use police misconduct data and local governing bodies’ minutes collected by Agenda Watch to produce a set of resources that can help others tackle similar challenges. The project goals include creating a guide for journalists, a suite of tools to analyze large unstructured data sets, and integrating these tools into other platforms such as DocumentCloud. Included in the project pipeline are common tools such as OCR, entity recognition, data extraction using large language models, and other AI technologies.
Joyce Chen, `25 BS Candidate, Stanford University; Emily Guo, `23 BA & `25 MS Candidate, Stanford University; Isabel Sieh, `25 BS Candidate, Stanford University; Serdar Tumgoren, Visiting Professor, Stanford University; Hilke Schellmann, Assistant Professor, NYU