THE DECLASSIFICATION ENGINE is a partnership between faculty and students in Columbia’s Departments of History, Statistics and Computer Science. The goal was to probe the limits of official secrecy by applying natural language processing software to archives of declassified documents to examine whether it is possible to predict the contents of redacted text, attribute authorship to anonymous documents and model to the geographic and temporal patterns of diplomatic communications.
UPDATE: Project Manager, Eric Gade, provided this update on the team’s progress:
The past year has seen big changes at the Declassification Engine project. In addition to hiring summer fellows and recruiting people to assist with data processing, we were able to obtain a MacArthur Foundation grant. This has helped to pay for a Project Manager and new Post-Doctoral researchers. We have also begun the process of designing a new public-facing website, which will give access to new tools we are currently developing.
Much of our day-to-day work up to this point has focused on non-glamorous but essential improvements to infrastructure. Previously MIT hosted our data on a physical server in Massachusetts. Three months ago we migrated to Amazon Web Services and created backup systems. Our server now hosts the API that we are developing so that users can quickly interact with our document collections. On top of this, we have built a basic user interface so that researchers can view documents without having to use a command line. Our new website, which is being designed in partnership with The Shared Web, will take this interface to a whole new level and provide integration with filtering and visuals tools that our other team members are creating. Soon one will be able to search for documents by country, date, people mentions, and more – all through a clean, accessible, and pleasant interface.
In terms of research, we have continued our work on analyzing redactions, such as determining the people and places most likely to be blacked out in declassified documents, as well as identifying unreported events through “bursts” in cable traffic. We have also begun new projects, such as attempting to predict authorship in declassified documents, which we hope will lead to an automated tool to identify the likely author of anonymous memos. We are also using topic modeling to discern patterns in official secrecy, such as the most and least sensitive topics as measured by the presence of redactions. Our Network Analysis team is attempting to extract social networks from the nearly 1.4 million State Department cables that we currently possess.
We’ve also participated in several conferences and workshops. Our recent paper, Topic Modeling Official Secrecy, was accepted for the “Data Science for Social Good” workshop at KDD 2014. We have also contributed a chapter on using computational methods to a methodological guide to research on U.S. foreign relations. Also, two of our researchers recently presented work on our project at the Brown Institute at Stanford to great interest, resulting in many fruitful discussions.
Altogether, the past year has set us up to better process our data, better present the results of that processing, and, in turn, to be prepared to ask and answer important research questions about official secrecy. From these we can create new knowledge not only about the past, but also about how to handle a pressing public policy issue in our current time: the exponential growth in official secrecy.
NEWSHUB is comprised of graduate students and recent graduates of the Columbia School of Journalism and the School of Engineering and Applied Science. Their goal was to create a system for tracking post-publication censorship (i.e. when a story is revised or deleted after publication) by authoritarian regimes. The team hoped to create real-time assessments and monthly reports of journalistic improprieties around the globe.
UPDATE from Yue Qui, student in Columbia’s Graduate School of Journalism & Columbia School of International and Public Affairs
Newshub’s goal is to enable readers to participate in keeping news organizations accountable. To that end, we have tried to build a comprehensive tool to help both English and Chinese speaking readers to understand the importance of our project. In order to achieve this goal, we are building a dashboard to summarize the deletions and alterations in news stories that we have collected.
Our website is still a work-in-progress and we are trying to tackle some technical problems. We expect a functional website to be launched in late September. We are also in touch with non-profit organizations and foundations, including the Committee to Protect Journalists and the International Center for Journalists, to seek further opportunities to cooperate and develop the idea.
ENSEMBLE is a web platform created by Stanford students Joy Kim and Justin Cheng that provides structure to collaborative storytelling. In Ensemble, one person is assigned the responsibility of managing creative direction, and can then enlist a crowd of friends or strangers to perform various tasks – such as contributing narrative direction or developing a character’s back story – with an ultimate goal of creating more engaging stories by drawing from the different personal viewpoints and experiences of a group.
UPDATE from Stanford student Joy Kim:
Ensemble is a novel collaborative story writing platform. Initially, our focus was on understanding how such a system could be used to author short stories among small groups of people. We focused on two main field deployments: The first was the Stanford Story Slam, an open contest where more than100 participants submitted over 50 stories written using the platform. We followed this up with the Arrowhead Story, a collaboration with established authors Tom Kealey and Chris Baty. Here, readers could suggest ideas for the story’s direction as it unfolded. Some of our results were published at CSCW 2014 , and were well received by the research community (e.g., ).
More recently, we have launched a version of Ensemble powered almost entirely by the crowd. An author can now create stories on Ensemble, and launch creative microtasks on Mechanical Turk. The author then acts as a curator, using crowdprovided ideas and writing as a platform for deciding where to go next with the story.
Today, anyone can sign up on Ensemble and create stories with the crowd. For instance, a class in the Teachers College at Columbia University has been using the system to collect anecdotes and recorded material. We have also been working with researchers from Bowdoin College to create a collaborative archive and reflection space for Puerto Rican student strikes.
Ensemble suggests several promising future directions, in terms of scale, as well as domain. Our experiments on Mechanical Turk suggest that Ensemble can easily scale to large numbers of collaborators: a crowdauthored novel involving hundreds of authors could be a potential next step. Where we focused on creative writing, we view Ensemble and its design approach as fairly general: Ensemble could be used in journalism, potentially supporting a more collaborative and fluid news writing process, and maintaining ongoing dialog between writers and readers.
Kim, J., Cheng, J. & Bernstein, M. (2014). Exploring Complementary Strengths of Leaders and Crowds in Creative Collaboration. Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing (CSCW ’14). ACM, New York, NY, USA, 745755.
Social Media / Citations
Matias, J. Nathan. “Writing Fiction Together: Ensemble, A Collaborative Storytelling System.”
Teevan, Jaime. “Related Work: Ensemble.” Slow Searching.
Baty, Chris. “Why You Should Consider Sharing Your Novel As You Go.” The NaNoWriMo Blog.
BUSHWIG: Columbia School of Journalism documentary film student, Adam Golub, and Rutgers University School of Communication and Information doctoral student, Jessa Lingel, will tell the story of a drag renaissance taking place in Bushwick, Brooklyn – one that is enlisting and extending social media platforms for the “identity curation” that happens in the drag community.
UPDATE from Jessa Lingel:
The Brooklyn drag archive started with the belief that members of a vibrant, creative counter-culture should have the ability to document and share media from their lives in a way that corresponds to their local needs and ethics. In the past two years, Brooklyn has seen an explosion in drag arts, led by the House of Bushwig, a group of drag performers based in the Brooklyn neighborhood of Bushwick.
Whereas it used to be that Brooklyn drag was limited to one or two queens performing irregularly, now it’s possible to see drag every night of the week, with nearly 100 performers active in the scene. Social media plays a valuable role in this community, allowing performers to promote upcoming shows, maintain relationships with fans as well as other performers, and contribute to a larger archive of queer and drag history. Yet, despite the popularity and ubiquitous use of social media sites like Facebook and Instagram among Brooklyn’s drag performers, these sites may not do the best job of respecting the ethics and values of this group. In order to address this gap, between counter cultural needs and technological design, we set out to create an online archive geared towards the specific needs, interests and ethics of Brooklyn’s drag community.
Wanting to make sure that our designs were informed by the everyday experiences of community members, we started by holding a series of focus groups with drag performers. During focus groups, we concentrated on the range of social media platforms used in everyday life as a drag performer, how and why they’re useful, and what needs weren’t being met by these platforms. We also asked specifically about whether a drag archive would be useful, and what features they’d most like to see incorporated into a site built specifically for them. In total, we interviewed 16 people in four focus groups during the fall of 2013.
Findings from this research have been written up as an academic journal article, currently under review in the Journal of Computer Mediated Communication.
Drawing on key findings from our analysis of focus group interviews, we began developing the online archive in 2014. Using a WordPress template, the archive offers the ability to create profiles as performers and upload media. Another key feature of the site is a community calendar where upcoming events can be promoted. As of this summer, development of the site is ongoing, as we continue to build and improve features requested by community members. Our goal is to launch the online archive in time for the third annual Bushwig Festival in September 2014, where attendees and performers will be encouraged to upload media to the site.
A final element of our research project has been an installation, creating a physical space where people can interact with media produced by and about this community. Working with a team of designers and architects, the installation concentrates on divisions between onstage (performers directing attention to and demanding attention of the audience), backstage (performers interacting with each other and preparing for shows), and offstage (the everyday lives of queens when they’re not actively participating in drag). Currently in its final design phase, our plan is to open the installation in late September in a Brooklyn-based venue.
WIDESCOPE & SYNAPP: Developed by Pranav Dandekar and David Lee, Widescope is an online social media platform designed to crowd-source federal and state budget proposals to drive greater consensus on budgets and budget deficits. Users can design a budget – for example, proposing greater funding for education and less for defense, or vice versa – and then interact with other users to restructure various proposals to arrive at a single consensus.
UPDATE from Stanford PhD Engineering student David Lee:
The goal of our Magic Grant is to develop algorithms, systems, and mechanisms for large-scale aggregation and collaboration, all in an effort to posit online social media as an enabler of deliberative and participatory democracy.
A 2013-14 Magic Grant recipient, Widescope & Synapp will receive renewed funding to scale up the current systems to achieve widespread usage and impact by partnering with governments, schools, and media organizations. Additionally, the team will further develop and implement algorithms and mechanisms for more effective aggregation and collaboration, all in an effort to posit online social media as an enabler of deliberative and participatory democracy. In addition to David Lee, the Widescope and Synapp team consists Sukolsak Sakshuwong, a graduate student in computer science at Stanford Engineering.
To date, we have collaborated with the Government of Finland in reforming their off-road traffic law and with Chicago’s 49th Ward in their Participatory Budgeting project. Our work has been published in several journals, conferences, and workshops: the Proceedings of the National Academy of Sciences (PNAS), the Conference on Human Computation (HCOMP), the Conference for Collective Intelligence (CI), the Workshop for Social Computing and User-Generated Content (SCUGC), the Workshop for Computational Social Choice (COMSOC), and at the International Political Science Association (IPSA).
GISTRAKER: A collaboration by Stanford Computer Science student Richard Socher and Communication doctoral candidate Rebecca Weiss, Gistraker is a Web application that analyzes the sentiment of language used in news media. Users will be able to create filters and explore visual summaries of how different media outlets cover specific actors or issues of interest, which could reveal instances of media bias.
UPDATE from Rebecca Weiss:
The goal of our project was to build software that would enable the distribution of Natural Language Processing technology on large-scale news data collections, particularly those technologies that are related to sentiment and other semantic variables (e.g. named entities, dependencies, and latent topics). The vision was that these semantic data could shed light onto defining a new form of media bias.
We had a few pitfalls over the year because of data sharing concerns; the two primary sources of data that we intended to use were not comfortable with the sharing of complete raw text data between machines, particularly machines that they didn’t own. This was a problem since the NLP models we wanted to use required lots of training data (meaning showing lots of raw content to Mechanical Turkers), and the software that we wrote to run the initial NLP server was dependent on AWS infrastructure.
Since the last update, we now have the following:
(1) The backend technology has been converted to a standalone web service, which is now available.
This means that anyone can download this service and run it on their own machine, creating their own NLP server backed by Stanford CoreNLP. It runs the latest version of the Stanford tools. The infrastructure of this technology was accepted as a paper at the ACL 2014 Workshop on Interactive Language Learning, Visualization, and Interfaces. We can consider this as a mark that part of the goals of this project have been met; software has been successful built and other models (e.g. new sentiment models) can be plugged into this server.
(2) This software also is now powering the NLP functionality of the Media Cloud project. This was a big step forward. It took several months to lay down the plumbing that could process all of the English-language media sources through the Stanford CoreNLP annotation pipeline, mostly due to Media Cloud infrastructure problems. This will eventually power all of the NLP enhanced tools on Media Meter, the front-end interface to the Media Cloud dataset. Media Meter will be free and open for public use. Currently, we are annotating the stream of New York Times data. After validating this feature, the next step in the agenda is the entire Mainstream Media set, which is 21 news sources. Eventually this service will be used to annotate all English language media sources in Media Cloud. This is the plan for the next year.
In order to move this agenda forward and recognize the integration of these tools with Media Cloud, Rebecca is now officially a 2014-2015 Fellow at the Berkman Center for Internet and Society at the Harvard Law School as well as a 2014-15 Brown Institute Fellow. Meanwhile, Richard has graduated! (This, unfortunately, means that he has moved onto his next venture.)
(3) Rebecca, along with several members of the Media Cloud project, presented a paper on recovering latent topics from large scale news content at the 2014 KDD Workshop on News Publishing. This will likely constitute a large portion of Rebecca’s Brown Fellow research over the 2014-2015 year.
CITYBEAT: A collaboration between The New York World housed in Columbia Journalism School, and the Social Media Information Lab at Rutgers University, this project will look for newsworthy events in the patterns of real-time, geotagged social media feeds.
UPDATE from Raz Schwartz:
CityBeat is multi-platform application for newsrooms and journalists that sources, monitors and analyzes hyper-local information from multiple social media platforms such as Instagram, Twitter and Foursquare, in real time. We use public, geo-tagged, real time data shared via social media services, in order to trace a city’s happenings and dynamics.
During our first year of the Magic grant, we have addressed several complicated technical and journalistic challenges. We generated design requirements from news editors and reporters (Schwartz et al., 2013), developed new algorithms (Xie et al., 2013), and built a fully functional large screen ambient display that is currently in the first phases of deployment (Xia et al., 2014).
The alpha version of CityBeat received a warm welcome. After only a few public presentations in small exclusive forums such as the Hacks and Hackers NYC meetup, we received several invitations from media outlets offering their newsrooms as the testing grounds for our system. Most notably, the New York Times metro desk, Buzzfeed and the Gothamist requested us to deploy a live version in their offices across several screens and projectors.
During the first year of our Magic grant, we worked closely with The New York World editors and reporters. This provided us with indispensable editorial direction for the development of CityBeat, including shaping training data to help the CityBeat algorithm identify true events and reject false events; and making sure the ambient display meets the needs of newsrooms seeking to discover untapped information and images. The World has also used CityBeat to curate coverage of the mayor’s inauguration and find images and sources for news events. CityBeat is on constant display in the newsroom, provoking ongoing discussion and feedback in a live context, and guiding ongoing project development.
Xia C., Schwartz R., Xie K., Krebs A., Langdon A., Ting J., Naaman M., CityBeat: Real-time Social Media Visualization of Hyper-local City Data, In Proceedings of WWW 2014, 23rd International World Wide Web Conference, Seoul, Korea, 2014.
Xie K., Xia C., Grinberg N., Schwartz R., and Naaman M., Robust detection of hyper-local events from geotagged social media data. In Proceedings of the 13th Workshop on Multimedia Data Mining in KDD, 2013.
NYC Media Lab Summit, Accepted Demo, September 19th.
In other news: