Media Watch feature, Vrije Universiteit Brussel
Automated data mining through newsfeeds using Drupal
Import, manage, search, visualize and export newsfeed items
BUILDING A WEB CRAWLER WITH NUMEROUS BENEFITS
IMPORT A HUGE AMOUNT OF NEWS
Items imported from over 400 newsfeed services, day by day.
AUTOMATIC TAGGING AND CATEGORIZATION
News selection based on the number of automatically discovered and assigned tags and categories.
USER-FRIENDLY ADMIN INTERFACE
Manage hundreds of newsfeeds and thousands of imported newsfeed items easily.
GENERATE STATISTICS
A function for admins. Use query parameters to explore data within a given data range.
USER-FRIENDLY SEARCH INTERFACE
Enable visitors to search among the imported news items by using free keywords, specifying date ranges and applying faceted filters (based on the assigned tags and categories).
VISUALIZE AND EXPORT SEARCH RESULTS
Visualize results in a Frequency Graph, Map and Heatmap. Export search results in a comma-separated (.csv) format.

In 2014 we created a media watch feature for the VUB, a system able to execute various tasks (processors) in a given order with the help of Drupal's queue operations and Jenkins jobs, such as:

  • Importing the full content of every news item with the help of our Node.js based proxy server

  • Indexing news items with Apache Solr automatically.
REDUCING UNNECESSARILY STORED DATA

The system removes every unrelated news item on a monthly basis to reduce the size of stored data.

FAILURE PROOF DATA MANAGEMENT

A failure safe implementation ensures a continuous collecting and processing of data.

Why Drupal?

​ Drupal offered the best content management framework (CMF) for this project, as the amount of content doesn’t limit Drupal’s functionalities.

We used several modules to extend Drupal 7:

  • Views, to build administration interfaces with good accessibility and usability.
  • Feeds as a solution to import content of RSS/Atom feeds.
  • Queue operation, to execute time-consuming calculations in the background to prevent performance issues.
  • Search API, Search API Solr and Facet API modules provide a user-friendly interface to build visually engaging and fast custom search pages.

Drupal also helped us to build an efficient visualization system, visualizing search results by using the combination of custom Solr queries and the D3.js and Leaflet module family.