BigDIVA: Big Data, Big Visuals, Big Searches, and Big Results
Big data and data visualization are very popular concepts today, as private corporations and public entities are all rushing to build visualizations, and the most common big datasets and visualizations are internet search engines. Despite significant advances in internet technology, search engine interfaces have not really changed since their inception in the 1990s; sites such as Google and Bing inexplicably continue to display search results in paginated lists. When download speeds were slower and when there were fewer sites to list, this format was immensely practical. However, as internet speeds have increased and as databases have continued to grow, paginated lists have become a growingly inefficient method of conducting internet research. After all, few people will view more than 3-5 pages of Google search returns, let alone the millions of other results from any particular enquiry. While search engines have invested significant sums of money to try to present the most relevant returns at the top of their return lists, companies have begun to specialize in Search Engine Optimization to ensure that their clients are listed at the top of those search results. Therefore, the most well-funded – not necessarily the most relevant – sites appear at the top of many internet searches. Though its dataset is not nearly as substantial as that of Google, Yahoo, or Bing, the Advanced Research Consortium (ARC) has compiled a catalog of humanities-related digital artifacts that presently consists of 1.6 items dating from the medieval period to the twentieth century, some of which include full text transcriptions and optical character recognition from books and pamphlets. Thanks to the efforts of the Early Modern OCR Project, or eMOP at Texas A&M University, that number will expand to include full-text transcriptions of the EEBO and ECCO corpuses. To deal with the challenge of this growing catalog, the Advanced Research Consortium (ARC) is considering ways in which large humanities-based datasets and search interfaces could better encourage discoverability and foster new and innovative research questions. This poster considers the challenges to visualizing humanities datasets, not least of which is the more abstract nature of the ARC dataset, which in many ways resembles a card catalogue of digital humanities artifacts and research. It then presents one solution developed by ARC called Big Data Infrastructure Visualization Application (BigDIVA). Rather than forcing users to load countless lists of search returns, BigDIVA presents users with a faceted visualization of all of their results. The poster argues that BigDIVA’s faceted organization of the data optimizes the search process, and it reveals three significant advantages over traditional search engines. First, BigDIVA eliminates the need for questionable, secretive, and proprietary site and search result rankings, which makes digital research eminently more transparent and reproducible. It also enables users to simultaneously view the big picture and the individual results within their searches. Finally, BigDIVA provides users with an experience similar to wandering the library stacks, enabling and encouraging researchers to discover the results that they did not expect among those that they did expect.