Knowledge is power. That’s been true even before Sir Francis Bacon coined the phrase (in Latin) back in 1597. In today’s age of readily accessible information, data is a commodity used by everyone from scientists researching cancer cures to fantasy football fanatics looking for an edge in their league. The internet is not only constructed by data, it’s filled with unique data sets that are available to anyone with a keyboard and cover topics that range from municipal bike share programs to spam text messages.
To celebrate data in all its wonderful forms, Stacker put together this list of super bizarre data sets you might not know existed. Obviously “super bizarre” is in the eye of the beholder, but these data sets span the spectrum from pop culture to public health and everything in between. To be included in the list, the data set had to be free and available to researchers and journalists, which eliminates a wide swath of data sets that are only accessible via a subscription or one-time payment.
Read on to explore the wonderful wide world of incredibly specific data.
Wine fans can use this data set to confidently converse about all things Portuguese wine. The data set involves 12 attributes including fixed acidity, pH, and alcohol content distilled from information regarding northern Portuguese red and white Vinho Verde wine samples. The University of Minho in Portugal puts everything but the bouquet underneath your nose with this data set of almost 5,000 variants.
This massive data set takes 142.8 million Amazon reviews and parses it into searchable details. Everything is broken down into consumable datasets, whether by category or just by product name. The reviews and metadata span nearly 20 years from 1996 to 2014 and were put together by Julian McAuley of the University of California San Diego.
Reddit can be a bizarre place in and of itself, but what happens when somone aggregates every comment that’s ever been made on the platform? That’s what Reddit user Stuck_In_The_Matrix aimed to find out when creating this data set that tracks over 1.7 billion comments. The comments are categorized by author, comment, score, subreddit, and more using Reddit’s application program interface. With so many comments on such a vast number of topics, anything could be hiding in this data set.
A relatable feeling for many is having that one song on the tip of the tongue, but the name just isn’t coming to mind. To assist in the self-Shazam, the Million Song Dataset is a collaboration of features made up of over a million contemporary popular music tracks. Even though it doesn’t have audio, it does break things down by features of the songs and includes a community of smaller data sets that analyze lyric data and cover songs.
Television painting master Bob Ross calmed millions with his easy-going attitude about "The Joy of Painting" on PBS, and this data set from the statistical analysis artists at FiveThirtyEight analyzes the types of paintings Ross taught in each episode. Broken down by elements like trees, mountains, and water, the data set can be used by Bob Ross aficionados to create an accurate picture of the art teacher’s work or by a novice painter looking for inspiration.
Speed daters can start looking for love in the all the right places as Professors Ray Fisman and Sheena Iyengar of Columbia Business School have put together this Speed Dating Experiment. Utilizing data collected from 2002 to 2004, they have broken down the information into categories such as dating habits or beliefs people found valuable in a mate.
Find out if a 60-pound dog is more or less intelligent than a heavier canine with this data set that pits weight versus the IQ of dogs. The data set sets out to explore the correlation between dog size and intelligence using data derived from the Intelligence of Dogs data set. The project is based on research by Stanley Coren, professor of canine psychology at the University of British Columbia.
The search for intelligent life in the universe gets a big upgrade with the ufo-reports data set, which tracks over 80,000 sightings from the National UFO Reporting Center over the last century. The data collected includes geo-location and time-standardization for easy comparison between sightings for those studying extraterrestrial contact.
Every fantasy league manager wants to know who’s hot and who’s not. This data set lets the owner view data from the last 20 games by individual categories like batting average or home runs. It also allows baseball fans the opportunity to break down data about the top 10 players or dive into the massive data set to examine the complete raw data about players on every team.
Felines in films have been around since 1903, according to this data set, which was compiled on OpenDataSoft, a portal for over 13,000 public datasets. This list can be sorted by director, producer, and year, and can be used to find out which decade was the most feline-friendly in film.
Imagine how many clicks one user goes through a day. Now, multiply that by 100,000 people and you have the billions of clicks registered by Google Software Engineer Mark Meiss and the University of Indiana Center for Complex Networks and Systems Research. This information can be used in practical applications like accurately predicting web traffic trends or understanding online behavior.
Serbian researcher Saša Stamenković created this unique geographic data set to provide programmers with the names and ISO 3166-1 codes for every country in the world. Even more useful, the data translates the countries into all languages and data formats, so it can be implemented across the globe from Mongolia to South Africa. Sometimes, even interesting data sets can bring the world together.
“Violence is the last refuge of the incompetent," Isaac Asimov noted. The violent flows data set examines these “incompetent” actions by examining crowd violence from YouTube and analyzing the data for the paper “Violent Flows: Real-Time Detection of Violent Crowd Behavior” by Tal Hassner, Yossi Itcher, and Orit Kliper-Gross. Future researchers may be able to use the data to identify violent situations in real-time if they’re captured by surveillance cameras.
Check out this data set compiled by the San Francisco Department of Public Health to find out which restaurants failed their health inspector tests before your next big date. The SFDPH organized all their health grade data by zip code for easy reference.
The Caltech Pedestrian Dataset is incredibly useful to traffic researchers and consists of 10 hours of video, which recognized 2,300 unique pedestrians. This technology has the ability to recognize pedestrians both in and out of crosswalks, potentially eliminating injuries and fatalities. With autonomous cars becoming a reality, this data can be invaluable in the future for saving lives.
It’s nearly impossible to objectively say whether something is funny or not, but The New Yorker Caption Contest Dataset aims to do just that. The data set contains 33 million ratings on over 440,000 New Yorker captions, which researchers can use to create an algorithm that could, hypothetically, create its own funny captions.
This obscure sounding data set procured all data on comic book murals on buildings in Brussels, Belgium, including the artist of the mural and the characters featured. The group VisitBrussels even put together a website publishing the maps where the murals can be found so travelers can put together a tour.
In 2013, a meteorite fell in the Ural Mountains in Russia injuring about 10,000 people. Inspired by this, Ramon Martinez of publichealthintelligence.org created this data set that registered meteorites. The information is based on every meteorite recorded in the US Meteoritical Society data base. Those using the data can now determine how often an area has been hit and the size of the meteor that hit, possibly foreseeing what could be coming if the data has predictive value.
When members of the United Kingdom government need to entertain guests, they turn to the Government Hospitality Wine Cellar. To keep track of all the wine U.K. ministers are consuming, the British government created a data set that tracks consumption of wine by origin, vintage, and quantity to see specifically how hospitable government employees are being to their guests. For the record, only three bottles of Australian wine were consumed by the British government in July 2015.
In Quentin Tarantino’s “Pulp Fiction,” Samuel L. Jackson’s character famously says, "Check out the big brain on Brad," but the big brains here come from FiveThirtyEight, which created this obscenely bloody data set. A movie buff can now impress friends by pinpointing the carnage in every Tarantino film with this list that breaks down the act, minute, and movie where the bleeding or swearing occurred.
The Bristol City Council in England created this hyper-specific data set to identify the location of abandoned shopping carts in the rivers of their fair town. While the relevance is limited to the citizens of Bristol, it certainly helps those wishing to round-out the abandoned carts for an impromptu shopping trip.
Bigfoot hunters can now take their hunt to specific geographic locations thanks to the Bigfoot Field Researchers Organization (BFRO), which has put together this Bigfoot Sightings data set. The BGFO has broken everything down into searchable categories such as how many people have witnessed sightings or suggestions for most likely sightings based on geographic clusters of previous sightings.
Even bizarre data sets can be useful and, in this case, sponsored. The Great British Toilet Map, sponsored by a British cleaning product company, was created by the British Toilet Association to promote their “Use Our Loos” public toilet initiative. The map, which charts the location of over 11,000 facilities, is United Kingdom’s “largest database of publicly accessible toilets,” making it relevant to any Brit or visitor who’s ever needed a loo.
While it may seem morose, it’s also quite fascinating to examine this data set of the last words of inmates who were executed by the State of Texas. The data set includes information about the inmates and links to their last statements. While the data may not be as practical as some other sets on this list, the historical value of the information cannot be ignored.
Mushrooms can be potentially harmful or delicious in food dishes. To separate the toxic from the tasty, researcher Jeff Schlimmer took records from the National Audubon Society Field Guide to North American Mushrooms and created this data set that organizes 23 species of gilled mushrooms into categories of “definitely edible,” “definitely poisonous,” and “unknown.” The best advice is to stick to what’s definitely edible.
“Does it fart?” began as a Twitter hashtag that captivated fart-lovers everywhere. To answer the question, this odorous data set was created on OpenDataSoft with 80 different animal species along with the answer to the infamous question. The data set also includes specific notes like, “They do it often and have no shame,” for orangutans.
Toys are things to play with for children, but they’re something else entirely for researchers at the Courant Institute of New York University. The NYU Object Recognition Benchmark Dataset is intended to be used for 3D recognition of objects based on shape. To create the data set, the researchers used images of 50 different toys from five categories: airplanes, trucks, four-legged animals, human figures, and cars.
Dog lovers take a lot of time to find the perfect name for their pooch and this data set from the European Data Portal provides insight into dog owners from the Swiss city of Zurich. There are over 7,000 names in the database and include names like Akosambo's Black Massai Ulani, Zorro of Blue Diamond, and Windy Nights Nice Angel.
Bike sharing is all the rage in major urban areas across the globe. For anyone curious about the duration of every single ride taken in Montreal through the BIXI bike-share service, BIXI provides complete trip history data sets separated by month. While the information is most useful to BIXI itself to determine pricing systems and use-rates, the data sets can also be beneficial to public health researchers looking to prove that bicycling is making Montreal residents healthier.
For anyone who’s ever been annoyed by random text messages advertising free money and crazy scams, there’s the SMS Spam Collection data set. The set comprises of over 5,000 messages compiled by the University of California Irvine Machine Learning Repository. The information was collected from various sources, including U.K. forum Grumbletext, a Singaporean university database, and a Ph.D thesis. The messages in the data set are categorized as real messages, aka “ham,” or illegitimate messages, aka “spam.” The ultimate goal is to use the data set to train machines to be able to recognize the difference between the two.