People involved in a Machine Learning (ML) project, probably know how difficult can be to find the dataset for your model. After working on a problem, defining it as an ML problem there is the need to have data to train the model. Maybe there is some data, but it’s not enough to feed neural networks. To help on this search of data here is a list of resources containing several repositories of datasets covering different topics:
Google Dataset Search: This is a search engine that lets you find datasets. It contains over 25 million datasets.
Kaggle: Kaggle provides a vast number of datasets that can be useful from the beginners to the expert.
Awesome Public Datasets: This is a large list of many different topics maintained by Github contributors.
VisualData: It’s a list of computer vision datasets organized by category.
CMU Libraries: This is a list of high-quality datasets from the Carnegie Mellon University.
Global Open Data Index: Here you can find a list of countries and their open data ranked by the number of data sources, how often are updated, and their quality. There is also the link to download the data resources.
EU Open Data Portal: List of datasets released by the European Union.
AWS Open Data: Amazon has launched its own repository of publicly available datasets. The quality of the datasets published is very high.
UC Irvine Machine Learning: This is a repository of several datasets for Machine Learning practitioners from The University of California, Irvine.
Microsoft Research Open Data: sharing several interesting datasets for researchers and ML practitioners. They are currently hosting around 100 datasets from different topics such as Biolog.
Academic Torrents: large list of Machine Learning datasets, you need a BitTorrent client to download them.
Wikipedia list of datasets for Machine Learning: table listing several datasets of different topics.
Knoema: it is a big collection of publicly available data and statistics on the web. Users have access to nearly 1000 topics coming from more than 1200 data sources updated daily.
Eurostat: High quality stats of numerous industries provided bu EU statistics office.
Harvard Dataverse: is an open-source data repository that researchers used to share and manage research data.
DataHub: an open framework and tool for data systems building and datasets access.