David Dowe's data links

[See also Ray Solomonoff (1926-2009) 85th memorial conference (Wedn 30 Nov - Fri 2 Dec 2011), 1st Call for Papers.]

Machine learning, statistics and "data mining" data
U. Calif. Irvine (UCI) ICS KDD Archive, Machine Learning Repository and other machine learning repositories and sites.
NIST (U.S.)'s Info. Tech. Lab.'s Statistical Reference Datasets (StRD) and Dataset archives.
CMU Dept. of Statistics's StatLib links, Datasets Archive and "other places" and statistical archives.
Baylor University Libraries Computer Science Data Repositories.
Machine Learning Resources - Data Repositories and competitions, maintained by David Aha.
Online Machine Learning Resources: ML Benchmarks and other Data Sources.
Bayesian Network data sets (or Bayes Net data sets) - see also Bayesian Networks using MML.
Dept of Computer Science, University of Toronto's Data for Evaluating Learning in Valid Experiments (DELVE)'s Datasets Summary Table, including the Titanic dataset.
KDNuggets's Datasets for "Data Mining" and "Data Mining" Competitions.
"The Data Mine"'s Data Sources.
Rob Hyndman's Time Series Data Library and CEC2000's Time series prediction competitions.
UCR Time Series Data Mining Archive, linked to by Eamonn Keogh.
Data on the Web - Faculty of Business and Economics, University of Sydney, Australia.
AskDrMath (The Math Forum - Math Library)'s Data Sets, Prob/Stat and Statistics: Data Sets.
Brookhaven Protein Database (old site) Gopher; SWISS-PROT Protein Sequence Database and CSSE Contig Restriction Site Mapping and links (human genome project, etc.).
Kathleen Cuningham Foundation Consortium for research into Familial Breast cancer (http://www.kconfab.org)'s policies and procedures for accessing kConFab data.
European Pulsar Network Data Archive (and mirror site) (and disclaimer)'s index (and Russell Edwards's comments): Data Archive.
Statistical Society of Canada's Case Studies in Data Analysis for 2000.
Bayesian networks repository (started by Nir Friedman); Bayesian networks and Related sites.
University of Fribourg Section of Chemistry's Useful Chemistry Links and databases.
ICMAS-2000: market game and ICMAS-00 Trading Agent Competition Overview.
Linguistic Data Consortium (LDC): LDC-Online, LDC Catalog(ue), Obtaining corpora and Search LDC Web site. Links to text analysis resources.
Geoff McLachlan and David Peel's "Finite Mixture Models" and data sets.
Australian Antarctic Division (AAD) and Australian Antarctic Data Centre.
Search and Rescue Data collection form (HTML, Word97, postscript, pdf) - Charles Twardy.


Competitions
Machine Learning Resources - Data Repositories and competitions, maintained by David Aha.
KDNuggets's Datasets for "Data Mining" and "Data Mining" Competitions.
Rob Hyndman's Time Series Data Library and CEC2000's Time series prediction competitions.
ICMAS-2000: market game and ICMAS-00 Trading Agent Competition Overview.
KDD Cup 2000, e-mail: kddcup2000@bluemartini.com.
This is the homepage of The Insurance Company (TIC) Benchmark.


Other data
Some links to chess and games data.
Sports: Australian Rules football with data since 1993, data since 1998, other footy statistics and some other sports data.
Medical links (with some Medical data links), and EEG data (electroencephalograph data) from UCI KDD Archive (http://kdd.ics.uci.edu).
www.statoo.com: "the portal to statistics on the internet" (so they say).


Links to Random number generation software
(Pseudo-)Random number generation software in Fortran :
uniform (for multinomial), Gaussian (Normal), von Mises (circular) and Poisson.
Random number generation (and other) publications by Chris Wallace: TR #89/123 (Feb. 1989), 1990, 1996.

http://www.almaden.ibm.com/cs/quest: synthetic market-basket dataset generator.
http://www.almaden.ibm.com/cs/people/bayardo/vinci/maxminer.html: max-miner algorithm, which generates frequent itemsets, in order to test your algorithm output. (Use the FINDALL option, unless you want only the maximal frequent itemsets.)

http://lib.stat.cmu.edu/DASL/DataArchive.html.

Random number (generator)s and Monte Carlo methods: Information Servers, Theory, Applications and Software.

Other RNG software: " C Programming " ; " Code Snippets " ;
" Portable functions and headers "; " Random number functions " ; " Rand1.C ".


Data analysis and ``data mining''
Minimum Message Length (MML), an operational form of Occam's razor [see also Minimum Description Length, MDL].
Clustering, mixture modelling and unsupervised learning.


Miscellaneous, other, links
Chris Wallace (1933-2004) (developer of MML in 1968),
Wallace, C.S. (2005) [posthumous], Statistical and Inductive Inference by Minimum Message Length, Springer (Series: Information Science and Statistics), 2005, XVI, 432 pp., 22 illus., Hardcover, ISBN: 0-387-23795-X [table of contents, chapter headings and more],
Wallace, C.S. (with D. L. Dowe), "Minimum Message Length and Kolmogorov complexity", Comp. J., Vol 42, No. 4 (1999), pp270-283 [this article is the Computer Journal's most downloaded ``full text as .pdf'' - see, e.g., here],
Bayesian networks using MML,
clustering and mixture modelling,
decision trees and decision graphs using MML,
"Minimum Message Length, MDL and Generalised Bayesian Networks with Asymmetric Languages", by J. W. Comley and D.L. Dowe; Chapter 11 (pp265-294) in P. Grunwald, M. A. Pitt and I. J. Myung (eds.), Advances in Minimum Description Length: Theory and Applications, M.I.T. Press, April 2005, ISBN 0-262-07262-9. {This is about Generalised Bayesian nets (or even the special case of hybrid Bayesian nets), generalising MML Bayesian nets or MML Bayesian networks or MML Bayes nets; and it deals with a mix of both continuous and discrete variables. (See also Comley and Dowe (2003), .pdf.)}
Occam's razor (Ockham's razor),
Snob (program for MML clustering and mixture modelling),
(econometric) time series using MML,
medical research,
a probabilistic sports prediction competition (and further reading on probabilistic scoring),
chess and game theory research,
TheHungerSite, TheRainforestSite, "do-goody"/"do-goody stuff, improving the world and saving the planet".

  • Please e-mail me if you would like to know more.
  • This page, http://www.csse.monash.edu.au/~dld/datalibrary.html , was last updated no earlier than 18th Apr. 2000.

    Copyright David L. Dowe, Monash University, Australia, 15 March 2000.