Search for data tables.
pip install commasearch==0.0.1
When we search for ordinary written documents, we send words into a search engine and get pages of words back. What if we could search for spreadsheets by sending spreadsheets into a search engine and getting spreadsheets back? The order of the results would be determined by various specialized statistics; just as we use PageRank to find relevant hypertext documents, we can develop other statistics that help us find relevant spreadsheets. Read more `here <http://dada.pink/dada/pagerank-for-spreadsheets>`_ Indexing ------------ To index a new spreadsheet, run this. :: , --index [csv file] For example, :: , --index /home/tlevine/Math Scores 2009 Copy (1).csv \ http://opendata.comune.bari.it/storage/f/2013-09-02T163858/2012_comune_assessori.csv Caches from the indexing process are stored in the ``~/.,`` directory. By default, CSV files that have already been indexed will be skipped; to index the same CSV file again, run with the ``--force`` or ``-f`` option. :: , --index --force [csv file] Once you have indexed a bunch of CSV files, you can search. :: , [csv file] You'll see a bunch of data tables as results. :: $ , 'Math Scores 2009.csv' /home/tlevine/math-scores-2010-gender.csv /home/tlevine/Math Scores 2009.csv /home/tlevine/Math Scores 2009 Copy (1).csv /home/tlevine/math-scores-2009-ethnicity.csv http://opendata.comune.bari.it/storage/f/2013-09-02T163858/2012_comune_assessori.csv mysql://bob:password@localhost/schools To do -------- Change stuff so that the following is true. Indexing ~~~~~~~~~~~~ Indexing a file involves storing the file as a list of list of hashes, where each hash corresponds to a cell, each list to a column, and each list list to a data table. Searching ~~~~~~~~~~~ Searching involves running a join between an input file and all of the indexed files. When a search is run, this happens. * The input file is indexed if it has not alreaby been indexed. * On the input file, **all** factorial groups of columns of all sizes are used as potential join keys. * On the indexed files, all combinations of columns are used; we don't need to use both orderings in this case because the different orderings are present in the searched file. * Results are emitted in the order they are computed; they are not sorted. Other ideas ------------------------ * Add non-exact column matches so that there can be more matches. * Store a preview of the table in the db or load it from the cache so that the web interface can show the preview.