Accueil |  MySQL Buzz |  FAQ |  Feeds |  Proposez votre flux |  Votre feedback |  Archives |  Abonnez vous ! RSS 2.0 Français English Deutsch Español Italiano 日本語 Русский
Air traffic queries in MyISAM and Tokutek (TokuDB)
+8 Vote Up -0 Vote Down

This is next post in series
Analyzing air traffic performance with InfoBright and MonetDB
Air traffic queries in LucidDB
Air traffic queries in InfiniDB: early alpha

Let me explain the reason of choosing these engines. After initial three posts I am often asked "What is baseline ? Can we compare results with standard MySQL engines ?". So there come MyISAM to consider it as base point to see how column-oriented-analytic engines are better here.

However, take into account, that for MyISAM we need to choose proper indexes to execute queries effectively, and there is pain coming with indexes: - load of data is getting slower; - to design proper indexes is additional research, especially when MySQL optimizer is not smart in picking best one.

The really nice thing about MonetDB, InfoBright, InfiniDB is that they do not need indexes, so you may not worry about maintaining them and picking best one. I am not sure about LucidDB, I was told indexes are needed, but creating new index was really fast even on full database, so I guess, it's not B-Tree indexes. So this my reflexion on indexes turned me onto TokuDB direction.

What is so special about TokuDB ? There two things: indexes have special structure and are "cheap", by "cheap" I mean the maintenance cost is constant and independent on datasize. With regular B-Tree indexes cost grows exponentially on datasize (Bradley Kuszmaul from Tokutek will correct me if I am wrong in this statement). Another point with TokuDB, it uses compression, so I expect less size of loaded data and less IO operations during query execution.

So what indexes we need for queries. To recall you details, the schema is available in this post
http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/, and
queries I posted on sheet "Queries" in my summary Spreadsheet.

With Bradley's help we chose next indexes:

PLAIN TEXT
CODE:
  1. KEY `Year` (`Year`,`Month`),
  2.   KEY `Year_2` (`Year`,`DayOfWeek`),
  3.   KEY `DayOfWeek` (`DayOfWeek`,`Year`,`DepDelay`),
  4.   KEY `DestCityName` (`DestCityName`,`OriginCityName`,`Year`),
  5.   KEY `Year_3` (`Year`,`DestCityName`,`OriginCityName`),
  6.   KEY `Year_4` (`Year`,`Carrier`,`DepDelay`),
  7.   KEY `Origin` (`Origin`,`Year`,`DepDelay`)

And I measured load time for both MyISAM and TokuDB in empty table with created indexes.

Load time for MyISAM: 16608 sec
For TokuDB: 19131 sec

Datasize (including indexes)

MyISAM: 36.7GB
TokuDB: 6.7GB

I am a bit surprised that TokuDB is slower loading data, but my guess it is related to compression, and I expect with bigger amount of data TokuDB will be faster MyISAM.

Now to queries. Bradley pointed me that query Q5 SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier,
count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY
carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE
Year=2007 GROUP BY carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3
can be rewritten as
SELECT carrier,totalflights,ndelayed,ndelayed*1000/totalflights as c3 FROM (SELECT carrier,count(*) as totalflights,sum(if(depdelay>10,1,0)) as ndelayed from ontime where year=2007 group by carrier) t order by c3 desc; ( I name it as Query Q5i)

The summary table with queries execution time (in sec, less is better):

Query MyISAM TokuDB
Q0 72.84 50.25
Q1 61.03 55.01
Q2 98.12 58.36
Q3 123.04 66.87
Q4 6.92 6.91
Q5 13.61 11.86
Q5i 7.68 6.96
Q6 123.84 69.03
Q7 187.22 159.62
Q8 (1y) 8.75 7.59
Q8 (2y) 102.17 64.95
Q8 (3y) 104.7 69.76
Q8 (4y) 107.05 70.46
Q8 (10y) 119.54 84.64
Q9 69.05 47.67

For reference I used 5.1.36-Tokutek-2.1.0 for both MyISAM and TokuDB tests.

And if you are interested to compare MyISAM with previous engines:

Query MyISAM MonetDB InfoBright LucidDB InfiniDB
Q0 72.84 29.9 4.19 103.21 NA
Q1 61.03 7.9 12.13 49.17 6.79
Q2 98.12 0.9 6.73 27.13 4.59
Q3 123.04 1.7 7.29 27.66 4.96
Q4 6.92 0.27 0.99 2.34 0.75
Q5 13.61 0.5 2.92 7.35 NA
Q6 123.84 12.5 21.83 78.42 NA
Q7 187.22 27.9 8.59 106.37 NA
Q8 (1y) 8.75 0.55 1.74 6.76 8.13
Q8 (2y) 102.17 1.1 3.68 28.82 16.54
Q8 (3y) 104.7 1.69 5.44 35.37 24.46
Q8 (4y) 107.05 2.12 7.22 41.66 32.49
Q8 (10y) 119.54 29.14 17.42 72.67 70.35
Q9 69.05 6.3 0.31 76.12 9.54

The all results are available in summary Spreadsheet

I especially do not put TokuDB in the same table with analytic oriented databases, to highlight TokuDB is OLTP engine for general purposes.
As you see it is doing better than MyISAM in all queries.


Entry posted by Vadim | 25 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Votes:

You must be logged in with a MySQL.com account to vote on Planet MySQL entries. More information on PlanetMySQL voting.

Planet MySQL © 1995-2008 MySQL AB, 2008-2009 Sun Microsystems, Inc.
Content reproduced on this site is the property of the respective copyright holders.
It is not reviewed in advance by Sun Microsystems, Inc. and does not
necessarily represent the opinion of Sun Microsystem, Inc. or any other party.