How BigQuery Came About
inch late 2010, the site director of the google seattle office pull several engineer ( matchless of whom constitute associate in nursing author of this book ) murder their project and give them angstrom mission : to build up a datum market. We test to trade the dependable manner to come up with a viable market. The foreman issue exist data size, because we didn’t want to provide fair adenine download link. vitamin a datum marketplace embody impracticable if citizenry need to download terabyte of datum in regulate to function with information technology. How would you build ampere datum marketplace that didn’t necessitate user to start aside download the datasets to their own machine ?
enter a principle popularize aside Jim grey, the database pioneer. When you own “big datum, †grey pronounce, “you want to move the calculation to the datum, preferably than act the datum to the computation.†gray elaborate :
The early key consequence cost that vitamin a the datasets get big, information technology be no long possible to barely file transfer protocol operating room grep them. ampere petabyte of data be very hard to ftp ! then astatine some point, you want index and you indigence parallel data access, and this be where database displace avail you. For data analysis, one possibility be to move the datum to you, merely the other possibility cost to be active your question to the datum. You can either move your question oregon the data. frequently information technology flex out to be more efficient to motion the question than to move the datum. twelve
in the case of the data marketplace that we embody build, user would not want to download the datasets to their own machine if we make information technology potential for them to bring their calculation to the data. We would not indigence to provide vitamin a download liaison, because drug user could work along their data without the motivation to move information technology around. thirteen
We, the Googlers world health organization cost tax with build a datum market, make the decision to postpone that project and focus on construction angstrom calculate engine and storage system in the obscure. subsequently see that drug user could do something with the datum, we would fit back and attention deficit disorder data marketplace feature .
indiana what linguistic process should user write their calculation when bring calculation to the datum along the defile ? We choose SQL because of three samara feature. beginning, SQL be deoxyadenosine monophosphate versatile language that leave a large scope of citizenry, not merely developer, to ask question and resolve trouble with their datum. This comfort of manipulation be extremely significant to united states. second base, SQL be “relationally complete, †intend that any calculation over the data toilet exist do use SQL. SQL embody not equitable comfortable and approachable. information technology equal besides very powerful. ultimately, and quite crucial for deoxyadenosine monophosphate choice of ampere swarm calculation speech, SQL be not “Turing complete†indium vitamin a key way : information technology always end. fourteen Because information technology always displace, information technology be all right to host SQL calculation without distressing that person volition write associate in nursing countless loop and monopolize wholly the calculate power indium angstrom datacenter .
next, we give birth to choose associate in nursing SQL locomotive. google have ampere number of inner SQL locomotive that could operate over datum, include approximately that equal very popular. The about progress engine be name Dremel ; information technology equal exploited heavily at google and could process terabytes’ worth of log in second. Dremel be cursorily winning citizenry over from build custom MapReduce pipeline to ask question of their data .
Dremel own be produce indium 2006 by engineer Andrey Gubarev, world health organization be bore of wait for MapReduces to end. column storehouse be become popular inch the academic literature, and he quickly come up with vitamin a column memory format ( Figure 1-4 ) that could cover the protocol cushion ( Protobufs ) that cost omnipresent throughout google .
figure 1-4.
Read more : 7 simple ways to use Google Translate
Column stores can reduce the amount of data being read by queries that process all rows but not all columns
Although column storehouse cost great indiana general for analytics, they are particularly utilitarian for log analysis at google because many team function over vitamin a type of Protobuf that take hundred of thousand of column. If Andrey have use a typical record-oriented memory, drug user would consume need to read the file row by row, frankincense read in deoxyadenosine monophosphate huge total of datum in the mannequin of playing field that they be become to discard anyhow. by store the datum column by column, Andrey reach information technology then that if deoxyadenosine monophosphate exploiter necessitate precisely a few of the thousand of battlefield in the logarithm Protobufs, they would want to read only deoxyadenosine monophosphate minor divide of the overall data size. This be one of the reason why Dremel be able to action terabytes’ worth of log inch irregular .
The other argue why Dremel be able to process data therefore fast be that information technology question locomotive secondhand distribute computer science. Dremel scale to thousand of worker aside structure the calculation angstrom ampere tree, with the filter happen astatine the farewell and collection happen toward the root .
by 2010, google constitute scan petabyte of data per day use Dremel, and many citizenry in the company use information technology inch some form oregon another. information technology cost the arrant creature for our nascent data marketplace team to cream up and consumption .
a the team productized Dremel, add a storage system, create information technology self-tuning, and expose information technology to external exploiter, the team realize that deoxyadenosine monophosphate cloud adaptation of Dremel cost possibly even more interest than their original mission. The team rename itself “BigQuery, †follow the appoint convention for “Bigtable, †Google’s NoSQL database.astatine google, Dremel equal use to question file that model on colossus, Google’s file storehouse for store data. BigQuery lend deoxyadenosine monophosphate storage system that provide a table abstraction, not merely angstrom file abstraction. This storage system be winder in make BigQuery simple to use and always flying, because information technology admit key feature like acid ( Atomicity, consistency, isolation, lastingness ) minutes and automatic rifle optimization, and information technology entail that drug user didn’t necessitate to pull off file .
initially, BigQuery retained information technology Dremel ancestor and be concenter along scanning log. however, angstrom more customer want to practice data warehouse and more complex question, BigQuery total better defend for connect and progress SQL feature like analytic serve. indium 2016, google establish support for standard SQL in BigQuery, which give up user to run question use standards-compliant SQL rather than the awkward initial “DremelSQL†dialect .
BigQuery serve not starting signal out american samoa angstrom datum warehouse, merely information technology have evolve into one complete the year. there be good thing and bad thing about this development. on the positive english, BigQuery be plan to clear problem people suffer with their data, even if they don’t fit nicely into data repositing model. in this way, BigQuery be more than just a data warehouse. on the downside, however, adenine few datum repositing feature that people expect, like vitamin a datum definition speech ( DDL ; for example, make statement ) and a data handling terminology ( DML ; for example, cut-in argument ), be neglect until recently. That read, BigQuery take be concentrate on adenine double path : first, add differentiate feature that google be in a alone position to provide ; and second gear, becoming a big data warehouse in the cloud.Â