可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I really enjoy database design and the whole concept of managing data semantically and all the logic that comes with it.
My knowledge level when it comes to databases is however (I would guess) quite basic - I can model data relationships correctly with ER diagrams, connection tables, handling many-to-many, one-to-many etc etc.
I'm experienced when it comes to programming in general, and I figure my database knowledge is like knowing the basics of object oriented programming, i.e how to model a car-class, inheriting from the vehicle class, containing wheel objects and so on.
Now I would like to further my knowledge about relational databases so that I may confidently say to an employer that I can handle the subject on a professional level.
All I can handle right now is probably my movie database in the back end of my personal website, which probably would collapse if I was Amazon and had to store millions of movies. So there's the subject of scalability right? I'm sure there's a pretty "standard" array of subjects/concepts within database design that you simply must understand and be able to apply in real life if you're going to work with databases on a professional level.
So, I would be very grateful if any database gurus in the field could namedrop some areas, concepts, case studies or anything that would be beneficial to study to get really good at databases. I'm sure there's a vast science lurking here, and I want it.
Thanks in advance!
回答1:
I'll volunteer a list of areas that you might want to consider as aspects of programming with databases. I would not claim that you need to be expert at all of them, or even most of them, in order to be able to program using a DBMS, nor even to program a DBMS. However, they are all topics that are of some relevance at some times - in no particular order:
- Query language design
- Query optimization
- Query rewriting
- Data types
- Storage organization
- Transaction management
- Communications protocols
- Encryption
- Authentication and identification
- Schema design
- Replication
- Backup and restore
- Two-phase commits
- Optimistic concurrency control
- Locking and pessimistic concurrency control
- Authorization
- Label-based access control
- Set theory
- Relational theory
- Distributed query
- Boolean logic
- User-defined types and functions
- Catalog management
- Buffer management
- Sorting
- Internationalization (I18N), Localization (L10N), Globalization (G11N)
- Quantifiers
- Auditing
- Triggers
- Stored procedures
I make no claims of completeness or minimality, either.
回答2:
The standard text in the field is "An Introduction to Database Systems", by C. J. Date.
I have twenty years C experience; I read it, thought it excellent and I wrote a relational database because of it (a proper one, not this SQL malarky!).
回答3:
A whole other area is dimensional modelling and data warehousing.
I had been working with relational modelling for years, and then I read The Data Warehouse Toolkit and received an entirely new view of how it could be used.
回答4:
Get more dirt from C. J. Date's Database In Depth: Relational Theory for Practitioners if his An Introduction to Database Systems isn't grubby enough for you.
Seriously, these two books will give you a great deal more knowledge about RDBMSes, in a great deal less space, than many other professional database workers possess. Database in Depth, in particular, looks at how to think about databases relationally even when the language doesn't support it, and how to trick SQL into being a close-to-relational language.
回答5:
being just a student of Databases I can only speak from my limited scope but I can suggest two sites that may help...
http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html
This is Kenneth Downs site, he goes from the very basics if SQL and delves into more complex subjects. The man has written a framework around DB's after all.
Another one is High Scalability...
http://highscalability.com/
They get into every realm of DB's.
Hope this helps.
回答6:
I think set + relational algebra is a thing that most database users know little about but would do well to learn. When you appreciate the logic involved behind mapping one relation to another you start to see more clearly why things like normalization are good, why NULLs are best avoided if possible, etc. You also start to see the flaws in SQL compared to purer relational query languages, where features impose limitations on the paradigm due to performance reasons, etc.
回答7:
Well it's always good designing examples... See if there's anyone you know who needs a database for something. But studying VLDB (Very Large DataBase) techniques might be useful depending on the industry you're interested in.
回答8:
I believe optimization on existing databases could be interesting to dwelve into. I.e. why you should denormalize tables.
Some basic relation algebra is useful knowledge and closely related to set theory.
回答9:
A very common scenario is having to map ugly databases to an entity model which is not necessary reflected directly in the structure of the DB. Working out which way is best to model the data in your domain can be tricky.
Full text search and XML are subjects that seem to be coming up more and more.
I have no experience with it but I know DB2 (of which there is a trial version) has some crazy new features)
Have fun :-)
回答10:
http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html
http://blogs.msdn.com/pathelland/archive/2007/07/23/normalization-is-for-sissies.aspx
http://www.25hoursaday.com/weblog/2007/08/03/WhenNotToNormalizeYourSQLDatabase.aspx
http://itc.conversationsnetwork.org/shows/detail571.html
http://www.scribd.com/doc/2592098/DVPmysqlucFederation-at-Flickr-Doing-Billions-of-Queries-Per-Day
http://highscalability.com/flickr-architecture
http://highscalability.com/ebay-architecture
http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf
http://blog.maxindelicato.com/2008/12/scalability-strategies-primer-database-sharding.html
http://blog.maxindelicato.com/2008/12/the-ihsdf-theorem-a-proposed-theorem-for-the-tradeoffs-in-horizontally-scalable-systems.html
http://www.iamcal.com/talks/
http://natishalom.typepad.com/nati_shaloms_blog/2009/04/writing-your-own-scalable-twitter.html
http://www.mysqlperformanceblog.com/
http://www.cecs.uci.edu/~papers/ipdps07/pdfs/SMTPS-201-paper-1.pdf
http://video.google.com/videoplay?docid=7278544055668715642
http://www.niallkennedy.com/blog/uploads/flickr_php.pdf
回答11:
It depends on what you want to do with your databases, how does your data look, what are your work flows, how many servers, clients and databases you have to work with ...
So let's pretend that you, like me, have to deal with several databases, not to big (< 100 GB each) and you have many clients with many different needs that make you develop many custom solutions, like producing custom reports or exports. That makes you more of a programmer than a DBA. And what you need is productivity, before performance.
The best solution I came up with, in that situation, is getting rid of SQL, as much as possible.
You can achieve that by using some kind of ORM, either home made or existing ones, and thus trading SQL scripting for object programming.
Doing that I do in minutes what would takes hours with SQL.
回答12:
Disclaimer: not an expert in database design.
Some of the performance issues can be handled either by:
- denormalizing your database, so to reduce the number of tables to join
- adding indexes
- filtering should be done so that you first remove the largest of the non matching data, then you cherry pick the next condition on the reduced set. It's better to go from 100 values -> 10 matching first condition -> 1 matching first and second condition than 100 values -> 80 matching second condition -> 1 matching first and second condition. Seems trivial, but it's important to keep in mind.
divide et impera is the motto for scalability. If you have something that scales in a non-linear way, say O(N^2) it makes sense to keep N as low as you can, and you should partition your data set into smaller sets, assuming they are independent and you can work out the partitioning. An example of this is sharding, typically used to keep databases of users in large social websites. (NB: an example, I would not implement it this way) Instead of having a huge database with all the users, they have 26 servers (one for each letter of the alphabet), then they put all the nicknames with the same first letter in the same server. This has the following advantages:
a. you balance the load on different machines
b. if one machine crashes, you make the site unaccessible only to a subset of your users, not to all of them
c. you preselect the search with a highly discriminating criterium (the first letter), then perform the second search (the username)
d. you reduce the number of entries each database has.
回答13:
Don't forget representing hierarchy and/or graphs in databases. It can be a pain & there's no right answer.
The standard techniques (for hierarchy at least) have been mentioned in these SO posts:
- Reporting Hierarchy in Mutltiple Tables
- What is the most efficient/elegant way to parse a flat table into a tree?
edit: There's also spatial database applications for GIS use, where you have data structures and/or indices based on point locations using R-trees and the like. Using these is a little bit different than the regular non-spatial database features.
回答14:
To my mind there are three "tracks" with database skills: Developer, DBA and Architect. From a development perspective you want to focus on development, understand Architect and pick up as much DBA stuff as you need along the way.
As a developer the key thing (to my mind) would be to get your SQL to a really good standard. As an interviewer if I'm looking for a developer I don't care if you can design databases as much as how you can write queries. Assuming you know about your basic CRUD commands, do you know about:
Stored Procedures (not just how to use them but when and why)
Views (ditto, including materialised views)
Triggers (insert, update, delete, how and why)
Cursors (especially impact on performance)
Referential integrity
Transactions
Indexes
Adding defaults, constraints and identities to tables
Complex use of group by and having
Functions especially:
- Date and time manipulation
- String manipulation
- Handling nulls
You should be able to pull any data you need from your database using SQL alone, you should never need to manipulate or parse it in any way using your procedural code (you may choose to but it will be a choice rather than you didn't know how to do it with SQL).
As a developer the one booking I'd look at is Joe Celko's SQL for Smarties. Lots of SQL to do things you may never have really thought about being able to do in SQL.
One of the best ways to learn this stuff is, tedious as it seems, writing reports (management information). I've seen so many people moan about writing reports being tedious and then do it really, really badly (and not just because they didn't try). Reports tend to be close to pure SQL so you have to really get to know the tools at hand and a complex report really exposes those who know SQL from those who don't. People also tend not to want to wait too long for them so performance is key too.
Look at your current database and come up with a bunch of really really awkward things someone might actually want to know. Think marketing, trends, most and least popular. Then try and combine a bunch of them into one query.
In terms of performance I'd also be trying to get inside how the query optimizer works, how it makes decisions about when to use an index and when to table scan, when indexes will help and when they'll hinder.
A good developer doesn't just write good queries, they write quick, maintainable queries. To really get to grips with this you'll need to play around with a database with a dozen (or more tables) containing, ideally, millions of rows. That's when you start seeing queries you thought were fine dragging their heels.
The architect/designer stuff others have covered pretty well. All I'd say on the subject is that for every database that has to be designed there are hundreds of queries that need to be written for it. You might want to consider that proportional break down of work when you're upskilling and make sure your querying is really up to scratch.
In terms of links it depends on the platform - all this stuff tends to be platform specific. But then that's what google is for.
Not I suspect entirely what you want but worth knowing as a lot of people who think they know SQL really really don't...
回答15:
I highly recommend to start with www.dbdebunk.com. It has a lot of practical stuff in oppose to theory. The site is a little outdated, but still useful. Even commercial content isn't too expensive, if you really like to become database professional.
回答16:
i would advise narrowing your scope a little. pick an sql server and become expert at it ... for example get mysql, learn the difference between the storage types, the replication types, etc. implement replication in a couple different ways. get a large dataset and try to optimize queries. do some pivots and optimize your indices for that. investigate backup strategies. see how to increase performance in replication and backups when you have a 10gb database that consistently adds 100,000 transactions every day. write software to insert the records and scripts to do replication and backups.
its hard to become an effective dba with no real experience when you try to cover all sql servers. just focus on one ... i'd suggest mysql or mssql, but whatever floats your boat.
-don
回答17:
There is only one rigorous technique for conceptually modeling a relational database schema that I know of (and I've spent a lot of time looking). It's confusingly named "Object-Role Modeling". Here are a couple references.
http://www.agilemodeling.com/artifacts/ormDiagram.htm
http://www.tdan.com/view-articles/5033
http://en.wikipedia.org/wiki/Object_role_modeling
http://en.wikipedia.org/wiki/NORMA
and here's a plugin for Visual Studio
回答18:
Well, to be frankly a database is just a way of storing and accessing data.
Pretty much what a file system does too.
A parallel from LDAP is that it is a protocol and by this not a definition of what you can do with it and how it should be implemented, the same could be said about SQL.
So if you want to know more about databases you are actually saying you want to know more about the SQL protocol and/or how to store and fetch data.
You might be interested in searching around for what a 'B-Tree' is and how it is used.
Another thing worth looking up is EAV (Entity-Attribute-Value) and why schema's are so important for it.
With that knowledge you could actually role out your own DB and while doing that appreciating what RDBM's already have done for you.
If you want a more practical approach look at the documentation that the open-source PostgreSQL provides, perhaps starting with this.
回答19:
You could start by reading one of the (almost recent) review papers that focuses on the foundations and trends in Databases: The anatomy of databases