What are the most common SQL anti-patterns? [close

2019-01-02 16:49发布

问题:

All of us who work with relational databases have learned (or are learning) that SQL is different. Eliciting the desired results, and doing so efficiently, involves a tedious process partly characterized by learning unfamiliar paradigms, and finding out that some of our most familiar programming patterns don't work here. What are the common antipatterns you've seen (or yourself committed)?

回答1:

I am consistently disappointed by most programmers' tendency to mix their UI-logic in the data access layer:

SELECT
    FirstName + ' ' + LastName as "Full Name",
    case UserRole
        when 2 then "Admin"
        when 1 then "Moderator"
        else "User"
    end as "User's Role",
    case SignedIn
        when 0 then "Logged in"
        else "Logged out"
    end as "User signed in?",
    Convert(varchar(100), LastSignOn, 101) as "Last Sign On",
    DateDiff('d', LastSignOn, getDate()) as "Days since last sign on",
    AddrLine1 + ' ' + AddrLine2 + ' ' + AddrLine3 + ' ' +
        City + ', ' + State + ' ' + Zip as "Address",
    'XXX-XX-' + Substring(
        Convert(varchar(9), SSN), 6, 4) as "Social Security #"
FROM Users

Normally, programmers do this because they intend to bind their dataset directly to a grid, and its just convenient to have SQL Server format server-side than format on the client.

Queries like the one shown above are extremely brittle because they tightly couple the data layer to the UI layer. On top of that, this style of programming thoroughly prevents stored procedures from being reusable.



回答2:

Here are my top 3.

Number 1. Failure to specify a field list. (Edit: to prevent confusion: this is a production code rule. It doesn't apply to one-off analysis scripts - unless I'm the author.)

SELECT *
Insert Into blah SELECT *

should be

SELECT fieldlist
Insert Into blah (fieldlist) SELECT fieldlist

Number 2. Using a cursor and while loop, when a while loop with a loop variable will do.

DECLARE @LoopVar int

SET @LoopVar = (SELECT MIN(TheKey) FROM TheTable)
WHILE @LoopVar is not null
BEGIN
  -- Do Stuff with current value of @LoopVar
  ...
  --Ok, done, now get the next value
  SET @LoopVar = (SELECT MIN(TheKey) FROM TheTable
    WHERE @LoopVar < TheKey)
END

Number 3. DateLogic through string types.

--Trim the time
Convert(Convert(theDate, varchar(10), 121), datetime)

Should be

--Trim the time
DateAdd(dd, DateDiff(dd, 0, theDate), 0)

I've seen a recent spike of "One query is better than two, amiright?"

SELECT *
FROM blah
WHERE (blah.Name = @name OR @name is null)
  AND (blah.Purpose = @Purpose OR @Purpose is null)

This query requires two or three different execution plans depending on the values of the parameters. Only one execution plan is generated and stuck into the cache for this sql text. That plan will be used regardless of the value of the parameters. This results in intermittent poor performance. It is much better to write two queries (one query per intended execution plan).



回答3:

  • Human readable password fields, egad. Self explanatory.

  • Using LIKE against indexed columns, and I'm almost tempted to just say LIKE in general.

  • Recycling SQL-generated PK values.

  • Surprise nobody mentioned the god-table yet. Nothing says "organic" like 100 columns of bit flags, large strings and integers.

  • Then there's the "I miss .ini files" pattern: storing CSVs, pipe delimited strings or other parse required data in large text fields.

  • And for MS SQL server the use of cursors at all. There's a better way to do any given cursor task.

Edited because there's so many!



回答4:

Don't have to dig deep for it: Not using prepared statements.



回答5:

Using meaningless table aliases:

from employee t1,
department t2,
job t3,
...

Makes reading a large SQL statement so much harder than it needs to be



回答6:

var query = "select COUNT(*) from Users where UserName = '" 
            + tbUser.Text 
            + "' and Password = '" 
            + tbPassword.Text +"'";
  1. Blindly trusting user input
  2. Not using parameterized queries
  3. Cleartext passwords


回答7:

My bugbears are the 450 column Access tables that have been put together by the 8 year old son of the Managing Director's best friends dog groomer and the dodgy lookup table that only exists because somebody doesn't know how to normalise a datastructure properly.

Typically, this lookup table looks like this:

ID INT,
Name NVARCHAR(132),
IntValue1 INT,
IntValue2 INT,
CharValue1 NVARCHAR(255),
CharValue2 NVARCHAR(255),
Date1 DATETIME,
Date2 DATETIME

I've lost count of the number of clients I've seen who have systems that rely on abominations like this.



回答8:

The ones that I dislike the most are

  1. Using spaces when creating tables, sprocs etc. I'm fine with CamelCase or under_scores and singular or plurals and UPPERCASE or lowercase but having to refer to a table or column [with spaces], especially if [ it is oddly spaced] (yes, I've run into this) really irritates me.

  2. Denormalized data. A table doesn't have to be perfectly normalized, but when I run into a table of employees that has information about their current evaluation score or their primary anything, it tells me that I will probably need to make a separate table at some point and then try to keep them synced. I will normalize the data first and then if I see a place where denormalization helps, I'll consider it.

  3. Overuse of either views or cursors. Views have a purpose, but when each table is wrapped in a view it's too much. I've had to use cursors a few times, but generally you can use other mechanisms for this.

  4. Access. Can a program be an anti-pattern? We have SQL Server at my work, but a number of people use access due to it's availabilty, "ease of use" and "friendliness" to non-technical users. There is too much here to go into, but if you've been in a similar environment, you know.



回答9:

use SP as the prefix of the store procedure name because it will first search in the System procedures location rather than the custom ones.



回答10:

Overuse of temporary tables and cursors.



回答11:

For storing time values, only UTC timezone should be used. Local time should not be used.



回答12:

using @@IDENTITY instead of SCOPE_IDENTITY()

Quoted from this answer :

  • @@IDENTITY returns the last identity value generated for any table in the current session, across all scopes. You need to be careful here, since it's across scopes. You could get a value from a trigger, instead of your current statement.
  • SCOPE_IDENTITY returns the last identity value generated for any table in the current session and the current scope. Generally what you want to use.
  • IDENT_CURRENT returns the last identity value generated for a specific table in any session and any scope. This lets you specify which table you want the value from, in case the two above aren't quite what you need (very rare). You could use this if you want to get the current IDENTITY value for a table that you have not inserted a record into.


回答13:

Re-using a 'dead' field for something it wasn't intended for (e.g. storing user data in a 'Fax' field) - very tempting as a quick fix though!



回答14:

select some_column, ...
from some_table
group by some_column

and assuming that the result will be sorted by some_column. I've seen this a bit with Sybase where the assumption holds (for now).



回答15:

SELECT FirstName + ' ' + LastName as "Full Name", case UserRole when 2 then "Admin" when 1 then "Moderator" else "User" end as "User's Role", case SignedIn when 0 then "Logged in" else "Logged out" end as "User signed in?", Convert(varchar(100), LastSignOn, 101) as "Last Sign On", DateDiff('d', LastSignOn, getDate()) as "Days since last sign on", AddrLine1 + ' ' + AddrLine2 + ' ' + AddrLine3 + ' ' + City + ', ' + State + ' ' + Zip as "Address", 'XXX-XX-' + Substring(Convert(varchar(9), SSN), 6, 4) as "Social Security #" FROM Users

Or, cramming everything into one line.



回答16:

  • The FROM TableA, TableB WHERE syntax for JOINS rather than FROM TableA INNER JOIN TableB ON

  • Making assumptions that a query will be returned sorted a certain way without putting an ORDER BY clause in, just because that was the way it showed up during testing in the query tool.



回答17:

Learning SQL in the first six months of their career and never learning anything else for the next 10 years. In particular not learning or effectively using windowing/analytical SQL features. In particular the use of over() and partition by.

Window functions, like aggregate functions, perform an aggregation on a defined set (a group) of rows, but rather than returning one value per group, window functions can return multiple values for each group.

See O'Reilly SQL Cookbook Appendix A for a nice overview of windowing functions.



回答18:

I need to put my own current favorite here, just to make the list complete. My favorite antipattern is not testing your queries.

This applies when:

  1. Your query involves more than one table.
  2. You think you have an optimal design for a query, but don't bother to test your assumptions.
  3. You accept the first query that works, with no clue about whether it's even close to optimized.

And any tests run against atypical or insufficient data don't count. If it's a stored procedure, put the test statement into a comment and save it, with the results. Otherwise, put it into a comment in the code with the results.



回答19:

Temporary Table abuse.

Specifically this sort of thing:

SELECT personid, firstname, lastname, age
INTO #tmpPeople
FROM People
WHERE lastname like 's%'

DELETE FROM #tmpPeople
WHERE firstname = 'John'

DELETE FROM #tmpPeople
WHERE firstname = 'Jon'

DELETE FROM #tmpPeople
WHERE age > 35

UPDATE People
SET firstname = 'Fred'
WHERE personid IN (SELECT personid from #tmpPeople)

Don't build a temporary table from a query, only to delete the rows you don't need.

And yes, I have seen pages of code in this form in production DBs.



回答20:

Contrarian view: over-obsession with normalization.

Most SQL/RBDBs systems give one lots of features (transactions, replication) that are quite useful, even with unnormalized data. Disk space is cheap, and sometimes it can be simpler (easier code, faster development time) to manipulate / filter / search fetched data, than it is to write up 1NF schema, and deal with all the hassles therein (complex joins, nasty subselects, etc).

I have found the over-normalized systems are often premature optimization, especially during early development stages.

(more thoughts on it... http://writeonly.wordpress.com/2008/12/05/simple-object-db-using-json-and-python-sqlite/)



回答21:

I just put this one together, based on some of the SQL responses here on SO.

It is a serious antipattern to think that triggers are to databases as event handlers are to OOP. There's this perception that just any old logic can be put into triggers, to be fired off when a transaction (event) happens on a table.

Not true. One of the big differences are that triggers are synchronous - with a vengeance, because they are synchronous on a set operation, not on a row operation. On the OOP side, exactly the opposite - events are an efficient way to implement asynchronous transactions.



回答22:

1) I don't know it's an "official" anti-pattern, but I dislike and try to avoid string literals as magic values in a database column.

An example from MediaWiki's table 'image':

img_media_type ENUM("UNKNOWN", "BITMAP", "DRAWING", "AUDIO", "VIDEO", 
    "MULTIMEDIA", "OFFICE", "TEXT", "EXECUTABLE", "ARCHIVE") default NULL,
img_major_mime ENUM("unknown", "application", "audio", "image", "text", 
    "video", "message", "model", "multipart") NOT NULL default "unknown",

(I just notice different casing, another thing to avoid)

I design such cases as int lookups into tables ImageMediaType and ImageMajorMime with int primary keys.

2) date/string conversion that relies on specific NLS settings

CONVERT(NVARCHAR, GETDATE())

without format identifier



回答23:

Identical subqueries in a query.



回答24:

  • The Altered View - A view that is altered too often and without notice or reason. The change will either be noticed at the most inappropriate time or worse be wrong and never noticed. Maybe your application will break because someone thought of a better name for that column. As a rule views should extend the usefulness of base tables while maintaining a contract with consumers. Fix problems but don't add features or worse change behavior, for that create a new view. To mitigate do not share views with other projects and, use CTEs when platforms allow. If your shop has a DBA you probably can't change views but all your views will be outdated and or useless in that case.

  • The !Paramed - Can a query have more than one purpose? Probably but the next person who reads it won't know until deep meditation. Even if you don't need them right now chances are you will, even if it's "just" to debug. Adding parameters lowers maintenance time and keep things DRY. If you have a where clause you should have parameters.

  • The case for no CASE -

    SELECT  
    CASE @problem  
      WHEN 'Need to replace column A with this medium to large collection of strings hanging out in my code.'  
        THEN 'Create a table for lookup and add to your from clause.'  
      WHEN 'Scrubbing values in the result set based on some business rules.'  
        THEN 'Fix the data in the database'  
      WHEN 'Formating dates or numbers.'   
        THEN 'Apply formating in the presentation layer.'  
      WHEN 'Createing a cross tab'  
        THEN 'Good, but in reporting you should probably be using cross tab, matrix or pivot templates'   
    ELSE 'You probably found another case for no CASE but now I have to edit my code instead of enriching the data...' END  
    


回答25:

Stored Procedures or Functions without any comments...



回答26:

Putting stuff in temporary tables, especially people who switch from SQL Server to Oracle have a habit of overusing temporary tables. Just use nested select statements.



回答27:

Developers who write queries without having a good idea about what makes SQL applications (both individual queries and multi-user systems) fast or slow. This includes ignorance about:

  • physical I/O minimization strategies, given that most queries' bottleneck is I/O not CPU
  • perf impact of different kinds of physical storage access (e.g. lots of sequential I/O will be faster than lots of small random I/O, although less so if your physical storage is an SSD!)
  • how to hand-tune a query if the DBMS produces a poor query plan
  • how to diagnose poor database performance, how to "debug" a slow query, and how to read a query plan (or EXPLAIN, depending on your DBMS of choice)
  • locking strategies to optimize throughput and avoid deadlocks in multi-user applications
  • importance of batching and other tricks to handle processing of data sets
  • table and index design to best balance space and performance (e.g. covering indexes, keeping indexes small where possible, reducing data types to minimum size needed, etc.)


回答28:

The two I find the most, and can have a significant cost in terms of performance are:

  • Using cursors instead of a set based expression. I guess this one occurs frequently when the programmer is thinking procedurely.

  • Using correlated sub-queries, when a join to a derived table can do the job.



回答29:

Using SQL as a glorified ISAM (Indexed Sequential Access Method) package. In particular, nesting cursors instead of combining SQL statements into a single, albeit larger, statement. This also counts as 'abuse of the optimizer' since in fact there isn't much the optimizer can do. This can be combined with non-prepared statements for maximum inefficiency:

DECLARE c1 CURSOR FOR SELECT Col1, Col2, Col3 FROM Table1

FOREACH c1 INTO a.col1, a.col2, a.col3
    DECLARE c2 CURSOR FOR
        SELECT Item1, Item2, Item3
            FROM Table2
            WHERE Table2.Item1 = a.col2
    FOREACH c2 INTO b.item1, b.item2, b.item3
        ...process data from records a and b...
    END FOREACH
END FOREACH

The correct solution (almost always) is to combine the two SELECT statements into one:

DECLARE c1 CURSOR FOR
    SELECT Col1, Col2, Col3, Item1, Item2, Item3
        FROM Table1, Table2
        WHERE Table2.Item1 = Table1.Col2
        -- ORDER BY Table1.Col1, Table2.Item1

FOREACH c1 INTO a.col1, a.col2, a.col3, b.item1, b.item2, b.item3
    ...process data from records a and b...
END FOREACH

The only advantage to the double loop version is that you can easily spot the breaks between values in Table1 because the inner loop ends. This can be a factor in control-break reports.

Also, sorting in the application is usually a no-no.



回答30:

Using primary keys as a surrogate for record addresses and using foreign keys as a surrogate for pointers embedded in records.



标签: