可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a table, with 3 columns A, B , C - where A is not the primary key. We need to select the B, C pairs for each distinct A(group by A), and append the results at the end of the final result set. Is this possible in sql ?
A | B | C
a1| b1| c1
a1| b2| c2
a1| b3| c3
a2| b1| c2
a2| b2| c5
I need to get
a1 | (c1,b1) ; (c2,b2);(c3;b3)
a2 | (c2,b1) ; (c5,b2)
as the rows appended at the end.
I normally do this via sqlalchemy, and then end up transforming the data in Python, is there a way in which I could do this directly in SQL ?
EDIT & open question :
What is the alternative to string_agg() in red shift (Postgres 8.0.2) - more info on use-case above.
On using string_agg I get ERROR: function string_agg(text, "unknown") does not exist Hint: No function matches the given name and argument types. You may need to add explicit type casts
Edit 2: Adding errors using the custom aggregate function
An error occurred when executing the SQL command:
CREATE FUNCTION cut_semicolon(text) RETURNS text AS $$
BEGIN
RETURN SUBSTRING($1 FROM 4)
ERROR: unterminated dollar-quoted string at or near "$$
BEGIN
RETURN SUBSTRING($1 FROM 4)"
Position: 53
CREATE FUNCTION cut_semicolon(text) RETURNS text AS $$
^
Execution time: 0.24s
(Statement 1 of 7 finished)
0 rows affected
END executed successfully
Execution time: 0.22s
(Statement 2 of 7 finished)
An error occurred when executing the SQL command:
$$ LANGUAGE 'plpgsql' IMMUTABLE
ERROR: unterminated dollar-quoted string at or near "$$ LANGUAGE 'plpgsql' IMMUTABLE"
Position: 1
$$ LANGUAGE 'plpgsql' IMMUTABLE
^
Execution time: 0.22s
(Statement 3 of 7 finished)
An error occurred when executing the SQL command:
CREATE FUNCTION concat_semicolon(text, text) RETURNS text AS $$
BEGIN
RETURN $1 || ' ; ' || $2
ERROR: unterminated dollar-quoted string at or near "$$
BEGIN
RETURN $1 || ' ; ' || $2"
Position: 62
CREATE FUNCTION concat_semicolon(text, text) RETURNS text AS $$
^
Execution time: 0.22s
(Statement 4 of 7 finished)
0 rows affected
END executed successfully
Execution time: 0.22s
(Statement 5 of 7 finished)
An error occurred when executing the SQL command:
$$ LANGUAGE 'plpgsql' IMMUTABLE
ERROR: unterminated dollar-quoted string at or near "$$ LANGUAGE 'plpgsql' IMMUTABLE"
Position: 1
$$ LANGUAGE 'plpgsql' IMMUTABLE
^
Execution time: 0.22s
(Statement 6 of 7 finished)
An error occurred when executing the SQL command:
CREATE AGGREGATE concat_semicolon(
BASETYPE=text,
SFUNC=concat_semicolon,
STYPE=text,
FINALFUNC=cut_semicolon,
INITCOND=''
)
ERROR: SQL command "CREATE AGGREGATE concat_semicolon(
BASETYPE=text,
SFUNC=concat_semicolon,
STYPE=text,
FINALFUNC=cut_semicolon,
INITCOND=''
)" not supported.
Execution time: 0.23s
(Statement 7 of 7 finished)
5 statements failed.
Script execution finished
Total script execution time: 1.55s
Also looked through a related answer in Google groups, & it looks like replacing the separator ";" might help? - though I am not sure, which ; to replace in this function definition.
Reference : https://groups.google.com/forum/#!topic/sql-workbench/5LHVUXTm3BI
Edit 3:
Perhaps,create function itself is not supported in Redshift ? "ERROR: CREATE FUNCTION is not supported" A 2013 thread says so forums.aws.amazon.com/thread.jspa?threadID=121137
Edit 4 :
select A, concat(concat(concat(C, ',' ) , cast(B as varchar)), ',')
from my_table
group by A,B,C
-- Is it ok to group by all A,B, C - since I can't group by A alone, which removes the related "C" columns--
gives -:
a1 c1b1b2b3
a2 c2b1b2
But not ALL the entries for C (and with semicolons)
a1 c1,b1;c2,b2;c2,b3
a2 c2,b1;c5,b2
but I would like the commas in between & also need to know if the group by A, B, C are ok ?
回答1:
PostgreSQL
SELECT
a,
STRING_AGG('(' || c || ',' || b || ')', ' ; ')
FROM
tbl
GROUP BY
a;
Edit:
For versions of PostgreSQL before 9.0 (when STRING_AGG was introduced) and even before 8.4 (when ARRAY_AGG was added) you can create your own custom aggregate function.
Edit 2: For versions before 8.0 (perhaps Amazon Redshift is based on PostgreSQL 7.4 somehow) the $$ syntax is not supported, so the function body needs to be enclosed in quotes, and quotes inside the body need to be escaped.
CREATE FUNCTION cut_semicolon(text) RETURNS text AS '
BEGIN
RETURN SUBSTRING($1 FROM 4);
END;
' LANGUAGE 'plpgsql' IMMUTABLE;
CREATE FUNCTION concat_semicolon(text, text) RETURNS text AS '
BEGIN
RETURN $1 || '' ; '' || $2;
END;
' LANGUAGE 'plpgsql' IMMUTABLE;
CREATE AGGREGATE concat_semicolon(
BASETYPE=text,
SFUNC=concat_semicolon,
STYPE=text,
FINALFUNC=cut_semicolon,
INITCOND=''
);
Then use that aggregate instead.
SELECT
a,
CONCAT_SEMICOLON('(' || c || ',' || b || ')')
FROM
tbl
GROUP BY
a;
MySQL
SELECT
a,
GROUP_CONCAT(CONCAT('(', c, ',', b, ')') SEPARATOR ' ; ')
FROM
tbl
GROUP BY
a;
回答2:
Unless you have a very specific reason to do this kind of stuff from within the DB itself, it ought to be done from within your app. Else, you'll end up with sets returning convoluted text fields that you might need to parse for post-processing and so forth.
In other words:
select A, B, C from table
And then, something like (Ruby):
res = {}
rows.each do |row|
res[row['a']] ||= []
res[row['a']][] = [row['b'], row['c']]
end
If you insist on doing it within Postgres, your options aren't many -- if any at all, in Redshift.
The array_agg()
and string_agg()
aggregate functions would both potentially be useful, but they were introduced in 8.4 and 9.0 respectively, and Redshift apparently supports neither.
Insofar as I'm aware, Redshift doesn't support array constructors, so using the ARRAY((select ...))
construct, which might have worked, flies out the window as well.
Returning something that makes use of the ROW()
construct is not possible either. Even if it were, it would have been ugly as sin, and impossibly convoluted to manipulate from within Python.
A custom aggregate function seems out of the question if the other answer, and the leads it made you follow, are anything to go by. Which is unsurprising: the docs seem clear that you cannot create a user-defined function, let alone create a pl/language to begin with.
In other words, your only option insofar as I can tell is to do this type of aggregation from within your app. And that, incidentally, is the place where you should be doing this kind of stuff anyway.
回答3:
It is probably achievable in PostgreSQL. Especially if B and C are of the same type. you may produce two column result and aggregate data in second column using ARRAY, otherwise use JSON. I'm not sure how to produce it in MySQL, but probably there you would need to serialize to string, and invert it in Python.
Either way in my opinion proper answer is: don't do it. You'll get much less readable, hacky, unportable solution, which may not necessarily be a faster one. There's nothing wrong in some post processing of data in Python to give them final form and in fact it is quite a common practice. Especially if it is purely reformatting output and not used for producing aggregate results.
回答4:
Try this One to get
a1 | (c1,b1) ; (c2,b2);(c3;b3)
a2 | (c2,b1) ; (c5,b2)
This is the code :
make temporary table, with running ID, that is example for SQL Server, you can try with another query
Select identity(int, 1, 1) as ID, A, '('+C+';'+B+')' as aa
Into #table2
From #table
Order BY A, aa
Make query with Loop
Declare @sSql as Varchar(1000), @A as Varchar(2), @A2 as Varchar(2), @aa as Varchar(10)
Declare @iRec as int, @iL as int
Set @iRec = (Select Count(*) From #table2)
Set @iL = 1
Set @sSql = ''
While @iL <= @iRec
Begin
Set @A = (Select A From #table2 Where ID = @iL)
Set @aa = (Select aa From #table2 Where ID = @iL)
if @A = @A2
Begin
Set @sSql = Left(@sSql, Len(@sSql)-1)+';'+@aa+'`'
End
Else
BEGIN
Set @sSql = @sSql + ' Union Select `'+ @A+'`,`'+@aa+'`'
END
Set @A2 = @A
Set @iL = @iL + 1
End
Set @sSql = Right(@sSql, Len(@sSql)-7)
Set @sSql = Replace(@sSql, '`', '''')
Exec(@sSql)
is it work ?
回答5:
Came to the conclusion that it cant be solved in postgres+ Redshift stack.
This is how I solved it.
import pandas as pd
df =pd.DataFrame({'A':[1,1,1,2,2,3,3,3],'B':['aaa','bbb','cc','gg','aaa','bbb','cc','gg']})
def f(x):
return [x['B'].values]
#s=df.groupby('A').apply(f)
series =df.groupby('A').apply(f)
series.name = 'metric'
s=pd.DataFrame(series.reset_index())
print s
A metric
0 1 [[aaa, bbb, cc]]
1 2 [[gg, aaa]]
2 3 [[bbb, cc, gg]]