可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I am working with a data dump of transactions which is not exported from the particular system in a very friendly format for any purpose.
The data in Excel is made up of approximately 700,000 rows currently, whereas once organised correctly should only consist of maybe 70,000, where the unique values of one particular column are transposed into some additional columns.
Currently I am working with something like this (simplified) example below;
Request_ID Status Field_Name Value
01000 Rejected Name John Smith
01000 Rejected Acc Number 123456
01000 Rejected Date 1/12/2015
01000 Rejected Enquiry Type Type 1
01000 Rejected Reason Reason 1
01001 Completed Name Jane Jones
01001 Completed Acc Number 123457
01001 Completed Date 1/12/2015
01001 Completed Enquiry Type Type 2
01001 Completed Reason Reason 2
The first two columns (Request_ID and Status) are simply duplicated values to cater for the numerous rows of data relating to each individual request.
The Field_Name column is the particular column repeats the same set of values on individual rows for each unique Request_ID - and the Value column lists the corresponding value relating to each Field_Name.
It is columns 3 and 4 (Field_Name and Value) which I would like to transpose into rows to produce something like the below:
Request_ID Status Name Acc Number Date Enquiry Type Reason
01000 Rejected John Smith 123456 1/12/2012 Type 1 Reason 1
01001 Completed Jane Jones 123457 1/12/2012 Type 2 Reason 2
As you can see, the second example above is much more logical and with less redundant data - and additionally will allow for simple filtering and analysis.
So, my first point of call was to attempt to implement the above manipulation by inserting the data into a PivotTable, however I simply could not seem to figure the logic. Or if I was able to get the fields into a similar column position, then the values would simply calculate a count of in the body of the table, rather than the value itself (as is the nature of PivotTables).
Is it possible to use the Transpose function? (which if so, I am more than happy to investigate myself)
Is there some alternative solution which I have not considered?
(Side Note - there are additional columns in my data dump which I have not included in the example as they contain values which are simply repeated for every row related to an individual Request_ID - so I figure I can use the same logic for any solution applicable above)
If you can point me in the right direction, or provide any guidance, it would be greatly appreciated as I am stumped.
回答1:
Depending on your version of Excel, you can use either Power Pivot
(2010/2013) or Get & Transform
(2016) to pivot the data appropriately. Your data, if not already in a table, will be converted into one.
For the latter, Selecting From Table
opens the Query Editor. After selecting the Field Name and Value columns, select Transform
► Pivot Column
This will bring up a Pivot Column dialog. You want to be sure the selections are as below. Also you must select advanced
to get to the do not aggregate
option.
Select OK and you have your results as in your question. When you save the query, it will write the results to a new worksheet. You'll need to format the date column properly.
I'm not sure how this will work with 700,000 rows. You might need 64-bit Excel.
However, looking at some of the comments to other responses, this solution should work with varying numbers of Field Name / Value
pairs.
回答2:
Assume your data is in the columns A:D . Copy column Request_ID to Col G. use excel Remove Duplicates function to get unique request IDs. Make headers for all possible Field Names like Name, Acc Number etc from Col H onwards and use the below formula.
H2 =VLOOKUP(G2,$A$2:$D$11,2,FALSE)
I2 =INDIRECT(ADDRESS(SUMPRODUCT(--($A$2:$A$11=$G2)*--($C$2:$C$11=I$1)*ROW($A$2:$A$11)),4))
Drag I2 formula for all other field names.
Here is a Google Sheet
I have tried with the data you have posted and works fine. If you want to automate the whole process, you can try recording Macro for all these steps.
回答3:
Classic need of the conditional aggregation query in SQL. Your initial example reflects the Entity-Attribute-Value model. To properly align to distinct columns, aggregate conditionally on the Field_Name columns, returning the max (only value) of Value, grouped by the repeating columns (add to SELECT
and GROUP BY
for all other repeat columns):
Conditional Aggregate Query
SELECT Request_ID, Status,
MAX(IIF(Field_Name='Name', Value, NULL)) AS Name,
MAX(IIF(Field_Name='Acc Number', Value, NULL)) AS [Acc Number],
MAX(IIF(Field_Name='Date', Value, NULL)) AS [Date],
MAX(IIF(Field_Name='Enquiry Type', Value, NULL)) AS [Enquiry Type]
MAX(IIF(Field_Name='Reason', Value, NULL)) AS [Reason]
FROM [Worksheet$]
GROUP BY Request_ID, Status
If using Excel for PC, you can use the ACE SQL Engine (Windows .dll files) to run the SQL query on the workbook in VBA. If using Mac, import data into a database like SQLite and run above query (replace IIF()
for CASE
statements). For 700,000+ records, SQL may be a robust solution compared to inline formulas or nested for
loops and arrays.
Below is the Windows solution using ADO (two connections available) where data resides in a DATA tab with a blank RESULTS tab used for the query results.
Sub RunSQL()
Dim conn As Object, rst As Object
Dim strConnection As String, strSQL As String, i As Integer
Set conn = CreateObject("ADODB.Connection")
Set rst = CreateObject("ADODB.Recordset")
' CONNECTION STRINGS (DRIVER AND PROVIDER)
' strConnection = "DRIVER={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};" _
' & ActiveWorkbook.FullName ";"
strConnection = "Provider=Microsoft.ACE.OLEDB.12.0;" _
& "Data Source='" & ActiveWorkbook.FullName & "';" _
& "Extended Properties=""Excel 8.0;HDR=YES;"";"
strSQL = " SELECT Request_ID, Status," _
& " MAX(IIF(Field_Name='Name', Value, NULL)) AS Name," _
& " MAX(IIF(Field_Name='Acc Number', Value, NULL)) AS [Acc Number]," _
& " MAX(IIF(Field_Name='Date', Value, NULL)) AS [Date]," _
& " MAX(IIF(Field_Name='Enquiry Type', Value, NULL)) AS [Enquiry Type]," _
& " MAX(IIF(Field_Name='Reason', Value, NULL)) AS [Reason]" _
& " FROM [DATA$]" _
& " GROUP BY Request_ID, Status;"
' OPEN CONNECTION AND RECORDSET
conn.Open strConnection
rst.Open strSQL, conn
' HEADERS
For i = 0 To rst.Fields.Count - 1
Worksheets("RESULTS").Cells(1, i + 1) = rst.Fields(i).Name
Next i
' DATA ROWS
Worksheets("RESULTS").Range("A2").CopyFromRecordset rst
rst.Close: conn.Close
End Sub
Alternatively, for variable number of Field_Name
, you can use ACE SQL's unique crosstab query which avoids hard-coding value possibilities such as the conditionals in above aggregate. And because ACE SQL limits columns to 255, below query can only return 253 or less distinct values of Field_Name
(2 for the repeated groupby columns):
Crosstab Query
strSQL = " TRANSFORM Max(Value)" _
& " SELECT Request_ID, Status" _
& " FROM [DATA$]" _
& " GROUP BY Request_ID, Status" _
& " PIVOT Field_Name;"
Data
Results
回答4:
Pick any cell, say G6 and enter:
=INDEX(A:A,(ROWS($1:1)-1)*5+2)
and copy down. In H6 enter:
=INDEX(B:B,(ROWS($1:1)-1)*5+2)
and copy down. Finally in I6 enter:
=OFFSET($D$2,COLUMNS($A:A)-1+(ROWS($1:1)-1)*5,0)
and copy this cell both across and down: