Excel: How to transpose select columns and group b

2019-04-16 07:16发布

问题:

I am working with a data dump of transactions which is not exported from the particular system in a very friendly format for any purpose.

The data in Excel is made up of approximately 700,000 rows currently, whereas once organised correctly should only consist of maybe 70,000, where the unique values of one particular column are transposed into some additional columns.

Currently I am working with something like this (simplified) example below;

Request_ID   Status       Field_Name       Value
01000        Rejected     Name             John Smith
01000        Rejected     Acc Number       123456
01000        Rejected     Date             1/12/2015
01000        Rejected     Enquiry Type     Type 1
01000        Rejected     Reason           Reason 1
01001        Completed    Name             Jane Jones
01001        Completed    Acc Number       123457
01001        Completed    Date             1/12/2015
01001        Completed    Enquiry Type     Type 2
01001        Completed    Reason           Reason 2

The first two columns (Request_ID and Status) are simply duplicated values to cater for the numerous rows of data relating to each individual request.

The Field_Name column is the particular column repeats the same set of values on individual rows for each unique Request_ID - and the Value column lists the corresponding value relating to each Field_Name.

It is columns 3 and 4 (Field_Name and Value) which I would like to transpose into rows to produce something like the below:

Request_ID  Status     Name          Acc Number    Date       Enquiry Type    Reason
01000       Rejected   John Smith    123456        1/12/2012  Type 1          Reason 1
01001       Completed  Jane Jones    123457        1/12/2012  Type 2          Reason 2

As you can see, the second example above is much more logical and with less redundant data - and additionally will allow for simple filtering and analysis.

So, my first point of call was to attempt to implement the above manipulation by inserting the data into a PivotTable, however I simply could not seem to figure the logic. Or if I was able to get the fields into a similar column position, then the values would simply calculate a count of in the body of the table, rather than the value itself (as is the nature of PivotTables).

Is it possible to use the Transpose function? (which if so, I am more than happy to investigate myself)

Is there some alternative solution which I have not considered?

(Side Note - there are additional columns in my data dump which I have not included in the example as they contain values which are simply repeated for every row related to an individual Request_ID - so I figure I can use the same logic for any solution applicable above)

If you can point me in the right direction, or provide any guidance, it would be greatly appreciated as I am stumped.

回答1:

Depending on your version of Excel, you can use either Power Pivot (2010/2013) or Get & Transform (2016) to pivot the data appropriately. Your data, if not already in a table, will be converted into one.

For the latter, Selecting From Table opens the Query Editor. After selecting the Field Name and Value columns, select TransformPivot Column

This will bring up a Pivot Column dialog. You want to be sure the selections are as below. Also you must select advanced to get to the do not aggregate option.

Select OK and you have your results as in your question. When you save the query, it will write the results to a new worksheet. You'll need to format the date column properly.

I'm not sure how this will work with 700,000 rows. You might need 64-bit Excel.

However, looking at some of the comments to other responses, this solution should work with varying numbers of Field Name / Value pairs.



回答2:

Assume your data is in the columns A:D . Copy column Request_ID to Col G. use excel Remove Duplicates function to get unique request IDs. Make headers for all possible Field Names like Name, Acc Number etc from Col H onwards and use the below formula.

H2 =VLOOKUP(G2,$A$2:$D$11,2,FALSE)

I2 =INDIRECT(ADDRESS(SUMPRODUCT(--($A$2:$A$11=$G2)*--($C$2:$C$11=I$1)*ROW($A$2:$A$11)),4))

Drag I2 formula for all other field names.

Here is a Google Sheet

I have tried with the data you have posted and works fine. If you want to automate the whole process, you can try recording Macro for all these steps.



回答3:

Classic need of the conditional aggregation query in SQL. Your initial example reflects the Entity-Attribute-Value model. To properly align to distinct columns, aggregate conditionally on the Field_Name columns, returning the max (only value) of Value, grouped by the repeating columns (add to SELECT and GROUP BY for all other repeat columns):

Conditional Aggregate Query

SELECT Request_ID, Status, 
       MAX(IIF(Field_Name='Name', Value, NULL)) AS Name,
       MAX(IIF(Field_Name='Acc Number', Value, NULL)) AS [Acc Number],
       MAX(IIF(Field_Name='Date', Value, NULL)) AS [Date],
       MAX(IIF(Field_Name='Enquiry Type', Value, NULL)) AS [Enquiry Type]
       MAX(IIF(Field_Name='Reason', Value, NULL)) AS [Reason]
FROM [Worksheet$]
GROUP BY Request_ID, Status

If using Excel for PC, you can use the ACE SQL Engine (Windows .dll files) to run the SQL query on the workbook in VBA. If using Mac, import data into a database like SQLite and run above query (replace IIF() for CASE statements). For 700,000+ records, SQL may be a robust solution compared to inline formulas or nested for loops and arrays.

Below is the Windows solution using ADO (two connections available) where data resides in a DATA tab with a blank RESULTS tab used for the query results.

Sub RunSQL()
    Dim conn As Object, rst As Object
    Dim strConnection As String, strSQL As String, i As Integer

    Set conn = CreateObject("ADODB.Connection")
    Set rst = CreateObject("ADODB.Recordset")

    ' CONNECTION STRINGS (DRIVER AND PROVIDER)
'    strConnection = "DRIVER={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};" _
'                      & ActiveWorkbook.FullName ";"
    strConnection = "Provider=Microsoft.ACE.OLEDB.12.0;" _
                       & "Data Source='" & ActiveWorkbook.FullName & "';" _
                       & "Extended Properties=""Excel 8.0;HDR=YES;"";"

    strSQL = " SELECT Request_ID, Status," _
               & "    MAX(IIF(Field_Name='Name', Value, NULL)) AS Name," _
               & "    MAX(IIF(Field_Name='Acc Number', Value, NULL)) AS [Acc Number]," _
               & "    MAX(IIF(Field_Name='Date', Value, NULL)) AS [Date]," _
               & "    MAX(IIF(Field_Name='Enquiry Type', Value, NULL)) AS [Enquiry Type]," _
               & "    MAX(IIF(Field_Name='Reason', Value, NULL)) AS [Reason]" _
               & " FROM [DATA$]" _
               & " GROUP BY Request_ID, Status;"            
    ' OPEN CONNECTION AND RECORDSET
    conn.Open strConnection
    rst.Open strSQL, conn

    ' HEADERS
    For i = 0 To rst.Fields.Count - 1
        Worksheets("RESULTS").Cells(1, i + 1) = rst.Fields(i).Name
    Next i           
    ' DATA ROWS
    Worksheets("RESULTS").Range("A2").CopyFromRecordset rst

    rst.Close: conn.Close
End Sub

Alternatively, for variable number of Field_Name, you can use ACE SQL's unique crosstab query which avoids hard-coding value possibilities such as the conditionals in above aggregate. And because ACE SQL limits columns to 255, below query can only return 253 or less distinct values of Field_Name (2 for the repeated groupby columns):

Crosstab Query

strSQL = " TRANSFORM Max(Value)" _
           & " SELECT Request_ID, Status" _
           & " FROM [DATA$]" _
           & " GROUP BY Request_ID, Status" _
           & " PIVOT Field_Name;"

Data

Results



回答4:

Pick any cell, say G6 and enter:

=INDEX(A:A,(ROWS($1:1)-1)*5+2)

and copy down. In H6 enter:

=INDEX(B:B,(ROWS($1:1)-1)*5+2)

and copy down. Finally in I6 enter:

=OFFSET($D$2,COLUMNS($A:A)-1+(ROWS($1:1)-1)*5,0)

and copy this cell both across and down: