Best Practice: How to Structure Arrays - Standards

2019-03-11 09:39发布

问题:

What is the best practice in multidimensional array structure in terms of what elements hold the iterator vs the detail elements?

The majority of my programming experience (and I do mainly do it for fun) comes from following tutorials on google, so I apologize in advance if this seems an exceptionally daft question - but I do want to start improving my code.

Whenever I have needed to make a multidimensional array, my naming has always placed the counter in the first element.

For example, if I have a single dimensional array as follows:

$myArray['year']=2012;
$myArray['month']='July';
$myArray['measure']=3;
// and so on.

However, if I wanted to make that same array keep a few owners of history I would add another dimension and format it as follows:

$myArray[$owner]['year']=2012;
$myArray[$owner]['month']='July';
$myArray[$owner]['measure']=3;

Edit: To make sure that my example isn't off-putting or leading in the right direction, I am basically following this structure:

$myArray[rowOfData][columnOfData]

Now, my question is about accepted convention. Should I instead be doing the following?

$myArray['year'][$owner]=2012;
$myArray['month'][$owner]='July';
$myArray['measure'][$owner]=3;

Edit: using that edit from above, should it be:

$myArray[columnOfData][rowOfData]

I have searched about array naming conventions, but keep hitting articles arguing about whether to name arrays as plurals or not. The way I have been naming them seems to be more logical and I think it follows a structure that resembles an object better i.e. object->secondaryLevel->detail but for all I know I have been doing it ass-about all this time. As I getting more and more into programming, I would prefer to change my habits if they are wrong.

Is there an accepted standard or is it just anything goes with arrays? If you were looking at code written by someone else, what format would be expecting? I get that any structure that makes sense/is intuitive is accepted.

Also from an iteration point of view, which one of the following is more intuitive?:

for($i=0;$i<$someNumber;$i++)
{
    echo $myArray[$i]['year'];
    // OR
    echo $myArray['year'][$owner];
}

Edit: I did have this post tagged as c# and Java because I wanted to get some opinions outside of just PHP programmers. I think that as arrays are used in so many different languages, it would have been good to get some input from programmers in various langauges.

回答1:

Your question is subjective, in that everyone may have a different approach to the situation you stated, and you are wise to even ask the question; How best to name your variables, classes, etc. Sad to say, I spend more time than I care to admit determining the best variable names that make sense and satisfy the requirements. My ultimate goal is to write code which is 'self documenting'. By writing self-documenting code you will find that it is much easier to add features or fix defects as they arise.

Over the years I have come to find that these practices work best for me:

Arrays: Always plural

I do this so loop control structures make more semantic sense, and are easier to work with.

// With a plural array it's easy to access a single element
foreach ($students as $student) {}

// Makes more sense semantically
do {} (while (count($students) > 0);

Arrays of objects > deep multi-dimensional arrays

In your example your arrays started blowing up to be 3 element deep multi-dimensional arrays, and as correct as Robbie's code snippet is, it demonstrates the complexity it takes to iterate over multi-dimensional arrays. Instead, I would suggest creating objects, which can be added to an array. Note that the following code is demonstrative only, I always use accessors.

class Owner
{
    public $year;
    public $measure;
    public $month;
}

// Demonstrative hydration 
for ($i = 1 ; $i <= 3 ; $i++) {

    $owner = new Owner();

    $owner->year = 2012;
    $owner->measure = $i;
    $owner->month = rand(1,12);

    $owners[] = $owner;
}

Now, you only need to iterate over a flat array to gain access to the data you need:

foreach ($owners as $owner) {
    var_dump(sprintf('%d.%d: %d', $owner->month, $owner->year, $owner->measure));
}

The cool thing about this array of objects approach is how easy it will be to add enhancements, what if you want to add an owner name? No problem, simply add the member variable to your class and modify your hydration a bit:

class Owner
{
    public $year;
    public $measure;
    public $month;
    public $name;
}

$names = array('Lars', 'James', 'Kirk', 'Robert');

// Demonstrative hydration 
for ($i = 1 ; $i <= 3 ; $i++) {

    $owner = new Owner();

    $owner->year = 2012;
    $owner->measure = $i;
    $owner->month = rand(1,12);
    $owner->name = array_rand($names);

    $owners[] = $owner;
}

foreach ($owners as $owner) {
    var_dump(sprintf('%s: %d.%d: %d', $owner->name, $owner->month, $owner->year, $owner->measure));
}

You have to remember that the above code snippets are just suggestions, if you rather stick with deep multi-dimensional arrays, then you will have to figure out an element ordering arrangement that makes sense to YOU and those you work with, if you think you will have trouble with the setup six months down the road, then it is best to implement a better strategy while you have the chance.



回答2:

You want to make it easy for yourself (and potentially other users) to understand what your code is doing, and what you are thinking when you wrote it. Imagine asking for help, imagine outputting debug code or imagine returning to fix a bug 12 months after you last touched the code. Which will be the best and fastest for you/others to understand?

If your code requires that you "For each year, display the data" then the first is more logical. If your thinking is "I need to gather all the measures together, then I'll process those" then go for the second option. If you need to re-order by year, then go the first.

Based on your example above, though the way I'd handle the above is probably:

$array[$year][$month] = $measure;

You don't need a specific "measure" element. Or if you do have two elements per month:

$array[$year][$month] = array('measure' => $measure, 'value'=>$value);

or

$array[$year][$month]['measure'] = $measure;
$array[$year][$month]['value'] = $value;

Then you can go:

for($year = $minYear; $year <= $maxYear; $year++) {  // Or "foreach" if consecutive
    for ($month = 1; $month <= 12; $month++) {
        if (isset($array[$year][$month])) {
             echo $array[$year][$month]['measure'];  // You can also check these exist using isset if required
             echo $array[$year][$month]['value'];
        } else {
             echo 'No value specified'
        }
    }
}

Hope that helps with your thinking.



回答3:

You're doing it right. You must realize that PHP does not have real multi-dimensional arrays; what you're looking at is an array of arrays, each one-dimensional. The major array is storing the pointers (the "iterator", as you put it). Because of this, row-first is the only reasonable way to go:

In your particular example, you can think of your two-dimensional array as containing a collection of objects, each of which has values for 'year', 'month', and 'measure' (plus the primary key, 'owner'). By filling in the major index, you can refer to each row of the two-dimensional array like this: $myArray[$owner]. Each such value is a three-element array with keys 'year', 'month', and 'measure'. In other words, it is identical to your original, one-dimensional data structure for the same information! You can pass it to a function that deals with just one row of your table, you can easily sort the rows of $myArray, etc.

If you were to put your indices the other way around, there's no way you can recover your individual records. There is no "slice" notation that can give you an entire "column" of a two-dimensional array.

Now for a bit of broader perspective:

Since you asked your question in terms of rows and columns, note that putting the "row" index first makes your arrays compatible with matrix arithmetic. This is an enormous win if you have to do calculations with matrices. Database notation also puts records in rows, so doing it backwards would needlessly complicate things.

C has real two-dimensional arrays and arrays of pointers. Arrays of pointers work exactly as in PHP (though only numeric indices are allowed), and for the same reason: the major index selects from an array of pointers, and the minor index is simply the index of the pointed-to array. C's two-dimensional arrays work the same way: The major index is on the left, and adjacent locations in memory differ by one value of the minor (second) index (except at the end of a row, of course). This makes them compatible with pointer arrays, since it possible to reference a row of a two-dimensional array by using a single index. For example, a[0] is abcd:

        a[.][0] a[.][1] a[.][2] a[.][3]
a[0]:      a       b       c       d    
a[1]:      e       f       g       . 
a[2]:      .       .       .       .    
a[3]:      .       .       .       .    

The system works seamlessly because the major (row) index is first. Fortran has real two-dimensional arrays but has the major index on the right: Adjacent locations in memory differ by one value of the left (first) index. I found this a pain in the neck, since there's no subexpression that reduces to a one-dimensional array in the same way. (But I have a C background so I'm certainly prejudiced).

In short: You're doing it right, and it's probably not by accident but because you learned by looking at well-written code.



回答4:

From my point of view this isn't a question about array naming conventions, but a question about how to structure your data. Meaning: Which pieces of information belong together - and why? To answer this question you have to look at both, readability and performance.

From a Java developer's perspective (and you tagged the question also for Java) I'm no friend of multi dimensional arrays, as they tend to result in error-prone index acrobatics in huge amounts of nested for-loops. To get rid of the second dimension in your array, one would create additional objects which enclose the information of one column.

The decision to make is now, which data should be embedded in this enclosing object. In your case the answer is simple: A collection of data about one user makes sense, a list of uncorrelated values for one property for a number of arbitrary users usually does not.

Even if you do not encapsule your data into objects and instead prefer to use multi-dimensional arrays, you should keep these thoughts in mind and see the last dimension of the array (in this case one column) as equivalent to the encapsuling object. Your data structure should be useful at all of it's abstraction levels, and this usually means that you put together what's used together.



回答5:

I believe that when you're using data that is meant to be kept as a collective, it's better to store it in a single data set (such as an associative array or object). The following are examples of structures that can be used to store data sets.

In PHP, as associative arrays:

$owner = array(
    'year' => 2012, 'month' => 'July', 'measure' => 3
);

In C#, as hash tables:

Hashtable owner = new Hashtable();
owner.Add("year", 2012);
owner.Add("month", "July");
owner.Add("measure", 3);

In Java, as hash tables:

Hashtable owner = new Hashtable();
owner.put("year", new Integer(2012));
owner.put("month", new String("July"));
owner.put("measure", new Integer(3));

In C#/Java, as objects:

public class Owner {
    public int year;
    public string month;
    public int measure;
}

Owner o = new Owner();
o.year = 2012;
o.month = "July";
o.measure = 3;

In C# and Java, the advantage to using objects over hash tables is that variable types (ie. int or string) can be declared for each field/variable, which will help to prevent errors and maintain data integrity since errors will be thrown (or warnings will be generated) when you attempt to assign the wrong type of data to a field.

In PHP, I find that objects have no real advantages over arrays when storing collections of data because type hinting isn't allows for scalar variables, which means that additional code is required to check/restrict what data is entered in a property, you have to write extra code to declare the class, you can accidentally assign your values to the wrong property by misspelling the name (just the sames as can be done with arrays so it doesn't help with data integrity), and iteration/manipulation of properties requires extra code as well. I also find it easier to work with with associative arrays when converting to/from JSON in PHP.

Based on the code in the question, most often you'll need to add another dimension to the array when you need to access the data via one of the fields or some other criteria (such as a combination of fields or data that the field would map onto).

This is where hash tables and associative arrays are more useful for each of the languages. For example, if you want to organize the owners into groups based on year and month, you can create associative arrays (or hash tables) to do this.

The following PHP example uses PDO and fetchAll to get the information from a database:

$sth = $dbh->prepare("SELECT year, month, measure FROM owners");
$sth->execute();

$rows = $sth->fetchAll(PDO::FETCH_ASSOC);

$data = array();
foreach ($rows as $row) {
    $year = $row['year'];
    $month = $row['month'];

    if (!isset($data[$year])) {
        $data[$year] = array();
    }

    if (!isset($data[$year][$month])) {
        $data[$year][$month] = array();
    }

    array_push($data[$year][$month], $row);
}

An example of how this data might look as code is:

$data = array(
    2011 => array(
        'July' => array(
            array('year' => 2011, 'month' => 'July', 'measure' => 1),
            array('year' => 2011, 'month' => 'July', 'measure' => 3)
        ),

        'May' => array(
            array('year' => 2011, 'month' => 'May', 'measure' => 9),
            array('year' => 2011, 'month' => 'May', 'measure' => 4),
            array('year' => 2011, 'month' => 'May', 'measure' => 2)
        )
    ),

    2012 => array(
        'April' => array(
            array('year' => 2012, 'month' => 'April', 'measure' => 7)
        )
    )
);

You can then access the data using the keys.

$data[2011]['July'];

// array(
//     array('year' => 2011, 'month' => 'July', 'measure' => 1),
//     array('year' => 2011, 'month' => 'July', 'measure' => 3)
// )

Furthermore, I attempt to keep my objects that represent collections of data minimal when I create them. If you're storing your collections in objects, as you begin to add functions that perform operations on the collections, there will be more code to maintain. There are times when this is necessary, for example if you need to restrict what values can be stored via setters, but if all you're doing is passing data from a user into a database and displaying it on the screen again, then there usually isn't a need for advanced functionality in the data collection and it may make more sense to have the functionality managed elsewhere. For example, View classes can handle displaying data, Model classes can handle extracting data, and Check objects can verify that data can be saved or verify whether or not operations can be performed based on the data in the collection.



回答6:

I very rarely use arrays in Java and the same would apply to C#. They're object oriented languages and you generally get much better flexibillity in the long run if you use classes and objects instead of primitive constructs such as arrays.

You're example looks like a kind of cross reference or lookup known as an associative array. Which way round you put the elements really depends on how you're going to be using it. It's a design decision specific to the problem. You need to ask yourself what will you be starting with and what do you want to end up with?

It looks like you want to retrieve different types depending on what you're looking up? Month is a String, year is an Integer etc. This makes an array or HashMap a bad choice because the calling code will need to second guess the type of data it is retrieving. A better design would be to wrap this data in an object which would be type safe. That's kind of the whole point of using Java and C#.

In this case using an object would give you the flexibillity to check that the month is actually a real value. You could for instance create an enum that contains the months and return an instance of this in the getMonth method.

Also, I haven't got enough rep points to leave a comment but I wanted to reply to Dave F's answer. Hashtable is a legacy class from Java's early days and so should be avoided. HashMap is the recommended replacement, both implement the Map interface.

In Java's early days the collection classes were synchronised to make them threadsafe (Vector, Hashtable etc). This made these essential classes unecessarily slow and hampered performance. If you need a synchrnised map these days there is a wrapper to HashMap.

Map m = Collections.synchronizedMap(new HashMap(...)); 

In other words there is no reason to use Hashtable anymore unless you happen to be working with legacy code.



回答7:

Actually, if we generalize to all programming languages, this isn't merely a subjective question. The array-traversal performance for many programming languages is hobbled if incorrect indexing is used by the programmer, because programming languages don't always store their arrays in memory the same way. Most languages, to my knowledge, are either row-major or column major. You need to know which it is if you are going to write programs that require high performance number crunching.

Example: C uses row-major, which is the usual way of doing it. When you iterate through an N x N array row by row, you will probably only access memory N times, because each row is loaded together. So for the first element of the first row, you will go out to memory and get the whole row back. For the second element of the first row, you won't need to go out to memory because the row is already loaded. However, if you decided to go column by column there could be memory issues. For the first element of the first column, you'd load the whole row, then for the second element of the first column you'd load the next row... etc. Once you run out of space in the cache, the first row will probably be ditched to make room and once you start loading all the elements in column 2, you'll have to start all over again. This wouldn't be a problem in Fortran; rather, you'd want to do it this way, because the whole column is loaded at once rather than the whole row. Hope that made sense. See the Wikipedia article I linked to above for a more visual explanation.

For most programs, the most priority of the developer should be clean code. But knowledge of how that language handles memory can be essential in cases where performance is key. Good question.