Precisions and counts

2020-05-08 08:34发布

问题:

I am working with a educational dataset called IPEDS from the National Center for Educational Statistics. They track students in college based upon major, degree completion, etc. The problem in Stata is that I am trying to determine the total count for degrees obtained by a specific major.

They have a variable cipcode which contains values that serve as "majors". cipcode might be 14.2501 "petroleum engineering, 16.0102 "Linguistics" and so forth.

When I write a particular code like

tab cipcode if cipcode==14.2501 

it reports no observations. What code will give me the totals?

/*Convert Float Variable to String Variable and use Force Replace*/
tostring cipcode, gen(cipcode_str) format(%6.4f) force
replace cipcode_str = reverse(substr(reverse(cipcode_str), indexnot(reverse(cipcode_str), "0"), .))
replace cipcode_str = reverse(substr(reverse(cipcode_str), indexnot(reverse(cipcode_str), "."), .))

/* Created a total variable called total_t1 for total count of all stem majors listed in table 1*/
gen total_t1 = cipcode_str== "14.2501" + "14.3901" + "15.0999" + "40.0601"

回答1:

This minimal example confirms your problem. (See, by the way, https://stackoverflow.com/help/mcve for advice on good examples.)

* code 
clear
input code 
14.2501 
14.2501 
14.2501 
end 

tab code if code == 14.2501
tab code if code == float(14.2501)

* results 
. tab code if code == 14.2501
no observations

. tab code if code == float(14.2501)

       code |      Freq.     Percent        Cum.
------------+-----------------------------------
    14.2501 |          3      100.00      100.00
------------+-----------------------------------
      Total |          3      100.00

The keyword is one you use, precision. In Stata, search precision for resources, starting with blog posts by William Gould. A decimal like 14.2501 is hard (impossible) to hold exactly in binary and the details of holding a variable as type float can bite.

It's hard to see what you're doing with your last block of code, which you don't explain. The last statement looks puzzling, as you're adding strings. Consider what happens with

. gen whatever =  "14.2501" + "14.3901" + "15.0999" + "40.0601"

. di whatever[1]
14.250114.390115.099940.0601

The result is a long string that cannot be a valid cipcode. I suspect that you are reaching towards

 ... if inlist(cipcode_str, "14.2501", "14.3901", "15.0999", "40.0601") 

which is quite different.

But using float() is the minimal trick for this problem.



标签: stata