Creating variable out of conditional values in ano

2019-08-05 12:59发布

问题:

I have quite a large conflict dataset (71 million observations) with many variables and date (daily).

This is from the GDELT project for which the way the dataset is structured is that for each day, there is a target country and a source country of aggression. Namely, the first of January of 2000, many countries engaged in aggressive behaviour against others or themselves, and this dataset tracks this.

It looks like this:

clear

input long date_01 str18 source_01 str19 target_01 str4 cameocode_01
20000101 "AFG"    "AFGGOV" "020" 
20000101 "AFG"    "AFGGOV" "0841"
20000101 "AFG"    "ARE"    "036" 
20000101 "AFG"    "CVL"    "043" 
20000101 "AFG"    "GOV"    "010" 
20000101 "AFG"    "GOV"    "043" 
20000101 "AFGGOV" "kasUAF" "0353"
20000101 "AFGGOV" "kasUAF" "084" 
20000101 "AFG"    "IGOUNO" "030" 
20000101 "AFG"    "IND"    "042" 
20000101 "AFG"    "IND"    "043" 
end

What I would like to do is to isolate these events per country.

For instance, I would like to create a variable for the US where, for each date, I have all the times that the US was either a target or a source, and their respective cameo code. I have a considerable number of countries but only need a subset of them and I know their names in advance.

As you can see in the example, the first variable is date, which for these cells is always 2000101 but after a couple of hundreds observations it changes to 2000102, denoting a change in day.

The second variable source_01 is a country attacking another one. In the example, IND is India, AFG is Afghanistan and the other codes are other countries.

The third variable target_01 is just the victim of the conflict.

Finally, cameocode_01 is a level of intensity of conflict measured with some algorithm that tracks the news in each language.

What I am after is to create a new (per country) variable that extracts the cameo code of that event if a specific country is involved either as source or target.

For this specific example, below is my desired output for the case of India (code IND), which is involved in two events on the specific date:

date      INDIAcameo
20000101  "042" 
20000101  "043"

I have tried this:

replace INDIA cameo=cameocode if "target"  ~ "source" ==IND

However, it says type mismatch and I doubt it would give me what I look for anyway.

回答1:

If you know in advance the countries you are interested in, then the following will work:

clear

input long date_01 str18 source_01 str19 target_01 str4 cameocode_01
20000101 "AFG"    "AFGGOV" "020" 
20000101 "AFG"    "IND"    "043"
20000101 "AFG"    "AFGGOV" "0841"
20000101 "AFG"    "ARE"    "036" 
20000101 "AFG"    "CVL"    "043" 
20000101 "AFG"    "GOV"    "010" 
20000101 "AFG"    "GOV"    "043" 
20000101 "AFGGOV" "kasUAF" "0353"
20000101 "AFGGOV" "kasUAF" "084" 
20000101 "AFG"    "IGOUNO" "030" 
20000102 "AFG"    "IND"    "042" 
end

foreach c in AFG IND ARE {
    generate ind_`c' = cameocode_01 if strmatch(source_01, "`c'") |  ///
                                       strmatch(target_01, "`c'")
}

Note that I have slightly modified your example for better illustration.

To see the results:

list, sepby(date) abbreviate(15)

     +-------------------------------------------------------------------------------+
     |  date_01   source_01   target_01   cameocode_01   ind_AFG   ind_IND   ind_ARE |
     |-------------------------------------------------------------------------------|
  1. | 20000101         AFG      AFGGOV            020       020                     |
  2. | 20000101         AFG         IND            043       043       043           |
  3. | 20000101         AFG      AFGGOV           0841      0841                     |
  4. | 20000101         AFG         ARE            036       036                 036 |
  5. | 20000101         AFG         CVL            043       043                     |
  6. | 20000101         AFG         GOV            010       010                     |
  7. | 20000101         AFG         GOV            043       043                     |
  8. | 20000101      AFGGOV      kasUAF           0353                               |
  9. | 20000101      AFGGOV      kasUAF            084                               |
 10. | 20000101         AFG      IGOUNO            030       030                     |
     |-------------------------------------------------------------------------------|
 11. | 20000102         AFG         IND            042       042       042           |
     +-------------------------------------------------------------------------------+

or

foreach v of varlist ind* {
    sort date `v'
    list date `v' if !missing(`v'), sepby(date) abbreviate(15)
}

     +--------------------+
     |  date_01   ind_AFG |
     |--------------------|
  3. | 20000101       010 |
  4. | 20000101       020 |
  5. | 20000101       030 |
  6. | 20000101       036 |
  7. | 20000101       043 |
  8. | 20000101       043 |
  9. | 20000101       043 |
 10. | 20000101      0841 |
     |--------------------|
 11. | 20000102       042 |
     +--------------------+

     +--------------------+
     |  date_01   ind_IND |
     |--------------------|
 10. | 20000101       043 |
     |--------------------|
 11. | 20000102       042 |
     +--------------------+

     +--------------------+
     |  date_01   ind_ARE |
     |--------------------|
 10. | 20000101       036 |
     +--------------------+