
简介: 【Stata】3-数据

1 打开示例数据和网络数据

1-1 打开示例数据:use

sysuse auto,clear
(1978 Automobile Data)

1-2 从网络获取数据:webuse or use

use nlswork, clear
file nlswork.dta not found
* 获取网络数据,从网站获取数据
use http://www.stata-press.com/data/r9/nlswork
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
* 与前一命令等价,从 STATA 官方数据库获取数据
webuse nlswork, clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
  • webuse只能从 http://www.stata-press.com/data这一路径获取数据,如果不是该网站的数据,webuse失效,只能把网站地址完全写出来。使用该命令时必须确保网络连接正常
  • 另一个网络数据较多的地方是波士登大学的数据中心,伍德里奇的《计量经
    . use http://fmwww.bc.edu/ec-p/data/wooldridge/CEOSAL1

use 命令只能打开后辍名为“*.dta”格式的数据,.dta 格式以外的数据,STATA不能直接读取,需要从外部读入,最简单而直接的办法是复制和粘贴。

但是有时没有其他软件,比如,我们有 SAS 格式或 SPSS 格式的数据,但没有 SAS 软件和 SPSS 软件,此时需要用 STATA 提供的其他命令或者使用 transfer 数据格式转化软件。

2 数据类型

STATA 通常把变量划分为三类:分别是数值型,字符型和日期型

2-1 数值变量

. dis 5 
. dis -5 
. dis 5.2 
. dis 5.2e+3 
. dis 5.2e-2

数值型变量按其精度区分,又有五种类型byte | int | float | long | double

存贮类型 最小 最大 0-领域 字节
byte -127 100 +/-1 1
int -32,767 32,740 +/-1 2
long -2,147,483,647 2,147,483,620 +/-1 4
float -1.70141173319*10^38 1.70141173319*10^36 +/-10^-36 4
double -8.9884656743*10^307 8.9884656743*10^307 +/-10^-323 8

. set obs 1
number of observations (_N) was 0, now 1
. gen a=1
. d
/*d 为 describ 命令的略写,describ 命令显示数据集的
属性信息,注意观察显示结果中,a 的 storage type 为 float 型,
Contains data
  obs:             1                          
 vars:             1                          
 size:             8                          
              storage   display    value
variable name   type    format     label      variable label
a               double  %10.0g                
Sorted by: 
     Note: Dataset has changed since last saved.
.  compress
  variable a was double now byte
  (7 bytes saved)
. d
Contains data
  obs:             1                          
 vars:             1                          
 size:             1                          
              storage   display    value
variable name   type    format     label      variable label
a               byte    %10.0g                
Sorted by: 
     Note: Dataset has changed since last saved.
. replace a=32741
variable a was byte now long
(1 real change made)
.  gen double b=1
. recast double a
. d
Contains data
  obs:             1                          
 vars:             2                          
 size:            16                          
              storage   display    value
variable name   type    format     label      variable label
a               double  %10.0g                
b               double  %10.0g                
Sorted by: 
     Note: Dataset has changed since last saved.
//注意到 a 和 b 均为双精度型

2-2 字符串变量

字符串最多可以达 244 个字符。一般用 str#来表示字符的多少,如 str20

表示将有 20 个字符。一般三个中文字的姓名需要 6 个字符。

gen c="123.5"
. d c
              storage   display    value
variable name   type    format     label      variable label
c               str5    %9s                   
. destring c
must specify either generate or replace option
. destring c,replace
c has all characters numeric; replaced as double
. d c
              storage   display    value
variable name   type    format     label      variable label
c               double  %10.0g

2-3 日期型变量

在 STATA 中,1960 年 1 月 1 日被认为是第 0 天,因此 1959 年 12 月 31 日为第-1 天,2001 年 1 月 25 日为 15000 天.

Types of dates and their human readable forms (HRFs)
         Date type         Examples of HRFs
         datetime          20jan2010 09:15:22.120  
         date              20jan2010, 20/01/2010, ...
         weekly date       2010w3
         monthly date      2010m1
         quarterly date    2010q1
         half-yearly date  2010h1
         yearly date       2010

. display %d date("20060125", "YMD")
. display %td date("060125", "20YMD")
. display %td date("060125", "19YMD")
. display %tc clock("20231101113159", "YMDhms")
01nov2023 11:31:59
. di (mdy(9,18,2023)-mdy(2,27,2002))/365.25


. sysuse sp500, clear
(S&P 500)
. des
Contains data from C:\Program Files (x86)\Stata14\ado\base/s/sp500.dta
  obs:           248                          S&P 500
 vars:             7                          22 Apr 2014 10:52
 size:         7,440                          (_dta has notes)
              storage   display    value
variable name   type    format     label      variable label
date            int     %td                   Date
open            float   %9.0g                 Opening price
high            float   %9.0g                 High price
low             float   %9.0g                 Low price
close           float   %9.0g                 Closing price
volume          double  %12.0gc               Volume (thousands)
change          float   %9.0g                 Closing price change
Sorted by: date
. gen d=day(date) 
. gen w=week(date) 
. gen m=month(date) 
. gen q= quarter(date) 
. gen hy= halfyear(date) 
. gen y=year(date) 
. gen ndate1=mdy(m,d,y) 
. gen weekd=dow(date)
. gen yeard=doy(date)
. des
Contains data from C:\Program Files (x86)\Stata14\ado\base/s/sp500.dta
  obs:           248                          S&P 500
 vars:            16                          22 Apr 2014 10:52
 size:        25,296                          (_dta has notes)
              storage   display    value
variable name   type    format     label      variable label
date            int     %td                   Date
open            float   %9.0g                 Opening price
high            float   %9.0g                 High price
low             float   %9.0g                 Low price
close           float   %9.0g                 Closing price
volume          double  %12.0gc               Volume (thousands)
change          float   %9.0g                 Closing price change
d               double  %10.0g                
w               double  %10.0g                
m               double  %10.0g                
q               double  %10.0g                
hy              double  %10.0g                
y               double  %10.0g                
ndate1          double  %10.0g                
weekd           double  %10.0g                
yeard           double  %10.0g                
Sorted by: date
     Note: Dataset has changed since last saved.

2-4 缺失值

display 2/0

3 数据类型转化

. *----------------将字符型数据转换为数值型数据:去掉字符间的空格------------
. webuse destring2, clear
. des
Contains data from http://www.stata-press.com/data/r14/destring2.dta
  obs:            10                          
 vars:             3                          3 Mar 2014 22:50
 size:           280                          
              storage   display    value
variable name   type    format     label      variable label
date            str14   %10s                  
price           str11   %11s                  
percent         str3    %9s                   
Sorted by: 
. destring date, replace
date contains nonnumeric characters; no replace
. destring date, replace ignore(“ ”)
date: characters space removed; replaced as long
. des date
              storage   display    value
variable name   type    format     label      variable label
date            long    %10.0g                
. list date
     |     date |
  1. | 19991210 |
  2. | 20000708 |
  3. | 19970302 |
  4. | 19990900 |
  5. | 19981004 |
  6. | 20000328 |
  7. | 20000808 |
  8. | 19971020 |
  9. | 19980116 |
 10. | 19991112 |
. destring price percent, gen(price2 percent2) ignore(“$ ,%”)

3- 2 数值型转化为字符型:tostring

#### 3-4 数据显示格式:format
 webuse census10,clear
(1980 Census data by state)
. des
Contains data from http://www.stata-press.com/data/r14/census10.dta
  obs:            50                          1980 Census data by state
 vars:             4                          9 Apr 2014 08:05
 size:         1,200                          
              storage   display    value
variable name   type    format     label      variable label
state           str14   %14s                  State
region          int     %8.0g      cenreg     Census region
pop             long    %11.0g                Population
medage          float   %9.0g                 Median age
Sorted by: 
. list in 1/5
     |      state   region        pop   medage |
  1. |    Alabama    South    3893888     29.3 |
  2. |     Alaska     West     401851     26.1 |
  3. |    Arizona     West    2718215     29.2 |
  4. |   Arkansas    South    2286435     30.6 |
  5. | California     West   23667902     29.9 |
. format state %-14s
. list in 1/5
     | state        region        pop   medage |
  1. | Alabama       South    3893888     29.3 |
  2. | Alaska         West     401851     26.1 |
  3. | Arizona        West    2718215     29.2 |
  4. | Arkansas      South    2286435     30.6 |
  5. | California     West   23667902     29.9 |
. format pop %14.0gc
. list in 1/5
     | state        region          pop   medage |
  1. | Alabama       South    3,893,888     29.3 |
  2. | Alaska         West      401,851     26.1 |
  3. | Arizona        West    2,718,215     29.2 |
  4. | Arkansas      South    2,286,435     30.6 |
  5. | California     West   23,667,902     29.9 |
. format medage %8.2f
. list in 1/5
     | state        region          pop   medage |
  1. | Alabama       South    3,893,888    29.30 |
  2. | Alaska         West      401,851    26.10 |
  3. | Arizona        West    2,718,215    29.20 |
  4. | Arkansas      South    2,286,435    30.60 |
  5. | California     West   23,667,902    29.90 |
. gen id=_n
. list in 1/6
     | state        region          pop   medage   id |
  1. | Alabama       South    3,893,888    29.30    1 |
  2. | Alaska         West      401,851    26.10    2 |
  3. | Arizona        West    2,718,215    29.20    3 |
  4. | Arkansas      South    2,286,435    30.60    4 |
  5. | California     West   23,667,902    29.90    5 |
  6. | Colorado       West    2,889,964    28.60    6 |

3-5 在 STATA 中直接录入数据:input

. input id str10 name economy
             id        name     economy
. 1 John 40 
  2. 2 Chris 80 
. 3 Jack 90 
. 4 Huang 70 
  5. 5 Tom 53 
. end
. save economy
file economy.dta saved
. list
     | id    name   economy |
  1. |  1    John        40 |
  2. |  2   Chris        80 |
  3. |  3    Jack        90 |
  4. |  4   Huang        70 |
  5. |  5     Tom        53 |

3-6 导入其他格式数据:insheet

insheet using 3origin.csv, clear


要在该命令后面加 double 选项。

insheet using 3origin.txt, double clear


假设想在 D 盘的根目录下创建一个新的文件夹 mydata 来存放数据文件,命令为 mkdir。如果该文件夹已经存在,运行该命令时会出现错误信息,加上 capture后,STATA 会自动判断,如果 mydata 文件夹存在,则跳过该命令,如果不存在,则创建。

. capture mkdir D:/mydata
. cd d:/mydata
