【Stata】3-数据

简介: 【Stata】3-数据

1 打开示例数据和网络数据

1-1 打开示例数据:use

sysuse auto,clear
(1978 Automobile Data)

1-2 从网络获取数据:webuse or use

use nlswork, clear
file nlswork.dta not found
r(601);
* 获取网络数据,从网站获取数据
use http://www.stata-press.com/data/r9/nlswork
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
* 与前一命令等价,从 STATA 官方数据库获取数据
webuse nlswork, clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
  • webuse只能从 http://www.stata-press.com/data这一路径获取数据,如果不是该网站的数据,webuse失效,只能把网站地址完全写出来。使用该命令时必须确保网络连接正常
  • 另一个网络数据较多的地方是波士登大学的数据中心,伍德里奇的《计量经
    济学导论》一书中所使用的全部数据都可以通过该数据中心获得。比如
    . use http://fmwww.bc.edu/ec-p/data/wooldridge/CEOSAL1

use 命令只能打开后辍名为“*.dta”格式的数据,.dta 格式以外的数据,STATA不能直接读取,需要从外部读入,最简单而直接的办法是复制和粘贴。

但是有时没有其他软件,比如,我们有 SAS 格式或 SPSS 格式的数据,但没有 SAS 软件和 SPSS 软件,此时需要用 STATA 提供的其他命令或者使用 transfer 数据格式转化软件。

2 数据类型

STATA 通常把变量划分为三类:分别是数值型,字符型和日期型

2-1 数值变量

. dis 5 
5
. dis -5 
-5
. dis 5.2 
5.2
. dis 5.2e+3 
5200
. dis 5.2e-2
.052

数值型变量按其精度区分,又有五种类型byte | int | float | long | double

存贮类型 最小 最大 0-领域 字节
byte -127 100 +/-1 1
int -32,767 32,740 +/-1 2
long -2,147,483,647 2,147,483,620 +/-1 4
float -1.70141173319*10^38 1.70141173319*10^36 +/-10^-36 4
double -8.9884656743*10^307 8.9884656743*10^307 +/-10^-323 8

//将设定一个观察值
. set obs 1
number of observations (_N) was 0, now 1
//提示信息说,之前系统中没有观察单位,现在有了一个
. gen a=1
. d
/*d 为 describ 命令的略写,describ 命令显示数据集的
属性信息,注意观察显示结果中,a 的 storage type 为 float 型,
浮点型为默认类型*/
Contains data
  obs:             1                          
 vars:             1                          
 size:             8                          
---------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------
a               double  %10.0g                
---------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.
//在不损害信息的基础上压缩,使数据占用空间尽可能小
.  compress
  variable a was double now byte
  (7 bytes saved)
. d
Contains data
  obs:             1                          
 vars:             1                          
 size:             1                          
---------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------
a               byte    %10.0g                
---------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.
. replace a=32741
variable a was byte now long
(1 real change made)
.  gen double b=1
. recast double a
. d
Contains data
  obs:             1                          
 vars:             2                          
 size:            16                          
---------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------
a               double  %10.0g                
b               double  %10.0g                
---------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.
//注意到 a 和 b 均为双精度型

2-2 字符串变量

字符串最多可以达 244 个字符。一般用 str#来表示字符的多少,如 str20

表示将有 20 个字符。一般三个中文字的姓名需要 6 个字符。

gen c="123.5"
. d c
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------
c               str5    %9s                   
. destring c
must specify either generate or replace option
r(198);
. destring c,replace
c has all characters numeric; replaced as double
. d c
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------
c               double  %10.0g

2-3 日期型变量

在 STATA 中,1960 年 1 月 1 日被认为是第 0 天,因此 1959 年 12 月 31 日为第-1 天,2001 年 1 月 25 日为 15000 天.

Types of dates and their human readable forms (HRFs)
         Date type         Examples of HRFs
         --------------------------------------------
         datetime          20jan2010 09:15:22.120  
         date              20jan2010, 20/01/2010, ...
         weekly date       2010w3
         monthly date      2010m1
         quarterly date    2010q1
         half-yearly date  2010h1
         yearly date       2010
         --------------------------------------------

. display %d date("20060125", "YMD")
25jan2006
. display %td date("060125", "20YMD")
25jan2006
. display %td date("060125", "19YMD")
25jan1906
. display %tc clock("20231101113159", "YMDhms")
01nov2023 11:31:59
//*计算自己的精确年龄
. di (mdy(9,18,2023)-mdy(2,27,2002))/365.25
21.555099

*其他时间函数

. sysuse sp500, clear
(S&P 500)
. des
Contains data from C:\Program Files (x86)\Stata14\ado\base/s/sp500.dta
  obs:           248                          S&P 500
 vars:             7                          22 Apr 2014 10:52
 size:         7,440                          (_dta has notes)
---------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------
date            int     %td                   Date
open            float   %9.0g                 Opening price
high            float   %9.0g                 High price
low             float   %9.0g                 Low price
close           float   %9.0g                 Closing price
volume          double  %12.0gc               Volume (thousands)
change          float   %9.0g                 Closing price change
---------------------------------------------------------------------------------------------------
Sorted by: date
. gen d=day(date) 
. gen w=week(date) 
. gen m=month(date) 
. gen q= quarter(date) 
. gen hy= halfyear(date) 
. gen y=year(date) 
. gen ndate1=mdy(m,d,y) 
. gen weekd=dow(date)
. gen yeard=doy(date)
. des
Contains data from C:\Program Files (x86)\Stata14\ado\base/s/sp500.dta
  obs:           248                          S&P 500
 vars:            16                          22 Apr 2014 10:52
 size:        25,296                          (_dta has notes)
---------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------
date            int     %td                   Date
open            float   %9.0g                 Opening price
high            float   %9.0g                 High price
low             float   %9.0g                 Low price
close           float   %9.0g                 Closing price
volume          double  %12.0gc               Volume (thousands)
change          float   %9.0g                 Closing price change
d               double  %10.0g                
w               double  %10.0g                
m               double  %10.0g                
q               double  %10.0g                
hy              double  %10.0g                
y               double  %10.0g                
ndate1          double  %10.0g                
weekd           double  %10.0g                
yeard           double  %10.0g                
---------------------------------------------------------------------------------------------------
Sorted by: date
     Note: Dataset has changed since last saved.

2-4 缺失值

display 2/0
.

3 数据类型转化

. *----------------将字符型数据转换为数值型数据:去掉字符间的空格------------
. webuse destring2, clear
. des
Contains data from http://www.stata-press.com/data/r14/destring2.dta
  obs:            10                          
 vars:             3                          3 Mar 2014 22:50
 size:           280                          
---------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------
date            str14   %10s                  
price           str11   %11s                  
percent         str3    %9s                   
---------------------------------------------------------------------------------------------------
Sorted by: 
. destring date, replace
date contains nonnumeric characters; no replace
. destring date, replace ignore(“ ”)
date: characters space removed; replaced as long
. des date
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------
date            long    %10.0g                
. list date
     +----------+
     |     date |
     |----------|
  1. | 19991210 |
  2. | 20000708 |
  3. | 19970302 |
  4. | 19990900 |
  5. | 19981004 |
     |----------|
  6. | 20000328 |
  7. | 20000808 |
  8. | 19971020 |
  9. | 19980116 |
 10. | 19991112 |
     +----------+
. destring price percent, gen(price2 percent2) ignore(“$ ,%”)

3- 2 数值型转化为字符型:tostring

#### 3-4 数据显示格式:format
```stata
 webuse census10,clear
(1980 Census data by state)
. des
Contains data from http://www.stata-press.com/data/r14/census10.dta
  obs:            50                          1980 Census data by state
 vars:             4                          9 Apr 2014 08:05
 size:         1,200                          
---------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------
state           str14   %14s                  State
region          int     %8.0g      cenreg     Census region
pop             long    %11.0g                Population
medage          float   %9.0g                 Median age
---------------------------------------------------------------------------------------------------
Sorted by: 
. list in 1/5
     +-----------------------------------------+
     |      state   region        pop   medage |
     |-----------------------------------------|
  1. |    Alabama    South    3893888     29.3 |
  2. |     Alaska     West     401851     26.1 |
  3. |    Arizona     West    2718215     29.2 |
  4. |   Arkansas    South    2286435     30.6 |
  5. | California     West   23667902     29.9 |
     +-----------------------------------------+
. format state %-14s
. list in 1/5
     +-----------------------------------------+
     | state        region        pop   medage |
     |-----------------------------------------|
  1. | Alabama       South    3893888     29.3 |
  2. | Alaska         West     401851     26.1 |
  3. | Arizona        West    2718215     29.2 |
  4. | Arkansas      South    2286435     30.6 |
  5. | California     West   23667902     29.9 |
     +-----------------------------------------+
. format pop %14.0gc
. list in 1/5
     +-------------------------------------------+
     | state        region          pop   medage |
     |-------------------------------------------|
  1. | Alabama       South    3,893,888     29.3 |
  2. | Alaska         West      401,851     26.1 |
  3. | Arizona        West    2,718,215     29.2 |
  4. | Arkansas      South    2,286,435     30.6 |
  5. | California     West   23,667,902     29.9 |
     +-------------------------------------------+
. format medage %8.2f
. list in 1/5
     +-------------------------------------------+
     | state        region          pop   medage |
     |-------------------------------------------|
  1. | Alabama       South    3,893,888    29.30 |
  2. | Alaska         West      401,851    26.10 |
  3. | Arizona        West    2,718,215    29.20 |
  4. | Arkansas      South    2,286,435    30.60 |
  5. | California     West   23,667,902    29.90 |
     +-------------------------------------------+
. gen id=_n
. list in 1/6
     +------------------------------------------------+
     | state        region          pop   medage   id |
     |------------------------------------------------|
  1. | Alabama       South    3,893,888    29.30    1 |
  2. | Alaska         West      401,851    26.10    2 |
  3. | Arizona        West    2,718,215    29.20    3 |
  4. | Arkansas      South    2,286,435    30.60    4 |
  5. | California     West   23,667,902    29.90    5 |
     |------------------------------------------------|
  6. | Colorado       West    2,889,964    28.60    6 |
     +------------------------------------------------+

3-5 在 STATA 中直接录入数据:input

. input id str10 name economy
             id        name     economy
  1. 
. 1 John 40 
  2. 2 Chris 80 
  3. 
. 3 Jack 90 
  4. 
. 4 Huang 70 
  5. 5 Tom 53 
  6. 
. end
. save economy
file economy.dta saved
. list
     +----------------------+
     | id    name   economy |
     |----------------------|
  1. |  1    John        40 |
  2. |  2   Chris        80 |
  3. |  3    Jack        90 |
  4. |  4   Huang        70 |
  5. |  5     Tom        53 |
     +----------------------+

3-6 导入其他格式数据:insheet

insheet using 3origin.csv, clear

当数据中某个变量的位数特别长或者对导入数据的精度要求很高的时候,需

要在该命令后面加 double 选项。

insheet using 3origin.txt, double clear

3-7

假设想在 D 盘的根目录下创建一个新的文件夹 mydata 来存放数据文件,命令为 mkdir。如果该文件夹已经存在,运行该命令时会出现错误信息,加上 capture后,STATA 会自动判断,如果 mydata 文件夹存在,则跳过该命令,如果不存在,则创建。

. capture mkdir D:/mydata
. cd d:/mydata
d:\mydata
目录
相关文章
|
数据挖掘 Python
【Python】数据分析:结构化数分工具 Pandas | Series 与 DataFrame | 读取CSV文件数据
【Python】数据分析:结构化数分工具 Pandas | Series 与 DataFrame | 读取CSV文件数据
77 1
|
存储
【Stata】stata学习入门
【Stata】stata学习入门
169 0
|
SQL 数据库
SPSS读取数据文件
SPSS读取数据文件
165 0
【Stata】4-数据整理
【Stata】4-数据整理
93 0
|
数据可视化 数据挖掘 数据处理
|
数据挖掘 Python
spss、R语言、Python数据分析系列(3):R语言从外部读取数据
spss、R语言、Python数据分析系列(3):R语言从外部读取数据
157 0
|
数据处理 Python
【基础题】不用pandas读取csv文件的成绩数据处理题
(1)用for的i默认是从0开始,如果想要要从1开始遍历,可以对后面的range处理 (2)题目是不用pandas对csv文件(数据之间是逗号间隔)处理,所以需要利用open后readlines后的每行数据,依次找到当前的第一个,位置,然后将前面用过的数据去掉。如下如处理:
309 0
【基础题】不用pandas读取csv文件的成绩数据处理题